Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix binary encoder for columntransformer #169

Conversation

datarian
Copy link

I discovered that when using the BinaryEncoder in a sklearn.ColumnTransformer, the passed params are lost.

This is because the encoder gets instantiated twice in a ColumnTransformer. Currently, params are not registered to self in BinaryEncoder.init(), so they are lost when the ColumnTransformer is put to work.

Disclaimer: I was able to correctly binary encode in a local debug session. However, as there are so many tests failing on the upstream master currently, it was hard to find out whether my solution has an undesired impact.

Also, I am confused by ordinal.py L323-L326. Is this a bug? It seems to correctly encode both with the -2 and np.nan...

@datarian
Copy link
Author

I wrote a gist to reproduce the behaviour. Before my edits, the ColumnTransformer-transformed data contains no na-values. After the fix, the data is encoded correctly.

@janmotl
Copy link
Collaborator

janmotl commented Mar 20, 2019

I fixed categorical-encoding/category_encoders/__init__.py. Majority of the fails should disappear (and there should not be any fail when running the tests locally). I will look at the proposed changes.

@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

Can you write your gist into a unit test? Reasoning: I was unable to unpickle the data (likely because of a wrong version of Pandas). And it would be nice to have a unit test in order to make sure that the bug does not return.

@datarian
Copy link
Author

Can you write your gist into a unit test? Reasoning: I was unable to unpickle the data (likely because of a wrong version of Pandas). And it would be nice to have a unit test in order to make sure that the bug does not return.

Yes, sure, I will write one.

I now set up a py27 environment to be as close to the test cases as possible for local tests. Many are now passing, just 6 left. But there is a weird problem with test_helpers.verify_numeric(). It fails to assert int64 being np.dtype(int).

@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

Ordinal.py L323-L326 look ok to me. When handle_missing == 'return_nan', we replace -2 with nan (or whatever we use to represent a missing value) at the transform time at L305-306. When handle_missing == 'value', we either encode the missing value into a positive value (because we have encountered it in the training data) or into -2 (I do not recall anymore why we make the distinction but I am ok with that.)

@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

I can confirm the issue with test_helpers.verify_numeric(). Just in my case, it fails on int32 and int64 passes.

inside sklearn.ColumnTransformer
@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

A possible solution to verify_numeric:

def verify_numeric(X_test):
    """
    Test that all attributes in the DataFrame are numeric.
    """
    for col in X_test:
        assert pd.api.types.is_numeric_dtype(X_test[col])

Also, we may rename test_helpers to helpers. And that will allow us to test helpers with:

import numpy as np
import pandas as pd
from unittest2 import TestCase  # or `from unittest import ...` if on Python 3.4+

from category_encoders.tests.helpers import verify_numeric


class TestHelpers(TestCase):

    def test_is_numeric_pandas(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = pd.DataFrame(np.ones([5, 5]), dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = pd.DataFrame(np.ones([5, 5]), dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Strings should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']])
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))

    def test_is_numeric_numpy(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = np.ones([5, 5], dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Floats
        X = np.ones([5, 5], dtype='float32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='float64')
        verify_numeric(pd.DataFrame(X))

@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

Is the issue with sklearn.ColumnTransformer limited to BinaryEncoder or are other encoders also affected?

@datarian
Copy link
Author

A possible solution to verify_numeric:

def verify_numeric(X_test):
    """
    Test that all attributes in the DataFrame are numeric.
    """
    for col in X_test:
        assert pd.api.types.is_numeric_dtype(X_test[col])

Yes, or this:

def verify_numeric(X_test):
    """
    Test that all attributes in the DataFrame are numeric.
    """
    _NUMERIC_KINDS = set('buifc')
    
    for dt in X_test.dtypes:
        assert(dt.kind in _NUMERIC_KINDS)

Also, we may rename test_helpers to helpers. And that will allow us to test helpers with:

import numpy as np
import pandas as pd
from unittest2 import TestCase  # or `from unittest import ...` if on Python 3.4+

from category_encoders.tests.helpers import verify_numeric


class TestHelpers(TestCase):

    def test_is_numeric_pandas(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = pd.DataFrame(np.ones([5, 5]), dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = pd.DataFrame(np.ones([5, 5]), dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Strings should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']])
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))

    def test_is_numeric_numpy(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = np.ones([5, 5], dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Floats
        X = np.ones([5, 5], dtype='float32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='float64')
        verify_numeric(pd.DataFrame(X))

That would surely be a good idea!

@datarian
Copy link
Author

datarian commented Mar 21, 2019

Is the issue with sklearn.ColumnTransformer limited to BinaryEncoder or are other encoders also affected?

I have only looked into BinaryEncoder. Basically, every encoder that does not set parameters as instance variables will suffer from this. I will have a look at them. I checked all init() functions for the other encoders. It looks like they should be working.

Added unittests for helpers
@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

Awesome. I propose to replace the BinaryEncoder specific test with a parameterized test in test_encoders:

    def test_column_transformer(self):
       # see issue #169
        for encoder_name in (set(encoders.__all__) - {'HashingEncoder'}): # HashingEncoder does not accept handle_missing parameter
            with self.subTest(encoder_name=encoder_name):

                # we can only test one data type at once. Here, we test string columns.
                tested_columns = ['unique_str', 'invariant', 'underscore', 'none', 'extra']

                # ColumnTransformer instantiates the encoder twice -> we have to make sure the encoder settings are correctly passed
                ct = ColumnTransformer([
                    ("dummy_encoder_name", getattr(encoders, encoder_name)(handle_missing="return_nan"), tested_columns)
                ])
                obtained = ct.fit_transform(X, y)

                # the old-school approach
                enc = getattr(encoders, encoder_name)(handle_missing="return_nan", return_df=False)
                expected = enc.fit_transform(X[tested_columns], y)

                np.testing.assert_array_equal(obtained, expected)

It takes longer to run. But we can at least be reasonably sure that all the encoders (but HashingEncoder) remain compatible with ColumnTransformer.

florian added 2 commits March 21, 2019 11:39
@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

I propose to create a new pull request for changes to verify_numeric (I am good with both solutions), renaming of test_helpers to helpers and creation of new test_helpers.

Also, test_helpers can be extended to also test category datatype:

        # Categories should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], dtype='category')
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))

Can you do it? Reasoning: I should not do both, writing a pr and merging the pr.

assertWarning tests correctly.
@datarian
Copy link
Author

I propose to create a new pull request for changes to verify_numeric (I am good with both solutions), renaming of test_helpers to helpers and creation of new test_helpers.

Also, test_helpers can be extended to also test category datatype:

        # Categories should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], dtype='category')
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))

Can you do it? Reasoning: I should not do both, writing a pr and merging the pr.

Yeah, I can do that. Will have to revert a couple of commits then.

Also, I found tests in test_basen and test_one_hot failing on my side because of how expected UserWarnings are handled. I started correcting those as well. Will also crate a separate PR for those.

@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

Also, I found tests in test_basen and test_one_hot failing on my side because of how expected UserWarnings are handled. I started correcting those as well. Will also crate a separate PR for those.

That would be really helpful - I have never managed to reproduce them on my machine.

@datarian
Copy link
Author

Also, I found tests in test_basen and test_one_hot failing on my side because of how expected UserWarnings are handled. I started correcting those as well. Will also crate a separate PR for those.

That would be really helpful - I have never managed to reproduce them on my machine.

Hm, that's strange... Well anyways this is how they pass on my machine:

message = 'inverse_transform is not supported because transform impute '\
                  'the unknown category nan when encode city'

        with self.assertWarns(UserWarning, msg=message) as w:
            enc.inverse_transform(result)

@datarian
Copy link
Author

Okay, so I created PR's #170 and #171. If #170 is merged first, then #171, this one here should hopefully pass checks.

@janmotl
Copy link
Collaborator

janmotl commented Mar 21, 2019

We have to update the minimal requirement on sklearn (at lest for development/testing environment in requirements.txt - I do not know whether we should update setup.py).

@janmotl
Copy link
Collaborator

janmotl commented Mar 22, 2019

And we have to bump the version in .travis.yml...

@datarian
Copy link
Author

So, I discovered another bug in basen inverse transform:
The current test_encoders.test_inverse_transform() just happens to pass because the cols are the last in the test df. If there's another one appended at the end (the na_categorical I added in this case), the order of columns after inverse transform is no longer equal and the test fails.

-> Should we fix BaseNEncoder.inverse_transform() or change helpers.verify_inverse_transform (which is probably a bad idea since we not only want to check if the same columns are returned in correct order....)?

@janmotl
Copy link
Collaborator

janmotl commented Mar 22, 2019

the na_categorical I added in this case

Good. Now I know why the tests are failing in test_classification :).

Should we fix BaseNEncoder.inverse_transform() or change helpers.verify_inverse_transform

If only BaseNEncoder is the offender, it makes sense to me to fix the root cause. And add a parameterized test to make sure that the deficiency does not appear in the future in some of the encoders

@datarian
Copy link
Author

the na_categorical I added in this case

Good. Now I know why the tests are failing in test_classification :).

Should we fix BaseNEncoder.inverse_transform() or change helpers.verify_inverse_transform

If only BaseNEncoder is the offender, it makes sense to me to fix the root cause. And add a parameterized test to make sure that the deficiency does not appear in the future in some of the encoders

See #172.

@janmotl janmotl merged commit 325a570 into scikit-learn-contrib:master Mar 22, 2019
@datarian datarian deleted the fix-binary-encoder-for-columntransformer branch March 23, 2019 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants