Fix binary encoder for columntransformer #169

datarian · 2019-03-19T12:10:47Z

I discovered that when using the BinaryEncoder in a sklearn.ColumnTransformer, the passed params are lost.

This is because the encoder gets instantiated twice in a ColumnTransformer. Currently, params are not registered to self in BinaryEncoder.init(), so they are lost when the ColumnTransformer is put to work.

Disclaimer: I was able to correctly binary encode in a local debug session. However, as there are so many tests failing on the upstream master currently, it was hard to find out whether my solution has an undesired impact.

Also, I am confused by ordinal.py L323-L326. Is this a bug? It seems to correctly encode both with the -2 and np.nan...

datarian · 2019-03-19T12:30:26Z

I wrote a gist to reproduce the behaviour. Before my edits, the ColumnTransformer-transformed data contains no na-values. After the fix, the data is encoded correctly.

janmotl · 2019-03-20T13:58:33Z

I fixed categorical-encoding/category_encoders/__init__.py. Majority of the fails should disappear (and there should not be any fail when running the tests locally). I will look at the proposed changes.

…r-for-columntransformer

janmotl · 2019-03-21T08:13:28Z

Can you write your gist into a unit test? Reasoning: I was unable to unpickle the data (likely because of a wrong version of Pandas). And it would be nice to have a unit test in order to make sure that the bug does not return.

datarian · 2019-03-21T08:24:21Z

Can you write your gist into a unit test? Reasoning: I was unable to unpickle the data (likely because of a wrong version of Pandas). And it would be nice to have a unit test in order to make sure that the bug does not return.

Yes, sure, I will write one.

I now set up a py27 environment to be as close to the test cases as possible for local tests. Many are now passing, just 6 left. But there is a weird problem with test_helpers.verify_numeric(). It fails to assert int64 being np.dtype(int).

janmotl · 2019-03-21T08:51:49Z

Ordinal.py L323-L326 look ok to me. When handle_missing == 'return_nan', we replace -2 with nan (or whatever we use to represent a missing value) at the transform time at L305-306. When handle_missing == 'value', we either encode the missing value into a positive value (because we have encountered it in the training data) or into -2 (I do not recall anymore why we make the distinction but I am ok with that.)

janmotl · 2019-03-21T09:15:59Z

I can confirm the issue with test_helpers.verify_numeric(). Just in my case, it fails on int32 and int64 passes.

inside sklearn.ColumnTransformer

janmotl · 2019-03-21T09:32:20Z

A possible solution to verify_numeric:

def verify_numeric(X_test):
    """
    Test that all attributes in the DataFrame are numeric.
    """
    for col in X_test:
        assert pd.api.types.is_numeric_dtype(X_test[col])

Also, we may rename test_helpers to helpers. And that will allow us to test helpers with:

import numpy as np
import pandas as pd
from unittest2 import TestCase  # or `from unittest import ...` if on Python 3.4+

from category_encoders.tests.helpers import verify_numeric


class TestHelpers(TestCase):

    def test_is_numeric_pandas(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = pd.DataFrame(np.ones([5, 5]), dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = pd.DataFrame(np.ones([5, 5]), dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Strings should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']])
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))

    def test_is_numeric_numpy(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = np.ones([5, 5], dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Floats
        X = np.ones([5, 5], dtype='float32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='float64')
        verify_numeric(pd.DataFrame(X))

janmotl · 2019-03-21T09:44:09Z

Is the issue with sklearn.ColumnTransformer limited to BinaryEncoder or are other encoders also affected?

datarian · 2019-03-21T09:47:55Z

A possible solution to verify_numeric:

def verify_numeric(X_test):
    """
    Test that all attributes in the DataFrame are numeric.
    """
    for col in X_test:
        assert pd.api.types.is_numeric_dtype(X_test[col])

Yes, or this:

def verify_numeric(X_test):
    """
    Test that all attributes in the DataFrame are numeric.
    """
    _NUMERIC_KINDS = set('buifc')
    
    for dt in X_test.dtypes:
        assert(dt.kind in _NUMERIC_KINDS)

Also, we may rename test_helpers to helpers. And that will allow us to test helpers with:

import numpy as np
import pandas as pd
from unittest2 import TestCase  # or `from unittest import ...` if on Python 3.4+

from category_encoders.tests.helpers import verify_numeric


class TestHelpers(TestCase):

    def test_is_numeric_pandas(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = pd.DataFrame(np.ones([5, 5]), dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = pd.DataFrame(np.ones([5, 5]), dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Strings should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']])
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))

    def test_is_numeric_numpy(self):
        # Whole numbers, regardless of the byte length, should not raise AssertionError
        X = np.ones([5, 5], dtype='int32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='int64')
        verify_numeric(pd.DataFrame(X))

        # Floats
        X = np.ones([5, 5], dtype='float32')
        verify_numeric(pd.DataFrame(X))

        X = np.ones([5, 5], dtype='float64')
        verify_numeric(pd.DataFrame(X))

That would surely be a good idea!

datarian · 2019-03-21T09:48:50Z

Is the issue with sklearn.ColumnTransformer limited to BinaryEncoder or are other encoders also affected?

I have only looked into BinaryEncoder. Basically, every encoder that does not set parameters as instance variables will suffer from this. ~~I will have a look at them.~~ I checked all init() functions for the other encoders. It looks like they should be working.

Added unittests for helpers

janmotl · 2019-03-21T10:29:13Z

Awesome. I propose to replace the BinaryEncoder specific test with a parameterized test in test_encoders:

    def test_column_transformer(self):
       # see issue #169
        for encoder_name in (set(encoders.__all__) - {'HashingEncoder'}): # HashingEncoder does not accept handle_missing parameter
            with self.subTest(encoder_name=encoder_name):

                # we can only test one data type at once. Here, we test string columns.
                tested_columns = ['unique_str', 'invariant', 'underscore', 'none', 'extra']

                # ColumnTransformer instantiates the encoder twice -> we have to make sure the encoder settings are correctly passed
                ct = ColumnTransformer([
                    ("dummy_encoder_name", getattr(encoders, encoder_name)(handle_missing="return_nan"), tested_columns)
                ])
                obtained = ct.fit_transform(X, y)

                # the old-school approach
                enc = getattr(encoders, encoder_name)(handle_missing="return_nan", return_df=False)
                expected = enc.fit_transform(X[tested_columns], y)

                np.testing.assert_array_equal(obtained, expected)

It takes longer to run. But we can at least be reasonably sure that all the encoders (but HashingEncoder) remain compatible with ColumnTransformer.

encoders through ColumnTransformer

janmotl · 2019-03-21T10:46:18Z

I propose to create a new pull request for changes to verify_numeric (I am good with both solutions), renaming of test_helpers to helpers and creation of new test_helpers.

Also, test_helpers can be extended to also test category datatype:

        # Categories should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], dtype='category')
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))

Can you do it? Reasoning: I should not do both, writing a pr and merging the pr.

assertWarning tests correctly.

datarian · 2019-03-21T11:04:23Z

I propose to create a new pull request for changes to verify_numeric (I am good with both solutions), renaming of test_helpers to helpers and creation of new test_helpers.

Also, test_helpers can be extended to also test category datatype:
        # Categories should raise AssertionError
        X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], dtype='category')
        with self.assertRaises(Exception):
            verify_numeric(pd.DataFrame(X))
Can you do it? Reasoning: I should not do both, writing a pr and merging the pr.

Yeah, I can do that. Will have to revert a couple of commits then.

Also, I found tests in test_basen and test_one_hot failing on my side because of how expected UserWarnings are handled. I started correcting those as well. Will also crate a separate PR for those.

janmotl · 2019-03-21T11:37:00Z

Also, I found tests in test_basen and test_one_hot failing on my side because of how expected UserWarnings are handled. I started correcting those as well. Will also crate a separate PR for those.

That would be really helpful - I have never managed to reproduce them on my machine.

This reverts commit c96143b.

This reverts commit 99b35cd.

This reverts commit 22c1eb0.

datarian · 2019-03-21T12:37:16Z

Also, I found tests in test_basen and test_one_hot failing on my side because of how expected UserWarnings are handled. I started correcting those as well. Will also crate a separate PR for those.

That would be really helpful - I have never managed to reproduce them on my machine.

Hm, that's strange... Well anyways this is how they pass on my machine:

message = 'inverse_transform is not supported because transform impute '\
                  'the unknown category nan when encode city'

        with self.assertWarns(UserWarning, msg=message) as w:
            enc.inverse_transform(result)

datarian · 2019-03-21T12:40:05Z

Okay, so I created PR's #170 and #171. If #170 is merged first, then #171, this one here should hopefully pass checks.

janmotl · 2019-03-21T13:30:43Z

We have to update the minimal requirement on sklearn (at lest for development/testing environment in requirements.txt - I do not know whether we should update setup.py).

…r-for-columntransformer

BinaryEncoder

janmotl · 2019-03-22T07:11:23Z

And we have to bump the version in .travis.yml...

datarian · 2019-03-22T08:01:14Z

So, I discovered another bug in basen inverse transform:
The current test_encoders.test_inverse_transform() just happens to pass because the cols are the last in the test df. If there's another one appended at the end (the na_categorical I added in this case), the order of columns after inverse transform is no longer equal and the test fails.

-> Should we fix BaseNEncoder.inverse_transform() or change helpers.verify_inverse_transform (which is probably a bad idea since we not only want to check if the same columns are returned in correct order....)?

janmotl · 2019-03-22T08:18:53Z

the na_categorical I added in this case

Good. Now I know why the tests are failing in test_classification :).

Should we fix BaseNEncoder.inverse_transform() or change helpers.verify_inverse_transform

If only BaseNEncoder is the offender, it makes sense to me to fix the root cause. And add a parameterized test to make sure that the deficiency does not appear in the future in some of the encoders

datarian · 2019-03-22T09:57:22Z

the na_categorical I added in this case

Good. Now I know why the tests are failing in test_classification :).

Should we fix BaseNEncoder.inverse_transform() or change helpers.verify_inverse_transform

If only BaseNEncoder is the offender, it makes sense to me to fix the root cause. And add a parameterized test to make sure that the deficiency does not appear in the future in some of the encoders

See #172.

Merge remote-tracking branch 'upstream/master' into fix-binary-encode…

20b94f6

…r-for-columntransformer

Added unittest for binary encoder

89cf1fa

inside sklearn.ColumnTransformer

Renamed test_helpers to helpers

22c1eb0

Added unittests for helpers

florian added 2 commits March 21, 2019 11:39

added parametric test for running

57ba094

encoders through ColumnTransformer

Deleted debug prints

99b35cd

Fixed test_ordinal to pass

c96143b

assertWarning tests correctly.

datarian mentioned this pull request Mar 21, 2019

Refactoring of test_helper #170

Merged

florian added 3 commits March 21, 2019 13:07

Revert "Fixed test_ordinal to pass"

aa24d65

This reverts commit c96143b.

Revert "Deleted debug prints"

82b1ef1

This reverts commit 99b35cd.

Revert "Renamed test_helpers to helpers"

d7acbb4

This reverts commit 22c1eb0.

datarian mentioned this pull request Mar 21, 2019

Fix userwarning unittests #171

Merged

Update test_binary.py

b5f32f6

Merge remote-tracking branch 'upstream/master' into fix-binary-encode…

343bc24

…r-for-columntransformer

florian added 4 commits March 21, 2019 16:25

fixed helpers import

0cd24e9

Reimplemented lost fix for

465f9ac

BinaryEncoder

pass instance variables to basen instantiation

ec13d1c

Bumped required scikit-learn version: >=0.20.2

5763ae3

bumped scikit-learn version: >= 0.20.2

f3f55f8

bump scikit-learn version number: >=0.20.2

8c71668

janmotl merged commit 325a570 into scikit-learn-contrib:master Mar 22, 2019

datarian mentioned this pull request Mar 23, 2019

Fix tests with new col na categorical #175

Merged

datarian deleted the fix-binary-encoder-for-columntransformer branch March 23, 2019 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix binary encoder for columntransformer #169

Fix binary encoder for columntransformer #169

datarian commented Mar 19, 2019

datarian commented Mar 19, 2019

janmotl commented Mar 20, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

datarian commented Mar 21, 2019 •

edited

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

datarian commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 22, 2019

datarian commented Mar 22, 2019

janmotl commented Mar 22, 2019

datarian commented Mar 22, 2019

Fix binary encoder for columntransformer #169

Fix binary encoder for columntransformer #169

Conversation

datarian commented Mar 19, 2019

datarian commented Mar 19, 2019

janmotl commented Mar 20, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

datarian commented Mar 21, 2019 • edited

janmotl commented Mar 21, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

janmotl commented Mar 21, 2019

datarian commented Mar 21, 2019

datarian commented Mar 21, 2019

janmotl commented Mar 21, 2019

janmotl commented Mar 22, 2019

datarian commented Mar 22, 2019

janmotl commented Mar 22, 2019

datarian commented Mar 22, 2019

datarian commented Mar 21, 2019 •

edited