New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix binary encoder for columntransformer #169
Fix binary encoder for columntransformer #169
Conversation
I wrote a gist to reproduce the behaviour. Before my edits, the ColumnTransformer-transformed data contains no na-values. After the fix, the data is encoded correctly. |
I fixed |
…r-for-columntransformer
Can you write your gist into a unit test? Reasoning: I was unable to unpickle the data (likely because of a wrong version of Pandas). And it would be nice to have a unit test in order to make sure that the bug does not return. |
Yes, sure, I will write one. I now set up a py27 environment to be as close to the test cases as possible for local tests. Many are now passing, just 6 left. But there is a weird problem with test_helpers.verify_numeric(). It fails to assert int64 being np.dtype(int). |
Ordinal.py L323-L326 look ok to me. When |
I can confirm the issue with |
inside sklearn.ColumnTransformer
A possible solution to def verify_numeric(X_test):
"""
Test that all attributes in the DataFrame are numeric.
"""
for col in X_test:
assert pd.api.types.is_numeric_dtype(X_test[col]) Also, we may rename import numpy as np
import pandas as pd
from unittest2 import TestCase # or `from unittest import ...` if on Python 3.4+
from category_encoders.tests.helpers import verify_numeric
class TestHelpers(TestCase):
def test_is_numeric_pandas(self):
# Whole numbers, regardless of the byte length, should not raise AssertionError
X = pd.DataFrame(np.ones([5, 5]), dtype='int32')
verify_numeric(pd.DataFrame(X))
X = pd.DataFrame(np.ones([5, 5]), dtype='int64')
verify_numeric(pd.DataFrame(X))
# Strings should raise AssertionError
X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']])
with self.assertRaises(Exception):
verify_numeric(pd.DataFrame(X))
def test_is_numeric_numpy(self):
# Whole numbers, regardless of the byte length, should not raise AssertionError
X = np.ones([5, 5], dtype='int32')
verify_numeric(pd.DataFrame(X))
X = np.ones([5, 5], dtype='int64')
verify_numeric(pd.DataFrame(X))
# Floats
X = np.ones([5, 5], dtype='float32')
verify_numeric(pd.DataFrame(X))
X = np.ones([5, 5], dtype='float64')
verify_numeric(pd.DataFrame(X)) |
Is the issue with sklearn.ColumnTransformer limited to BinaryEncoder or are other encoders also affected? |
Yes, or this: def verify_numeric(X_test):
"""
Test that all attributes in the DataFrame are numeric.
"""
_NUMERIC_KINDS = set('buifc')
for dt in X_test.dtypes:
assert(dt.kind in _NUMERIC_KINDS)
That would surely be a good idea! |
I have only looked into BinaryEncoder. Basically, every encoder that does not set parameters as instance variables will suffer from this. |
Added unittests for helpers
Awesome. I propose to replace the def test_column_transformer(self):
# see issue #169
for encoder_name in (set(encoders.__all__) - {'HashingEncoder'}): # HashingEncoder does not accept handle_missing parameter
with self.subTest(encoder_name=encoder_name):
# we can only test one data type at once. Here, we test string columns.
tested_columns = ['unique_str', 'invariant', 'underscore', 'none', 'extra']
# ColumnTransformer instantiates the encoder twice -> we have to make sure the encoder settings are correctly passed
ct = ColumnTransformer([
("dummy_encoder_name", getattr(encoders, encoder_name)(handle_missing="return_nan"), tested_columns)
])
obtained = ct.fit_transform(X, y)
# the old-school approach
enc = getattr(encoders, encoder_name)(handle_missing="return_nan", return_df=False)
expected = enc.fit_transform(X[tested_columns], y)
np.testing.assert_array_equal(obtained, expected) It takes longer to run. But we can at least be reasonably sure that all the encoders (but HashingEncoder) remain compatible with |
encoders through ColumnTransformer
I propose to create a new pull request for changes to Also, # Categories should raise AssertionError
X = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], dtype='category')
with self.assertRaises(Exception):
verify_numeric(pd.DataFrame(X)) Can you do it? Reasoning: I should not do both, writing a pr and merging the pr. |
assertWarning tests correctly.
Yeah, I can do that. Will have to revert a couple of commits then. Also, I found tests in test_basen and test_one_hot failing on my side because of how expected UserWarnings are handled. I started correcting those as well. Will also crate a separate PR for those. |
That would be really helpful - I have never managed to reproduce them on my machine. |
Hm, that's strange... Well anyways this is how they pass on my machine: message = 'inverse_transform is not supported because transform impute '\
'the unknown category nan when encode city'
with self.assertWarns(UserWarning, msg=message) as w:
enc.inverse_transform(result) |
We have to update the minimal requirement on |
…r-for-columntransformer
And we have to bump the version in |
So, I discovered another bug in basen inverse transform: -> Should we fix BaseNEncoder.inverse_transform() or change helpers.verify_inverse_transform (which is probably a bad idea since we not only want to check if the same columns are returned in correct order....)? |
Good. Now I know why the tests are failing in
If only |
See #172. |
I discovered that when using the BinaryEncoder in a sklearn.ColumnTransformer, the passed params are lost.
This is because the encoder gets instantiated twice in a ColumnTransformer. Currently, params are not registered to self in BinaryEncoder.init(), so they are lost when the ColumnTransformer is put to work.
Disclaimer: I was able to correctly binary encode in a local debug session. However, as there are so many tests failing on the upstream master currently, it was hard to find out whether my solution has an undesired impact.
Also, I am confused by ordinal.py L323-L326. Is this a bug? It seems to correctly encode both with the -2 and np.nan...