Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OrdinalEncoder fails when the input is entirely numeric or can be converted to numeric values #229

Closed
djrscally opened this issue Jan 7, 2020 · 7 comments

Comments

@djrscally
Copy link

djrscally commented Jan 7, 2020

Summary

OrdinalEncoder.fit() throws an exception when the input values are entirely numeric (I.E. [1, 2, 3, 4, 5]) or can be converted to be numeric (I.E. ['001', '002', '003', '004', '005'], or even worse strings containing whitespace like [' 01 ', ' 02 ', ' 03 ']).

Version of category_encoders
2.1.0

Reproducible code

import numpy as np
from category_encoders import OrdinalEncoder

X = np.array(['1', '2', '3', '4', '5'])
oe = OrdinalEncoder(return_df=False)

oe.fit(X)

Expected Outcome

The encoder is fit and returns itself.

Actual Outcome

The following exception is thrown:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<command-2844558405733313> in <module>
      5 oe = OrdinalEncoder(return_df=False)
      6 
----> 7 oe.fit(X)

/databricks/python/lib/python3.7/site-packages/category_encoders/ordinal.py in fit(self, X, y, **kwargs)
    144 
    145         X_temp = self.transform(X, override_return_df=True)
--> 146         self.feature_names = X_temp.columns.tolist()
    147 
    148         # drop all output columns with 0 variance.

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

This was preventing me from using OrdinalEncoder to encode some Item ID numbers, since they are comprised entirely of numeric values (albeit stored as text). By way of explanation as to why I was doing it in the first place; the Item IDs themselves are usually some long and unwieldy number like 00311684374, and moreover there's a bunch of stuff that specifically wants to know the number of unique categories in a feature, for which a simple .max() on the encoded column can be employed. There are other things in a machine learning Pipeline (Keras Embedding layers being the example that have triggered this) that expect all values in a feature to be within the range 0:max_encoded_value, and that's harder to code for without actually encoding the values. Finally, although they're currently all numeric, the field in the source of the data is very much _alpha_numeric, and I definitely can't guarantee that they will remain numeric only in the future. Currently, I can't future-proof that possibility using this class.

@djrscally
Copy link
Author

So this happens because on the call to .fit(), the array is passed to utils.convert_input(), which converts all such columns to be numeric:

X = X.apply(lambda x: pd.to_numeric(x, errors='ignore'))

Then runs utils.get_obj_cols() which now returns an empty list, because there are no longer any Object columns due that silent conversion. When the call to X_temp = self.transform(X, override_return_df=True) is made it therefore trips this condition:

        if not list(self.cols):
            return X if self.return_df else X.values

I don't really see why that silent conversion to numeric values is done. I think that the ideal solution is to remove it, but I suspect that that will be a little bit less than ideally backwards compatible. It will mean that people who fail to explicitly specify columns to encode OR convert intended-to-be-numeric object columns to numeric values prior to fitting an OrdinalEncoder will accidentally encode features that are intended to be numeric, which probably means a warning should be raised.

Alternatively, I think that we could edit ordinal.py to change how X is returned when self.cols is an empty list, like so:

        if not list(self.cols):
            return X if self.return_df or override_return_df else X.values

Which is how it's handled at the end of the .transform() function anyway:

        if self.return_df or override_return_df:
            return X
        else:
            return X.values

Thoughts?

janmotl added a commit that referenced this issue Jan 8, 2020
@janmotl
Copy link
Collaborator

janmotl commented Jan 8, 2020

Nice bug report. I removed the silent conversion. All the current tests + some new tests seem to work fine.

Do you have a code where the removal of the silent conversion would change the behaviour? I didn't find anything.

janmotl added a commit that referenced this issue Jan 8, 2020
@djrscally
Copy link
Author

djrscally commented Jan 8, 2020

I think so yeah. So for example an array containing numeric data formatted as string that is not intended to be encoded as categories. Under the current behaviour, this is basically just converted to numbers, which will work fine:

>>> import numpy as np
>>> from category_encoders import OrdinalEncoder

>>> X = np.array(
...  [
...    ['A','1'],
...    ['2','9'],
...    ['3','3.5'],
...    ['4','4'],
...    ['5','5']
...  ]
...)
>>> oe = OrdinalEncoder(return_df=False)

>>> oe.fit_transform(X)
array([[1. , 1. ],
       [2. , 9. ],
       [3. , 3.5],
       [4. , 4. ],
       [5. , 5. ]])

So say the first column is categorical, and that's the column the user intends to encode. The second is numeric data, and the user does not intend to encode it. It's simply converted as is into a numeric format, and so all is hunky dory.

After this change though it will be treated as another categorical column and encoded, so I would expect the output to change to this:

array([[1. , 1. ],
       [2. , 2. ],
       [3. , 3. ],
       [4. , 4. ],
       [5. , 5. ]])

It will still work downstream in an sklearn estimator or whatever, but that's kind of A Bad Thing™ because it means the code will still function (and thus the user isn't alerted by a failure to the fact that something is wrong) but the output of a model trained on that data is likely to suffer, since the meaning of that feature will be ruined.

I still think that that's the right approach, but it probably needs a warning or something.

@janmotl
Copy link
Collaborator

janmotl commented Jan 13, 2020

I get it now.

I think it deserves a warning in the release notes like:

Now, if you pass a string Numpy array and do not explicitly specify which of the columns to encode (e.g.: with cols=[0, 1, 5]), all the columns get encoded. The former implementation was attempting to estimate which of the columns should be directly converted to numbers and which should be encoded (e.g.: with ordinal encoder). But this proved to be unreliable (see issue #229). And if we can't reliably estimate which of the columns to encode, it is better to leave the decision up to the user. Note also that the former behaviour could have been interpreted as a bug, because the documentation is (and was) saying that by default all string columns will be encoded.

If you want a warning in the code, propose the code (with a description how to silent it).

@djrscally
Copy link
Author

Sure, I can do that. I'll probably reference sklearn's ChangedBehaviorWarning from sklearn.exceptions, unless you'd rather I don't do that.

I'll make the warning message be something like:

Input contains string formatted column containing entirely numeric values. Each distinct value will now be encoded to its own category. To avoid this behaviour, specify which columns to encode using the cols parameter. See version x.y.z release notes for details.

@janmotl
Copy link
Collaborator

janmotl commented Jan 13, 2020

Sounds good to me.

@PaulWestenthanner
Copy link
Collaborator

The example given works now. Probably this was fixed when the whole convert_input functionality got refactored

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants