-
-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add 'if_binary' option to drop argument of OneHotEncoder #16245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It looks pretty good,
A bit related to categories="dtype" in #15396 I could imagine The alternative could be to make this breaking change in 1.0. WDYT? cc @thomasjpfan |
Do you mean |
Yes, never mind my comment for this PR, particularly since this doesn't add a new parameter. The comment was more about how to make this the default in the long term. |
@thomasjpfan would be great if you could have a look ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @rushabh-v !
So this currently drops the first collumn for any input with 2 classes, not just binary,
>>> OneHotEncoder(drop=None).fit_transform([['a'], ['b'], ['a']]).A
array([[1., 0.],
[0., 1.],
[1., 0.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['b'], ['a']]).A
array([[0.],
[1.],
[0.]])
I would expect to this only take effect for actually binary column, not any input with 2 classes. You could add it as a counter example in unit tests.
Enabling it if the drop='if_binary' the column is of dtype bool, might be easier. For int/float I suppose one could apply it if np.unique(X[col]) is [0, 1], but it's a bit more expensive. Let's wait for input from other reviewers, before making changes.
@rth how would you do to tag a feature as "binary"? I would somehow assume that 2 categories features should be close to a binary feature, isn't it? |
I think here we are assuming that two categories means binary. The point of
this is that as long as there are only two categories, encoding both is
merely redundant.
|
OK that also works. Though, I think binary features has a very specific meaning, as either a column with binary dtype or a column with [0, 1]. A column with ["yes", "no"] or ["true", "false"], can be considered binary, but IMO it's up to the initial preprocessing to convert it to binary (also I have rarely seen this in practice). It might be best to avoid that vocabulary in this PR. Say if you have a column with countries It might be OK to keep the name
(since this PR also drops columns with 1 features I think). WDYT? |
No, it doesn't drop columns with 1 category. |
Thanks for the confirmation. It is currently behaving strangely in that case though which probably needs fixing, >>> OneHotEncoder().fit_transform([['a'], ['a'], ['a']]).A
array([[1.],
[1.],
[1.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['a'], ['a']]).A
array([[0.],
[0.],
[0.]]) In general just to clarify, implementation wise is this expected to behave identically to |
In every case when there is a binary feature, But IMO, for features with one category |
The above example with >>> OneHotEncoder(drop='first').fit_transform([['a'], ['a'], ['a']]).A
array([], shape=(3, 0), dtype=float64) In terms of implementation I think we might want to change the signature of >>> est = OneHotEncoder(drop='if_binary')
>>> est.fit(np.atleast_2d(['a', 'b', 'c']))
OneHotEncoder(categories='auto', drop='if_binary',
dtype=<class 'numpy.float64'>, handle_unknown='error',
sparse=True)
>>> est.drop_idx_
array([0, 0, 0]) In general we should compute |
|
And even if we change the logic of |
Yes, it may have some additional cost but basically the logic for drop should be,
so we cannot fix drop_idx_ in transform, as the determination of "binary" features need to happen based on train data in fit. |
What I meant was that we can edit the And in case of correct me if I am wrong! |
Yes, that is what I mean. Something like: in fit, self.drop_idx_ = self._compute_drop_idx(self.categories) with def _compute_drop_idx(self, categories)
...
elif (isinstance(self.drop, str) and self.drop == 'first'):
return np.array([0 if len(col_cats) > 2 else None
for col_cats in categories], dtype=object) and then in |
So right now I am considering implementing it with object array just to check if it works fine or not. then we will change it if ints are required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @rushabh-v !
I'm still gettig a 500 from codecov, not sure what's going on. Not merging just yet in case @thomasjpfan wants to have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverse_transform
needs to be updated and tested with the new 'if_binary'
option.
Another small decision: if we look across the codebase, do we tend to use underscores, spaces, hyphens or other separators for words in param values? Should it be |
+1 for 'if_binary' |
@thomasjpfan I have updated the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with if_binary
as the keyword option.
@@ -265,6 +265,16 @@ def test_one_hot_encoder_inverse(sparse_, drop): | |||
enc.inverse_transform(X_tr) | |||
|
|||
|
|||
def test_one_hot_encoder_inverse_if_binary(): | |||
X = [['Male', 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make the input a numpy array:
X = np.array([['Male', 1],
['Female', 3],
['Female', 2]])
and then
assert_array_equal(ohe.inverse_transform(X_tr), X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/preprocessing/_encoders.py
Outdated
@@ -448,8 +474,13 @@ def inverse_transform(self, X): | |||
n_transformed_features = sum(len(cats) | |||
for cats in self.categories_) | |||
else: | |||
n_transformed_features = sum(len(cats) - 1 | |||
for cats in self.categories_) | |||
if isinstance(self.drop, str) and self.drop == 'if_binary': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use elif if it is appropriate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
why it says |
Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
All greens now! |
Thanks @rushabh-v and all reviewers! |
Fixes #12502