New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds missing value support to OneHotEncoder #17317
ENH Adds missing value support to OneHotEncoder #17317
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but needs docs and a whatsnew. Also, you should explain why using None and NaN in a single column is prohibited.
@@ -350,6 +363,27 @@ def _compute_drop_idx(self): | |||
len(self.drop))) | |||
missing_drops = [(i, val) for i, val in enumerate(self.drop) | |||
if val not in self.categories_[i]] | |||
|
|||
missing_drops = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the missing_drops implementation from two lines above still work? And if not you should at least remove it? but set operations should work on NaN, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since self.categories_[i]
is a numpy array, I was trying to avoid converting it into another data structure and making a copy.
if none_in_diff and nan_in_diff: | ||
raise ValueError("Input wiith both types of missing, None and " | ||
"np.nan, is not supported") | ||
if none_in_diff: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Does sort
fail to sort? Can you add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment may have lost its context. The nans
were removed from the sets above because they do not work with set differencing in this context.
As for the Nones
, we could have left the Nones
in the set, but it would complicate the _extract_missing
helper.
Updated PR with:
|
What's the motivation for treating None as missing? |
Test failures, @thomasjpfan |
Test failures are fixed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I am still slightly hopeful two places I comment on can be simplified, otherwise ready to merge I'd say :)
@@ -350,6 +363,27 @@ def _compute_drop_idx(self): | |||
len(self.drop))) | |||
missing_drops = [(i, val) for i, val in enumerate(self.drop) | |||
if val not in self.categories_[i]] | |||
|
|||
missing_drops = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.
sklearn/utils/_encode.py
Outdated
|
||
|
||
def _unique_python(values, *, return_inverse): | ||
# Only used in `_uniques`, see docstring there for details | ||
try: | ||
uniques = sorted(set(values)) | ||
uniques_set = set(values) | ||
missing_values_in_set = [value for value in (None, np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is None not sortable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not:
sorted(['a', 'b', None])
# TypeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't match float('nan')
, nor (np.array(0)/0).item()
. Does it matter?
To be careful, I think you might be safer with:
missing_values_in_set = [value for value in uniques_set
if value is None or is_scalar_nan(value)]
def test_ohe_missing_values_both_missing_values(): | ||
# test both types of missing of missing values are treated as its own | ||
# category | ||
X = np.array([['a', 'b', None, 'a', np.nan]], dtype=object).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it's necessary but having np.nan
twice would check the deduplication logic (again, I know it's covered in other tests as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm done here... Lgtm!
Let's solicit another review...
We have ?4 weeks for another review to push this into 0.24 ... volunteers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
besides LGTM
maybe I would not have done with a new MissingValues class but I understand the argument to enforce typing.
>>> enc.transform(X).toarray() | ||
array([[0., 1., 0., 0., 1., 0.], | ||
[1., 0., 0., 0., 0., 1.], | ||
[0., 0., 1., 1., 0., 0.]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note here about what happens if None and np.nan are present in the same column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also pass explicitly here the handle_unknown parameter. To document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR:
- The user guide now describes what happens when
nan
andNone
are present. - The default value of
handle_unknown='error'
is passed explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thx @thomasjpfan !
Two approvals here and the opportunity to close three issues and moving forward with the milestone! |
OneHotEncoder supports categorical features with missing values by considering the missing values as an additional category.
OneHotEncoder supports categorical features with missing values by considering the missing values as an additional category.
OMG THIS WAS MERGED!!! I'm so happy! |
Yes, thanks @thomasjpfan for putting in the hard work to make this finally happen!! |
Reference Issues/PRs
Fixes #11996
Closes #15009
Closes #13028
Towards #15796
What does this implement/fix? Explain your changes.
Adds missing value support to
OneHotEncoder
.For numerical data,
np.nan
is represents missing values. For object dtypes,None
andnp.nan
is support for missing values.