Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds missing value support to OneHotEncoder #17317

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented May 23, 2020

Reference Issues/PRs

Fixes #11996
Closes #15009
Closes #13028
Towards #15796

What does this implement/fix? Explain your changes.

Adds missing value support to OneHotEncoder.

For numerical data, np.nan is represents missing values. For object dtypes, None and np.nan is support for missing values.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but needs docs and a whatsnew. Also, you should explain why using None and NaN in a single column is prohibited.

@@ -350,6 +363,27 @@ def _compute_drop_idx(self):
len(self.drop)))
missing_drops = [(i, val) for i, val in enumerate(self.drop)
if val not in self.categories_[i]]

missing_drops = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the missing_drops implementation from two lines above still work? And if not you should at least remove it? but set operations should work on NaN, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since self.categories_[i] is a numpy array, I was trying to avoid converting it into another data structure and making a copy.

sklearn/utils/_encode.py Outdated Show resolved Hide resolved
sklearn/utils/_encode.py Outdated Show resolved Hide resolved
if none_in_diff and nan_in_diff:
raise ValueError("Input wiith both types of missing, None and "
"np.nan, is not supported")
if none_in_diff:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Does sort fail to sort? Can you add a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may have lost its context. The nans were removed from the sets above because they do not work with set differencing in this context.

As for the Nones, we could have left the Nones in the set, but it would complicate the _extract_missing helper.

sklearn/utils/_encode.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_encode.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_encode.py Outdated Show resolved Hide resolved
@thomasjpfan
Copy link
Member Author

Updated PR with:

  1. Support for both np.nan and None support for object dtype.
  2. Updated examples that do not need an imputer anymore before the OneHotEncoder.

@NicolasHug
Copy link
Member

What's the motivation for treating None as missing?

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

Test failures, @thomasjpfan

@thomasjpfan
Copy link
Member Author

Test failures are fixed!

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I am still slightly hopeful two places I comment on can be simplified, otherwise ready to merge I'd say :)

@@ -350,6 +363,27 @@ def _compute_drop_idx(self):
len(self.drop)))
missing_drops = [(i, val) for i, val in enumerate(self.drop)
if val not in self.categories_[i]]

missing_drops = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hasn't been addressed, right? There is still an implementation two lines above that is then overwritten. And I am pretty sure that the old code still works. you can also use list.index which works on NaN as expected.

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved


def _unique_python(values, *, return_inverse):
# Only used in `_uniques`, see docstring there for details
try:
uniques = sorted(set(values))
uniques_set = set(values)
missing_values_in_set = [value for value in (None, np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is None not sortable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not:

sorted(['a', 'b', None])
# TypeError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't match float('nan'), nor (np.array(0)/0).item(). Does it matter?

To be careful, I think you might be safer with:

        missing_values_in_set = [value for value in uniques_set
                                 if value is None or is_scalar_nan(value)]

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved
def test_ohe_missing_values_both_missing_values():
# test both types of missing of missing values are treated as its own
# category
X = np.array([['a', 'b', None, 'a', np.nan]], dtype=object).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it's necessary but having np.nan twice would check the deduplication logic (again, I know it's covered in other tests as well).

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm done here... Lgtm!

Let's solicit another review...

@jnothman
Copy link
Member

We have ?4 weeks for another review to push this into 0.24 ... volunteers?

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides LGTM

maybe I would not have done with a new MissingValues class but I understand the argument to enforce typing.

>>> enc.transform(X).toarray()
array([[0., 1., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 1.],
[0., 0., 1., 1., 0., 0.]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note here about what happens if None and np.nan are present in the same column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also pass explicitly here the handle_unknown parameter. To document it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR:

  1. The user guide now describes what happens when nan and None are present.
  2. The default value of handle_unknown='error' is passed explicitly.

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thx @thomasjpfan !

@cmarmo
Copy link
Contributor

cmarmo commented Oct 8, 2020

Two approvals here and the opportunity to close three issues and moving forward with the milestone!
@thomasjpfan do you mind syncing (just to be sure) and then this could probably be merged?

@lorentzenchr lorentzenchr merged commit eeddc17 into scikit-learn:master Oct 9, 2020
amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020
OneHotEncoder supports categorical features with missing values by
considering the missing values as an additional category.
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
OneHotEncoder supports categorical features with missing values by
considering the missing values as an additional category.
ogrisel added a commit to thomasjpfan/scikit-learn that referenced this pull request Oct 27, 2020
@amueller
Copy link
Member

OMG THIS WAS MERGED!!! I'm so happy!

@jnothman
Copy link
Member

jnothman commented Nov 1, 2020

Yes, thanks @thomasjpfan for putting in the hard work to make this finally happen!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle missing values in OneHotEncoder
8 participants