Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged

Conversation

Projects
None yet
6 participants
@drewmjohnston
Copy link
Contributor

drewmjohnston commented Jan 3, 2019

Reference Issues/PRs

Fixes #6488 and fixes #6053. This builds upon some of the code from #12884 (thanks @NicolasHug!), but also incorporates functionality which lets the user manually specify which category in each column they would like to be dropped, so this is a more general solution along the lines of what @amueller suggested in #6053. This is useful in some cases (such as OLS regression) where the dropped group affects the interpretation of coefficients.
This should also fix #9361, which has less functionality and appears stalled.

What does this implement/fix? Explain your changes.

This code implements a new parameter (drop) in the OneHotEncoder, which can take any of three values:
None (default), which implements the existing behavior.
'first', which drops the first value in each category.
list of length n_features, which allows for the manual specification of the reference group.

Any other comments?

This new feature does not work in Legacy mode (this was discussed in #6053), and it requires the manual specification of "categories='auto'" in the case in which the input is all integers, so as not to interfere with the ongoing change in the treatment of integers in OneHotEncoder.
This code is also incompatible with "handle_missing='ignore'", since in that case it is not possible to determine which categories are 0 because they are in the reference category and which are all 0 due to missing data.

Closes #12884

@jnothman
Copy link
Member

jnothman left a comment

Not yet looked in detail at transform or tests. Nice work!

Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Jan 17, 2019

Thanks for the feedback @jnothman! I just got back from a vacation and am working my way through these.

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Jan 25, 2019

When do we need to specify which features to drop? Maybe #12884 is enough?

@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Jan 25, 2019

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jan 27, 2019

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Jan 28, 2019

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

Thanks, now I agree that this one is better than #12884

Drew Johnston added some commits Jan 28, 2019

Drew Johnston
Added new way to encode features to be dropped. Added support for lea…
…ving all values of some features. Updated behavior to by default drop columns with same value throughout
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved sklearn/preprocessing/_encoders.py
Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
drop_value = self.drop_idx_[i]
"""
Add the cells where the drop value is present to the list
of those to be masked-out. Decrement the indices of the values

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

I see three options for implementing transform:

  1. ignore the added category when encoding, a bit like handle_unknown='ignore'
  2. remove the value once encoded as an int
  3. remove the columns once one-hot encoded

I think you've implemented #2. Why is this one chosen? What are its pros and cons in terms of code simplicity and computational complexity?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 30, 2019

Author Contributor

To do the first option, it seems like I'd have to modify _transform(), which is in the BaseEncoder class. We're not adding drop options to the other classes that inherit from BaseEncoder, so I thought that changing this in BaseEncoder might not be the best approach.
The approach I took with 2) had some advantages, since I was able to mix it in to reuse some of the logic that was already used for masking unknown values. It also avoids having to do column slicing on an output that might be a CSR matrix (as would be the case in 3)).

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 4, 2019

Author Contributor

@jnothman do you think that one of the other approaches is superior?

@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Feb 8, 2019

@jnothman any further comments?

@jnothman
Copy link
Member

jnothman left a comment

Sorry this is only a partial review

Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 25, 2019

Discussing with @jorisvandenbossche, we think it's unnecessary complexity to support None being in the list of drops.... this can be achieved, if the user needs it, with a ColumnTransformer. Supporting None in the list adds unnecessary complexity to the code, and may make it hard to support having None be a valid (or missing value) category in the input in the future. I hope it's not too much effort, @drewmjohnston, to pull back that feature.

@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Feb 25, 2019

That's not a problem--it'll make the code cleaner and allow us to check the dtype of drop_idx too. I'll incorporate this and re-push.

Drew Johnston added some commits Feb 25, 2019

@jnothman
Copy link
Member

jnothman left a comment

Thanks for making those changes, @drewmjohnston

Drew Johnston
@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche left a comment

Latest changes look all good, thanks!

Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
Show resolved Hide resolved doc/modules/preprocessing.rst Outdated
@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Feb 26, 2019

Great--made the small documentation fix you recommended.

Drew Johnston
@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Feb 26, 2019

@jorisvandenbossche OK, sounds good. I removed this in a moment of confusion when I though this was going in to .22, not .21.

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor

jorisvandenbossche commented Feb 26, 2019

@drewmjohnston Can you also add back the test? (it's in here: 2eb2c16)

@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Feb 26, 2019

OK, removed the extra code and added the tests for the drop/n_values interactions.

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche left a comment

OK, looks all good now!

@drewmjohnston drewmjohnston changed the title [MRG + 1] Add Drop Option to OneHotEncoder. [MRG + 2] Add Drop Option to OneHotEncoder. Feb 26, 2019

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Feb 26, 2019

merging as it has two +1s and appveyor is kinda redundant now.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Feb 26, 2019

You didn't merge @amueller ?

@jorisvandenbossche jorisvandenbossche merged commit 350cd4a into scikit-learn:master Feb 26, 2019

11 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 92.48%)
Details
codecov/project 92.49% (+0.01%) compared to 314686a
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor

jorisvandenbossche commented Feb 26, 2019

@drewmjohnston Thanks a lot!

@jorisvandenbossche jorisvandenbossche added this to the 0.21 milestone Feb 26, 2019

@drewmjohnston

This comment has been minimized.

Copy link
Contributor Author

drewmjohnston commented Feb 26, 2019

Great! Thanks for the help everyone. You guess make contributing to this a breeze.

@drewmjohnston drewmjohnston deleted the drewmjohnston:leave_one_out_ohe branch Feb 26, 2019

@jrbourbeau jrbourbeau referenced this pull request Feb 28, 2019

Merged

Fix sklearn dev tests #474

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.