Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged

Conversation

@drewmjohnston
Copy link
Contributor

@drewmjohnston drewmjohnston commented Jan 3, 2019

Reference Issues/PRs

Fixes #6488 and fixes #6053. This builds upon some of the code from #12884 (thanks @NicolasHug!), but also incorporates functionality which lets the user manually specify which category in each column they would like to be dropped, so this is a more general solution along the lines of what @amueller suggested in #6053. This is useful in some cases (such as OLS regression) where the dropped group affects the interpretation of coefficients.
This should also fix #9361, which has less functionality and appears stalled.

What does this implement/fix? Explain your changes.

This code implements a new parameter (drop) in the OneHotEncoder, which can take any of three values:
None (default), which implements the existing behavior.
'first', which drops the first value in each category.
list of length n_features, which allows for the manual specification of the reference group.

Any other comments?

This new feature does not work in Legacy mode (this was discussed in #6053), and it requires the manual specification of "categories='auto'" in the case in which the input is all integers, so as not to interfere with the ongoing change in the treatment of integers in OneHotEncoder.
This code is also incompatible with "handle_missing='ignore'", since in that case it is not possible to determine which categories are 0 because they are in the reference category and which are all 0 due to missing data.

Closes #12884

Copy link
Member

@jnothman jnothman left a comment

Not yet looked in detail at transform or tests. Nice work!

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
@drewmjohnston
Copy link
Contributor Author

@drewmjohnston drewmjohnston commented Jan 17, 2019

Thanks for the feedback @jnothman! I just got back from a vacation and am working my way through these.

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Jan 25, 2019

When do we need to specify which features to drop? Maybe #12884 is enough?

@drewmjohnston
Copy link
Contributor Author

@drewmjohnston drewmjohnston commented Jan 25, 2019

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

@amueller
Copy link
Member

@amueller amueller commented Jan 27, 2019

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Jan 28, 2019

@qinhanmin2014 The main appeal of specifying what feature to drop is for situations in which the coefficient interpretation is important (and done relative to the dropped category), for example in unregularized regression.

I think it would be good to support what feature to drop. If you do this in regularized linear regression, picking which to drop will actually change the optimization problem.

Thanks, now I agree that this one is better than #12884

Drew Johnston added 2 commits Jan 28, 2019
…ving all values of some features. Updated behavior to by default drop columns with same value throughout
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
drop_value = self.drop_idx_[i]
"""
Add the cells where the drop value is present to the list
of those to be masked-out. Decrement the indices of the values

This comment has been minimized.

@jnothman

jnothman Jan 28, 2019
Member

I see three options for implementing transform:

  1. ignore the added category when encoding, a bit like handle_unknown='ignore'
  2. remove the value once encoded as an int
  3. remove the columns once one-hot encoded

I think you've implemented #2. Why is this one chosen? What are its pros and cons in terms of code simplicity and computational complexity?

This comment has been minimized.

@drewmjohnston

drewmjohnston Jan 30, 2019
Author Contributor

To do the first option, it seems like I'd have to modify _transform(), which is in the BaseEncoder class. We're not adding drop options to the other classes that inherit from BaseEncoder, so I thought that changing this in BaseEncoder might not be the best approach.
The approach I took with 2) had some advantages, since I was able to mix it in to reuse some of the logic that was already used for masking unknown values. It also avoids having to do column slicing on an output that might be a CSR matrix (as would be the case in 3)).

This comment has been minimized.

@drewmjohnston

drewmjohnston Feb 4, 2019
Author Contributor

@jnothman do you think that one of the other approaches is superior?

@drewmjohnston
Copy link
Contributor Author

@drewmjohnston drewmjohnston commented Feb 8, 2019

@jnothman any further comments?

Copy link
Member

@jnothman jnothman left a comment

Sorry this is only a partial review

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Thanks for making those changes, @drewmjohnston

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Latest changes look all good, thanks!

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
doc/modules/preprocessing.rst Outdated Show resolved Hide resolved
@drewmjohnston
Copy link
Contributor Author

@drewmjohnston drewmjohnston commented Feb 26, 2019

Great--made the small documentation fix you recommended.

@drewmjohnston
Copy link
Contributor Author

@drewmjohnston drewmjohnston commented Feb 26, 2019

@jorisvandenbossche OK, sounds good. I removed this in a moment of confusion when I though this was going in to .22, not .21.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 26, 2019

@drewmjohnston Can you also add back the test? (it's in here: 2eb2c16)

@drewmjohnston
Copy link
Contributor Author

@drewmjohnston drewmjohnston commented Feb 26, 2019

OK, removed the extra code and added the tests for the drop/n_values interactions.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

OK, looks all good now!

@drewmjohnston drewmjohnston changed the title [MRG + 1] Add Drop Option to OneHotEncoder. [MRG + 2] Add Drop Option to OneHotEncoder. Feb 26, 2019
@amueller
Copy link
Member

@amueller amueller commented Feb 26, 2019

merging as it has two +1s and appveyor is kinda redundant now.

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Feb 26, 2019

You didn't merge @amueller ?

@jorisvandenbossche jorisvandenbossche merged commit 350cd4a into scikit-learn:master Feb 26, 2019
18 checks passed
18 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
@codecov
codecov/patch 100% of diff hit (target 92.48%)
Details
@codecov
codecov/project 92.49% (+0.01%) compared to 314686a
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@azure-pipelines
scikit-learn.scikit-learn Build #20190226.19 succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Windows py35) Windows py35 succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Windows py37) Windows py37 succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 26, 2019

@drewmjohnston Thanks a lot!

@jorisvandenbossche jorisvandenbossche added this to the 0.21 milestone Feb 26, 2019
@drewmjohnston
Copy link
Contributor Author

@drewmjohnston drewmjohnston commented Feb 26, 2019

Great! Thanks for the help everyone. You guess make contributing to this a breeze.

@drewmjohnston drewmjohnston deleted the drewmjohnston:leave_one_out_ohe branch Feb 26, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde added a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment