Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneHotEncoding - Defining a reference category #6053

Closed
ghost opened this issue Dec 16, 2015 · 10 comments · Fixed by #12908
Closed

OneHotEncoding - Defining a reference category #6053

ghost opened this issue Dec 16, 2015 · 10 comments · Fixed by #12908

Comments

@ghost
Copy link

ghost commented Dec 16, 2015

In order to avoid multicollinearity in modelling, the number of dummy-coded variables needed should be one less than the number of categories. Therefore, it would be very code if OneHotEncoding could accept a reference category as an input variable.

@amueller
Copy link
Member

not sure if that was raise somewhere already. multicollinearity is really not a problem in any model in scikit-learn. But feel free to create a pull-request. The OneHotEncoder is being restructured quite heavily right now, though.

@drewmjohnston
Copy link
Contributor

I'm interested in working on this feature! I ran into some problems using a OneHotEncoder in a pipeline that used a Keras Neural Network as the classifier. I was attempting to transform a few columns of categorical features into a dummy variable representation and feed the resulting columns (plus some numerical variables that were passed through) into the NN for classification. However, the one hot encoding played poorly with the collinear columns, and my model performed poorly out of sample. I was eventually able to design a workaround, but it seems to me that it would be valuable to have a tool in scikit-learn that could do this simply.
I see the above pull request, which began to implement this in the DictVectorizer class, but it looks like this was never implemented (probably due to some unresolved fixes that were suggested). Is there anything stopping this from being implemented in the OneHotEncoder case instead?

@amueller
Copy link
Member

I think we'd accept a PR. I'm a bit surprised there's none yet. We also changed the OneHotEncoder quite a bit recently. You probably don't want to modify the "legacy" mode. A question is whether/how we allow users to specify which category to drop. In regularized models this actually makes a difference IIRC.
We could have a parameter drop that's 'none' by default, and could be 'first' or a datastructure with the values to drop. could be a list/numpy array of length n_features (all input features are categorical in the new OneHotEncoder).

@drewmjohnston
Copy link
Contributor

drewmjohnston commented Nov 28, 2018 via email

@drewmjohnston
Copy link
Contributor

I've got a working implementation of this now--going to draw up some tests and then submit a PR later this week.

@NicolasHug
Copy link
Member

@drewmjohnston I think #12884 solves this issue right?

@drewmjohnston
Copy link
Contributor

Yes, I think it should! I've also added functionality for manually specifying the dropped value for each feature.

@drewmjohnston
Copy link
Contributor

(this is along the lines of what @amueller suggested)

@NicolasHug
Copy link
Member

Ok.

Please add a reference to #12884 in your PR so reviewers know that both PRs address the same thing, yours being more general. Also feel free to use some of my code if you deem relevant.

@drewmjohnston
Copy link
Contributor

drewmjohnston commented Jan 2, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants