OneHotEncoding - Defining a reference category #6053

ghost · 2015-12-16T11:22:06Z

In order to avoid multicollinearity in modelling, the number of dummy-coded variables needed should be one less than the number of categories. Therefore, it would be very code if OneHotEncoding could accept a reference category as an input variable.

amueller · 2016-09-14T20:02:19Z

not sure if that was raise somewhere already. multicollinearity is really not a problem in any model in scikit-learn. But feel free to create a pull-request. The OneHotEncoder is being restructured quite heavily right now, though.

drewmjohnston · 2018-11-28T20:36:53Z

I'm interested in working on this feature! I ran into some problems using a OneHotEncoder in a pipeline that used a Keras Neural Network as the classifier. I was attempting to transform a few columns of categorical features into a dummy variable representation and feed the resulting columns (plus some numerical variables that were passed through) into the NN for classification. However, the one hot encoding played poorly with the collinear columns, and my model performed poorly out of sample. I was eventually able to design a workaround, but it seems to me that it would be valuable to have a tool in scikit-learn that could do this simply.
I see the above pull request, which began to implement this in the DictVectorizer class, but it looks like this was never implemented (probably due to some unresolved fixes that were suggested). Is there anything stopping this from being implemented in the OneHotEncoder case instead?

amueller · 2018-11-28T20:48:26Z

I think we'd accept a PR. I'm a bit surprised there's none yet. We also changed the OneHotEncoder quite a bit recently. You probably don't want to modify the "legacy" mode. A question is whether/how we allow users to specify which category to drop. In regularized models this actually makes a difference IIRC.
We could have a parameter drop that's 'none' by default, and could be 'first' or a datastructure with the values to drop. could be a list/numpy array of length n_features (all input features are categorical in the new OneHotEncoder).

drewmjohnston · 2018-11-28T20:53:10Z

Reading through the comments on the old PR, I was thinking that those options seem to be the natural choice. I'm in the midst of graduate school applications right now so my time is somewhat limited, but this seems to be something that is going to keep appearing in my work, so I'm going to have to address this (or keep using workarounds) at some point.

…

On Wed, Nov 28, 2018 at 3:49 PM Andreas Mueller ***@***.***> wrote: I think we'd accept a PR. I'm a bit surprised there's none yet. We also changed the OneHotEncoder quite a bit recently. You probably don't want to modify the "legacy" mode. A question is whether/how we allow users to specify which category to drop. In regularized models this actually makes a difference IIRC. We could have a parameter drop that's 'none' by default, and could be 'first' or a datastructure with the values to drop. could be a list/numpy array of length n_features (all input features are categorical in the new OneHotEncoder). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6053 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Am4ucQaSxhMTVGCeWx4cy-xx5Xl3EHmOks5uzvbugaJpZM4G2bMF> .

drewmjohnston · 2019-01-02T18:20:53Z

I've got a working implementation of this now--going to draw up some tests and then submit a PR later this week.

NicolasHug · 2019-01-02T19:38:09Z

@drewmjohnston I think #12884 solves this issue right?

drewmjohnston · 2019-01-02T19:40:40Z

Yes, I think it should! I've also added functionality for manually specifying the dropped value for each feature.

drewmjohnston · 2019-01-02T19:42:03Z

(this is along the lines of what @amueller suggested)

NicolasHug · 2019-01-02T19:52:43Z

Ok.

Please add a reference to #12884 in your PR so reviewers know that both PRs address the same thing, yours being more general. Also feel free to use some of my code if you deem relevant.

drewmjohnston · 2019-01-02T19:55:01Z

Sounds great--I think your testing suite will come in handy. Thank you for the help :).

…

On Wed, Jan 2, 2019 at 2:53 PM Nicolas Hug ***@***.***> wrote: Ok. Please add a reference to #12884 <#12884> in your PR so reviewers know that both PRs address the same thing, yours being more general. Also feel free to use some of my code if you deem relevant. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6053 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Am4ucSK38aUyJ14bEUMYjXXgUUhWFJX9ks5u_Q5SgaJpZM4G2bMF> .

IamGianluca mentioned this issue Jul 14, 2017

[MRG] ENH: add support for dropping first level of categorical feature #9361

Closed

drewmjohnston mentioned this issue Jan 3, 2019

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged

jorisvandenbossche closed this as completed in #12908 Feb 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneHotEncoding - Defining a reference category #6053

OneHotEncoding - Defining a reference category #6053

ghost commented Dec 16, 2015

amueller commented Sep 14, 2016

drewmjohnston commented Nov 28, 2018

amueller commented Nov 28, 2018

drewmjohnston commented Nov 28, 2018 via email

drewmjohnston commented Jan 2, 2019

NicolasHug commented Jan 2, 2019

drewmjohnston commented Jan 2, 2019

drewmjohnston commented Jan 2, 2019

NicolasHug commented Jan 2, 2019

drewmjohnston commented Jan 2, 2019 via email

OneHotEncoding - Defining a reference category #6053

OneHotEncoding - Defining a reference category #6053

Comments

ghost commented Dec 16, 2015

amueller commented Sep 14, 2016

drewmjohnston commented Nov 28, 2018

amueller commented Nov 28, 2018

drewmjohnston commented Nov 28, 2018 via email

drewmjohnston commented Jan 2, 2019

NicolasHug commented Jan 2, 2019

drewmjohnston commented Jan 2, 2019

drewmjohnston commented Jan 2, 2019

NicolasHug commented Jan 2, 2019

drewmjohnston commented Jan 2, 2019 via email