New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OneHotEncoding - Defining a reference category #6053
Comments
not sure if that was raise somewhere already. multicollinearity is really not a problem in any model in scikit-learn. But feel free to create a pull-request. The OneHotEncoder is being restructured quite heavily right now, though. |
I'm interested in working on this feature! I ran into some problems using a OneHotEncoder in a pipeline that used a Keras Neural Network as the classifier. I was attempting to transform a few columns of categorical features into a dummy variable representation and feed the resulting columns (plus some numerical variables that were passed through) into the NN for classification. However, the one hot encoding played poorly with the collinear columns, and my model performed poorly out of sample. I was eventually able to design a workaround, but it seems to me that it would be valuable to have a tool in scikit-learn that could do this simply. |
I think we'd accept a PR. I'm a bit surprised there's none yet. We also changed the OneHotEncoder quite a bit recently. You probably don't want to modify the "legacy" mode. A question is whether/how we allow users to specify which category to drop. In regularized models this actually makes a difference IIRC. |
Reading through the comments on the old PR, I was thinking that those
options seem to be the natural choice. I'm in the midst of graduate school
applications right now so my time is somewhat limited, but this seems to be
something that is going to keep appearing in my work, so I'm going to have
to address this (or keep using workarounds) at some point.
…On Wed, Nov 28, 2018 at 3:49 PM Andreas Mueller ***@***.***> wrote:
I think we'd accept a PR. I'm a bit surprised there's none yet. We also
changed the OneHotEncoder quite a bit recently. You probably don't want to
modify the "legacy" mode. A question is whether/how we allow users to
specify which category to drop. In regularized models this actually makes a
difference IIRC.
We could have a parameter drop that's 'none' by default, and could be
'first' or a datastructure with the values to drop. could be a list/numpy
array of length n_features (all input features are categorical in the new
OneHotEncoder).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6053 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Am4ucQaSxhMTVGCeWx4cy-xx5Xl3EHmOks5uzvbugaJpZM4G2bMF>
.
|
I've got a working implementation of this now--going to draw up some tests and then submit a PR later this week. |
@drewmjohnston I think #12884 solves this issue right? |
Yes, I think it should! I've also added functionality for manually specifying the dropped value for each feature. |
(this is along the lines of what @amueller suggested) |
Ok. Please add a reference to #12884 in your PR so reviewers know that both PRs address the same thing, yours being more general. Also feel free to use some of my code if you deem relevant. |
Sounds great--I think your testing suite will come in handy. Thank you for
the help :).
…On Wed, Jan 2, 2019 at 2:53 PM Nicolas Hug ***@***.***> wrote:
Ok.
Please add a reference to #12884
<#12884> in your PR so
reviewers know that both PRs address the same thing, yours being more
general. Also feel free to use some of my code if you deem relevant.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6053 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Am4ucSK38aUyJ14bEUMYjXXgUUhWFJX9ks5u_Q5SgaJpZM4G2bMF>
.
|
In order to avoid multicollinearity in modelling, the number of dummy-coded variables needed should be one less than the number of categories. Therefore, it would be very code if OneHotEncoding could accept a reference category as an input variable.
The text was updated successfully, but these errors were encountered: