-
-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147
Comments
If there is no-one working on this issue I can do it. By the way, I believe pandas supports one hot encoding found this on StackOverflow What are your thoughts on detecting that it is a pandas dataframe and using pandas native encoder ? |
I don't think we want to use But feel free to work on this! |
Perfect I'll work on this ! |
Just to add, one key challenge when returning an array is mapping feature importances back to the original column names when you've applied OneHotEncoder. It would be a big step forward to replace the prefixes |
We will try to tackle this one during the sprints in Paris this week. |
…erting to array scikit-learn#12147 (scikit-learn#13253)" This reverts commit d94af6f.
…erting to array scikit-learn#12147 (scikit-learn#13253)" This reverts commit d94af6f.
related to #12086 |
This was actually solved by #13253 (handling a DataFrame column by column, preserving the column's dtypes) |
Left-over to do from #9151 (comment)
Idea is to support DataFrames without converting to a contiguous array. This conversion is not needed, as the transformer encodes the input column by column anyway, so it would be rather easy to preserve the datatypes per column.
This would avoid converting a potentially mixed-dtype DataFrame (eg ints and object strings) to a full object array.
This can introduces a slight change in behaviour (it can change the
dtype
of thecategories_
in certain edge cases, eg when you had a mixture of float and int columns).(Note that is not yet necessarily means to have special handling for certain pandas dtypes such as categorical dtype, see #12086, in an initial step, we could still do a
check_array
on each column / coerce each column to a numpy array).The text was updated successfully, but these errors were encountered: