Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147

Closed
jorisvandenbossche opened this issue Sep 24, 2018 · 7 comments · Fixed by #13253
Closed

Comments

@jorisvandenbossche
Copy link
Member

Left-over to do from #9151 (comment)

Idea is to support DataFrames without converting to a contiguous array. This conversion is not needed, as the transformer encodes the input column by column anyway, so it would be rather easy to preserve the datatypes per column.

This would avoid converting a potentially mixed-dtype DataFrame (eg ints and object strings) to a full object array.

This can introduces a slight change in behaviour (it can change the dtype of the categories_ in certain edge cases, eg when you had a mixture of float and int columns).

(Note that is not yet necessarily means to have special handling for certain pandas dtypes such as categorical dtype, see #12086, in an initial step, we could still do a check_array on each column / coerce each column to a numpy array).

@GMarzinotto
Copy link
Contributor

If there is no-one working on this issue I can do it.

By the way, I believe pandas supports one hot encoding found this on StackOverflow

What are your thoughts on detecting that it is a pandas dataframe and using pandas native encoder ?

@jorisvandenbossche
Copy link
Member Author

I don't think we want to use pd.get_dummies for now (assuming you are referring to that). Even apart from the question if we would want to depend on it, it does not give us everything that would be needed for the OneHotEncoder (eg specifying categories per column, handling unknown values, etc).

But feel free to work on this!

@GMarzinotto
Copy link
Contributor

Perfect I'll work on this !

@plpxsk
Copy link
Contributor

plpxsk commented Feb 23, 2019

Just to add, one key challenge when returning an array is mapping feature importances back to the original column names when you've applied OneHotEncoder.

It would be a big step forward to replace the prefixes x0_, x1_, etc with the proper column names.

See https://stackoverflow.com/q/54570947/3217870

@jorisvandenbossche
Copy link
Member Author

We will try to tackle this one during the sprints in Paris this week.

xhluca pushed a commit to xhluca/scikit-learn that referenced this issue Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this issue Apr 28, 2019
@amueller
Copy link
Member

amueller commented Aug 6, 2019

related to #12086

@jorisvandenbossche
Copy link
Member Author

This was actually solved by #13253 (handling a DataFrame column by column, preserving the column's dtypes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants