Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147

Closed
jorisvandenbossche opened this issue Sep 24, 2018 · 7 comments · Fixed by #13253
Closed

ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147

jorisvandenbossche opened this issue Sep 24, 2018 · 7 comments · Fixed by #13253

Comments

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 24, 2018

Left-over to do from #9151 (comment)

Idea is to support DataFrames without converting to a contiguous array. This conversion is not needed, as the transformer encodes the input column by column anyway, so it would be rather easy to preserve the datatypes per column.

This would avoid converting a potentially mixed-dtype DataFrame (eg ints and object strings) to a full object array.

This can introduces a slight change in behaviour (it can change the dtype of the categories_ in certain edge cases, eg when you had a mixture of float and int columns).

(Note that is not yet necessarily means to have special handling for certain pandas dtypes such as categorical dtype, see #12086, in an initial step, we could still do a check_array on each column / coerce each column to a numpy array).

@GMarzinotto

This comment has been minimized.

Copy link
Contributor

@GMarzinotto GMarzinotto commented Oct 13, 2018

If there is no-one working on this issue I can do it.

By the way, I believe pandas supports one hot encoding found this on StackOverflow

What are your thoughts on detecting that it is a pandas dataframe and using pandas native encoder ?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche commented Oct 13, 2018

I don't think we want to use pd.get_dummies for now (assuming you are referring to that). Even apart from the question if we would want to depend on it, it does not give us everything that would be needed for the OneHotEncoder (eg specifying categories per column, handling unknown values, etc).

But feel free to work on this!

@GMarzinotto

This comment has been minimized.

Copy link
Contributor

@GMarzinotto GMarzinotto commented Oct 13, 2018

Perfect I'll work on this !

@pavopax

This comment has been minimized.

Copy link
Contributor

@pavopax pavopax commented Feb 23, 2019

Just to add, one key challenge when returning an array is mapping feature importances back to the original column names when you've applied OneHotEncoder.

It would be a big step forward to replace the prefixes x0_, x1_, etc with the proper column names.

See https://stackoverflow.com/q/54570947/3217870

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche commented Feb 25, 2019

We will try to tackle this one during the sprints in Paris this week.

@jorisvandenbossche jorisvandenbossche added this to To do in Sprint Paris 2019 via automation Feb 25, 2019
@jorisvandenbossche jorisvandenbossche moved this from To do to In progress in Sprint Paris 2019 Feb 25, 2019
@jnothman jnothman removed this from In progress in Sprint Paris 2019 Feb 28, 2019
jorisvandenbossche added a commit that referenced this issue Mar 1, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this issue Apr 28, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this issue Apr 28, 2019
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 6, 2019

related to #12086

@adrinjalali adrinjalali added this to To do in Pandas Oct 21, 2019
@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche commented Oct 28, 2019

This was actually solved by #13253 (handling a DataFrame column by column, preserving the column's dtypes)

Pandas automation moved this from To do to Done Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Pandas
  
Done
5 participants
You can’t perform that action at this time.