Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify relationship to ColumnTransformer #173

Open
micahjsmith opened this issue Sep 26, 2018 · 7 comments
Open

Clarify relationship to ColumnTransformer #173

micahjsmith opened this issue Sep 26, 2018 · 7 comments

Comments

@micahjsmith
Copy link

Scikit-learn v0.20.0 was just released and includes the ColumnTransformer, which is much related to the DataFrameMapper. It might be helpful for people who come to the sklearn-pandas project to see a note in the README explaining what differences there are between the approaches and when one should be preferred over the other.

@devforfu
Copy link
Collaborator

devforfu commented Oct 7, 2018

@micahjsmith Yes, agree, it seems like ColumnTransformer makes a lot of work in a similar way. Definitely worth to mention in the README file. Would you like to make a PR with documentation describing the similarities and differences between these two things?

@micahjsmith
Copy link
Author

Sure, I can give it a shot

@ganesh-krishnan
Copy link

Just checking on this issue. Looks like sklearn now has ColumnTransformer. As such, I'm not sure if there are any additional benefits to using sklearn-pandas. Would someone mind clarifying?

@micahjsmith
Copy link
Author

micahjsmith commented Sep 9, 2020

I never followed up above...

But maybe we can start collecting differences on this thread. ColumnTransformer is close to feature parity, and APIs I presume may change. They are quite similar overall with some minor differences.

API differences

functionality DataFrameMapper ColumnTransformer
drop unmapped cols default = False remainder = 'drop'
drop specific cols drop_cols = ['A', 'B'] transformer = 'drop'
passthrough unmapped cols default = None remainder = 'passthrough'
passthrough specific cols transformer = None transformer = 'passthrough'
output dataframe df_out = True n/a
apply prefix and suffix prefix and suffix options n/a
apply default transformer default = SomeTransformer() n/a
global prefix and suffix prefix and suffix kwargs n/a
feature naming user-specified or automatic user-specified or use make_column_transformer
column selection str or List[str] str, array-like of str, int, array-like of int, array-like of bool, slice or callable
treatment of sparse data only if sparse=True and has sparse output by default, configurable by sparse_threshold
supervised transformations yes yes

Other functionality

  • gen_features

Does this look about right? Are we missing anything?

@micahjsmith
Copy link
Author

@ganesh-krishnan does the above look about right?

@ganesh-krishnan
Copy link

Looks good per my understanding.

I'm not an expert in both by any means. Was struggling on which one to choose.
A table like this should be very valuable to folks trying to make a decision.

@arora123
Copy link

Few basic differences in DataFrameMapper() and ColumnTransformer():
https://github.com/arora123/Python-for-Data-Science/blob/master/DataFrameMapper_Vs_Column_Transformer.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants