Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repr of column transformer unhelpful #13770

Open
amueller opened this issue May 2, 2019 · 6 comments
Open

repr of column transformer unhelpful #13770

amueller opened this issue May 2, 2019 · 6 comments

Comments

@amueller
Copy link
Member

amueller commented May 2, 2019

Here's a screenshot from a tutorial I did the other day:
image

The repr currently includes part of the dataframe, which I find of questionable value to begin with. Additionally, there's ellipsis inside the dataframe, which is hard to see but means that there is actually two different dataframes being shown here (the first part is from the mask of the first transformer, the second from the mask of the second transformer).

I'm not sure what a good solution would be. Possibly a custom repr that hides the columns if they are too long to show? Or trying more magic in the general repr? @NicolasHug had a hard time making that much smarter when he tried in the fall.
Given how central the ColumnTransformer is, I think we should make sure it's easy to understand, including the repr.

@NicolasHug
Copy link
Member

I'll take a look

@jnothman
Copy link
Member

jnothman commented May 2, 2019 via email

@amueller
Copy link
Member Author

amueller commented May 3, 2019

Really? Why?
this is

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

data = pd.read_csv("https://github.com/amueller/ml-workshop-1-of-4/raw/master/notebooks/data/adult.csv", index_col=0)
cat_features = data.dtypes == 'object'
ct = make_column_transformer((OneHotEncoder(sparse=False), cat_features),
                             (StandardScaler(), ~cat_features))

or something similar.
This seems a pretty obvious and intuitive pattern to me? (yes I could have used remainder, then the repr will be quite different I think?)

@amueller
Copy link
Member Author

amueller commented May 3, 2019

Oh I guess the person was using stable, so maybe in 0.21 this is less bad
I'm getting

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 age               False
workclass          True
education          True
education-num     False
marital-status     True
occupation...
hours-per-week    False
native-country     True
income             True
dtype: bool),
                                ('standardscaler',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 age                True
workclass         False
education         False
education-num      True
marital-status    False
occupation        False
relationship      False
race              False
gender            False
capital-gain       True
capital-loss       True
hours-per-week     True
native-country    False
income            False
dtype: bool)],
                  verbose=False)

for the code above. Really friendly is different, though...

@NicolasHug
Copy link
Member

NicolasHug commented May 3, 2019

There are 2 ways of "fixing" it that should not be too hard to implement:

Option 1: cut dataframes and numpy arrays after the first \n

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 age               False...),
                                ('standardscaler',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 age                True...)],
                  verbose=False)

Option 1 with change_only=True:

ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(sparse=False),
                                 age               False...),
                                ('standardscaler', StandardScaler(),
                                 age                True...)])

Option 2: don't cut, but indent correctly

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 age               False
                                 workclass          True
                                 education          True
                                 education-num     False
                                 marital-status     True
                                 occupation...
                                 hours-per-week    False
                                 native-country     True
                                 income             True
                                 dtype: bool),
                                ('standardscaler',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 age                True
                                 workclass         False
                                 education         False
                                 education-num      True
                                 marital-status    False
                                 occupation        False
                                 relationship      False
                                 race              False
                                 gender            False
                                 capital-gain       True
                                 capital-loss       True
                                 hours-per-week     True
                                 native-country    False
                                 income            False
                                 dtype: bool)],
                  verbose=False)

LMK which one you prefer

@jnothman
Copy link
Member

jnothman commented May 4, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants