repr of column transformer unhelpful #13770

amueller · 2019-05-02T19:43:03Z

Here's a screenshot from a tutorial I did the other day:

The repr currently includes part of the dataframe, which I find of questionable value to begin with. Additionally, there's ellipsis inside the dataframe, which is hard to see but means that there is actually two different dataframes being shown here (the first part is from the mask of the first transformer, the second from the mask of the second transformer).

I'm not sure what a good solution would be. Possibly a custom repr that hides the columns if they are too long to show? Or trying more magic in the general repr? @NicolasHug had a hard time making that much smarter when he tried in the fall.
Given how central the ColumnTransformer is, I think we should make sure it's easy to understand, including the repr.

NicolasHug · 2019-05-02T20:16:34Z

I'll take a look

jnothman · 2019-05-02T23:21:19Z

I'd never thought to use a mask like that, let alone with a series. I would use a list of column names where a callable is not possible...

amueller · 2019-05-03T21:30:15Z

Really? Why?
this is

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

data = pd.read_csv("https://github.com/amueller/ml-workshop-1-of-4/raw/master/notebooks/data/adult.csv", index_col=0)
cat_features = data.dtypes == 'object'
ct = make_column_transformer((OneHotEncoder(sparse=False), cat_features),
                             (StandardScaler(), ~cat_features))

or something similar.
This seems a pretty obvious and intuitive pattern to me? (yes I could have used remainder, then the repr will be quite different I think?)

amueller · 2019-05-03T21:31:54Z

Oh I guess the person was using stable, so maybe in 0.21 this is less bad
I'm getting

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 age               False
workclass          True
education          True
education-num     False
marital-status     True
occupation...
hours-per-week    False
native-country     True
income             True
dtype: bool),
                                ('standardscaler',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 age                True
workclass         False
education         False
education-num      True
marital-status    False
occupation        False
relationship      False
race              False
gender            False
capital-gain       True
capital-loss       True
hours-per-week     True
native-country    False
income            False
dtype: bool)],
                  verbose=False)

for the code above. Really friendly is different, though...

NicolasHug · 2019-05-03T21:47:30Z

There are 2 ways of "fixing" it that should not be too hard to implement:

Option 1: cut dataframes and numpy arrays after the first \n

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 age               False...),
                                ('standardscaler',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 age                True...)],
                  verbose=False)

Option 1 with change_only=True:

ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(sparse=False),
                                 age               False...),
                                ('standardscaler', StandardScaler(),
                                 age                True...)])

Option 2: don't cut, but indent correctly

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 age               False
                                 workclass          True
                                 education          True
                                 education-num     False
                                 marital-status     True
                                 occupation...
                                 hours-per-week    False
                                 native-country     True
                                 income             True
                                 dtype: bool),
                                ('standardscaler',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 age                True
                                 workclass         False
                                 education         False
                                 education-num      True
                                 marital-status    False
                                 occupation        False
                                 relationship      False
                                 race              False
                                 gender            False
                                 capital-gain       True
                                 capital-loss       True
                                 hours-per-week     True
                                 native-country    False
                                 income            False
                                 dtype: bool)],
                  verbose=False)

LMK which one you prefer

jnothman · 2019-05-04T09:43:16Z

If we had an easy tool for make_feature_selector(dtype=object) then that would be the right idiom.

NicolasHug mentioned this issue Aug 3, 2019

really weird __repr__ #14546

Closed

thomasjpfan added the module:compose label Feb 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repr of column transformer unhelpful #13770

repr of column transformer unhelpful #13770

amueller commented May 2, 2019

NicolasHug commented May 2, 2019

jnothman commented May 2, 2019 via email

amueller commented May 3, 2019

amueller commented May 3, 2019

NicolasHug commented May 3, 2019 •

edited

jnothman commented May 4, 2019 via email

repr of column transformer unhelpful #13770

repr of column transformer unhelpful #13770

Comments

amueller commented May 2, 2019

NicolasHug commented May 2, 2019

jnothman commented May 2, 2019 via email

amueller commented May 3, 2019

amueller commented May 3, 2019

NicolasHug commented May 3, 2019 • edited

Option 1: cut dataframes and numpy arrays after the first \n

Option 2: don't cut, but indent correctly

jnothman commented May 4, 2019 via email

NicolasHug commented May 3, 2019 •

edited