Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] Fix pprint ellipsis #13436

Merged
merged 15 commits into from
Apr 18, 2019
Merged

Conversation

NicolasHug
Copy link
Member

Reference Issues/PRs

Fixes #13372

What does this implement/fix? Explain your changes.

When the repr of an estimator is really long (more than 700 non blank characters), we use bruteforce ellipsis and cut the string in the middle. The old behaviour would cut the string without ignoring blank characters, leading to unnecessarily short reprs.

This PR also fixes a weird side effect of the ellipsis that would immediately append the right side of the string after the ellipsis, leading to confusing reprs. We now keep the whole line of the right side.

Any other comments?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea... but a simpler solution might be just to constrain the maximum number of lines rather than characters. (Although putting the ellipsis alone on a line might also have some trickiness to it.)

It would be easier to test various cases were N_CHAR_MAX possible to alter in the test.

sklearn/base.py Outdated
lim = N_CHAR_MAX // 2
repr_ = repr_[:lim] + '...' + repr_[-lim:]
lim = N_CHAR_MAX // 2 # apprx number of chars to keep on both ends
repr_array = np.array(list(repr_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are really intent on using array for this, do you need to convert to list first? But I'm not sure you benefit much from array anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the only way I found :/

In [2]: np.array('abcdef')
Out[2]: array('abcdef', dtype='<U6')

In [3]: np.array(list('abcdef'))
Out[3]: array(['a', 'b', 'c', 'd', 'e', 'f'], dtype='<U1')

In [7]: np.array('abcdef', dtype='<U1')
Out[7]: array('a', dtype='<U1')

I'm also not a big fan of using numpy here TBH. What we ultimately need is to find the nth non-blank character in the string, from the start and from the end. I couldn't think of a pythonic way to do this, but I'd be happy to consider alternative solutions!


I don't think we should constraint the number of lines because lots of lines doesn't necessarily mean lots of non-blank characters.

Are you suggesting we could constraint the number of non blank characters, and cut according to the number of lines, something like (warning, buggy):

        if len(''.join(repr_.split())) > N_CHAR_MAX:  # check non-blank chars
            N_LINES = 10  # totally arbitrary
            splitted = repr_.split('\n')
            splitted = splitted[:N_LINES] + ['...'] + splitted[-N_LINES:]
            repr_ = '\n'.join(splitted)

I would find it reasonable. But it might be hard to find an appropriate value for N_LINES so that the number of non-blank characters is approximately equal to N_CHAR_MAX.

Copy link
Member

@jnothman jnothman Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.array('abcdef').view(('<U1', 6))

does what you want in numpy

def nth_nonblank(s, n):
    return re.match(r'^(\s*\S){%d}' % n, s).end() - 1

for i in range(1, 10):
    print(i, nth_nonblank('a bc de f gh i jk', i))

does what you want in regex.

def nth_nonblank(s, n):
    return next(itertools.islice((i for i, c in enumerate(s)
                                  if c not in string.whitespace),
                                 n - 1, n))

does what you want in iterators.

Okay. Let's not worry about the number of lines constraint for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow impressive... Thanks!!

I went for the regex solution which i believe is fairly easy to understand, given some appropriate comments.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you handled the case that there is no \n between the right and left? Where there is 1? Where there is more than one? Tests are better off with N_MAX_CHARS configurable

sklearn/base.py Outdated
# categoric...
# handle_unknown='ignore',
# hence the addition of .*\n which matches until the next \n
right_side = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use parentheses rather than \ for line continuation

@NicolasHug
Copy link
Member Author

Tests are better off with N_MAX_CHARS configurable

You mean it should be added to the global config? Or as an optional parameter to __repr__(self, N_CHAR_MAX=700)?

@jnothman
Copy link
Member

jnothman commented Mar 13, 2019 via email

@NicolasHug
Copy link
Member Author

Ok I'll try to come up with some tests.

I can definitely see the current implementation produce weird results when there are no \n in-between (pretty sure having 1+ is fine).

Would it be an acceptable solution to compute left_lim and right_lim, and only do the ellipsis if there is a \n in repr_[left_lim:right_lim]?

@jnothman
Copy link
Member

jnothman commented Mar 13, 2019 via email

@NicolasHug
Copy link
Member Author

NicolasHug commented Mar 19, 2019

Just FYI Joel I updated the code and the tests as per your last comments. Here are the rendered reprs since looking at the diff is quite difficult:

# With bruteforce ellipsis, left and right side are on a different line
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True)...
                                                                                 dtype=<class 'numpy.float64'>,
                                                                                 handle_unknown='ignore',
                                                                                 n_values=None,
                                                                                 sparse=True))]),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))])
# With very small N_MAX_CHAR

Pi...
                                    warm_start=False))])
# WITH N_MAX_CHAR == n_non_blanks (no ellipsis, as expected)
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value='missing',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                                                                 verbose=0)),
                                                                  ('onehot',
                                                                   OneHotEncoder(categorical_features=None,
                                                                                 categories=None,
                                                                                 drop=None,
                                                                                 dtype=<class 'numpy.float64'>,
                                                                                 handle_unknown='ignore',
                                                                                 n_values=None,
                                                                                 sparse=True))]),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))])
# WITH N_MAX_CHAR == n_non_blanks - 1 (adding ellipsis, on the same line)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value='missing',
                                                                                 missing_values=n...n,
                                                                                 strategy='constant',
                                                                                 verbose=0)),
                                                                  ('onehot',
                                                                   OneHotEncoder(categorical_features=None,
                                                                                 categories=None,
                                                                                 drop=None,
                                                                                 dtype=<class 'numpy.float64'>,
                                                                                 handle_unknown='ignore',
                                                                                 n_values=None,
                                                                                 sparse=True))]),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))])
.LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a lot better than what we currently have, but I'm a little concerned that the user can't see the ellipsis at a glance, whereas if we generally tried to just do it by lines we could have a line dedicated to ... making it more visible.

It's tempting to just remove the character limit...

clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])

expected = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had, btw, expected that with a configurable N_CHAR_MAX you'd make this test case much simpler...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean testing with an estimator that has a shorted repr?

I agree it'd be better (I'll do it if that's what you meant)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Looking at the long, pathological cases give us a better sense of the user experience, but makes the tests harder to work with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks it looks much better

@jnothman jnothman added this to the 0.21 milestone Apr 2, 2019
@NicolasHug NicolasHug changed the title [MRG] Fix pprint ellipsis [MRG + 1] Fix pprint ellipsis Apr 5, 2019
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is already an improvement. I also like the suggestion to use ellipsis to replace whole lines instead (so as to make it more visible) but this can be explored in another PR.

@ogrisel ogrisel merged commit 1fe00b5 into scikit-learn:master Apr 18, 2019
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Confusing pretty print repr for nested Pipeline
3 participants