[MRG + 1] Fix pprint ellipsis #13436

NicolasHug · 2019-03-12T14:40:46Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

When the repr of an estimator is really long (more than 700 non blank characters), we use bruteforce ellipsis and cut the string in the middle. The old behaviour would cut the string without ignoring blank characters, leading to unnecessarily short reprs.

This PR also fixes a weird side effect of the ellipsis that would immediately append the right side of the string after the ellipsis, leading to confusing reprs. We now keep the whole line of the right side.

Any other comments?

jnothman

I think this is a good idea... but a simpler solution might be just to constrain the maximum number of lines rather than characters. (Although putting the ellipsis alone on a line might also have some trickiness to it.)

It would be easier to test various cases were N_CHAR_MAX possible to alter in the test.

jnothman · 2019-03-13T07:35:45Z

sklearn/base.py

-            lim = N_CHAR_MAX // 2
-            repr_ = repr_[:lim] + '...' + repr_[-lim:]
+            lim = N_CHAR_MAX // 2  # apprx number of chars to keep on both ends
+            repr_array = np.array(list(repr_))


If you are really intent on using array for this, do you need to convert to list first? But I'm not sure you benefit much from array anyway

That's the only way I found :/

In [2]: np.array('abcdef') Out[2]: array('abcdef', dtype='<U6') In [3]: np.array(list('abcdef')) Out[3]: array(['a', 'b', 'c', 'd', 'e', 'f'], dtype='<U1') In [7]: np.array('abcdef', dtype='<U1') Out[7]: array('a', dtype='<U1')

I'm also not a big fan of using numpy here TBH. What we ultimately need is to find the nth non-blank character in the string, from the start and from the end. I couldn't think of a pythonic way to do this, but I'd be happy to consider alternative solutions!

I don't think we should constraint the number of lines because lots of lines doesn't necessarily mean lots of non-blank characters.

Are you suggesting we could constraint the number of non blank characters, and cut according to the number of lines, something like (warning, buggy):

if len(''.join(repr_.split())) > N_CHAR_MAX: # check non-blank chars N_LINES = 10 # totally arbitrary splitted = repr_.split('\n') splitted = splitted[:N_LINES] + ['...'] + splitted[-N_LINES:] repr_ = '\n'.join(splitted)

I would find it reasonable. But it might be hard to find an appropriate value for N_LINES so that the number of non-blank characters is approximately equal to N_CHAR_MAX.

np.array('abcdef').view(('<U1', 6))

does what you want in numpy

def nth_nonblank(s, n): return re.match(r'^(\s*\S){%d}' % n, s).end() - 1 for i in range(1, 10): print(i, nth_nonblank('a bc de f gh i jk', i))

does what you want in regex.

def nth_nonblank(s, n): return next(itertools.islice((i for i, c in enumerate(s) if c not in string.whitespace), n - 1, n))

does what you want in iterators.

Okay. Let's not worry about the number of lines constraint for now.

Wow impressive... Thanks!!

I went for the regex solution which i believe is fairly easy to understand, given some appropriate comments.

jnothman

Have you handled the case that there is no \n between the right and left? Where there is 1? Where there is more than one? Tests are better off with N_MAX_CHARS configurable

jnothman · 2019-03-13T20:42:51Z

sklearn/base.py

+            # categoric...
+            # handle_unknown='ignore',
+            # hence the addition of .*\n which matches until the next \n
+            right_side = \


Please use parentheses rather than \ for line continuation

NicolasHug · 2019-03-13T21:15:51Z

Tests are better off with N_MAX_CHARS configurable

You mean it should be added to the global config? Or as an optional parameter to __repr__(self, N_CHAR_MAX=700)?

…ter matching

jnothman · 2019-03-13T21:29:12Z

Optional parameter to reproduce just for testing or a class attribute that can be monkey patched

NicolasHug · 2019-03-13T21:36:54Z

Ok I'll try to come up with some tests.

I can definitely see the current implementation produce weird results when there are no \n in-between (pretty sure having 1+ is fine).

Would it be an acceptable solution to compute left_lim and right_lim, and only do the ellipsis if there is a \n in repr_[left_lim:right_lim]?

jnothman · 2019-03-13T22:28:45Z

That's like constraining the number of lines. If there's a very long line (e.g. a very long string) then you're better off putting in ellipsis. I think using (.*\n)? might do what you need.

NicolasHug · 2019-03-19T13:44:11Z

Just FYI Joel I updated the code and the tests as per your last comments. Here are the rendered reprs since looking at the diff is quite difficult:

# With bruteforce ellipsis, left and right side are on a different line
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True)...
                                                                                 dtype=<class 'numpy.float64'>,
                                                                                 handle_unknown='ignore',
                                                                                 n_values=None,
                                                                                 sparse=True))]),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))])

# With very small N_MAX_CHAR

Pi...
                                    warm_start=False))])

# WITH N_MAX_CHAR == n_non_blanks (no ellipsis, as expected)
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value='missing',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                                                                 verbose=0)),
                                                                  ('onehot',
                                                                   OneHotEncoder(categorical_features=None,
                                                                                 categories=None,
                                                                                 drop=None,
                                                                                 dtype=<class 'numpy.float64'>,
                                                                                 handle_unknown='ignore',
                                                                                 n_values=None,
                                                                                 sparse=True))]),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))])

# WITH N_MAX_CHAR == n_non_blanks - 1 (adding ellipsis, on the same line)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(copy=True,
                                                                                 fill_value='missing',
                                                                                 missing_values=n...n,
                                                                                 strategy='constant',
                                                                                 verbose=0)),
                                                                  ('onehot',
                                                                   OneHotEncoder(categorical_features=None,
                                                                                 categories=None,
                                                                                 drop=None,
                                                                                 dtype=<class 'numpy.float64'>,
                                                                                 handle_unknown='ignore',
                                                                                 n_values=None,
                                                                                 sparse=True))]),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))])
.LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

jnothman

I think this is a lot better than what we currently have, but I'm a little concerned that the user can't see the ellipsis at a glance, whereas if we generally tried to just do it by lines we could have a line dedicated to ... making it more visible.

It's tempting to just remove the character limit...

jnothman · 2019-03-20T11:36:01Z

sklearn/utils/tests/test_pprint.py

+    clf = Pipeline(steps=[('preprocessor', preprocessor),
+                        ('classifier', LogisticRegression(solver='lbfgs'))])
+
+    expected = """


I had, btw, expected that with a configurable N_CHAR_MAX you'd make this test case much simpler...

you mean testing with an estimator that has a shorted repr?

I agree it'd be better (I'll do it if that's what you meant)

Yes. Looking at the long, pathological cases give us a better sense of the user experience, but makes the tests harder to work with.

updated, thanks it looks much better

…x_pprint_ellipsis

ogrisel

I agree this is already an improvement. I also like the suggestion to use ellipsis to replace whole lines instead (so as to make it more visible) but this can be explored in another PR.

This reverts commit 628ad46.

NicolasHug added 4 commits March 1, 2019 11:29

ugly fix

992883f

Merge branch 'master' into fix_pprint_ellipsis

688e109

better bruteforce ellipsis

83112b7

Trigger CI

0306ed1

jnothman reviewed Mar 13, 2019

View reviewed changes

NicolasHug added 2 commits March 13, 2019 10:06

Used regex instead of numpy

7f2adb5

use substring matches directly instead of indexes

0f9b6bd

jnothman reviewed Mar 13, 2019

View reviewed changes

Used parenthesis instead of backdash and used explicit newline charac…

cfee311

…ter matching

NicolasHug added 2 commits March 14, 2019 09:02

Covered case when left and right side are on the same line, + tests

b16dca8

removed method params whose length differ between Oss

af1b1a8

jnothman approved these changes Mar 20, 2019

View reviewed changes

jnothman reviewed Mar 20, 2019

View reviewed changes

NicolasHug added 2 commits March 20, 2019 08:07

used simpler and shorter tests

9bcf16b

Merge branch 'master' of github.com:scikit-learn/scikit-learn into fi…

b124bea

…x_pprint_ellipsis

NicolasHug mentioned this pull request Mar 21, 2019

RFE/RFECV step enhancements #13470

Closed

NicolasHug added 2 commits March 21, 2019 14:46

Only add ellipsis if it actually makes the string shorter

0d03910

Updated confusing example

078dccc

jnothman added this to the 0.21 milestone Apr 2, 2019

Merge remote-tracking branch 'upstream/master' into fix_pprint_ellipsis

39ec938

NicolasHug changed the title ~~[MRG] Fix pprint ellipsis~~ [MRG + 1] Fix pprint ellipsis Apr 5, 2019

jnothman added Blocker Bug labels Apr 9, 2019

Merge branch 'master' into fix_pprint_ellipsis

6fab522

ogrisel approved these changes Apr 18, 2019

View reviewed changes

ogrisel merged commit 1fe00b5 into scikit-learn:master Apr 18, 2019

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019

[MRG + 1] Fix pprint ellipsis (scikit-learn#13436)

04fcbce

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

[MRG + 1] Fix pprint ellipsis (scikit-learn#13436)

628ad46

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "[MRG + 1] Fix pprint ellipsis (scikit-learn#13436)"

792aac5

This reverts commit 628ad46.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "[MRG + 1] Fix pprint ellipsis (scikit-learn#13436)"

e2d9767

This reverts commit 628ad46.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

[MRG + 1] Fix pprint ellipsis (scikit-learn#13436)

0d2b542

Uh oh!

[MRG + 1] Fix pprint ellipsis #13436

[MRG + 1] Fix pprint ellipsis #13436

Uh oh!

Conversation

NicolasHug commented Mar 12, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Mar 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Mar 13, 2019

Uh oh!

jnothman commented Mar 13, 2019 via email

Uh oh!

NicolasHug commented Mar 13, 2019

Uh oh!

jnothman commented Mar 13, 2019 via email

Uh oh!

NicolasHug commented Mar 19, 2019 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman Mar 13, 2019 •

edited

Loading

NicolasHug commented Mar 19, 2019 •

edited by TomDLT

Loading