-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] Fix pprint ellipsis #13436
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea... but a simpler solution might be just to constrain the maximum number of lines rather than characters. (Although putting the ellipsis alone on a line might also have some trickiness to it.)
It would be easier to test various cases were N_CHAR_MAX possible to alter in the test.
sklearn/base.py
Outdated
lim = N_CHAR_MAX // 2 | ||
repr_ = repr_[:lim] + '...' + repr_[-lim:] | ||
lim = N_CHAR_MAX // 2 # apprx number of chars to keep on both ends | ||
repr_array = np.array(list(repr_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are really intent on using array for this, do you need to convert to list first? But I'm not sure you benefit much from array anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the only way I found :/
In [2]: np.array('abcdef')
Out[2]: array('abcdef', dtype='<U6')
In [3]: np.array(list('abcdef'))
Out[3]: array(['a', 'b', 'c', 'd', 'e', 'f'], dtype='<U1')
In [7]: np.array('abcdef', dtype='<U1')
Out[7]: array('a', dtype='<U1')
I'm also not a big fan of using numpy here TBH. What we ultimately need is to find the nth non-blank character in the string, from the start and from the end. I couldn't think of a pythonic way to do this, but I'd be happy to consider alternative solutions!
I don't think we should constraint the number of lines because lots of lines doesn't necessarily mean lots of non-blank characters.
Are you suggesting we could constraint the number of non blank characters, and cut according to the number of lines, something like (warning, buggy):
if len(''.join(repr_.split())) > N_CHAR_MAX: # check non-blank chars
N_LINES = 10 # totally arbitrary
splitted = repr_.split('\n')
splitted = splitted[:N_LINES] + ['...'] + splitted[-N_LINES:]
repr_ = '\n'.join(splitted)
I would find it reasonable. But it might be hard to find an appropriate value for N_LINES
so that the number of non-blank characters is approximately equal to N_CHAR_MAX
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array('abcdef').view(('<U1', 6))
does what you want in numpy
def nth_nonblank(s, n):
return re.match(r'^(\s*\S){%d}' % n, s).end() - 1
for i in range(1, 10):
print(i, nth_nonblank('a bc de f gh i jk', i))
does what you want in regex.
def nth_nonblank(s, n):
return next(itertools.islice((i for i, c in enumerate(s)
if c not in string.whitespace),
n - 1, n))
does what you want in iterators.
Okay. Let's not worry about the number of lines constraint for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow impressive... Thanks!!
I went for the regex solution which i believe is fairly easy to understand, given some appropriate comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you handled the case that there is no \n between the right and left? Where there is 1? Where there is more than one? Tests are better off with N_MAX_CHARS configurable
sklearn/base.py
Outdated
# categoric... | ||
# handle_unknown='ignore', | ||
# hence the addition of .*\n which matches until the next \n | ||
right_side = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use parentheses rather than \ for line continuation
You mean it should be added to the global config? Or as an optional parameter to |
Optional parameter to reproduce just for testing or a class attribute that
can be monkey patched
|
Ok I'll try to come up with some tests. I can definitely see the current implementation produce weird results when there are no Would it be an acceptable solution to compute |
That's like constraining the number of lines. If there's a very long line
(e.g. a very long string) then you're better off putting in ellipsis. I
think using (.*\n)? might do what you need.
|
Just FYI Joel I updated the code and the tests as per your last comments. Here are the rendered reprs since looking at the diff is quite difficult: # With bruteforce ellipsis, left and right side are on a different line
Pipeline(memory=None,
steps=[('preprocessor',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with_mean=True,
with_std=True)...
dtype=<class 'numpy.float64'>,
handle_unknown='ignore',
n_values=None,
sparse=True))]),
['embarked', 'sex',
'pclass'])])),
('classifier',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False))])
# With very small N_MAX_CHAR
Pi...
warm_start=False))]) # WITH N_MAX_CHAR == n_non_blanks (no ellipsis, as expected)
Pipeline(memory=None,
steps=[('preprocessor',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with_mean=True,
with_std=True))]),
['age', 'fare']),
('cat',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(copy=True,
fill_value='missing',
missing_values=nan,
strategy='constant',
verbose=0)),
('onehot',
OneHotEncoder(categorical_features=None,
categories=None,
drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='ignore',
n_values=None,
sparse=True))]),
['embarked', 'sex',
'pclass'])])),
('classifier',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False))]) # WITH N_MAX_CHAR == n_non_blanks - 1 (adding ellipsis, on the same line)
Pipeline(memory=None,
steps=[('preprocessor',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with_mean=True,
with_std=True))]),
['age', 'fare']),
('cat',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(copy=True,
fill_value='missing',
missing_values=n...n,
strategy='constant',
verbose=0)),
('onehot',
OneHotEncoder(categorical_features=None,
categories=None,
drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='ignore',
n_values=None,
sparse=True))]),
['embarked', 'sex',
'pclass'])])),
('classifier',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False))])
.LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a lot better than what we currently have, but I'm a little concerned that the user can't see the ellipsis at a glance, whereas if we generally tried to just do it by lines we could have a line dedicated to ...
making it more visible.
It's tempting to just remove the character limit...
clf = Pipeline(steps=[('preprocessor', preprocessor), | ||
('classifier', LogisticRegression(solver='lbfgs'))]) | ||
|
||
expected = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had, btw, expected that with a configurable N_CHAR_MAX
you'd make this test case much simpler...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean testing with an estimator that has a shorted repr?
I agree it'd be better (I'll do it if that's what you meant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Looking at the long, pathological cases give us a better sense of the user experience, but makes the tests harder to work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks it looks much better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is already an improvement. I also like the suggestion to use ellipsis to replace whole lines instead (so as to make it more visible) but this can be explored in another PR.
This reverts commit 628ad46.
This reverts commit 628ad46.
Reference Issues/PRs
Fixes #13372
What does this implement/fix? Explain your changes.
When the repr of an estimator is really long (more than 700 non blank characters), we use bruteforce ellipsis and cut the string in the middle. The old behaviour would cut the string without ignoring blank characters, leading to unnecessarily short reprs.
This PR also fixes a weird side effect of the ellipsis that would immediately append the right side of the string after the ellipsis, leading to confusing reprs. We now keep the whole line of the right side.
Any other comments?