New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Avoid nonlinear complexity for repr of nested estimators when print_c… #18508
[MRG] Avoid nonlinear complexity for repr of nested estimators when print_c… #18508
Conversation
As a reference, what is the runtime when |
With |
I found the remaining problem. For |
To fix it, we can either put an additional check specific to pipelines (or iterators of estimators) but it feels quite ugly, maybe parameters that have no default value should just be automatically considered as changed ? It is indirect but cleaner / simpler. |
That makes sense yes. We can compare the default with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the new check with _empty
help with printing time @Xethan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you work @Xethan
This looks good.
I think we should add a short test in sklearn/utils/tests/test_pprint.py
in order to prevent any regression in the future. The test could just count the number of calls to repr
as you've previously done (I believe). LMK if you need any guidance.
Since this provides a significant improvement we should also write an Enhancement
entry in doc/whats_new/v0.24.rst
. Make sure to credit yourself and mention this PR, like other entries there.
Thanks!
sklearn/utils/_pprint.py
Outdated
return True | ||
if init_params[k] == inspect._empty: # k has no default value | ||
return True | ||
# avoid `repr` on nested estimators, it causes nonlinear complexity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# avoid `repr` on nested estimators, it causes nonlinear complexity | |
# try to avoid calling repr on nested estimators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test and verified it fails as expected when any of the two fixes is commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Xethan , some last nits but LGTM!
sklearn/utils/tests/test_pprint.py
Outdated
def transform(self, X, copy=None): | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this because we never call fit
on the pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When transform
was not there, I had this error :
E TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'DummyEstimator(estimator=DummyEstimator())' (type <class 'sklearn.utils.tests.test_pprint.test_complexity_print_changed_only.<locals>.DummyEstimator'>) doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also going to add estimator.transform(None)
because a coverage test has failed due to this function not being called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's preferable to just add # pragma: no cover
instead
doc/whats_new/v0.24.rst
Outdated
- |Enhancement| Avoid nonlinear complexity for ``repr`` of nested estimators | ||
when `print_changed_only=True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid the term nonlinear complexity and instead be very explicit that the calls are now faster.
Also, this should be instead in a Miscellaneous
section as done in doc/whats_new/v0.23.rst
- |Enhancement| Avoid nonlinear complexity for ``repr`` of nested estimators | |
when `print_changed_only=True`. | |
- |Enhancement| Calls to ``repr`` are now faster when `print_changed_only=True`, especially with meta-estimators. |
sklearn/utils/tests/test_pprint.py
Outdated
class ReprCounter(BaseEstimator): | ||
nb_times_repr_called = 0 | ||
def __repr__(self): | ||
ReprCounter.nb_times_repr_called += 1 | ||
return super().__repr__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining this base class, would it be equivalent to just have nb_times_repr_called
as a global variable and DummyEstimator.__repr__
to be the current __repr__
of ReprCounter
?
42ddb05
to
aa4d02c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @Xethan
Minor comment, otherwise LGTM!
sklearn/utils/tests/test_pprint.py
Outdated
return super().__repr__() | ||
|
||
def transform(self, X, copy=None): # pragma: no cover | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
return self | |
return X |
aa4d02c
to
102433c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good!
if init_params[k] == inspect._empty: # k has no default value | ||
return True | ||
# try to avoid calling repr on nested estimators | ||
if (isinstance(v, BaseEstimator) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could check if either v or init_params[k] is an estimator, in fact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there code using estimators as default value ? I have never seen any and I doubt it would be nested estimators but I can add the check if you think it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen places (maybe not in sklearn itself) where the default value is an estimator, but then v
in those cases is also an estimator. And since this is only a minor performance issue, I'm fine as is.
Yes, you are right, it would not be recommended to use a mutable default
value
|
Is there anything left to do then, or can it be merged ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Xethan
if init_params[k] == inspect._empty: # k has no default value | ||
return True | ||
# try to avoid calling repr on nested estimators | ||
if (isinstance(v, BaseEstimator) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen places (maybe not in sklearn itself) where the default value is an estimator, but then v
in those cases is also an estimator. And since this is only a minor performance issue, I'm fine as is.
…nt_c… (scikit-learn#18508) * Avoid nonlinear complexity for repr of nested estimators when print_changed_only=True * Automatically consider parameters with no default value as changed * Use helper function for improved readability * Add a non-regression test and a release note entry Co-authored-by: nathan <nathan@sancare.fr>
…nt_c… (scikit-learn#18508) * Avoid nonlinear complexity for repr of nested estimators when print_changed_only=True * Automatically consider parameters with no default value as changed * Use helper function for improved readability * Add a non-regression test and a release note entry Co-authored-by: nathan <nathan@sancare.fr>
Reference Issues/PRs
Fixes #18490
What does this implement/fix? Explain your changes.
When
print_changed_only
isTrue
andrepr
is called on an estimator, parameters are checked to see if they are different from the default value usingrepr
. This causes a complexity issue for nested estimators because it will recursively callrepr
on the attributes that are estimators and so on, so we first check if the parameter is an estimator with a type different than the default value (often the parameter will have no default value) and exit theif
early.I tested it on the case that caused me trouble and it went from 30+ minutes to roughly 1 minute.
Any other comments?
I don't know if it is possible to do a non-regression test as functionality is unchanged. I don't know how to use asv benchmarks and I would need to reproduce the issue with base estimators first.