Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Avoid nonlinear complexity for repr of nested estimators when print_c… #18508

Conversation

Xethan
Copy link
Contributor

@Xethan Xethan commented Oct 1, 2020

Reference Issues/PRs

Fixes #18490

What does this implement/fix? Explain your changes.

When print_changed_only is True and repr is called on an estimator, parameters are checked to see if they are different from the default value using repr. This causes a complexity issue for nested estimators because it will recursively call repr on the attributes that are estimators and so on, so we first check if the parameter is an estimator with a type different than the default value (often the parameter will have no default value) and exit the if early.

I tested it on the case that caused me trouble and it went from 30+ minutes to roughly 1 minute.

Any other comments?

I don't know if it is possible to do a non-regression test as functionality is unchanged. I don't know how to use asv benchmarks and I would need to reproduce the issue with base estimators first.

@thomasjpfan
Copy link
Member

I tested it on the case that caused me trouble and it went from 30+ minutes to roughly 1 minute.

As a reference, what is the runtime when print_changed_only=False in your use case?

@Xethan
Copy link
Contributor Author

Xethan commented Oct 2, 2020

With print_changed_only=False, the runtime is ~2 seconds.
I have tried investigating further with no success for now.

@Xethan
Copy link
Contributor Author

Xethan commented Oct 2, 2020

I found the remaining problem. For Pipeline objects, the attribute steps is not filtered by the condition I added and thus repr is called on steps and the estimators it contains.

@Xethan
Copy link
Contributor Author

Xethan commented Oct 2, 2020

To fix it, we can either put an additional check specific to pipelines (or iterators of estimators) but it feels quite ugly, maybe parameters that have no default value should just be automatically considered as changed ? It is indirect but cleaner / simpler.

@NicolasHug
Copy link
Member

parameters that have no default value should just be automatically considered as changed ?

That makes sense yes. We can compare the default with inspect._empty.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the new check with _empty help with printing time @Xethan ?

sklearn/utils/_pprint.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you work @Xethan
This looks good.

I think we should add a short test in sklearn/utils/tests/test_pprint.py in order to prevent any regression in the future. The test could just count the number of calls to repr as you've previously done (I believe). LMK if you need any guidance.

Since this provides a significant improvement we should also write an Enhancement entry in doc/whats_new/v0.24.rst. Make sure to credit yourself and mention this PR, like other entries there.

Thanks!

sklearn/utils/_pprint.py Show resolved Hide resolved
return True
if init_params[k] == inspect._empty: # k has no default value
return True
# avoid `repr` on nested estimators, it causes nonlinear complexity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# avoid `repr` on nested estimators, it causes nonlinear complexity
# try to avoid calling repr on nested estimators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test and verified it fails as expected when any of the two fixes is commented out.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Xethan , some last nits but LGTM!

Comment on lines 594 to 595
def transform(self, X, copy=None):
return self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this because we never call fit on the pipeline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When transform was not there, I had this error :

E               TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'DummyEstimator(estimator=DummyEstimator())' (type <class 'sklearn.utils.tests.test_pprint.test_complexity_print_changed_only.<locals>.DummyEstimator'>) doesn't

Copy link
Contributor Author

@Xethan Xethan Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also going to add estimator.transform(None) because a coverage test has failed due to this function not being called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's preferable to just add # pragma: no cover instead

Comment on lines 592 to 593
- |Enhancement| Avoid nonlinear complexity for ``repr`` of nested estimators
when `print_changed_only=True`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid the term nonlinear complexity and instead be very explicit that the calls are now faster.

Also, this should be instead in a Miscellaneous section as done in doc/whats_new/v0.23.rst

Suggested change
- |Enhancement| Avoid nonlinear complexity for ``repr`` of nested estimators
when `print_changed_only=True`.
- |Enhancement| Calls to ``repr`` are now faster when `print_changed_only=True`, especially with meta-estimators.

Comment on lines 584 to 588
class ReprCounter(BaseEstimator):
nb_times_repr_called = 0
def __repr__(self):
ReprCounter.nb_times_repr_called += 1
return super().__repr__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining this base class, would it be equivalent to just have nb_times_repr_called as a global variable and DummyEstimator.__repr__ to be the current __repr__ of ReprCounter?

@Xethan Xethan force-pushed the avoid_nonlinear_complexity_for_repr_of_nested_estimators branch 2 times, most recently from 42ddb05 to aa4d02c Compare October 2, 2020 16:41
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @Xethan

Minor comment, otherwise LGTM!

return super().__repr__()

def transform(self, X, copy=None): # pragma: no cover
return self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
return self
return X

@Xethan Xethan force-pushed the avoid_nonlinear_complexity_for_repr_of_nested_estimators branch from aa4d02c to 102433c Compare October 3, 2020 16:09
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

if init_params[k] == inspect._empty: # k has no default value
return True
# try to avoid calling repr on nested estimators
if (isinstance(v, BaseEstimator) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could check if either v or init_params[k] is an estimator, in fact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there code using estimators as default value ? I have never seen any and I doubt it would be nested estimators but I can add the check if you think it's useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen places (maybe not in sklearn itself) where the default value is an estimator, but then v in those cases is also an estimator. And since this is only a minor performance issue, I'm fine as is.

@jnothman
Copy link
Member

jnothman commented Oct 5, 2020 via email

@Xethan
Copy link
Contributor Author

Xethan commented Oct 6, 2020

Is there anything left to do then, or can it be merged ?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Xethan

if init_params[k] == inspect._empty: # k has no default value
return True
# try to avoid calling repr on nested estimators
if (isinstance(v, BaseEstimator) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen places (maybe not in sklearn itself) where the default value is an estimator, but then v in those cases is also an estimator. And since this is only a minor performance issue, I'm fine as is.

@adrinjalali adrinjalali merged commit cdff9b0 into scikit-learn:master Oct 8, 2020
5 checks passed
amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020
…nt_c… (scikit-learn#18508)

* Avoid nonlinear complexity for repr of nested estimators when print_changed_only=True

* Automatically consider parameters with no default value as changed

* Use helper function for improved readability

* Add a non-regression test and a release note entry

Co-authored-by: nathan <nathan@sancare.fr>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…nt_c… (scikit-learn#18508)

* Avoid nonlinear complexity for repr of nested estimators when print_changed_only=True

* Automatically consider parameters with no default value as changed

* Use helper function for improved readability

* Add a non-regression test and a release note entry

Co-authored-by: nathan <nathan@sancare.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Very long repr for nested estimator when print_changed_only=True
5 participants