Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNT] Refactor .clone() #281

Merged
merged 10 commits into from
Feb 25, 2024
Merged

[MNT] Refactor .clone() #281

merged 10 commits into from
Feb 25, 2024

Conversation

tpvasconcelos
Copy link
Contributor

@tpvasconcelos tpvasconcelos commented Jan 31, 2024

Reference Issues/PRs

#280

What does this implement/fix? Explain your changes.

  • Simplify .clone()'s implementation to 3 easy-to-understand steps:
    1. clone self
    2. if check_clone, run extra checks
    3. return cloned object
  • Refactor extra clone checks into a private _check_clone() function
  • Refactor the private ._check() method into a private _check() function
    • I think this is more idiomatic since the current ._check() implementation was not accessing any instance (self) attributes or methods. It would only recursively call itself again as self._clone(...), in practice making it a static method.
    • I don't think this function should ever depend on any private instance attributes or ever mutate the state of the object being cloned.
    • In practice, this is 100% equivalent in behaviour but I think it's a more intentional and idiomatic implementation.
    • As a bonus (but probably not coincidentally), makes it more in line with sklearn's reference implementation

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions
  • I've reviewed the project documentation on contributing
  • I've added myself to the list of contributors.
  • The PR title starts with either [ENH], [CI/CD], [MNT], [DOC], or [BUG] indicating whether
    the PR topic is related to enhancement, CI/CD, maintenance, documentation, or a bug.
For code contributions
  • Unit tests have been added covering code functionality
  • Appropriate docstrings have been added (see documentation standards)
  • New public functionality has been added to the API Reference

Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: Patch coverage is 84.21053% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 84.62%. Comparing base (306958d) to head (5b81057).
Report is 9 commits behind head on main.

Files Patch % Lines
skbase/base/_base.py 84.21% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #281      +/-   ##
==========================================
- Coverage   85.07%   84.62%   -0.46%     
==========================================
  Files          45       45              
  Lines        3015     3050      +35     
==========================================
+ Hits         2565     2581      +16     
- Misses        450      469      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skbase/base/_base.py Dismissed Show resolved Hide resolved
@fkiraly fkiraly added the enhancement Adding new functionality label Jan 31, 2024
@fkiraly
Copy link
Contributor

fkiraly commented Jan 31, 2024

Nice, thanks!

Btw, I'm not sure I remember where the "check clone" code is coming from. I added the config to ignore it by default but still have it there just in case. I always thought it is sth from sklearn, but apparently not?

@tpvasconcelos
Copy link
Contributor Author

@fkiraly thanks for taking a look at this already. I opened a draft PR just because I wanted to start typing some notes in the description.

Btw, I'm not sure I remember where the "check clone" code is coming from. I added the config to ignore it by default but still have it there just in case. I always thought it is sth from sklearn, but apparently not?

It’s a bit strange indeed. It doesn’t seem like the right place to run such checks. If these checks ever fail it’s because the BaseObject’s implementation is not compliant with the scikit estimator protocol which requires all params passed to the constructor to be set as instance variables “as is” during instantiation. Having these checks at runtime during a cloning call seems quite arbitrary and out of place (and unnecessary overhead). If someone wants to check whether their estimator is scikit-compliant, this should probably happen in a test suite and maybe a helper function could be provided (maybe already exists?). Also, the check only runs for self and never for inner/nested objects

@fkiraly
Copy link
Contributor

fkiraly commented Feb 1, 2024

It’s a bit strange indeed. It doesn’t seem like the right place to run such checks.

Ah, I remember!

I agree with you and that's why I turned it into an opt-in configurable feature.

When we refactored the base classes from sktime, we added some tests independent of sktime estimator types. The test test_clone_raises_error_for_nonconforming_objects was added, which checks for various error messages being raised.

I did not add that test (it was @RNKuhns), I believe they were added because back then sklearn had similar tests. This was at a time where BaseObject.clone still dispatched to sklearn clone directly.

Then we removed the sklearn dependency, I must have copied over code, that must have caused both tests and code to be in the same package - which must have been when I saw those strange checks and thought the same, seems out of place at runtime, instead tests should check that. But the old tests want that, so I added a config that makes the checks opt-in, and set that config in the test test_clone_raises_error_for_nonconforming_objects (instead of deleting both checks and the tests, which I thought was the "safer" bet, just in case this is still sth sklearn wants under who knows what circumstance).

@tpvasconcelos
Copy link
Contributor Author

Thanks for the background!

What's your take, should this be deprecated and removed in a future version? Or you see value in leaving it in?

@fkiraly
Copy link
Contributor

fkiraly commented Feb 1, 2024

Hm, I would say, let's deprcate, given that it seems gone in sklearn - if a user sets the config, they should be pointed to a check_estimator-like utility instead.

This could be TestAllObjects.run_tests, or w could make a check_estimator like in sktime that points to it.

@tpvasconcelos
Copy link
Contributor Author

@fkiraly For the reasons mentioned above, I think that moving the cloning logic to a separate function makes sense. In that regard, this PR would be ready to review and merge.

However, after doing a quick scan through the codebase, it seems like the use of the term 'estimator' is misused in a few places. I guess that abstracting BaseEstimator was the initial motivation for this framework but then you also abstracted it further into a BaseObject but never updated the explicit references to "estimator" in these base classes. Specifically I'm talking about references in the BaseObject and _FlagManager implementations, both of which should be agnostic to whether the actual implementation is an "estimator" or something else.

Some of these references are easy to fix (just replace "estimator" with something else {"base object"?}). See tpvasconcelos#2 for example. However, both BaseObject.clone_tags() and other methods in _FlagManager use the term estimator in their signature. Changing this would cause a breaking change.

Thoughts?

@fkiraly
Copy link
Contributor

fkiraly commented Feb 21, 2024

In that regard, this PR would be ready to review and merge.

Makes sense - do you want ot turn it done then, and resolve the QL issues?

However, after doing a quick scan through the codebase, it seems like the use of the term 'estimator' is misused in a few places.

Yes - the history is that, initially only estimators (with "fit") were inheriting from a sklearn BaseEstimator-like interface. This expanded as different kinds of object became necessary, i.e., without fit, in sktime - leading to the BaseObject abstraction. However, by that time, many docstrings and variable names already had "estimator" in them.

In creating skbase, we did quite some renaming to avoid confusion, but mostly focused on user or developer facing locations. Already that created a lot of problems, since renaming has to be consistent everywhere, and also it needs to be consistent with a refactor in sktime. I still half wish we hadn't renamed anything, see, e.g., the refactor sktime/sktime#4283 which fails for unclear reasons but likely because of renaming - comparing to other packages where using skbase from scratch has created no problems.

@fkiraly
Copy link
Contributor

fkiraly commented Feb 23, 2024

So, @tpvasconcelos, if you're done - could you remove the WIP tag and turn the PR to ready? Just to make sure we're not merging something incomplete.

@tpvasconcelos tpvasconcelos changed the title [WIP][MNT] Refactor .clone() [MNT] Refactor .clone() Feb 23, 2024
@tpvasconcelos tpvasconcelos marked this pull request as ready for review February 23, 2024 12:54
@tpvasconcelos
Copy link
Contributor Author

Opened!

@tpvasconcelos
Copy link
Contributor Author

@fkiraly I'll open this PR once we close this one: #293

(I don't know how to daisy-chain PRs from fork repos)

@fkiraly
Copy link
Contributor

fkiraly commented Feb 24, 2024

(no major change, just moved _clone to the end for better readability)

@fkiraly fkiraly merged commit 580244e into sktime:main Feb 25, 2024
21 of 23 checks passed
fkiraly pushed a commit that referenced this pull request Sep 22, 2024
…more general references (#293)

#### Reference Issues/PRs

Depends on #281

#### What does this implement/fix? Explain your changes.

See #281 (comment)
and relevant replies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants