Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] skbase clone vs sklearn clone - comparison and actions #280

Open
fkiraly opened this issue Jan 31, 2024 · 2 comments
Open

[ENH] skbase clone vs sklearn clone - comparison and actions #280

fkiraly opened this issue Jan 31, 2024 · 2 comments
Labels
API design API design & software architecture

Comments

@fkiraly
Copy link
Contributor

fkiraly commented Jan 31, 2024

posted by @tpvasconcelos in #275 - opening a new issue to discuss design and actions arising, if any


Just for reference, I'm attaching a visual diff between the current BaseObject._clone() implementation and sklearn's _clone_parametrized():

image

Summary of differences:

  1. skbase does not clone dict values (@fkiraly do you know why?)
  2. sklearn also clones the _metadata_request and _sklearn_output_config attributes
  3. skbase now also clones the config flags

I need to think more about this to give any meaningful input here.

Either way, some tech discussions on uniformization of APIs are going to be scheduled for later this year between the two projects (sktime, sklearn) - in case you are interested.

@fkiraly definitely

Originally posted by @tpvasconcelos in #275 (comment)

@fkiraly fkiraly added the API design API design & software architecture label Jan 31, 2024
@fkiraly
Copy link
Contributor Author

fkiraly commented Jan 31, 2024

Summary of differences:

These are likely due to changes that were merged in sklearn since the skbase _clone was copied from there.

The forking was conscious, as discussed in #275 - for instance, the more recently added _sklearn_output_config in sklearn is better handled by set_config/get_config - they keep proliferating attrs that are really hard to keep track of if you are not an sklearn core dev...

  1. skbase does not clone dict values (@fkiraly do you know why?)

The simple reason is that sklearn did not support it back then. Seems like a reasonable idea to add.

  1. sklearn also clones the _metadata_request and _sklearn_output_config attributes

Yes, this stuff was added recently in sklearn, see above. This is too specific to sklearn estimators or pipelines imo to be part of a super-base class, and imo one example where sklearn has recently gone down a rabbit hole of decreasing cohesion as a side effect of adding more and more features - in particular, features that sklearn-like packages don't necessarily need - and then having to deal with them all over the place.

For instance, sth like get_config / set_config is useful, but _sklearn_metadata_config is extremely package specific, and each such addition makes it less and less easy to use sklearn BaseEstimator as a starting point for a sklearn-like package that is not identical with sklearn interfaces.

Feels a bit like a gradual "Lasagna-fication" of once crisp architecture...

  1. skbase now also clones the config flags

Yes, given that it happens in the same place where the sklearn config is being dealt with validates your approach, because the same solution was found independently twice.

@fkiraly fkiraly changed the title skbase clone vs sklearn clone - comparison and actions [ENH] skbase clone vs sklearn clone - comparison and actions Jan 31, 2024
@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 24, 2024

#281 improves the structure, but does not add support for dict

fkiraly pushed a commit that referenced this issue Feb 25, 2024
#### Reference Issues/PRs

#280

#### What does this implement/fix? Explain your changes.

- Simplify `.clone()`'s implementation to 3 easy-to-understand steps:
    1. clone `self`
    2. if `check_clone`, run extra checks
    3. return cloned object
- Refactor extra clone checks into a private `_check_clone()` function
- Refactor the private `._check()` method into a private `_check()`
function
- I think this is more idiomatic since the current `._check()`
implementation was not accessing any instance (`self`) attributes or
methods. It would only recursively call itself again as
`self._clone(...)`, in practice making it a static method.
- I don't think this function should ever depend on any private instance
attributes or ever mutate the state of the object being cloned.
- In practice, this is 100% equivalent in behaviour but I think it's a
more _intentional_ and idiomatic implementation.
- As a bonus (but probably not coincidentally), makes it more in line
with sklearn's reference implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture
Projects
None yet
Development

No branches or pull requests

1 participant