Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EHN] Alternative design for dataset as an object #5270

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

hazrulakmal
Copy link
Collaborator

@hazrulakmal hazrulakmal commented Sep 18, 2023

Reference Issues/PRs

alternative design prototype to #4333 & towards #5105 & #5196

What does this implement/fix? Explain your changes.

Design from 4333 is great and opens up the possibility of using the tagging system. However, the problem is that datasets do not have the same characteristics as BaseObject. BaseObject has a number of methods that are totally unrelated to datasets as the design is fundamentally tailored for "estimator" object. Some of the unrelated methods to name a few are load_from_serial, get_params clone reset etc.

I propose we create the dataset base class from the ground up and this PR fills up the implementation details.

Did you add any tests for the change?

will add tests as the implementation progresses. work in progress

PR checklist

  • TSC
  • TSF
  • one pre-installed dataset
  • refactor dataset download routines

@hazrulakmal hazrulakmal added API design API design & software architecture module:datasets&loaders data sets and data loaders labels Sep 19, 2023
@hazrulakmal hazrulakmal self-assigned this Sep 19, 2023
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment rather than a full review.

  • I like the overall design of the base class, i.e., methods and intented use. I will have to play around to see the user experience, but I think it is an improvement on my original design.
  • regarding on inheriting from base class: (i) I think clone is useful, as datasets might be parameteric, e.g., the TSF datasets have a name. (ii) get_params is also useful imo, this is a crucial inspection method. (iii) save could be used to move the dataset to another HD or cloud location.and would be fully in line with its usual meaning.
    • so I would do it, to avoid proliferating inheritance trees. There's nothing about fitting in there either, that's BaseEstimator, not BaseObject.
    • another benefit is automatic compatibility with composition and scikit-base retrieval utilities
    • and we do not lose the tagging and config system, which I think we really want? We want to avoid duplicating the parts we like, as patches in scikit-base will not automatically apply unless in case of inheritance.

Smaller design questions related to the conceptual model:

  • I wouldn't say that all datasets have to be downloaded from somewhere. I would also account for: (i) in-memory generation from pseudo-random seed; (ii) load from hard drive, possibly user extension case. So not everything will have download?
  • interesting question, perhaps out of scope but worth thinking about: how do we model dataset collections? E.g., "all datasets from M5"?

@hazrulakmal
Copy link
Collaborator Author

hazrulakmal commented Sep 22, 2023

I think clone is useful, as datasets might be parameteric, e.g., the TSF datasets have a name.

Yes, TSF or other loader offers some kind of parameters to select the dataset you wish to load or how the output should look like. However, I believe there's no necessity to have clone to transform a TSF dataset from a stateful object into a specification object. The dataset module's primary purpose is to facilitate downloading or loading data into memory or a mix of both. e.g to specify a different dataset, such as one with a distinct name, you can easily achieve this by instantiating a new object with the corresponding name. and you usually do this at the very first of the workflow rather than somewhere in the process

Please correct me if I'm mistaken, but it appears to me that the purpose of the clone lies for something that we want for sure to be a clean object. in the case of estimator, to return to the pre-fitting state. However, in the case of datasets, I don't see a similar need, even the logic for downloading or utilizing an existing dataset is handled within the download routines themselves.

The same case is also not very clear for when users ever need to do reset

FYI, my understanding of these methods relies on your explanation from #2804

save could be used to move the dataset to another HD or cloud location.and would be fully in line with its usual meaning.

perhaps download does the same thing as save but under a different name. or it can be different in the sense that save is used after the dataset is loaded into memory and users want to save it somewhere else instead of the place where it was read.

another benefit is automatic compatibility with composition and scikit-base retrieval utilities

do you mind pinpointing me to how I can understand what the word composition refers to and what are some of the examples of scikit-base retrieval utilities? so that it will be clearer on my side what these two mean.

and we do not lose the tagging and config system, which I think we really want?

In fact, this is the bad part that I recognise and agree on. we lose the tagging and config system if we don't go with the BaseObject approach. If we can truly justify the inheritance of clone & reset then I don't have any other objections.

I wouldn't say that all datasets have to be downloaded from somewhere. I would also account for: (i) in-memory generation from pseudo-random seed; (ii) load from hard drive, possibly user extension case. So not everything will have download?

Yup will account for this as I continue with the iteration. I recognise that the current base class is more of a parent class for a dataset that requires downloading from external sources.

as of now, what does (i) look like?

interesting question, perhaps out of scope but worth thinking about: how do we model dataset collections? E.g., "all datasets from M5

I think for now we focus on porting datasets to academic archive repos that have datasets in some kind of uniform format like UEA & UCR classification, Monash forecasting, Monash UEA & UCR regression. for other repo sources like Kaggle where M5 datasets and some other famous hierarchical datasets are parked, they are usually large in size so we probably need to compress and save them in our storage somewhere for retrieval - this requires data cleaning, etc and I don't know how much smaller the compression can help with. afaik kaggle has no API that can help us fetch the dataset from them.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 23, 2023

it appears to me that the purpose of the clone lies for something that we want for sure to be a clean object. in the case of estimator,

Yes, there are two of these, clone and reset:

  • clone creates a clean copy
  • reset resets self to clean specification state

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 23, 2023

do you mind pinpointing me to how I can understand what the word composition refers to

A composition in the sense of object oriented programming: https://en.wikipedia.org/wiki/Composition_over_inheritance

An example would be a forecasting pipeline. It is a crucial pattern that composition also follows the strategy pattern - an example would be what get_params does for a forecasting pipeline (e.g., ForecastingPipeline) and how it calls get_params of the components.

and what are some of the examples of scikit-base retrieval utilities?

registry.all_estimators

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture module:datasets&loaders data sets and data loaders
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants