[EHN] Alternative design for dataset as an object #5270

hazrulakmal · 2023-09-18T11:26:28Z

Reference Issues/PRs

alternative design prototype to #4333 & towards #5105 & #5196

What does this implement/fix? Explain your changes.

Design from 4333 is great and opens up the possibility of using the tagging system. However, the problem is that datasets do not have the same characteristics as BaseObject. BaseObject has a number of methods that are totally unrelated to datasets as the design is fundamentally tailored for "estimator" object. Some of the unrelated methods to name a few are load_from_serial, get_params clone reset etc.

I propose we create the dataset base class from the ground up and this PR fills up the implementation details.

Did you add any tests for the change?

will add tests as the implementation progresses. work in progress

PR checklist

TSC
TSF
one pre-installed dataset
refactor dataset download routines

fkiraly

A comment rather than a full review.

I like the overall design of the base class, i.e., methods and intented use. I will have to play around to see the user experience, but I think it is an improvement on my original design.
regarding on inheriting from base class: (i) I think clone is useful, as datasets might be parameteric, e.g., the TSF datasets have a name. (ii) get_params is also useful imo, this is a crucial inspection method. (iii) save could be used to move the dataset to another HD or cloud location.and would be fully in line with its usual meaning.
- so I would do it, to avoid proliferating inheritance trees. There's nothing about fitting in there either, that's BaseEstimator, not BaseObject.
- another benefit is automatic compatibility with composition and scikit-base retrieval utilities
- and we do not lose the tagging and config system, which I think we really want? We want to avoid duplicating the parts we like, as patches in scikit-base will not automatically apply unless in case of inheritance.

Smaller design questions related to the conceptual model:

I wouldn't say that all datasets have to be downloaded from somewhere. I would also account for: (i) in-memory generation from pseudo-random seed; (ii) load from hard drive, possibly user extension case. So not everything will have download?
interesting question, perhaps out of scope but worth thinking about: how do we model dataset collections? E.g., "all datasets from M5"?

hazrulakmal · 2023-09-22T04:07:57Z

I think clone is useful, as datasets might be parameteric, e.g., the TSF datasets have a name.

Yes, TSF or other loader offers some kind of parameters to select the dataset you wish to load or how the output should look like. However, I believe there's no necessity to have clone to transform a TSF dataset from a stateful object into a specification object. The dataset module's primary purpose is to facilitate downloading or loading data into memory or a mix of both. e.g to specify a different dataset, such as one with a distinct name, you can easily achieve this by instantiating a new object with the corresponding name. and you usually do this at the very first of the workflow rather than somewhere in the process

Please correct me if I'm mistaken, but it appears to me that the purpose of the clone lies for something that we want for sure to be a clean object. in the case of estimator, to return to the pre-fitting state. However, in the case of datasets, I don't see a similar need, even the logic for downloading or utilizing an existing dataset is handled within the download routines themselves.

The same case is also not very clear for when users ever need to do reset

FYI, my understanding of these methods relies on your explanation from #2804

save could be used to move the dataset to another HD or cloud location.and would be fully in line with its usual meaning.

perhaps download does the same thing as save but under a different name. or it can be different in the sense that save is used after the dataset is loaded into memory and users want to save it somewhere else instead of the place where it was read.

another benefit is automatic compatibility with composition and scikit-base retrieval utilities

do you mind pinpointing me to how I can understand what the word composition refers to and what are some of the examples of scikit-base retrieval utilities? so that it will be clearer on my side what these two mean.

and we do not lose the tagging and config system, which I think we really want?

In fact, this is the bad part that I recognise and agree on. we lose the tagging and config system if we don't go with the BaseObject approach. If we can truly justify the inheritance of clone & reset then I don't have any other objections.

I wouldn't say that all datasets have to be downloaded from somewhere. I would also account for: (i) in-memory generation from pseudo-random seed; (ii) load from hard drive, possibly user extension case. So not everything will have download?

Yup will account for this as I continue with the iteration. I recognise that the current base class is more of a parent class for a dataset that requires downloading from external sources.

as of now, what does (i) look like?

interesting question, perhaps out of scope but worth thinking about: how do we model dataset collections? E.g., "all datasets from M5

I think for now we focus on porting datasets to academic archive repos that have datasets in some kind of uniform format like UEA & UCR classification, Monash forecasting, Monash UEA & UCR regression. for other repo sources like Kaggle where M5 datasets and some other famous hierarchical datasets are parked, they are usually large in size so we probably need to compress and save them in our storage somewhere for retrieval - this requires data cleaning, etc and I don't know how much smaller the compression can help with. afaik kaggle has no API that can help us fetch the dataset from them.

fkiraly · 2023-09-23T08:59:34Z

it appears to me that the purpose of the clone lies for something that we want for sure to be a clean object. in the case of estimator,

Yes, there are two of these, clone and reset:

clone creates a clean copy
reset resets self to clean specification state

fkiraly · 2023-09-23T09:02:01Z

do you mind pinpointing me to how I can understand what the word composition refers to

A composition in the sense of object oriented programming: https://en.wikipedia.org/wiki/Composition_over_inheritance

An example would be a forecasting pipeline. It is a crucial pattern that composition also follows the strategy pattern - an example would be what get_params does for a forecasting pipeline (e.g., ForecastingPipeline) and how it calls get_params of the components.

and what are some of the examples of scikit-base retrieval utilities?

registry.all_estimators

tsc datasets first design

62a3f30

hazrulakmal added API design API design & software architecture module:datasets&loaders data sets and data loaders labels Sep 19, 2023

hazrulakmal self-assigned this Sep 19, 2023

hazrulakmal added 5 commits September 21, 2023 21:31

download & load

92d63f2

complete tsc datasets

ae89554

small edits on comments

dfa5321

tsf datasetloader

1cd72d1

remove not usefull class

3e4594d

fkiraly reviewed Sep 21, 2023

View reviewed changes

hazrulakmal mentioned this pull request Sep 26, 2023

[ENH] Implement Logger Interface design #5230

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EHN] Alternative design for dataset as an object #5270

[EHN] Alternative design for dataset as an object #5270

hazrulakmal commented Sep 18, 2023 •

edited

fkiraly left a comment •

edited

hazrulakmal commented Sep 22, 2023 •

edited

fkiraly commented Sep 23, 2023

fkiraly commented Sep 23, 2023

[EHN] Alternative design for dataset as an object #5270

Are you sure you want to change the base?

[EHN] Alternative design for dataset as an object #5270

Conversation

hazrulakmal commented Sep 18, 2023 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Did you add any tests for the change?

PR checklist

fkiraly left a comment • edited

Choose a reason for hiding this comment

hazrulakmal commented Sep 22, 2023 • edited

fkiraly commented Sep 23, 2023

fkiraly commented Sep 23, 2023

hazrulakmal commented Sep 18, 2023 •

edited

fkiraly left a comment •

edited

hazrulakmal commented Sep 22, 2023 •

edited