New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EHN] Alternative design for dataset as an object #5270
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment rather than a full review.
- I like the overall design of the base class, i.e., methods and intented use. I will have to play around to see the user experience, but I think it is an improvement on my original design.
- regarding on inheriting from base class: (i) I think
clone
is useful, as datasets might be parameteric, e.g., the TSF datasets have a name. (ii)get_params
is also useful imo, this is a crucial inspection method. (iii)save
could be used to move the dataset to another HD or cloud location.and would be fully in line with its usual meaning.- so I would do it, to avoid proliferating inheritance trees. There's nothing about fitting in there either, that's
BaseEstimator
, notBaseObject
. - another benefit is automatic compatibility with composition and
scikit-base
retrieval utilities - and we do not lose the tagging and config system, which I think we really want? We want to avoid duplicating the parts we like, as patches in
scikit-base
will not automatically apply unless in case of inheritance.
- so I would do it, to avoid proliferating inheritance trees. There's nothing about fitting in there either, that's
Smaller design questions related to the conceptual model:
- I wouldn't say that all datasets have to be downloaded from somewhere. I would also account for: (i) in-memory generation from pseudo-random seed; (ii) load from hard drive, possibly user extension case. So not everything will have
download
? - interesting question, perhaps out of scope but worth thinking about: how do we model dataset collections? E.g., "all datasets from M5"?
Yes, TSF or other loader offers some kind of parameters to select the dataset you wish to load or how the output should look like. However, I believe there's no necessity to have Please correct me if I'm mistaken, but it appears to me that the purpose of the The same case is also not very clear for when users ever need to do FYI, my understanding of these methods relies on your explanation from #2804
perhaps
do you mind pinpointing me to how I can understand what the word composition refers to and what are some of the examples of scikit-base retrieval utilities? so that it will be clearer on my side what these two mean.
In fact, this is the bad part that I recognise and agree on. we lose the tagging and config system if we don't go with the
Yup will account for this as I continue with the iteration. I recognise that the current base class is more of a parent class for a dataset that requires downloading from external sources. as of now, what does (i) look like?
I think for now we focus on porting datasets to academic archive repos that have datasets in some kind of uniform format like UEA & UCR classification, Monash forecasting, Monash UEA & UCR regression. for other repo sources like Kaggle where M5 datasets and some other famous hierarchical datasets are parked, they are usually large in size so we probably need to compress and save them in our storage somewhere for retrieval - this requires data cleaning, etc and I don't know how much smaller the compression can help with. afaik kaggle has no API that can help us fetch the dataset from them. |
Yes, there are two of these,
|
A composition in the sense of object oriented programming: https://en.wikipedia.org/wiki/Composition_over_inheritance An example would be a forecasting pipeline. It is a crucial pattern that composition also follows the strategy pattern - an example would be what
|
Reference Issues/PRs
alternative design prototype to #4333 & towards #5105 & #5196
What does this implement/fix? Explain your changes.
Design from 4333 is great and opens up the possibility of using the tagging system. However, the problem is that datasets do not have the same characteristics as
BaseObject
.BaseObject
has a number of methods that are totally unrelated to datasets as the design is fundamentally tailored for "estimator" object. Some of the unrelated methods to name a few areload_from_serial
,get_params
clone
reset
etc.I propose we create the dataset base class from the ground up and this PR fills up the implementation details.
Did you add any tests for the change?
will add tests as the implementation progresses. work in progress
PR checklist