Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add real world datasets into dataset module #4695

Open
hazrulakmal opened this issue Jun 12, 2023 · 3 comments
Open

[ENH] Add real world datasets into dataset module #4695

hazrulakmal opened this issue Jun 12, 2023 · 3 comments
Labels
enhancement Adding new functionality feature request New feature or request module:datasets&loaders data sets and data loaders

Comments

@hazrulakmal
Copy link
Collaborator

hazrulakmal commented Jun 12, 2023

To conduct a real-world benchmarking study, we need access to real-world datasets. However, the current dataset module only contains a collection of simple/toy forecasting datasets. To address this limitation, I suggest incorporating more complex datasets from two public repositories:

  1. Monash Forecasting
  2. Time Series Extrinsic Regression

The goal is to create a function similar to load_UCR_UEA_dataset from the classification task. This function would fetch datasets from the public repositories, download them to the local machine, and load them into memory into a format that is compatible with estimators.

Monash Forecasting has several panel datasets that the current module does not include see #3465 , one panel dataset example from the repo is London Smart Meters. The other repository is for a regression task, so it should also address issue #4314. Personally, I think it would be better to have a function for users to download the datasets, as this would reduce the memory footprint of the package and offer many more datasets for users to choose from.

This feature would not only benefit the benchmarking framework, but also users who are interested in working with real-world time series examples. With just a single line of code to download and load the dataset, users can focus on modeling rather than data collection side of things.

fyi @fkiraly @achieveordie @JonathanBechtel

@hazrulakmal hazrulakmal added the enhancement Adding new functionality label Jun 12, 2023
@fkiraly fkiraly added feature request New feature or request module:datasets&loaders data sets and data loaders labels Jun 12, 2023
@yarnabrina
Copy link
Collaborator

In last Friday's call, @JonathanBechtel was talking about functionalities like make_time_series_dataset, similar to scikit-learn's make_* functions. I assume one can use it to generate time series with desired features, e.g. trend, seasonality, interaction type, etc. Is this issue going to cover that as well, or only focused on loading from the specified public repositories?

(just clarifying, as that would be such a useful feature for quick testing.)

@hazrulakmal
Copy link
Collaborator Author

hazrulakmal commented Jun 13, 2023

I believe that those functionalities like make_time_series_dataset are not real-world datasets, but rather synthetic ones generated based on some characteristics. If I'm not mistaken, there is a separate issue for this matter.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 16, 2023

If I'm not mistaken, there is a separate issue for this matter.

yes, afaik it´s this one: #4098

fkiraly pushed a commit that referenced this issue Jul 14, 2023
See #4695. Since this PR implement one of the two repos mentioned in the
reference PR, it should'nt be closed.

Add a `fetch_forecasting` dataloader that retrieves forecasting datasets
from the Monash Time Series Forecasting Repository. This dataloader
first looks for the dataset in the extract_path. If the dataset is not
present, it attempts to download the data from
*https://forecastingdata.org/* and saves it to the extract_path.

To ensure consistency with the implementation of `load_UCR_UEA_dataset`,
a dataloader that does the same for classification datasets,
`load_forecastingdata` closely follows it and uses some of the private
functions that it relies on.

Discussed: A naming fetch instead of load would be better to distinguish
it from other dataloaders that load data from sktime pre-installed
datasets as per suggestion in #4314. For now, we've called it
`load_forecastingdata` for consistency reasons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality feature request New feature or request module:datasets&loaders data sets and data loaders
Projects
None yet
Development

No branches or pull requests

3 participants