New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add real world datasets into dataset module #4695
Comments
In last Friday's call, @JonathanBechtel was talking about functionalities like (just clarifying, as that would be such a useful feature for quick testing.) |
I believe that those functionalities like |
yes, afaik it´s this one: #4098 |
See #4695. Since this PR implement one of the two repos mentioned in the reference PR, it should'nt be closed. Add a `fetch_forecasting` dataloader that retrieves forecasting datasets from the Monash Time Series Forecasting Repository. This dataloader first looks for the dataset in the extract_path. If the dataset is not present, it attempts to download the data from *https://forecastingdata.org/* and saves it to the extract_path. To ensure consistency with the implementation of `load_UCR_UEA_dataset`, a dataloader that does the same for classification datasets, `load_forecastingdata` closely follows it and uses some of the private functions that it relies on. Discussed: A naming fetch instead of load would be better to distinguish it from other dataloaders that load data from sktime pre-installed datasets as per suggestion in #4314. For now, we've called it `load_forecastingdata` for consistency reasons.
To conduct a real-world benchmarking study, we need access to real-world datasets. However, the current dataset module only contains a collection of simple/toy forecasting datasets. To address this limitation, I suggest incorporating more complex datasets from two public repositories:
The goal is to create a function similar to
load_UCR_UEA_dataset
from the classification task. This function would fetch datasets from the public repositories, download them to the local machine, and load them into memory into a format that is compatible with estimators.Monash Forecasting has several panel datasets that the current module does not include see #3465 , one panel dataset example from the repo is London Smart Meters. The other repository is for a regression task, so it should also address issue #4314. Personally, I think it would be better to have a function for users to download the datasets, as this would reduce the memory footprint of the package and offer many more datasets for users to choose from.
This feature would not only benefit the benchmarking framework, but also users who are interested in working with real-world time series examples. With just a single line of code to download and load the dataset, users can focus on modeling rather than data collection side of things.
fyi @fkiraly @achieveordie @JonathanBechtel
The text was updated successfully, but these errors were encountered: