New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] add Monash Forecasting Repository data loader #4826
Conversation
tagging @achieveordie, since you worked in dataset module before. |
side note, @hazrulakmal - given that you have looked at this, do you have any suggestions for #4754 ? |
I see we were thinking of the same reference, @yarnabrina - no, no decision has been taken so far, the last was that we would engage with #4754 and come up with options (which has not happened yet - I assume due to holiday weeks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is excellent and so are the tests (we still need to think about the 3rd point you raised). My general concern right now is that we might be collecting some technical debts if we were to be directly inspired from the classification loaders code.
As you know, the current implementation for forecasting and classification have a lot of steps in common, resulting in a bloated codebase. If we choose to add a separate repository for a different task, we would unnecessarily increase it even further.
So I suppose we should be asking ourselves some fundamental questions about designing data loaders and fetchers so that extending it remains simple and we avoid amassing any amount of tech debt for us to deal with in the future.
(This could very well be outside of the scope of this PR, but we need to start thinking in this context to avoid the aforementioned problems.)
I have personally looked at the dataset module codebase and I hope our understanding of bloated codebase is the same. However, I don't think that adding another repository, in this case for a forecasting task, is the root cause of the problem. Instead, I believe it's more about how the module is maintained and structured. Currently, all kinds of dataset functionality (readers, loaders, writers, etc.) are stored in one file, In regard to the third point, I have made some comments on the issue. Please do look at it. |
I believe we are saying the same point. Quoting myself:
and
I believe both of us are making the same point, i.e. the current design is not suitable for extension and correcting it might as well be beyond the scope of this PR. Perhaps we might be disagreeing with the premise of this leading to a bloated codebase, should we not address the first point. Let me try to convince you that this could lead to a bloated codebase (perhaps I should have worded my original sentence as a future possibility and not as an immediate result). Looking at # Allow user to have non standard extract path
if extract_path is not None:
local_module = os.path.dirname(extract_path)
local_dirname = extract_path
else: # this is the default path for downloaded dataset
local_module = MODULE
local_dirname = DIRNAME
if not os.path.exists(os.path.join(local_module, local_dirname)):
os.makedirs(os.path.join(local_module, local_dirname))
path_to_data_dir = os.path.join(local_module, local_dirname)
# TODO should create a function to check if dataset exists
if name not in _list_available_datasets(path_to_data_dir, "forecastingorg"):
# Dataset is not already present in the datasets directory provided.
# If it is not there, download and install it.
# TODO: create a registry function to lookup
# valid dataset names for classification, regression, forecasting datasets repo
if name not in list(tsf_all_datasets):
raise ValueError(
{name}
+ " is not a valid dataset name. \
List of valid dataset names can be found at \
sktime.datasets.tsf_dataset_names.tsf_all_datasets"
)
url = f"https://zenodo.org/record/{tsf_all[name]}/files/{name}.zip"
# This also tests the validitiy of the URL, can't rely on the html
# status code as it always returns 200
try:
_download_and_extract(
url,
extract_path=path_to_data_dir,
)
except zipfile.BadZipFile as e:
raise ValueError(
f"Invalid dataset name ={name} is not available on extract path ="
f"{extract_path}. Nor is it available on "
f"https://forecastingdata.org/.",
) from e and comparing it with if extract_path is not None:
local_module = os.path.dirname(extract_path)
local_dirname = extract_path
else:
local_module = MODULE
local_dirname = "data"
if not os.path.exists(os.path.join(local_module, local_dirname)):
os.makedirs(os.path.join(local_module, local_dirname))
if name not in _list_available_datasets(extract_path):
if extract_path is None:
local_dirname = "local_data"
if not os.path.exists(os.path.join(local_module, local_dirname)):
os.makedirs(os.path.join(local_module, local_dirname))
if name not in _list_available_datasets(
os.path.join(local_module, local_dirname)
):
# Dataset is not already present in the datasets directory provided.
# If it is not there, download and install it.
url = (
"https://timeseriesclassification.com/"
f"ClassificationDownloads/{name}.zip"
)
# This also tests the validitiy of the URL, can't rely on the html
# status code as it always returns 200
try:
_download_and_extract(
url,
extract_path=extract_path,
)
except zipfile.BadZipFile as e:
raise ValueError(
f"Invalid dataset name ={name} is not available on extract path ="
f"{extract_path}. Nor is it available on "
f"https://timeseriesclassification.com/.",
) from e We find that most of the core functionalities are very similar and refactoring them would avoid repetitions. Should we add more repositories that download from a different source into a different format, we would need to repeat the same process. This isn't on new developers that are trying to extend, but an inherent flaw of the API design. |
I agree with you on this matter. The inherent flaw in the dataset design enforces repeatable codes for the same processing steps. In my opinion, the current design is more of a research code.
Second this. As the dataset collection continues to grow to account for more complicated cases, a major redesign of the dataloader abstraction is something worth considering - I'm thinking about datasets loader something similar to deep-learning libraries like pytorch datasets. Your point above should be an important factor to consider. I think this will be a great addition to work-in-progress enhancements in the kotsu benchmarking framework. Would you be willing to collaborate on this? I believe some early design work is currently being carried out here. #4333
Moving forward, I believe it would be best to open a central issue to discuss the shortfalls in the dataset module, as we have been discussing them sporadically here and there across multiple dataset-related github issues. Regarding this specific issue,I will change the interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the reasoning above - let's merge this as it is useful functionality.
I'll just rename it to load_forecastingdata
and add it to the API reference.
(please let me know, @hazrulakmal, if you'd rather have a different name etc) |
I'm ok with the naming :) |
Reference Issues/PRs
See #4695. Since this PR implement one of the two repos mentioned in the reference PR, it should'nt be closed.
What does this implement/fix? Explain your changes.
Add a
fetch_forecasting
dataloader that retrieves forecasting datasets from the Monash Time Series Forecasting Repository. This dataloader first looks for the dataset in the extract_path. If the dataset is not present, it attempts to download the data from https://forecastingdata.org/ and saves it to the extract_path.To ensure consistency with the implementation of
load_UCR_UEA_dataset
, a dataloader that does the same for classification datasets,load_forecastingdata
closely follows it and uses some of the private functions that it relies on.Discussed: A naming fetch instead of load would be better to distinguish it from other dataloaders that load data from sktime pre-installed datasets as per suggestion in #4314. For now, we've called it
load_forecastingdata
for consistency reasons.Work in progress
What should a reviewer concentrate their feedback on?
fetch_forecastingorg
function