Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] add Monash Forecasting Repository data loader #4826

Merged
merged 9 commits into from Jul 14, 2023

Conversation

hazrulakmal
Copy link
Collaborator

@hazrulakmal hazrulakmal commented Jul 4, 2023

Reference Issues/PRs

See #4695. Since this PR implement one of the two repos mentioned in the reference PR, it should'nt be closed.

What does this implement/fix? Explain your changes.

Add a fetch_forecasting dataloader that retrieves forecasting datasets from the Monash Time Series Forecasting Repository. This dataloader first looks for the dataset in the extract_path. If the dataset is not present, it attempts to download the data from https://forecastingdata.org/ and saves it to the extract_path.

To ensure consistency with the implementation of load_UCR_UEA_dataset, a dataloader that does the same for classification datasets, load_forecastingdata closely follows it and uses some of the private functions that it relies on.

Discussed: A naming fetch instead of load would be better to distinguish it from other dataloaders that load data from sktime pre-installed datasets as per suggestion in #4314. For now, we've called it load_forecastingdata for consistency reasons.

Work in progress

What should a reviewer concentrate their feedback on?

  • fetch_forecastingorg function
  • tests to account the changes

@hazrulakmal hazrulakmal requested a review from fkiraly as a code owner July 4, 2023 22:20
@hazrulakmal hazrulakmal marked this pull request as draft July 4, 2023 22:20
@hazrulakmal
Copy link
Collaborator Author

  1. I have made some changes to _list_available_datasets to also look for datasets from the forecastingorg repo, which has a different format than the one from classification. FYI, classification datasets from timeseriesclassification.com and regression datasets from tseregression.org have a similar format of <dataset name>._TRAIN.ts and <dataset name>._TEST.ts, while forecasting datasets are in the tsf format. I added some tests to ensure this works correctly.
  2. I have modified fetch_forecastingorg to search for datasets in the directory first to load them to memory. If the dataset is not found, fetch_forecastingorg will download it from the repo. By default, fetch_forecastingorg saves the datasets in the data directory together with other pre-installed datasets, which is slightly different than load_UCR_UEA_dataset, which creates a new directory local_data to save the datasets. Parking downloaded datasets in the data directory, in my opinion, is cleaner and results in simpler code, as there is no need to sequentially search in data and then in local_data before deciding to download the datasets - search can be done in just a single dir (reduce redundancy).
  3. I have added a test to check if the URL exists and is downloadable without downloading the zip file (for speed) for all available datasets in test_single_problem_loaders. This helps us quickly identify if the upstream repo has deleted a dataset or changed the URL (although the former is unlikely to happen, as the datasets are saved in a third-party, well-maintained storage platform). However, I'm unsure if checking the existence of a dataset every time sktime developers run a CI test is desirable. Another option is to trigger a CI test for this periodically (weekly, monthly, etc.). I'm not very familiar with CI workflows, so if this needs to be implemented, I need some guidance on how to do it.

tagging @achieveordie, since you worked in dataset module before.

@hazrulakmal hazrulakmal marked this pull request as ready for review July 6, 2023 12:29
@yarnabrina
Copy link
Collaborator

I think these are related to your 3rd point. I'm not aware if any decision was taken in dev or council calls.

#4754
#4749

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 7, 2023

side note, @hazrulakmal - given that you have looked at this, do you have any suggestions for #4754 ?

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 7, 2023

I'm not aware if any decision was taken in dev or council calls.

I see we were thinking of the same reference, @yarnabrina - no, no decision has been taken so far, the last was that we would engage with #4754 and come up with options (which has not happened yet - I assume due to holiday weeks).

Copy link
Collaborator

@achieveordie achieveordie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is excellent and so are the tests (we still need to think about the 3rd point you raised). My general concern right now is that we might be collecting some technical debts if we were to be directly inspired from the classification loaders code.

As you know, the current implementation for forecasting and classification have a lot of steps in common, resulting in a bloated codebase. If we choose to add a separate repository for a different task, we would unnecessarily increase it even further.

So I suppose we should be asking ourselves some fundamental questions about designing data loaders and fetchers so that extending it remains simple and we avoid amassing any amount of tech debt for us to deal with in the future.

(This could very well be outside of the scope of this PR, but we need to start thinking in this context to avoid the aforementioned problems.)

@hazrulakmal
Copy link
Collaborator Author

As you know, the current implementation for forecasting and classification have a lot of steps in common, resulting in a bloated codebase. If we choose to add a separate repository for a different task, we would unnecessarily increase it even further.

I have personally looked at the dataset module codebase and I hope our understanding of bloated codebase is the same. However, I don't think that adding another repository, in this case for a forecasting task, is the root cause of the problem. Instead, I believe it's more about how the module is maintained and structured. Currently, all kinds of dataset functionality (readers, loaders, writers, etc.) are stored in one file, _data_io.py. This can be a headache for new developers and should be separated into different files for long-term maintainability. Common steps can always be remedied by gluing them together with utility functions. Refactoring the module should be able to resolve all these issues, which is a separate issue from this pull request.

In regard to the third point, I have made some comments on the issue. Please do look at it.

@achieveordie
Copy link
Collaborator

I believe we are saying the same point.

Quoting myself:

... we should be asking ourselves some fundamental questions about designing data loaders and fetchers so that extending it remains simple ...

and

This could very well be outside of the scope of this PR, but we need to start thinking in this context to avoid the aforementioned problems.

I believe both of us are making the same point, i.e. the current design is not suitable for extension and correcting it might as well be beyond the scope of this PR. Perhaps we might be disagreeing with the premise of this leading to a bloated codebase, should we not address the first point.

Let me try to convince you that this could lead to a bloated codebase (perhaps I should have worded my original sentence as a future possibility and not as an immediate result). Looking at fetch_forecastingorg():

   # Allow user to have non standard extract path
    if extract_path is not None:
        local_module = os.path.dirname(extract_path)
        local_dirname = extract_path
    else:  # this is the default path for downloaded dataset
        local_module = MODULE
        local_dirname = DIRNAME

    if not os.path.exists(os.path.join(local_module, local_dirname)):
        os.makedirs(os.path.join(local_module, local_dirname))

    path_to_data_dir = os.path.join(local_module, local_dirname)
    # TODO should create a function to check if dataset exists
    if name not in _list_available_datasets(path_to_data_dir, "forecastingorg"):
        # Dataset is not already present in the datasets directory provided.
        # If it is not there, download and install it.

        # TODO: create a registry function to lookup
        # valid dataset names for classification, regression, forecasting datasets repo
        if name not in list(tsf_all_datasets):
            raise ValueError(
                {name}
                + " is not a valid dataset name. \
                    List of valid dataset names can be found at \
                    sktime.datasets.tsf_dataset_names.tsf_all_datasets"
            )

        url = f"https://zenodo.org/record/{tsf_all[name]}/files/{name}.zip"

        # This also tests the validitiy of the URL, can't rely on the html
        # status code as it always returns 200
        try:
            _download_and_extract(
                url,
                extract_path=path_to_data_dir,
            )
        except zipfile.BadZipFile as e:
            raise ValueError(
                f"Invalid dataset name ={name} is not available on extract path ="
                f"{extract_path}. Nor is it available on "
                f"https://forecastingdata.org/.",
            ) from e

and comparing it with _load_dataset():

if extract_path is not None:
        local_module = os.path.dirname(extract_path)
        local_dirname = extract_path
    else:
        local_module = MODULE
        local_dirname = "data"

    if not os.path.exists(os.path.join(local_module, local_dirname)):
        os.makedirs(os.path.join(local_module, local_dirname))
    if name not in _list_available_datasets(extract_path):
        if extract_path is None:
            local_dirname = "local_data"
        if not os.path.exists(os.path.join(local_module, local_dirname)):
            os.makedirs(os.path.join(local_module, local_dirname))
        if name not in _list_available_datasets(
            os.path.join(local_module, local_dirname)
        ):
            # Dataset is not already present in the datasets directory provided.
            # If it is not there, download and install it.
            url = (
                "https://timeseriesclassification.com/"
                f"ClassificationDownloads/{name}.zip"
            )
            # This also tests the validitiy of the URL, can't rely on the html
            # status code as it always returns 200
            try:
                _download_and_extract(
                    url,
                    extract_path=extract_path,
                )
            except zipfile.BadZipFile as e:
                raise ValueError(
                    f"Invalid dataset name ={name} is not available on extract path ="
                    f"{extract_path}. Nor is it available on "
                    f"https://timeseriesclassification.com/.",
                ) from e

We find that most of the core functionalities are very similar and refactoring them would avoid repetitions. Should we add more repositories that download from a different source into a different format, we would need to repeat the same process. This isn't on new developers that are trying to extend, but an inherent flaw of the API design.

@hazrulakmal
Copy link
Collaborator Author

hazrulakmal commented Jul 12, 2023

.. the current design is not suitable for extension ..

I agree with you on this matter. The inherent flaw in the dataset design enforces repeatable codes for the same processing steps. In my opinion, the current design is more of a research code.

designing data loaders and fetchers so that extending it remains simple

Second this. As the dataset collection continues to grow to account for more complicated cases, a major redesign of the dataloader abstraction is something worth considering - I'm thinking about datasets loader something similar to deep-learning libraries like pytorch datasets. Your point above should be an important factor to consider. I think this will be a great addition to work-in-progress enhancements in the kotsu benchmarking framework. Would you be willing to collaborate on this? I believe some early design work is currently being carried out here. #4333

Should we add more repositories that download from a different source into a different format, we would need to repeat the same process. This isn't on new developers that are trying to extend, but an inherent flaw of the API design

Moving forward, I believe it would be best to open a central issue to discuss the shortfalls in the dataset module, as we have been discussing them sporadically here and there across multiple dataset-related github issues. Regarding this specific issue,I will change the interface fetch_forecasting to load_forecasting_dataset in order to maintain consistency with the current implementation for fetches. As you pointed out, we may not want to extend the dataset using the current design. Therefore, any direct improvements to the existing code should probably not be questioned at this time, as the problems will be addressed when redesigning the dataset module. I still aim to get this PR merged as it still offers a great feature for users despite relying on the existing architecture. what do you think?

fkiraly
fkiraly previously approved these changes Jul 14, 2023
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the reasoning above - let's merge this as it is useful functionality.

I'll just rename it to load_forecastingdata and add it to the API reference.

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 14, 2023

(please let me know, @hazrulakmal, if you'd rather have a different name etc)

@hazrulakmal
Copy link
Collaborator Author

(please let me know, @hazrulakmal, if you'd rather have a different name etc)

I'm ok with the naming :)

@fkiraly fkiraly changed the title [ENH] Add Monash Forecasting Dataset Dataloader [ENH] add Monash Forecasting Repository data loader Jul 14, 2023
@fkiraly fkiraly merged commit fef1fc3 into sktime:main Jul 14, 2023
24 checks passed
@hazrulakmal hazrulakmal deleted the monashforecasting branch July 27, 2023 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants