[ENH] add Monash Forecasting Repository data loader #4826

hazrulakmal · 2023-07-04T22:20:11Z

Reference Issues/PRs

See #4695. Since this PR implement one of the two repos mentioned in the reference PR, it should'nt be closed.

What does this implement/fix? Explain your changes.

Add a fetch_forecasting dataloader that retrieves forecasting datasets from the Monash Time Series Forecasting Repository. This dataloader first looks for the dataset in the extract_path. If the dataset is not present, it attempts to download the data from https://forecastingdata.org/ and saves it to the extract_path.

To ensure consistency with the implementation of load_UCR_UEA_dataset, a dataloader that does the same for classification datasets, load_forecastingdata closely follows it and uses some of the private functions that it relies on.

Discussed: A naming fetch instead of load would be better to distinguish it from other dataloaders that load data from sktime pre-installed datasets as per suggestion in #4314. For now, we've called it load_forecastingdata for consistency reasons.

Work in progress

What should a reviewer concentrate their feedback on?

fetch_forecastingorg function
tests to account the changes

hazrulakmal · 2023-07-06T12:03:44Z

I have made some changes to _list_available_datasets to also look for datasets from the forecastingorg repo, which has a different format than the one from classification. FYI, classification datasets from timeseriesclassification.com and regression datasets from tseregression.org have a similar format of <dataset name>._TRAIN.ts and <dataset name>._TEST.ts, while forecasting datasets are in the tsf format. I added some tests to ensure this works correctly.
I have modified fetch_forecastingorg to search for datasets in the directory first to load them to memory. If the dataset is not found, fetch_forecastingorg will download it from the repo. By default, fetch_forecastingorg saves the datasets in the data directory together with other pre-installed datasets, which is slightly different than load_UCR_UEA_dataset, which creates a new directory local_data to save the datasets. Parking downloaded datasets in the data directory, in my opinion, is cleaner and results in simpler code, as there is no need to sequentially search in data and then in local_data before deciding to download the datasets - search can be done in just a single dir (reduce redundancy).
I have added a test to check if the URL exists and is downloadable without downloading the zip file (for speed) for all available datasets in test_single_problem_loaders. This helps us quickly identify if the upstream repo has deleted a dataset or changed the URL (although the former is unlikely to happen, as the datasets are saved in a third-party, well-maintained storage platform). However, I'm unsure if checking the existence of a dataset every time sktime developers run a CI test is desirable. Another option is to trigger a CI test for this periodically (weekly, monthly, etc.). I'm not very familiar with CI workflows, so if this needs to be implemented, I need some guidance on how to do it.

tagging @achieveordie, since you worked in dataset module before.

yarnabrina · 2023-07-06T16:21:01Z

I think these are related to your 3rd point. I'm not aware if any decision was taken in dev or council calls.

#4754
#4749

fkiraly · 2023-07-07T12:53:21Z

side note, @hazrulakmal - given that you have looked at this, do you have any suggestions for #4754 ?

fkiraly · 2023-07-07T12:54:09Z

I'm not aware if any decision was taken in dev or council calls.

I see we were thinking of the same reference, @yarnabrina - no, no decision has been taken so far, the last was that we would engage with #4754 and come up with options (which has not happened yet - I assume due to holiday weeks).

achieveordie

The code is excellent and so are the tests (we still need to think about the 3rd point you raised). My general concern right now is that we might be collecting some technical debts if we were to be directly inspired from the classification loaders code.

As you know, the current implementation for forecasting and classification have a lot of steps in common, resulting in a bloated codebase. If we choose to add a separate repository for a different task, we would unnecessarily increase it even further.

So I suppose we should be asking ourselves some fundamental questions about designing data loaders and fetchers so that extending it remains simple and we avoid amassing any amount of tech debt for us to deal with in the future.

(This could very well be outside of the scope of this PR, but we need to start thinking in this context to avoid the aforementioned problems.)

hazrulakmal · 2023-07-10T11:02:20Z

As you know, the current implementation for forecasting and classification have a lot of steps in common, resulting in a bloated codebase. If we choose to add a separate repository for a different task, we would unnecessarily increase it even further.

I have personally looked at the dataset module codebase and I hope our understanding of bloated codebase is the same. However, I don't think that adding another repository, in this case for a forecasting task, is the root cause of the problem. Instead, I believe it's more about how the module is maintained and structured. Currently, all kinds of dataset functionality (readers, loaders, writers, etc.) are stored in one file, _data_io.py. This can be a headache for new developers and should be separated into different files for long-term maintainability. Common steps can always be remedied by gluing them together with utility functions. Refactoring the module should be able to resolve all these issues, which is a separate issue from this pull request.

In regard to the third point, I have made some comments on the issue. Please do look at it.

achieveordie · 2023-07-10T15:02:47Z

I believe we are saying the same point.

Quoting myself:

... we should be asking ourselves some fundamental questions about designing data loaders and fetchers so that extending it remains simple ...

and

This could very well be outside of the scope of this PR, but we need to start thinking in this context to avoid the aforementioned problems.

I believe both of us are making the same point, i.e. the current design is not suitable for extension and correcting it might as well be beyond the scope of this PR. Perhaps we might be disagreeing with the premise of this leading to a bloated codebase, should we not address the first point.

Let me try to convince you that this could lead to a bloated codebase (perhaps I should have worded my original sentence as a future possibility and not as an immediate result). Looking at fetch_forecastingorg():

   # Allow user to have non standard extract path
    if extract_path is not None:
        local_module = os.path.dirname(extract_path)
        local_dirname = extract_path
    else:  # this is the default path for downloaded dataset
        local_module = MODULE
        local_dirname = DIRNAME

    if not os.path.exists(os.path.join(local_module, local_dirname)):
        os.makedirs(os.path.join(local_module, local_dirname))

    path_to_data_dir = os.path.join(local_module, local_dirname)
    # TODO should create a function to check if dataset exists
    if name not in _list_available_datasets(path_to_data_dir, "forecastingorg"):
        # Dataset is not already present in the datasets directory provided.
        # If it is not there, download and install it.

        # TODO: create a registry function to lookup
        # valid dataset names for classification, regression, forecasting datasets repo
        if name not in list(tsf_all_datasets):
            raise ValueError(
                {name}
                + " is not a valid dataset name. \
                    List of valid dataset names can be found at \
                    sktime.datasets.tsf_dataset_names.tsf_all_datasets"
            )

        url = f"https://zenodo.org/record/{tsf_all[name]}/files/{name}.zip"

        # This also tests the validitiy of the URL, can't rely on the html
        # status code as it always returns 200
        try:
            _download_and_extract(
                url,
                extract_path=path_to_data_dir,
            )
        except zipfile.BadZipFile as e:
            raise ValueError(
                f"Invalid dataset name ={name} is not available on extract path ="
                f"{extract_path}. Nor is it available on "
                f"https://forecastingdata.org/.",
            ) from e

and comparing it with _load_dataset():

if extract_path is not None:
        local_module = os.path.dirname(extract_path)
        local_dirname = extract_path
    else:
        local_module = MODULE
        local_dirname = "data"

    if not os.path.exists(os.path.join(local_module, local_dirname)):
        os.makedirs(os.path.join(local_module, local_dirname))
    if name not in _list_available_datasets(extract_path):
        if extract_path is None:
            local_dirname = "local_data"
        if not os.path.exists(os.path.join(local_module, local_dirname)):
            os.makedirs(os.path.join(local_module, local_dirname))
        if name not in _list_available_datasets(
            os.path.join(local_module, local_dirname)
        ):
            # Dataset is not already present in the datasets directory provided.
            # If it is not there, download and install it.
            url = (
                "https://timeseriesclassification.com/"
                f"ClassificationDownloads/{name}.zip"
            )
            # This also tests the validitiy of the URL, can't rely on the html
            # status code as it always returns 200
            try:
                _download_and_extract(
                    url,
                    extract_path=extract_path,
                )
            except zipfile.BadZipFile as e:
                raise ValueError(
                    f"Invalid dataset name ={name} is not available on extract path ="
                    f"{extract_path}. Nor is it available on "
                    f"https://timeseriesclassification.com/.",
                ) from e

We find that most of the core functionalities are very similar and refactoring them would avoid repetitions. Should we add more repositories that download from a different source into a different format, we would need to repeat the same process. This isn't on new developers that are trying to extend, but an inherent flaw of the API design.

hazrulakmal · 2023-07-12T00:48:07Z

.. the current design is not suitable for extension ..

I agree with you on this matter. The inherent flaw in the dataset design enforces repeatable codes for the same processing steps. In my opinion, the current design is more of a research code.

designing data loaders and fetchers so that extending it remains simple

Second this. As the dataset collection continues to grow to account for more complicated cases, a major redesign of the dataloader abstraction is something worth considering - I'm thinking about datasets loader something similar to deep-learning libraries like pytorch datasets. Your point above should be an important factor to consider. I think this will be a great addition to work-in-progress enhancements in the kotsu benchmarking framework. Would you be willing to collaborate on this? I believe some early design work is currently being carried out here. #4333

Should we add more repositories that download from a different source into a different format, we would need to repeat the same process. This isn't on new developers that are trying to extend, but an inherent flaw of the API design

Moving forward, I believe it would be best to open a central issue to discuss the shortfalls in the dataset module, as we have been discussing them sporadically here and there across multiple dataset-related github issues. Regarding this specific issue,I will change the interface fetch_forecasting to load_forecasting_dataset in order to maintain consistency with the current implementation for fetches. As you pointed out, we may not want to extend the dataset using the current design. Therefore, any direct improvements to the existing code should probably not be questioned at this time, as the problems will be addressed when redesigning the dataset module. I still aim to get this PR merged as it still offers a great feature for users despite relying on the existing architecture. what do you think?

fkiraly

Agree with the reasoning above - let's merge this as it is useful functionality.

I'll just rename it to load_forecastingdata and add it to the API reference.

Co-authored-by: fkiraly <f.kiraly@ucl.ac.uk>

fkiraly · 2023-07-14T16:42:29Z

(please let me know, @hazrulakmal, if you'd rather have a different name etc)

hazrulakmal · 2023-07-14T18:21:52Z

(please let me know, @hazrulakmal, if you'd rather have a different name etc)

I'm ok with the naming :)

add forecastingorg

7aee0cc

hazrulakmal requested a review from fkiraly as a code owner July 4, 2023 22:20

hazrulakmal marked this pull request as draft July 4, 2023 22:20

hazrulakmal added 3 commits July 5, 2023 17:27

add url check tests

a08c389

add requests lib & stream api & better error message

3bd5d11

replace requests lib w urllib

7337bb3

edit docstings

4c2f464

hazrulakmal marked this pull request as ready for review July 6, 2023 12:29

hazrulakmal requested review from achieveordie, benHeid and yarnabrina as code owners July 6, 2023 12:29

achieveordie reviewed Jul 9, 2023

View reviewed changes

hazrulakmal mentioned this pull request Jul 10, 2023

[ENH] file hosting for buffering or mirroring upstream data repositories #4754

Open

hazrulakmal mentioned this pull request Jul 12, 2023

[ENH] Enhancements to kotsu Benchmarking Framework #4873

Open

9 tasks

fkiraly previously approved these changes Jul 14, 2023

View reviewed changes

fkiraly added 2 commits July 14, 2023 13:12

rename

f7ce60b

api reference

265f76e

fkiraly dismissed their stale review via 265f76e July 14, 2023 12:12

hazrulakmal and others added 2 commits July 14, 2023 15:26

linting

30c6779

Co-authored-by: fkiraly <f.kiraly@ucl.ac.uk>

Merge remote-tracking branch 'origin/main' into monashforecasting

c0a1be1

fkiraly changed the title ~~[ENH] Add Monash Forecasting Dataset Dataloader~~ [ENH] add Monash Forecasting Repository data loader Jul 14, 2023

fkiraly merged commit fef1fc3 into sktime:main Jul 14, 2023
24 checks passed

hazrulakmal deleted the monashforecasting branch July 27, 2023 19:17

hazrulakmal mentioned this pull request Aug 15, 2023

[ENH] Dataset module refactor #5105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] add Monash Forecasting Repository data loader #4826

[ENH] add Monash Forecasting Repository data loader #4826

hazrulakmal commented Jul 4, 2023 •

edited by fkiraly

hazrulakmal commented Jul 6, 2023

yarnabrina commented Jul 6, 2023

fkiraly commented Jul 7, 2023

fkiraly commented Jul 7, 2023

achieveordie left a comment

hazrulakmal commented Jul 10, 2023

achieveordie commented Jul 10, 2023

hazrulakmal commented Jul 12, 2023 •

edited

fkiraly left a comment

fkiraly commented Jul 14, 2023

hazrulakmal commented Jul 14, 2023

[ENH] add Monash Forecasting Repository data loader #4826

[ENH] add Monash Forecasting Repository data loader #4826

Conversation

hazrulakmal commented Jul 4, 2023 • edited by fkiraly

Reference Issues/PRs

What does this implement/fix? Explain your changes.

What should a reviewer concentrate their feedback on?

hazrulakmal commented Jul 6, 2023

yarnabrina commented Jul 6, 2023

fkiraly commented Jul 7, 2023

fkiraly commented Jul 7, 2023

achieveordie left a comment

Choose a reason for hiding this comment

hazrulakmal commented Jul 10, 2023

achieveordie commented Jul 10, 2023

hazrulakmal commented Jul 12, 2023 • edited

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly commented Jul 14, 2023

hazrulakmal commented Jul 14, 2023

hazrulakmal commented Jul 4, 2023 •

edited by fkiraly

hazrulakmal commented Jul 12, 2023 •

edited