[ENH] file hosting for buffering or mirroring upstream data repositories #4754

fkiraly · 2023-06-23T19:12:12Z

Design issue for finding a solution to mitigating abrupt changes or failures of upstream data repositories in sktime. Currently, such changes directly propagate to the user, and they are not unlikely as the data repositories are usually academic and not professionally maintained.

From a design perspective, I see four options:

status quo
drop data set loaders entirely
mirroring: mirror the upstream and use the mirror as primary source
fallback/buffer: mirror the upstream but use the mirror only if the upstream fails

Options 3 and 4 require hosting the files, which in the case of most data repositories are in the order of 100s of files at a total file volume in the order of GB.

Hosting mechanisms discussed so far:

GitHub LFS, an implementation of option 3 with LFS is here: [ENH] loading of sktime datasets from GitHub LFS buffer mirror instead of direct upstream at UEA #4749
AWS S3 with http redirect from sktime.net to the S3 bucket (proposed by @kirilral and @JonathanBechtel)
other cloud providers than Amazon same principle

The hosting options differ in terms of:

throughput, speed
cost per month, cost per storage, cost per bandwidth
replicability, e.g., ease for a 3rd party to build another mirror
maintainability of the hosting, complexity to implement and maintain in sktime code
failure risk, support

The text was updated successfully, but these errors were encountered:

hazrulakmal · 2023-07-10T10:14:27Z

In my opinion, I will vote for options 3 and 4, but I will not suggest hosting datasets on platforms (AWS, GCP etc) that require costs to maintain them, to avoid the financial burden since we are a non-profit entity. Instead, I suggest storing datasets in sktime's Google Drive. Every Google account comes with a free 15GB, which I hope should suffice for our case. @fkiraly, do you have any ideas on how big datasets in GB can be, since you already made the first move of storing all datasets from the classification website in a GitHub repository? If the allocated storage of 15GB is not sufficient, we can do two things:

Create a secondary sktime email account to get another 15GB of storage.
Be more selective about which datasets to store in Google Drive.

The Google Drive API has rate limits (number of requests allowed for a given time), allowing only 20,000 requests per 100 seconds. I think it should be sufficient to support sktime current and future users who use data loaders. To implement downloading files from Google Drive, we can closely follow this StackOverflow. Additionally, to further mitigate this limit, I think it's important to choose when to use options 3 and 4 carefully.

The decision to use option 3 or 4 depends on the capacity of the upstream repository, specifically whether or not the API link is professionally maintained and capable of handling high-frequency requests and queries.

If the upstream repository is professionally maintained and can handle high-frequency requests, use option 4 (fallbacks). Examples of such repositories are Monash Forecasting and Time Series Extrinsic Regression, which use the Zenodo, a third-party data storage platform, to store their datasets. A third-party platform like this is unlikely to change the URL link (a recent problem with time series classification URL [BUG] fix dead source link for UEA datasets #4705) unless they delete the dataset from the platform and robust to high-frequency API requests.
If the upstream repository is not professionally maintained, use option 3 (mirroring). For example, www.timeseriesclassification.com stores dataset zip files in its own server, which means that any dataset fetching requests will depend on how well they manage the server. One problem I encountered (on top of the other already existing problems that push this issue forward) is connection errors as below when attempting to verify the integrity of the URL link using the same tests in PR [ENH] add Monash Forecasting Repository data loader #4826.

[WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of
time, or established connection failed because connected host has failed to respond

achieveordie · 2023-07-10T15:17:01Z

I'm not sure how others feel about this, but we could try to monitor the volume of downloads made by sktime users to better understand whether we require mirroring or fallback. I agree with @hazrulakmal that hosting separate S3s will be less optimal (maintenance, cost etc) and I like the idea of storing it on Google Drive.

We should be able to make a more informed decision if we were to monitor download patterns for a given period of time. I don't think we would be breaking any T&C by keeping a tab on downloads but would love a confirmation about it from someone more informed.

fkiraly · 2023-07-10T19:03:41Z

@hazrulakmal, that's smart - google drive as an option that we did not have on the radar!

Although, afaik, the stated purpose for google drive is file storage (for a person or group for their own use) rather than file hosting (from a person or group to a wider population of downloaders).

So, my question, is there not an inherent risk of using google drive for sth else than the purpose it is commonly used for?

Either way, 20.000 requests per 100 seconds don't sound like much, given that the dataset loading is a bit of a rarely used feature of sktime.

fkiraly · 2023-07-10T19:05:16Z

The decision to use option 3 or 4 depends on the capacity of the upstream repository, specifically whether or not the API link is professionally maintained and capable of handling high-frequency requests and queries.

@hazrulakmal, I am also convinced by your reasoning here - if I understand correctly, you are weighing up intelligently the risk of failure of the primary and the secondary location to minimize risk to the user while keeping in mind the burden to the secondary (sktime controlled) location.

fkiraly · 2023-07-10T19:07:01Z

I'm not sure how others feel about this, but we could try to monitor the volume of downloads made by sktime users to better understand whether we require mirroring or fallback.

Yes, that sounds like a good idea, @achieveordie.

I don't think we would be breaking any T&C by keeping a tab on downloads but would love a confirmation about it from someone more informed.

I can't see that either, but I also do not consider myself as more informed on this particular matter.

hazrulakmal · 2023-07-12T17:49:13Z

Either way, 20.000 requests per 100 seconds don't sound like much, given that the dataset loading is a bit of a rarely used feature of sktime.

this sounds a bit contradictory - not sure if it's a typo. but if the loading is a rarely feature used by the users then shouldnt 20 thousands downloads per 100 seconds would be enough to support the current request demand?

So, my question, is there not an inherent risk of using google drive for sth else than the purpose it is commonly used for?

I myself not entirely sure about this.

fkiraly · 2023-07-12T18:20:45Z

this sounds a bit contradictory - not sure if it's a typo. but if the loading is a rarely feature used by the users then shouldnt 20 thousands downloads per 100 seconds would be enough to support the current request demand?

Yes, it's due to imprecise wording and context reference, sorry.

I meant, X requests per Y seconds aren't in the range of higher velocity/volume of professional hosting solutions, but should be more than enough for the velocity/volume we expect.

Probably I was intending to write sth along the lines of "but it should be enough for"

JonathanBechtel · 2023-07-14T16:17:45Z

I suppose Google Drive is fine......easy to get started with. A couple details I'm not sure about.......in addition to request limits.....do we know if there's any issue with downloading latency? I'm not sure how big most of the sktime datasets are ....... most of them I think are pretty small, so I doubt it's an issue, but I suppose that's the only other drawback I can think of, unless the API interface has some difficult issues with it.

The UEA benchmark data repository has grown increasingly unreliable, see #4754 - there is also a current failure on `main`. This PR `xfail`-s the loader tests until this is fixed by a buffer or mirror, see #4754.

…4985) This PR reworks the somewhat messy data loader module, and adds an ability to specify download mirrors for remote datasets. This is towards #4754 but does create a framework for arbitrary data loader mirrors - I focused mainly on enabling the mirroring/fallback feature for existing loaders, rather than a framework level refactor. A refactor would also include the forecasting data loader and factor out more of the repetitive code in functions. In terms of addressing #4754, it should now be easy to add an arbitrary number of mirrors to prevent a blowout failure to users if the UEA data repository decides to abruptly change folder structure again without deprecation or warning.

This sets the mirrors for time series classification data loaders: 1. to the new URL of the UEA repository (the one that is constantly changing) 2. to the mirror URL on the sktime GitHub, as a backup Testing of the download utility is re-enabled. Related: #4754, #4749

fkiraly added enhancement Adding new functionality API design API design & software architecture maintenance Continuous integration, unit testing & package distribution module:datasets&loaders data sets and data loaders labels Jun 23, 2023

yarnabrina mentioned this issue Jul 6, 2023

[ENH] add Monash Forecasting Repository data loader #4826

Merged

2 tasks

fkiraly mentioned this issue Jul 26, 2023

[MNT] xfail UEA data loaders until fixed by mirroring/buffering #4964

Merged

fkiraly mentioned this issue Jul 30, 2023

[ENH] rework data loader module, ability to specify download mirrors #4985

Merged

fkiraly mentioned this issue Aug 21, 2023

[WIP] Fix cdist_soft_dtw_normalized when two different datasets are provided tslearn-team/tslearn#476

Merged

fkiraly mentioned this issue Sep 17, 2023

[ENH] set mirrors for time series classification data loaders #5260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] file hosting for buffering or mirroring upstream data repositories #4754

[ENH] file hosting for buffering or mirroring upstream data repositories #4754

fkiraly commented Jun 23, 2023 •

edited

hazrulakmal commented Jul 10, 2023 •

edited

achieveordie commented Jul 10, 2023

fkiraly commented Jul 10, 2023

fkiraly commented Jul 10, 2023

fkiraly commented Jul 10, 2023

hazrulakmal commented Jul 12, 2023

fkiraly commented Jul 12, 2023

JonathanBechtel commented Jul 14, 2023

[ENH] file hosting for buffering or mirroring upstream data repositories #4754

[ENH] file hosting for buffering or mirroring upstream data repositories #4754

Comments

fkiraly commented Jun 23, 2023 • edited

hazrulakmal commented Jul 10, 2023 • edited

achieveordie commented Jul 10, 2023

fkiraly commented Jul 10, 2023

fkiraly commented Jul 10, 2023

fkiraly commented Jul 10, 2023

hazrulakmal commented Jul 12, 2023

fkiraly commented Jul 12, 2023

JonathanBechtel commented Jul 14, 2023

fkiraly commented Jun 23, 2023 •

edited

hazrulakmal commented Jul 10, 2023 •

edited