Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] file hosting for buffering or mirroring upstream data repositories #4754

Open
fkiraly opened this issue Jun 23, 2023 · 8 comments
Open
Labels
API design API design & software architecture enhancement Adding new functionality maintenance Continuous integration, unit testing & package distribution module:datasets&loaders data sets and data loaders

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 23, 2023

Design issue for finding a solution to mitigating abrupt changes or failures of upstream data repositories in sktime. Currently, such changes directly propagate to the user, and they are not unlikely as the data repositories are usually academic and not professionally maintained.

From a design perspective, I see four options:

  1. status quo
  2. drop data set loaders entirely
  3. mirroring: mirror the upstream and use the mirror as primary source
  4. fallback/buffer: mirror the upstream but use the mirror only if the upstream fails

Options 3 and 4 require hosting the files, which in the case of most data repositories are in the order of 100s of files at a total file volume in the order of GB.

Hosting mechanisms discussed so far:

The hosting options differ in terms of:

  • throughput, speed
  • cost per month, cost per storage, cost per bandwidth
  • replicability, e.g., ease for a 3rd party to build another mirror
  • maintainability of the hosting, complexity to implement and maintain in sktime code
  • failure risk, support
@fkiraly fkiraly added enhancement Adding new functionality API design API design & software architecture maintenance Continuous integration, unit testing & package distribution module:datasets&loaders data sets and data loaders labels Jun 23, 2023
@hazrulakmal
Copy link
Collaborator

hazrulakmal commented Jul 10, 2023

In my opinion, I will vote for options 3 and 4, but I will not suggest hosting datasets on platforms (AWS, GCP etc) that require costs to maintain them, to avoid the financial burden since we are a non-profit entity. Instead, I suggest storing datasets in sktime's Google Drive. Every Google account comes with a free 15GB, which I hope should suffice for our case. @fkiraly, do you have any ideas on how big datasets in GB can be, since you already made the first move of storing all datasets from the classification website in a GitHub repository? If the allocated storage of 15GB is not sufficient, we can do two things:

  1. Create a secondary sktime email account to get another 15GB of storage.
  2. Be more selective about which datasets to store in Google Drive.

The Google Drive API has rate limits (number of requests allowed for a given time), allowing only 20,000 requests per 100 seconds. I think it should be sufficient to support sktime current and future users who use data loaders. To implement downloading files from Google Drive, we can closely follow this StackOverflow. Additionally, to further mitigate this limit, I think it's important to choose when to use options 3 and 4 carefully.

The decision to use option 3 or 4 depends on the capacity of the upstream repository, specifically whether or not the API link is professionally maintained and capable of handling high-frequency requests and queries.

  1. If the upstream repository is professionally maintained and can handle high-frequency requests, use option 4 (fallbacks). Examples of such repositories are Monash Forecasting and Time Series Extrinsic Regression, which use the Zenodo, a third-party data storage platform, to store their datasets. A third-party platform like this is unlikely to change the URL link (a recent problem with time series classification URL [BUG] fix dead source link for UEA datasets #4705) unless they delete the dataset from the platform and robust to high-frequency API requests.
  2. If the upstream repository is not professionally maintained, use option 3 (mirroring). For example, www.timeseriesclassification.com stores dataset zip files in its own server, which means that any dataset fetching requests will depend on how well they manage the server. One problem I encountered (on top of the other already existing problems that push this issue forward) is connection errors as below when attempting to verify the integrity of the URL link using the same tests in PR [ENH] add Monash Forecasting Repository data loader #4826.

[WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of
time, or established connection failed because connected host has failed to respond

@achieveordie
Copy link
Collaborator

I'm not sure how others feel about this, but we could try to monitor the volume of downloads made by sktime users to better understand whether we require mirroring or fallback. I agree with @hazrulakmal that hosting separate S3s will be less optimal (maintenance, cost etc) and I like the idea of storing it on Google Drive.

We should be able to make a more informed decision if we were to monitor download patterns for a given period of time. I don't think we would be breaking any T&C by keeping a tab on downloads but would love a confirmation about it from someone more informed.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 10, 2023

@hazrulakmal, that's smart - google drive as an option that we did not have on the radar!

Although, afaik, the stated purpose for google drive is file storage (for a person or group for their own use) rather than file hosting (from a person or group to a wider population of downloaders).

So, my question, is there not an inherent risk of using google drive for sth else than the purpose it is commonly used for?

Either way, 20.000 requests per 100 seconds don't sound like much, given that the dataset loading is a bit of a rarely used feature of sktime.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 10, 2023

The decision to use option 3 or 4 depends on the capacity of the upstream repository, specifically whether or not the API link is professionally maintained and capable of handling high-frequency requests and queries.

@hazrulakmal, I am also convinced by your reasoning here - if I understand correctly, you are weighing up intelligently the risk of failure of the primary and the secondary location to minimize risk to the user while keeping in mind the burden to the secondary (sktime controlled) location.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 10, 2023

I'm not sure how others feel about this, but we could try to monitor the volume of downloads made by sktime users to better understand whether we require mirroring or fallback.

Yes, that sounds like a good idea, @achieveordie.

I don't think we would be breaking any T&C by keeping a tab on downloads but would love a confirmation about it from someone more informed.

I can't see that either, but I also do not consider myself as more informed on this particular matter.

@hazrulakmal
Copy link
Collaborator

Either way, 20.000 requests per 100 seconds don't sound like much, given that the dataset loading is a bit of a rarely used feature of sktime.

this sounds a bit contradictory - not sure if it's a typo. but if the loading is a rarely feature used by the users then shouldnt 20 thousands downloads per 100 seconds would be enough to support the current request demand?

So, my question, is there not an inherent risk of using google drive for sth else than the purpose it is commonly used for?

I myself not entirely sure about this.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 12, 2023

this sounds a bit contradictory - not sure if it's a typo. but if the loading is a rarely feature used by the users then shouldnt 20 thousands downloads per 100 seconds would be enough to support the current request demand?

Yes, it's due to imprecise wording and context reference, sorry.

I meant, X requests per Y seconds aren't in the range of higher velocity/volume of professional hosting solutions, but should be more than enough for the velocity/volume we expect.

Probably I was intending to write sth along the lines of "but it should be enough for"

@JonathanBechtel
Copy link
Contributor

I suppose Google Drive is fine......easy to get started with. A couple details I'm not sure about.......in addition to request limits.....do we know if there's any issue with downloading latency? I'm not sure how big most of the sktime datasets are ....... most of them I think are pretty small, so I doubt it's an issue, but I suppose that's the only other drawback I can think of, unless the API interface has some difficult issues with it.

fkiraly added a commit that referenced this issue Jul 26, 2023
The UEA benchmark data repository has grown increasingly unreliable, see
#4754 - there is also a current
failure on `main`.

This PR `xfail`-s the loader tests until this is fixed by a buffer or
mirror, see #4754.
fkiraly added a commit that referenced this issue Aug 12, 2023
…4985)

This PR reworks the somewhat messy data loader module, and adds an
ability to specify download mirrors for remote datasets.

This is towards #4754 but does
create a framework for arbitrary data loader mirrors - I focused mainly
on enabling the mirroring/fallback feature for existing loaders, rather
than a framework level refactor.

A refactor would also include the forecasting data loader and factor out
more of the repetitive code in functions.

In terms of addressing #4754, it should now be easy to add an arbitrary
number of mirrors to prevent a blowout failure to users if the UEA data
repository decides to abruptly change folder structure again without
deprecation or warning.
fkiraly added a commit that referenced this issue Sep 17, 2023
This sets the mirrors for time series classification data loaders:

1. to the new URL of the UEA repository (the one that is constantly
changing)
2. to the mirror URL on the sktime GitHub, as a backup

Testing of the download utility is re-enabled.

Related: #4754, #4749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture enhancement Adding new functionality maintenance Continuous integration, unit testing & package distribution module:datasets&loaders data sets and data loaders
Projects
None yet
Development

No branches or pull requests

4 participants