New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] file hosting for buffering or mirroring upstream data repositories #4754
Comments
In my opinion, I will vote for options 3 and 4, but I will not suggest hosting datasets on platforms (AWS, GCP etc) that require costs to maintain them, to avoid the financial burden since we are a non-profit entity. Instead, I suggest storing datasets in sktime's Google Drive. Every Google account comes with a free 15GB, which I hope should suffice for our case. @fkiraly, do you have any ideas on how big datasets in GB can be, since you already made the first move of storing all datasets from the classification website in a GitHub repository? If the allocated storage of 15GB is not sufficient, we can do two things:
The Google Drive API has rate limits (number of requests allowed for a given time), allowing only 20,000 requests per 100 seconds. I think it should be sufficient to support sktime current and future users who use data loaders. To implement downloading files from Google Drive, we can closely follow this StackOverflow. Additionally, to further mitigate this limit, I think it's important to choose when to use options 3 and 4 carefully. The decision to use option 3 or 4 depends on the capacity of the upstream repository, specifically whether or not the API link is professionally maintained and capable of handling high-frequency requests and queries.
|
I'm not sure how others feel about this, but we could try to monitor the volume of downloads made by sktime users to better understand whether we require mirroring or fallback. I agree with @hazrulakmal that hosting separate S3s will be less optimal (maintenance, cost etc) and I like the idea of storing it on Google Drive. We should be able to make a more informed decision if we were to monitor download patterns for a given period of time. I don't think we would be breaking any T&C by keeping a tab on downloads but would love a confirmation about it from someone more informed. |
@hazrulakmal, that's smart - google drive as an option that we did not have on the radar! Although, afaik, the stated purpose for google drive is file storage (for a person or group for their own use) rather than file hosting (from a person or group to a wider population of downloaders). So, my question, is there not an inherent risk of using google drive for sth else than the purpose it is commonly used for? Either way, 20.000 requests per 100 seconds don't sound like much, given that the dataset loading is a bit of a rarely used feature of |
@hazrulakmal, I am also convinced by your reasoning here - if I understand correctly, you are weighing up intelligently the risk of failure of the primary and the secondary location to minimize risk to the user while keeping in mind the burden to the secondary ( |
Yes, that sounds like a good idea, @achieveordie.
I can't see that either, but I also do not consider myself as more informed on this particular matter. |
this sounds a bit contradictory - not sure if it's a typo. but if the loading is a rarely feature used by the users then shouldnt 20 thousands downloads per 100 seconds would be enough to support the current request demand?
I myself not entirely sure about this. |
Yes, it's due to imprecise wording and context reference, sorry. I meant, X requests per Y seconds aren't in the range of higher velocity/volume of professional hosting solutions, but should be more than enough for the velocity/volume we expect. Probably I was intending to write sth along the lines of "but it should be enough for" |
I suppose Google Drive is fine......easy to get started with. A couple details I'm not sure about.......in addition to request limits.....do we know if there's any issue with downloading latency? I'm not sure how big most of the sktime datasets are ....... most of them I think are pretty small, so I doubt it's an issue, but I suppose that's the only other drawback I can think of, unless the API interface has some difficult issues with it. |
…4985) This PR reworks the somewhat messy data loader module, and adds an ability to specify download mirrors for remote datasets. This is towards #4754 but does create a framework for arbitrary data loader mirrors - I focused mainly on enabling the mirroring/fallback feature for existing loaders, rather than a framework level refactor. A refactor would also include the forecasting data loader and factor out more of the repetitive code in functions. In terms of addressing #4754, it should now be easy to add an arbitrary number of mirrors to prevent a blowout failure to users if the UEA data repository decides to abruptly change folder structure again without deprecation or warning.
Design issue for finding a solution to mitigating abrupt changes or failures of upstream data repositories in
sktime
. Currently, such changes directly propagate to the user, and they are not unlikely as the data repositories are usually academic and not professionally maintained.From a design perspective, I see four options:
Options 3 and 4 require hosting the files, which in the case of most data repositories are in the order of 100s of files at a total file volume in the order of GB.
Hosting mechanisms discussed so far:
sktime
datasets from GitHub LFS buffer mirror instead of direct upstream at UEA #4749The hosting options differ in terms of:
sktime
codeThe text was updated successfully, but these errors were encountered: