-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching Support for class Dataset #35642
Comments
I second this, would be a great enhancement and remove a lot of boilerplate code from people's projects. I think this can be implemented on the level of Edit: I can implement this and submit PR if necessary. |
@adamsvystun , are you still working on this? If not, I would love to work on this. |
@choidongyeon Not working on this, go ahead. |
Hi, the bot's @ mention on this issue was broken so I didn't get notifications. Sorry for the late reply. I'm copying my reply on the PR below. I really don't think we should have this functionality in core, because dataset caching is IMO better handled in code specifically written according to use cases. The reasons are the following:
|
@ssnl Thank you for your reply in this thread. I see your way of thinking and these are valid concerns. But may I offer some counter-points to each of your arguments.
Overall, I kindly ask to you to consider this issue again, this might be a great Quality of Life improvement. Maybe we can think about how to implement this in a way so that all of your concerns are addressed. |
Hey @ssnl , any thoughts on @adamsvystun 's points and how I should proceed forward? |
@adamsvystun , I am not convinced by the arguments. The proposed API seems quite more complicated and less straightforward than the alternatives. For example, in your snippet, why inheritance than just adding a decorator which is simpler and much clearer in what it is doing? class MyDataset(Dataset):
@cache_decorator
def io(self, index):
return xxx
def __getitem__(self, index):
return self.aug(self.io(index)) For users, one using the proposed API has to remember the interplay among Why an abstraction when there is no need to have one? Also, the added abstraction reduces clarity, and can be quite error-prune because users not careful can accidentally turn off random things without knowing! Could you point me to places where users implement such form of caching? I am not sure that I understand what the exact need we want to fulfill here that is not easily achievable using existing libs. We can surely look at them and discuss whether and how they can be made to be a general API, but I don't think the current proposal is the right one. I am also not sure that I understand your point 2 and 3.
I also think that you might have missed my point on workers. The problem is that caching will not help at all with you have workers. |
could you please share your detail steps that I can run through this cache? :) |
馃殌 Feature
For datasets that fit into memory, but the samples are loaded individually a cache could decrease the time needed to fetch the samples. In each call to the dataset the cache should be checked for the existence of the object to be loaded and if possible return the cached sample.
Motivation
Loading data from the file system or even worse from network storage can take a considerable amount of time. Since training speed is limited by the time it takes to load the requested data, speed-ups on the data fetching side can lead to improved training time and higher utilization of the computational resources.
Pitch
A simple function like the following could handle the caching:
However, for this sample to work the function
load_item
needs to be implemented by the user just like usingDataset
directly.Additional context
I started defining the required class, however, found that my implementation requires to much action using this as a base class.
I would appreciated any suggestions to reduce the complexity of this proposal:
The text was updated successfully, but these errors were encountered: