Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching Support for class Dataset #35642

Open
lukasfolle opened this issue Mar 29, 2020 · 8 comments
Open

Caching Support for class Dataset #35642

lukasfolle opened this issue Mar 29, 2020 · 8 comments
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@lukasfolle
Copy link

lukasfolle commented Mar 29, 2020

馃殌 Feature

For datasets that fit into memory, but the samples are loaded individually a cache could decrease the time needed to fetch the samples. In each call to the dataset the cache should be checked for the existence of the object to be loaded and if possible return the cached sample.

Motivation

Loading data from the file system or even worse from network storage can take a considerable amount of time. Since training speed is limited by the time it takes to load the requested data, speed-ups on the data fetching side can lead to improved training time and higher utilization of the computational resources.

Pitch

A simple function like the following could handle the caching:

    def get_item(self, index):
        if index in self.cache.keys():
            return self.cache[index]
        sample = self.load_item(index)
        self.cache[index] = sample
        return sample

However, for this sample to work the function load_item needs to be implemented by the user just like using Dataset directly.

Additional context

I started defining the required class, however, found that my implementation requires to much action using this as a base class.
I would appreciated any suggestions to reduce the complexity of this proposal:

class CachingDataset(Dataset):
    r"""A caching :class:`Dataset`.

    All subclasses should overwrite :meth:`load_item`, supporting fetching a
    data sample from file.
    Subclasses could also optionally overwrite
    :meth:`__len__`, which is expected to return the size of the dataset by many
    :class:`~torch.utils.data.Sampler` implementations and the default options
    of :class:`~torch.utils.data.DataLoader`.
    For datasets that fit into the memory this allows faster data loading from the second batch on
    by caching the loaded data into the `cache`.

    .. note::
      Performance gains will be visible from the second epoch on.

    Example 1: Loading data from cache if available using  :meth:`get_item`::

        >>> class MyCachingDataset(torch.utils.data.CachingDataset):
        ...     def __init__(self):
        ...         super().__init__()
        ...
        ...     def load_item(self, index):
        ...         sample = read_from_file(self.data[index])
        ...         return sample
    """
    def __init__(self):
        self.cache = dict()

    def __getitem__(self, index):
        return self.get_item(index)

    def get_item(self, index):
        if index in self.cache.keys():
            return self.cache[index]
        sample = self.load_item(index)
        self.cache[index] = sample
        return sample

    def load_item(self, index):
        raise NotImplementedError
```

cc @SsnL
@ailzhang ailzhang added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 30, 2020
@adamsvystun
Copy link

adamsvystun commented Apr 26, 2020

I second this, would be a great enhancement and remove a lot of boilerplate code from people's projects.

I think this can be implemented on the level of DataLoader. In the same way there is a shuffle flag, there might be a cache boolean flag, that automatically caches the dataset loading. Very simple.

Edit: I can implement this and submit PR if necessary.

@choidongyeon
Copy link

@adamsvystun , are you still working on this? If not, I would love to work on this.

@adamsvystun
Copy link

@choidongyeon Not working on this, go ahead.

@ssnl
Copy link
Collaborator

ssnl commented Jun 11, 2020

Hi, the bot's @ mention on this issue was broken so I didn't get notifications. Sorry for the late reply.

I'm copying my reply on the PR below.

I really don't think we should have this functionality in core, because dataset caching is IMO better handled in code specifically written according to use cases. The reasons are the following:

  1. Caching index->data is often not what you want in DL. Most of the times stochastic augmentations are applied on the read data. While IO can be cached, stochastic augs can not.
  2. Unless the entire dataset is small, the cache won't have effect until the 2nd epoch is reached, at which point the entire dataset is in memory. But if the dataset is such small, the user might as well load it in to memory at the beginning.
  3. When num_workers > 0, not only this doesn't have effect (unless a sampler that repeatedly samples data is used) (since a dataset object is sent to workers at the beginning of each epoch), this will also "leak" a lot of memory because a separate cache is used for each worker.
  4. Better and more robust caching supports already exist in python core lib (functools.lru_cache) and 3rd party libs specialized for this (e.g., ring, methodtools etc.). I don't think PyTorch should maintain another copy. When worker reusing is implemented, users could just use these existing decorators to add caching to their datasets.

@adamsvystun
Copy link

adamsvystun commented Jun 16, 2020

@ssnl Thank you for your reply in this thread. I see your way of thinking and these are valid concerns. But may I offer some counter-points to each of your arguments.

  1. While it is often not what you want, it is also often exactly what you want. Having the approach that a feature is not needed because it is 'often not what you want' seems to lead to a very narrow set of features. I mean, half of the DataLoader init parameters should be thrown out by this argument. For example - shuffle. I might not want to shuffle my data, plus people can implement shuffling themselves, why have it as a parameter? In my opinion, it is because shuffle is often what you want, so it makes life easier for developers and researchers. They don't need to reimplement this every time.
    Regarding augmentations, you can still use augmentations while also doing caching. Even with the current proposed implementation in PR you can do this by extending __getitem__:
class AugmenetedCacheDataset(CacheDataset):

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        image = super().__getitem__(idx)
        augmented_image = augment(image)
        return augmented_image

    def get_item(self, idx):
        image = # heavy IO/DB loading, whatever
        return image
  1. I think the most important thing to consider here is that many people across multiple projects implement their own caching. It often does the same thing, so in my opinion, the use-case is there. But even in your picked example, consider this: Sometimes you work on a smaller sample of the dataset and want to have it in memory and then switch to larger dataset which can't be loaded in memory. I wouldn't want to maintain two versions of the dataset, one which loads everything in the beginning and another one that loads each sample in __getitem__. Another scenario is that you want to test locally on a smaller machine with less RAM, and load each sample in __getitem__, while when running full training on remote machine with huge RAM you want to cache. Again, caching is useful here. And these are just some of the scenarios, I am sure there are more.

  2. When worker reusing is implemented, this will still be helpful. The leaking problem can be fixed with smarter cache ( functools.lru_cache with maxsize=None or something).

  3. I think we can use functools.lru_cache, no need to maintain another copy.

Overall, I kindly ask to you to consider this issue again, this might be a great Quality of Life improvement. Maybe we can think about how to implement this in a way so that all of your concerns are addressed.

@choidongyeon
Copy link

Hey @ssnl , any thoughts on @adamsvystun 's points and how I should proceed forward?

@ssnl
Copy link
Collaborator

ssnl commented Jun 25, 2020

@adamsvystun , I am not convinced by the arguments. The proposed API seems quite more complicated and less straightforward than the alternatives. For example, in your snippet, why inheritance than just adding a decorator which is simpler and much clearer in what it is doing?

class MyDataset(Dataset):
  @cache_decorator
  def io(self, index):
    return xxx
  
  def __getitem__(self, index):
    return self.aug(self.io(index))

For users, one using the proposed API has to remember the interplay among CacheDataset.__getitem__, get_item, and determinisim. For readers, it is nearly impossible to understand what super().__getitem__(idx) does unless one reads the docs. Whereas the decorator seems very readable and much simpler & clearer in what it does.

Why an abstraction when there is no need to have one?

Also, the added abstraction reduces clarity, and can be quite error-prune because users not careful can accidentally turn off random things without knowing!

Could you point me to places where users implement such form of caching? I am not sure that I understand what the exact need we want to fulfill here that is not easily achievable using existing libs. We can surely look at them and discuss whether and how they can be made to be a general API, but I don't think the current proposal is the right one.

I am also not sure that I understand your point 2 and 3.

  • For 2,

    But even in your picked example, consider this: Sometimes you work on a smaller sample of the dataset and want to have it in memory and then switch to larger dataset which can't be loaded in memory. I wouldn't want to maintain two versions of the dataset, one which loads everything in the beginning and another one that loads each sample in __getitem__.

    But caching won't have any effect if your dataset can't be fully loaded in memory because you are not loading the same sample until next epoch, where all samples have been iterated and must already live in cache (assuming LRU). Therefore, one must use an unlimited cache size to begin with, and thus sort of has to be able to decide whether the dataset fits in memory and turn cache on vs off, right? Surely you don't want to let the small machines script or OS crash by using all the RAM.

    Another scenario is that you want to test locally on a smaller machine with less RAM, and load each sample in __getitem__, while when running full training on remote machine with huge RAM you want to cache. Again, caching is useful here.

    I never argued that caching is not helpful. What I was saying is that (1) I don't see why users can't use existing functionalities in python core/3rd party libs and that (2) benefit of caching is limited in the only case of having a lot of RAM to load the entire dataset into memory and num_workers=0, and is harmful in most other cases.

  • For 3,

    When worker reusing is implemented, this will still be helpful.

    The issue is that, unless worker reusing is implemented, this will not be helpful and will be harmful.

    The leaking problem can be fixed with smarter cache ( functools.lru_cache with maxsize=None or something).

    That is not solving the problem where you essentially load num_workers copies of dataset into RAM, and immediately discard them before actually hitting the cache, assuming that the machine hasn't crashed at the moment.

I also think that you might have missed my point on workers. The problem is that caching will not help at all with you have workers.

@gaowayne
Copy link

could you please share your detail steps that I can run through this cache? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants