Skip to content

Load the entire dataset into memory #8059

@ScanFun

Description

@ScanFun

🚀 The feature

I plan to add a feature for loading datasets into memory to the built-in datasets in the datasets module in torchvision. Maybe we can add a parameter to the initialization function of the datasets, such as "to_memory"?

I have now implemented the prototype of this feature by modifying ImageFolder:

# ATTENTION: This code is UNFINISHED!
def __init__(
    self,
    root: str,
    loader: Callable[[str], Any],
    extensions: Optional[Tuple[str, ...]] = None,
    transform: Optional[Callable] = None,
    target_transform: Optional[Callable] = None,
    is_valid_file: Optional[Callable[[str], bool]] = None,
) -> None:
    super().__init__(root, transform=transform, target_transform=target_transform)
    classes, class_to_idx = self.find_classes(self.root)
    samples = self.make_dataset(self.root, class_to_idx, extensions, is_valid_file)

    self.loader = loader
    self.extensions = extensions

    self.classes = classes
    self.class_to_idx = class_to_idx
    self.samples = samples
    self.targets = [s[1] for s in samples]
    self.images = []

    for s in samples:
        sample = self.loader(s[0])
        if self.transform is not None:
            sample = self.transform(sample)
        self.images.append(sample)

def __getitem__(self, index: int) -> Tuple[Any, Any]:
    """
    Args:
        index (int): Index

    Returns:
        tuple: (sample, target) where target is class_index of the target class.
    """
    path, target = self.samples[index]
    sample = self.images[index]
    if self.target_transform is not None:
        target = self.target_transform(target)
    return sample, target

Motivation, pitch

I found that training was unusually slow when I was using a larger dataset for one of my projects (I was using ImageFolder to build my dataset).
After looking at the data given by pytorch_profiler, I found that aten::copy_ takes up most of the time.
I thought this meant I had an IO bottleneck, so I wanted to see if I could load the entire dataset into memory, but torchvision didn't seem to support it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions