-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Closed
Description
🚀 The feature
I plan to add a feature for loading datasets into memory to the built-in datasets in the datasets module in torchvision. Maybe we can add a parameter to the initialization function of the datasets, such as "to_memory"?
I have now implemented the prototype of this feature by modifying ImageFolder:
# ATTENTION: This code is UNFINISHED!
def __init__(
self,
root: str,
loader: Callable[[str], Any],
extensions: Optional[Tuple[str, ...]] = None,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
is_valid_file: Optional[Callable[[str], bool]] = None,
) -> None:
super().__init__(root, transform=transform, target_transform=target_transform)
classes, class_to_idx = self.find_classes(self.root)
samples = self.make_dataset(self.root, class_to_idx, extensions, is_valid_file)
self.loader = loader
self.extensions = extensions
self.classes = classes
self.class_to_idx = class_to_idx
self.samples = samples
self.targets = [s[1] for s in samples]
self.images = []
for s in samples:
sample = self.loader(s[0])
if self.transform is not None:
sample = self.transform(sample)
self.images.append(sample)
def __getitem__(self, index: int) -> Tuple[Any, Any]:
"""
Args:
index (int): Index
Returns:
tuple: (sample, target) where target is class_index of the target class.
"""
path, target = self.samples[index]
sample = self.images[index]
if self.target_transform is not None:
target = self.target_transform(target)
return sample, target
Motivation, pitch
I found that training was unusually slow when I was using a larger dataset for one of my projects (I was using ImageFolder to build my dataset).
After looking at the data given by pytorch_profiler, I found that aten::copy_ takes up most of the time.
I thought this meant I had an IO bottleneck, so I wanted to see if I could load the entire dataset into memory, but torchvision didn't seem to support it.
Metadata
Metadata
Assignees
Labels
No labels