[RFC] A common dataset root #3776

NicolasHug · 2021-05-05T12:10:49Z

Currently, all datasets have a mandatory root parameter, indicating where the dataset will be or has been downloaded.

It would be more convenient if users didn't need to pass the root, and just rely on some predefined default behaviour. Also, having a default for all datasets will allow places with no internet access (looking at you fbcode 👀) to dump all datasets once and for all at the root, and have a seamless access to it afterwards.

Note that for downloading model weights e.g. using fasterrcnn_resnet50_fpn(pretrained=True), we internally rely on load_state_dict_from_url() which will download the weights in what torch.hub.getdir() returns (by default, this is $TORCH_HOME/hub/).

(BTW, where the models are downloaded doesn't seem to be configurable on the torchvision side, but that's another story)

In scikit-learn, a similar logic is used: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.get_data_home.html

Solution

As previously discussed a bit with @pmeier and @datumbox we could do something similar for torchvision datasets:

Introduce torchvision.datasets.getdir() and torchvision.datasets.setdir(). By default getdir() would return $TORCH_HOME/torchvision_datasets/, which is consistent with $TORCH_HOME/hub/.
Introduce a default for all root parameters, where the default is what torchvision.datasets.getdir() returns

Problem

(Yes, here the problem comes after the solution :))

For most datasets this should work OK. For a few of them (namely phototour, UCF101, Kinetics-400, HMDB51, Flickr, EMNIST, COCO), the root parameter is followed by other parameters without a default, so we can't introduce a default for root without changing its place, which would break backward compatibility.

The easiest workaround here would be to introduce defaults for these other parameters. Otherwise, things get tricky.

Other considerations

Currently, datasets are inconsistent with respect to how they treat the root: some will dump their data in root/TheDatasetName like MNIST, but some will dump their data directly in root like Places365.

While unlikely, this can create conflicts between datasets if they use the same file names. Perhaps it would be safe here to "fix" the datasets like Places365 so that they all use a root/TheDatasetName directory. This will create a minor inconvenience of re-downloading the dataset for some users, but it's probably for the best?

CC @fmassa @datumbox @pmeier @prabhat00155 @parmeet

cc @pmeier

The text was updated successfully, but these errors were encountered:

parmeet · 2021-05-05T15:38:49Z

torchtext also have similar issues w.r.t datasets and currently try to solve them mostly through decorators.

Regarding default root: torchtext datasets also have mandatory root argument. @_wrap_split_argument (among other things) adds a default './data' value to root argument by changing the function spec.

Regarding consistent treatment of root folder: For NLP datasets, we have considerable in-consistencies across different datasets. Some datasets extract just one file, whereas others may extract several from the downloaded archive. This makes the root folder dirty. To provide dedicated directory for each dataset @_create_dataset_directory creates folder (if not already exists) for each dataset (named after the corresponding dataset name) and modify the root path to point to this new path.

The above decorators are applied to all the available datasets. Here is an example.

cc: @cpuhrsch

pmeier · 2021-05-05T15:52:52Z

I think the common dataset root is a really good idea and we should go for it. I know the current state makes it harder to maintain, but there are no urgent issues solved by this. Thus, I suggest we wait with the change until the new datapipe functionality, which will break torchvision.datasets on multiple other parts. One more BC breaking change won't make a difference.

mthrok · 2021-05-07T03:32:46Z

$TORCH_HOME/torchvision_datasets/, which is consistent with $TORCH_HOME/hub/

This does not look consistent to me. Considering the possibility to place model parameter files in a similar fashion, $TORCH_HOME/vision/datasets|models/ will be cleaner.

pmeier · 2021-05-07T05:33:58Z

This does not look consistent to me. Considering the possibility to place model parameter files in a similar fashion, $TORCH_HOME/vision/datasets|models/ will be cleaner.

Maybe we can get all domain libraries on board with this change? In that case $TORCH_HOME/datasets would be the best choice IMO. If every dataset always creates a folder in there, we could have all datasets side by side.

NicolasHug · 2021-05-07T06:36:04Z

In that case $TORCH_HOME/datasets would be the best choice IMO

There's a small risk that the same dataset exists in audio and vision. Same with models, where it might be more likely?

pmeier · 2021-05-07T06:51:40Z

There's a small risk that the same dataset exists in audio and vision.

Really? I can imagine two scenarios:

A dataset provides vision and audio data at the same time. If torchaudio also switches to the datapipe way of doing things (@mthrok is there a discussion / plan to do that?) this shouldn't be a problem. Both libraries can simply work with the same raw data.
There exist two separate datasets with the same name in different domains. That would be a serious blunder by the authors that added the second one. In such a rare case, we could namespace them to be FooAudio and FooVision. Since the user does not have to supply a root anymore, the class itself could still be called Foo in both libraries.

Did I miss anything?

NicolasHug · 2021-05-07T07:02:11Z

we could namespace them to be FooAudio and FooVision. Since the user does not have to supply a root anymore, the class itself could still be called Foo in both libraries

So we'd need a way (i.e. a new private API) for each dataset class to say "this is my name"? And same for models?
It's much easier to consistently rely on TheClass.__name__ IMO

What benefit to you see to having all datasets in $TORCH_HOME/datasets instead of $TORCH_HOME/[vision audio text]/datasets?

pmeier · 2021-05-07T07:15:16Z

So we'd need a way (i.e. a new private API) for each dataset class to say "this is my name"? And same for models?
It's much easier to consistently rely on TheClass.__name__ IMO

True, but since I imagine the probability is fairly low for something like this, we could simply go with the simple logic at first and only add some more complicated logic if something pops up.

What benefit to you see to having all datasets in $TORCH_HOME/datasets instead of $TORCH_HOME/[vision audio text]/datasets?

All libraries could point to the same directory. Imagine you have shared drive with datasets. You wouldn't need to tell your users to append the domain identifier to it, but could simply do datasets.setdir($SHARED_DRIVE) for every PyTorch library.

I'm aware that this is only a small inconvenience, but given the risk of overlapping dataset names is so low, it could be a viable way to improve the UX.

pmeier · 2021-05-07T07:16:37Z

Plus, if there is a dataset that is used in torchvision and torchaudio we would need to duplicate it, which will probably frustrate users in case they want to use both.

NicolasHug · 2021-05-07T07:55:53Z

All libraries could point to the same directory. Imagine you have shared drive with datasets. You wouldn't need to tell your users to append the domain identifier to it, but could simply do datasets.setdir($SHARED_DRIVE) for every PyTorch library.

I'm not sure this is a feature that the $TORCH_HOME/[vision audio text]/datasets layout doesn't have: one can set $TORCH_HOME and still have seamless access to each subfolders.

pmeier · 2021-05-07T08:00:26Z

True, I forgot that is also possible to change $TORCH_HOME directly.

mthrok · 2021-05-07T12:33:57Z

Maybe we can get all domain libraries on board with this change? In that case $TORCH_HOME/datasets would be the best choice IMO. If every dataset always creates a folder in there, we could have all datasets side by side.

That way, as a user and as a developer, I will always have to worry about the possible collision of dataset directory names among the different libraries, but I do not see a benefit of doing that.

I think this is a typical YAGNI situation. There is no specification for cross-domain data sharing, so we should not include it in this discussion. The matter proposed here should only concern vision and I think the idea is generic enough and can be applied without modification to audio/text as well.

Also I do not think it matters. I do not like to talk about hypothetical situations as it is often quite opposite of fruitful, but in a future, when we migrate to data pipes, the file-fetcher pipes will access the directories/files managed by the library that manages the pipe API. Say I want to have a multi-modal data pipeline and I combine pipes from different domain libraries, like, FooTextPipe, FooVision Pipe and FooAudioPipe, that I import from each domain library, then they will be accessing the location each library manages separately, that is ~/.cache/torch/text|vision|audio. It's still nicely abstracted and clean. The problem occurs if the data pipe implementation does not allow such flexibility.

NicolasHug added the module: datasets label May 5, 2021

datumbox changed the title ~~A common dataset root~~ [RFC] A common dataset root May 5, 2021

pmeier added new feature needs discussion labels May 5, 2021

NicolasHug closed this as completed Oct 10, 2023

NicolasHug mentioned this issue Nov 18, 2023

Disorganized feedback from a quick look pytorch/torchtune#25

Closed

NicolasHug mentioned this issue Dec 15, 2023

Use keyword-only arguments where relevant pytorch/torchtune#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] A common dataset root #3776

[RFC] A common dataset root #3776

NicolasHug commented May 5, 2021 •

edited by pytorch-probot bot

Loading

parmeet commented May 5, 2021 •

edited

Loading

pmeier commented May 5, 2021

mthrok commented May 7, 2021

pmeier commented May 7, 2021

NicolasHug commented May 7, 2021

pmeier commented May 7, 2021

NicolasHug commented May 7, 2021

pmeier commented May 7, 2021

pmeier commented May 7, 2021

NicolasHug commented May 7, 2021

pmeier commented May 7, 2021

mthrok commented May 7, 2021

[RFC] A common dataset root #3776

[RFC] A common dataset root #3776

Comments

NicolasHug commented May 5, 2021 • edited by pytorch-probot bot Loading

Solution

Problem

Other considerations

parmeet commented May 5, 2021 • edited Loading

pmeier commented May 5, 2021

mthrok commented May 7, 2021

pmeier commented May 7, 2021

NicolasHug commented May 7, 2021

pmeier commented May 7, 2021

NicolasHug commented May 7, 2021

pmeier commented May 7, 2021

pmeier commented May 7, 2021

NicolasHug commented May 7, 2021

pmeier commented May 7, 2021

mthrok commented May 7, 2021

NicolasHug commented May 5, 2021 •

edited by pytorch-probot bot

Loading

parmeet commented May 5, 2021 •

edited

Loading