Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] A common dataset root #3776

Closed
NicolasHug opened this issue May 5, 2021 · 12 comments
Closed

[RFC] A common dataset root #3776

NicolasHug opened this issue May 5, 2021 · 12 comments

Comments

@NicolasHug
Copy link
Member

NicolasHug commented May 5, 2021

Currently, all datasets have a mandatory root parameter, indicating where the dataset will be or has been downloaded.

It would be more convenient if users didn't need to pass the root, and just rely on some predefined default behaviour. Also, having a default for all datasets will allow places with no internet access (looking at you fbcode 👀) to dump all datasets once and for all at the root, and have a seamless access to it afterwards.

Note that for downloading model weights e.g. using fasterrcnn_resnet50_fpn(pretrained=True), we internally rely on load_state_dict_from_url() which will download the weights in what torch.hub.getdir() returns (by default, this is $TORCH_HOME/hub/).

(BTW, where the models are downloaded doesn't seem to be configurable on the torchvision side, but that's another story)

In scikit-learn, a similar logic is used: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.get_data_home.html

Solution

As previously discussed a bit with @pmeier and @datumbox we could do something similar for torchvision datasets:

  • Introduce torchvision.datasets.getdir() and torchvision.datasets.setdir(). By default getdir() would return $TORCH_HOME/torchvision_datasets/, which is consistent with $TORCH_HOME/hub/.
  • Introduce a default for all root parameters, where the default is what torchvision.datasets.getdir() returns

Problem

(Yes, here the problem comes after the solution :))

For most datasets this should work OK. For a few of them (namely phototour, UCF101, Kinetics-400, HMDB51, Flickr, EMNIST, COCO), the root parameter is followed by other parameters without a default, so we can't introduce a default for root without changing its place, which would break backward compatibility.

The easiest workaround here would be to introduce defaults for these other parameters. Otherwise, things get tricky.

Other considerations

Currently, datasets are inconsistent with respect to how they treat the root: some will dump their data in root/TheDatasetName like MNIST, but some will dump their data directly in root like Places365.

While unlikely, this can create conflicts between datasets if they use the same file names. Perhaps it would be safe here to "fix" the datasets like Places365 so that they all use a root/TheDatasetName directory. This will create a minor inconvenience of re-downloading the dataset for some users, but it's probably for the best?

CC @fmassa @datumbox @pmeier @prabhat00155 @parmeet

cc @pmeier

@datumbox datumbox changed the title A common dataset root [RFC] A common dataset root May 5, 2021
@parmeet
Copy link

parmeet commented May 5, 2021

torchtext also have similar issues w.r.t datasets and currently try to solve them mostly through decorators.

  1. Regarding default root: torchtext datasets also have mandatory root argument. @_wrap_split_argument (among other things) adds a default './data' value to root argument by changing the function spec.
  1. Regarding consistent treatment of root folder: For NLP datasets, we have considerable in-consistencies across different datasets. Some datasets extract just one file, whereas others may extract several from the downloaded archive. This makes the root folder dirty. To provide dedicated directory for each dataset @_create_dataset_directory creates folder (if not already exists) for each dataset (named after the corresponding dataset name) and modify the root path to point to this new path.

The above decorators are applied to all the available datasets. Here is an example.

cc: @cpuhrsch

@pmeier
Copy link
Collaborator

pmeier commented May 5, 2021

I think the common dataset root is a really good idea and we should go for it. I know the current state makes it harder to maintain, but there are no urgent issues solved by this. Thus, I suggest we wait with the change until the new datapipe functionality, which will break torchvision.datasets on multiple other parts. One more BC breaking change won't make a difference.

@mthrok
Copy link
Contributor

mthrok commented May 7, 2021

  • $TORCH_HOME/torchvision_datasets/, which is consistent with $TORCH_HOME/hub/

This does not look consistent to me. Considering the possibility to place model parameter files in a similar fashion, $TORCH_HOME/vision/datasets|models/ will be cleaner.

@pmeier
Copy link
Collaborator

pmeier commented May 7, 2021

This does not look consistent to me. Considering the possibility to place model parameter files in a similar fashion, $TORCH_HOME/vision/datasets|models/ will be cleaner.

Maybe we can get all domain libraries on board with this change? In that case $TORCH_HOME/datasets would be the best choice IMO. If every dataset always creates a folder in there, we could have all datasets side by side.

@NicolasHug
Copy link
Member Author

In that case $TORCH_HOME/datasets would be the best choice IMO

There's a small risk that the same dataset exists in audio and vision. Same with models, where it might be more likely?

@pmeier
Copy link
Collaborator

pmeier commented May 7, 2021

There's a small risk that the same dataset exists in audio and vision.

Really? I can imagine two scenarios:

  1. A dataset provides vision and audio data at the same time. If torchaudio also switches to the datapipe way of doing things (@mthrok is there a discussion / plan to do that?) this shouldn't be a problem. Both libraries can simply work with the same raw data.
  2. There exist two separate datasets with the same name in different domains. That would be a serious blunder by the authors that added the second one. In such a rare case, we could namespace them to be FooAudio and FooVision. Since the user does not have to supply a root anymore, the class itself could still be called Foo in both libraries.

Did I miss anything?

@NicolasHug
Copy link
Member Author

we could namespace them to be FooAudio and FooVision. Since the user does not have to supply a root anymore, the class itself could still be called Foo in both libraries

So we'd need a way (i.e. a new private API) for each dataset class to say "this is my name"? And same for models?
It's much easier to consistently rely on TheClass.__name__ IMO

What benefit to you see to having all datasets in $TORCH_HOME/datasets instead of $TORCH_HOME/[vision audio text]/datasets?

@pmeier
Copy link
Collaborator

pmeier commented May 7, 2021

So we'd need a way (i.e. a new private API) for each dataset class to say "this is my name"? And same for models?
It's much easier to consistently rely on TheClass.__name__ IMO

True, but since I imagine the probability is fairly low for something like this, we could simply go with the simple logic at first and only add some more complicated logic if something pops up.

What benefit to you see to having all datasets in $TORCH_HOME/datasets instead of $TORCH_HOME/[vision audio text]/datasets?

All libraries could point to the same directory. Imagine you have shared drive with datasets. You wouldn't need to tell your users to append the domain identifier to it, but could simply do datasets.setdir($SHARED_DRIVE) for every PyTorch library.

I'm aware that this is only a small inconvenience, but given the risk of overlapping dataset names is so low, it could be a viable way to improve the UX.

@pmeier
Copy link
Collaborator

pmeier commented May 7, 2021

Plus, if there is a dataset that is used in torchvision and torchaudio we would need to duplicate it, which will probably frustrate users in case they want to use both.

@NicolasHug
Copy link
Member Author

All libraries could point to the same directory. Imagine you have shared drive with datasets. You wouldn't need to tell your users to append the domain identifier to it, but could simply do datasets.setdir($SHARED_DRIVE) for every PyTorch library.

I'm not sure this is a feature that the $TORCH_HOME/[vision audio text]/datasets layout doesn't have: one can set $TORCH_HOME and still have seamless access to each subfolders.

@pmeier
Copy link
Collaborator

pmeier commented May 7, 2021

True, I forgot that is also possible to change $TORCH_HOME directly.

@mthrok
Copy link
Contributor

mthrok commented May 7, 2021

Maybe we can get all domain libraries on board with this change? In that case $TORCH_HOME/datasets would be the best choice IMO. If every dataset always creates a folder in there, we could have all datasets side by side.

That way, as a user and as a developer, I will always have to worry about the possible collision of dataset directory names among the different libraries, but I do not see a benefit of doing that.

I think this is a typical YAGNI situation. There is no specification for cross-domain data sharing, so we should not include it in this discussion. The matter proposed here should only concern vision and I think the idea is generic enough and can be applied without modification to audio/text as well.

Also I do not think it matters. I do not like to talk about hypothetical situations as it is often quite opposite of fruitful, but in a future, when we migrate to data pipes, the file-fetcher pipes will access the directories/files managed by the library that manages the pipe API. Say I want to have a multi-modal data pipeline and I combine pipes from different domain libraries, like, FooTextPipe, FooVision Pipe and FooAudioPipe, that I import from each domain library, then they will be accessing the location each library manages separately, that is ~/.cache/torch/text|vision|audio. It's still nicely abstracted and clean. The problem occurs if the data pipe implementation does not allow such flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants