-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] A common dataset root #3776
Comments
torchtext also have similar issues w.r.t datasets and currently try to solve them mostly through decorators.
The above decorators are applied to all the available datasets. Here is an example. cc: @cpuhrsch |
I think the common dataset root is a really good idea and we should go for it. I know the current state makes it harder to maintain, but there are no urgent issues solved by this. Thus, I suggest we wait with the change until the new datapipe functionality, which will break |
This does not look consistent to me. Considering the possibility to place model parameter files in a similar fashion, |
Maybe we can get all domain libraries on board with this change? In that case |
There's a small risk that the same dataset exists in audio and vision. Same with models, where it might be more likely? |
Really? I can imagine two scenarios:
Did I miss anything? |
So we'd need a way (i.e. a new private API) for each dataset class to say "this is my name"? And same for models? What benefit to you see to having all datasets in |
True, but since I imagine the probability is fairly low for something like this, we could simply go with the simple logic at first and only add some more complicated logic if something pops up.
All libraries could point to the same directory. Imagine you have shared drive with datasets. You wouldn't need to tell your users to append the domain identifier to it, but could simply do I'm aware that this is only a small inconvenience, but given the risk of overlapping dataset names is so low, it could be a viable way to improve the UX. |
Plus, if there is a dataset that is used in |
I'm not sure this is a feature that the |
True, I forgot that is also possible to change |
That way, as a user and as a developer, I will always have to worry about the possible collision of dataset directory names among the different libraries, but I do not see a benefit of doing that. I think this is a typical YAGNI situation. There is no specification for cross-domain data sharing, so we should not include it in this discussion. The matter proposed here should only concern vision and I think the idea is generic enough and can be applied without modification to Also I do not think it matters. I do not like to talk about hypothetical situations as it is often quite opposite of fruitful, but in a future, when we migrate to data pipes, the file-fetcher pipes will access the directories/files managed by the library that manages the pipe API. Say I want to have a multi-modal data pipeline and I combine pipes from different domain libraries, like, |
Currently, all datasets have a mandatory
root
parameter, indicating where the dataset will be or has been downloaded.It would be more convenient if users didn't need to pass the root, and just rely on some predefined default behaviour. Also, having a default for all datasets will allow places with no internet access (looking at you fbcode 👀) to dump all datasets once and for all at the root, and have a seamless access to it afterwards.
Note that for downloading model weights e.g. using
fasterrcnn_resnet50_fpn(pretrained=True)
, we internally rely onload_state_dict_from_url()
which will download the weights in what torch.hub.getdir() returns (by default, this is$TORCH_HOME/hub/
).(BTW, where the models are downloaded doesn't seem to be configurable on the torchvision side, but that's another story)
In scikit-learn, a similar logic is used: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.get_data_home.html
Solution
As previously discussed a bit with @pmeier and @datumbox we could do something similar for torchvision datasets:
torchvision.datasets.getdir()
andtorchvision.datasets.setdir()
. By defaultgetdir()
would return$TORCH_HOME/torchvision_datasets/
, which is consistent with$TORCH_HOME/hub/
.root
parameters, where the default is whattorchvision.datasets.getdir()
returnsProblem
(Yes, here the problem comes after the solution :))
For most datasets this should work OK. For a few of them (namely phototour, UCF101, Kinetics-400, HMDB51, Flickr, EMNIST, COCO), the
root
parameter is followed by other parameters without a default, so we can't introduce a default forroot
without changing its place, which would break backward compatibility.The easiest workaround here would be to introduce defaults for these other parameters. Otherwise, things get tricky.
Other considerations
Currently, datasets are inconsistent with respect to how they treat the
root
: some will dump their data inroot/TheDatasetName
like MNIST, but some will dump their data directly inroot
likePlaces365
.While unlikely, this can create conflicts between datasets if they use the same file names. Perhaps it would be safe here to "fix" the datasets like
Places365
so that they all use aroot/TheDatasetName
directory. This will create a minor inconvenience of re-downloading the dataset for some users, but it's probably for the best?CC @fmassa @datumbox @pmeier @prabhat00155 @parmeet
cc @pmeier
The text was updated successfully, but these errors were encountered: