Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is MultiDataset a complete replacement for torch.utils.data.DataLoader? #12

Open
wongjoel opened this issue Sep 14, 2020 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@wongjoel
Copy link

When using webdataset with pytorch-lightning, I discovered that if I pass dataloaders to pytorch-lightning as instances of MultiDataset, training will stall on epoch 0. Once I changed the dataloaders to be instances of torch.utils.data.DataLoader instead, the pytorch-lightning trainer behaved as expected.

Is MultiDataset supposed to completely replace torch.utils.data.DataLoader? If so, is there a way to make it work with pytorch-lightning?

@tmbdev
Copy link
Collaborator

tmbdev commented Sep 14, 2020

The DataLoader class is complex and has some problems, in particular when it comes to working with IterableDatasets. MultiDataset is an experimental class showing what DataLoader might be replaced with in the future.

Among other differences, MultiDataset handles splitting of samples among workers differently from DataLoader, and it also handles determining dataset length differently.

So, for now, you want to use DataLoader if your training framework requires it, but you will have to deal with the limitations in DataLoader for IterableDataset. On the other hand, MultiDataset is a good choice in containers (since it doesn't use shared memory) or if you want a simpler way of controlling the assignment of shards to processes.

@tmbdev tmbdev added the enhancement New feature or request label Sep 14, 2020
@wongjoel
Copy link
Author

Thanks for the explanation. That makes sense - So Multidataset is not currently intended to be a fully compatible drop-in replacement for torch.utils.data.DataLoader, but it rather provides an alternative that gets around the limitations of torch.utils.data.DataLoader.

@tmbdev
Copy link
Collaborator

tmbdev commented Sep 15, 2020

Yes, they have different use cases. There is a strong desire to refactor DataLoader as well, but we have to take this one step at a time.

Another alternative to either DataLoader or MultiDatset that's in development is Tensorcom, which runs data loaders as explicit, separate processes, simplifying debugging and scaling; Tensorcom also supports RDMA and GPUdirect for very large scale, high performance training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants