Is MultiDataset a complete replacement for torch.utils.data.DataLoader? #12

wongjoel · 2020-09-14T14:31:34Z

When using webdataset with pytorch-lightning, I discovered that if I pass dataloaders to pytorch-lightning as instances of MultiDataset, training will stall on epoch 0. Once I changed the dataloaders to be instances of torch.utils.data.DataLoader instead, the pytorch-lightning trainer behaved as expected.

Is MultiDataset supposed to completely replace torch.utils.data.DataLoader? If so, is there a way to make it work with pytorch-lightning?

tmbdev · 2020-09-14T19:55:26Z

The DataLoader class is complex and has some problems, in particular when it comes to working with IterableDatasets. MultiDataset is an experimental class showing what DataLoader might be replaced with in the future.

Among other differences, MultiDataset handles splitting of samples among workers differently from DataLoader, and it also handles determining dataset length differently.

So, for now, you want to use DataLoader if your training framework requires it, but you will have to deal with the limitations in DataLoader for IterableDataset. On the other hand, MultiDataset is a good choice in containers (since it doesn't use shared memory) or if you want a simpler way of controlling the assignment of shards to processes.

wongjoel · 2020-09-15T06:18:34Z

Thanks for the explanation. That makes sense - So Multidataset is not currently intended to be a fully compatible drop-in replacement for torch.utils.data.DataLoader, but it rather provides an alternative that gets around the limitations of torch.utils.data.DataLoader.

tmbdev · 2020-09-15T21:56:59Z

Yes, they have different use cases. There is a strong desire to refactor DataLoader as well, but we have to take this one step at a time.

Another alternative to either DataLoader or MultiDatset that's in development is Tensorcom, which runs data loaders as explicit, separate processes, simplifying debugging and scaling; Tensorcom also supports RDMA and GPUdirect for very large scale, high performance training.

tmbdev added the enhancement New feature or request label Sep 14, 2020

isty2e mentioned this issue Oct 13, 2020

What is the recommended way of using webdataset with pytorch-lightning and ddp? #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is MultiDataset a complete replacement for torch.utils.data.DataLoader? #12

Is MultiDataset a complete replacement for torch.utils.data.DataLoader? #12

wongjoel commented Sep 14, 2020

tmbdev commented Sep 14, 2020

wongjoel commented Sep 15, 2020

tmbdev commented Sep 15, 2020

Is MultiDataset a complete replacement for torch.utils.data.DataLoader? #12

Is MultiDataset a complete replacement for torch.utils.data.DataLoader? #12

Comments

wongjoel commented Sep 14, 2020

tmbdev commented Sep 14, 2020

wongjoel commented Sep 15, 2020

tmbdev commented Sep 15, 2020