PyTorch DataLoader improvements for Iterable Dataset #127072

tchaton · 2024-05-24T07:17:23Z

🚀 The feature, motivation and pitch

Currently, the PyTorch DataLoader makes it hard to design properly IterableDataset that correctly function.

The main reason is that the IterableDataset needs to know about the num_workers, batch_size and shuffle argument provided to the DataLoader in order to work properly. One could argue to expose them on the IterableDataset, but this isn't intuitive for new users and leads to more issues. Some could provide the same batch_size to both the dataset and DataLoader.

I am thinking a simple solution would be to have a protocol for IterableDataset. If they have a setup function, it would be called in the DataLoader init function to pass down the arguments of the user (num_workers, batch_size, shuffle).

Interestingly, this is already done for DataPipes: https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py#L305 but only with shuffling.

In the framework LitData: https://github.com/Lightning-AI/litdata, I had to override the PyTorch DataLoader to support this properly: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/dataloader.py#L479. So the length of the iterable dataset doesn't cause issues with DDP.

Alternatives

Subclass the PyTorch DataLoader

Additional context

No response

cc @andrewkho @gokulavasan @ssnl @VitalyFedyunin @dzhulgakov

The text was updated successfully, but these errors were encountered:

tchaton · 2024-05-25T19:39:41Z

Here is an example resulting of this problem where the IterableDataset is un-aware of the batch_size and num_workers: Lightning-AI/pytorch-lightning#19624

andrewkho · 2024-05-25T21:53:17Z

@tchaton thanks for flagging this! Definitely understand and agree that IterableDatasets are difficult to get correct. Just to check my understanding: I think the proposal you're making is to pass information from DataLoader to dataset through a method call, but the user would still be required to set the behaviour up to eg end iteration early.

To be honest I'm not totally convinced how much of an improvement this is. Re: the linked Lightning issue, that does sound like a pain but I'm not sure how this proposal would help in this situation.

tchaton · 2024-05-27T08:46:10Z

Hey @andrewkho Yes, there is a lot of things hard to get it right.

As an example, here: https://github.com/Lightning-AI/litdata/blob/f23ba626c48d62a8397c2b7ae4bacfd345df2c86/src/litdata/streaming/dataloader.py#L620.

I had to override the length of the DataLoader to ensure the length was correct

tchaton · 2024-06-01T09:33:39Z

Here is another example of this issue: Lightning-AI/litdata#147

colesbury added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch DataLoader improvements for Iterable Dataset #127072

PyTorch DataLoader improvements for Iterable Dataset #127072

tchaton commented May 24, 2024 •

edited by pytorch-bot bot

Loading

tchaton commented May 25, 2024

andrewkho commented May 25, 2024

tchaton commented May 27, 2024 •

edited

Loading

tchaton commented Jun 1, 2024

PyTorch DataLoader improvements for Iterable Dataset #127072

PyTorch DataLoader improvements for Iterable Dataset #127072

Comments

tchaton commented May 24, 2024 • edited by pytorch-bot bot Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

tchaton commented May 25, 2024

andrewkho commented May 25, 2024

tchaton commented May 27, 2024 • edited Loading

tchaton commented Jun 1, 2024

tchaton commented May 24, 2024 •

edited by pytorch-bot bot

Loading

tchaton commented May 27, 2024 •

edited

Loading