Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch DataLoader improvements for Iterable Dataset #127072

Open
tchaton opened this issue May 24, 2024 · 4 comments
Open

PyTorch DataLoader improvements for Iterable Dataset #127072

tchaton opened this issue May 24, 2024 · 4 comments
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@tchaton
Copy link

tchaton commented May 24, 2024

馃殌 The feature, motivation and pitch

Currently, the PyTorch DataLoader makes it hard to design properly IterableDataset that correctly function.

The main reason is that the IterableDataset needs to know about the num_workers, batch_size and shuffle argument provided to the DataLoader in order to work properly. One could argue to expose them on the IterableDataset, but this isn't intuitive for new users and leads to more issues. Some could provide the same batch_size to both the dataset and DataLoader.

I am thinking a simple solution would be to have a protocol for IterableDataset. If they have a setup function, it would be called in the DataLoader init function to pass down the arguments of the user (num_workers, batch_size, shuffle).

Interestingly, this is already done for DataPipes: https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py#L305 but only with shuffling.

In the framework LitData: https://github.com/Lightning-AI/litdata, I had to override the PyTorch DataLoader to support this properly: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/dataloader.py#L479. So the length of the iterable dataset doesn't cause issues with DDP.

Alternatives

Subclass the PyTorch DataLoader

Additional context

No response

cc @andrewkho @gokulavasan @ssnl @VitalyFedyunin @dzhulgakov

@colesbury colesbury added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 24, 2024
@tchaton
Copy link
Author

tchaton commented May 25, 2024

Here is an example resulting of this problem where the IterableDataset is un-aware of the batch_size and num_workers: Lightning-AI/pytorch-lightning#19624

@andrewkho
Copy link
Contributor

@tchaton thanks for flagging this! Definitely understand and agree that IterableDatasets are difficult to get correct. Just to check my understanding: I think the proposal you're making is to pass information from DataLoader to dataset through a method call, but the user would still be required to set the behaviour up to eg end iteration early.

To be honest I'm not totally convinced how much of an improvement this is. Re: the linked Lightning issue, that does sound like a pain but I'm not sure how this proposal would help in this situation.

@tchaton
Copy link
Author

tchaton commented May 27, 2024

Hey @andrewkho Yes, there is a lot of things hard to get it right.

As an example, here: https://github.com/Lightning-AI/litdata/blob/f23ba626c48d62a8397c2b7ae4bacfd345df2c86/src/litdata/streaming/dataloader.py#L620.

I had to override the length of the DataLoader to ensure the length was correct

@tchaton
Copy link
Author

tchaton commented Jun 1, 2024

Here is another example of this issue: Lightning-AI/litdata#147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants