Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][train] Clarify prepare_data_loader shuffle behavior and include set_epoch usage in all examples #41807

Merged
merged 12 commits into from
Dec 12, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Dec 11, 2023

Why are these changes needed?

prepare_data_loader adds a DistributedSampler to an existing pytorch DataLoader object. To do this, it recreates a DataLoader object and passes most arguments through from the original object, but also makes some implicit assumptions that are not configurable/visible to the caller.

For example, if using just vanilla pytorch by itself, it's possible to do: train_dataloader = DataLoader(..., shuffle=False, sampler=DistributedSampler(shuffle=True)). Here, the DataLoader sets shuffle=False, but the DistributedSampler will still do a shuffle on every epoch so that the training data order is not always the same. The shuffle=False argument of the DataLoader is pretty much ignored because a custom sampler is supplied.

However, with Ray Train, since this prepare_data_loader utility injects the DistributedSampler for the user, there's no visibility on the shuffle parameter. Ray Train will detect the shuffle parameter set on the original dataloader, then pass that along to the DistributedSampler. So, it's not possible to have this False+True situation.

Additionally, if shuffle=True, DistributedSampler.set_epoch must be called at the start of each epoch in order for the dataset ordering to be different for all workers on every epoch. This is because the seed of the sampler is determined at the epoch start (epoch seed = base random seed + epoch number).

Shuffling can be very important for training a model successfully -- if the data order remains the same every epoch, it's possible that training never converges (ex: we ran into this issue training resnet18 on imagenet).

Example:

import torch
import ray.train.torch
train_dataloader = torch.utils.data.DataLoader(
+   ..., batch_size=..., shuffle=True
)
train_dataloader = ray.train.torch.prepare_data_loader(train_loader)
for epoch in range(10):
+   if ray.train.get_context().get_world_size() > 1:
+       # Required for the distributed sampler to shuffle properly across epochs
+       train_dataloader.sampler.set_epoch(epoch)
    for X, y in train_loader:
        # No need to move data to GPU, this is done by `prepare_data_loader`!
        # X, y = X.to("cuda"), y.to("cuda")

Note: the ray.train.get_context().get_world_size() > 1 condition is needed so that debugging with num_workers=1 doesn't throw an error. set_epoch is only a valid method on DistributedSampler, and that only gets set when num_workers > 1.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@woshiyyya
Copy link
Member

However, with Ray Train, since this prepare_data_loader utility injects the DistributedSampler for the user, there's no visibility on the shuffle parameter. Ray Train will detect the shuffle parameter set on the original dataloader, then pass that along to the DistributedSampler. So, it's not possible to have this False+True situation.

What would Ray Train do if a user does have shuffle=False but Sampler(shuffle=True)?

@justinvyu
Copy link
Contributor Author

@woshiyyya If the dataloader already has a DistributedSampler(shuffle=True) attached, then prepare_data_loader is a noop, so it will respect the user's custom config.

and not isinstance(data_loader.sampler, DistributedSampler)

@woshiyyya
Copy link
Member

woshiyyya commented Dec 12, 2023

Got it. The logic is indeed quite convoluted. There are actually 3 factors that may affects the shuffling behavior.

  • If the user specified a DistributedSampler in the DataLoader
  • IterableDataset or Dataset
  • DataLoader(shuffle = True or False)

Shall we have a table in the docstring to illustrate the priority of these factors?

We can mention that if the users provided a DistributedSampler, then Ray Train will not add a new sampler, and the shuffling behavior will go with the user's config.

@@ -129,6 +129,8 @@ Compare a PyTorch training script with and without Ray Train.

# Training
for epoch in range(10):
if ray.train.get_context().get_world_size() > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it surprised me that there's no set_epoch in any of these Accelerate examples. So it'd be great to show it in our example.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu
Copy link
Contributor Author

@woshiyyya I added some extra notes talking about caveats in this section: https://anyscale-ray--41807.com.readthedocs.build/en/41807/train/getting-started-pytorch.html#set-up-a-dataset

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@justinvyu justinvyu merged commit 6b65b56 into ray-project:master Dec 12, 2023
12 of 14 checks passed
@justinvyu justinvyu deleted the prep_dataloader branch December 12, 2023 23:40
aslonnie pushed a commit that referenced this pull request Dec 13, 2023
…41876)

This PR fixes a previous typo in #41807 that updated the test, but CI didn't actually run the test due to no Ray Serve code being changed.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
justinvyu added a commit to justinvyu/ray that referenced this pull request Dec 14, 2023
…ay-project#41876)

This PR fixes a previous typo in ray-project#41807 that updated the test, but CI didn't actually run the test due to no Ray Serve code being changed.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
architkulkarni pushed a commit that referenced this pull request Dec 14, 2023
…41876) (#41921)

This PR fixes a previous typo in #41807 that updated the test, but CI didn't actually run the test due to no Ray Serve code being changed.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants