[doc][train] Clarify `prepare_data_loader` shuffle behavior and include `set_epoch` usage in all examples #41807

justinvyu · 2023-12-11T22:34:48Z

Why are these changes needed?

prepare_data_loader adds a DistributedSampler to an existing pytorch DataLoader object. To do this, it recreates a DataLoader object and passes most arguments through from the original object, but also makes some implicit assumptions that are not configurable/visible to the caller.

For example, if using just vanilla pytorch by itself, it's possible to do: train_dataloader = DataLoader(..., shuffle=False, sampler=DistributedSampler(shuffle=True)). Here, the DataLoader sets shuffle=False, but the DistributedSampler will still do a shuffle on every epoch so that the training data order is not always the same. The shuffle=False argument of the DataLoader is pretty much ignored because a custom sampler is supplied.

However, with Ray Train, since this prepare_data_loader utility injects the DistributedSampler for the user, there's no visibility on the shuffle parameter. Ray Train will detect the shuffle parameter set on the original dataloader, then pass that along to the DistributedSampler. So, it's not possible to have this False+True situation.

Additionally, if shuffle=True, DistributedSampler.set_epoch must be called at the start of each epoch in order for the dataset ordering to be different for all workers on every epoch. This is because the seed of the sampler is determined at the epoch start (epoch seed = base random seed + epoch number).

Shuffling can be very important for training a model successfully -- if the data order remains the same every epoch, it's possible that training never converges (ex: we ran into this issue training resnet18 on imagenet).

Example:

import torch
import ray.train.torch
train_dataloader = torch.utils.data.DataLoader(
+   ..., batch_size=..., shuffle=True
)
train_dataloader = ray.train.torch.prepare_data_loader(train_loader)
for epoch in range(10):
+   if ray.train.get_context().get_world_size() > 1:
+       # Required for the distributed sampler to shuffle properly across epochs
+       train_dataloader.sampler.set_epoch(epoch)
    for X, y in train_loader:
        # No need to move data to GPU, this is done by `prepare_data_loader`!
        # X, y = X.to("cuda"), y.to("cuda")

Note: the ray.train.get_context().get_world_size() > 1 condition is needed so that debugging with num_workers=1 doesn't throw an error. set_epoch is only a valid method on DistributedSampler, and that only gets set when num_workers > 1.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…_dataloader

woshiyyya · 2023-12-12T01:01:52Z

However, with Ray Train, since this prepare_data_loader utility injects the DistributedSampler for the user, there's no visibility on the shuffle parameter. Ray Train will detect the shuffle parameter set on the original dataloader, then pass that along to the DistributedSampler. So, it's not possible to have this False+True situation.

What would Ray Train do if a user does have shuffle=False but Sampler(shuffle=True)?

justinvyu · 2023-12-12T01:23:13Z

@woshiyyya If the dataloader already has a DistributedSampler(shuffle=True) attached, then prepare_data_loader is a noop, so it will respect the user's custom config.

ray/python/ray/train/torch/train_loop_utils.py

Line 429 in bb719ef

and not isinstance(data_loader.sampler, DistributedSampler)

woshiyyya · 2023-12-12T01:35:09Z

Got it. The logic is indeed quite convoluted. There are actually 3 factors that may affects the shuffling behavior.

If the user specified a DistributedSampler in the DataLoader
IterableDataset or Dataset
DataLoader(shuffle = True or False)

~~Shall we have a table in the docstring to illustrate the priority of these factors?~~

We can mention that if the users provided a DistributedSampler, then Ray Train will not add a new sampler, and the shuffling behavior will go with the user's config.

woshiyyya · 2023-12-12T00:54:34Z

doc/source/train/getting-started-pytorch.rst

@@ -129,6 +129,8 @@ Compare a PyTorch training script with and without Ray Train.

                # Training
                for epoch in range(10):
+                    if ray.train.get_context().get_world_size() > 1:


Also it surprised me that there's no set_epoch in any of these Accelerate examples. So it'd be great to show it in our example.

…_dataloader

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…_dataloader

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-12-12T23:17:43Z

@woshiyyya I added some extra notes talking about caveats in this section: https://anyscale-ray--41807.com.readthedocs.build/en/41807/train/getting-started-pytorch.html#set-up-a-dataset

matthewdeng

😍

…41876) This PR fixes a previous typo in #41807 that updated the test, but CI didn't actually run the test due to no Ray Serve code being changed. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ay-project#41876) This PR fixes a previous typo in ray-project#41807 that updated the test, but CI didn't actually run the test due to no Ray Serve code being changed. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…41876) (#41921) This PR fixes a previous typo in #41807 that updated the test, but CI didn't actually run the test due to no Ray Serve code being changed. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 4 commits December 11, 2023 13:28

add to prepare_dataloader docstsring

affb922

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update all prepare_dataloader usage to set the epoch

6cee287

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add docstring example

d111a2f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add missing var

3f78317

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned matthewdeng, scottjlee and woshiyyya Dec 11, 2023

justinvyu requested review from richardliaw, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla, woshiyyya and a team as code owners December 11, 2023 22:34

justinvyu added 3 commits December 11, 2023 15:27

typo

904ef77

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix ci failures

da44644

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into prep…

bb719ef

…_dataloader

woshiyyya reviewed Dec 12, 2023

View reviewed changes

justinvyu added 5 commits December 12, 2023 11:23

Merge branch 'master' of https://github.com/ray-project/ray into prep…

1907a68

…_dataloader

improve the dataloader section of the torch quickstart guide

746d78e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix line highlighting

bd97a22

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into prep…

0894981

…_dataloader

code block -> untested testcode .-.

66c4085

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng approved these changes Dec 12, 2023

View reviewed changes

justinvyu merged commit 6b65b56 into ray-project:master Dec 12, 2023
12 of 14 checks passed

justinvyu deleted the prep_dataloader branch December 12, 2023 23:40

justinvyu mentioned this pull request Dec 13, 2023

[release] Fix golden_notebook_torch_tune_serve_test missing argument #41876

Merged

8 tasks

justinvyu mentioned this pull request Dec 14, 2023

[cherry-pick][release] Fix golden_notebook_torch_tune_serve_test missing argument #41921

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc][train] Clarify `prepare_data_loader` shuffle behavior and include `set_epoch` usage in all examples #41807

[doc][train] Clarify `prepare_data_loader` shuffle behavior and include `set_epoch` usage in all examples #41807

justinvyu commented Dec 11, 2023 •

edited

Loading

woshiyyya commented Dec 12, 2023

justinvyu commented Dec 12, 2023

woshiyyya commented Dec 12, 2023 •

edited

Loading

woshiyyya Dec 12, 2023

justinvyu commented Dec 12, 2023

matthewdeng left a comment

[doc][train] Clarify prepare_data_loader shuffle behavior and include set_epoch usage in all examples #41807

[doc][train] Clarify prepare_data_loader shuffle behavior and include set_epoch usage in all examples #41807

Conversation

justinvyu commented Dec 11, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

woshiyyya commented Dec 12, 2023

justinvyu commented Dec 12, 2023

woshiyyya commented Dec 12, 2023 • edited Loading

woshiyyya Dec 12, 2023

Choose a reason for hiding this comment

justinvyu commented Dec 12, 2023

matthewdeng left a comment

Choose a reason for hiding this comment

[doc][train] Clarify `prepare_data_loader` shuffle behavior and include `set_epoch` usage in all examples #41807

[doc][train] Clarify `prepare_data_loader` shuffle behavior and include `set_epoch` usage in all examples #41807

justinvyu commented Dec 11, 2023 •

edited

Loading

woshiyyya commented Dec 12, 2023 •

edited

Loading