[doc][train] Fix `ray.train.report` docstring to clarify that it is not a barrier #42422

justinvyu · 2024-01-16T19:05:59Z

Why are these changes needed?

ray.train.report gets called by all distributed training workers, but does not actually act as a synchronous barrier for the workers. Ray Train workers continue their training execution immediately after finishing ray.train.report.

Here's what happens:

The training coordinator waits on all futures to resolve, so it will be blocked if ones of the workers is a straggler.
The workers themselves will just put their results on a queue -- once it's on the queue, it will resolve one of the coordinator's futures.
The workers continue onto the next batch/epoch immediately. (The workers will typically run into another barrier via an all-reduce at some point.)

Users should add a barrier themselves if they need workers to be synchronized at the beginning or end of a report. See our lightning integration utility for one example: #40875

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…report_docstring

python/ray/train/_checkpoint.py

python/ray/train/_internal/session.py

…report_docstring

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ot a barrier (#42422) `ray.train.report` gets called by all distributed training workers, but does not actually act as a synchronous barrier *for the workers.* Ray Train workers continue their training execution immediately after finishing `ray.train.report`. This PR fixes the docstring to correctly describe this behavior. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: khluu <khluu000@gmail.com>

fix the docstring

a07de1e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from matthewdeng and woshiyyya January 16, 2024 19:06

justinvyu assigned matthewdeng and woshiyyya Jan 16, 2024

justinvyu added 2 commits January 16, 2024 14:37

improve checkpoint docstring

e0d8581

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

47548bf

…report_docstring

woshiyyya approved these changes Jan 17, 2024

View reviewed changes

python/ray/train/_checkpoint.py Show resolved Hide resolved

python/ray/train/_internal/session.py Outdated Show resolved Hide resolved

justinvyu added 2 commits January 24, 2024 23:22

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

8770a8d

…report_docstring

add clarifying comment

23b4ce3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu merged commit d9bd752 into ray-project:master Jan 26, 2024
9 checks passed

justinvyu deleted the fix_report_docstring branch January 26, 2024 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc][train] Fix `ray.train.report` docstring to clarify that it is not a barrier #42422

[doc][train] Fix `ray.train.report` docstring to clarify that it is not a barrier #42422

justinvyu commented Jan 16, 2024

[doc][train] Fix ray.train.report docstring to clarify that it is not a barrier #42422

[doc][train] Fix ray.train.report docstring to clarify that it is not a barrier #42422

Conversation

justinvyu commented Jan 16, 2024

Why are these changes needed?

Related issue number

Checks

[doc][train] Fix `ray.train.report` docstring to clarify that it is not a barrier #42422

[doc][train] Fix `ray.train.report` docstring to clarify that it is not a barrier #42422