Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train/docs] Extend resource guide (training backend + choosing resources) #39202

Merged
merged 16 commits into from
Sep 8, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Sep 1, 2023

Why are these changes needed?

This docs update adds a comprehensive guide to choosing resources for distributed training. Additionally, it touches setting the distributed communication backend in torch, configuring persistence storage via an environment variable, and expands on checkpoint restoration. The PR also fixes a few references within existing checkpointing docs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 6 commits September 1, 2023 12:13
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Show resolved Hide resolved

.. _train-resource-guide:

How many nodes, workers, and resources should I use?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good information but I feel like it's really scattered right now and as a user it would be very hard to figure out what to concretely do. It's generally not clear to me if this is intended to be more catered to people who are starting distributed training, or people who already have a job and are facing concrete bottlenecks (maybe we want to apply this to both categories).

Sorry if this comment isn't super concrete, maybe we can separate this from the PR for now and think more how to architect this information.

Kai Fricke added 2 commits September 6, 2023 15:52
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
The :class:`Trainer <ray.train.trainer.BaseTrainer>` object you instantiate in the
training script contains the settings to run your training. When you call
:meth:`Trainer.fit() <ray.train.trainer.BaseTrainer.fit>`, it will be scheduled
as a :ref:`Ray Actor <actor-key-concept>`. It can then also use resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not able to leave the comment on the exact lines since it was not edited in this PR, but I feel like the code snippet should just be moved to an example in the API reference. 1. I feel like we're missing an example in the API reference. 2. I don't think we should include more advanced confgs like placement_strategy in the introduction of this guide.

doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
Kai Fricke added 3 commits September 7, 2023 14:05
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling/GPUs guide looks good after the suggested changes. Will let @justinvyu take another pass for the other pages!

doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
doc/source/train/user-guides/using-gpus.rst Outdated Show resolved Hide resolved
python/ray/air/config.py Outdated Show resolved Hide resolved
doc/source/train/api/api.rst Outdated Show resolved Hide resolved
doc/source/train/images/train_cluster_overview.png Outdated Show resolved Hide resolved
…e-guide

# Conflicts:
#	doc/source/train/api/api.rst
#	python/ray/air/config.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 8, 2023
@matthewdeng matthewdeng merged commit 8190332 into ray-project:master Sep 8, 2023
15 of 66 checks passed
matthewdeng added a commit to matthewdeng/ray that referenced this pull request Sep 8, 2023
…rces) (ray-project#39202)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
GeneDer pushed a commit that referenced this pull request Sep 8, 2023
…39468)

* [train] Fix issues in migration of tune_cifar_torch_pbt_example (#39158)

Resolves three issues that come up when migrating the `tune_cifar_torch_pbt_example` from Ray 2.6 to Ray 2.7:

1. There is a warning message because PBT uses the `_schedule_trial_save` interface. This is added to the white list attributes so it doesn't come up anymore.
2. PBT malfunctions in Python 2.7, so instead of silently failing, we raise an error and ask users to migrate
3. When users use old `ray.air.Checkpoint` APIs on `ray.train.Checkpoint`, we should raise an actionable error message.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [tune] Make Trainable.save/restore developer APIs (#39391)

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [Telemetry] Add Telemetry for Ray Train Utilities (#39363)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [train] update Train API references & annotations (#39294)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [2.7] Cleanup all LightningTrainer Mentions in Ray Doc (#39406)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [train] remove _max_cpu_fraction_per_node (#39412)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [train] Legacy interface cleanup (`air.Checkpoint`, `LegacyExperimentAnalysis`) (#39289)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>

* [Train][Telemetry] Limit the usage of `ray.train.torch.get_device`. (#39432)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [train-ci] Fix Train examples with authentication buildkite commands. (#39387)

* [train-ci] fix Train examples with authentication buildkite commands.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* [train][doc] Remove preprocessor reference in tune+train user guide (#39442)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [train/docs] Extend resource guide (training backend + choosing resources) (#39202)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docs

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [Minor] Remove remaining LightningTrainer Mentions (#39441)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

---------

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
@krfricke krfricke deleted the doc/train/resource-guide branch September 11, 2023 23:50
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
…rces) (ray-project#39202)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…rces) (ray-project#39202)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants