[Doc][Train] Add `accelerator_type` to Ray Train user guide #44882

hongpeng-guo · 2024-04-20T00:09:44Z

Why are these changes needed?

Our ScalingConfig() function supports a new argument accelerator_type. This PR provides a user guide with example code to showcase the usage. The generated section of the user guide is appended below:

Related issue number

"Closes #44763

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

justinvyu · 2024-04-22T18:05:22Z

Tip if you haven't seen this already: we build the docs as part of the premerge CI, so you can take a look at your rendered docs: https://anyscale-ray--44882.com.readthedocs.build/en/44882/index.html

hongpeng-guo · 2024-04-22T18:23:21Z

Tip if you haven't seen this already: we build the docs as part of the premerge CI, so you can take a look at your rendered docs: https://anyscale-ray--44882.com.readthedocs.build/en/44882/index.html

Nice tips! ty

justinvyu

Nice! Made some edit suggestions.

justinvyu · 2024-04-22T18:36:55Z

doc/source/train/user-guides/using-gpus.rst

+Sometimes you might want to specify the accelerator type for a worker. For example, 
+you can specify `accelerator_type="A100"` in the `ScalingConfig` if you want to 
+assign the worker an NVIDIA A100 GPU.


Suggested change

Sometimes you might want to specify the accelerator type for a worker. For example,

you can specify `accelerator_type="A100"` in the `ScalingConfig` if you want to

assign the worker an NVIDIA A100 GPU.

Ray Train allows you to specify the accelerator type for each worker.

This is useful if your model training has some GPU memory constraints that requires a specific type of GPU.

In a heterogeneous Ray cluster, this means that your training workers will be forced to run on the specified GPU type, rather than on any arbitrary GPU node.

For example, you can specify `accelerator_type="A100"` in the :class:`~ray.train.ScalingConfig` if you want to

assign each worker a NVIDIA A100 GPU.

justinvyu · 2024-04-22T18:37:33Z

doc/source/train/user-guides/using-gpus.rst

+    import torch
+    from ray.train import ScalingConfig
+    from ray.train.torch import TorchTrainer, get_device
+
+
+    def train_func():
+        assert torch.cuda.is_available()
+
+        device = get_device()
+        assert device == torch.device("cuda:0")
+
+    trainer = TorchTrainer(
+        train_func,
+        scaling_config=ScalingConfig(
+            num_workers=1,
+            use_gpu=True,
+            accelerator_type="A100"
+        )
+    )
+    trainer.fit()


We can cut this down to just show the ScalingConfig.

…accelerator-type

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

…o/ray into doc-accelerator-type

justinvyu · 2024-04-22T20:00:06Z

doc/source/train/user-guides/using-gpus.rst

+    from ray.train import ScalingConfig
+    from ray.train.torch import TorchTrainer
+
+
+    trainer = TorchTrainer(
+        train_func,
+        scaling_config=ScalingConfig(
+            num_workers=1,
+            use_gpu=True,
+            accelerator_type="A100"
+        )
+    )


Oh, for this one, I'm thinking of just showing:

ScalingConfig(...)

justinvyu · 2024-04-22T20:01:58Z

doc/source/train/user-guides/using-gpus.rst

+Ensure that your cluster has instances with the specified accelerator type 
+or is able to autoscale to fulfill the request.


We can make this a tip:

.. tip:: Ensure that your cluster has instances with the specified accelerator type or is able to autoscale to fulfill the request. Otherwise, your job will hang forever due to unsatisfiable pending resource requests.

Oh nice tip structure. Will try.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

…accelerator-type

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

woshiyyya · 2024-04-22T22:27:53Z

doc/source/train/user-guides/using-gpus.rst

+    ScalingConfig(
+            num_workers=1,
+            use_gpu=True,
+            accelerator_type="A100"
+    )


Fix the indent here?

woshiyyya · 2024-04-22T22:30:47Z

doc/source/train/user-guides/using-gpus.rst

+Setting the GPU type
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Ray Train allows you to specify the accelerator type for each worker.
+This is useful if your model training has some GPU memory constraints that requires a specific type of GPU.


Users may want to use different accelerator types not only for GPU memory constraints, but also for e.g. compute power, cost efficiency, availability, etc.

Let's just say This is useful if you want to use a specific accelerator type for model training.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

woshiyyya

Nice work!

justinvyu

Nice!

Add accelerator type to Ray Train user-guides

7516071

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo requested review from matthewdeng, justinvyu, woshiyyya and a team as code owners April 20, 2024 00:09

Merge branch 'master' into doc-accelerator-type

3441a7b

hongpeng-guo assigned justinvyu, matthewdeng and woshiyyya Apr 20, 2024

justinvyu reviewed Apr 22, 2024

View reviewed changes

hongpeng-guo added 3 commits April 22, 2024 11:53

Merge branch 'master' of https://github.com/ray-project/ray into doc-…

7ac8aba

…accelerator-type

adopting suggested comments for doc updates

5b93f67

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

Merge branch 'doc-accelerator-type' of https://github.com/hongpeng-gu…

3bc56d6

…o/ray into doc-accelerator-type

justinvyu reviewed Apr 22, 2024

View reviewed changes

hongpeng-guo added 3 commits April 22, 2024 13:26

Adding tip structure, and simplifying testcode

48bed0f

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into doc-…

167a87a

…accelerator-type

appending last commit, splitting into two paragraphs

4777cad

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

woshiyyya reviewed Apr 22, 2024

View reviewed changes

hongpeng-guo added 3 commits April 22, 2024 15:46

fix indentation & minor changes for specific accelerator type

d629cf5

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

add supported accelerator type list

41c8595

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

add supported accelerator type list

809e173

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

woshiyyya approved these changes Apr 24, 2024

View reviewed changes

justinvyu approved these changes Apr 24, 2024

View reviewed changes

justinvyu changed the title ~~[Doc][Train] Add accelerator_type to Ray Train user guides~~ [Doc][Train] Add accelerator_type to Ray Train user guide Apr 24, 2024

justinvyu merged commit 0da794c into ray-project:master Apr 24, 2024
5 checks passed

hongpeng-guo deleted the doc-accelerator-type branch April 25, 2024 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc][Train] Add `accelerator_type` to Ray Train user guide #44882

[Doc][Train] Add `accelerator_type` to Ray Train user guide #44882

hongpeng-guo commented Apr 20, 2024 •

edited by justinvyu

justinvyu commented Apr 22, 2024

hongpeng-guo commented Apr 22, 2024

justinvyu left a comment

justinvyu Apr 22, 2024

justinvyu Apr 22, 2024

justinvyu Apr 22, 2024

justinvyu Apr 22, 2024

hongpeng-guo Apr 22, 2024

woshiyyya Apr 22, 2024

woshiyyya Apr 22, 2024

woshiyyya left a comment

justinvyu left a comment

		Ensure that your cluster has instances with the specified accelerator type
		or is able to autoscale to fulfill the request.

[Doc][Train] Add accelerator_type to Ray Train user guide #44882

[Doc][Train] Add accelerator_type to Ray Train user guide #44882

Conversation

hongpeng-guo commented Apr 20, 2024 • edited by justinvyu

Why are these changes needed?

Related issue number

Checks

justinvyu commented Apr 22, 2024

hongpeng-guo commented Apr 22, 2024

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Apr 22, 2024

Choose a reason for hiding this comment

justinvyu Apr 22, 2024

Choose a reason for hiding this comment

justinvyu Apr 22, 2024

Choose a reason for hiding this comment

justinvyu Apr 22, 2024

Choose a reason for hiding this comment

hongpeng-guo Apr 22, 2024

Choose a reason for hiding this comment

woshiyyya Apr 22, 2024

Choose a reason for hiding this comment

woshiyyya Apr 22, 2024

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

[Doc][Train] Add `accelerator_type` to Ray Train user guide #44882

[Doc][Train] Add `accelerator_type` to Ray Train user guide #44882

hongpeng-guo commented Apr 20, 2024 •

edited by justinvyu