New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[train] simplify TorchTrainer docstring #38049

Merged

matthewdeng merged 8 commits into ray-project:master from matthewdeng:simplify-torchtrainer

Aug 3, 2023

Contributor

matthewdeng commented Aug 3, 2023 •

edited

Loading

Why are these changes needed?

Generated API Reference: TorchTrainer

This PR simplifies the TorchTrainer docstring with the following changes:

Improve the descriptions of each of the parameters.
Reorganize the parameters (moving dataset_config immediately after datasets).
Clean up the example.
Reduce the amount of content in the description to make it easier to digest.
1. Some of this is now captured directly in the parameter descriptions.
2. Some of it should be linked to user guides.

Note: One very unfortunate thing is that the arguments do not get rendered properly, so it's not possible to see the type hints right now. I would like to follow up on this and get this fixed in the future...

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(


          [train] simplify TorchTrainer docstring

15c8380

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng requested review from GokuMohandas, woshiyyya and angelinalg

August 3, 2023 03:23

matthewdeng requested review from richardliaw, krfricke, xwjiang2010, amogkam, Yard1, maxpumperla and a team as code owners

August 3, 2023 03:23

Contributor

richardliaw commented Aug 3, 2023

Related to your comment; #33429

the fix is to not override new in the base trainer

Contributor Author

matthewdeng commented Aug 3, 2023

Yeah... not clear to me how to though. 😢

Contributor

richardliaw commented Aug 3, 2023

Would it be helpful if I made a draft/closed PR, and perhaps we can find someone to help shepherd a proper fix into 2.7?

Contributor Author

matthewdeng commented Aug 3, 2023

Yeah that would be great if you have an idea how to - I tried doing something in BaseTrainer.__init__ but couldn't get it to work since the subclasses don't pass all args in when calling super(...).__init__(...).

woshiyyya approved these changes

View reviewed changes

Member

woshiyyya left a comment

Thank you Matt. This looks great!

python/ray/train/torch/torch_trainer.py Show resolved Hide resolved

python/ray/train/torch/torch_trainer.py Show resolved Hide resolved

angelinalg approved these changes

View reviewed changes

Contributor

angelinalg left a comment

Really, really great improvement. Thank you!

python/ray/train/torch/torch_trainer.py

		@@ -16,217 +16,121 @@
		class TorchTrainer(DataParallelTrainer):
		"""A Trainer for data parallel PyTorch training.

Contributor

angelinalg Aug 3, 2023

Is this the same training for all torch-based trainers? Is it appropriate to mention that Lightning and HF also follow this pattern?

Contributor Author

matthewdeng Aug 3, 2023

Yes, but will make this change in a future PR when this becomes the standard way!

python/ray/train/torch/torch_trainer.py

                           import ray
-                          from ray import train
                           from ray.train import Checkpoint, CheckpointConfig, RunConfig, ScalingConfig
                           from ray.train.torch import TorchTrainer
                           # If using GPUs, set this to True.

Contributor

angelinalg Aug 3, 2023

If we're setting it to True by default, does it make sense to make the comment: "If not using GPUs, set this to False."?

Contributor Author

matthewdeng Aug 3, 2023

Good catch.

python/ray/train/torch/torch_trainer.py

-                          num_epochs = 20
-                          num_workers = 3
+                          use_gpu = True
+                          num_workers = 4

Contributor

angelinalg Aug 3, 2023

Does everyone know what a worker is, or should we add a comment about what this is for?

Contributor Author

matthewdeng Aug 3, 2023

Added a comment

python/ray/train/torch/torch_trainer.py Outdated Show resolved Hide resolved

python/ray/train/torch/torch_trainer.py

+. Sets up a PyTorch Distributed environment
+                     on these workers as defined by the ``torch_config``.
+. Ingests the input ``datasets`` based on the ``dataset_config``.
+. Runs the input ``train_loop_per_worker(train_loop_config)``

Contributor

angelinalg Aug 3, 2023

Is this a user-defined function?

Contributor Author

matthewdeng Aug 3, 2023

Yeah it is

python/ray/train/torch/torch_trainer.py Outdated Show resolved Hide resolved

python/ray/train/torch/torch_trainer.py Outdated Show resolved Hide resolved

python/ray/air/config.py Outdated Show resolved Hide resolved

python/ray/air/config.py Outdated Show resolved Hide resolved

python/ray/air/config.py Outdated Show resolved Hide resolved

matthewdeng and others added 7 commits

August 3, 2023 11:09


          Apply suggestions from code review

8faf4ee

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>


          Merge branch 'master' of github.com:ray-project/ray into simplify-tor…

69d0fb4

…chtrainer


          address comments

e4e2520

Signed-off-by: Matthew Deng <matt@anyscale.com>


          update link

e19c8d5

Signed-off-by: Matthew Deng <matt@anyscale.com>


          Merge branch 'master' of github.com:ray-project/ray into simplify-tor…

72c5962

…chtrainer


          indent

ea53d55

Signed-off-by: Matthew Deng <matt@anyscale.com>


          reference

980adc8

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng merged commit a92fa2f into ray-project:master

7 of 44 checks passed

NripeshN pushed a commit to NripeshN/ray that referenced this pull request


          [train] simplify TorchTrainer docstring (ray-project#38049)

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>

harborn pushed a commit to harborn/ray that referenced this pull request


          [train] simplify TorchTrainer docstring (ray-project#38049)

84480ed

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn pushed a commit to harborn/ray that referenced this pull request


          [train] simplify TorchTrainer docstring (ray-project#38049)

57f3809


Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

arvind-chandra pushed a commit to lmco/ray that referenced this pull request


          [train] simplify TorchTrainer docstring (ray-project#38049)

cecb8db

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

vymao pushed a commit to vymao/ray that referenced this pull request


          [train] simplify TorchTrainer docstring (ray-project#38049)

7466b27

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

woshiyyya woshiyyya approved these changes

angelinalg angelinalg approved these changes

GokuMohandas Awaiting requested review from GokuMohandas

richardliaw Awaiting requested review from richardliaw

krfricke Awaiting requested review from krfricke

xwjiang2010 Awaiting requested review from xwjiang2010

amogkam Awaiting requested review from amogkam

Yard1 Awaiting requested review from Yard1

maxpumperla Awaiting requested review from maxpumperla