Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] simplify TorchTrainer docstring #38049

Merged
merged 8 commits into from
Aug 3, 2023

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Aug 3, 2023

Why are these changes needed?

Generated API Reference: TorchTrainer

This PR simplifies the TorchTrainer docstring with the following changes:

  1. Improve the descriptions of each of the parameters.
  2. Reorganize the parameters (moving dataset_config immediately after datasets).
  3. Clean up the example.
  4. Reduce the amount of content in the description to make it easier to digest.
    1. Some of this is now captured directly in the parameter descriptions.
    2. Some of it should be linked to user guides.

Note: One very unfortunate thing is that the arguments do not get rendered properly, so it's not possible to see the type hints right now. I would like to follow up on this and get this fixed in the future...
image

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Matthew Deng <matt@anyscale.com>
@richardliaw
Copy link
Contributor

Related to your comment; #33429

the fix is to not override new in the base trainer

@matthewdeng
Copy link
Contributor Author

Yeah... not clear to me how to though. 😢

@richardliaw
Copy link
Contributor

Would it be helpful if I made a draft/closed PR, and perhaps we can find someone to help shepherd a proper fix into 2.7?

@matthewdeng
Copy link
Contributor Author

Yeah that would be great if you have an idea how to - I tried doing something in BaseTrainer.__init__ but couldn't get it to work since the subclasses don't pass all args in when calling super(...).__init__(...).

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Matt. This looks great!

python/ray/train/torch/torch_trainer.py Show resolved Hide resolved
python/ray/train/torch/torch_trainer.py Show resolved Hide resolved
Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, really great improvement. Thank you!

@@ -16,217 +16,121 @@
class TorchTrainer(DataParallelTrainer):
"""A Trainer for data parallel PyTorch training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same training for all torch-based trainers? Is it appropriate to mention that Lightning and HF also follow this pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but will make this change in a future PR when this becomes the standard way!


import ray
from ray import train
from ray.train import Checkpoint, CheckpointConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer

# If using GPUs, set this to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're setting it to True by default, does it make sense to make the comment: "If not using GPUs, set this to False."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

num_epochs = 20
num_workers = 3
use_gpu = True
num_workers = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does everyone know what a worker is, or should we add a comment about what this is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

python/ray/train/torch/torch_trainer.py Outdated Show resolved Hide resolved
2. Sets up a PyTorch Distributed environment
on these workers as defined by the ``torch_config``.
3. Ingests the input ``datasets`` based on the ``dataset_config``.
4. Runs the input ``train_loop_per_worker(train_loop_config)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a user-defined function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is

python/ray/train/torch/torch_trainer.py Outdated Show resolved Hide resolved
python/ray/train/torch/torch_trainer.py Outdated Show resolved Hide resolved
python/ray/air/config.py Outdated Show resolved Hide resolved
python/ray/air/config.py Outdated Show resolved Hide resolved
python/ray/air/config.py Outdated Show resolved Hide resolved
matthewdeng and others added 7 commits August 3, 2023 11:09
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
@matthewdeng matthewdeng merged commit a92fa2f into ray-project:master Aug 3, 2023
7 of 44 checks passed
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants