Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Update docstring and user guides for train_loop_config #43691

Merged
merged 9 commits into from
Mar 8, 2024
9 changes: 8 additions & 1 deletion doc/source/train/getting-started-pytorch-lightning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ For reference, the final code is as follows:
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

def train_func(config):
Copy link
Member Author

@woshiyyya woshiyyya Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not showing config argument in the first place, since we didn't specify train_loop_config in TorchTrainer in this code snippet. Users will be confused about where to put the train_func arguments.

woshiyyya marked this conversation as resolved.
Show resolved Hide resolved
def train_func():
# Your PyTorch Lightning training code here.

scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
Expand Down Expand Up @@ -190,6 +190,13 @@ Begin by wrapping your code in a :ref:`training function <train-overview-trainin

Each distributed training worker executes this function.

You can specify the input argument for `train_func` via the Trainer's `train_loop_config` parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, we can extract this section out to a separate file and include it, similar to what's being done here.

In the future we may just have a full separate user guide for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I've extracted the common paragraph into a separate doc.


.. note::
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved

Avoid passing large data objects through `train_loop_config` to reduce the
serialization and deserialization overhead. Instead, it's preferred to
initialize large objects (e.g. datasets, models) directly in `train_func`.

Ray Train sets up your distributed process group on each worker. You only need to
make a few changes to your Lightning Trainer definition.
Expand Down
10 changes: 9 additions & 1 deletion doc/source/train/getting-started-pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ For reference, the final code will look something like the following:
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

def train_func(config):
def train_func():
# Your PyTorch training code here.
...

Expand Down Expand Up @@ -195,6 +195,14 @@ Begin by wrapping your code in a :ref:`training function <train-overview-trainin

Each distributed training worker executes this function.

You can specify the input argument for `train_func` via the Trainer's `train_loop_config` parameter.

.. note::

Avoid passing large data objects through `train_loop_config` to reduce the
serialization and deserialization overhead. Instead, it's preferred to
initialize large objects (e.g. datasets, models) directly in `train_func`.

Set up a model
^^^^^^^^^^^^^^

Expand Down
14 changes: 11 additions & 3 deletions doc/source/train/getting-started-transformers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ For reference, the final code follows:
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

def train_func(config):
def train_func():
# Your Transformers training code here.

scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
Expand Down Expand Up @@ -212,9 +212,17 @@ You can begin by wrapping your code in a :ref:`training function <train-overview
def train_func(config):
# Your Transformers training code here.

This function executes on each distributed training worker. Ray Train sets up the distributed
process group on each worker before entering this function.
This function executes on each distributed training worker.

You can specify the input argument for `train_func` via the Trainer's `train_loop_config` parameter.

.. note::

Avoid passing large data objects through `train_loop_config` to reduce the
serialization and deserialization overhead. Instead, it's preferred to
initialize large objects (e.g. datasets, models) directly in `train_func`.

Ray Train sets up the distributed process group on each worker before entering this function.
Put all the logic into this function, including dataset construction and preprocessing,
model initialization, transformers trainer definition and more.

Expand Down
4 changes: 3 additions & 1 deletion python/ray/train/torch/torch_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,9 @@ def train_loop_per_worker(config):
:ref:`Ray Train Loop utilities <train-loop-api>`.
train_loop_config: A configuration ``Dict`` to pass in as an argument to
``train_loop_per_worker``.
This is typically used for specifying hyperparameters.
This is typically used for specifying hyperparameters. Passing large
datasets via `train_loop_config` is not recommended and may introduce
large overhead and unknown issues with serialization and deserialization.
torch_config: The configuration for setting up the PyTorch Distributed backend.
If set to None, a default configuration will be used in which
GPU training uses NCCL and CPU training uses Gloo.
Expand Down