Skip to content

Commit

Permalink
[docs][train]Make Train example titles, heading more consistent (#39606)
Browse files Browse the repository at this point in the history
Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
  • Loading branch information
3 people committed Sep 14, 2023
1 parent 77b4cb9 commit 4ed4b52
Show file tree
Hide file tree
Showing 22 changed files with 531 additions and 397 deletions.
2 changes: 1 addition & 1 deletion .github/styles/Vocab/Train/accept.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Horovod
Hugging Face
hyperparameters?
Keras
LightGBM
PyTorch
Expand Down
15 changes: 9 additions & 6 deletions doc/source/train/deepspeed.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
.. _train-deepspeed:

Training with DeepSpeed
=======================
Get Started with DeepSpeed
==========================

The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed <https://www.deepspeed.ai/>`_ training across a distributed Ray cluster.

All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this:
Code example
------------

You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this:

.. code-block:: python
Expand Down Expand Up @@ -74,12 +77,12 @@ Below is a simple example of ZeRO-3 training with DeepSpeed only.
keep using `deepspeed.initialize() <https://deepspeed.readthedocs.io/en/latest/initialize.html>`_ as usual to prepare everything
for distributed training.

Running DeepSpeed with other frameworks
-------------------------------------------
Run DeepSpeed with other frameworks
-----------------------------------

Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train.

Please check the below examples for more details:
Check the below examples for more details:

.. list-table::
:header-rows: 1
Expand Down
104 changes: 49 additions & 55 deletions doc/source/train/distributed-tensorflow-keras.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
.. _train-tensorflow-overview:

Distributed Tensorflow & Keras
==============================
Get Started with TensorFlow and Keras
=====================================

Ray Train's `TensorFlow <https://www.tensorflow.org/>`__ integration enables you
to scale your TensorFlow and Keras training loops to many machines and GPUs.
to scale your TensorFlow and Keras training functions to many machines and GPUs.

On a technical level, Ray Train schedules your training workers
and configures ``TF_CONFIG`` for you, allowing you to run
your ``MultiWorkerMirroredStrategy`` training script. See `Distributed
training with TensorFlow <https://www.tensorflow.org/guide/distributed_training>`_
for more information.

Most of the examples in this guide use Tensorflow with Keras, but
Ray Train also works with vanilla Tensorflow.
Most of the examples in this guide use TensorFlow with Keras, but
Ray Train also works with vanilla TensorFlow.


Quickstart
Expand All @@ -23,29 +24,27 @@ Quickstart
:end-before: __tf_train_end__


Updating your training function
-------------------------------
Update your training function
-----------------------------

First, you'll want to update your training function to support distributed
First, update your :ref:`training function <train-overview-training-function>` to support distributed
training.


.. note::
The current TensorFlow implementation supports
``MultiWorkerMirroredStrategy`` (and ``MirroredStrategy``). If there are
other strategies you wish to see supported by Ray Train, please let us know
by submitting a `feature request on GitHub <https://github.com/ray-project/ray/issues>`_.
other strategies you wish to see supported by Ray Train, submit a `feature request on GitHub <https://github.com/ray-project/ray/issues>`_.

These instructions closely follow TensorFlow's `Multi-worker training
with Keras <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras>`_
tutorial. One key difference is that Ray Train will handle the environment
tutorial. One key difference is that Ray Train handles the environment
variable set up for you.

**Step 1:** Wrap your model in ``MultiWorkerMirroredStrategy``.

The `MultiWorkerMirroredStrategy <https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy>`_
enables synchronous distributed training. The ``Model`` *must* be built and
compiled within the scope of the strategy.
enables synchronous distributed training. You *must* build and compile the ``Model`` within the scope of the strategy.

.. code-block:: python
Expand All @@ -56,9 +55,8 @@ compiled within the scope of the strategy.
**Step 2:** Update your ``Dataset`` batch size to the *global* batch
size.

The `batch <https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch>`_
will be split evenly across worker processes, so ``batch_size`` should be
set appropriately.
Set ``batch_size`` appropriately because `batch <https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch>`_
splits evenly across worker processes.

.. code-block:: diff
Expand All @@ -67,20 +65,20 @@ set appropriately.
.. warning::
Ray will not automatically set any environment variables or configuration
related to local parallelism / threading
Ray doesn't automatically set any environment variables or configuration
related to local parallelism or threading
:ref:`aside from "OMP_NUM_THREADS" <omp-num-thread-note>`.
If you desire greater control over TensorFlow threading, use
If you want greater control over TensorFlow threading, use
the ``tf.config.threading`` module (eg.
``tf.config.threading.set_inter_op_parallelism_threads(num_cpus)``)
at the beginning of your ``train_loop_per_worker`` function.

Creating a :class:`~ray.train.tensorflow.TensorflowTrainer`
-----------------------------------------------------------
Create a TensorflowTrainer
--------------------------

``Trainer``\s are the primary Ray Train classes that are used to manage state and
``Trainer``\s are the primary Ray Train classes for managing state and
execute training. For distributed Tensorflow,
we use a :class:`~ray.train.tensorflow.TensorflowTrainer`
use a :class:`~ray.train.tensorflow.TensorflowTrainer`
that you can setup like this:

.. code-block:: python
Expand Down Expand Up @@ -109,38 +107,35 @@ To customize the backend setup, you can pass a
)
For more configurability, please reference the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.
For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.


Running your training function
------------------------------
Run a training function
-----------------------

With a distributed training function and a Ray Train ``Trainer``, you are now
ready to start training!
ready to start training.

.. code-block:: python
trainer.fit()
Data loading and preprocessing
------------------------------
Tensorflow per default uses its own internal dataset sharding policy, as described
Load and preprocess data
------------------------

TensorFlow by default uses its own internal dataset sharding policy, as described
`in the guide <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#dataset_sharding>`__.
If your tensorflow dataset is compatible with distributed loading, you don't need to
If your TensorFlow dataset is compatible with distributed loading, you don't need to
change anything.

If you require more advanced preprocessing, you may want to consider using Ray Data
for distributed data ingest.

There is a guide for using :ref:`Ray Data with Ray Train <data-ingest-torch>`
in our PyTorch guide. Since Ray Data is an independent library, most concepts can
be directly applied to TensorFlow.
for distributed data ingest. See :ref:`Ray Data with Ray Train <data-ingest-torch>`.

The main difference is that you may want to convert your Ray Data dataset shard to
a TensorFlow dataset in your training function so that you can use the Keras
API for model training.

`Here's a full example you can refer to <https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tf/tune_tensorflow_autoencoder_example.py>`__
`See this example <https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tf/tune_tensorflow_autoencoder_example.py>`__
for distributed data loading. The relevant parts are:

.. code-block:: python
Expand Down Expand Up @@ -184,8 +179,8 @@ for distributed data loading. The relevant parts are:
Reporting results
-----------------
Report results
--------------
During training, the training loop should report intermediate results and checkpoints
to Ray Train. This reporting logs the results to the console output and appends them to
local log files. The logging also triggers :ref:`checkpoint bookkeeping <train-dl-configure-checkpoints>`.
Expand All @@ -203,30 +198,29 @@ The easiest way to report your results with Keras is by using the
model.fit(dataset, callbacks=[ReportCheckpointCallback()])
This callback will automatically forward all results and checkpoints from the
Keras training loop to Ray Train.
This callback automatically forwards all results and checkpoints from the
Keras training function to Ray Train.


Aggregating results
~~~~~~~~~~~~~~~~~~~
Aggregate results
~~~~~~~~~~~~~~~~~

TensorFlow Keras automatically aggregates metrics from all workers. If you wish to have more
control over that, consider implementing a `custom training loop <https://www.tensorflow.org/tutorials/distribute/custom_training>`__.


Saving and loading checkpoints
------------------------------
Save and load checkpoints
-------------------------

:class:`Checkpoints <ray.train.Checkpoint>` can be saved by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the
training function. This will cause the checkpoint state from the distributed
workers to be saved on the ``Trainer`` (where your python script is executed).
You can save :class:`Checkpoints <ray.train.Checkpoint>` by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the
training function. This call saves the checkpoint state from the distributed
workers on the ``Trainer``, where you executed your python script.

The latest saved checkpoint can be accessed through the ``checkpoint`` attribute of
the :py:class:`~ray.train.Result`, and the best saved checkpoints can be accessed by the ``best_checkpoints``
You can access the latest saved checkpoint through the ``checkpoint`` attribute of
the :py:class:`~ray.train.Result`, and access the best saved checkpoints with the ``best_checkpoints``
attribute.

Concrete examples are provided to demonstrate how checkpoints (model weights but not models) are saved
appropriately in distributed training.
These concrete examples demonstrate how Ray Train appropriately saves checkpoints, model weights but not models, in distributed training.


.. code-block:: python
Expand Down Expand Up @@ -275,11 +269,11 @@ appropriately in distributed training.
result = trainer.fit()
print(result.checkpoint)
By default, checkpoints will be persisted to local disk in the :ref:`log
By default, checkpoints persist to local disk in the :ref:`log
directory <train-log-dir>` of each run.

Loading checkpoints
~~~~~~~~~~~~~~~~~~~
Load checkpoints
~~~~~~~~~~~~~~~~

.. code-block:: python
Expand Down
Loading

0 comments on commit 4ed4b52

Please sign in to comment.