[docs][train]Make Train example titles, heading more consistent (#39606)

Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
ray-project · Sep 14, 2023 · 4ed4b52 · 4ed4b52
1 parent 77b4cb9
commit 4ed4b52
Show file tree

Hide file tree

Showing 22 changed files with 531 additions and 397 deletions.
diff --git a/.github/styles/Vocab/Train/accept.txt b/.github/styles/Vocab/Train/accept.txt
@@ -1,5 +1,5 @@
 Horovod
-Hugging Face
+hyperparameters?
 Keras
 LightGBM
 PyTorch

diff --git a/doc/source/train/deepspeed.rst b/doc/source/train/deepspeed.rst
@@ -1,11 +1,14 @@
 .. _train-deepspeed:
 
-Training with DeepSpeed
-=======================
+Get Started with DeepSpeed
+==========================
 
 The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed <https://www.deepspeed.ai/>`_  training across a distributed Ray cluster.
 
-All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this:
+Code example
+------------
+
+You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this:
 
 .. code-block:: python
 
@@ -74,12 +77,12 @@ Below is a simple example of ZeRO-3 training with DeepSpeed only.
     keep using `deepspeed.initialize() <https://deepspeed.readthedocs.io/en/latest/initialize.html>`_ as usual to prepare everything 
     for distributed training.
 
-Running DeepSpeed with other frameworks
--------------------------------------------
+Run DeepSpeed with other frameworks
+-----------------------------------
 
 Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train.
 
-Please check the below examples for more details:
+Check the below examples for more details:
 
 .. list-table::
    :header-rows: 1

diff --git a/doc/source/train/distributed-tensorflow-keras.rst b/doc/source/train/distributed-tensorflow-keras.rst
@@ -1,18 +1,19 @@
 .. _train-tensorflow-overview:
 
-Distributed Tensorflow & Keras
-==============================
+Get Started with TensorFlow and Keras
+=====================================
+
 Ray Train's `TensorFlow <https://www.tensorflow.org/>`__ integration enables you
-to scale your TensorFlow and Keras training loops to many machines and GPUs.
+to scale your TensorFlow and Keras training functions to many machines and GPUs.
 
 On a technical level, Ray Train schedules your training workers
 and configures ``TF_CONFIG`` for you, allowing you to run
 your ``MultiWorkerMirroredStrategy`` training script. See `Distributed
 training with TensorFlow <https://www.tensorflow.org/guide/distributed_training>`_
 for more information.
 
-Most of the examples in this guide use Tensorflow with Keras, but
-Ray Train also works with vanilla Tensorflow.
+Most of the examples in this guide use TensorFlow with Keras, but
+Ray Train also works with vanilla TensorFlow.
 
 
 Quickstart
@@ -23,29 +24,27 @@ Quickstart
   :end-before: __tf_train_end__
 
 
-Updating your training function
--------------------------------
+Update your training function
+-----------------------------
 
-First, you'll want to update your training function to support distributed
+First, update your :ref:`training function <train-overview-training-function>` to support distributed
 training.
 
 
 .. note::
    The current TensorFlow implementation supports
    ``MultiWorkerMirroredStrategy`` (and ``MirroredStrategy``). If there are
-   other strategies you wish to see supported by Ray Train, please let us know
-   by submitting a `feature request on GitHub <https://github.com/ray-project/ray/issues>`_.
+   other strategies you wish to see supported by Ray Train, submit a `feature request on GitHub <https://github.com/ray-project/ray/issues>`_.
 
 These instructions closely follow TensorFlow's `Multi-worker training
 with Keras <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras>`_
-tutorial. One key difference is that Ray Train will handle the environment
+tutorial. One key difference is that Ray Train handles the environment
 variable set up for you.
 
 **Step 1:** Wrap your model in ``MultiWorkerMirroredStrategy``.
 
 The `MultiWorkerMirroredStrategy <https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy>`_
-enables synchronous distributed training. The ``Model`` *must* be built and
-compiled within the scope of the strategy.
+enables synchronous distributed training. You *must* build and compile the ``Model`` within the scope of the strategy.
 
 .. code-block:: python
 
@@ -56,9 +55,8 @@ compiled within the scope of the strategy.
 **Step 2:** Update your ``Dataset`` batch size to the *global* batch
 size.
 
-The `batch <https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch>`_
-will be split evenly across worker processes, so ``batch_size`` should be
-set appropriately.
+Set ``batch_size`` appropriately because `batch <https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch>`_
+splits evenly across worker processes.
 
 .. code-block:: diff
 
@@ -67,20 +65,20 @@ set appropriately.
 
 
 .. warning::
-    Ray will not automatically set any environment variables or configuration
-    related to local parallelism / threading
+    Ray doesn't automatically set any environment variables or configuration
+    related to local parallelism or threading
     :ref:`aside from "OMP_NUM_THREADS" <omp-num-thread-note>`.
-    If you desire greater control over TensorFlow threading, use
+    If you want greater control over TensorFlow threading, use
     the ``tf.config.threading`` module (eg.
     ``tf.config.threading.set_inter_op_parallelism_threads(num_cpus)``)
     at the beginning of your ``train_loop_per_worker`` function.
 
-Creating a :class:`~ray.train.tensorflow.TensorflowTrainer`
------------------------------------------------------------
+Create a TensorflowTrainer
+--------------------------
 
-``Trainer``\s are the primary Ray Train classes that are used to manage state and
+``Trainer``\s are the primary Ray Train classes for managing state and
 execute training. For distributed Tensorflow,
-we use a :class:`~ray.train.tensorflow.TensorflowTrainer`
+use a :class:`~ray.train.tensorflow.TensorflowTrainer`
 that you can setup like this:
 
 .. code-block:: python
@@ -109,38 +107,35 @@ To customize the backend setup, you can pass a
     )
 
 
-For more configurability, please reference the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.
+For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.
 
 
-Running your training function
-------------------------------
+Run a training function
+-----------------------
 
 With a distributed training function and a Ray Train ``Trainer``, you are now
-ready to start training!
+ready to start training.
 
 .. code-block:: python
 
     trainer.fit()
 
-Data loading and preprocessing
-------------------------------
-Tensorflow per default uses its own internal dataset sharding policy, as described
+Load and preprocess data
+------------------------
+
+TensorFlow by default uses its own internal dataset sharding policy, as described
 `in the guide <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#dataset_sharding>`__.
-If your tensorflow dataset is compatible with distributed loading, you don't need to
+If your TensorFlow dataset is compatible with distributed loading, you don't need to
 change anything.
 
 If you require more advanced preprocessing, you may want to consider using Ray Data
-for distributed data ingest.
-
-There is a guide for using :ref:`Ray Data with Ray Train <data-ingest-torch>`
-in our PyTorch guide. Since Ray Data is an independent library, most concepts can
-be directly applied to TensorFlow.
+for distributed data ingest. See :ref:`Ray Data with Ray Train <data-ingest-torch>`.
 
 The main difference is that you may want to convert your Ray Data dataset shard to
 a TensorFlow dataset in your training function so that you can use the Keras
 API for model training.
 
-`Here's a full example you can refer to <https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tf/tune_tensorflow_autoencoder_example.py>`__
+`See this example <https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tf/tune_tensorflow_autoencoder_example.py>`__
 for distributed data loading. The relevant parts are:
 
 .. code-block:: python
@@ -184,8 +179,8 @@ for distributed data loading. The relevant parts are:
 
 
 
-Reporting results
------------------
+Report results
+--------------
 During training, the training loop should report intermediate results and checkpoints
 to Ray Train. This reporting logs the results to the console output and appends them to
 local log files. The logging also triggers :ref:`checkpoint bookkeeping <train-dl-configure-checkpoints>`.
@@ -203,30 +198,29 @@ The easiest way to report your results with Keras is by using the
             model.fit(dataset, callbacks=[ReportCheckpointCallback()])
 
 
-This callback will automatically forward all results and checkpoints from the
-Keras training loop to Ray Train.
+This callback automatically forwards all results and checkpoints from the
+Keras training function to Ray Train.
 
 
-Aggregating results
-~~~~~~~~~~~~~~~~~~~
+Aggregate results
+~~~~~~~~~~~~~~~~~
 
 TensorFlow Keras automatically aggregates metrics from all workers. If you wish to have more
 control over that, consider implementing a `custom training loop <https://www.tensorflow.org/tutorials/distribute/custom_training>`__.
 
 
-Saving and loading checkpoints
-------------------------------
+Save and load checkpoints
+-------------------------
 
-:class:`Checkpoints <ray.train.Checkpoint>` can be saved by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the
-training function. This will cause the checkpoint state from the distributed
-workers to be saved on the ``Trainer`` (where your python script is executed).
+You can save :class:`Checkpoints <ray.train.Checkpoint>` by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the
+training function. This call saves the checkpoint state from the distributed
+workers on the ``Trainer``, where you executed your python script.
 
-The latest saved checkpoint can be accessed through the ``checkpoint`` attribute of
-the :py:class:`~ray.train.Result`, and the best saved checkpoints can be accessed by the ``best_checkpoints``
+You can access the latest saved checkpoint through the ``checkpoint`` attribute of
+the :py:class:`~ray.train.Result`, and access the best saved checkpoints with the ``best_checkpoints``
 attribute.
 
-Concrete examples are provided to demonstrate how checkpoints (model weights but not models) are saved
-appropriately in distributed training.
+These concrete examples demonstrate how Ray Train appropriately saves checkpoints, model weights but not models, in distributed training.
 
 
 .. code-block:: python
@@ -275,11 +269,11 @@ appropriately in distributed training.
     result = trainer.fit()
     print(result.checkpoint)
 
-By default, checkpoints will be persisted to local disk in the :ref:`log
+By default, checkpoints persist to local disk in the :ref:`log
 directory <train-log-dir>` of each run.
 
-Loading checkpoints
-~~~~~~~~~~~~~~~~~~~
+Load checkpoints
+~~~~~~~~~~~~~~~~
 
 .. code-block:: python