[train][2.7][4/n] cherry-picks for documentations, tests, examples (#…

…39468) * [train] Fix issues in migration of tune_cifar_torch_pbt_example (#39158) Resolves three issues that come up when migrating the `tune_cifar_torch_pbt_example` from Ray 2.6 to Ray 2.7: 1. There is a warning message because PBT uses the `_schedule_trial_save` interface. This is added to the white list attributes so it doesn't come up anymore. 2. PBT malfunctions in Python 2.7, so instead of silently failing, we raise an error and ask users to migrate 3. When users use old `ray.air.Checkpoint` APIs on `ray.train.Checkpoint`, we should raise an actionable error message. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Make Trainable.save/restore developer APIs (#39391) Signed-off-by: Kai Fricke <kai@anyscale.com> * [Telemetry] Add Telemetry for Ray Train Utilities (#39363) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train] update Train API references & annotations (#39294) Signed-off-by: Matthew Deng <matt@anyscale.com> * [2.7] Cleanup all LightningTrainer Mentions in Ray Doc (#39406) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train] remove _max_cpu_fraction_per_node (#39412) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] Legacy interface cleanup (`air.Checkpoint`, `LegacyExperimentAnalysis`) (#39289) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [Train][Telemetry] Limit the usage of `ray.train.torch.get_device`. (#39432) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train-ci] Fix Train examples with authentication buildkite commands. (#39387) * [train-ci] fix Train examples with authentication buildkite commands. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [train][doc] Remove preprocessor reference in tune+train user guide (#39442) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train/docs] Extend resource guide (training backend + choosing resources) (#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docs Signed-off-by: Matthew Deng <matt@anyscale.com> * [Minor] Remove remaining LightningTrainer Mentions (#39441) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> --------- Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
ray-project · Sep 8, 2023 · 4b6e6b2 · 4b6e6b2
1 parent 0f7e733
commit 4b6e6b2
Show file tree

Hide file tree

Showing 78 changed files with 1,667 additions and 948 deletions.
diff --git a/.buildkite/pipeline.ml.yml b/.buildkite/pipeline.ml.yml
@@ -54,18 +54,18 @@
     - ./ci/env/env_info.sh
     - bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=tune,-gpu_only,-ray_air,-gpu,-doctest,-needs_credentials python/ray/train/...
 
-# - label: ":train: :key: Train examples with authentication"
-#   conditions: ["NO_WHEELS_REQUIRED", "RAY_CI_TRAIN_AFFECTED", "RAY_CI_BRANCH_BUILD"]
-#   instance_size: medium
-#   commands:
-#     - if [ "$BUILDKITE_PULL_REQUEST" != "false" ]; then exit 0; fi
-#     - cleanup() { if [ "${BUILDKITE_PULL_REQUEST}" = "false" ]; then ./ci/build/upload_build_info.sh; fi }; trap cleanup EXIT
-#     - TRAIN_TESTING=1 ./ci/env/install-dependencies.sh
-#     - ./ci/env/env_info.sh
-#     - credentials=$(python ./ci/env/setup_credentials.py)
-#     - credential_exit_code=$$?
-#     - if [ $credential_exit_code -eq 0 ]; then echo "Credentials are fetched successfully"; bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=needs_credentials 
-#       python/ray/train/... $credentials; else echo "Credentials cannot be fetched"; exit 1; fi
+- label: ":train: :key: Train examples with authentication"
+  conditions: ["NO_WHEELS_REQUIRED", "RAY_CI_TRAIN_AFFECTED", "RAY_CI_BRANCH_BUILD"]
+  instance_size: medium
+  commands:
+    - if [[ "$BUILDKITE_PIPELINE_ID" != "0183465b-c6fb-479b-8577-4cfd743b545d" ]]; then exit 0; fi
+    - trap ./ci/build/upload_build_info.sh EXIT
+    - TRAIN_TESTING=1 ./ci/env/install-dependencies.sh
+    - ./ci/env/env_info.sh
+    - $(python ci/env/setup_credentials.py)
+    - bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=needs_credentials 
+      --test_env=WANDB_API_KEY --test_env=COMET_API_KEY
+      python/ray/train/...
 
 - label: ":brain: RLlib: Benchmarks (Torch 2.x)"
   conditions: ["NO_WHEELS_REQUIRED", "RAY_CI_RLLIB_AFFECTED"]

diff --git a/ci/env/setup_credentials.py b/ci/env/setup_credentials.py
@@ -1,9 +1,11 @@
 """
 This script sets up credentials for some services in the
 CI environment.
-This prints out credentials in the following format, to be ingested
-by as bazel test envs.
---test_env=WANDB_API_KEY=abcd --test_env=COMET_API_KEY=efgh
+This generates a bash script in the following format, which will
+then be sourced to run bazel test with.
+
+export WANDB_API_KEY=abcd
+export COMET_API_KEY=efgh
 """
 import boto3
 import json
@@ -34,16 +36,9 @@ def main():
     except Exception as e:
         print(f"Could not get Ray AIR secrets: {e}")
         sys.exit(1)
-        return
-
-    print(
-        " ".join(
-            [
-                f"--test_env={SERVICES[key]}={ray_air_secrets[key]}"
-                for key in SERVICES.keys()
-            ]
-        )
-    )
+
+    for key in SERVICES.keys():
+        print(f"export {SERVICES[key]}={ray_air_secrets[key]}")
 
 
 if __name__ == "__main__":

diff --git a/doc/source/ray-overview/examples.rst b/doc/source/ray-overview/examples.rst
@@ -1373,7 +1373,7 @@ Ray Examples
         :link: /train/examples/lightning/vicuna_13b_lightning_deepspeed_finetune
         :link-type: doc
 
-        Fine-tune vicuna-13b-v1.3 with DeepSpeed and LightningTrainer
+        Fine-tune vicuna-13b-v1.3 with DeepSpeed, PyTorch Lightning and Ray Train
 
     .. grid-item-card:: :bdg-secondary:`Code example`
         :class-item: gallery-item training llm pytorch nlp

diff --git a/doc/source/rllib/rllib-saving-and-loading-algos-and-policies.rst b/doc/source/rllib/rllib-saving-and-loading-algos-and-policies.rst
@@ -8,7 +8,7 @@ Saving and Loading your RL Algorithms and Policies
 ##################################################
 
 
-You can use :py:class:`~ray.air.checkpoint.Checkpoint` objects to store
+You can use :py:class:`~ray.train.Checkpoint` objects to store
 and load the current state of your :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`
 or :py:class:`~ray.rllib.policy.policy.Policy` and the neural networks (weights)
 within these structures. In the following, we will cover how you can create these
@@ -26,7 +26,7 @@ or a single :py:class:`~ray.rllib.policy.policy.Policy` instance.
 The Algorithm- or Policy instances that were used to create the checkpoint in the first place
 may or may not have been trained prior to this.
 
-RLlib uses the :py:class:`~ray.air.checkpoint.Checkpoint` class to create checkpoints and
+RLlib uses the :py:class:`~ray.train.Checkpoint` class to create checkpoints and
 restore objects from them.
 
 The main file in a checkpoint directory, containing the state information, is currently
@@ -50,7 +50,7 @@ How do I create an Algorithm checkpoint?
 ----------------------------------------
 
 The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` ``save()`` method creates a new checkpoint
-(directory with files in it) and returns the path to that directory.
+(directory with files in it).
 
 Let's take a look at a simple example on how to create such an
 Algorithm checkpoint:
@@ -69,8 +69,6 @@ like this:
     $ ls -la
       .
       ..
-      .is_checkpoint
-      .tune_metadata
       policies/
       algorithm_state.pkl
       rllib_checkpoint.json

diff --git a/doc/source/train/api/api.rst b/doc/source/train/api/api.rst
@@ -4,33 +4,24 @@
 Ray Train API
 =============
 
-This page covers framework specific integrations with Ray Train and Ray Train Developer APIs.
-
 .. _train-integration-api:
 .. _train-framework-specific-ckpts:
 
 .. currentmodule:: ray
 
-Ray Train Integrations
-----------------------
-
-.. _train-pytorch-integration:
-
 PyTorch Ecosystem
-~~~~~~~~~~~~~~~~~
-
-Scale out your PyTorch, Lightning, Hugging Face code with Ray TorchTrainer.
+-----------------
 
 .. autosummary::
     :toctree: doc/
 
     ~train.torch.TorchTrainer
     ~train.torch.TorchConfig
-    ~train.torch.TorchCheckpoint
 
+.. _train-pytorch-integration:
 
 PyTorch
-*******
+~~~~~~~
 
 .. autosummary::
     :toctree: doc/
@@ -43,7 +34,7 @@ PyTorch
 .. _train-lightning-integration:
 
 PyTorch Lightning
-*****************
+~~~~~~~~~~~~~~~~~
 
 .. autosummary::
     :toctree: doc/
@@ -55,49 +46,20 @@ PyTorch Lightning
     ~train.lightning.RayDeepSpeedStrategy
     ~train.lightning.RayTrainReportCallback
 
-.. note::
-
-    We will deprecate `LightningTrainer`, `LightningConfigBuilder`,
-    `LightningCheckpoint`, and `LightningPredictor` in Ray 2.8. Please 
-    refer to the :ref:`migration guide <lightning-trainer-migration-guide>` for more info.
-
-.. autosummary::
-    :toctree: doc/
-
-    ~train.lightning.LightningTrainer
-    ~train.lightning.LightningConfigBuilder
-    ~train.lightning.LightningCheckpoint
-    ~train.lightning.LightningPredictor
-
 .. _train-transformers-integration:
 
 Hugging Face Transformers
-*************************
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autosummary::
     :toctree: doc/
 
     ~train.huggingface.transformers.prepare_trainer
     ~train.huggingface.transformers.RayTrainReportCallback
 
-.. note::
-
-    We will deprecate `TransformersTrainer`, `TransformersCheckpoint` in Ray 2.8. Please 
-    refer to the :ref:`migration guide <transformers-trainer-migration-guide>` for more info.
-
-.. autosummary::
-    :toctree: doc/
-
-    ~train.huggingface.TransformersTrainer
-    ~train.huggingface.TransformersCheckpoint
 
-Hugging Face Accelerate
-***********************
-
-.. autosummary::
-    :toctree: doc/
-
-    ~train.huggingface.AccelerateTrainer
+More Frameworks
+---------------
 
 Tensorflow/Keras
 ~~~~~~~~~~~~~~~~
@@ -107,21 +69,8 @@ Tensorflow/Keras
 
     ~train.tensorflow.TensorflowTrainer
     ~train.tensorflow.TensorflowConfig
-    ~train.tensorflow.TensorflowCheckpoint
-
-
-Tensorflow/Keras Training Loop Utilities
-****************************************
-
-.. autosummary::
-    :toctree: doc/
-
     ~train.tensorflow.prepare_dataset_shard
-
-.. autosummary::
-
-    ~air.integrations.keras.ReportCheckpointCallback
-
+    ~train.tensorflow.keras.ReportCheckpointCallback
 
 Horovod
 ~~~~~~~
@@ -140,7 +89,6 @@ XGBoost
     :toctree: doc/
 
     ~train.xgboost.XGBoostTrainer
-    ~train.xgboost.XGBoostCheckpoint
 
 
 LightGBM
@@ -150,32 +98,42 @@ LightGBM
     :toctree: doc/
 
     ~train.lightgbm.LightGBMTrainer
-    ~train.lightgbm.LightGBMCheckpoint
 
 
 .. _ray-train-configs-api:
 
-Ray Train Config
-----------------
+Ray Train Configuration
+-----------------------
 
 .. autosummary::
     :toctree: doc/
 
-    ~train.ScalingConfig
-    ~train.RunConfig
     ~train.CheckpointConfig
-    ~train.FailureConfig
     ~train.DataConfig
+    ~train.FailureConfig
+    ~train.RunConfig
+    ~train.ScalingConfig
+    ~train.SyncConfig
 
 .. _train-loop-api:
 
-Ray Train Loop
---------------
+Ray Train Utilities
+-------------------
+
+**Classes**
 
 .. autosummary::
     :toctree: doc/
 
+    ~train.Checkpoint
     ~train.context.TrainContext
+
+**Functions**
+
+.. autosummary::
+    :toctree: doc/
+
+    ~train.get_checkpoint
     ~train.get_context
     ~train.get_dataset_shard
     ~train.report
@@ -190,14 +148,9 @@ Ray Train Output
 
     ~train.Result
 
-.. autosummary::
-    :toctree: doc/
-
-    ~train.Checkpoint
-
 
-Ray Train Base Classes (Developer APIs)
----------------------------------------
+Ray Train Developer APIs
+------------------------
 
 .. _train-base-trainer:
 

diff --git a/doc/source/train/distributed-tensorflow-keras.rst b/doc/source/train/distributed-tensorflow-keras.rst
@@ -191,11 +191,11 @@ to Ray Train. This reporting logs the results to the console output and appends
 local log files. The logging also triggers :ref:`checkpoint bookkeeping <train-dl-configure-checkpoints>`.
 
 The easiest way to report your results with Keras is by using the
-:class:`~air.integrations.keras.ReportCheckpointCallback`:
+:class:`~ray.train.tensorflow.keras.ReportCheckpointCallback`:
 
 .. code-block:: python
 
-    from ray.air.integrations.keras import ReportCheckpointCallback
+    from ray.train.tensorflow.keras import ReportCheckpointCallback
 
     def train_func(config: dict):
         # ...

diff --git a/doc/source/train/doc_code/key_concepts.py b/doc/source/train/doc_code/key_concepts.py
@@ -100,21 +100,6 @@ def train_fn(config):
 # __session_checkpoint_end__
 
 
-# __scaling_config_start__
-from ray.train import ScalingConfig
-
-scaling_config = ScalingConfig(
-    # Number of distributed workers.
-    num_workers=2,
-    # Turn on/off GPU.
-    use_gpu=True,
-    # Specify resources used for trainer.
-    trainer_resources={"CPU": 1},
-    # Try to schedule workers on different nodes.
-    placement_strategy="SPREAD",
-)
-# __scaling_config_end__
-
 # __run_config_start__
 from ray.train import RunConfig
 from ray.air.integrations.wandb import WandbLoggerCallback

diff --git a/doc/source/train/doc_code/tuner.py b/doc/source/train/doc_code/tuner.py
@@ -118,30 +118,24 @@
 # __torch_end__
 
 
-# __tune_preprocess_start__
+# __tune_dataset_start__
 from ray.data.preprocessors import StandardScaler
-from ray.tune import Tuner
-
-prep_v1 = StandardScaler(["worst radius", "worst area"])
-prep_v2 = StandardScaler(["worst concavity", "worst smoothness"])
-tuner = Tuner(
-    trainer,
-    param_space={
-        "preprocessor": tune.grid_search([prep_v1, prep_v2]),
-        # Your other parameters go here
-    },
-)
-# __tune_preprocess_end__
 
 
-# __tune_dataset_start__
 def get_dataset():
-    return ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
+    ds1 = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
+    prep_v1 = StandardScaler(["worst radius", "worst area"])
+    ds1 = prep_v1.fit_transform(ds1)
+    return ds1
 
 
 def get_another_dataset():
-    # imagine this is a different dataset
-    return ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
+    ds2 = ray.data.read_csv(
+        "s3://anonymous@air-example-data/breast_cancer_with_categorical.csv"
+    )
+    prep_v2 = StandardScaler(["worst concavity", "worst smoothness"])
+    ds2 = prep_v2.fit_transform(ds2)
+    return ds2
 
 
 dataset_1 = get_dataset()

diff --git a/doc/source/train/examples/lightning/lightning_mnist_example.ipynb b/doc/source/train/examples/lightning/lightning_mnist_example.ipynb
@@ -51,7 +51,7 @@
             "source": [
                 "## Prepare Dataset and Module\n",
                 "\n",
-                "The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can keep using them without any changes for the Ray AIR LightningTrainer. "
+                "The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can keep using them without any changes with Ray Train. "
             ]
         },
         {