Skip to content

Commit

Permalink
[train][2.7][4/n] cherry-picks for documentations, tests, examples (#…
Browse files Browse the repository at this point in the history
…39468)

* [train] Fix issues in migration of tune_cifar_torch_pbt_example (#39158)

Resolves three issues that come up when migrating the `tune_cifar_torch_pbt_example` from Ray 2.6 to Ray 2.7:

1. There is a warning message because PBT uses the `_schedule_trial_save` interface. This is added to the white list attributes so it doesn't come up anymore.
2. PBT malfunctions in Python 2.7, so instead of silently failing, we raise an error and ask users to migrate
3. When users use old `ray.air.Checkpoint` APIs on `ray.train.Checkpoint`, we should raise an actionable error message.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [tune] Make Trainable.save/restore developer APIs (#39391)

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [Telemetry] Add Telemetry for Ray Train Utilities (#39363)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [train] update Train API references & annotations (#39294)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [2.7] Cleanup all LightningTrainer Mentions in Ray Doc (#39406)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [train] remove _max_cpu_fraction_per_node (#39412)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [train] Legacy interface cleanup (`air.Checkpoint`, `LegacyExperimentAnalysis`) (#39289)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>

* [Train][Telemetry] Limit the usage of `ray.train.torch.get_device`. (#39432)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [train-ci] Fix Train examples with authentication buildkite commands. (#39387)

* [train-ci] fix Train examples with authentication buildkite commands.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* [train][doc] Remove preprocessor reference in tune+train user guide (#39442)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [train/docs] Extend resource guide (training backend + choosing resources) (#39202)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docs

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [Minor] Remove remaining LightningTrainer Mentions (#39441)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

---------

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
  • Loading branch information
5 people committed Sep 8, 2023
1 parent 0f7e733 commit 4b6e6b2
Show file tree
Hide file tree
Showing 78 changed files with 1,667 additions and 948 deletions.
24 changes: 12 additions & 12 deletions .buildkite/pipeline.ml.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,18 +54,18 @@
- ./ci/env/env_info.sh
- bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=tune,-gpu_only,-ray_air,-gpu,-doctest,-needs_credentials python/ray/train/...

# - label: ":train: :key: Train examples with authentication"
# conditions: ["NO_WHEELS_REQUIRED", "RAY_CI_TRAIN_AFFECTED", "RAY_CI_BRANCH_BUILD"]
# instance_size: medium
# commands:
# - if [ "$BUILDKITE_PULL_REQUEST" != "false" ]; then exit 0; fi
# - cleanup() { if [ "${BUILDKITE_PULL_REQUEST}" = "false" ]; then ./ci/build/upload_build_info.sh; fi }; trap cleanup EXIT
# - TRAIN_TESTING=1 ./ci/env/install-dependencies.sh
# - ./ci/env/env_info.sh
# - credentials=$(python ./ci/env/setup_credentials.py)
# - credential_exit_code=$$?
# - if [ $credential_exit_code -eq 0 ]; then echo "Credentials are fetched successfully"; bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=needs_credentials
# python/ray/train/... $credentials; else echo "Credentials cannot be fetched"; exit 1; fi
- label: ":train: :key: Train examples with authentication"
conditions: ["NO_WHEELS_REQUIRED", "RAY_CI_TRAIN_AFFECTED", "RAY_CI_BRANCH_BUILD"]
instance_size: medium
commands:
- if [[ "$BUILDKITE_PIPELINE_ID" != "0183465b-c6fb-479b-8577-4cfd743b545d" ]]; then exit 0; fi
- trap ./ci/build/upload_build_info.sh EXIT
- TRAIN_TESTING=1 ./ci/env/install-dependencies.sh
- ./ci/env/env_info.sh
- $(python ci/env/setup_credentials.py)
- bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=needs_credentials
--test_env=WANDB_API_KEY --test_env=COMET_API_KEY
python/ray/train/...

- label: ":brain: RLlib: Benchmarks (Torch 2.x)"
conditions: ["NO_WHEELS_REQUIRED", "RAY_CI_RLLIB_AFFECTED"]
Expand Down
21 changes: 8 additions & 13 deletions ci/env/setup_credentials.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
"""
This script sets up credentials for some services in the
CI environment.
This prints out credentials in the following format, to be ingested
by as bazel test envs.
--test_env=WANDB_API_KEY=abcd --test_env=COMET_API_KEY=efgh
This generates a bash script in the following format, which will
then be sourced to run bazel test with.
export WANDB_API_KEY=abcd
export COMET_API_KEY=efgh
"""
import boto3
import json
Expand Down Expand Up @@ -34,16 +36,9 @@ def main():
except Exception as e:
print(f"Could not get Ray AIR secrets: {e}")
sys.exit(1)
return

print(
" ".join(
[
f"--test_env={SERVICES[key]}={ray_air_secrets[key]}"
for key in SERVICES.keys()
]
)
)

for key in SERVICES.keys():
print(f"export {SERVICES[key]}={ray_air_secrets[key]}")


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-overview/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1373,7 +1373,7 @@ Ray Examples
:link: /train/examples/lightning/vicuna_13b_lightning_deepspeed_finetune
:link-type: doc

Fine-tune vicuna-13b-v1.3 with DeepSpeed and LightningTrainer
Fine-tune vicuna-13b-v1.3 with DeepSpeed, PyTorch Lightning and Ray Train

.. grid-item-card:: :bdg-secondary:`Code example`
:class-item: gallery-item training llm pytorch nlp
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Saving and Loading your RL Algorithms and Policies
##################################################


You can use :py:class:`~ray.air.checkpoint.Checkpoint` objects to store
You can use :py:class:`~ray.train.Checkpoint` objects to store
and load the current state of your :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`
or :py:class:`~ray.rllib.policy.policy.Policy` and the neural networks (weights)
within these structures. In the following, we will cover how you can create these
Expand All @@ -26,7 +26,7 @@ or a single :py:class:`~ray.rllib.policy.policy.Policy` instance.
The Algorithm- or Policy instances that were used to create the checkpoint in the first place
may or may not have been trained prior to this.

RLlib uses the :py:class:`~ray.air.checkpoint.Checkpoint` class to create checkpoints and
RLlib uses the :py:class:`~ray.train.Checkpoint` class to create checkpoints and
restore objects from them.

The main file in a checkpoint directory, containing the state information, is currently
Expand All @@ -50,7 +50,7 @@ How do I create an Algorithm checkpoint?
----------------------------------------

The :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` ``save()`` method creates a new checkpoint
(directory with files in it) and returns the path to that directory.
(directory with files in it).

Let's take a look at a simple example on how to create such an
Algorithm checkpoint:
Expand All @@ -69,8 +69,6 @@ like this:
$ ls -la
.
..
.is_checkpoint
.tune_metadata
policies/
algorithm_state.pkl
rllib_checkpoint.json
Expand Down
103 changes: 28 additions & 75 deletions doc/source/train/api/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,24 @@
Ray Train API
=============

This page covers framework specific integrations with Ray Train and Ray Train Developer APIs.

.. _train-integration-api:
.. _train-framework-specific-ckpts:

.. currentmodule:: ray

Ray Train Integrations
----------------------

.. _train-pytorch-integration:

PyTorch Ecosystem
~~~~~~~~~~~~~~~~~

Scale out your PyTorch, Lightning, Hugging Face code with Ray TorchTrainer.
-----------------

.. autosummary::
:toctree: doc/

~train.torch.TorchTrainer
~train.torch.TorchConfig
~train.torch.TorchCheckpoint

.. _train-pytorch-integration:

PyTorch
*******
~~~~~~~

.. autosummary::
:toctree: doc/
Expand All @@ -43,7 +34,7 @@ PyTorch
.. _train-lightning-integration:

PyTorch Lightning
*****************
~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: doc/
Expand All @@ -55,49 +46,20 @@ PyTorch Lightning
~train.lightning.RayDeepSpeedStrategy
~train.lightning.RayTrainReportCallback

.. note::

We will deprecate `LightningTrainer`, `LightningConfigBuilder`,
`LightningCheckpoint`, and `LightningPredictor` in Ray 2.8. Please
refer to the :ref:`migration guide <lightning-trainer-migration-guide>` for more info.

.. autosummary::
:toctree: doc/

~train.lightning.LightningTrainer
~train.lightning.LightningConfigBuilder
~train.lightning.LightningCheckpoint
~train.lightning.LightningPredictor

.. _train-transformers-integration:

Hugging Face Transformers
*************************
~~~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: doc/

~train.huggingface.transformers.prepare_trainer
~train.huggingface.transformers.RayTrainReportCallback

.. note::

We will deprecate `TransformersTrainer`, `TransformersCheckpoint` in Ray 2.8. Please
refer to the :ref:`migration guide <transformers-trainer-migration-guide>` for more info.

.. autosummary::
:toctree: doc/

~train.huggingface.TransformersTrainer
~train.huggingface.TransformersCheckpoint

Hugging Face Accelerate
***********************

.. autosummary::
:toctree: doc/

~train.huggingface.AccelerateTrainer
More Frameworks
---------------

Tensorflow/Keras
~~~~~~~~~~~~~~~~
Expand All @@ -107,21 +69,8 @@ Tensorflow/Keras

~train.tensorflow.TensorflowTrainer
~train.tensorflow.TensorflowConfig
~train.tensorflow.TensorflowCheckpoint


Tensorflow/Keras Training Loop Utilities
****************************************

.. autosummary::
:toctree: doc/

~train.tensorflow.prepare_dataset_shard

.. autosummary::

~air.integrations.keras.ReportCheckpointCallback

~train.tensorflow.keras.ReportCheckpointCallback

Horovod
~~~~~~~
Expand All @@ -140,7 +89,6 @@ XGBoost
:toctree: doc/

~train.xgboost.XGBoostTrainer
~train.xgboost.XGBoostCheckpoint


LightGBM
Expand All @@ -150,32 +98,42 @@ LightGBM
:toctree: doc/

~train.lightgbm.LightGBMTrainer
~train.lightgbm.LightGBMCheckpoint


.. _ray-train-configs-api:

Ray Train Config
----------------
Ray Train Configuration
-----------------------

.. autosummary::
:toctree: doc/

~train.ScalingConfig
~train.RunConfig
~train.CheckpointConfig
~train.FailureConfig
~train.DataConfig
~train.FailureConfig
~train.RunConfig
~train.ScalingConfig
~train.SyncConfig

.. _train-loop-api:

Ray Train Loop
--------------
Ray Train Utilities
-------------------

**Classes**

.. autosummary::
:toctree: doc/

~train.Checkpoint
~train.context.TrainContext

**Functions**

.. autosummary::
:toctree: doc/

~train.get_checkpoint
~train.get_context
~train.get_dataset_shard
~train.report
Expand All @@ -190,14 +148,9 @@ Ray Train Output

~train.Result

.. autosummary::
:toctree: doc/

~train.Checkpoint


Ray Train Base Classes (Developer APIs)
---------------------------------------
Ray Train Developer APIs
------------------------

.. _train-base-trainer:

Expand Down
4 changes: 2 additions & 2 deletions doc/source/train/distributed-tensorflow-keras.rst
Original file line number Diff line number Diff line change
Expand Up @@ -191,11 +191,11 @@ to Ray Train. This reporting logs the results to the console output and appends
local log files. The logging also triggers :ref:`checkpoint bookkeeping <train-dl-configure-checkpoints>`.

The easiest way to report your results with Keras is by using the
:class:`~air.integrations.keras.ReportCheckpointCallback`:
:class:`~ray.train.tensorflow.keras.ReportCheckpointCallback`:

.. code-block:: python
from ray.air.integrations.keras import ReportCheckpointCallback
from ray.train.tensorflow.keras import ReportCheckpointCallback
def train_func(config: dict):
# ...
Expand Down
15 changes: 0 additions & 15 deletions doc/source/train/doc_code/key_concepts.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,21 +100,6 @@ def train_fn(config):
# __session_checkpoint_end__


# __scaling_config_start__
from ray.train import ScalingConfig

scaling_config = ScalingConfig(
# Number of distributed workers.
num_workers=2,
# Turn on/off GPU.
use_gpu=True,
# Specify resources used for trainer.
trainer_resources={"CPU": 1},
# Try to schedule workers on different nodes.
placement_strategy="SPREAD",
)
# __scaling_config_end__

# __run_config_start__
from ray.train import RunConfig
from ray.air.integrations.wandb import WandbLoggerCallback
Expand Down
28 changes: 11 additions & 17 deletions doc/source/train/doc_code/tuner.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,30 +118,24 @@
# __torch_end__


# __tune_preprocess_start__
# __tune_dataset_start__
from ray.data.preprocessors import StandardScaler
from ray.tune import Tuner

prep_v1 = StandardScaler(["worst radius", "worst area"])
prep_v2 = StandardScaler(["worst concavity", "worst smoothness"])
tuner = Tuner(
trainer,
param_space={
"preprocessor": tune.grid_search([prep_v1, prep_v2]),
# Your other parameters go here
},
)
# __tune_preprocess_end__


# __tune_dataset_start__
def get_dataset():
return ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
ds1 = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
prep_v1 = StandardScaler(["worst radius", "worst area"])
ds1 = prep_v1.fit_transform(ds1)
return ds1


def get_another_dataset():
# imagine this is a different dataset
return ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
ds2 = ray.data.read_csv(
"s3://anonymous@air-example-data/breast_cancer_with_categorical.csv"
)
prep_v2 = StandardScaler(["worst concavity", "worst smoothness"])
ds2 = prep_v2.fit_transform(ds2)
return ds2


dataset_1 = get_dataset()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
"source": [
"## Prepare Dataset and Module\n",
"\n",
"The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can keep using them without any changes for the Ray AIR LightningTrainer. "
"The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can keep using them without any changes with Ray Train. "
]
},
{
Expand Down
Loading

0 comments on commit 4b6e6b2

Please sign in to comment.