Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train][2.7][5/n] cherry-picks for documentations, tests, examples #39515

Merged
merged 7 commits into from
Sep 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions doc/source/ray-overview/use-cases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,9 @@ Learn more about model serving with the following resources.

- `[Talk] Productionizing ML at Scale with Ray Serve <https://www.youtube.com/watch?v=UtH-CMpmxvI>`_
- `[Blog] Simplify your MLOps with Ray & Ray Serve <https://www.anyscale.com/blog/simplify-your-mlops-with-ray-and-ray-serve>`_
- `[Guide] Getting Started with Ray Serve </serve/getting_started>`_
- `[Guide] Model Composition in Serve </serve/model_composition>`_
- `[Gallery] Serve Examples Gallery </serve/tutorials/index>`_
- :doc:`[Guide] Getting Started with Ray Serve </serve/getting_started>`
- :doc:`[Guide] Model Composition in Serve </serve/model_composition>`
- :doc:`[Gallery] Serve Examples Gallery </serve/tutorials/index>`
- `[Gallery] More Serve Use Cases on the Blog <https://www.anyscale.com/blog?tag=ray_serve>`_

Hyperparameter Tuning
Expand All @@ -116,11 +116,11 @@ Running multiple hyperparameter tuning experiments is a pattern apt for distribu

Learn more about the Tune library with the following talks and user guides.

- `[Guide] Getting Started with Ray Tune </tune/getting-started>`_
- :doc:`[Guide] Getting Started with Ray Tune </tune/getting-started>`
- `[Blog] How to distribute hyperparameter tuning with Ray Tune <https://www.anyscale.com/blog/how-to-distribute-hyperparameter-tuning-using-ray-tune>`_
- `[Talk] Simple Distributed Hyperparameter Optimization <https://www.youtube.com/watch?v=KgYZtlbFYXE>`_
- `[Blog] Hyperparameter Search with 🤗 Transformers <https://www.anyscale.com/blog/hyperparameter-search-hugging-face-transformers-ray-tune>`_
- `[Gallery] Ray Tune Examples Gallery </tune/examples/index>`_
- :doc:`[Gallery] Ray Tune Examples Gallery </tune/examples/index>`
- `More Tune use cases on the Blog <https://www.anyscale.com/blog?tag=ray-tune>`_

Distributed Training
Expand All @@ -139,9 +139,9 @@ Learn more about the Train library with the following talks and user guides.

- `[Talk] Ray Train, PyTorch, TorchX, and distributed deep learning <https://www.youtube.com/watch?v=e-A93QftCfc>`_
- `[Blog] Elastic Distributed Training with XGBoost on Ray <https://www.uber.com/blog/elastic-xgboost-ray/>`_
- `[Guide] Getting Started with Ray Train </train/train>`_
- `[Example] Fine-tune a 🤗 Transformers model </train/examples/transformers/huggingface_text_classification>`_
- `[Gallery] Ray Train Examples Gallery </train/examples>`_
- :doc:`[Guide] Getting Started with Ray Train </train/train>`
- :doc:`[Example] Fine-tune a 🤗 Transformers model </train/examples/transformers/huggingface_text_classification>`
- :doc:`[Gallery] Ray Train Examples Gallery </train/examples>`
- `[Gallery] More Train Use Cases on the Blog <https://www.anyscale.com/blog?tag=ray_train>`_

Reinforcement Learning
Expand All @@ -157,9 +157,9 @@ Learn more about reinforcement learning with the following resources.

- `[Course] Applied Reinforcement Learning with RLlib <https://applied-rl-course.netlify.app/>`_
- `[Blog] Intro to RLlib: Example Environments <https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70>`_
- `[Guide] Getting Started with RLlib </rllib/rllib-training>`_
- :doc:`[Guide] Getting Started with RLlib </rllib/rllib-training>`
- `[Talk] Deep reinforcement learning at Riot Games <https://www.anyscale.com/events/2022/03/29/deep-reinforcement-learning-at-riot-games>`_
- `[Gallery] RLlib Examples Gallery </rllib/rllib-examples>`_
- :doc:`[Gallery] RLlib Examples Gallery </rllib/rllib-examples>`
- `[Gallery] More RL Use Cases on the Blog <https://www.anyscale.com/blog?tag=rllib>`_

ML Platform
Expand All @@ -181,10 +181,10 @@ End-to-End ML Workflows

The following highlights examples utilizing Ray AI libraries to implement end-to-end ML workflows.

- `[Example] Text classification with Ray </train/examples/transformers/huggingface_text_classification>`_
- `[Example] Object detection with Ray </train/examples/pytorch/torch_detection>`_
- `[Example] Machine learning on tabular data </train/examples/xgboost/xgboost_example>`_
- `[Example] AutoML for Time Series with Ray </ray-core/examples/automl_for_time_series>`_
- :doc:`[Example] Text classification with Ray </train/examples/transformers/huggingface_text_classification>`
- :doc:`[Example] Object detection with Ray </train/examples/pytorch/torch_detection>`
- :doc:`[Example] Machine learning on tabular data </train/examples/xgboost/xgboost_example>`
- :doc:`[Example] AutoML for Time Series with Ray </ray-core/examples/automl_for_time_series>`

Large Scale Workload Orchestration
----------------------------------
Expand All @@ -194,4 +194,4 @@ The following highlights feature projects leveraging Ray Core's distributed APIs
- `[Blog] Highly Available and Scalable Online Applications on Ray at Ant Group <https://www.anyscale.com/blog/building-highly-available-and-scalable-online-applications-on-ray-at-ant>`_
- `[Blog] Ray Forward 2022 Conference: Hyper-scale Ray Application Use Cases <https://www.anyscale.com/blog/ray-forward-2022>`_
- `[Blog] A new world record on the CloudSort benchmark using Ray <https://www.anyscale.com/blog/ray-breaks-the-usd1-tb-barrier-as-the-worlds-most-cost-efficient-sorting>`_
- `[Example] Speed up your web crawler by parallelizing it with Ray </ray-core/examples/web-crawler>`_
- :doc:`[Example] Speed up your web crawler by parallelizing it with Ray </ray-core/examples/web-crawler>`
75 changes: 39 additions & 36 deletions doc/source/train/distributed-tensorflow-keras.rst
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,9 @@ appropriately in distributed training.


.. code-block:: python
:emphasize-lines: 23

import os
import tempfile

from ray import train
from ray.train import Checkpoint, ScalingConfig
Expand All @@ -254,24 +256,24 @@ appropriately in distributed training.
model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"])

for epoch in range(config["num_epochs"]):
model.fit(X, Y, batch_size=20)
checkpoint = Checkpoint.from_dict(
dict(epoch=epoch, model_weights=model.get_weights())
)
train.report({}, checkpoint=checkpoint)
history = model.fit(X, Y, batch_size=20)

with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
model.save(os.path.join(temp_checkpoint_dir, "model.keras"))
checkpoint_dict = os.path.join(temp_checkpoint_dir, "checkpoint.json")
with open(checkpoint_dict, "w") as f:
json.dump({"epoch": epoch}, f)
checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

train.report({"loss": history.history["loss"][0]}, checkpoint=checkpoint)

trainer = TensorflowTrainer(
train_func,
train_loop_config={"num_epochs": 5},
scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()

print(result.checkpoint.to_dict())
# {'epoch': 4, 'model_weights': [array([[-0.31858477],
# [ 0.03747174],
# [ 0.28266194],
# [ 0.8626015 ]], dtype=float32), array([0.02230084], dtype=float32)], '_timestamp': 1656107383, '_preprocessor': None, '_current_checkpoint_id': 4}
print(result.checkpoint)

By default, checkpoints will be persisted to local disk in the :ref:`log
directory <train-log-dir>` of each run.
Expand All @@ -280,7 +282,9 @@ Loading checkpoints
~~~~~~~~~~~~~~~~~~~

.. code-block:: python
:emphasize-lines: 15, 21, 22, 25, 26, 27, 30

import os
import tempfile

from ray import train
from ray.train import Checkpoint, ScalingConfig
Expand All @@ -297,37 +301,42 @@ Loading checkpoints
X = np.random.normal(0, 1, size=(n, 4))
Y = np.random.uniform(0, 1, size=(n, 1))

start_epoch = 0
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

with strategy.scope():
# toy neural network : 1-layer
model = tf.keras.Sequential([tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))])
checkpoint = train.get_checkpoint()
if checkpoint:
# assume that we have run the train.report() example
# and successfully save some model weights
checkpoint_dict = checkpoint.to_dict()
model.set_weights(checkpoint_dict.get("model_weights"))
start_epoch = checkpoint_dict.get("epoch", -1) + 1
with checkpoint.as_directory() as checkpoint_dir:
model = tf.keras.models.load_model(
os.path.join(checkpoint_dir, "model.keras")
)
else:
model = tf.keras.Sequential(
[tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))]
)
model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"])

for epoch in range(start_epoch, config["num_epochs"]):
model.fit(X, Y, batch_size=20)
checkpoint = Checkpoint.from_dict(
dict(epoch=epoch, model_weights=model.get_weights())
)
train.report({}, checkpoint=checkpoint)
for epoch in range(config["num_epochs"]):
history = model.fit(X, Y, batch_size=20)

with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
model.save(os.path.join(temp_checkpoint_dir, "model.keras"))
extra_json = os.path.join(temp_checkpoint_dir, "checkpoint.json")
with open(extra_json, "w") as f:
json.dump({"epoch": epoch}, f)
checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

train.report({"loss": history.history["loss"][0]}, checkpoint=checkpoint)

trainer = TensorflowTrainer(
train_func,
train_loop_config={"num_epochs": 2},
train_loop_config={"num_epochs": 5},
scaling_config=ScalingConfig(num_workers=2),
)
# save a checkpoint
result = trainer.fit()
print(result.checkpoint)

# load a checkpoint
# Start a new run from a loaded checkpoint
trainer = TensorflowTrainer(
train_func,
train_loop_config={"num_epochs": 5},
Expand All @@ -336,12 +345,6 @@ Loading checkpoints
)
result = trainer.fit()

print(result.checkpoint.to_dict())
# {'epoch': 4, 'model_weights': [array([[-0.70056134],
# [-0.8839263 ],
# [-1.0043601 ],
# [-0.61634773]], dtype=float32), array([0.01889327], dtype=float32)], '_timestamp': 1656108446, '_preprocessor': None, '_current_checkpoint_id': 3}


Further reading
---------------
Expand Down
Loading
Loading