ray-project · GeneDer · Sep 10, 2023 · Sep 8, 2023 · Sep 8, 2023 · Sep 9, 2023
@@ -98,9 +98,9 @@ Learn more about model serving with the following resources.
 
 - `[Talk] Productionizing ML at Scale with Ray Serve <https://www.youtube.com/watch?v=UtH-CMpmxvI>`_
 - `[Blog] Simplify your MLOps with Ray & Ray Serve <https://www.anyscale.com/blog/simplify-your-mlops-with-ray-and-ray-serve>`_
-- `[Guide] Getting Started with Ray Serve </serve/getting_started>`_
-- `[Guide] Model Composition in Serve </serve/model_composition>`_
-- `[Gallery] Serve Examples Gallery </serve/tutorials/index>`_
+- :doc:`[Guide] Getting Started with Ray Serve </serve/getting_started>`
+- :doc:`[Guide] Model Composition in Serve </serve/model_composition>`
+- :doc:`[Gallery] Serve Examples Gallery </serve/tutorials/index>`
 - `[Gallery] More Serve Use Cases on the Blog <https://www.anyscale.com/blog?tag=ray_serve>`_
 
 Hyperparameter Tuning
@@ -116,11 +116,11 @@ Running multiple hyperparameter tuning experiments is a pattern apt for distribu
 
 Learn more about the Tune library with the following talks and user guides.
 
-- `[Guide] Getting Started with Ray Tune </tune/getting-started>`_
+- :doc:`[Guide] Getting Started with Ray Tune </tune/getting-started>`
 - `[Blog] How to distribute hyperparameter tuning with Ray Tune <https://www.anyscale.com/blog/how-to-distribute-hyperparameter-tuning-using-ray-tune>`_
 - `[Talk] Simple Distributed Hyperparameter Optimization <https://www.youtube.com/watch?v=KgYZtlbFYXE>`_
 - `[Blog] Hyperparameter Search with 🤗 Transformers <https://www.anyscale.com/blog/hyperparameter-search-hugging-face-transformers-ray-tune>`_
-- `[Gallery] Ray Tune Examples Gallery </tune/examples/index>`_
+- :doc:`[Gallery] Ray Tune Examples Gallery </tune/examples/index>`
 - `More Tune use cases on the Blog <https://www.anyscale.com/blog?tag=ray-tune>`_
 
 Distributed Training
@@ -139,9 +139,9 @@ Learn more about the Train library with the following talks and user guides.
 
 - `[Talk] Ray Train, PyTorch, TorchX, and distributed deep learning <https://www.youtube.com/watch?v=e-A93QftCfc>`_
 - `[Blog] Elastic Distributed Training with XGBoost on Ray <https://www.uber.com/blog/elastic-xgboost-ray/>`_
-- `[Guide] Getting Started with Ray Train </train/train>`_
-- `[Example] Fine-tune a 🤗 Transformers model </train/examples/transformers/huggingface_text_classification>`_
-- `[Gallery] Ray Train Examples Gallery </train/examples>`_
+- :doc:`[Guide] Getting Started with Ray Train </train/train>`
+- :doc:`[Example] Fine-tune a 🤗 Transformers model </train/examples/transformers/huggingface_text_classification>`
+- :doc:`[Gallery] Ray Train Examples Gallery </train/examples>`
 - `[Gallery] More Train Use Cases on the Blog <https://www.anyscale.com/blog?tag=ray_train>`_
 
 Reinforcement Learning
@@ -157,9 +157,9 @@ Learn more about reinforcement learning with the following resources.
 
 - `[Course] Applied Reinforcement Learning with RLlib <https://applied-rl-course.netlify.app/>`_
 - `[Blog] Intro to RLlib: Example Environments <https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70>`_
-- `[Guide] Getting Started with RLlib </rllib/rllib-training>`_
+- :doc:`[Guide] Getting Started with RLlib </rllib/rllib-training>`
 - `[Talk] Deep reinforcement learning at Riot Games <https://www.anyscale.com/events/2022/03/29/deep-reinforcement-learning-at-riot-games>`_
-- `[Gallery] RLlib Examples Gallery </rllib/rllib-examples>`_
+- :doc:`[Gallery] RLlib Examples Gallery </rllib/rllib-examples>`
 - `[Gallery] More RL Use Cases on the Blog <https://www.anyscale.com/blog?tag=rllib>`_
 
 ML Platform
@@ -181,10 +181,10 @@ End-to-End ML Workflows
 
 The following highlights examples utilizing Ray AI libraries to implement end-to-end ML workflows.
 
-- `[Example] Text classification with Ray </train/examples/transformers/huggingface_text_classification>`_
-- `[Example] Object detection with Ray </train/examples/pytorch/torch_detection>`_
-- `[Example] Machine learning on tabular data </train/examples/xgboost/xgboost_example>`_
-- `[Example] AutoML for Time Series with Ray </ray-core/examples/automl_for_time_series>`_
+- :doc:`[Example] Text classification with Ray </train/examples/transformers/huggingface_text_classification>`
+- :doc:`[Example] Object detection with Ray </train/examples/pytorch/torch_detection>`
+- :doc:`[Example] Machine learning on tabular data </train/examples/xgboost/xgboost_example>`
+- :doc:`[Example] AutoML for Time Series with Ray </ray-core/examples/automl_for_time_series>`
 
 Large Scale Workload Orchestration
 ----------------------------------
@@ -194,4 +194,4 @@ The following highlights feature projects leveraging Ray Core's distributed APIs
 - `[Blog] Highly Available and Scalable Online Applications on Ray at Ant Group <https://www.anyscale.com/blog/building-highly-available-and-scalable-online-applications-on-ray-at-ant>`_
 - `[Blog] Ray Forward 2022 Conference: Hyper-scale Ray Application Use Cases <https://www.anyscale.com/blog/ray-forward-2022>`_
 - `[Blog] A new world record on the CloudSort benchmark using Ray <https://www.anyscale.com/blog/ray-breaks-the-usd1-tb-barrier-as-the-worlds-most-cost-efficient-sorting>`_
-- `[Example] Speed up your web crawler by parallelizing it with Ray </ray-core/examples/web-crawler>`_
+- :doc:`[Example] Speed up your web crawler by parallelizing it with Ray </ray-core/examples/web-crawler>`
@@ -230,7 +230,9 @@ appropriately in distributed training.
 
 
 .. code-block:: python
-    :emphasize-lines: 23
+
+    import os
+    import tempfile
 
     from ray import train
     from ray.train import Checkpoint, ScalingConfig
@@ -254,24 +256,24 @@ appropriately in distributed training.
             model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"])
 
         for epoch in range(config["num_epochs"]):
-            model.fit(X, Y, batch_size=20)
-            checkpoint = Checkpoint.from_dict(
-                dict(epoch=epoch, model_weights=model.get_weights())
-            )
-            train.report({}, checkpoint=checkpoint)
+            history = model.fit(X, Y, batch_size=20)
+
+            with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
+                model.save(os.path.join(temp_checkpoint_dir, "model.keras"))
+                checkpoint_dict = os.path.join(temp_checkpoint_dir, "checkpoint.json")
+                with open(checkpoint_dict, "w") as f:
+                    json.dump({"epoch": epoch}, f)
+                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)
+
+                train.report({"loss": history.history["loss"][0]}, checkpoint=checkpoint)
 
     trainer = TensorflowTrainer(
         train_func,
         train_loop_config={"num_epochs": 5},
         scaling_config=ScalingConfig(num_workers=2),
     )
     result = trainer.fit()
-
-    print(result.checkpoint.to_dict())
-    # {'epoch': 4, 'model_weights': [array([[-0.31858477],
-    #    [ 0.03747174],
-    #    [ 0.28266194],
-    #    [ 0.8626015 ]], dtype=float32), array([0.02230084], dtype=float32)], '_timestamp': 1656107383, '_preprocessor': None, '_current_checkpoint_id': 4}
+    print(result.checkpoint)
 
 By default, checkpoints will be persisted to local disk in the :ref:`log
 directory <train-log-dir>` of each run.
@@ -280,7 +282,9 @@ Loading checkpoints
 ~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: python
-    :emphasize-lines: 15, 21, 22, 25, 26, 27, 30
+
+    import os
+    import tempfile
 
     from ray import train
     from ray.train import Checkpoint, ScalingConfig
@@ -297,37 +301,42 @@ Loading checkpoints
         X = np.random.normal(0, 1, size=(n, 4))
         Y = np.random.uniform(0, 1, size=(n, 1))
 
-        start_epoch = 0
         strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
-
         with strategy.scope():
             # toy neural network : 1-layer
-            model = tf.keras.Sequential([tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))])
             checkpoint = train.get_checkpoint()
             if checkpoint:
-                # assume that we have run the train.report() example
-                # and successfully save some model weights
-                checkpoint_dict = checkpoint.to_dict()
-                model.set_weights(checkpoint_dict.get("model_weights"))
-                start_epoch = checkpoint_dict.get("epoch", -1) + 1
+                with checkpoint.as_directory() as checkpoint_dir:
+                    model = tf.keras.models.load_model(
+                        os.path.join(checkpoint_dir, "model.keras")
+                    )
+            else:
+                model = tf.keras.Sequential(
+                    [tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))]
+                )
             model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"])
 
-        for epoch in range(start_epoch, config["num_epochs"]):
-            model.fit(X, Y, batch_size=20)
-            checkpoint = Checkpoint.from_dict(
-                dict(epoch=epoch, model_weights=model.get_weights())
-            )
-            train.report({}, checkpoint=checkpoint)
+        for epoch in range(config["num_epochs"]):
+            history = model.fit(X, Y, batch_size=20)
+
+            with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
+                model.save(os.path.join(temp_checkpoint_dir, "model.keras"))
+                extra_json = os.path.join(temp_checkpoint_dir, "checkpoint.json")
+                with open(extra_json, "w") as f:
+                    json.dump({"epoch": epoch}, f)
+                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)
+
+                train.report({"loss": history.history["loss"][0]}, checkpoint=checkpoint)
 
     trainer = TensorflowTrainer(
         train_func,
-        train_loop_config={"num_epochs": 2},
+        train_loop_config={"num_epochs": 5},
         scaling_config=ScalingConfig(num_workers=2),
     )
-    # save a checkpoint
     result = trainer.fit()
+    print(result.checkpoint)
 
-    # load a checkpoint
+    # Start a new run from a loaded checkpoint
     trainer = TensorflowTrainer(
         train_func,
         train_loop_config={"num_epochs": 5},
@@ -336,12 +345,6 @@ Loading checkpoints
     )
     result = trainer.fit()
 
-    print(result.checkpoint.to_dict())
-    # {'epoch': 4, 'model_weights': [array([[-0.70056134],
-    #    [-0.8839263 ],
-    #    [-1.0043601 ],
-    #    [-0.61634773]], dtype=float32), array([0.01889327], dtype=float32)], '_timestamp': 1656108446, '_preprocessor': None, '_current_checkpoint_id': 3}
-
 
 Further reading
 ---------------