Cherry-pick all Serve doc changes for Ray 2.0 (#27960)

Cherry-picks all docs changes for Serve in Ray 2.0. I did this by overwriting the entire `doc/source/serve/` directory in addition to `doc/source/_toc.yml`. The changes should be isolated to Serve (manually verified).
ray-project · Aug 18, 2022 · 0c02ff9 · 0c02ff9
1 parent aec51c3
commit 0c02ff9
Show file tree

Hide file tree

Showing 102 changed files with 5,437 additions and 4,085 deletions.
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -168,28 +168,29 @@ parts:
           - file: serve/key-concepts
           - file: serve/user-guide
             sections:
-              - file: serve/managing-deployments
-              - file: serve/handling-dependencies
               - file: serve/http-guide
-              - file: serve/http-adapters
-              - file: serve/handle-guide
-              - file: serve/ml-models
-              - file: serve/deploying-serve
-              - file: serve/monitoring
-              - file: serve/performance
-              - file: serve/autoscaling
-              - file: serve/deployment-graph
+              - file: serve/scaling-and-resource-allocation
+              - file: serve/model_composition
+              - file: serve/dev-workflow
+              - file: serve/production-guide/index
                 sections:
-                  - file: serve/deployment-graph/deployment-graph-e2e-tutorial
-                  - file: serve/deployment-graph/chain_nodes_same_class_different_args
-                  - file: serve/deployment-graph/combine_two_nodes_with_passing_input_parallel
-                  - file: serve/deployment-graph/control_flow_based_on_user_inputs
-                  - file: serve/deployment-graph/visualize_dag_during_development
-                  - file: serve/deployment-graph/http_endpoint_for_dag_graph
-              - file: serve/production
+                  - file: serve/production-guide/config
+                  - file: serve/production-guide/rest-api
+                  - file: serve/production-guide/kubernetes
+                  - file: serve/production-guide/monitoring
+                  - file: serve/production-guide/failures
+              - file: serve/performance
+              - file: serve/handling-dependencies
+              - file: serve/managing-java-deployments
+              - file: serve/migration
           - file: serve/architecture
           - file: serve/tutorials/index
-          - file: serve/faq
+            sections:
+              - file: serve/tutorials/deployment-graph-patterns
+                sections:
+                  - file: serve/tutorials/deployment-graph-patterns/linear_pipeline
+                  - file: serve/tutorials/deployment-graph-patterns/branching_input
+                  - file: serve/tutorials/deployment-graph-patterns/conditional
           - file: serve/package-ref
 
       - file: rllib/index
@@ -295,6 +296,7 @@ parts:
       - file: cluster/running-applications/index
         title: Applications Guide
 
+
   - caption: References
     chapters:
       - file: ray-references/api

diff --git a/doc/source/ray-core/ray-dag.rst b/doc/source/ray-core/ray-dag.rst
@@ -154,6 +154,4 @@ More Resources
 You can find more application patterns and examples in the following resources
 from other Ray libraries built on top of Ray DAG API with same mechanism.
 
-| `Visualization of DAGs <https://docs.ray.io/en/master/serve/deployment-graph/visualize_dag_during_development.html>`_
-| `DAG Cookbook and patterns <https://docs.ray.io/en/master/serve/deployment-graph.html#patterns>`_
 | `Serve Deployment Graph's original REP <https://github.com/ray-project/enhancements/blob/main/reps/2022-03-08-serve_pipeline.md>`_
diff --git a/doc/source/ray-references/faq.rst b/doc/source/ray-references/faq.rst
@@ -8,7 +8,6 @@ FAQ
     :caption: Frequently Asked Questions
 
     ./../tune/faq.rst
-    ./../serve/faq.rst
 
 
 Further Questions or Issues?

diff --git a/doc/source/serve/architecture-2.0.svg b/doc/source/serve/architecture-2.0.svg
diff --git a/doc/source/serve/architecture.md b/doc/source/serve/architecture.md
@@ -1,85 +1,115 @@
 (serve-architecture)=
 
-# Serve Architecture
+# Architecture
 
-This section should help you:
-
-- understand an overview of how each component in Serve works
-- understand the different types of actors that make up a Serve instance
+In this section, we explore Serve's key architectural concepts and components. It will offer insight and overview into:
+- the role of each component in Serve and how they work
+- the different types of actors that make up a Serve application
 
 % Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing
 
-```{image} architecture.svg
+```{image} architecture-2.0.svg
 :align: center
 :width: 600px
 ```
 
-## High Level View
+(serve-architecture-high-level-view)=
+## High-Level View
 
 Serve runs on Ray and utilizes [Ray actors](actor-guide).
 
 There are three kinds of actors that are created to make up a Serve instance:
 
-- Controller: A global actor unique to each Serve instance that manages
+- **Controller**: A global actor unique to each Serve instance that manages
   the control plane. The Controller is responsible for creating, updating, and
   destroying other actors. Serve API calls like creating or getting a deployment
   make remote calls to the Controller.
-- Router: There is one router per node. Each router is a [Uvicorn](https://www.uvicorn.org/) HTTP
+- **HTTP Proxy**: By default there is one HTTP proxy actor on the head node. This actor runs a [Uvicorn](https://www.uvicorn.org/) HTTP
   server that accepts incoming requests, forwards them to replicas, and
-  responds once they are completed.
-- Worker Replica: Worker replicas actually execute the code in response to a
+  responds once they are completed.  For scalability and high availability,
+  you can also run a proxy on each node in the cluster via the `location` field of [`http_options`](core-apis).
+- **Replicas**: Actors that actually execute the code in response to a
   request. For example, they may contain an instantiation of an ML model. Each
-  replica processes individual requests from the routers (they may be batched
-  by the replica using `@serve.batch`, see the [batching](serve-batching) docs).
+  replica processes individual requests from the HTTP proxy (these may be batched
+  by the replica using `@serve.batch`, see the [batching](serve-performance-batching-requests) docs).
 
 ## Lifetime of a Request
 
-When an HTTP request is sent to the router, the follow things happen:
+When an HTTP request is sent to the HTTP proxy, the following happens:
 
-- The HTTP request is received and parsed.
-- The correct deployment associated with the HTTP url path is looked up. The
+1. The HTTP request is received and parsed.
+2. The correct deployment associated with the HTTP URL path is looked up. The
   request is placed on a queue.
-- For each request in a deployment queue, an available replica is looked up
-  and the request is sent to it. If there are no available replicas (there
-  are more than `max_concurrent_queries` requests outstanding), the request
-  is left in the queue until an outstanding request is finished.
+3. For each request in a deployment's queue, an available replica is looked up in round-robin fashion
+  and the request is sent to it. If there are no available replicas (i.e. there
+  are more than `max_concurrent_queries` requests outstanding at each replica), the request
+  is left in the queue until a replica becomes available.
 
-Each replica maintains a queue of requests and executes one at a time, possibly
-using asyncio to process them concurrently. If the handler (the function for the
-deployment or `__call__`) is `async`, the replica will not wait for the
-handler to run; otherwise, the replica will block until the handler returns.
+Each replica maintains a queue of requests and executes requests one at a time, possibly
+using `asyncio` to process them concurrently. If the handler (the deployment function or the `__call__` method of the deployment class) is declared with `async def`, the replica will not wait for the
+handler to run.  Otherwise, the replica will block until the handler returns.
 
-## FAQ
+When making a request via [ServeHandle](serve-handle-explainer) instead of HTTP, the request is placed on a queue in the ServeHandle, and we skip to step 3 above.
 
 (serve-ft-detail)=
 
-### How does Serve handle fault tolerance?
+## Fault tolerance
 
 Application errors like exceptions in your model evaluation code are caught and
 wrapped. A 500 status code will be returned with the traceback information. The
 replica will be able to continue to handle requests.
 
-Machine errors and faults will be handled by Ray. Serve utilizes the [actor
-reconstruction](actor-fault-tolerance) capability. For example, when a machine hosting any of the
-actors crashes, those actors will be automatically restarted on another
+Machine errors and faults will be handled by Ray Serve as follows:
+
+- When replica actors fail, the Controller actor will replace them with new ones.
+- When the HTTP proxy actor fails, the Controller actor will restart it.
+- When the Controller actor fails, Ray will restart it.
+- When using the [KubeRay RayService](https://ray-project.github.io/kuberay/guidance/rayservice/), KubeRay will recover crashed nodes or a crashed cluster.  Cluster crashes can be avoided using the [GCS FT feature](https://ray-project.github.io/kuberay/guidance/gcs-ft/).
+- If not using KubeRay, when the Ray cluster fails, Ray Serve cannot recover.
+
+When a machine hosting any of the actors crashes, those actors will be automatically restarted on another
 available machine. All data in the Controller (routing policies, deployment
-configurations, etc) is checkpointed to the Ray. Transient data in the
-router and the replica (like network connections and internal request
-queues) will be lost upon failure.
+configurations, etc) is checkpointed to the Ray Global Control Store (GCS) on the head node. Transient data in the
+router and the replica (like network connections and internal request queues) will be lost for this kind of failure.
+See [Serve Health Checking](serve-health-checking) for how actor crashes are detected.
+
+(serve-autoscaling-architecture)=
+
+## Ray Serve Autoscaling
+
+Ray Serve's autoscaling feature automatically increases or decreases a deployment's number of replicas based on its load.
+
+![pic](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling.svg)
+
+- The Serve Autoscaler runs in the Serve Controller actor.
+- Each ServeHandle and each replica periodically pushes its metrics to the autoscaler.
+- For each deployment, the autoscaler periodically checks ServeHandle queues and in-flight queries on replicas to decide whether or not to scale the number of replicas.
+- Each ServeHandle continuously polls the controller to check for new deployment replicas. Whenever new replicas are discovered, it will send any buffered or new queries to the replica until `max_concurrent_queries` is reached.  Queries are sent to replicas in round-robin fashion, subject to the constraint that no replica is handling more than `max_concurrent_queries` requests at a time.
+
+:::{note}
+When the controller dies, requests can still be sent via HTTP and ServeHandles, but autoscaling will be paused. When the controller recovers, the autoscaling will resume, but all previous metrics collected will be lost.
+:::
+
+## Ray Serve API Server
+
+Ray Serve provides a [CLI](serve-cli) for managing your Ray Serve instance, as well as a [REST API](serve-rest-api).
+Each node in your Ray cluster provides a Serve REST API server that can connect to Serve and respond to Serve REST requests.
+
+## FAQ
 
 ### How does Serve ensure horizontal scalability and availability?
 
-Serve starts one router per node. Each router will bind the same port. You
+Serve can be configured to start one HTTP proxy actor per node via the `location` field of [`http_options`](core-apis). Each one will bind the same port. You
 should be able to reach Serve and send requests to any models via any of the
-servers.
+servers.  You can use your own load balancer on top of Ray Serve.
 
-This architecture ensures horizontal scalability for Serve. You can scale the
-router by adding more nodes and scale the model by increasing the number
-of replicas.
+This architecture ensures horizontal scalability for Serve. You can scale your HTTP ingress by adding more nodes and scale your model inference by increasing the number
+of replicas via the `num_replicas` option of your deployment.
 
 ### How do ServeHandles work?
 
-{mod}`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
+{mod}`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to a "router" on the
+same node which routes requests to replicas for a deployment. When a
 request is sent from one replica to another via the handle, the
 requests go through the same data path as incoming HTTP requests. This enables
 the same deployment selection and batching procedures to happen. ServeHandles are
@@ -89,5 +119,4 @@ often used to implement [model composition](serve-model-composition).
 
 Serve utilizes Ray’s [shared memory object store](plasma-store) and in process memory
 store. Small request objects are directly sent between actors via network
-call. Larger request objects (100KiB+) are written to a distributed shared
-memory store and the replica can read them via zero-copy read.
+call. Larger request objects (100KiB+) are written to the object store and the replica can read them via zero-copy read.
diff --git a/doc/source/serve/architecture.svg b/doc/source/serve/architecture.svg
diff --git a/doc/source/serve/autoscaling.md b/doc/source/serve/autoscaling.md