# Build a tool-using agent with LangChain / LangGraph, and Ray Serve on Anyscale

This tutorial guides you through building and deploying a sophisticated, tool-using agent using LangChain, LangGraph, and Ray Serve on Anyscale.

You'll create a scalable microservices architecture where each component—the agent, the LLM, and the tools—runs as an independent, autoscaling service.

* The Agent (built with LangGraph) orchestrates tasks and manages conversation state.

* The LLM (Qwen 4B) runs in its own service for dedicated, high-speed inference.

* The Tools (a weather API) are exposed via the Model Context Protocol (MCP), an open standard that allows the agent to discover and use them dynamically.

This decoupled design provides automatic scaling, fault isolation, and the flexibility to update or swap components (like LLMs or tools) without changing your agent's code.


## Architecture overview

This project uses a microservices architecture where three independent Ray Serve applications work together.

### Components
* Agent Service (LangGraph): The "brain" of the operation. It orchestrates the multi-step reasoning and manages the conversation state. It's lightweight (CPU-only) and deployed with Ray Serve.

* LLM Service (Ray Serve LLM): The "language engine." It runs the Qwen/Qwen3-4B-Instruct-2507-FP8 model, optimized for tool use. It's deployed with vLLM on a GPU (L4) for high-speed inference and provides an OpenAI-compatible API.

* Tool Service (MCP): The "hands." It exposes a weather API as a set of tools. The agent discovers these tools at runtime using the Model Context Protocol (MCP). It's also a stateless, CPU-only service.

Benefits of this Architecture
This microservice architecture allows each component to scale independently. Your GPU-intensive LLM service can scale up and down based on inference demand, separate from the lightweight, CPU-based agent orchestration.

### Key benefits using Ray and Anyscale include:

* Independent Scaling: Scale GPUs for the LLM and CPUs for the agent/tools separately.

* High Availability: Zero-downtime updates and automatic recovery from failures.

* Flexibility: Swap LLMs or add new tools simply by deploying a new service. The agent discovers them at runtime—no code changes needed.

* Enhanced Observability: Anyscale provides comprehensive logs, metrics, and tracing for each service.

### Additional resources

For more information on LLM serving and Ray Serve, see the following:
- [Anyscale LLM Serving documentation](https://docs.anyscale.com/llm/serving)
- [Ray Serve LLM documentation](https://docs.ray.io/en/master/serve/llm/index.html)
- [Anyscale LLM Serving Template](https://console.anyscale.com/template-preview/deployment-serve-llm)




## Dependencies and Compute Resources Requirement

This project uses `pyproject.toml` with locked versions in `uv.lock` for reproducible installations. Run the following command to install dependencies:


In [1]:
%%bash
uv sync

[36m[1mDownloading[0m[39m cpython-3.12.12-linux-x86_64-gnu (download) [2m(31.8MiB)[0m
 [32m[1mDownloading[0m[39m cpython-3.12.12-linux-x86_64-gnu (download)
Using CPython [36m3.12.12[39m
Creating virtual environment at: [36m.venv[39m
[2mResolved [1m58 packages[0m [2min 0.70ms[0m[0m
[36m[1mDownloading[0m[39m tiktoken [2m(1.1MiB)[0m
[36m[1mDownloading[0m[39m pydantic-core [2m(1.9MiB)[0m
[36m[1mDownloading[0m[39m zstandard [2m(5.3MiB)[0m
 [32m[1mDownloading[0m[39m tiktoken
 [32m[1mDownloading[0m[39m pydantic-core
 [32m[1mDownloading[0m[39m zstandard
[2mPrepared [1m55 packages[0m [2min 483ms[0m[0m
[2mInstalled [1m55 packages[0m [2min 33ms[0m[0m
 [32m+[39m [1mannotated-types[0m[2m==0.7.0[0m
 [32m+[39m [1manyio[0m[2m==4.11.0[0m
 [32m+[39m [1mattrs[0m[2m==25.3.0[0m
 [32m+[39m [1mcertifi[0m[2m==2025.8.3[0m
 [32m+[39m [1mcharset-normalizer[0m[2m==3.4.3[0m
 [32m+[39m [1mclick[0m[2m==8.3.0[0m
 [32m+

The deployment requires two compute resources: one L4 GPU (g6.2xlarge instance, 24 GB GPU memory) for the LLM service, and one m5d.xlarge (4 vCPU) for the MCP and agent services.

## Implementation: Building the Services

This project consists of several Python scripts that work together to create and serve the agent.

### Step 1: Create the LLM service

Check out the code in `llm_deploy_qwen.py`. This script deploys the Qwen LLM (`Qwen/Qwen3-4B-Instruct-2507-FP8`) as an OpenAI-compatible API endpoint using Ray Serve's `build_openai_app` utility. This allows you to use the Qwen model with any OpenAI-compatible client, including LangChain.

The following are key configurations in this script:

- **`accelerator_type="L4"`**: Specifies the GPU type. L4 GPUs (Ada Lovelace architecture) are optimized for FP8 precision, making them cost-effective for this quantized model. For higher throughput, use H100 GPUs. For GPU selection guidance, see the [GPU guidance documentation](https://docs.anyscale.com/llm/serving/gpu-guidance).


- **`enable_auto_tool_choice=True`**: Enables the model to automatically decide when to use tools based on the input. This is essential for agent workflows where the LLM needs to determine whether to call a tool or respond directly. For more information on tool calling, see the [tool and function calling documentation](https://docs.anyscale.com/llm/serving/tool-function-calling).

- **`tool_call_parser="hermes"`**: Specifies the parsing strategy for tool calls. The "hermes" parser is designed for models that follow the Hermes function-calling format, which Qwen models support.

- **`trust_remote_code=True`**: Required when loading Qwen models from Hugging Face, as they use custom chat templates and tokenization logic that aren't part of the standard transformers library.

**Additional LLM development resources:**
- [LLM serving basics](https://docs.anyscale.com/llm/serving/intro)
- [LLM serving examples and template](https://console.anyscale.com/template-preview/deployment-serve-llm): Comprehensive examples for deploying LLMs with Ray Serve
- [Performance optimization documentation](https://docs.anyscale.com/llm/serving/performance-optimization)
- [Configure structured output](https://docs.anyscale.com/llm/serving/structured-output): Ensure LLM responses match specific schemas


### Step 2: Create the MCP weather tool service

Check out the code in `weather_mcp_ray.py` to deploy weather tools as an MCP (Model Context Protocol) service.

**How the weather tool service works:**

The `weather_mcp_ray.py` script uses `FastMCP` from `langchain_mcp_adapters` to define and expose weather-related tools. This service is a FastAPI application deployed with Ray Serve, making the tools available over HTTP.

- **FastMCP framework**: The `FastMCP` class provides a way to define tools using Python decorators. Setting `stateless_http=True` makes it suitable for deployment as an HTTP service.

- **Tool registration**: Each function decorated with `@mcp.tool()` becomes an automatically discoverable tool:
  - `get_alerts(state: str)`: Fetches active weather alerts for a given U.S. state code.
  - `get_forecast(latitude: float, longitude: float)`: Retrieves a 5-period forecast for specific coordinates.

- **External API integration**: The service makes asynchronous HTTP requests to the National Weather Service (NWS) API using `httpx`. The `USER_AGENT` header is required by the NWS API to identify the client application.

- **Tool metadata**: The docstrings for each tool function serve as descriptions that the agent uses to understand when and how to call each tool. This is crucial for the LLM to decide which tool to use.

- **Ray Serve deployment**: When deployed with Ray Serve, this becomes a scalable microservice that can handle multiple concurrent tool requests from agent instances.

**Important:** Ray Serve currently only supports stateless HTTP mode in MCP. Set `stateless_http=True` to prevent "session not found" errors when multiple replicas are running:

```python
mcp = FastMCP("weather", stateless_http=True)
```

**Additional resources:**
- [MCP quickstart guide](https://docs.anyscale.com/mcp/mcp-quickstart-guide)
- [Deploy scalable MCP servers](https://docs.anyscale.com/mcp/scalable-remote-mcp-deployment)
- [Anyscale MCP Deployment Template](https://console.anyscale.com/template-preview/mcp-ray-serve)


### Step 3: Create the agent logic

Check out the code in `agent_with_mcp.py` to define the agent that orchestrates the LLM and tools.

The core function is `build_agent`:

```python
async def build_agent():
    mcp_tools = await get_mcp_tools()

    tools = []
    if mcp_tools:
        tools.extend(mcp_tools)
    else:
        # Fallback so you can verify tool-calling quickly.
        tools.append(echo)

    print(f"\n[Agent] Using {len(tools)} tool(s).")

    memory = MemorySaver()
    agent = create_agent(
        llm,
        tools,
        system_prompt=PROMPT,
        checkpointer=memory,
    )
    return agent
```

**How the agent works:**

- **LLM configuration**: Connects to your deployed Qwen model using the OpenAI-compatible API.

- **Tool discovery**: Uses `MultiServerMCPClient` to automatically discover available tools from the MCP service.

- **Agent creation**: Creates an agent with the LLM, tools, and system prompt using LangChain's `create_agent` function.

- **Memory management**: Uses `MemorySaver` to maintain conversation state across multiple turns.



### Step 4: Create the agent deployment script

The `ray_serve_agent_deployment.py` script deploys the agent as a Ray Serve application with a `/chat` endpoint.

```python
import json
from contextlib import asynccontextmanager
from typing import AsyncGenerator
from uuid import uuid4

from fastapi import FastAPI, Request
from fastapi.encoders import jsonable_encoder
from starlette.responses import StreamingResponse
from ray import serve

from agent_with_mcp import build_agent  # Your factory that returns a Langchain / LangGraph agent.

# ----------------------------------------------------------------------
# FastAPI app with an async lifespan hook.
# ----------------------------------------------------------------------
@asynccontextmanager
async def lifespan(app: FastAPI):
    agent = await build_agent()  # Likely compiled with a checkpointer.
    app.state.agent = agent
    try:
        yield
    finally:
        if hasattr(agent, "aclose"):
            await agent.aclose()

fastapi_app = FastAPI(lifespan=lifespan)

@fastapi_app.post("/chat")
async def chat(request: Request):
    """
    POST /chat
    Body: {"user_request": "<text>", "thread_id": "<optional>", "checkpoint_ns": "<optional>"}

    Streams LangGraph 'update' dicts as SSE (one JSON object per event).
    """
    body = await request.json()
    user_request: str = body.get("user_request", "")

    # Threading and checkpoint identifiers.
    thread_id = (
        body.get("thread_id")
        or request.headers.get("X-Thread-Id")
        or str(uuid4())  # New thread per request if none provided.
    )
    checkpoint_ns = body.get("checkpoint_ns")  # Optional namespacing.

    # Build config for LangGraph.
    config = {"configurable": {"thread_id": thread_id}}
    if checkpoint_ns:
        config["configurable"]["checkpoint_ns"] = checkpoint_ns

    async def event_stream() -> AsyncGenerator[str, None]:
        agent = request.app.state.agent
        inputs = {"messages": [{"role": "user", "content": user_request}]}

        try:
            # Stream updates from the agent.
            async for update in agent.astream(inputs, config=config, stream_mode="updates"):
                safe_update = jsonable_encoder(update)
                # Proper SSE framing: "data: <json>\n\n".
                yield f"data: {json.dumps(safe_update)}\n\n"
        except Exception as e:
            # Don't crash the SSE; surface one terminal error event and end.
            err = {"error": type(e).__name__, "detail": str(e)}
            yield f"data: {json.dumps(err)}\n\n"

    # Expose thread id so the client can reuse it on the next call.
    headers = {"X-Thread-Id": thread_id}

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers=headers,
    )

# ----------------------------------------------------------------------
# Ray Serve deployment wrapper.
# ----------------------------------------------------------------------
@serve.deployment(ray_actor_options={"num_cpus": 1})
@serve.ingress(fastapi_app)
class LangGraphServeDeployment:
    pass

app = LangGraphServeDeployment.bind()

# Deploy the agent app locally:
# serve run ray_serve_agent_deployment:app

# Deploy the agent using Anyscale service:
# anyscale service deploy ray_serve_agent_deployment:app
```

**How deployment works:**

- **FastAPI lifespan management**: Uses `@asynccontextmanager` to initialize the agent on startup and clean up on shutdown.

- **Streaming endpoint**: The `/chat` endpoint accepts POST requests and returns server-sent events (SSE):
  ```python
  {
    "user_request": "What's the weather?",
    "thread_id": "optional-thread-id",
    "checkpoint_ns": "optional-namespace"
  }
  ```

- **Thread management**: Each conversation can have a `thread_id` to maintain context across requests. If no `thread_id` is provided, a new UUID is generated.

- **Event streaming**: Uses LangGraph's `astream` to emit real-time updates (tool calls, reasoning steps, final answers) as JSON objects.

- **Resource allocation**: The agent deployment is lightweight (0.2 CPUs per replica, no GPU) since heavy computation happens in the LLM service.


## Deploy the services

Now that you've reviewed the code, deploy each service to Anyscale.

### Step 5: Deploy the LLM service

Deploy the Qwen LLM service on Anyscale. This command creates a scalable endpoint for LLM inference:


In [2]:
%%bash
anyscale service deploy llm_deploy_qwen:app --name llm_deploy_qwen_service


  import pkg_resources
(anyscale +1.7s) Restarting existing service 'llm_deploy_qwen_service'.
(anyscale +3.6s) Uploading local dir '.' to cloud storage.
(anyscale +4.2s) Including workspace-managed pip dependencies.
(anyscale +5.1s) Service 'llm_deploy_qwen_service' deployed (version ID: 6uyk5r1b).
(anyscale +5.1s) View the service in the UI: 'https://console.anyscale.com/services/service2_4ebm1f7su1fjgr6bflxgh7hqf6'
(anyscale +5.1s) Query the service once it's running using the following curl command (add the path you want to query):
(anyscale +5.1s) curl -H "Authorization: Bearer VrBDo0s-qNOaP9kugBQtJQhGAIA6EUszb6iJHbB1xDQ" https://llm-deploy-qwen-service-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/


After deployment completes, you'll receive:
- Service URL (for example, `https://llm-deploy-qwen-service-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1`)
- API token for authentication

**Save these values—you'll need them to configure the agent.**


### Step 6: Deploy the weather MCP service

Deploy the weather tool service. This creates an endpoint for the agent to discover and call weather tools:


In [3]:
%%bash
anyscale service deploy weather_mcp_ray:app --name weather_mcp_service


  import pkg_resources
(anyscale +1.1s) Restarting existing service 'weather_mcp_service'.
(anyscale +1.9s) Uploading local dir '.' to cloud storage.
(anyscale +2.6s) Including workspace-managed pip dependencies.
(anyscale +3.4s) Service 'weather_mcp_service' deployed (version ID: 6vta7xsr).
(anyscale +3.4s) View the service in the UI: 'https://console.anyscale.com/services/service2_gewuw3u78jnjv5wxzx53tnvdb2'
(anyscale +3.4s) Query the service once it's running using the following curl command (add the path you want to query):
(anyscale +3.4s) curl -H "Authorization: Bearer uyOArxwCNeTpxn0odOW7hGY57tXQNNrF16Yy8ziskrY" https://weather-mcp-service-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/


After deployment completes, you'll receive:
- Service URL (for example, `https://weather-mcp-service-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/mcp`)
- API token for authentication

**Important:** Make sure to include `/mcp` in your URL when configuring the agent.


### Step 7: Configure the agent

Update `agent_with_mcp.py` with the service endpoints you received from the deployments. Modify the following lines:

```python
API_KEY = "<your-llm-service-token>"
OPENAI_COMPAT_BASE_URL = "<your-llm-service-url>/v1"  # Include "/v1".
MODEL = "Qwen/Qwen3-4B-Instruct-2507-FP8"
TEMPERATURE = 0.01
WEATHER_MCP_BASE_URL = "<your-mcp-service-url>/mcp"  # Include "/mcp".
WEATHER_MCP_TOKEN = "<your-mcp-service-token>"
```




### Step 8: Deploy the agent service

Deploy the agent itself. For local testing, use `serve run`. For production deployment on Anyscale, see next step.

**For local deployment:**

```bash
serve run ray_serve_agent_deployment:app 
```


## Test the agent

### Step 9: Send test requests

With the agent service running, send requests to the `/chat` endpoint. The following script sends a request and streams the response:


In [8]:
import json
import requests

SERVER_URL = "http://127.0.0.1:8000/chat"  # For local deployment.
HEADERS = {"Content-Type": "application/json"}

def chat(user_request: str, thread_id: str | None = None) -> None:
    """Send a chat request to the agent and stream the response."""
    payload = {"user_request": user_request}
    if thread_id:
        payload["thread_id"] = thread_id

    with requests.post(SERVER_URL, headers=HEADERS, json=payload, stream=True) as resp:
        resp.raise_for_status()
        # Capture thread_id for multi-turn conversations.
        server_thread = resp.headers.get("X-Thread-Id")
        if not thread_id and server_thread:
            print(f"[thread_id: {server_thread}]")
        # Stream SSE events.
        for line in resp.iter_lines():
            if not line:
                continue
            txt = line.decode("utf-8")
            if txt.startswith("data: "):
                txt = txt[len("data: "):]
            print(txt, flush=True)

# Test the agent.
chat("What's the weather in Palo Alto?")


[thread_id: ab9ecc2b-6ec5-48d3-bd80-7f6f09659734]
{"model": {"messages": [{"content": "", "additional_kwargs": {"refusal": null}, "response_metadata": {"token_usage": {"completion_tokens": 40, "prompt_tokens": 314, "total_tokens": 354, "completion_tokens_details": null, "prompt_tokens_details": null}, "model_provider": "openai", "model_name": "Qwen/Qwen3-4B-Instruct-2507-FP8", "system_fingerprint": null, "id": "chatcmpl-194a9eba-b073-4a35-9ee2-032c31de689f", "finish_reason": "tool_calls", "logprobs": null}, "type": "ai", "name": null, "id": "lc_run--0467b4b4-4865-4bd1-b3c2-058ac4a6cb94-0", "tool_calls": [{"name": "get_forecast", "args": {"latitude": 37.4419, "longitude": -122.1416}, "id": "chatcmpl-tool-71fdcbd9053941cd84f8b77dff82a719", "type": "tool_call"}], "invalid_tool_calls": [], "usage_metadata": {"input_tokens": 314, "output_tokens": 40, "total_tokens": 354, "input_token_details": {}, "output_token_details": {}}}]}}
{"tools": {"messages": [{"content": "Today:\nTemperature: 62°F

### Step 10: Deploy the agent to production on Anyscale

After testing the agent locally, deploy it to Anyscale for production use. This creates a scalable, managed endpoint with enterprise features.

#### Why deploy to Anyscale

**Production benefits:**
- **Auto-scaling**: Automatically scales replicas based on request volume (0 to N replicas)
- **High availability**: Zero-downtime deployments with automatic failover  
- **Observability**: Built-in metrics, logs, and distributed tracing
- **Cost optimization**: Scale to zero when idle (with appropriate configuration)
- **Load balancing**: Distributes requests across multiple agent replicas
- **Fault isolation**: Agent, LLM, and tools run as separate services

#### Deploy the agent service

Run the following command to deploy your agent to Anyscale. This command packages your code and creates a production-ready service:



In [4]:
%%bash
anyscale service deploy ray_serve_agent_deployment:app --name agent_service_langchain

  import pkg_resources
(anyscale +0.8s) Starting new service 'agent_service_langchain'.
(anyscale +1.5s) Uploading local dir '.' to cloud storage.
(anyscale +2.1s) Including workspace-managed pip dependencies.
(anyscale +3.0s) Service 'agent_service_langchain' deployed (version ID: bkr6yywq).
(anyscale +3.0s) View the service in the UI: 'https://console.anyscale.com/services/service2_ikfi286bzvx7929zhgwvucw2qt'
(anyscale +3.0s) Query the service once it's running using the following curl command (add the path you want to query):
(anyscale +3.0s) curl -H "Authorization: Bearer nZp2BEjdloNlwGyxoWSpdalYGtkhfiHtfXhmV4BQuyk" https://agent-service-langchain-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/


#### Understanding the deployment output

After running the deployment command, you'll receive:
- **Service URL**: The HTTPS endpoint for your agent (e.g., `https://agent-service-langchain-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com`)
- **Authorization token**: Bearer token for authenticating requests
- **Service UI link**: Direct link to monitor your service in the Anyscale console

#### Test the production agent

Once deployed, test your production agent with authenticated requests. Update the following code with your deployment details:




In [None]:
import json
import requests

base_url = "https://agent-service-langchain-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com" ## replace with your service url
token = "nZp2BEjdloNlwGyxoWSpdalYGtkhfiHtfXhmV4BQuyk" ## replace with your service bearer token

SERVER_URL = f"{base_url}/chat"  # For Anyscale deployment.
HEADERS = {"Content-Type": "application/json",
"Authorization": f"Bearer {token}"
}

def chat(user_request: str, thread_id: str | None = None) -> None:
    """Send a chat request to the agent and stream the response."""
    payload = {"user_request": user_request}
    if thread_id:
        payload["thread_id"] = thread_id

    with requests.post(SERVER_URL, headers=HEADERS, json=payload, stream=True) as resp:
        resp.raise_for_status()
        # Capture thread_id for multi-turn conversations.
        server_thread = resp.headers.get("X-Thread-Id")
        if not thread_id and server_thread:
            print(f"[thread_id: {server_thread}]")
        # Stream SSE events.
        for line in resp.iter_lines():
            if not line:
                continue
            txt = line.decode("utf-8")
            if txt.startswith("data: "):
                txt = txt[len("data: "):]
            print(txt, flush=True)

# Test the agent.
chat("What's the weather in Palo Alto?")

[thread_id: 83c20767-1b5e-4616-af0f-81808be5ffbc]
{"model": {"messages": [{"content": "", "additional_kwargs": {"refusal": null}, "response_metadata": {"token_usage": {"completion_tokens": 40, "prompt_tokens": 314, "total_tokens": 354, "completion_tokens_details": null, "prompt_tokens_details": null}, "model_provider": "openai", "model_name": "Qwen/Qwen3-4B-Instruct-2507-FP8", "system_fingerprint": null, "id": "chatcmpl-977afc75-ae84-47a6-921b-a4e20f83707f", "finish_reason": "tool_calls", "logprobs": null}, "type": "ai", "name": null, "id": "lc_run--b6da0939-6b77-498f-bc37-e310f5306709-0", "tool_calls": [{"name": "get_forecast", "args": {"latitude": 37.4419, "longitude": -122.1416}, "id": "chatcmpl-tool-ade83d4018e34dfda64ebe49c13a5313", "type": "tool_call"}], "invalid_tool_calls": [], "usage_metadata": {"input_tokens": 314, "output_tokens": 40, "total_tokens": 354, "input_token_details": {}, "output_token_details": {}}}]}}
{"tools": {"messages": [{"content": "Today:\nTemperature: 62°F

## Next steps

You've successfully built, deployed, and tested a multi-tool agent using Ray Serve on Anyscale. This architecture demonstrates how to build production-ready AI applications with independent scaling, fault isolation, and dynamic tool discovery.

### Extend your agent

**Add more tools**  
Extend the MCP service with additional capabilities such as database queries, API integrations, or custom business logic. The MCP protocol allows your agent to discover new tools dynamically without code changes. For implementation examples, see the [Anyscale MCP Deployment Template](https://console.anyscale.com/template-preview/mcp-ray-serve).

**Swap or upgrade LLMs**  
Replace the Qwen model with other tool-calling models such as GPT-4, Claude, or Llama variants. Since the LLM runs as a separate service, you can A/B test different models or perform zero-downtime upgrades. For deployment patterns, see the [Anyscale LLM Serving Template](https://console.anyscale.com/template-preview/deployment-serve-llm).

**Build complex workflows**  
Implement sophisticated reasoning patterns with LangGraph, such as multi-agent collaboration, iterative refinement, or conditional branching based on tool outputs.

### Optimize for production

**Monitor performance**  
Use Anyscale's built-in observability to track:
- Request latency and token throughput
- GPU utilization and memory usage
- Tool call patterns and success rates
- Cost per request across services

For detailed metrics guidance, see [Monitor and debug Anyscale workloads](https://docs.anyscale.com/monitoring).

**Scale efficiently**  
Configure auto-scaling policies for each service independently:
- Scale the LLM service based on GPU utilization
- Scale the agent service based on request volume
- Scale tool services based on specific workload patterns

See [Ray Serve autoscaling configuration](https://docs.anyscale.com/llm/serving/parameter-tuning#ray-serve-autoscaling-configuration)

### Production best practices

Anyscale services provide enterprise-grade features for running agents in production. Key capabilities include:

- **Zero-downtime deployments**: Update models or agent logic without interrupting service. See [Update an Anyscale service](https://docs.anyscale.com/services/update).

- **Multi-version management**: Deploy up to 10 versions behind a single endpoint for A/B testing and canary deployments. See [Deploy multiple versions of an Anyscale service](https://docs.anyscale.com/services/versions).

- **High availability**: Distribute replicas across availability zones with automatic failover. See [Configure head node fault tolerance](https://docs.anyscale.com/administration/resource-management/head-node-fault-tolerance).


For comprehensive guidance on production deployments, see the [Anyscale Services documentation](https://docs.anyscale.com/services) and [Ray Serve on the Anyscale Runtime](https://docs.anyscale.com/runtime/serve).






