# Deploy a Medium-Sized LLM with Ray Serve LLM

Â© 2025, Anyscale. All Rights Reserved


ðŸ’» **Launch Locally**: You can run this notebook locally, but you'll need access to multiple GPUs.

ðŸš€ **Launch on Cloud**: A Ray Cluster with 4-8 GPUs (Click [here](http://console.anyscale.com/register) to easily start a Ray cluster on Anyscale) is recommended to run this notebook.


This notebook demonstrates how to deploy a medium-sized LLM using Ray Serve LLM. We'll walk through the complete process from configuration to production deployment, covering both local development and cloud deployment with Anyscale Services.

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li>Overview: Why Medium-Sized Models?</li>
    <li>Setting up Ray Serve LLM</li>
    <li>Local Deployment & Inference</li>
    <li>Deploying to Anyscale Services</li>
    <li>Advanced Topics: Monitoring & Optimization</li>
    <li>Summary & Outlook</li>
</ul>
</div>


## Overview: Why Medium-Sized Models?

A medium LLM typically runs on a single node with 4-8 GPUs. It offers a balance between performance and efficiency. These models provide stronger accuracy and reasoning than small models while remaining more affordable and resource-friendly than very large ones.

### Model Size Comparison

Let's understand how different model sizes compare:

| Model Size | Parameters | Memory (FP16) | Typical Use Case | Hardware Requirements |
|------------|------------|---------------|------------------|----------------------|
| **Small** | 7B-13B | 14-26 GB | Prototyping, simple tasks | 1-2 GPUs |
| **Medium** | 70B-80B | 140-160 GB | Production workloads, complex reasoning | 4-8 GPUs |
| **Large** | 400B+ | 800+ GB | Research, maximum capability | Multiple nodes |

### Why Choose Medium-Sized Models?

**Advantages:**
- **Balanced Performance**: Strong accuracy and reasoning capabilities
- **Cost-Effective**: More affordable than very large models
- **Resource Efficient**: Can run on single-node multi-GPU setups
- **Production Ready**: Ideal for scaling applications where large models would be too slow or expensive

**Perfect for:**
- Production workloads requiring good quality at lower cost
- Applications needing stronger reasoning than small models
- Scaling scenarios where large models are too resource-intensive

### Our Example: Llama-3.1-70B

In this tutorial, we'll deploy **Meta's Llama-3.1-70B-Instruct** model, which:
- Has 70 billion parameters
- Requires ~140GB memory in FP16 precision
- Needs 4-8 GPUs for efficient serving
- Provides excellent reasoning and instruction-following capabilities

### Related Examples

- **Small Models**: [Deploy a small-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html) - 1-2 GPUs
- **Large Models**: [Deploy a large-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/large-size-llm/README.html) - Multiple nodes
- **Workspace Template**: [Run on Anyscale](https://console.anyscale.com/template-preview/deployment-serve-llm?file=%252Ffiles%252Fmedium-size-llm)


## Setting up Ray Serve LLM

Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. The main abstractions we'll work with are:

### Key Components

1. **`LLMConfig`**: Configuration object that defines your model, hardware, and deployment settings
2. **`build_openai_app`**: Public function that creates an OpenAI-compatible application from your configuration
3. **Ray Serve**: The underlying orchestration layer that handles scaling and load balancing

### Configuration for Medium-Sized Models

For medium-sized models, we need to:
- Set appropriate `accelerator_type` for the hardware
- Configure **tensor parallelism** with `tensor_parallel_size` to match the number of GPUs

Let's create our configuration:


In [None]:
# serve_llama_3_1_70b.py
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-llama-3.1-70b",
        # Or unsloth/Meta-Llama-3.1-70B-Instruct for an ungated model
        model_source="meta-llama/Llama-3.1-70B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=4,
        )
    ),
    accelerator_type="L40S", # Or with similar VRAM like "A100-40G"
    # Type `export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>` in a terminal
    runtime_env=dict(env_vars={"HF_TOKEN": os.environ.get("HF_TOKEN")}),
    engine_kwargs=dict(
        max_model_len=32768, # See model's Hugging Face card for max context length
        # Split weights among 8 GPUs in the node
        tensor_parallel_size=8,
    ),
    log_engine_metrics=True,
)

app = build_openai_app({"llm_configs": [llm_config]})

### Configuration Breakdown

Let's understand each part of our configuration:

**Model Loading:**
- `model_id`: Unique identifier for your model in the API
- `model_source`: Hugging Face model path (gated model requires HF token)
- `HF_TOKEN`: Hugging Face token for accessing gated models

**Hardware Configuration:**
- `accelerator_type`: GPU type (L40S, A100-40G, etc.)
- `tensor_parallel_size`: Number of GPUs to split the model across

**Deployment Settings:**
- `autoscaling_config`: Min/max replicas for horizontal scaling

**Monitoring**
- `log_engine_metrics`: Display LLM-specific metrics (Time to First Toke, Time Per Output Token, Request Per Second...)

## Local Deployment & Inference

Now let's deploy our medium-sized LLM locally and query it.

### Prerequisites

**Hardware Requirements:**
- Access to 4-8 GPUs (L40S, A100-40G, or similar with sufficient GPU memory for the 70B model (~140GB))

**Software Requirements:**
- Ray Serve LLM
- For gated models, an Hugging Face token with authorization to access the model

**Installation:**
```bash
pip install "ray[serve,llm]"
```

**Hugging Face Token:**
```bash
export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>
```

### Launching Ray Serve

Let's start our LLM service:


In [None]:
!serve run serve_llama_3_1_70b:app --non-blocking

### Sending Requests

Once deployed, your endpoint is available at `http://localhost:8000`. You can use a placeholder authentication token like `"FAKE_KEY"`.

Let's test our model with some example requests:


In [None]:
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

response = client.chat.completions.create(
    model="my-llama-3.1-70b",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

### Shutting Down

When you're done testing, shut down the service:


In [None]:
!serve shutdown -y

## Deploying to Anyscale Services

For production deployment, we'll use Anyscale Services to deploy our Ray Serve app to a dedicated cluster. The great news is that **no code changes are needed** - we can use the exact same LLM configuration!

### What is an Anyscale Service?

An **Anyscale Service** is a managed deployment that provides:
- **Dedicated Infrastructure**: Your own Ray cluster in the cloud
- **Automatic Scaling**: Handles traffic spikes and load balancing
- **Fault Tolerance**: Resilient against node failures and rolling updates
- **Enterprise Features**: Security, monitoring, and compliance

### Setting up the Configuration File

Let's create the service configuration:
```yaml
# service.yaml
name: deploy-llama-3-70b
image_uri: anyscale/ray-llm:2.49.0-py311-cu128 # Anyscale Ray Serve LLM image. Use `containerfile: ./Dockerfile` to use a custom Dockerfile.
compute_config:
  auto_select_worker_config: true 
working_dir: .
cloud:
applications:
  # Point to your app in your Python module
  - import_path: serve_llama_3_1_70b:app
```


### Launching the Service

Now let's deploy our service to Anyscale:


In [None]:
!anyscale service deploy -f service.yaml --env HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>

### Running Inference on Anyscale

Once deployed, you'll get an endpoint and authentication token. Let's see how to use them:


In [None]:
from openai import OpenAI

client = OpenAI(
    base_url="https://deploy-llama-3-70b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1",
    api_key="2YKUt_IJZ8q8GWT5VPHVitzsHKsddoL6mSszJxzwe5A"
)

response = client.chat.completions.create(
    model="my-llama-3.1-70b",
    messages=[{"role": "user", "content": "Tell me about Anyscale!"}],
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

### Shutting Down the Service

When you're done with your service:


In [None]:
!anyscale service terminate -n deploy-llama-3-70b

## Advanced Topics: Monitoring & Optimization

Now let's explore advanced features for production deployments.

### Enabling LLM Monitoring

The Serve LLM Dashboard offers deep visibility into model performance. Let's enable comprehensive monitoring:


In [None]:
# serve_llama_3_1_70b.py
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-llama-3.1-70b",
        model_source="meta-llama/Llama-3.1-70B-Instruct",
    ),
    accelerator_type="L40S",
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=4,
        )
    ),
    runtime_env=dict(
        env_vars={
            "HF_TOKEN": os.environ.get("HF_TOKEN"),
        }
    ),
    engine_kwargs=dict(
        max_model_len=32768,
        tensor_parallel_size=8,
    ),
    # Enable detailed engine metrics
    log_engine_metrics=True
)

app = build_openai_app({"llm_configs": [llm_config]})

Anyscale provides an easy way to visualize your LLM metrics on an integrated Grafana dashboard.

On your Anyscale Workspace or Service page, go to Metrics, then click on the View on Grafana dropdown and select Ray Serve LLM Dashboard.

In [None]:
!serve run serve_llama_3_1_70b:app --non-blocking

Remember shutting down your service

In [None]:
!serve shutdown -y

### Improving Concurrency

Ray Serve LLM uses vLLM as its backend engine, which logs the maximum concurrency it can support.  
Example log for 8xL40S:
```console
INFO 08-19 20:57:37 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 17.79x
```
Let's explore optimization strategies:

### Concurrency Optimization Strategies

Below are key strategies to improve model concurrency and performance when serving LLMs.

---

**Example log (8Ã—L40S setup):**

```
INFO: Maximum concurrency for 32,768 tokens per request: 17.79x
```

---

#### 1. Reduce `max_model_len`

* `32,768` tokens â†’ concurrency â‰ˆ **18**
* `16,384` tokens â†’ concurrency â‰ˆ **36**
* **Trade-off:** shorter context window but higher concurrency

---

#### 2. Use Quantized Models

* **FP16 â†’ FP8:** ~50% memory reduction
* **FP8 â†’ INT4:** ~75% memory reduction
* Frees up memory for the KV cache, enabling more concurrent requests

---

#### 3. Enable Pipeline Parallelism

* Distribute layers across multiple nodes, set `pipeline_parallel_size > 1`
* This increase the size of your KV cache, trading off on your latency due to the multi-node communication overhead

---

#### 4. Scale with More Replicas

* Horizontally scale across multiple nodes
* Each replica runs an independent model instance
* **Total concurrency = per-replica concurrency Ã— number of replicas**

---

#### 5. Upgrade Hardware

* Example: **L40S (48 GB) â†’ A100 (80 GB)**
* More GPU memory allows higher concurrency
* Faster interconnects (e.g., NVLink) reduce latency


## Summary & Outlook

Congratulations! You've successfully learned how to deploy a medium-sized LLM with Ray Serve LLM. Let's summarize what we've covered and look ahead to other possibilities.

### What We Accomplished

**Module 2 Summary:**
1. **Overview**: Understood why medium-sized models (70B parameters) are ideal for production workloads
2. **Configuration**: Set up Ray Serve LLM with tensor parallelism across 8 GPUs
3. **Local Deployment**: Deployed locally and tested with various inference scenarios
4. **Anyscale Services**: Deployed to production with zero code changes
5. **Advanced Topics**: Enabled monitoring, optimized concurrency

### Key Takeaways

- **No Code Changes**: Same configuration works locally and in production
- **Tensor Parallelism**: Essential for medium models to distribute across multiple GPUs
- **Production Ready**: Anyscale Services provide enterprise-grade deployment
- **Monitoring**: Comprehensive dashboards for performance optimization
- **Scalability**: Multiple optimization strategies for different use cases

### Related Examples & Templates

Ray provides many more examples for different scenarios:

**Ray Documentation Examples:**
- [Deploy a small-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html) - 1-2 GPUs, prototyping
- [Deploy a large-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/large-size-llm/README.html) - Multiple nodes, research

**Anyscale Workspace Templates:**
- [Anyscale LLM Deployment Templates](https://console.anyscale.com/template-preview/deployment-serve-llm) - Ready-to-run examples

### How Other Sizes Differ

Now that you've seen a medium model deployment, here's how other sizes would differ:

**Small Models (7B-13B):**
- No tensor parallelism needed
- Single GPU deployment
- Faster startup time
- Lower concurrency but simpler setup

**Large Models (400B+):**
- Pipeline parallelism across multiple nodes
- More complex infrastructure requirements
- Higher costs but maximum capability
- Research and specialized use cases

### Next Steps

Ready to explore more? Consider:
1. **Try different model sizes** - Deploy small or large models
2. **Experiment with optimizations** - Test quantization and concurrency tuning
3. **Build applications** - Create end-to-end AI applications
4. **Explore advanced features** - Multi-model deployments, custom endpoints

### Resources

- [Ray Serve LLM Documentation](https://docs.ray.io/en/latest/serve/llm/index.html)
- [Anyscale LLM Serving Guide](https://docs.anyscale.com/llm/serving)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Ray Community Forum](https://discuss.ray.io/)


You now have the knowledge to deploy medium-sized LLMs in production with Ray Serve LLM!