# Deploying a medium-size LLM

This tutorial walks you through deploying a medium-size LLM using Ray Serve LLM.  

A medium-size model typically runs on a single node with 4—8 GPUs. For smaller model, see [Deploying a small-size LLM](#), and for larger models, see [Deploying a large-size LLM](#).

---

## Prerequisites

* Access to GPU compute.
* (Optional) A **Hugging Face token** if using gated models like Meta’s Llama. Store it in `export HF_TOKEN=<YOUR-TOKEN-HERE>`

> Depending on the organization, you can usually request access on the model's Hugging Face page. For example, Meta’s Llama models approval can take anywhere from a few hours to several weeks.

**Dependencies:**  
```bash
pip install "ray[serve,llm]"
```

---

## Configure Ray Serve

You can configure your deployment using the Ray Serve LLM Python SDK for fast iteration, or use a Ray Serve config file for better integration and long-term maintainability in your systems.

Make sure to set your Hugging Face token in the config file to run gated models like `Llama-3.1`.

A medium-sized LLM can typically be deployed on a single node with multiple GPUs. To leverage all available GPUs, set `tensor_parallel_size` to the number of GPUs on the node, which distributes the model’s weights evenly across them.

::::{tab-set}

:::{tab-item} Python SDK

Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object.

```python
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-llama-3.1-70B",
        # Or Qwen/Qwen2.5-72B-Instruct for an ungated model
        model_source="meta-llama/Llama-3.1-70B-Instruct",
    ),
    accelerator_type="A100-40G",
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=2, max_replicas=4,
        )
    ),
    engine_kwargs=dict(
        max_model_len=32768,
        ### If your model is not gated, you can skip `hf_token`
        # Share your Hugging Face Token to the vllm engine so it can access the gated Llama 3
        hf_token=os.environ["HF_TOKEN"],
        # Split weights among 8 GPUs in the node
        tensor_parallel_size=8
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})

serve.run(app, blocking=True)
```

:::

:::{tab-item} Serve Config (YAML)

In your Ray Serve config file:
```yaml
applications:
- name: my-medium-llm-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-llama-3.1-70B
          # Or Qwen/Qwen2.5-72B-Instruct for an ungated model
          model_source: meta-llama/Llama-3.1-70B-Instruct
        accelerator_type: A100-40G
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 4
        engine_kwargs:
          max_model_len: 32768
          # We need to share our Hugging Face Token to the workers so they can access the gated Llama 3
          # If your model is not gated, you can skip this
          hf_token: <YOUR-TOKEN-HERE>
          # Split weights among 8 GPUs in the node
          tensor_parallel_size: 8
```

Alternatively, Ray Serve LLM provides a user-friendly CLI to generate config files with `python -m ray.serve.llm.gen_config`. More info at [Serving LLM: Generate Config Files](https://docs.ray.io/en/latest/serve/llm/serving-llms.html#generate-config-files).

:::

::::

> Before moving to a production setup, we recommend switching to a [Serve config file](https://docs.ray.io/en/latest/serve/production-guide/config.html). This makes your deployment version-controlled, reproducible, and easier to maintain for CI/CD pipelines for example. 

---

## Deploy

There are different ways to deploy your service depending on how you defined it.  

::::{tab-set}

:::{tab-item} From the Python SDK

Follow the instructions at [Configure Ray Serve (Python SDK)](#configure-ray-serve) to define your app in a Python module.  
> Make sure your script runs your application with `serve.run(app, blocking=True)`.  

In a terminal, run:  
```bash
python serve_my_llama_3_1_8B.py
```

:::

:::{tab-item} From a Serve Config (YAML)  

Follow the instructions at [Configure Ray Serve (Serve Config)](#configure-ray-serve) to define your app with a Ray Serve config file.  

In a terminal, run:
```bash
serve run serve_my_llama_3_1_8B.yaml
```

:::

::::


---

## Sending Requests to your LLM

> Follow the [Deployment instructions](#deploy) to launch your application on your Ray Cluster.

Deployment typically takes a few minutes as the cluster is provisioned, the vLLM server starts, and the model is downloaded.

**Retrieve your authentication token and endpoint**  
If running locally, your model will be available at `<YOUR-ENDPOINT-HERE> = "http://localhost:8000"` and you can use a placeholder authentication token: `<YOUR-TOKEN-HERE> = "FAKE_KEY"`
  > **Note:** The OpenAI client requires an `api_key`, but this is **not needed** for local deployments.  

Otherwise, retrieve both your endpoint and authentication token from your deployment environment’s dashboard or logs.

**Send a request**  
Use the `model_id` defined in your config (here, `my-llama-3.1-8B`) to query your model.

::::{tab-set}

:::{tab-item} Example Curl
```bash
curl -X POST <YOUR-ENDPOINT-HERE>/v1/chat/completions \
  -H "Authorization: Bearer <YOUR-TOKEN-HERE>" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "my-llama-3.1-8B",
        "messages": [{"role": "user", "content": "What is 2 + 2?"}]
      }'
```

:::

:::{tab-item} Example Python
```python
from urllib.parse import urljoin
from openai import OpenAI

api_key = <YOUR-TOKEN-HERE>
base_url = <YOUR-ENDPOINT-HERE>

client = OpenAI(base_url=urljoin(base_url, "v1"), api_key=api_key)

response = client.chat.completions.create(
    model="my-llama-3.1-8B",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
```

:::

::::

---

## Enable LLM Monitoring

The *Serve LLM Dashboard* offers deep visibility into model performance, latency, and system behavior, including:

* Token throughput (tokens/sec)
* Latency metrics: Time To First Token (TTFT), Time Per Output Token (TPOT)
* KV cache utilization

To enable these metrics, go to your LLM config and set `log_engine_metrics: true`. Ensure vLLM V1 is active with `VLLM_USE_V1: "1"`. 
> `VLLM_USE_V1: "1"` is the default value with `ray >= 2.48.0` and can be omitted.
```yaml
applications:
- ...
  args:
    llm_configs:
      - ...
        runtime_env:
          env_vars:
            VLLM_USE_V1: "1"
        ...
        log_engine_metrics: true
```

---

## Improving Concurrency for LLM Inference

Ray Serve LLM uses [vLLM](https://docs.vllm.ai/en/latest/) as its backend engine, which logs the *maximum concurrency* it can support based on your configuration.  

Example log:
```bash
INFO 08-06 20:15:53 [executor_base.py:118] Maximum concurrency for 8192 tokens per request: 3.53x
```

Here are a few ways to improve concurrency depending on your model and hardware:  

**Reduce `max_model_len`**  
Lowering `max_model_len` reduces the memory needed for KV cache.

> *Example*:  
> Running llama-3.1-8B On an A10G or L4 GPU:
> * `max_model_len = 8192` → concurrency ≈ 3.5
> * `max_model_len = 4096` → concurrency ≈ 7

**Use Quantized Models**  
Quantizing your model (for example, to FP8) reduces the model's memory footprint, freeing up memory for more KV cache and enabling more concurrent requests.

**Use Pipeline Parallelism**  
Distribute the model's layers across multiple nodes with `pipeline_parallel_size > 1`.

**Upgrade to GPUs with more memory**  
Some GPUs provide significantly more room for KV cache and allow for higher concurrency out of the box.

**Scale with more Replicas**  
In addition to tuning per-GPU concurrency, you can scale *horizontally* by increasing the number of replicas in your config.  
Each replica runs on its own GPU, so raising the replica count increases the total number of concurrent requests your service can handle, especially under sustained or bursty traffic.
```yaml
deployment_config:
  autoscaling_config:
    min_replicas: 1
    max_replicas: 4
```

*For more details on tuning strategies and hardware guidance, see this [GPU Selection Guide for LLM Serving](#).*

---

## Troubleshooting

**HuggingFace Auth Errors**  
Some models, such as Llama-3, are gated and require prior authorization from the organization. See your model’s documentation for instructions on obtaining access.

**Out-Of-Memory Errors**  
Out‑of‑memory (OOM) errors are one of the most common failure modes when deploying LLMs, especially as model sizes, and context length increase.  
See this [Troubleshooting Guide](#) for common errors and how to fix them.

---

## Summary

In this tutorial, you deployed a medium-size LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM, deploy your service on your Ray Cluster, and how to send requests. you also learned how to monitor your app and common troubleshooting issues.