# Deploying a Reasoning LLM

This tutorial walks you through deploying a reasoning LLM using Ray Serve LLM.  

---

## Prerequisites

* Access to GPU compute.
* (Optional) A **Hugging Face token** if using gated models like Meta’s Llama. Store it in `export HF_TOKEN=<YOUR-TOKEN-HERE>`

> Depending on the organization, you can usually request access on the model's Hugging Face page. For example, Meta’s Llama models approval can take anywhere from a few hours to several weeks.

**Dependencies:**  
```bash
pip install "ray[serve,llm]"
```

---

## Distinction with non-reasoning models

Reasoning models are designed to simulate step-by-step, structured thought processes to solve complex tasks like math, multi-hop QA, or code generation. In contrast, non-reasoning models aim for fast, direct responses and are typically trained for fluency or instruction following without explicit intermediate reasoning. The key distinction lies in whether the model attempts to "think through" the problem before answering.

| **Model Type**          | **Core Behavior**                    | **Use Case Examples**                                    | **Limitation**                                        |
| ----------------------- | ------------------------------------ | -------------------------------------------------------- | ----------------------------------------------------- |
| **Reasoning Model**     | Explicit multi-step thinking process | Math, coding, logic puzzles, multi-hop QA, CoT prompting | Slower response time, more tokens used                |
| **Non-Reasoning Model** | Direct answer generation             | Casual queries, short instructions, single-step answers  | May struggle with complex reasoning or explainability |

Many reasoning-capable models structure their outputs with special markers such as `<think>` tags, or expose reasoning traces inside dedicated fields like `reasoning_content` in the OpenAI API response. Always check the model's documentation to see how thinking is structured and controlled.

> Reasoning LLMs often benefit from long context windows (32K—200K tokens), high token throughput, low-temperature decoding (greedy sampling), and strong instruction tuning or scratchpad-style reasoning.

---

### When to use a reasoning model?

Whether you should use a reasoning model depends on how much information your prompt already provides.

If your input is clear and complete, a standard model is usually faster and more efficient.  
If your input is ambiguous or complex, a reasoning model is better suited—it can work through the problem step by step and fill in gaps through intermediate reasoning.

---

## Parsing Reasoning Outputs

Reasoning models often separate *reasoning* from the *final answer* using tags like `<think>...</think>`. Without a proper parser, this reasoning may end up in the `content` field instead of the dedicated `reasoning_content` field.

To extract reasoning correctly, configure a `reasoning_parser` in your Ray Serve deployment. This tells vLLM how to isolate the model’s thought process from the rest of the output.
> For example, *QwQ* uses the `deepseek-r1` parser. Other models may require different parsers. See the [vLLM docs](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html#supported-models) or your model's documentation to find a supported parser, or [build your own](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html#how-to-support-a-new-reasoning-model) if needed.

```yaml
applications:
- name: reasoning-llm-app
  ...
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-qwq-32B
          model_source: Qwen/QwQ-32B
        ...
        engine_kwargs:
          ...
          reasoning_parser: deepseek-r1 # <-- for QwQ models
```

See [Define your deployment](#define-your-deployment) for a complete example.

**Example Response**  
When using a reasoning parser, the response is typically structured like this:

```python
ChatCompletionMessage(
    content="The temperature is...",
    ...,
    reasoning_content="Okay, the user is asking for the temperature today and tomorrow..."
)
```
And you can extract the content and reasoning like this
```python
response = client.chat.completions.create(
  ...
)

print(f"Content: {response.choices[0].message.content}")
print(f"Reasoning: {response.choices[0].message.reasoning_content}")
```

---

## Configure Ray Serve

You can configure your deployment using the Ray Serve LLM Python SDK for fast iteration, or use a Ray Serve config file for better integration and long-term maintainability in your systems.

Make sure to set your Hugging Face token in the config file to run gated models.

We set `tensor_parallel_size= 8` to distribute the model's weights among 4 GPUs in the node. 

::::{tab-set}

:::{tab-item} Python SDK

Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object.
```python
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-qwq-32B",
        model_source="Qwen/QwQ-32B",
    ),
    accelerator_type="A100-40G",
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=2, max_replicas=2,
        )
    ),
    engine_kwargs=dict(
        tensor_parallel_size=8,
        max_model_len=32768
        ### Uncomment if your model is gated and need your Huggingface Token to access it
        #hf_token=os.environ["HF_TOKEN"],
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})

serve.run(app, blocking=True)
```

:::

:::{tab-item} Serve Config (YAML)

In your Ray Serve config file:
```yaml
applications:
- name: reasoning-llm-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-qwq-32B
          model_source: Qwen/QwQ-32B
        accelerator_type: A100-40G
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 2
        engine_kwargs:
          tensor_parallel_size: 8
          max_model_len: 32768
          ### Uncomment if your model is gated and need your Huggingface Token to access it
          #hf_token: <YOUR-TOKEN-HERE>
```

:::

::::

---

## Deploy

There are different ways to deploy your service depending on how you defined it.  

::::{tab-set}

:::{tab-item} From the Python SDK

Follow the instructions at [Configure Ray Serve (Python SDK)](#configure-ray-serve) to define your app in a Python module `serve_my_qwq_32B.py`.  
> Make sure your script runs your application with `serve.run(app, blocking=True)`.  

In a terminal, run:  
```bash
python serve_my_qwq_32B.py
```

:::

:::{tab-item} From a Serve Config (YAML)  

Follow the instructions at [Configure Ray Serve (Serve Config)](#configure-ray-serve) to define your app with a Ray Serve config file `serve_my_qwq_32B.yaml`.  

In a terminal, run:
```bash
serve run serve_my_qwq_32B.yaml
```

:::

::::

---

## Sending Requests
> Follow the [Deployment instructions](#deploy) to launch your application on your Ray Cluster.

Deployment typically takes a few minutes as the cluster is provisioned, the vLLM server starts, and the model is downloaded.

**Retrieve your authentication token and endpoint**  
If running locally, your model will be available at `<YOUR-ENDPOINT-HERE> = "http://localhost:8000"` and you can use a placeholder authentication token: `<YOUR-TOKEN-HERE> = "FAKE_KEY"`
  > **Note:** The OpenAI client requires an `api_key`, but this is **not needed** for local deployments.  

Otherwise, retrieve both your endpoint and authentication token from your deployment environment’s dashboard or logs.

**Send a request**  
Use the `model_id` defined in your config (here, `my-qwq-32B`) to query your model.

::::{tab-set}

:::{tab-item} Example Curl
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer FAKE_KEY" \
  -d '{
        "model": "my-qwq-32B",
        "messages": [{"role": "user", "content": "Pick three random words with 3 syllables each and count the number of R'\''s in each of them"}]
      }'
```

:::

:::{tab-item} Example Python
```python
from urllib.parse import urljoin
from openai import OpenAI

api_key = <YOUR-TOKEN-HERE>
base_url = <YOUR-ENDPOINT-HERE>

client = OpenAI(base_url=urljoin(base_url, "v1"), api_key=api_key)

response = client.chat.completions.create(
    model="my-qwq-32B",
    messages=[
        {"role": "user", "content": "What is the sum of all even numbers between 1 and 100?"}
    ]
)

print(f"Reasoning: \n{response.choices[0].message.reasoning_content}\n\n")
print(f"Answer: \n {response.choices[0].message.content}")
```

:::

::::

If you configure the reasoning parser, the reasoning output will appear in the `reasoning_content` field of the response message. Otherwise, it may be included in the main `content` field, typically wrapped in `<think>...</think>` tags.

---

## Structured Outputs and Tooling call

To support structured outputs and tooling calls with your reasoning model see [Structured Output with JSON Mode](#) and [Tool and Function Calling](#).  

It is recommended to use an appropriate reasoning parser to ensure good formatting of your model's response, see [Parsing Reasoning Outputs](#parsing-reasoning-outputs) for more information.

---

## Next Steps

In this tutorial, you deployed a reasoning LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM with the right reasoning parser, deploy your service on your Ray Cluster, how to send requests, and how to parse reasoning outputs in the response.