# Advanced LLM Features with Ray Serve LLM

¬© 2025, Anyscale. All Rights Reserved


üíª **Launch Locally**: You can run this notebook locally, but you'll need access to GPUs.

üöÄ **Launch on Cloud**: A Ray Cluster with GPUs (Click [here](http://console.anyscale.com/register) to easily start a Ray cluster on Anyscale) is recommended to run this notebook.


This notebook explores advanced features and capabilities of Ray Serve LLM beyond basic model deployment. We'll dive into practical examples that showcase the power and flexibility of production LLM serving.

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li>Overview: Advanced Features Preview</li>
    <li>Example: Deploying LoRA Adapters</li>
    <li>Example: Getting Structured JSON Output</li>
    <li>Example: Setting up Tool Calling</li>
    <li>How to Choose an LLM?</li>
    <li>Conclusion: Next Steps</li>
</ul>
</div>


## Overview: Advanced Features Preview

Now that you've mastered the basics of LLM deployment with Ray Serve LLM, let's explore some advanced features that make production LLM serving more powerful and flexible.

### What We'll Cover

In this module, we'll focus on **3 practical examples** that demonstrate advanced capabilities:

1. **LoRA Adapters**: Deploy multiple fine-tuned adapters on a single base model
2. **Structured Output**: Generate consistent JSON and other structured formats
3. **Tool Calling**: Enable models to call external functions and APIs

### Why These Features Matter

**LoRA Adapters** allow you to:
- Serve multiple specialized models from one base model
- Reduce memory usage and deployment complexity
- Switch between different fine-tuned behaviors at runtime

**Structured Output** enables:
- Consistent, parseable responses for applications
- Integration with downstream systems
- Better reliability for production use cases

**Tool Calling** provides:
- Integration with external APIs and databases
- Enhanced model capabilities through function execution
- Building more sophisticated AI applications

### Learning Approach

We'll take a **hands-on approach** - each example will show you:
- Why the feature is useful
- How to configure it
- Working code you can run
- Links to comprehensive guides for deeper learning

Let's dive in!


## Example: Deploying LoRA Adapters

LoRA (Low-Rank Adaptation) adapters are small, efficient fine-tuned models that can be loaded on top of a base model. This allows you to serve multiple specialized behaviors from a single deployment.

### Why Use LoRA Adapters?

- **Parameter Efficiency**: LoRA adapters are typically less than 1% of the base model's size
- **Runtime Adaptation**: Switch between different adapters without reloading the base model
- **Simpler MLOps**: Centralize inference around one model while supporting multiple use cases
- **Cost Effective**: Share expensive base model across multiple specialized tasks

### Example: Code Assistant LoRA

Let's deploy a base model with multiple LoRA adapters. This will allow the model to switch between general and specialized generation.

For this example, we'll use publicly available adapters from Hugging Face.

First, we need to prepare our LoRA adapters and save them in our cloud storage. 

For example, here is an example script for downloading adapters from Huggingface and saving them in an AWS bucket:

In [None]:
import os
import boto3
from huggingface_hub import snapshot_download

# Mapping of custom names to Hugging Face LoRA adapter repo IDs
adapters = {
    "nemoguard": "nvidia/llama-3.1-nemoguard-8b-topic-control",
    "cv_job_matching": "LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching",
    "yara": "vtriple/Llama-3.1-8B-yara"
}

# S3 target
bucket_name = "llm-docs-aydin"
base_s3_path = "1-5-multi-lora/lora_checkpoints"

# Initialize S3 client
s3 = boto3.client("s3")

for custom_name, repo_id in adapters.items():
    print(f"\nüì• Downloading adapter '{custom_name}' from {repo_id}...")
    local_path = snapshot_download(repo_id)

    print(f"‚¨ÜÔ∏è Uploading files to s3://{bucket_name}/{base_s3_path}/{custom_name}/")

    for root, _, files in os.walk(local_path):
        for file_name in files:
            local_file_path = os.path.join(root, file_name)
            rel_path = os.path.relpath(local_file_path, local_path)
            s3_key = f"{base_s3_path}/{custom_name}/{rel_path}".replace("\\", "/")

            print(f"  ‚Üí {s3_key}")
            s3.upload_file(local_file_path, bucket_name, s3_key)

print("\n‚úÖ All adapters uploaded successfully.")

# List all objects in the bucket to confirm
response = s3.list_objects_v2(Bucket=bucket_name)

print(f"Files in s3://{bucket_name}/:")
for obj in response["Contents"]:
    print(obj["Key"])

You should end up with this folder structure for each adapter.
```
s3://your-bucket/lora-adapters/
‚îú‚îÄ‚îÄ nemoguard/
‚îÇ   ‚îú‚îÄ‚îÄ adapter_config.json
‚îÇ   ‚îî‚îÄ‚îÄ adapter_model.safetensors
‚îú‚îÄ‚îÄ cv_job_matching/
‚îÇ   ‚îú‚îÄ‚îÄ adapter_config.json
‚îÇ   ‚îî‚îÄ‚îÄ adapter_model.safetensors
‚îú‚îÄ‚îÄ yara/
    ‚îú‚îÄ‚îÄ adapter_config.json
    ‚îî‚îÄ‚îÄ adapter_model.safetensors
```

### Configure Ray Serve LLM with LoRA

Now let's configure our LLM with LoRA support. The key additions are the `lora_config` and enabling LoRA in the engine arguments:


In [None]:
import os
from ray.serve.llm import LLMConfig, build_openai_app

# Configure LLM with LoRA support
llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-llama",
        # Make sure your huggingface token has access/authorization
        # Go to https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct and request access otherwise
        # Or switch to the unsloth/ version for an ungated LLama 
        model_source="meta-llama/Llama-3.1-8B-Instruct" # Base model
    ),
    accelerator_type="L4",
    # LoRA configuration
    lora_config=dict(
        dynamic_lora_loading_path="s3://llm-docs-aydin/1-5-multi-lora/lora_checkpoints/",  # Your S3/GCS path
        max_num_adapters_per_replica=3  # (optional) Limit adapters per replica
    ),
    runtime_env=dict(
        env_vars={
            "HF_TOKEN": os.environ.get("HF_TOKEN"), # Set your token beforehand: export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>
            "AWS_REGION": "us-west-2"  # Your AWS region
        }
    ),
    engine_kwargs=dict(
        max_model_len=8192,
        # Enable LoRA support
        enable_lora=True,
        max_lora_rank=32,  # Maximum LoRA rank. Set to the largest rank you plan to use.
        max_loras=3,  # Must match max_num_adapters_per_replica
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})

Deploy

In [None]:
!serve run serve_my_lora_app:app --non-blocking

### Using LoRA Adapters

Once deployed, you can query different adapters by specifying them in the model name using the format `<base_model_id>:<adapter_name>`:


In [None]:
#client.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="FAKE_KEY")

############################ Base model request (no adapter) #####################
print("=== Base model ===")
response = client.chat.completions.create(
    model="my-llama",  # no adapter
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")


############################ nemoguard adapter (moderation) #####################
print("=== LoRA: nemoguard ===")
# As per Nemoguard's usage instruction, add this to your system prompt
# https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-topic-control#system-instruction
TOPIC_SAFETY_OUTPUT_RESTRICTION = 'If any of the above conditions are violated, please respond with "off-topic". Otherwise, respond with "on-topic". You must respond with "on-topic" or "off-topic".'
messages_nemoguard = [
    {
        "role": "system",
        "content": f'In the next conversation always use a polite tone and do not engage in any talk about travelling and touristic destinations.{TOPIC_SAFETY_OUTPUT_RESTRICTION}',
    },
    {"role": "user", "content": "Do you know which is the most popular beach in Barcelona?"},
]
#response = client.chat.completions.create(
##    model="my-llama:nemoguard", ### with nemoguard adapter
 #   messages=messages_nemoguard,
 #   stream=True
#)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

############################ cv_job_matching adapter (structured JSON output) ############################
print("=== LoRA: cv_job_matching ===")
messages_cv = [
    {
        "role": "system",
        "content": """You are an advanced AI model designed to analyze the compatibility between a CV and a job description. You will receive a CV and a job description. Your task is to output a structured JSON format that includes the following:

1. matching_analysis: Analyze the CV against the job description to identify key strengths and gaps.
2. description: Summarize the relevance of the CV to the job description in a few concise sentences.
3. score: Provide a numerical compatibility score (0-100) based on qualifications, skills, and experience.
4. recommendation: Suggest actions for the candidate to improve their match or readiness for the role.

Your output must be in JSON format as follows:
{
  "matching_analysis": "Your detailed analysis here.",
  "description": "A brief summary here.",
  "score": 85,
  "recommendation": "Your suggestions here."
}
""",
    },
    {
        "role": "user",
        "content": "<CV> Software engineer with 5 years of experience in Python and cloud infrastructure. </CV>\n<job_description> Looking for a backend engineer with Python and AWS experience. </job_description>",
    },
]
response = client.chat.completions.create(
    model="my-llama:cv_job_matching", ### with cv_job_matching adapter
    messages=messages_cv,
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

############################ yara adapter (cybersecurity task) ############################
print("=== LoRA: yara ===")
messages_yara = [{"role": "user", "content": "Generate a YARA rule to detect a PowerShell-based keylogger. Generate ONLY the YARA rule, do not add explanations."}]
response = client.chat.completions.create(
    model="my-llama:yara", ### with yara adapter
    messages=messages_yara,
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Shutdown the deployment

In [None]:
!serve shutdown -y

### Key Benefits

- **Single Deployment**: One base model serves multiple specialized behaviors
- **Dynamic Switching**: Change adapters at runtime without restarting
- **Memory Efficient**: Adapters are much smaller than full fine-tuned models
- **Cost Effective**: Share expensive base model across multiple use cases

### Learn More

For comprehensive multi-LoRA deployment guides, see:
- [Multi-LoRA deployment guide on Anyscale](https://docs.anyscale.com/llm/serving/multi-lora) - Complete guide with best practices
- [Multi-LoRA with Ray Serve LLM (Ray docs)](https://docs.ray.io/en/latest/serve/llm/user-guides/multi-lora.html) - Quick-start configuration details


## Example: Getting Structured JSON Output

Many applications need consistent, parseable output from LLMs. Ray Serve LLM supports structured output generation, ensuring your model returns data in the exact format you need.

### Why Structured Output Matters

- **Consistent Format**: Guaranteed JSON structure for downstream processing
- **Integration Ready**: Easy to parse and use in applications
- **Reliability**: Reduces parsing errors and improves system robustness
- **Type Safety**: Enforces data types and required fields

### Example: Car type description

Let's deploy a model. It is recommended to research the performance of your model in structured output benchmarks.


```yaml
# serve_my_qwen.yaml
applications:
- name: json-mode-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-qwen
          model_source: Qwen/Qwen2.5-3B-Instruct
        accelerator_type: L4
        ### Uncomment if your model is gated and need your Huggingface Token to access it
        #runtime_env:
        #  env_vars:
        #    HF_TOKEN: <YOUR-TOKEN-HERE>
        engine_kwargs:
          max_model_len: 8192
```

In [None]:
!serve run serve_my_qwen.yaml --non-blocking

### Using Structured Output

Now let's test our structured output model with some product descriptions:


In [None]:
#json_method1.py
from openai import OpenAI
from pydantic import BaseModel
from enum import Enum

client = OpenAI(base_url="http://localhost:8000/v1", api_key="FAKE_KEY")

# (Optional) We use Pydantic model to handle schema definition/validation
class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"

class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: CarType

# 1. Define your schema
json_schema = CarDescription.model_json_schema()

# 2. Send a request
response = client.chat.completions.create(
    model="my-qwen",
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
        }
    ],
    # 3. Set `response_format` of type `json_schema`
    response_format= {
        "type": "json_schema",
        # 4. Provide `name`and `schema` (both required)
        "json_schema": {
            "name": "car-description", # arbitrary
            "schema": json_schema # your JSON schema
        },
    }
)

print(response.choices[0].message.content)

### Expected Output

The model will return a consistent JSON structure like:

```json
{
  "brand": "Lexus",
  "model": "IS F",
  "car_type": "SUV"
}
```

Shutdown

In [None]:
!serve shutdown -y

### Key Benefits

- **Guaranteed Structure**: Always returns valid JSON matching your schema
- **Type Safety**: Enforces data types (strings, numbers, arrays)
- **Required Fields**: Ensures all specified fields are present
- **Easy Integration**: Directly usable in applications without parsing

### Learn More

For comprehensive structured output guides, see:
- [LLM deployment with structured output on Anyscale](https://docs.anyscale.com/llm/serving/structured-output) - Complete guide with all output formats
- [Request structured output (vLLM documentation)](https://docs.vllm.ai/en/stable/features/structured_outputs.html) - Complete guide on vLLM API for structured outputs

## Example: Setting up Tool Calling

Tool calling enables LLMs to interact with external functions, APIs, and databases. This opens up powerful possibilities for building sophisticated AI applications that can perform actions beyond just text generation.

### Why Tool Calling Matters

- **Enhanced Capabilities**: Models can perform actions, not just generate text
- **Real-time Data**: Access current information from APIs and databases
- **Workflow Automation**: Integrate AI into existing business processes
- **Interactive Applications**: Build chatbots that can actually do things

### Example: Weather Assistant with Tool Calling

Let's create a model that can check weather information by calling a weather API:


In [None]:
# serve_my_qwen3.py
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-qwen3",
        model_source="Qwen/Qwen3-32B",
    ),
    accelerator_type="L40S",
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    ### Uncomment if your model is gated and needs your Hugging Face token to access it.
    # runtime_env=dict(env_vars={"HF_TOKEN": os.environ.get("HF_TOKEN")}),
    engine_kwargs=dict(
        tensor_parallel_size=4, 
        max_model_len=32768, 
        reasoning_parser="qwen3",
        enable_auto_tool_choice= True,
        tool_call_parser= "hermes"
    ),
)
app = build_openai_app({"llm_configs": [llm_config]})

Deploy

In [None]:
!serve run serve_my_qwen3:app --non-blocking

### Using Tool Calling

Now let's test our tool-calling model. The model will decide when to call tools and provide the results:


In [None]:
#tool_call_client.py
import random
import json
from openai import OpenAI

# Dummy APIs
def get_current_temperature(location: str, unit: str = "celsius"):
    temperature = random.randint(15, 30) if unit == "celsius" else random.randint(59, 86)
    return {
        "temperature": temperature,
        "location": location,
        "unit": unit
    }

def get_temperature_date(location: str, date: str, unit: str = "celsius"):
    temperature = random.randint(15, 30) if unit == "celsius" else random.randint(59, 86)
    return {
        "temperature": temperature,
        "location": location,
        "date": date,
        "unit": unit
    }

# Tools schema definitions
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_temperature",
            "description": "Get current temperature at a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The location to get the temperature for, in the format \"City, State, Country\"."
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The unit to return the temperature in. Defaults to \"celsius\"."
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_temperature_date",
            "description": "Get temperature at a location and date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The location to get the temperature for, in the format \"City, State, Country\"."
                    },
                    "date": {
                        "type": "string",
                        "description": "The date to get the temperature for, in the format \"Year-Month-Day\"."
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The unit to return the temperature in. Defaults to \"celsius\"."
                    }
                },
                "required": ["location", "date"]
            }
        }
    }
]

######################### Sending request for tool calls #########################
client = OpenAI(base_url="http://localhost:8000/v1", api_key="FAKE_KEY")

messages = [
    {
        "role": "system",
        "content": "You are a weather assistant. Use the given functions to get weather data and provide the results."
    },
    {
        "role": "user",
        "content": "What's the temperature in San Francisco now? How about tomorrow? Current Date: 2025-07-29."
    }
]
response = client.chat.completions.create(
    model="my-qwen3",
    messages=messages,
    tools=tools,
    tool_choice= "auto" # let the model decide to use tools or not
)

######################### Process tool calls #########################
for tc in response.choices[0].message.tool_calls:
    print(f"Tool call id: {tc.id}")
    print(f"Tool call function name: {tc.function.name}")
    print(f"Tool call arguments: {tc.function.arguments}")
    print("\n")

# Helper tool map (str -> python callable to your APIs)
helper_tool_map = {
    "get_current_temperature": get_current_temperature,
    "get_temperature_date": get_temperature_date
}

######################### Add your model's tool calls request to the chat history #########################
# `response` is your model's last response containing the tool calls it requests.
# Add the previous response containing the tool calls
messages.append(response.choices[0].message.model_dump())

######################### Add `tool` messages in your chat history #########################
# Loop through the tool calls and create `tool` messages
for tool_call in response.choices[0].message.tool_calls:
    call_id, fn_call = tool_call.id, tool_call.function
    
    fn_callable = helper_tool_map[fn_call.name]
    fn_args = json.loads(fn_call.arguments)

    output = json.dumps(fn_callable(**fn_args))

    # Create a new message of role `"tool"` containing the output of your tool
    messages.append({
        "role": "tool",
        "content": output,
        "tool_call_id": call_id
    })

######################### Sending final request #########################

response = client.chat.completions.create(
    model="my-qwen3",
    messages=messages
)


print(response.choices[0].message.content)

In [None]:
!serve shutdown -y

### Key Benefits

- **Intelligent Tool Selection**: Model decides when and which tools to use
- **Structured Parameters**: Tools receive properly formatted arguments
- **Seamless Integration**: Natural conversation flow with tool execution
- **Extensible**: Easy to add new tools and capabilities

### Learn More

For comprehensive tool calling guides, see:
- [LLM deployment with tool and function calling on Anyscale](https://docs.anyscale.com/llm/serving/tool-function-calling) - Complete tool calling setup

## How to Choose an LLM?

With so many models available, choosing the right one for your use case is crucial. Here's a practical framework for model selection based on the [Anyscale documentation](https://docs.anyscale.com/llm/serving/intro#selecting-model).

### Model Selection Framework

#### 1. **Model Quality Benchmarks**

Use established benchmarks to evaluate model capabilities:

- **Chatbot Arena**: For conversational capabilities and user preference
- **MMLU-Pro**: For domain-specific performance across academic subjects
- **Code Benchmarks**: For programming and code generation tasks
- **Reasoning Tests**: For logical reasoning and problem-solving

#### 2. **Task and Domain Alignment**

Match your model to your specific use case:

| Model Type | Best For | Example Use Cases |
|------------|----------|-------------------|
| **Base Models** | Next-token prediction, open-ended continuation | Sentence completion, code autocomplete |
| **Instruction-tuned** | Following explicit directions | Chatbots, coding assistants, Q&A |
| **Reasoning-optimized** | Complex problem-solving | Mathematical reasoning, scientific analysis |


#### 3. **Context Window Requirements**

Match context length to your use case:

| Context Length | Use Cases | Memory Impact |
|----------------|-----------|---------------|
| **4K-8K tokens** | Q&A, simple chat | Low memory requirements |
| **32K-128K tokens** | Document analysis, summarization | Moderate memory usage |
| **128K+ tokens** | Multi-step agents, complex reasoning | High memory requirements |

#### 4. **Hardware and Cost Considerations**

Balance performance with resource constraints:

- **Small Models (7B-13B)**: 1-2 GPUs, fast deployment, lower cost
- **Medium Models (70B-80B)**: 4-8 GPUs, balanced performance/cost
- **Large Models (400B+)**: Multiple nodes, maximum capability, higher cost

### Practical Selection Process

1. **Define Requirements**: Latency, accuracy, context length, budget
2. **Benchmark Models**: Test on your specific tasks and data
3. **Consider Trade-offs**: Speed vs. accuracy, cost vs. capability
4. **Start Simple**: Begin with smaller models, scale up as needed
5. **Iterate and Optimize**: Monitor performance and adjust accordingly

### Model Recommendations by Use Case

**For Production Chatbots:**
- Llama 3.1 8B/70B (balanced performance)
- Mistral 7B (fast inference)

**For Code Generation:**
- Code Llama 7B/13B (specialized for code)
- DeepSeek-Coder (reasoning + code)

**For Complex Reasoning:**
- Qwen 3 32B (hybrid thinking)
- DeepSeek-R1 (dedicated reasoning)

**For Document Processing:**
- Llama 3.1 70B (large context)
- Claude 3.5 Sonnet (excellent long context)


## Conclusion: Next Steps

Congratulations! You've now explored advanced features of Ray Serve LLM and learned how to deploy sophisticated LLM applications. Let's summarize what we've covered and look ahead to even more possibilities.

### What We Accomplished

**Module 3 Summary:**
1. **LoRA Adapters**: Deployed multiple specialized models from a single base model
2. **Structured Output**: Generated consistent JSON and structured data formats
3. **Tool Calling**: Enabled models to interact with external functions and APIs
4. **Model Selection**: Learned a framework for choosing the right LLM for your use case

### Key Takeaways

- **Advanced Features**: Ray Serve LLM supports sophisticated production capabilities
- **Practical Examples**: Each feature has real-world applications and benefits
- **Easy Integration**: Advanced features build on the same foundation as basic deployment
- **Production Ready**: All features are designed for scalable, reliable deployments

### More Advanced Topics

Ready to dive deeper? Here are additional areas to explore:

**Performance & Optimization:**
- [Choose a GPU for LLM serving](https://docs.anyscale.com/llm/serving/gpu-guidance) - Hardware selection and optimization
- [Tune parameters for LLMs](https://docs.anyscale.com/llm/serving/parameter-tuning) - Advanced configuration tuning
- [Troubleshoot LLM serving](https://docs.anyscale.com/llm/serving/troubleshooting) - Common issues and solutions
- [Optimize performance for Ray Serve LLM](https://docs.anyscale.com/llm/serving/performance-optimization) - Performance optimization guide

**Enterprise Features:**
- **Monitoring & Observability**: Advanced metrics and debugging tools
- **Security & Compliance**: Enterprise-grade security features
- **CI/CD Integration**: Automated deployment and testing pipelines
- **Multi-tenant Deployments**: Serve multiple customers from shared infrastructure

### Next Steps

1. **Practice**: Try deploying your own models with these advanced features
2. **Explore**: Dive into the comprehensive guides we've linked
3. **Build**: Create real applications using what you've learned
4. **Share**: Join the Ray community and share your experiences

### Resources

- [Ray Serve LLM Documentation](https://docs.ray.io/en/latest/serve/llm/index.html)
- [Anyscale LLM Serving Guide](https://docs.anyscale.com/llm/serving)
- [Ray Community Forum](https://discuss.ray.io/)
- [Anyscale Console](https://console.anyscale.com/) - Deploy your models

**Course Complete** üéâ

Thank you for learning with us! You're now ready to build amazing LLM applications with Ray Serve LLM.
