# Chapter 6: Efficient LLM Serving with vLLM and PagedAttention

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayoisio/genai-on-google-cloud/blob/main/chapter-6/colabs/03_vllm_serving.ipynb)

## Learning Goals

In this notebook, you will learn how to:
- Understand PagedAttention and its impact on LLM serving efficiency
- Configure vLLM for different bottleneck patterns (bandwidth, memory, compute)
- Implement batch inference and continuous batching
- Use prefix caching to optimize repeated prompt scenarios
- Benchmark throughput and analyze GPU memory utilization

## Prerequisites

- A Colab environment with GPU (T4, L4, or A100 recommended)
- Basic understanding of LLM inference and GPU memory concepts
- Familiarity with Python and PyTorch

## The Serving Efficiency Problem

Traditional LLM serving wastes memory through **KV cache fragmentation**:

```
Traditional Approach:
┌────────────────────────────────────────────────┐
│  Request 1: [████████░░░░░░░░░░░░░░░░░░░░░░]  │  Allocated: 512 tokens
│  Request 2: [██████████████░░░░░░░░░░░░░░░░]  │  Used: 50-200 tokens
│  Request 3: [████░░░░░░░░░░░░░░░░░░░░░░░░░░]  │  Wasted: 60-90%
└────────────────────────────────────────────────┘

PagedAttention:
┌────────────────────────────────────────────────┐
│  [██][██][██][██][░░][██][██][░░][██][██][██]  │  Dynamic pages
│  Req1  Req2  Req1  Req3     Req2     Req1-3   │  Near 100% utilization
└────────────────────────────────────────────────┘
```

This memory waste directly limits concurrent requests and throughput.

## 1. Setup and Installation

In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"✓ GPU Available: {gpu_name}")
    print(f"✓ GPU Memory: {gpu_memory:.2f} GB")
else:
    print("⚠ No GPU detected!")
    print("Enable GPU: Runtime > Change runtime type > GPU")

In [None]:
# Install vLLM (requires version 0.4.0+ for prefix_caching)
!pip install -q "vllm>=0.4.0"
print("✓ vLLM installed!")

In [None]:
# Import libraries
from vllm import LLM, SamplingParams
import time
import gc

print("✓ Libraries imported!")

## 2. Understanding vLLM Configuration

vLLM parameters map directly to the bottleneck patterns from Chapter 6:

| Parameter | Bottleneck | Description |
|-----------|------------|-------------|
| `tensor_parallel_size` | Pattern 2 (Memory) | Split model across GPUs |
| `gpu_memory_utilization` | Pattern 2 (Memory) | % of GPU memory for KV cache |
| `max_num_seqs` | Pattern 1 (Bandwidth) | Max concurrent sequences |
| `enable_prefix_caching` | Pattern 1 (Bandwidth) | Cache common prompt prefixes |
| `max_model_len` | Pattern 2 (Memory) | Maximum context length |

## 3. Load a Model with vLLM

We'll use a small model that fits on Colab's GPU for demonstration.

In [None]:
# Model selection based on available GPU memory
gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3

if gpu_memory_gb >= 40:  # A100 or similar
    MODEL_ID = "google/gemma-2-9b-it"
    GPU_MEMORY_UTIL = 0.9
    MAX_MODEL_LEN = 8192
elif gpu_memory_gb >= 20:  # L4 or T4
    MODEL_ID = "google/gemma-2-2b-it"
    GPU_MEMORY_UTIL = 0.9
    MAX_MODEL_LEN = 4096
else:  # Smaller GPU
    MODEL_ID = "microsoft/phi-2"
    GPU_MEMORY_UTIL = 0.85
    MAX_MODEL_LEN = 2048

print(f"Selected model: {MODEL_ID}")
print(f"GPU memory utilization: {GPU_MEMORY_UTIL}")
print(f"Max context length: {MAX_MODEL_LEN}")

In [None]:
# Initialize vLLM with production-optimized settings
print(f"Loading {MODEL_ID} with vLLM...")
print("This may take a few minutes for initial download...")
print()

llm = LLM(
    model=MODEL_ID,
    tensor_parallel_size=1,  # Single GPU
    gpu_memory_utilization=GPU_MEMORY_UTIL,
    max_model_len=MAX_MODEL_LEN,
    enable_prefix_caching=True,  # Enable prefix caching for efficiency
    trust_remote_code=True,
)

print("\n✓ Model loaded successfully!")

## 4. Basic Inference with vLLM

In [None]:
# Configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

# Single prompt inference
prompt = "Explain the key benefits of PagedAttention in LLM serving:"

print("Prompt:", prompt)
print()

start_time = time.time()
outputs = llm.generate([prompt], sampling_params)
latency = time.time() - start_time

print("Response:")
print(outputs[0].outputs[0].text)
print(f"\nLatency: {latency:.2f}s")

## 5. Batch Inference - Continuous Batching Demo

vLLM's continuous batching handles variable-length outputs efficiently.

In [None]:
# Batch of prompts with different expected output lengths
batch_prompts = [
    "What is 2+2?",  # Very short response expected
    "Explain quantum computing in one paragraph.",  # Medium response
    "Write a detailed comparison of GPUs vs TPUs for machine learning training, covering architecture, performance, cost, and use cases.",  # Long response
    "Name three programming languages.",  # Short response
    "What is the capital of France?",  # Very short response
    "Describe the transformer architecture and its key innovations.",  # Medium-long response
]

print(f"Processing batch of {len(batch_prompts)} prompts...")
print()

# Time batch inference
start_time = time.time()
outputs = llm.generate(batch_prompts, sampling_params)
total_time = time.time() - start_time

# Calculate metrics
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
tokens_per_second = total_tokens / total_time

print("="*70)
print("BATCH INFERENCE RESULTS")
print("="*70)

for i, output in enumerate(outputs):
    response = output.outputs[0].text
    num_tokens = len(output.outputs[0].token_ids)
    print(f"\nPrompt {i+1}: {batch_prompts[i][:50]}...")
    print(f"Response ({num_tokens} tokens): {response[:100]}...")

print("\n" + "="*70)
print(f"Total time: {total_time:.2f}s")
print(f"Total tokens generated: {total_tokens}")
print(f"Throughput: {tokens_per_second:.2f} tokens/second")
print(f"Average latency per request: {total_time/len(batch_prompts):.2f}s")
print("="*70)

## 6. Prefix Caching Demo

Prefix caching dramatically speeds up requests that share common system prompts.

In [None]:
# Common system prompt (like in chat applications)
system_prompt = """You are a helpful AI assistant specialized in cloud computing and machine learning infrastructure. 
You provide concise, accurate answers focused on practical implementation.
Always consider cost, performance, and scalability in your recommendations.

User: """

# Multiple queries with the same prefix
user_queries = [
    "What GPU should I use for fine-tuning a 7B model?",
    "How do I reduce serving latency for my LLM application?",
    "When should I choose TPUs over GPUs?",
    "What is the best storage option for training data?",
    "How can I reduce cold start times for my model?",
]

# Create full prompts with shared prefix
full_prompts = [system_prompt + query for query in user_queries]

print("Testing prefix caching with shared system prompt...")
print(f"System prompt length: {len(system_prompt)} characters")
print()

# First run - cache population
start_time = time.time()
outputs_first = llm.generate(full_prompts[:2], sampling_params)
first_run_time = time.time() - start_time

# Second run - should benefit from cached prefix
start_time = time.time()
outputs_second = llm.generate(full_prompts[2:], sampling_params)
second_run_time = time.time() - start_time

print("="*70)
print("PREFIX CACHING RESULTS")
print("="*70)
print(f"First batch (2 queries): {first_run_time:.2f}s")
print(f"Second batch (3 queries): {second_run_time:.2f}s")
print(f"Time per query (first): {first_run_time/2:.2f}s")
print(f"Time per query (second): {second_run_time/3:.2f}s")
print()
print("Note: Second batch may be faster due to prefix caching")
print("="*70)

## 7. Memory and Performance Analysis

In [None]:
# GPU memory analysis
import torch

def get_gpu_memory_info():
    """Get current GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1024**3
        reserved = torch.cuda.memory_reserved(0) / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        return {
            'allocated_gb': allocated,
            'reserved_gb': reserved,
            'total_gb': total,
            'free_gb': total - reserved,
            'utilization_pct': (reserved / total) * 100
        }
    return None

mem_info = get_gpu_memory_info()

print("="*70)
print("GPU MEMORY ANALYSIS")
print("="*70)
print(f"Total GPU Memory:     {mem_info['total_gb']:.2f} GB")
print(f"Reserved Memory:      {mem_info['reserved_gb']:.2f} GB")
print(f"Allocated Memory:     {mem_info['allocated_gb']:.2f} GB")
print(f"Free Memory:          {mem_info['free_gb']:.2f} GB")
print(f"Utilization:          {mem_info['utilization_pct']:.1f}%")
print("="*70)
print()
print("vLLM Memory Management:")
print(f"  - gpu_memory_utilization set to: {GPU_MEMORY_UTIL}")
print(f"  - This reserves {GPU_MEMORY_UTIL * 100:.0f}% of GPU memory for KV cache")
print(f"  - Remaining {(1-GPU_MEMORY_UTIL) * 100:.0f}% is headroom for spikes")

## 8. Throughput Benchmark

In [None]:
# Benchmark with different batch sizes
import statistics

def benchmark_throughput(llm, num_prompts, num_runs=3):
    """Benchmark throughput with specified number of prompts."""
    
    # Generate test prompts
    test_prompts = [
        f"Write a brief explanation of concept number {i} in machine learning."
        for i in range(num_prompts)
    ]
    
    sampling_params = SamplingParams(
        temperature=0.7,
        max_tokens=100,
    )
    
    latencies = []
    token_counts = []
    
    for run in range(num_runs):
        start_time = time.time()
        outputs = llm.generate(test_prompts, sampling_params)
        latency = time.time() - start_time
        
        total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
        
        latencies.append(latency)
        token_counts.append(total_tokens)
    
    avg_latency = statistics.mean(latencies)
    avg_tokens = statistics.mean(token_counts)
    throughput = avg_tokens / avg_latency
    
    return {
        'num_prompts': num_prompts,
        'avg_latency_s': avg_latency,
        'avg_tokens': avg_tokens,
        'throughput_tps': throughput,
        'latency_per_prompt_s': avg_latency / num_prompts,
    }

print("="*70)
print("THROUGHPUT BENCHMARK")
print("="*70)
print()

batch_sizes = [1, 4, 8, 16]
results = []

for batch_size in batch_sizes:
    print(f"Benchmarking batch size {batch_size}...")
    result = benchmark_throughput(llm, batch_size, num_runs=2)
    results.append(result)

print()
print("Results:")
print("-" * 70)
print(f"{'Batch Size':<12} {'Latency (s)':<14} {'Throughput (tok/s)':<20} {'Per-Request (s)':<15}")
print("-" * 70)

for r in results:
    print(f"{r['num_prompts']:<12} {r['avg_latency_s']:<14.2f} {r['throughput_tps']:<20.2f} {r['latency_per_prompt_s']:<15.2f}")

print("-" * 70)
print()
print("Key insight: Larger batches improve throughput (tokens/second)")
print("but may increase individual request latency.")

## 9. Configuration for Different Bottleneck Patterns

In [None]:
# Configuration recommendations based on bottleneck patterns

configs = {
    "Pattern 1 - Bandwidth Bottleneck": {
        "symptom": "GPU utilization 50-70%, not improving with optimization",
        "config": {
            "gpu_memory_utilization": 0.9,
            "enable_prefix_caching": True,
            "max_num_seqs": 256,
        },
        "explanation": "Maximize memory usage and cache prefixes to reduce data movement"
    },
    "Pattern 2 - Memory Bottleneck": {
        "symptom": "OOM errors, model doesn't fit on GPU",
        "config": {
            "gpu_memory_utilization": 0.85,
            "tensor_parallel_size": 2,
            "max_num_seqs": 64,
        },
        "explanation": "Reduce memory pressure, split model across GPUs if needed"
    },
    "Pattern 3 - Compute Bottleneck": {
        "symptom": "Sustained 100% GPU utilization",
        "config": {
            "gpu_memory_utilization": 0.9,
            "enable_prefix_caching": True,
        },
        "explanation": "Already optimized - need faster hardware or more GPUs"
    },
    "Pattern 4 - Network Bottleneck": {
        "symptom": "Multi-GPU scaling shows diminishing returns",
        "config": {
            "tensor_parallel_size": "minimum needed to fit model",
        },
        "explanation": "Minimize inter-GPU communication, use NVLink if available"
    },
}

print("="*70)
print("vLLM CONFIGURATION BY BOTTLENECK PATTERN")
print("="*70)

for pattern, info in configs.items():
    print(f"\n{pattern}")
    print("-" * 50)
    print(f"Symptom: {info['symptom']}")
    print(f"Recommended config: {info['config']}")
    print(f"Why: {info['explanation']}")

## 10. Cleanup

In [None]:
# Clean up GPU memory
print("Cleaning up...")

del llm
gc.collect()
torch.cuda.empty_cache()

mem_info = get_gpu_memory_info()
print(f"GPU memory after cleanup: {mem_info['reserved_gb']:.2f} GB reserved")
print("✓ Cleanup complete!")

## Summary

### Key vLLM Optimizations

| Feature | Impact | When to Use |
|---------|--------|-------------|
| **PagedAttention** | 2-3x throughput | Always (automatic) |
| **Continuous Batching** | 85-90% GPU utilization | High-traffic scenarios |
| **Prefix Caching** | Faster repeated prompts | Chat applications, agents |
| **Tensor Parallelism** | Fit large models | When model > GPU memory |

### Production Configuration Checklist

1. **Set `gpu_memory_utilization`** to 0.85-0.9 (leave headroom for spikes)
2. **Enable `prefix_caching`** for conversational workloads
3. **Use minimum `tensor_parallel_size`** that fits your model
4. **Tune `max_num_seqs`** based on your latency requirements
5. **Monitor GPU utilization** to identify your bottleneck pattern

### Next Steps & Learning Labs

| Resource | Description |
|----------|-------------|
| [vLLM Official Documentation](https://docs.vllm.ai/) | Complete guide to vLLM configuration and deployment |
| [vLLM Recipes & Tutorials](https://recipes.vllm.ai/) | Example notebooks and step-by-step tutorials |
| [Model Garden vLLM Tutorial](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_vllm_text_only_tutorial.ipynb) | Deploy models with vLLM on Vertex AI |
| [Gemma Deployment on GKE](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma_deployment_on_gke.ipynb) | Deploy Gemma on GKE using GPU |
| [PagedAttention Paper](https://arxiv.org/abs/2309.06180) | Original research paper on PagedAttention |
| [vLLM Performance Benchmarks](https://perf.vllm.ai/) | Compare vLLM performance across configurations |

### Related Notebooks

- [02_model_garden_deployment.ipynb](./02_model_garden_deployment.ipynb) - Deploy models from Vertex AI Model Garden
- [01_gemma_finetuning.ipynb](./01_gemma_finetuning.ipynb) - Fine-tune Gemma with QLoRA