# vLLM: High-Performance Inference Engine for LLMs

In this notebook, we'll explore vLLM, an advanced inference engine designed to maximize the performance of Large Language Models (LLMs) on GPU hardware. vLLM addresses key performance bottlenecks in traditional inference systems, achieving substantially higher throughput while maintaining low latency.

## 1. Introduction to Inference Engines

### Why LLM Inference is Challenging

Deploying LLMs efficiently presents several challenges:

1. **Memory Constraints**: LLMs have billions of parameters that must be loaded into GPU memory
2. **Attention Computation**: Quadratic scaling with sequence length makes attention expensive
3. **Sequential Generation**: Auto-regressive generation is inherently sequential
4. **Dynamic Batch Sizes**: Variable-length inputs and outputs make static batching inefficient
5. **GPU Utilization**: Traditional inference pipelines often leave GPUs underutilized

Specialized inference engines like vLLM are designed to address these challenges through innovative techniques and optimizations.

### Installation Requirements

Before we begin, here are the requirements for installing vLLM:

```bash
# Standard installation (with compatible CUDA version)
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

# Or via Docker (recommended for CUDA compatibility issues)
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
```

vLLM requires:
- NVIDIA GPU with compute capability 7.0+ (V100, T4, A100, H100, etc.)
- CUDA 11.8+ (best with CUDA 12.1)
- NVIDIA drivers (tested with Driver 535+)
- PyTorch 2.5.1 (recommended)

### Core Value Propositions of vLLM

vLLM offers several key advantages for LLM inference:

1. **PagedAttention**: Efficient KV cache management that eliminates waste and reduces memory fragmentation
2. **Continuous Batching**: Dynamic handling of requests to maximize GPU utilization 
3. **Optimized CUDA Kernels**: Highly optimized implementations of computational bottlenecks
4. **Tensor Parallelism**: Ability to split model across multiple GPUs
5. **Quantization Support**: Precision reduction (INT8, INT4) with minimal accuracy loss

These innovations allow vLLM to achieve 2-18x higher throughput than other inference engines while maintaining low latency.

## 2. PagedAttention: Memory-Efficient KV Caching

### The KV Cache Problem

During LLM inference, the key-value (KV) pairs from previous tokens must be stored to avoid recomputation. Traditional approaches allocate a fixed-size cache for each request based on the maximum possible sequence length, leading to significant memory waste when:

1. Actual sequences are shorter than the maximum length
2. Each sequence in a batch has different lengths

This problem becomes more severe with multiple simultaneous requests and longer context windows.

### How PagedAttention Works

PagedAttention applies virtual memory concepts to KV cache management:

1. **Blocks Instead of Sequences**: PagedAttention divides the KV cache into fixed-size memory blocks (e.g., 16 tokens per block)
2. **Physical and Logical Separation**: Maintains logical sequences using a block table that maps sequence positions to physical memory blocks
3. **On-Demand Allocation**: Only allocates blocks when needed, rather than preallocating for maximum length
4. **Memory Reuse**: Freed blocks from completed sequences can be immediately reused for new requests

Let's see a basic example of how this works:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Visualization helper for PagedAttention concept
def visualize_paged_attention(block_size=4, num_blocks=10, sequences=None):
    if sequences is None:
        sequences = [
            {'name': 'Seq 1', 'length': 7, 'start_block': 0},
            {'name': 'Seq 2', 'length': 5, 'start_block': 2},
            {'name': 'Seq 3', 'length': 11, 'start_block': 4},
        ]
    
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))
    
    # Physical memory blocks
    blocks = np.zeros(num_blocks * block_size)
    colors = ['#ff9999', '#99ff99', '#9999ff']
    
    # Fill blocks based on sequences
    for i, seq in enumerate(sequences):
        start_idx = seq['start_block'] * block_size
        for j in range(min(seq['length'], (num_blocks - seq['start_block']) * block_size)):
            blocks[start_idx + j] = i + 1
    
    # Display physical memory
    ax1.set_title('Physical Memory: KV Cache Blocks')
    for i in range(num_blocks):
        ax1.axvline(x=i*block_size - 0.5, color='black', linestyle='-', alpha=0.3)
        ax1.text(i*block_size + block_size/2 - 0.5, -0.8, f'Block {i}', ha='center')
    
    cmap = plt.cm.colors.ListedColormap(['white'] + colors)
    ax1.imshow(blocks.reshape(1, -1), aspect='auto', cmap=cmap, vmin=0, vmax=len(sequences))
    ax1.set_yticks([])
    ax1.set_xticks(np.arange(0, num_blocks*block_size, block_size))
    ax1.set_xticklabels([])
    
    # Display logical sequences
    ax2.set_title('Logical Sequences: How PagedAttention Maps Requests')
    for i, seq in enumerate(sequences):
        logical_seq = np.zeros(num_blocks * block_size)
        for j in range(seq['length']):
            block_idx = seq['start_block'] + j // block_size
            if block_idx < num_blocks:
                token_idx = (block_idx * block_size) + (j % block_size)
                logical_seq[j] = token_idx + 1  # Where this token is stored in physical memory
                
        ax2.scatter(range(seq['length']), [i]*seq['length'], 
                   c=[colors[i]]*seq['length'], label=seq['name'])
        for j in range(seq['length']):
            if logical_seq[j] > 0:
                ax2.text(j, i, f'{int(logical_seq[j]-1)}', ha='center', va='center', fontsize=8)
    
    ax2.set_xlim(-0.5, num_blocks*block_size - 0.5)
    ax2.set_ylim(-0.5, len(sequences) - 0.5)
    ax2.set_yticks(range(len(sequences)))
    ax2.set_yticklabels([s['name'] for s in sequences])
    ax2.legend(loc='upper right')
    ax2.set_xlabel('Logical Token Position')
    
    plt.tight_layout()
    
visualize_paged_attention()

### Benefits of PagedAttention

PagedAttention provides several key advantages:

1. **Memory Efficiency**: Uses up to 65% less memory than traditional KV caching
2. **Elimination of Memory Fragmentation**: Blocks can be allocated and released dynamically
3. **Support for Variable-Length Sequences**: Efficiently handles sequences of any length
4. **Higher Throughput**: More requests can be processed simultaneously with the same memory

The CUDA implementation of PagedAttention in vLLM includes highly optimized attention kernels that maintain computational efficiency while providing these memory benefits.

## 3. Continuous Batching for Maximum GPU Utilization

### Static vs. Continuous Batching

Traditional inference systems use static batching, where:
1. A fixed number of requests are grouped into a batch
2. The entire batch is processed together
3. No new requests can join until the batch completes
4. All sequences in the batch must wait for the longest sequence to finish

This approach leads to poor GPU utilization because:
- GPUs are idle while waiting for new batches
- Shorter sequences waste compute waiting for longer ones
- Batch size must be small to maintain reasonable latency

Continuous batching solves these problems by dynamically adding and removing sequences from the batch.

### How Continuous Batching Works

vLLM implements continuous batching through a scheduling algorithm that:

1. **Dynamically Adds Requests**: New requests join the batch as soon as they arrive
2. **Independently Manages Sequences**: Each sequence progresses at its own pace
3. **Immediately Removes Completed Sequences**: Frees resources as soon as generation completes
4. **Prioritizes Requests**: Can implement strategies like FIFO, round-robin, or priority queuing

Let's look at a comparison between static and continuous batching:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# Visualizing static vs continuous batching
def compare_batching_methods():
    # Setup
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
    
    # Define request data: [arrival_time, tokens_to_generate]
    requests = [
        [0, 10],   # Request 1: arrives at t=0, needs 10 tokens
        [1, 5],    # Request 2: arrives at t=1, needs 5 tokens
        [3, 15],   # Request 3: arrives at t=3, needs 15 tokens
        [5, 3],    # Request 4: arrives at t=5, needs 3 tokens
        [6, 7],    # Request 5: arrives at t=6, needs 7 tokens
    ]
    
    # Colors for each request
    colors = ['#ff9999', '#99ff99', '#9999ff', '#ffcc99', '#cc99ff']
    
    # Static batching (batch size = 2)
    ax1.set_title('Static Batching (batch size = 2)')
    batch_size = 2
    current_time = 0
    batch_queue = []
    processing_batch = False
    batch_end_time = 0
    
    for t in range(30):  # Simulate 30 time steps
        # Add new arrivals to queue
        for i, req in enumerate(requests):
            if req[0] == t:
                batch_queue.append((i, req[1]))
        
        # Start new batch if we're not processing and have enough requests
        if not processing_batch and len(batch_queue) >= batch_size:
            processing_batch = True
            current_batch = batch_queue[:batch_size]
            batch_queue = batch_queue[batch_size:]
            
            # Determine longest sequence in batch
            max_tokens = max([r[1] for r in current_batch])
            batch_end_time = t + max_tokens
            
            # Draw rectangles for each request in this batch
            for idx, (req_idx, tokens) in enumerate(current_batch):
                rect = Rectangle((t, req_idx), max_tokens, 0.7, color=colors[req_idx], alpha=0.7)
                ax1.add_patch(rect)
                
                # Add diagonal lines for actual token generation
                for token in range(tokens):
                    ax1.plot([t+token, t+token+1], [req_idx+0.3, req_idx+0.3], color='black', alpha=0.5)
                
                # Add idle time (waiting for longest sequence)
                if tokens < max_tokens:
                    rect = Rectangle((t+tokens, req_idx), max_tokens-tokens, 0.7, 
                                    color='lightgray', alpha=0.5, hatch='//')
                    ax1.add_patch(rect)
        
        # Check if current batch is done
        if processing_batch and t == batch_end_time:
            processing_batch = False
    
    # Continuous batching
    ax2.set_title('Continuous Batching')
    active_requests = {}
    
    for t in range(30):  # Simulate 30 time steps
        # Add new arrivals to active requests
        for i, req in enumerate(requests):
            if req[0] == t:
                active_requests[i] = {'start': t, 'tokens_left': req[1]}
        
        # Process all active requests
        finished = []
        for req_idx, req_data in active_requests.items():
            if req_data['tokens_left'] > 0:
                # Draw token generation step
                token_num = req_data['start'] + req[1] - req_data['tokens_left']
                ax2.plot([t, t+1], [req_idx+0.3, req_idx+0.3], color='black', alpha=0.5)
                
                # Draw rectangle for this time step
                rect = Rectangle((t, req_idx), 1, 0.7, color=colors[req_idx], alpha=0.7)
                ax2.add_patch(rect)
                
                # Decrement tokens left
                req_data['tokens_left'] -= 1
                
                # Check if request is finished
                if req_data['tokens_left'] == 0:
                    finished.append(req_idx)
        
        # Remove finished requests
        for req_idx in finished:
            del active_requests[req_idx]
    
    # Formatting
    for ax in [ax1, ax2]:
        ax.set_xlim(0, 30)
        ax.set_ylim(-0.5, len(requests) - 0.5)
        ax.set_yticks(range(len(requests)))
        ax.set_yticklabels([f'Request {i+1}' for i in range(len(requests))])
        ax.set_xlabel('Time steps')
        ax.grid(True, linestyle='--', alpha=0.3)
    
    # Add legend
    handles = [Rectangle((0,0), 1, 1, color=colors[i]) for i in range(len(requests))]
    handles.append(Rectangle((0,0), 1, 1, color='lightgray', hatch='//'))
    labels = [f'Request {i+1}' for i in range(len(requests))]
    labels.append('Idle (waiting for batch)')
    
    plt.figlegend(handles, labels, loc='lower center', ncol=len(handles), bbox_to_anchor=(0.5, 0))
    plt.tight_layout(rect=[0, 0.05, 1, 1])
    
compare_batching_methods()

### Benefits of Continuous Batching

Continuous batching provides several critical advantages:

1. **Higher GPU Utilization**: Keeps the GPU busy by continuously adding new requests
2. **Lower Latency**: No waiting for batch formation or for longer sequences to complete
3. **Higher Throughput**: Processes more tokens per second by eliminating idle time
4. **Better QoS**: Can implement prioritization for critical requests

When combined with PagedAttention, continuous batching allows vLLM to achieve much higher throughput than traditional inference engines, especially under high load.

## 4. Flash Attention for Compute Optimization

While PagedAttention optimizes memory usage and continuous batching improves scheduling, the actual attention computation remains a bottleneck. Flash Attention is an algorithm that optimizes attention computation by reducing memory I/O between GPU high-bandwidth memory (HBM) and on-chip SRAM.

### How Flash Attention Works

Standard attention implementation has three key inefficiencies:

1. **Multiple HBM Accesses**: Reading Q, K, V matrices multiple times
2. **Storing Large Attention Matrices**: O(N²) memory for sequence length N
3. **Softmax Stability Tricks**: Extra passes to compute max values

Flash Attention addresses these issues by:

1. **Tiled Computation**: Breaking large matrices into smaller blocks that fit in SRAM
2. **Online Softmax Algorithm**: Computing softmax incrementally without storing the full attention matrix
3. **Fused Operations**: Combining multiple operations to reduce memory I/O

The algorithm time complexity remains O(N²) but with drastically reduced memory I/O.

### vLLM Integration with Flash Attention

vLLM integrates FlashAttention (via FlashInfer) as its attention backend to further optimize inference performance:

1. **Reduced Memory Bandwidth**: Lower memory I/O means faster attention computation
2. **Lower GPU Memory Usage**: No need to materialize the full attention matrix
3. **Faster Backward Pass**: More efficient gradient computation (when using vLLM for fine-tuning)
4. **Optimized for Modern GPUs**: Tailored for NVIDIA architectures (Ampere, Hopper, etc.)

The combination of PagedAttention memory efficiency and FlashAttention compute efficiency gives vLLM exceptional performance.

## 5. Implementing Efficient Batch Inference with vLLM

Now let's see how to use vLLM in practice. We'll first look at a simple example of how to load a model and generate completions.

In [None]:
# !pip install vllm
# Basic vLLM usage example

from vllm import LLM, SamplingParams

# Initialize the model
# This loads the model onto the GPU(s)
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2")

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256
)

# Single prompt generation
prompts = ["Write a short story about a robot learning to paint:"]
outputs = llm.generate(prompts, sampling_params)

# Print the generated text
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Generated text: {generated_text}")
    print("---")

### Batch Processing with Thread Concurrency

Now let's implement a more complex example that demonstrates how vLLM handles concurrent requests efficiently. We'll create multiple threads to simulate concurrent users and see how vLLM processes them using continuous batching.

In [None]:
import threading
import time
import queue
import random
from vllm import LLM, SamplingParams

# Create a thread-safe queue for results
result_queue = queue.Queue()

# Define some varied prompts with different expected lengths
prompts = [
    "Explain quantum computing in one sentence:",  # Short
    "Write a haiku about programming:",  # Short
    "List 5 tips for productive coding sessions:",  # Medium
    "Explain the difference between supervised and unsupervised learning:",  # Medium
    "Write a short story about a programmer who discovers a bug that leads to an adventure:",  # Long
    "Describe in detail how transformers work in deep learning:",  # Long
]

# Function to generate text for a given prompt
def generate_text(llm, prompt_id, prompt, max_tokens):
    start_time = time.time()
    
    # Configure sampling parameters - vary them to simulate different request types
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.95,
        max_tokens=max_tokens
    )
    
    # Generate the text
    outputs = llm.generate([prompt], sampling_params)
    generated_text = outputs[0].outputs[0].text
    
    # Calculate elapsed time
    elapsed_time = time.time() - start_time
    
    # Add result to queue
    result_queue.put({
        'prompt_id': prompt_id,
        'prompt': prompt,
        'text': generated_text,
        'max_tokens': max_tokens,
        'elapsed_time': elapsed_time,
        'tokens_per_second': len(generated_text.split()) / elapsed_time
    })

# Main function to simulate concurrent requests
def run_concurrent_inference():
    # Initialize the model once (shared across all threads)
    print("Loading model...")
    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2")
    print("Model loaded!")
    
    # Create threads for concurrent requests
    threads = []
    num_requests = 10  # Total number of requests to process
    
    # Staggered launch of requests to simulate real-world scenario
    for i in range(num_requests):
        # Select a random prompt from our list
        prompt = random.choice(prompts)
        
        # Vary max tokens to simulate different response lengths
        max_tokens = random.choice([32, 64, 128, 256])
        
        # Create a thread for this request
        thread = threading.Thread(
            target=generate_text,
            args=(llm, i, prompt, max_tokens)
        )
        threads.append(thread)
    
    # Start all threads with a small delay between them
    print(f"Starting {num_requests} concurrent requests...")
    start_time = time.time()
    
    for thread in threads:
        thread.start()
        # Small random delay to simulate staggered arrivals
        time.sleep(random.uniform(0.1, 0.5))
    
    # Wait for all threads to complete
    for thread in threads:
        thread.join()
    
    total_time = time.time() - start_time
    print(f"All requests completed in {total_time:.2f} seconds")
    
    # Process and display results
    results = []
    while not result_queue.empty():
        results.append(result_queue.get())
    
    # Sort results by completion time
    results.sort(key=lambda x: x['elapsed_time'])
    
    # Display statistics
    print("\nRequest Statistics:")
    print("-" * 80)
    print(f"{'ID':^5} | {'Max Tokens':^10} | {'Time (s)':^10} | {'Tokens/s':^10} | {'Prompt':30}")
    print("-" * 80)
    
    for result in results:
        prompt_short = result['prompt'][:30] + "..." if len(result['prompt']) > 30 else result['prompt']
        print(f"{result['prompt_id']:^5} | {result['max_tokens']:^10} | {result['elapsed_time']:.2f}s | {result['tokens_per_second']:.2f} | {prompt_short}")
    
    # Calculate aggregate statistics
    avg_time = sum(r['elapsed_time'] for r in results) / len(results)
    total_tokens = sum(r['max_tokens'] for r in results)
    overall_tokens_per_sec = total_tokens / total_time
    
    print("\nAggregate Statistics:")
    print(f"Average request time: {avg_time:.2f}s")
    print(f"Total tokens generated: {total_tokens}")
    print(f"Overall tokens per second: {overall_tokens_per_sec:.2f}")
    print(f"Throughput: {len(results) / total_time:.2f} requests per second")
    
    # Return results for additional analysis if needed
    return results

# Run the concurrent inference test
results = run_concurrent_inference()

### Deploying a vLLM Server

For production deployment, vLLM offers a server mode with a REST API compatible with the OpenAI API. This makes it easy to integrate into existing applications.

In [None]:
# Example command to start a vLLM server
!python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --host 0.0.0.0 \
    --port 8000

Then you can query the server using standard HTTP requests that mimic the OpenAI API format:

In [None]:
import requests
import json

# Example of calling a vLLM server with OpenAI-compatible API
def query_vllm_server(prompt, max_tokens=100):
    url = "http://localhost:8000/v1/completions"
    
    payload = {
        "model": "mistralai/Mistral-7B-Instruct-v0.2",
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": 0.7,
        "stream": False
    }
    
    headers = {
        "Content-Type": "application/json"
    }
    
    try:
        response = requests.post(url, headers=headers, data=json.dumps(payload))
        return response.json()
    except Exception as e:
        return {"error": str(e)}

# Example usage (if server is running)
# result = query_vllm_server("Write a haiku about artificial intelligence:")
# print(json.dumps(result, indent=2))

## 6. Advanced vLLM Configurations

vLLM supports several advanced configurations to further optimize performance:

### Tensor Parallelism

For large models that don't fit on a single GPU, vLLM supports tensor parallelism to split the model across multiple GPUs:

In [None]:
# Example of using tensor parallelism across multiple GPUs
from vllm import LLM, SamplingParams

# Load a larger model with tensor parallelism across GPUs
llm = LLM(
    model="meta-llama/Llama-2-70b-chat-hf",
    tensor_parallel_size=4,  # Use 4 GPUs
    gpu_memory_utilization=0.85,  # Control memory usage
)

### Quantization

vLLM supports various quantization methods to reduce memory footprint with minimal impact on accuracy:

In [None]:
# Example of loading a quantized model
from vllm import LLM, SamplingParams

# Load model with 4-bit quantization
llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    quantization="awq",  # Activation-aware Weight Quantization
    dtype="half"  # FP16 for non-quantized parts
)

### Custom KV Cache Size

You can control the amount of memory allocated to KV cache to balance between memory usage and performance:

In [None]:
# Example of controlling KV cache size
from vllm import LLM, SamplingParams

# Adjust block size and max number of batched tokens
llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    block_size=16,  # Size of each memory block in tokens
    max_num_batched_tokens=4096,  # Max tokens across all requests
    max_num_seqs=256  # Max number of concurrent sequences
)

## 7. Conclusion

vLLM represents a significant advancement in LLM inference optimization, addressing key bottlenecks through innovations like:

1. **PagedAttention**: Efficient memory management of KV cache
2. **Continuous Batching**: Dynamic handling of requests for maximum GPU utilization
3. **Flash Attention Integration**: Optimized attention computation
4. **Distributed Inference**: Support for multi-GPU deployment

These optimizations together enable vLLM to deliver 2-18x higher throughput compared to other inference engines while maintaining low latency. As LLMs continue to grow in size and importance, inference engines like vLLM will be critical for making these models practically deployable in production environments.

## 8. References

- [vLLM Documentation](https://docs.vllm.ai/)
- [PagedAttention: Paging KV Cache for Unlimited Context in Large Language Models](https://arxiv.org/abs/2309.06180)
- [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)
- [Deploying vLLM: A Step-by-Step Guide](https://ploomber.io/blog/vllm-deploy/)
- [vLLM GitHub Repository](https://github.com/vllm-project/vllm)