## This notebook requires GPU

This lab must be run in Google Colab in order to use GPU acceleration for model training. Click the button below to open this notebook in Colab, then set your runtime to GPU:

**Runtime > Change Runtime Type > T4 GPU**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scott2b/coursera-msds-public/blob/main/notebooks/4_efficient_inference_vllm.ipynb)

# ⚡ Efficient Inference with vLLM

This notebook dives deep into high-performance LLM inference using vLLM, covering everything from basic setup to advanced optimization techniques.

## 🎯 Learning Objectives

By the end of this notebook, you will:
1. Master vLLM installation and setup
2. Understand different inference optimization techniques
3. Implement batched inference for maximum throughput
4. Learn about continuous batching and memory management
5. Compare vLLM with other inference engines
6. Deploy models with different quantization levels
7. Monitor and profile inference performance
8. Handle production workloads and scaling

## 🔧 Prerequisites

- Completed Notebook 1 (LLM Fundamentals)
- CUDA-compatible GPU (recommended)
- Basic understanding of model serving
- Familiarity with performance optimization

In [None]:
# Install vLLM and dependencies
!pip install vllm
!pip install ray  # For distributed inference
!pip install transformers accelerate
!pip install psutil GPUtil  # For monitoring
!pip install matplotlib seaborn plotly

# Optional: Install for advanced quantization
# !pip install auto-gptq optimum

In [None]:
import vllm
from vllm import LLM, SamplingParams
import torch
import time
import psutil
import GPUtil
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from typing import List, Dict, Any
import warnings
warnings.filterwarnings('ignore')

print("🚀 vLLM Inference Environment Ready!")
print(f"vLLM version: {vllm.__version__}")
print(f"PyTorch version: {torch.__version__}")

# Check available hardware
print(f"\n🖥️  CPU cores: {psutil.cpu_count()}")
print(f"🧠 RAM: {psutil.virtual_memory().total / (1024**3):.1f} GB")

try:
    gpus = GPUtil.getGPUs()
    if gpus:
        gpu = gpus[0]
        print(f"🎮 GPU: {gpu.name}")
        print(f"🧠 GPU Memory: {gpu.memoryTotal} MB")
        print(f"📊 GPU Utilization: {gpu.load * 100:.1f}%")
    else:
        print("❌ No GPU detected")
except:
    print("⚠️  GPU monitoring not available")

## 🏗️ vLLM Architecture and Setup

Understanding vLLM's core components and setup process.

In [None]:
def setup_vllm_model(model_name: str, quantization: str = None, **kwargs):
    """Setup vLLM model with optimal configuration"""

    print(f"🚀 Setting up vLLM with {model_name}...")

    # Default configuration
    config = {
        "model": model_name,
        "tensor_parallel_size": 1,  # Increase for multi-GPU
        "gpu_memory_utilization": 0.85, # Reduced GPU memory utilization
        "max_model_len": 1024,
        "enforce_eager": False,  # Use CUDA graphs
        "trust_remote_code": True,
    }

    # Add quantization if specified
    if quantization:
        config["quantization"] = quantization
        print(f"📊 Using {quantization} quantization")

    # Add custom parameters
    config.update(kwargs)

    try:
        llm = LLM(**config)
        print("✅ vLLM model loaded successfully!")
        print(f"[INFO] Model: {model_name}")
        # print(f"🧠 Device: {llm.device}") # Removed this line

        # Attempt to clear GPU memory after loading the model
        import gc
        import torch
        gc.collect()
        torch.cuda.empty_cache()
        print("Attempting to clear GPU memory after model load...")

        return llm
    except Exception as e:
        print(f"❌ Failed to load model: {e}")
        return None

# Test with a small model
model_name = "microsoft/DialoGPT-small"
llm = setup_vllm_model(model_name, max_model_len=1024)

# Display model information
if llm:
    print(f"\n📋 Model Configuration:")
    # print(f"Max sequence length: {llm.max_model_len}") # Removed this line
    # print(f"Vocabulary size: {llm.vocab_size}") # Removed this line
    # print(f"Number of layers: {llm.num_layers}") # Removed this line
    # print(f"Hidden size: {llm.hidden_size}") # Removed this line
    pass # Added pass to avoid empty if block

## 🎯 Basic Inference with vLLM

Learn the fundamentals of running inference with vLLM.

In [None]:
# Basic inference example
def basic_inference_demo(llm):
    """Demonstrate basic inference capabilities"""

    if not llm:
        print("❌ No model loaded")
        return

    # Sample prompts for classification
    prompts = [
        "Classify this movie review: 'This film was absolutely amazing! The acting was superb and the plot kept me engaged throughout.'",
        "Analyze the sentiment: 'I waited 2 hours in line and the product was out of stock. Terrible customer service.'",
        "Determine if this is positive or negative: 'The food arrived cold and the portions were tiny for the price.'",
        "Classify the emotion in: 'I just won the lottery! I can't believe my luck!'"
    ]

    # Configure sampling parameters
    sampling_params = SamplingParams(
        temperature=0.1,  # Low temperature for consistent results
        max_tokens=50,    # Limit response length
        stop=["\n", "\r"], # Stop at newlines
        presence_penalty=0.0,
        frequency_penalty=0.0
    )

    print("🔍 Running basic inference...")
    start_time = time.time()

    outputs = llm.generate(prompts, sampling_params)

    end_time = time.time()
    total_time = end_time - start_time

    print(f"\n⚡ Inference completed in {total_time:.3f} seconds")
    print(f"📊 Throughput: {len(prompts) / total_time:.2f} requests/second")

    # Display results
    for i, output in enumerate(outputs):
        print(f"\n📝 Prompt {i+1}: {prompts[i][:60]}...")
        print(f"🤖 Response: {output.outputs[0].text.strip()}")
        print(f"📊 Tokens generated: {len(output.outputs[0].token_ids)}")

    return outputs

# Run basic inference demo
if llm:
    results = basic_inference_demo(llm)
else:
    print("⚠️  Skipping inference demo - no model loaded")

## 🚀 Advanced Inference Techniques

Exploring batched inference, continuous batching, and optimization strategies.

In [None]:
# Batched inference performance comparison
def benchmark_batch_sizes(llm, base_prompts: List[str], batch_sizes: List[int] = None):
    """Benchmark performance across different batch sizes"""

    if not llm:
        return None

    if batch_sizes is None:
        batch_sizes = [1, 2, 4, 8, 16]

    results = []

    for batch_size in batch_sizes:
        print(f"\n🔬 Testing batch size: {batch_size}")

        # Create prompts for this batch size
        prompts = base_prompts * (batch_size // len(base_prompts) + 1)
        prompts = prompts[:batch_size]

        sampling_params = SamplingParams(
            temperature=0.7,
            max_tokens=20,
            stop=["\n"]
        )

        # Time the inference
        start_time = time.time()
        outputs = llm.generate(prompts, sampling_params)
        end_time = time.time()

        total_time = end_time - start_time
        throughput = len(prompts) / total_time
        latency = total_time / len(prompts)

        results.append({
            'batch_size': batch_size,
            'total_time': total_time,
            'throughput': throughput,
            'latency': latency,
            'total_tokens': sum(len(output.outputs[0].token_ids) for output in outputs)
        })

        print(f"  ⏱️  Total time: {total_time:.3f}s")
        print(f"  ⚡ Throughput: {throughput:.2f} req/s")
        print(f"  🕐 Latency: {latency:.3f}s per request")

    return pd.DataFrame(results)

# Run batch size benchmark
base_prompts = [
    "Classify sentiment: This product is amazing!",
    "Analyze: I love this service.",
    "Determine polarity: Terrible experience."
]

if llm:
    benchmark_results = benchmark_batch_sizes(llm, base_prompts, [1, 2, 4])

    if benchmark_results is not None:
        # Visualize results
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

        ax1.plot(benchmark_results['batch_size'], benchmark_results['throughput'], 'bo-', linewidth=2)
        ax1.set_xlabel('Batch Size')
        ax1.set_ylabel('Throughput (req/s)')
        ax1.set_title('Throughput vs Batch Size')
        ax1.grid(True, alpha=0.3)

        ax2.plot(benchmark_results['batch_size'], benchmark_results['latency'], 'ro-', linewidth=2)
        ax2.set_xlabel('Batch Size')
        ax2.set_ylabel('Latency (s)')
        ax2.set_title('Latency vs Batch Size')
        ax2.grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

        print("\n📊 Benchmark Summary:")
        print(benchmark_results.to_string(index=False))

## 🧠 Memory Management and Optimization

Understanding memory usage and optimization techniques in vLLM.

In [None]:
# Memory monitoring utilities
def monitor_memory():
    """Monitor system and GPU memory usage"""

    # CPU memory
    cpu_memory = psutil.virtual_memory()
    print(f"🖥️  CPU Memory: {cpu_memory.used / (1024**3):.2f}GB / {cpu_memory.total / (1024**3):.2f}GB ({cpu_memory.percent}%)")

    # GPU memory (if available)
    try:
        gpus = GPUtil.getGPUs()
        if gpus:
            gpu = gpus[0]
            print(f"🎮 GPU Memory: {gpu.memoryUsed}MB / {gpu.memoryTotal}MB ({gpu.memoryUtil * 100:.1f}%)")
            print(f"🎮 GPU Utilization: {gpu.load * 100:.1f}%")
    except:
        print("⚠️  GPU monitoring not available")

# Memory optimization techniques
def memory_optimization_demo():
    """Demonstrate memory optimization techniques"""

    print("🧠 Memory Optimization Techniques:")
    print("=" * 40)

    optimizations = {
        "PagedAttention": "Reduces memory fragmentation by virtualizing KV cache",
        "Continuous Batching": "Dynamically batches requests for better utilization",
        "Quantization": "Reduces model precision (FP16, INT8, INT4)",
        "KV Cache Sharing": "Shares KV cache across similar sequences",
        "Memory Pooling": "Pre-allocates memory to reduce allocation overhead",
        "Gradient Checkpointing": "Trades compute for memory during inference"
    }

    for technique, description in optimizations.items():
        print(f"\n🔧 {technique}:")
        print(f"   {description}")

    # Show current memory usage
    print("\n📊 Current Memory Status:")
    monitor_memory()

memory_optimization_demo()

## 🔍 Performance Profiling and Monitoring

Learn how to monitor and profile vLLM performance.

In [None]:
# Performance profiling utilities
def profile_inference(llm, prompts: List[str], num_runs: int = 5):
    """Profile inference performance over multiple runs"""

    if not llm:
        return None

    latencies = []
    throughputs = []

    sampling_params = SamplingParams(
        temperature=0.7,
        max_tokens=30,
        stop=["\n"]
    )

    print(f"🔬 Profiling {len(prompts)} prompts over {num_runs} runs...")

    for run in range(num_runs):
        start_time = time.time()
        outputs = llm.generate(prompts, sampling_params)
        end_time = time.time()

        total_time = end_time - start_time
        throughput = len(prompts) / total_time
        latency = total_time / len(prompts)

        latencies.append(latency)
        throughputs.append(throughput)

        print(f"Run {run+1}: {throughput:.2f} req/s, {latency:.3f}s latency")

    # Calculate statistics
    results = {
        'mean_latency': np.mean(latencies),
        'std_latency': np.std(latencies),
        'mean_throughput': np.mean(throughputs),
        'std_throughput': np.std(throughputs),
        'min_latency': np.min(latencies),
        'max_latency': np.max(latencies)
    }

    print("\n📊 Profiling Results:")
    print(f"Latency: {results['mean_latency']:.3f} ± {results['std_latency']:.3f}s")
    print(f"Throughput: {results['mean_throughput']:.2f} ± {results['std_throughput']:.2f} req/s")
    print(f"Latency Range: {results['min_latency']:.3f} - {results['max_latency']:.3f}s")

    return results

# Profile performance
if llm:
    profile_prompts = [
        "Analyze this review: Great product!",
        "Classify sentiment: I love this.",
        "Determine polarity: Not good."
    ]

    profile_results = profile_inference(llm, profile_prompts, num_runs=3)
else:
    print("⚠️  Skipping profiling - no model loaded")

## 🚀 Scaling and Distributed Inference

Understanding how to scale vLLM for production workloads.

In [None]:
# Scaling considerations
def scaling_guide():
    """Guide for scaling vLLM deployments"""

    scaling_strategies = {
        "Single GPU": {
            "use_case": "Development, small applications",
            "throughput": "100-500 req/s",
            "setup": "tensor_parallel_size=1"
        },
        "Multi-GPU (Single Node)": {
            "use_case": "Medium applications, A/B testing",
            "throughput": "500-2000 req/s",
            "setup": "tensor_parallel_size=2-4"
        },
        "Multi-Node Cluster": {
            "use_case": "Large-scale production",
            "throughput": "2000+ req/s",
            "setup": "Ray-based distributed setup"
        },
        "Edge Deployment": {
            "use_case": "Mobile, IoT applications",
            "throughput": "10-100 req/s",
            "setup": "Quantized models, CPU inference"
        }
    }

    print("🚀 vLLM Scaling Strategies:")
    print("=" * 50)

    for strategy, details in scaling_strategies.items():
        print(f"\n🎯 {strategy}:")
        print(f"   📋 Use Case: {details['use_case']}")
        print(f"   ⚡ Throughput: {details['throughput']}")
        print(f"   🔧 Setup: {details['setup']}")

    # Performance tips
    print("\n💡 Performance Optimization Tips:")
    tips = [
        "Use continuous batching for dynamic workloads",
        "Implement request prioritization for latency-sensitive tasks",
        "Monitor GPU utilization and adjust batch sizes accordingly",
        "Use model parallelism for very large models",
        "Implement caching for frequent requests",
        "Consider CPU offloading for memory-intensive tasks"
    ]

    for i, tip in enumerate(tips, 1):
        print(f"   {i}. {tip}")

scaling_guide()

## 📚 Key Takeaways

1. **vLLM Architecture**: Continuous batching, PagedAttention, and optimized memory management
2. **Performance Optimization**: Batch size tuning, quantization, and hardware utilization
3. **Production Readiness**: Monitoring, error handling, and scaling strategies
4. **Memory Management**: KV cache optimization and memory pooling
5. **Quantization**: Trade-offs between speed, memory, and accuracy
6. **Monitoring**: Throughput, latency, and resource utilization tracking

## 🚀 Next Steps

Now that you understand efficient inference, proceed to:
- **Notebook 3**: Advanced Fine-tuning with Unsloth and PEFT
- **Notebook 4**: Production Deployment and Scaling
- **Notebook 5**: Evaluation, Benchmarking, and Ethics

## 🔗 Additional Resources

- [vLLM Documentation](https://vllm.readthedocs.io/)
- [Efficient Inference Techniques](https://arxiv.org/abs/2205.05198)
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
- [Model Compression Techniques](https://arxiv.org/abs/2002.11794)

## 🎯 Hands-on Exercises

1. **Batch Size Optimization**: Experiment with different batch sizes and measure throughput/latency trade-offs
2. **Quantization Comparison**: Compare FP16, INT8, and INT4 quantization on the same model
3. **Memory Profiling**: Monitor GPU memory usage during inference with different configurations
4. **Production Service**: Build a simple REST API using the ProductionLLMService class
5. **Performance Benchmarking**: Create a comprehensive benchmark comparing different inference engines

## 🎉 Conclusion

You've now mastered efficient LLM inference with vLLM! Key achievements:
- ✅ Understanding vLLM's core optimizations
- ✅ Implementing batched and continuous batching
- ✅ Memory management and optimization techniques
- ✅ Quantization strategies and trade-offs
- ✅ Production-ready service implementation
- ✅ Performance profiling and monitoring
- ✅ Scaling strategies for different workloads

Ready to move on to advanced fine-tuning techniques! 🚀