[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic2.8/vLLM/performance_comparison.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic2.8/vLLM/performance_comparison.ipynb)

# vLLM Performance Comparison and Benchmarking

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- Performance characteristics of vLLM compared to baseline HF implementations
- How to benchmark inference throughput and latency
- Memory efficiency gains from PagedAttention
- Optimal configurations for different use cases
- Real-world performance implications for hate speech detection

## 📋 Prerequisites
- Basic understanding of transformer inference
- Familiarity with Docker and containerization
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Completed vLLM setup guide

## 📚 What We'll Cover
1. **Setup**: Environment and performance monitoring tools
2. **Baseline Measurements**: Standard HuggingFace inference performance
3. **vLLM Benchmarks**: Throughput and latency measurements
4. **Memory Analysis**: PagedAttention vs standard attention
5. **Batch Processing**: Scaling characteristics
6. **Real-world Scenarios**: Hate speech detection at scale
7. **Cost Analysis**: Performance vs resource trade-offs

## 1. Setup and Environment

Let's start by setting up our benchmarking environment with comprehensive monitoring.

In [None]:
# Install required packages for benchmarking
# !pip install transformers torch datasets vllm psutil GPUtil matplotlib seaborn pandas numpy

# Import essential libraries
import torch
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional
import requests
import json
import os
import subprocess
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
import warnings
warnings.filterwarnings('ignore')

# Memory and system monitoring
import psutil
try:
    import GPUtil
    GPU_AVAILABLE = True
except ImportError:
    GPU_AVAILABLE = False
    print("⚠️ GPUtil not available, GPU monitoring disabled")

# HuggingFace transformers for baseline comparison
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    pipeline, TextClassificationPipeline
)

def get_device() -> torch.device:
    """
    Automatically detect and return the best available device.
    
    Priority: CUDA > MPS (Apple Silicon) > CPU
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
        print(f"📊 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU (consider GPU for better performance)")
    
    return device

# Setup device and plotting
device = get_device()
plt.style.use('default')
sns.set_palette("husl")

print(f"\n=== Benchmarking Environment ===\n")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")
print(f"CPU cores: {psutil.cpu_count()}")
print(f"RAM: {psutil.virtual_memory().total / 1e9:.1f}GB")
print(f"Ready for performance comparison! 📈")

## 2. Baseline HuggingFace Performance

First, let's establish baseline performance metrics using standard HuggingFace transformers.

In [None]:
class BaselineHFBenchmark:
    """Benchmark class for standard HuggingFace inference."""
    
    def __init__(self, model_name: str = "cardiffnlp/twitter-roberta-base-hate-latest"):
        """Initialize with preferred hate speech detection model."""
        self.model_name = model_name
        self.device = get_device()
        self.tokenizer = None
        self.model = None
        self.pipeline = None
        
        print(f"🤖 Loading baseline model: {model_name}")
        
    def load_model(self):
        """Load model and tokenizer."""
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
            self.model = AutoModelForSequenceClassification.from_pretrained(
                self.model_name,
                torch_dtype=torch.float16 if self.device.type == "cuda" else torch.float32,
                device_map="auto" if self.device.type == "cuda" else None
            )
            
            if self.device.type != "cuda":
                self.model = self.model.to(self.device)
            
            # Create pipeline for easier benchmarking
            self.pipeline = pipeline(
                "text-classification",
                model=self.model,
                tokenizer=self.tokenizer,
                device=0 if self.device.type == "cuda" else -1,
                return_all_scores=True
            )
            
            print("✅ Baseline model loaded successfully")
            
        except Exception as e:
            print(f"❌ Error loading model: {e}")
            raise
    
    def benchmark_single_inference(self, texts: List[str], warmup_runs: int = 3) -> Dict:
        """Benchmark single-text inference performance."""
        print(f"🔥 Warming up with {warmup_runs} runs...")
        
        # Warmup
        for _ in range(warmup_runs):
            _ = self.pipeline(texts[0])
        
        # Actual benchmarking
        times = []
        results = []
        
        print(f"⏱️ Running benchmark on {len(texts)} texts...")
        
        for text in texts:
            start_time = time.perf_counter()
            result = self.pipeline(text)
            end_time = time.perf_counter()
            
            times.append(end_time - start_time)
            results.append(result)
        
        return {
            "times": times,
            "results": results,
            "avg_time": np.mean(times),
            "std_time": np.std(times),
            "throughput": len(texts) / sum(times)
        }
    
    def benchmark_batch_inference(self, texts: List[str], batch_sizes: List[int]) -> Dict:
        """Benchmark batch processing performance."""
        results = {}
        
        for batch_size in batch_sizes:
            print(f"📦 Testing batch size: {batch_size}")
            
            # Create batches
            batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
            
            times = []
            total_processed = 0
            
            for batch in batches:
                if len(batch) == 0:
                    continue
                    
                start_time = time.perf_counter()
                try:
                    # Process batch
                    _ = self.pipeline(batch)
                    end_time = time.perf_counter()
                    
                    times.append(end_time - start_time)
                    total_processed += len(batch)
                    
                except Exception as e:
                    print(f"⚠️ Batch processing error: {e}")
                    continue
            
            if times:
                results[batch_size] = {
                    "avg_batch_time": np.mean(times),
                    "total_time": sum(times),
                    "throughput": total_processed / sum(times) if sum(times) > 0 else 0,
                    "processed": total_processed
                }
        
        return results

# Initialize baseline benchmark
baseline = BaselineHFBenchmark()
baseline.load_model()

print("✅ Baseline benchmark ready!")

## 3. vLLM Performance Testing

Now let's set up vLLM benchmarking to compare against the baseline.

In [None]:
class vLLMBenchmark:
    """Benchmark class for vLLM inference performance."""
    
    def __init__(self, server_url: str = "http://localhost:8000"):
        self.server_url = server_url
        self.base_url = f"{server_url}/v1"
        
    def is_server_running(self) -> bool:
        """Check if vLLM server is running."""
        try:
            response = requests.get(f"{self.server_url}/health", timeout=5)
            return response.status_code == 200
        except requests.RequestException:
            return False
    
    def start_server_if_needed(self) -> bool:
        """Start vLLM server if not running."""
        if self.is_server_running():
            print("✅ vLLM server already running")
            return True
        
        print("🚀 Starting vLLM server...")
        # Note: In a real implementation, you'd start the Docker container here
        # For this demo, we'll assume manual server startup
        print("📋 Please start vLLM server manually with:")
        print("   docker run --rm --runtime nvidia --gpus all -p 8000:8000 vllm-offline:latest")
        return False
    
    def benchmark_single_inference(self, texts: List[str], warmup_runs: int = 3) -> Dict:
        """Benchmark vLLM single-text inference."""
        if not self.is_server_running():
            raise RuntimeError("vLLM server not running")
        
        print(f"🔥 vLLM warmup with {warmup_runs} runs...")
        
        # Warmup
        for _ in range(warmup_runs):
            self._single_request(texts[0])
        
        # Benchmark
        times = []
        results = []
        
        print(f"⏱️ Running vLLM benchmark on {len(texts)} texts...")
        
        for text in texts:
            start_time = time.perf_counter()
            result = self._single_request(text)
            end_time = time.perf_counter()
            
            times.append(end_time - start_time)
            results.append(result)
        
        return {
            "times": times,
            "results": results,
            "avg_time": np.mean(times),
            "std_time": np.std(times),
            "throughput": len(texts) / sum(times)
        }
    
    def benchmark_concurrent_requests(self, texts: List[str], concurrent_users: List[int]) -> Dict:
        """Benchmark concurrent request handling."""
        results = {}
        
        for num_concurrent in concurrent_users:
            print(f"👥 Testing {num_concurrent} concurrent users")
            
            # Select subset of texts for concurrent testing
            test_texts = texts[:num_concurrent * 2]  # 2 requests per user
            
            start_time = time.perf_counter()
            
            with ThreadPoolExecutor(max_workers=num_concurrent) as executor:
                futures = []
                
                for text in test_texts:
                    future = executor.submit(self._single_request, text)
                    futures.append(future)
                
                # Wait for all requests to complete
                completed_requests = 0
                for future in as_completed(futures):
                    try:
                        _ = future.result(timeout=30)
                        completed_requests += 1
                    except Exception as e:
                        print(f"❌ Request failed: {e}")
            
            end_time = time.perf_counter()
            total_time = end_time - start_time
            
            results[num_concurrent] = {
                "total_time": total_time,
                "completed_requests": completed_requests,
                "throughput": completed_requests / total_time,
                "avg_time_per_request": total_time / completed_requests if completed_requests > 0 else 0
            }
        
        return results
    
    def _single_request(self, text: str) -> Dict:
        """Make a single inference request to vLLM."""
        payload = {
            "model": "hate-speech-detector",
            "prompt": f"Classify this text for hate speech (safe/unsafe): {text}\nClassification:",
            "max_tokens": 10,
            "temperature": 0.1
        }
        
        try:
            response = requests.post(
                f"{self.base_url}/completions",
                json=payload,
                timeout=30
            )
            response.raise_for_status()
            return response.json()
        
        except requests.RequestException as e:
            return {"error": str(e)}

# Initialize vLLM benchmark
vllm_bench = vLLMBenchmark()

# Check if server is running
if vllm_bench.is_server_running():
    print("✅ vLLM server detected and ready!")
else:
    print("⚠️ vLLM server not detected")
    print("📋 To start the server, run:")
    print("   cd examples/basic2.8/vLLM && python basic_usage.py")
    vllm_bench.start_server_if_needed()

## 4. Test Dataset Preparation

Let's prepare a comprehensive test dataset for benchmarking hate speech detection.

In [None]:
def create_test_dataset() -> Tuple[List[str], List[str]]:
    """Create a comprehensive test dataset for hate speech detection benchmarking."""
    
    # Safe examples (positive, neutral, and constructive content)
    safe_examples = [
        "I love learning about artificial intelligence and machine learning.",
        "This community is so supportive and welcoming to everyone.",
        "Great job on the presentation! Very informative and well-structured.",
        "I appreciate the diverse perspectives shared in this discussion.",
        "Technology is advancing rapidly and creating amazing opportunities.",
        "Thank you for taking the time to help me understand this concept.",
        "Looking forward to collaborating with the team on this project.",
        "The documentation is clear and easy to follow.",
        "I enjoy participating in constructive debates and discussions.",
        "This solution works well for our use case.",
        "Machine learning models are becoming more accessible to developers.",
        "I respect different opinions and viewpoints on this topic.",
        "The research findings are fascinating and well-documented.",
        "Community feedback has been incredibly valuable for improvement.",
        "Open source projects benefit from collaborative contributions."
    ]
    
    # Challenging but safe examples (edge cases that should be classified as safe)
    challenging_safe = [
        "I disagree with this approach, but I understand the reasoning behind it.",
        "While I don't personally like this, others might find it useful.",
        "The criticism was harsh but constructive and fair.",
        "This is not my preferred method, but it has merit.",
        "I'm frustrated with the bug, but the team is working on it.",
        "The debate was intense but remained respectful throughout.",
        "Different cultures have different perspectives on this issue.",
        "I question the methodology but appreciate the effort.",
        "The policy has both advantages and disadvantages to consider.",
        "Competition in the market drives innovation and improvement."
    ]
    
    # Combine all safe examples
    all_safe = safe_examples + challenging_safe
    
    # Create labels (all safe examples get 'safe' label)
    labels = ['safe'] * len(all_safe)
    
    print(f"📊 Created test dataset:")
    print(f"   Total examples: {len(all_safe)}")
    print(f"   Safe examples: {len(all_safe)}")
    print(f"   Focus: Educational content with positive/neutral sentiment")
    
    return all_safe, labels

# Create test dataset
test_texts, test_labels = create_test_dataset()

# Display sample texts
print("\n📝 Sample test texts:")
for i, text in enumerate(test_texts[:5], 1):
    print(f"{i}. '{text}'")

print(f"\n✅ Test dataset prepared with {len(test_texts)} examples")

## 5. Performance Comparison

Now let's run the comprehensive benchmark comparing HuggingFace baseline with vLLM.

In [None]:
def run_comprehensive_benchmark():
    """Run comprehensive performance comparison between HF baseline and vLLM."""
    
    results = {
        "baseline": {},
        "vllm": {},
        "comparison": {}
    }
    
    # Test subset for quick benchmarking
    benchmark_texts = test_texts[:20]  # Use first 20 texts for benchmarking
    
    print("🏁 COMPREHENSIVE PERFORMANCE BENCHMARK")
    print("=" * 45)
    
    # 1. Single inference benchmark (HF Baseline)
    print("\n🤖 Running HuggingFace baseline benchmark...")
    try:
        baseline_single = baseline.benchmark_single_inference(benchmark_texts)
        results["baseline"]["single"] = baseline_single
        
        print(f"✅ Baseline single inference:")
        print(f"   Average time: {baseline_single['avg_time']:.3f}s")
        print(f"   Throughput: {baseline_single['throughput']:.2f} texts/second")
        
    except Exception as e:
        print(f"❌ Baseline benchmark failed: {e}")
        results["baseline"]["single"] = None
    
    # 2. Single inference benchmark (vLLM)
    print("\n⚡ Running vLLM benchmark...")
    try:
        if vllm_bench.is_server_running():
            vllm_single = vllm_bench.benchmark_single_inference(benchmark_texts)
            results["vllm"]["single"] = vllm_single
            
            print(f"✅ vLLM single inference:")
            print(f"   Average time: {vllm_single['avg_time']:.3f}s")
            print(f"   Throughput: {vllm_single['throughput']:.2f} texts/second")
            
        else:
            print("❌ vLLM server not running, skipping vLLM benchmarks")
            results["vllm"]["single"] = None
            
    except Exception as e:
        print(f"❌ vLLM benchmark failed: {e}")
        results["vllm"]["single"] = None
    
    # 3. Batch processing benchmark (HF Baseline)
    print("\n📦 Running batch processing benchmarks...")
    batch_sizes = [1, 4, 8, 16]
    
    try:
        baseline_batch = baseline.benchmark_batch_inference(benchmark_texts, batch_sizes)
        results["baseline"]["batch"] = baseline_batch
        
        print(f"✅ Baseline batch results:")
        for size, metrics in baseline_batch.items():
            print(f"   Batch {size}: {metrics['throughput']:.2f} texts/second")
            
    except Exception as e:
        print(f"❌ Baseline batch benchmark failed: {e}")
        results["baseline"]["batch"] = None
    
    # 4. Concurrent requests benchmark (vLLM)
    if vllm_bench.is_server_running():
        try:
            concurrent_users = [1, 2, 4, 8]
            vllm_concurrent = vllm_bench.benchmark_concurrent_requests(
                benchmark_texts, concurrent_users
            )
            results["vllm"]["concurrent"] = vllm_concurrent
            
            print(f"✅ vLLM concurrent results:")
            for users, metrics in vllm_concurrent.items():
                print(f"   {users} users: {metrics['throughput']:.2f} requests/second")
                
        except Exception as e:
            print(f"❌ vLLM concurrent benchmark failed: {e}")
            results["vllm"]["concurrent"] = None
    
    return results

# Run the comprehensive benchmark
print("🚀 Starting comprehensive benchmark...")
benchmark_results = run_comprehensive_benchmark()
print("\n✅ Benchmark completed!")

## 6. Results Visualization

Let's visualize the performance comparison results.

In [None]:
def visualize_benchmark_results(results: Dict):
    """Create comprehensive visualizations of benchmark results."""
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('🚀 vLLM vs HuggingFace Performance Comparison', fontsize=16, fontweight='bold')
    
    # 1. Single Inference Throughput Comparison
    ax1 = axes[0, 0]
    
    frameworks = []
    throughputs = []
    colors = ['#FF6B6B', '#4ECDC4']
    
    if results["baseline"]["single"]:
        frameworks.append('HuggingFace\nBaseline')
        throughputs.append(results["baseline"]["single"]["throughput"])
    
    if results["vllm"]["single"]:
        frameworks.append('vLLM')
        throughputs.append(results["vllm"]["single"]["throughput"])
    
    if frameworks:
        bars = ax1.bar(frameworks, throughputs, color=colors[:len(frameworks)])
        ax1.set_title('Single Inference Throughput', fontweight='bold')
        ax1.set_ylabel('Texts/Second')
        
        # Add value labels on bars
        for bar, value in zip(bars, throughputs):
            height = bar.get_height()
            ax1.text(bar.get_x() + bar.get_width()/2., height,
                    f'{value:.2f}', ha='center', va='bottom')
    else:
        ax1.text(0.5, 0.5, 'No data available', ha='center', va='center', transform=ax1.transAxes)
        ax1.set_title('Single Inference Throughput (No Data)')
    
    # 2. Batch Processing Performance (HF only)
    ax2 = axes[0, 1]
    
    if results["baseline"]["batch"]:
        batch_sizes = list(results["baseline"]["batch"].keys())
        batch_throughputs = [results["baseline"]["batch"][size]["throughput"] 
                           for size in batch_sizes]
        
        ax2.plot(batch_sizes, batch_throughputs, 'o-', color='#FF6B6B', linewidth=2, markersize=8)
        ax2.set_title('HF Batch Processing Scaling', fontweight='bold')
        ax2.set_xlabel('Batch Size')
        ax2.set_ylabel('Throughput (texts/second)')
        ax2.grid(True, alpha=0.3)
        
        # Add value labels
        for x, y in zip(batch_sizes, batch_throughputs):
            ax2.annotate(f'{y:.1f}', (x, y), textcoords="offset points", 
                        xytext=(0,10), ha='center')
    else:
        ax2.text(0.5, 0.5, 'No batch data available', ha='center', va='center', 
                transform=ax2.transAxes)
        ax2.set_title('Batch Processing (No Data)')
    
    # 3. vLLM Concurrent Users Performance
    ax3 = axes[1, 0]
    
    if results["vllm"]["concurrent"]:
        concurrent_users = list(results["vllm"]["concurrent"].keys())
        concurrent_throughputs = [results["vllm"]["concurrent"][users]["throughput"] 
                                for users in concurrent_users]
        
        ax3.plot(concurrent_users, concurrent_throughputs, 'o-', color='#4ECDC4', 
                linewidth=2, markersize=8)
        ax3.set_title('vLLM Concurrent User Scaling', fontweight='bold')
        ax3.set_xlabel('Concurrent Users')
        ax3.set_ylabel('Requests/Second')
        ax3.grid(True, alpha=0.3)
        
        # Add value labels
        for x, y in zip(concurrent_users, concurrent_throughputs):
            ax3.annotate(f'{y:.1f}', (x, y), textcoords="offset points", 
                        xytext=(0,10), ha='center')
    else:
        ax3.text(0.5, 0.5, 'No concurrent data available', ha='center', va='center', 
                transform=ax3.transAxes)
        ax3.set_title('Concurrent Users (No Data)')
    
    # 4. Latency Comparison
    ax4 = axes[1, 1]
    
    latency_frameworks = []
    avg_latencies = []
    std_latencies = []
    
    if results["baseline"]["single"]:
        latency_frameworks.append('HuggingFace')
        avg_latencies.append(results["baseline"]["single"]["avg_time"] * 1000)  # Convert to ms
        std_latencies.append(results["baseline"]["single"]["std_time"] * 1000)
    
    if results["vllm"]["single"]:
        latency_frameworks.append('vLLM')
        avg_latencies.append(results["vllm"]["single"]["avg_time"] * 1000)
        std_latencies.append(results["vllm"]["single"]["std_time"] * 1000)
    
    if latency_frameworks:
        bars = ax4.bar(latency_frameworks, avg_latencies, yerr=std_latencies, 
                      color=colors[:len(latency_frameworks)], capsize=5)
        ax4.set_title('Average Latency Comparison', fontweight='bold')
        ax4.set_ylabel('Latency (milliseconds)')
        
        # Add value labels
        for bar, value in zip(bars, avg_latencies):
            height = bar.get_height()
            ax4.text(bar.get_x() + bar.get_width()/2., height,
                    f'{value:.1f}ms', ha='center', va='bottom')
    else:
        ax4.text(0.5, 0.5, 'No latency data available', ha='center', va='center', 
                transform=ax4.transAxes)
        ax4.set_title('Latency Comparison (No Data)')
    
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print("\n📊 PERFORMANCE SUMMARY")
    print("=" * 25)
    
    if results["baseline"]["single"] and results["vllm"]["single"]:
        hf_throughput = results["baseline"]["single"]["throughput"]
        vllm_throughput = results["vllm"]["single"]["throughput"]
        speedup = vllm_throughput / hf_throughput
        
        print(f"🚀 vLLM Speedup: {speedup:.2f}x faster than HuggingFace")
        print(f"📈 Throughput Improvement: {(speedup - 1) * 100:.1f}%")
        
        hf_latency = results["baseline"]["single"]["avg_time"] * 1000
        vllm_latency = results["vllm"]["single"]["avg_time"] * 1000
        latency_reduction = (hf_latency - vllm_latency) / hf_latency * 100
        
        print(f"⚡ Latency Reduction: {latency_reduction:.1f}%")
    
    else:
        print("⚠️ Insufficient data for comparison")
        print("   Make sure both HF baseline and vLLM server are running")

# Visualize the benchmark results
visualize_benchmark_results(benchmark_results)

## 7. Memory Usage Analysis

Let's analyze memory usage patterns between the two approaches.

In [None]:
def analyze_memory_usage():
    """Analyze and compare memory usage patterns."""
    
    print("💾 MEMORY USAGE ANALYSIS")
    print("=" * 30)
    
    # Get current memory stats
    cpu_memory = psutil.virtual_memory()
    
    print(f"🖥️  System Memory:")
    print(f"   Total RAM: {cpu_memory.total / 1e9:.1f}GB")
    print(f"   Available: {cpu_memory.available / 1e9:.1f}GB")
    print(f"   Used: {cpu_memory.percent:.1f}%")
    
    # GPU memory analysis
    if torch.cuda.is_available():
        print(f"\n🎮 GPU Memory:")
        for i in range(torch.cuda.device_count()):
            allocated = torch.cuda.memory_allocated(i) / 1e9
            cached = torch.cuda.memory_reserved(i) / 1e9
            total = torch.cuda.get_device_properties(i).total_memory / 1e9
            
            print(f"   GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {total:.1f}GB total")
    
    elif GPU_AVAILABLE:
        try:
            gpus = GPUtil.getGPUs()
            for gpu in gpus:
                print(f"\n🎮 GPU {gpu.id} ({gpu.name}):")
                print(f"   Memory Used: {gpu.memoryUsed}MB / {gpu.memoryTotal}MB")
                print(f"   Utilization: {gpu.memoryUtil * 100:.1f}%")
        except Exception as e:
            print(f"⚠️ GPU monitoring error: {e}")
    
    # Memory efficiency insights
    print(f"\n🧠 Memory Efficiency Insights:")
    print(f"   • vLLM PagedAttention reduces memory fragmentation")
    print(f"   • Dynamic batching optimizes memory allocation")
    print(f"   • KV cache reuse improves efficiency for repeated queries")
    print(f"   • Model quantization can reduce memory by 50-75%")
    
    return {
        "cpu_memory_gb": cpu_memory.total / 1e9,
        "cpu_memory_used_percent": cpu_memory.percent,
        "gpu_available": torch.cuda.is_available(),
        "gpu_memory_gb": torch.cuda.get_device_properties(0).total_memory / 1e9 if torch.cuda.is_available() else 0
    }

# Run memory analysis
memory_stats = analyze_memory_usage()

## 8. Production Deployment Recommendations

Based on our benchmarking results, let's provide deployment recommendations.

In [None]:
def generate_deployment_recommendations(results: Dict, memory_stats: Dict):
    """Generate deployment recommendations based on benchmark results."""
    
    print("🎯 DEPLOYMENT RECOMMENDATIONS")
    print("=" * 35)
    
    recommendations = {
        "high_throughput": {
            "title": "🚀 High Throughput Applications",
            "description": "Social media content moderation, batch processing",
            "recommendation": "vLLM with Docker deployment",
            "reasoning": [
                "Continuous batching maximizes GPU utilization",
                "PagedAttention reduces memory overhead",
                "Handles concurrent requests efficiently",
                "Scales horizontally with multiple containers"
            ]
        },
        "low_latency": {
            "title": "⚡ Low Latency Requirements", 
            "description": "Real-time chat moderation, interactive applications",
            "recommendation": "vLLM with optimized configuration",
            "reasoning": [
                "Pre-loaded models eliminate cold start delays",
                "Efficient attention computation reduces inference time",
                "Connection pooling minimizes request overhead",
                "GPU acceleration provides consistent performance"
            ]
        },
        "development": {
            "title": "🔧 Development & Prototyping",
            "description": "Research, experimentation, small-scale testing",
            "recommendation": "HuggingFace Pipeline API",
            "reasoning": [
                "Simpler setup and configuration",
                "Direct access to model internals",
                "Better debugging capabilities",
                "Faster iteration for research"
            ]
        },
        "resource_constrained": {
            "title": "💻 Resource-Constrained Environments",
            "description": "Limited GPU memory, edge devices, cost optimization",
            "recommendation": "Model quantization + vLLM",
            "reasoning": [
                "Quantization reduces memory requirements by 50-75%",
                "vLLM optimizes memory usage patterns",
                "Dynamic batching adapts to available resources",
                "Better performance per dollar spent"
            ]
        }
    }
    
    for scenario_key, scenario in recommendations.items():
        print(f"\n{scenario['title']}")
        print(f"Use Case: {scenario['description']}")
        print(f"💡 Recommendation: {scenario['recommendation']}")
        print(f"📋 Reasoning:")
        for reason in scenario['reasoning']:
            print(f"   • {reason}")
    
    # Hardware-specific recommendations
    print(f"\n🖥️  HARDWARE RECOMMENDATIONS")
    print(f"=" * 30)
    
    if memory_stats["gpu_available"]:
        gpu_memory = memory_stats["gpu_memory_gb"]
        
        if gpu_memory >= 24:
            print(f"🎮 High-end GPU ({gpu_memory:.0f}GB):")
            print(f"   • Run large models (7B+ parameters)")
            print(f"   • Use fp16 precision for optimal performance")
            print(f"   • Enable tensor parallelism for massive models")
        
        elif gpu_memory >= 8:
            print(f"🎮 Mid-range GPU ({gpu_memory:.0f}GB):")
            print(f"   • Focus on smaller models (3B parameters or less)")
            print(f"   • Use quantization for memory efficiency")
            print(f"   • Optimize batch sizes for your use case")
        
        else:
            print(f"🎮 Entry-level GPU ({gpu_memory:.0f}GB):")
            print(f"   • Use heavily quantized models (4-bit)")
            print(f"   • Consider CPU inference for some workloads")
            print(f"   • Batch size 1-2 for memory safety")
    
    else:
        print(f"💻 CPU-only environment:")
        print(f"   • Consider llama.cpp for better CPU performance")
        print(f"   • Use quantized models (4-bit/8-bit)")
        print(f"   • Optimize for latency over throughput")
    
    # Cost-benefit analysis
    print(f"\n💰 COST-BENEFIT ANALYSIS")
    print(f"=" * 25)
    
    cost_scenarios = [
        {
            "scenario": "Small-scale deployment (< 1000 requests/day)",
            "recommendation": "HuggingFace on CPU or small GPU",
            "cost": "Low",
            "complexity": "Low"
        },
        {
            "scenario": "Medium-scale deployment (1K-100K requests/day)",
            "recommendation": "vLLM with Docker on cloud GPU",
            "cost": "Medium",
            "complexity": "Medium"
        },
        {
            "scenario": "Large-scale deployment (100K+ requests/day)",
            "recommendation": "vLLM with auto-scaling + load balancing",
            "cost": "High",
            "complexity": "High"
        }
    ]
    
    for scenario in cost_scenarios:
        print(f"\n📊 {scenario['scenario']}")
        print(f"   Recommendation: {scenario['recommendation']}")
        print(f"   Cost: {scenario['cost']} | Complexity: {scenario['complexity']}")

# Generate deployment recommendations
generate_deployment_recommendations(benchmark_results, memory_stats)

## 📋 Summary

### 🔑 Key Concepts Mastered
- **vLLM Performance**: Understanding PagedAttention and continuous batching benefits
- **Benchmarking Methodology**: Systematic performance comparison techniques
- **Production Deployment**: Real-world considerations for inference optimization
- **Resource Management**: Memory and computational efficiency trade-offs

### 📈 Best Practices Learned
- Comprehensive benchmarking includes throughput, latency, and memory analysis
- vLLM excels in high-throughput scenarios with concurrent requests
- Hardware specifications directly impact deployment recommendations
- Cost-benefit analysis is crucial for production deployment decisions

### 🚀 Next Steps
- **Advanced Configuration**: Explore vLLM advanced settings and optimizations
- **Production Monitoring**: Implement comprehensive monitoring and alerting
- **Auto-scaling**: Set up dynamic scaling based on request patterns
- **Documentation**: Review [vLLM Documentation](https://docs.vllm.ai/) for deeper insights

---

*Ready for edge deployment? Check out **llama.cpp CPU inference** for optimized CPU performance!*

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*