# Module 2: vLLM Baseline Benchmarking

This notebook runs GuideLLM benchmarks against your single-GPU vLLM deployment to establish a performance baseline for ParasolCloud's customer service workload.

## Prerequisites

Before running this notebook:
- Your vLLM InferenceService (`llama-vllm-single`) should be deployed and in `Ready` state
- Grafana dashboard should be configured to monitor vLLM metrics

## Objectives

By the end of this notebook, you'll have:
- Established single-GPU performance baseline
- Measured TTFT, ITL, and throughput at different load levels
- Identified saturation point for single-GPU deployment

## Setup and Configuration

In [None]:
# Install GuideLLM if not already installed
!pip install guidellm -q

In [None]:
import subprocess
import json
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Create output directory for results
output_dir = Path("benchmark-results/module-02")
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Results will be saved to: {output_dir}")

## Get InferenceService URL

Retrieve the URL for your vLLM deployment:

In [None]:
# Get the inference URL from the InferenceService
result = subprocess.run(
    ["oc", "get", "inferenceservice", "llama-vllm-single", "-o", "jsonpath='{.status.url}'"],
    capture_output=True,
    text=True
)

INFERENCE_URL = result.stdout.strip().strip("'")
print(f"InferenceService URL: {INFERENCE_URL}")

# Verify the service is reachable
if not INFERENCE_URL:
    raise ValueError("Could not retrieve InferenceService URL. Ensure llama-vllm-single is deployed and Ready.")

# Construct the completions endpoint
COMPLETIONS_URL = f"{INFERENCE_URL}/v1/completions"
print(f"Completions endpoint: {COMPLETIONS_URL}")

## Create ParasolCloud Customer Service Dataset

Generate a dataset that simulates ParasolCloud's customer service workload with shared system prompts:

In [None]:
# ParasolCloud customer service prompts with shared prefix
customer_service_prompts = [
    "You are a helpful customer service agent. User asks: How do I reset my password?",
    "You are a helpful customer service agent. User asks: What are your business hours?",
    "You are a helpful customer service agent. User asks: How can I track my order?",
    "You are a helpful customer service agent. User asks: What is your return policy?",
    "You are a helpful customer service agent. User asks: How do I contact support?",
    "You are a helpful customer service agent. User asks: Can I change my shipping address?",
    "You are a helpful customer service agent. User asks: What payment methods do you accept?",
    "You are a helpful customer service agent. User asks: How do I update my account information?",
    "You are a helpful customer service agent. User asks: Where is my refund?",
    "You are a helpful customer service agent. User asks: How long does shipping take?",
    "You are a helpful customer service agent. User asks: Do you offer international shipping?",
    "You are a helpful customer service agent. User asks: Can I cancel my order?",
    "You are a helpful customer service agent. User asks: How do I create an account?",
    "You are a helpful customer service agent. User asks: What is your privacy policy?",
    "You are a helpful customer service agent. User asks: Do you have a mobile app?",
    "You are a helpful customer service agent. User asks: How do I apply a discount code?",
]

# Save to file for GuideLLM
dataset_file = output_dir / "customer-service-prompts.txt"
with open(dataset_file, 'w') as f:
    f.write('\n'.join(customer_service_prompts))

print(f"Created dataset with {len(customer_service_prompts)} prompts")
print(f"Saved to: {dataset_file}")
print(f"\nShared prefix: 'You are a helpful customer service agent.'")
print(f"This simulates ParasolCloud's workload where all requests share a system prompt.")

## Benchmark 1: Low Load (1 concurrent request)

Understand single-request performance without queuing:

In [None]:
%%time
print("Running low-load benchmark (1 concurrent request)...")
print("This will take approximately 2-3 minutes.\n")

# Run GuideLLM benchmark
subprocess.run([
    "guidellm",
    "--target", COMPLETIONS_URL,
    "--model", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--data", str(dataset_file),
    "--rate", "1",
    "--max-requests", "50",
    "--output-format", "json",
    "--output-file", str(output_dir / "baseline-single-low.json")
], check=True)

print("\nâœ“ Low-load benchmark complete")

## Benchmark 2: Medium Load (5 concurrent requests)

Increase concurrency to measure throughput scaling:

In [None]:
%%time
print("Running medium-load benchmark (5 concurrent requests)...")
print("This will take approximately 3-4 minutes.\n")

subprocess.run([
    "guidellm",
    "--target", COMPLETIONS_URL,
    "--model", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--data", str(dataset_file),
    "--rate", "5",
    "--max-requests", "100",
    "--output-format", "json",
    "--output-file", str(output_dir / "baseline-single-medium.json")
], check=True)

print("\nâœ“ Medium-load benchmark complete")

## Benchmark 3: High Load (20 concurrent requests)

Push to saturation to find maximum throughput:

In [None]:
%%time
print("Running high-load benchmark (20 concurrent requests)...")
print("This will take approximately 5-7 minutes.\n")
print("ðŸ’¡ TIP: Open your Grafana dashboard now to watch metrics in real-time!")

subprocess.run([
    "guidellm",
    "--target", COMPLETIONS_URL,
    "--model", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--data", str(dataset_file),
    "--rate", "20",
    "--max-requests", "200",
    "--output-format", "json",
    "--output-file", str(output_dir / "baseline-single-high.json")
], check=True)

print("\nâœ“ High-load benchmark complete")

## Analyze Results

Load and analyze the benchmark results:

In [None]:
def load_benchmark_results(filepath):
    """Load GuideLLM JSON results."""
    with open(filepath, 'r') as f:
        return json.load(f)

def extract_metrics(results):
    """Extract key metrics from GuideLLM results."""
    summary = results.get('summary', {})
    return {
        'throughput': summary.get('throughput', 0),
        'ttft_p50': summary.get('ttft_p50', 0),
        'ttft_p95': summary.get('ttft_p95', 0),
        'itl_p50': summary.get('itl_p50', 0),
        'itl_p95': summary.get('itl_p95', 0),
        'latency_p95': summary.get('latency_p95', 0),
    }

# Load all benchmark results
low_results = load_benchmark_results(output_dir / "baseline-single-low.json")
medium_results = load_benchmark_results(output_dir / "baseline-single-medium.json")
high_results = load_benchmark_results(output_dir / "baseline-single-high.json")

# Extract metrics
low_metrics = extract_metrics(low_results)
medium_metrics = extract_metrics(medium_results)
high_metrics = extract_metrics(high_results)

print("âœ“ Results loaded successfully")

## Performance Summary Table

In [None]:
# Create summary DataFrame
summary_df = pd.DataFrame([
    {'Load Level': 'Low (1 concurrent)', **low_metrics},
    {'Load Level': 'Medium (5 concurrent)', **medium_metrics},
    {'Load Level': 'High (20 concurrent)', **high_metrics},
])

# Format for display
summary_df['throughput'] = summary_df['throughput'].round(2)
summary_df['ttft_p50'] = (summary_df['ttft_p50'] * 1000).round(0).astype(int)  # Convert to ms
summary_df['ttft_p95'] = (summary_df['ttft_p95'] * 1000).round(0).astype(int)
summary_df['latency_p95'] = (summary_df['latency_p95'] * 1000).round(0).astype(int)

# Rename columns for clarity
summary_df = summary_df.rename(columns={
    'throughput': 'Throughput (req/s)',
    'ttft_p50': 'TTFT P50 (ms)',
    'ttft_p95': 'TTFT P95 (ms)',
    'latency_p95': 'P95 Latency (ms)'
})

print("\n=== Single GPU vLLM Baseline Performance ===")
print(summary_df[['Load Level', 'Throughput (req/s)', 'TTFT P50 (ms)', 'TTFT P95 (ms)', 'P95 Latency (ms)']].to_string(index=False))

# Save summary
summary_df.to_csv(output_dir / "baseline-summary.csv", index=False)
print(f"\nâœ“ Summary saved to: {output_dir / 'baseline-summary.csv'}")

## Visualize Performance Scaling

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

load_levels = ['Low\n(1)', 'Medium\n(5)', 'High\n(20)']
throughput = summary_df['Throughput (req/s)'].values
ttft_p95 = summary_df['TTFT P95 (ms)'].values

# Throughput plot
axes[0].bar(load_levels, throughput, color=['green', 'orange', 'red'])
axes[0].set_ylabel('Throughput (req/s)', fontsize=12)
axes[0].set_xlabel('Concurrency Level', fontsize=12)
axes[0].set_title('Single GPU Throughput vs. Load', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(throughput):
    axes[0].text(i, v + 1, f'{v:.1f}', ha='center', va='bottom', fontweight='bold')

# TTFT P95 latency plot
axes[1].bar(load_levels, ttft_p95, color=['green', 'orange', 'red'])
axes[1].set_ylabel('TTFT P95 (ms)', fontsize=12)
axes[1].set_xlabel('Concurrency Level', fontsize=12)
axes[1].set_title('Single GPU TTFT P95 vs. Load', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(ttft_p95):
    axes[1].text(i, v + 10, f'{v:.0f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / 'baseline-performance.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"âœ“ Chart saved to: {output_dir / 'baseline-performance.png'}")

## Key Observations

Document your findings:

In [None]:
max_throughput = high_metrics['throughput']
ttft_p95_high = high_metrics['ttft_p95'] * 1000  # Convert to ms
parasolcloud_target = 500  # req/s
gpus_needed_naive = parasolcloud_target / max_throughput

observations = f"""
=== Single GPU vLLM Baseline Observations ===

Configuration:
- GPUs: 1
- Model: Meta-Llama-3.1-8B-Instruct
- Max model length: 4096 tokens
- GPU memory utilization: 0.9

Performance Metrics (High Load - 20 concurrent):
- Maximum Throughput: ~{max_throughput:.1f} req/s
- TTFT (P95): ~{ttft_p95_high:.0f}ms
- P95 Latency: ~{high_metrics['latency_p95'] * 1000:.0f}ms

Key Observations:
- Single GPU saturates at ~{max_throughput:.0f} req/s
- Latency increases significantly under load
- No cache sharing across requests (every request recomputes full prompt)

ParasolCloud's Challenge:
- Current capacity: {max_throughput:.0f} req/s per GPU
- Target capacity: {parasolcloud_target}+ req/s
- Naive calculation: Need ~{gpus_needed_naive:.0f} GPUs for {parasolcloud_target/max_throughput:.0f}x scaling
- Question: Can intelligent orchestration improve this ratio?
"""

print(observations)

# Save observations
with open(output_dir / 'baseline-observations.txt', 'w') as f:
    f.write(observations)

print(f"\nâœ“ Observations saved to: {output_dir / 'baseline-observations.txt'}")

## Next Steps

You've established your single-GPU baseline! Key findings:

1. **Maximum throughput**: The single GPU saturates at a specific req/s rate
2. **Latency degradation**: TTFT increases significantly as load approaches saturation
3. **No cache sharing**: Each request recomputes the shared "You are a helpful customer service agent" prefix

### What's Next?

In **Module 3**, you'll:
- Scale to 4 GPUs using naive horizontal scaling (round-robin load balancing)
- Run the same benchmarks
- Discover why simple scaling doesn't deliver linear 4x improvement
- Understand the impact of isolated KV caches (no sharing)

### Before You Go

âœ“ Review your Grafana dashboard - take screenshots of:
- Peak throughput (requests per second)
- GPU cache usage
- Request latency distribution

âœ“ Keep your benchmark results - you'll compare them against scaled deployments in later modules