# Module 2: vLLM Baseline Tail Latency Benchmarking

This notebook runs GuideLLM benchmarks against your single-GPU vLLM deployment to establish a **tail latency baseline** (P95/P99) for ParasolCloud's customer service workload.

## Prerequisites

Before running this notebook:
- Your vLLM InferenceService (`llama-vllm-single`) should be deployed and in `Ready` state
- Grafana dashboard should be configured to monitor vLLM metrics

## Objectives

By the end of this notebook, you'll have:
- Established single-GPU **P95/P99 tail latency baseline** across two critical scenarios
- **Scenario 1**: Single-turn requests with large shared prefixes (prefill bottleneck)
- **Scenario 2**: Multi-turn chat conversations (cache reuse opportunity)
- Identified how tail latency degrades under load
- Documented what your most frustrated users currently experience

## Setup and Configuration

In [None]:
# Install GuideLLM if not already installed
!pip install guidellm -q

In [None]:
import subprocess
import json
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Create output directory for results
output_dir = Path("benchmark-results/module-02")
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Results will be saved to: {output_dir}")

## Get InferenceService URL

Retrieve the URL for your vLLM deployment:

In [None]:
# Get the inference URL from the InferenceService
result = subprocess.run(
    ["oc", "get", "inferenceservice", "llama-vllm-single", "-o", "jsonpath='{.status.url}'"],
    capture_output=True,
    text=True
)

INFERENCE_URL = result.stdout.strip().strip("'")
print(f"InferenceService URL: {INFERENCE_URL}")

# Verify the service is reachable
if not INFERENCE_URL:
    raise ValueError("Could not retrieve InferenceService URL. Ensure llama-vllm-single is deployed and Ready.")

# Construct the completions endpoint
COMPLETIONS_URL = f"{INFERENCE_URL}/v1/completions"
print(f"Completions endpoint: {COMPLETIONS_URL}")

## Create ParasolCloud Customer Service Datasets

Generate two datasets to test different tail latency scenarios:

**Scenario 1: Single-turn with large shared prefixes**
- All requests share a 300-token system prompt
- Tests prefill bottleneck and cache miss impact

**Scenario 2: Multi-turn chat conversations**
- Simulates ongoing customer conversations
- Tests KV cache reuse across conversation turns

In [None]:
# Scenario 1: Single-turn requests with large shared prefixes
# All requests share the same system prompt - testing cache miss impact

# Create a longer system prompt (approximately 300 tokens)
system_prompt = """You are an expert customer service agent for ParasolCloud, a leading enterprise cloud infrastructure provider. 
Your role is to assist customers with technical issues, account questions, billing inquiries, and service recommendations. 
You should be professional, empathetic, and solution-oriented. Always prioritize customer satisfaction while following company policies. 
When addressing technical issues, provide clear step-by-step instructions. For billing questions, explain charges clearly and offer to escalate complex cases. 
If you don't know an answer, be honest and offer to connect the customer with a specialist. Remember to maintain a friendly, helpful tone throughout the interaction."""

single_turn_prompts = [
    f"{system_prompt}\n\nCustomer asks: How do I reset my account password?",
    f"{system_prompt}\n\nCustomer asks: What are your business support hours?",
    f"{system_prompt}\n\nCustomer asks: How can I track my open support tickets?",
    f"{system_prompt}\n\nCustomer asks: What is your data retention policy?",
    f"{system_prompt}\n\nCustomer asks: How do I contact the billing department?",
    f"{system_prompt}\n\nCustomer asks: Can I change my subscription tier mid-cycle?",
    f"{system_prompt}\n\nCustomer asks: What payment methods do you accept?",
    f"{system_prompt}\n\nCustomer asks: How do I update my account contact information?",
    f"{system_prompt}\n\nCustomer asks: Where can I find documentation for your API?",
    f"{system_prompt}\n\nCustomer asks: How long does it take to process a refund?",
    f"{system_prompt}\n\nCustomer asks: Do you offer enterprise-level SLAs?",
    f"{system_prompt}\n\nCustomer asks: Can I cancel my service without penalties?",
    f"{system_prompt}\n\nCustomer asks: How do I enable two-factor authentication?",
    f"{system_prompt}\n\nCustomer asks: What is your service uptime guarantee?",
    f"{system_prompt}\n\nCustomer asks: Do you have a mobile app for account management?",
    f"{system_prompt}\n\nCustomer asks: How do I apply my discount code to my account?",
]

# Scenario 2: Multi-turn chat conversations
# Simulates ongoing customer conversations with growing context

multi_turn_conversations = [
    # Conversation 1
    f"{system_prompt}\n\nCustomer: I'm having trouble logging into my account.",
    f"{system_prompt}\n\nCustomer: I'm having trouble logging into my account.\nAgent: I'd be happy to help. Have you tried resetting your password?\nCustomer: No, how do I do that?",
    f"{system_prompt}\n\nCustomer: I'm having trouble logging into my account.\nAgent: I'd be happy to help. Have you tried resetting your password?\nCustomer: No, how do I do that?\nAgent: You can click the 'Forgot Password' link on the login page.\nCustomer: I don't see that link. Where exactly is it?",
    
    # Conversation 2
    f"{system_prompt}\n\nCustomer: What's included in the enterprise plan?",
    f"{system_prompt}\n\nCustomer: What's included in the enterprise plan?\nAgent: The enterprise plan includes 24/7 support, dedicated account manager, and 99.99% uptime SLA.\nCustomer: What about data backup?",
    f"{system_prompt}\n\nCustomer: What's included in the enterprise plan?\nAgent: The enterprise plan includes 24/7 support, dedicated account manager, and 99.99% uptime SLA.\nCustomer: What about data backup?\nAgent: Yes, automated daily backups with 30-day retention are included.\nCustomer: Can I increase the retention period?",
    
    # Conversation 3
    f"{system_prompt}\n\nCustomer: I was charged twice this month.",
    f"{system_prompt}\n\nCustomer: I was charged twice this month.\nAgent: I apologize for the inconvenience. Let me look into this for you. Can you provide your account number?\nCustomer: It's ACC-12345678.",
    f"{system_prompt}\n\nCustomer: I was charged twice this month.\nAgent: I apologize for the inconvenience. Let me look into this for you. Can you provide your account number?\nCustomer: It's ACC-12345678.\nAgent: Thank you. I see the duplicate charge. I'm processing a refund now.\nCustomer: How long until I see the refund?",
]

# Save both datasets
dataset_dir = output_dir / "datasets"
dataset_dir.mkdir(exist_ok=True)

single_turn_file = dataset_dir / "single-turn-prompts.txt"
with open(single_turn_file, 'w') as f:
    f.write('\n'.join(single_turn_prompts))

multi_turn_file = dataset_dir / "multi-turn-prompts.txt"
with open(multi_turn_file, 'w') as f:
    f.write('\n'.join(multi_turn_conversations))

print(f"âœ“ Created Scenario 1 dataset: {len(single_turn_prompts)} single-turn prompts with shared prefix")
print(f"  System prompt length: ~{len(system_prompt.split())} words (~300 tokens)")
print(f"  Saved to: {single_turn_file}\n")

print(f"âœ“ Created Scenario 2 dataset: {len(multi_turn_conversations)} multi-turn conversation turns")
print(f"  Tests growing context across conversation turns")
print(f"  Saved to: {multi_turn_file}\n")

print("These scenarios test two critical tail latency bottlenecks:")
print("  1. Prefill queuing from large shared prefixes (single-turn)")
print("  2. Cache misses in multi-turn conversations (no KV cache reuse)")

## Scenario 1 Benchmarks: Single-turn with Large Shared Prefixes

Test tail latency impact of prefill queuing with shared system prompts.

We'll run three load levels:
- **Low load** (1 concurrent): Baseline P95/P99 without queuing
- **Medium load** (5 concurrent): Moderate queuing - observe P95 degradation
- **High load** (20 concurrent): Heavy queuing - measure worst-case P95/P99

In [None]:
%%time
print("Running Scenario 1 - Low load (1 concurrent request)...")
print("This will take approximately 2-3 minutes.\n")

# Run GuideLLM benchmark
subprocess.run([
    "guidellm",
    "--target", COMPLETIONS_URL,
    "--model", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--data", str(single_turn_file),
    "--rate", "1",
    "--max-requests", "50",
    "--output-format", "json",
    "--output-file", str(output_dir / "scenario1-single-turn-low.json")
], check=True)

print("\nâœ“ Scenario 1 low-load benchmark complete")

### Scenario 1 - Medium Load

In [None]:
%%time
print("Running Scenario 1 - Medium load (5 concurrent requests)...")
print("This will take approximately 3-4 minutes.\n")

subprocess.run([
    "guidellm",
    "--target", COMPLETIONS_URL,
    "--model", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--data", str(single_turn_file),
    "--rate", "5",
    "--max-requests", "100",
    "--output-format", "json",
    "--output-file", str(output_dir / "scenario1-single-turn-medium.json")
], check=True)

print("\nâœ“ Scenario 1 medium-load benchmark complete")

### Scenario 1 - High Load

In [None]:
%%time
print("Running Scenario 1 - High load (20 concurrent requests)...")
print("This will take approximately 5-7 minutes.\n")
print("ðŸ’¡ TIP: Open your Grafana dashboard now to watch metrics in real-time!")

subprocess.run([
    "guidellm",
    "--target", COMPLETIONS_URL,
    "--model", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--data", str(single_turn_file),
    "--rate", "20",
    "--max-requests", "200",
    "--output-format", "json",
    "--output-file", str(output_dir / "scenario1-single-turn-high.json")
], check=True)

print("\nâœ“ Scenario 1 high-load benchmark complete")

## Scenario 2 Benchmarks: Multi-turn Chat Conversations

Test tail latency impact of cache misses in multi-turn conversations.

Without intelligent routing, conversation turns land on random vLLM instances, causing cache misses and redundant computation of conversation history.

In [None]:
%%time
print("Running Scenario 2 - Multi-turn chat (medium load)...")
print("This tests conversation latency with growing context.")
print("This will take approximately 3-4 minutes.\n")

subprocess.run([
    "guidellm",
    "--target", COMPLETIONS_URL,
    "--model", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--data", str(multi_turn_file),
    "--rate", "5",
    "--max-requests", "50",
    "--output-format", "json",
    "--output-file", str(output_dir / "scenario2-multi-turn.json")
], check=True)

print("\nâœ“ Scenario 2 multi-turn benchmark complete")

## Analyze Results

Load and analyze the benchmark results:

In [None]:
def load_benchmark_results(filepath):
    """Load GuideLLM JSON results."""
    with open(filepath, 'r') as f:
        return json.load(f)

def extract_metrics(results):
    """Extract key metrics from GuideLLM results."""
    summary = results.get('summary', {})
    return {
        'throughput': summary.get('throughput', 0),
        'ttft_p50': summary.get('ttft_p50', 0),
        'ttft_p95': summary.get('ttft_p95', 0),
        'itl_p50': summary.get('itl_p50', 0),
        'itl_p95': summary.get('itl_p95', 0),
        'latency_p95': summary.get('latency_p95', 0),
    }

# Load all benchmark results
low_results = load_benchmark_results(output_dir / "baseline-single-low.json")
medium_results = load_benchmark_results(output_dir / "baseline-single-medium.json")
high_results = load_benchmark_results(output_dir / "baseline-single-high.json")

# Extract metrics
low_metrics = extract_metrics(low_results)
medium_metrics = extract_metrics(medium_results)
high_metrics = extract_metrics(high_results)

print("âœ“ Results loaded successfully")

## Performance Summary Table

In [None]:
# Create summary DataFrame
summary_df = pd.DataFrame([
    {'Load Level': 'Low (1 concurrent)', **low_metrics},
    {'Load Level': 'Medium (5 concurrent)', **medium_metrics},
    {'Load Level': 'High (20 concurrent)', **high_metrics},
])

# Format for display
summary_df['throughput'] = summary_df['throughput'].round(2)
summary_df['ttft_p50'] = (summary_df['ttft_p50'] * 1000).round(0).astype(int)  # Convert to ms
summary_df['ttft_p95'] = (summary_df['ttft_p95'] * 1000).round(0).astype(int)
summary_df['latency_p95'] = (summary_df['latency_p95'] * 1000).round(0).astype(int)

# Rename columns for clarity
summary_df = summary_df.rename(columns={
    'throughput': 'Throughput (req/s)',
    'ttft_p50': 'TTFT P50 (ms)',
    'ttft_p95': 'TTFT P95 (ms)',
    'latency_p95': 'P95 Latency (ms)'
})

print("\n=== Single GPU vLLM Baseline Performance ===")
print(summary_df[['Load Level', 'Throughput (req/s)', 'TTFT P50 (ms)', 'TTFT P95 (ms)', 'P95 Latency (ms)']].to_string(index=False))

# Save summary
summary_df.to_csv(output_dir / "baseline-summary.csv", index=False)
print(f"\nâœ“ Summary saved to: {output_dir / 'baseline-summary.csv'}")

## Visualize Performance Scaling

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

load_levels = ['Low\n(1)', 'Medium\n(5)', 'High\n(20)']
throughput = summary_df['Throughput (req/s)'].values
ttft_p95 = summary_df['TTFT P95 (ms)'].values

# Throughput plot
axes[0].bar(load_levels, throughput, color=['green', 'orange', 'red'])
axes[0].set_ylabel('Throughput (req/s)', fontsize=12)
axes[0].set_xlabel('Concurrency Level', fontsize=12)
axes[0].set_title('Single GPU Throughput vs. Load', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(throughput):
    axes[0].text(i, v + 1, f'{v:.1f}', ha='center', va='bottom', fontweight='bold')

# TTFT P95 latency plot
axes[1].bar(load_levels, ttft_p95, color=['green', 'orange', 'red'])
axes[1].set_ylabel('TTFT P95 (ms)', fontsize=12)
axes[1].set_xlabel('Concurrency Level', fontsize=12)
axes[1].set_title('Single GPU TTFT P95 vs. Load', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(ttft_p95):
    axes[1].text(i, v + 10, f'{v:.0f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / 'baseline-performance.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"âœ“ Chart saved to: {output_dir / 'baseline-performance.png'}")

## Key Observations

Document your findings:

In [None]:
max_throughput = high_metrics['throughput']
ttft_p95_high = high_metrics['ttft_p95'] * 1000  # Convert to ms
parasolcloud_target = 500  # req/s
gpus_needed_naive = parasolcloud_target / max_throughput

observations = f"""
=== Single GPU vLLM Baseline Observations ===

Configuration:
- GPUs: 1
- Model: Meta-Llama-3.1-8B-Instruct
- Max model length: 4096 tokens
- GPU memory utilization: 0.9

Performance Metrics (High Load - 20 concurrent):
- Maximum Throughput: ~{max_throughput:.1f} req/s
- TTFT (P95): ~{ttft_p95_high:.0f}ms
- P95 Latency: ~{high_metrics['latency_p95'] * 1000:.0f}ms

Key Observations:
- Single GPU saturates at ~{max_throughput:.0f} req/s
- Latency increases significantly under load
- No cache sharing across requests (every request recomputes full prompt)

ParasolCloud's Challenge:
- Current capacity: {max_throughput:.0f} req/s per GPU
- Target capacity: {parasolcloud_target}+ req/s
- Naive calculation: Need ~{gpus_needed_naive:.0f} GPUs for {parasolcloud_target/max_throughput:.0f}x scaling
- Question: Can intelligent orchestration improve this ratio?
"""

print(observations)

# Save observations
with open(output_dir / 'baseline-observations.txt', 'w') as f:
    f.write(observations)

print(f"\nâœ“ Observations saved to: {output_dir / 'baseline-observations.txt'}")

## Next Steps

You've established your single-GPU baseline! Key findings:

1. **Maximum throughput**: The single GPU saturates at a specific req/s rate
2. **Latency degradation**: TTFT increases significantly as load approaches saturation
3. **No cache sharing**: Each request recomputes the shared "You are a helpful customer service agent" prefix

### What's Next?

In **Module 3**, you'll:
- Scale to 4 GPUs using naive horizontal scaling (round-robin load balancing)
- Run the same benchmarks
- Discover why simple scaling doesn't deliver linear 4x improvement
- Understand the impact of isolated KV caches (no sharing)

### Before You Go

âœ“ Review your Grafana dashboard - take screenshots of:
- Peak throughput (requests per second)
- GPU cache usage
- Request latency distribution

âœ“ Keep your benchmark results - you'll compare them against scaled deployments in later modules