# M3.4 — Load Testing & Scaling for RAG Systems

**Module**: M3 - Production RAG  
**Focus**: Locust, bottlenecks, horizontal/vertical scaling

---

## Section 1: Objectives & Reality Check

### Learning Objectives

By the end of this module, you will be able to:

1. **Implement** load testing with Locust for RAG systems
2. **Execute** 5 test types: smoke, load, stress, spike, and soak tests
3. **Interpret** performance metrics: p50, p95, p99 latencies, throughput, error rates
4. **Identify** bottlenecks: application code, external services, infrastructure
5. **Apply** scaling strategies: horizontal vs vertical, caching, batching, auto-scaling
6. **Decide** when to load test and when to skip it

### Why Load Testing Matters

Load testing answers critical questions:

- **Capacity**: How many concurrent users can your system handle?
- **Latency**: What's the user experience under load? (p95, p99 reveal worst cases)
- **Reliability**: At what point does the system start failing?
- **Bottlenecks**: Is it the database? External API? Application code?

### The Reality Check: When NOT to Load Test

**Skip load testing when:**

1. **Small scale**: 50 daily users with no growth trajectory  
   *Cost*: 8-12 hours setup + 2-4 hours per sprint maintenance  
   *Benefit*: Minimal for non-critical systems

2. **Staging ≠ Production**: Different environments yield unreliable predictions  
   *Example*: Staging has 1GB test data, production has 100GB  
   *Result*: Load test shows 2s latency, production experiences 20s

3. **Early-stage uncertainty**: User patterns unknown, architecture may pivot  
   *Better approach*: Optimize obvious bottlenecks first (N+1 queries, missing indexes)

**Use load testing when:**

1. **Growth trajectory**: Approaching capacity limits (e.g., 70% CPU sustained)
2. **SLA requirements**: Performance guarantees in contracts (e.g., p99 <2s)
3. **Infrastructure decisions**: Justifying $500/month upgrade with data
4. **Pre-production validation**: Confirming system meets capacity targets

### Key Metrics Explained

| Metric | Definition | Target (typical) | Why It Matters |
|--------|------------|------------------|----------------|
| **Throughput** | Requests per second (RPS) | 20-100 RPS | System capacity |
| **p50 Latency** | Median response time | <500ms | Typical user experience |
| **p95 Latency** | 95th percentile response | <2s | Most users' experience |
| **p99 Latency** | 99th percentile response | <5s | Worst-case user experience |
| **Error Rate** | % of failed requests | <0.1% | System reliability |
| **Concurrency** | Simultaneous users | 50-500 | Peak load capacity |

### Cost-Benefit Analysis

**Time investment**:  
- Initial setup: 8-12 hours (locustfile, infrastructure, baseline tests)  
- Per-sprint maintenance: 2-4 hours (update tests, analyze results)  

**Value delivered**:  
- Prevent outages during traffic spikes  
- Data-driven infrastructure decisions (avoid over-provisioning)  
- Identify bottlenecks before production exposure  
- Confidence in scaling strategy  

**Break-even point**: Systems expecting >500 concurrent users or revenue-critical applications

In [None]:
# Verify environment setup
import sys
print(f"Python version: {sys.version}")

# Check if Locust is available
try:
    import locust
    print(f"Locust version: {locust.__version__}")
except ImportError:
    print("⚠️  Locust not installed. Run: pip install locust==2.31.6")

# Expected: Python 3.8+ and Locust 2.31.6

### Reality Check Exercise

**Question**: Should you load test in these scenarios?

1. **Scenario A**: Internal tool used by 10 employees, no external traffic  
   **Answer**: ❌ No - overhead not justified for small, stable usage

2. **Scenario B**: Public API with 1000 daily users, growing 20% monthly  
   **Answer**: ✅ Yes - growth trajectory demands capacity planning

3. **Scenario C**: E-commerce site with SLA: p95 <1s, 99.9% uptime  
   **Answer**: ✅ Yes - contractual obligations require validation

4. **Scenario D**: MVP with 50 beta users, architecture may change  
   **Answer**: ❌ No - premature optimization, focus on product-market fit

---

**Next**: Section 2 covers Locust setup and the 5 test types (smoke, load, stress, spike, soak).

## Section 2: Locust Setup & Test Types

### Installation & Project Structure

**Install Locust**:
```bash
pip install locust==2.31.6 python-dotenv==1.0.1
```

**Project structure**:
```
project/
├── locustfile.py          # Test scenarios
├── .env                   # Configuration (TARGET_URL)
├── .env.example           # Template for .env
├── requirements.txt       # Dependencies
└── results/               # Test output (CSV, HTML reports)
```

### The 5 Test Types

| Test Type | Purpose | Users | Duration | When to Use |
|-----------|---------|-------|----------|-------------|
| **Smoke** | Health check | 10 | 2 min | Before each deployment |
| **Load** | Normal capacity | 100 | 10 min | Weekly baseline |
| **Stress** | Breaking point | 1000+ | 15 min | Quarterly capacity planning |
| **Spike** | Traffic surge | 500 instant | 5 min | Before marketing campaigns |
| **Soak** | Memory leaks | 50 | 4 hours | Monthly stability check |

### Test Type Details

#### 1. Smoke Test
**Goal**: Verify system works under minimal load  
**Scenario**: 10 users, 2 minutes  
**Success criteria**: 0% error rate, p95 <3s

```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 10 --spawn-rate 2 --run-time 2m --headless
```

**Use case**: Run after every deployment to catch basic breakage

---

#### 2. Load Test
**Goal**: Validate system handles expected traffic  
**Scenario**: 100 users, 10 minutes  
**Success criteria**: Error rate <1%, p95 <2s

```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 100 --spawn-rate 10 --run-time 10m --headless
```

**Use case**: Weekly regression test, establish performance baseline

---

#### 3. Stress Test
**Goal**: Find breaking point  
**Scenario**: Gradually increase to 1000+ users  
**Success criteria**: Identify at what user count error rate exceeds 5%

```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 1000 --spawn-rate 50 --run-time 15m --headless
```

**Use case**: Capacity planning, infrastructure sizing

---

#### 4. Spike Test
**Goal**: Simulate sudden traffic surge (e.g., HackerNews front page)  
**Scenario**: Jump from 10 to 500 users instantly  
**Success criteria**: System recovers without manual intervention

```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 500 --spawn-rate 500 --run-time 5m --headless
```

**Use case**: Before product launches, marketing campaigns

---

#### 5. Soak Test
**Goal**: Detect memory leaks, resource exhaustion  
**Scenario**: 50 users sustained for 4 hours  
**Success criteria**: Stable memory, no degradation over time

```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 50 --spawn-rate 5 --run-time 4h --headless
```

**Use case**: Before major releases, after architectural changes

---

### Locust File Anatomy

See `locustfile.py` in this repository for full implementation. Key components:

1. **HttpUser class**: Defines simulated user behavior
2. **@task decorators**: Weighted task distribution (10:3:1 ratio)
3. **wait_time**: Realistic thinking time between requests (1-3s)
4. **catch_response=True**: Granular success/failure handling
5. **Event listeners**: Track test start/stop with statistics

**Task weights** simulate realistic traffic:
- `/query` endpoint: weight=10 (most frequent)
- `/retrieval` endpoint: weight=3
- `/health` endpoint: weight=1

**Why wait_time matters**: Without it, Locust sends requests non-stop (unrealistic). Real users think, scroll, type between actions.

In [None]:
# Inspect locustfile.py structure
with open('locustfile.py', 'r') as f:
    lines = f.readlines()
    print(f"Total lines: {len(lines)}")
    print("\nKey components found:")
    for i, line in enumerate(lines, 1):
        if 'class RAGUser' in line:
            print(f"  Line {i}: RAGUser class definition")
        elif '@task' in line:
            print(f"  Line {i}: Task decorator (weighted)")
        elif 'wait_time' in line:
            print(f"  Line {i}: Wait time configuration")

# Expected: 
# - RAGUser class
# - Multiple @task decorators with weights
# - wait_time = between(1, 3)

## Section 3: Running Tests & Reading p50/p95/p99

### Starting Locust

**Option 1: Web UI** (recommended for exploration)
```bash
locust -f locustfile.py --host=http://localhost:8000
```
Then open `http://localhost:8089` in browser and configure:
- Number of users
- Spawn rate
- Host (if not set via --host)

**Option 2: Headless CLI** (recommended for CI/CD)
```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 100 --spawn-rate 10 --run-time 10m --headless
```

### Understanding Locust Output

**CLI output example**:
```
Type     Name              # reqs   # fails  Avg    Min    Max    Median p95    p99   req/s failures/s
POST     /query            1523     12       856    234    3421   780    1850   2340  12.3  0.1
GET      /retrieval        458      2        234    112    890    210    520    670   3.7   0.0
GET      /health           152      0        45     23     120    42     89     110   1.2   0.0
```

**Key columns**:
- **Avg**: Mean response time (ms) - **less important** than percentiles
- **Median (p50)**: Half of requests faster, half slower - **typical user**
- **p95**: 95% of requests faster - **most users' experience**
- **p99**: 99% of requests faster - **worst-case user experience**
- **req/s**: Throughput (requests per second)
- **failures/s**: Error rate

### Interpreting Percentiles

**Why p95/p99 matter more than average**:

| Scenario | Avg | p50 | p95 | p99 | Interpretation |
|----------|-----|-----|-----|-----|----------------|
| **Healthy** | 500ms | 450ms | 800ms | 1200ms | Consistent performance |
| **Outliers** | 600ms | 450ms | 3500ms | 8000ms | Some users have terrible experience |
| **Degrading** | 2000ms | 1800ms | 5000ms | 10000ms | System struggling under load |

**Example**: Average 500ms looks good, but p99=8s means 1% of users wait 8 seconds!

### Success Criteria by Test Type

| Test Type | p50 Target | p95 Target | p99 Target | Error Rate |
|-----------|------------|------------|------------|------------|
| **Smoke** | <500ms | <1s | <2s | 0% |
| **Load** | <1s | <2s | <3s | <1% |
| **Stress** | N/A | N/A | N/A | Find breaking point |
| **Spike** | <2s | <5s | <10s | <5% during spike |
| **Soak** | Stable (no increase) | Stable | Stable | <0.5% |

### Common Patterns in Results

**Pattern 1: Rate limiting**
```
# reqs   # fails   p95      p99
2000     0         850ms    1200ms   ← Below rate limit
3000     500       920ms    8500ms   ← 429 errors start
```
**Diagnosis**: External API rate limit hit (e.g., OpenAI 500 RPM)

**Pattern 2: Connection pool exhaustion**
```
Time    # reqs   # fails   p95      Error message
0-5m    1000     0         800ms    -
5-10m   1500     300       5000ms   "QueuePool limit reached"
```
**Diagnosis**: Database connection pool too small

**Pattern 3: Memory leak**
```
Time     # reqs   p95      Memory
0-1h     5000     800ms    1.2GB
1-2h     5000     950ms    1.8GB
2-3h     5000     1800ms   2.5GB
3-4h     5000     4500ms   3.2GB ← Degradation
```
**Diagnosis**: Memory leak causes GC thrashing, slows responses

### Exporting Results

**Generate CSV for analysis**:
```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 100 --spawn-rate 10 --run-time 10m --headless \
  --csv=results/load_test
```

**Output files**:
- `results/load_test_stats.csv` - Aggregated statistics
- `results/load_test_stats_history.csv` - Time-series data
- `results/load_test_failures.csv` - Error details

**Generate HTML report**:
```bash
locust -f locustfile.py --host=http://localhost:8000 \
  --users 100 --spawn-rate 10 --run-time 10m --headless \
  --html=results/report.html
```

In [None]:
# Simulate parsing Locust CSV results
import csv
from io import StringIO

# Example CSV data (simulated)
csv_data = """Type,Name,Request Count,Failure Count,Median Response Time,Average Response Time,Min Response Time,Max Response Time,Average Content Size,Requests/s,Failures/s,50%,66%,75%,80%,90%,95%,98%,99%,99.9%,99.99%,100%
POST,/query,1523,12,780,856,234,3421,1250,12.3,0.1,780,890,1100,1250,1600,1850,2100,2340,3200,3400,3421
GET,/retrieval,458,2,210,234,112,890,450,3.7,0.0,210,230,250,270,380,520,620,670,850,880,890
GET,/health,152,0,42,45,23,120,50,1.2,0.0,42,45,48,50,65,89,100,110,118,120,120
"""

# Parse CSV
reader = csv.DictReader(StringIO(csv_data))
for row in reader:
    endpoint = row['Name']
    p50 = int(row['50%'])
    p95 = int(row['95%'])
    p99 = int(row['99%'])
    rps = float(row['Requests/s'])
    error_rate = (int(row['Failure Count']) / int(row['Request Count'])) * 100
    
    print(f"{endpoint:15} | p50: {p50:4}ms | p95: {p95:4}ms | p99: {p99:4}ms | RPS: {rps:4.1f} | Errors: {error_rate:.1f}%")

# Expected:
# /query         | p50:  780ms | p95: 1850ms | p99: 2340ms | RPS: 12.3 | Errors: 0.8%
# /retrieval     | p50:  210ms | p95:  520ms | p99:  670ms | RPS:  3.7 | Errors: 0.4%
# /health        | p50:   42ms | p95:   89ms | p99:  110ms | RPS:  1.2 | Errors: 0.0%

## Section 4: Finding Bottlenecks (code, external, infra)

### The 3 Bottleneck Categories

Load tests reveal where your system breaks. Bottlenecks fall into three categories:

1. **Application Code**: Inefficient algorithms, blocking operations
2. **External Services**: OpenAI rate limits, slow database queries
3. **Infrastructure**: CPU, memory, or network constraints

### Investigation Workflow

```
Load Test → Identify symptom → Correlate metrics → Diagnose root cause → Apply fix
```

**Step 1: Run load test and capture symptoms**  
- High p99 latency?
- Error rate spike?
- Throughput plateau?

**Step 2: Correlate with infrastructure metrics**  
- CPU usage during test
- Memory utilization
- Network I/O
- Database connection pool

**Step 3: Analyze logs for patterns**  
- What errors occur at breaking point?
- Are there specific requests causing slowdown?

---

### Bottleneck #1: Application Code

**Symptoms**:
- High CPU usage (>80%) with low concurrency
- Latency increases linearly with users
- No external API errors

**Common culprits**:

#### A. N+1 Query Problem
```python
# BAD: 1 query + N queries (N+1 problem)
users = db.query(User).all()  # 1 query
for user in users:
    user.profile = db.query(Profile).filter_by(user_id=user.id).first()  # N queries

# GOOD: 2 queries total (eager loading)
users = db.query(User).options(joinedload(User.profile)).all()
```

**Impact**: 10x reduction in database round trips

#### B. Blocking Operations
```python
# BAD: Synchronous blocking (ties up worker thread)
def process_query(query):
    result = openai.embed(query)  # Blocks for 200-500ms
    return result

# GOOD: Async non-blocking
async def process_query(query):
    result = await openai.embed(query)  # Other requests processed while waiting
    return result
```

**Impact**: 5-10x increase in concurrency capacity

#### C. Inefficient Algorithms
```python
# BAD: O(n²) comparison
def find_duplicates(documents):
    duplicates = []
    for i, doc1 in enumerate(documents):
        for doc2 in documents[i+1:]:
            if similarity(doc1, doc2) > 0.9:
                duplicates.append((doc1, doc2))

# GOOD: O(n log n) with indexing
def find_duplicates(documents):
    # Use locality-sensitive hashing or vector index
    index = build_similarity_index(documents)
    return index.find_near_duplicates(threshold=0.9)
```

---

### Bottleneck #2: External Services

**Symptoms**:
- 429 errors (rate limiting)
- High latency despite low CPU
- Timeouts or connection errors

**Common culprits**:

#### A. OpenAI Rate Limits
```
Error: Rate limit reached for gpt-4 in organization org-XXX
Limit: 500 requests per minute (RPM)
```

**Solutions**:
1. **Cache embeddings**: Eliminate redundant API calls
2. **Batch requests**: Embed multiple texts in one API call
3. **Retry with backoff**: Handle transient rate limits
4. **Upgrade tier**: Purchase higher rate limits

**Example caching impact**:
- Before: 1000 queries → 1000 OpenAI calls → $0.20
- After: 1000 queries → 200 OpenAI calls (80% cache hit) → $0.04

#### B. Database Query Slowness
```python
# Monitor slow queries
import time

def log_slow_queries(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        duration = time.time() - start
        if duration > 1.0:  # Log queries >1s
            print(f"SLOW QUERY: {func.__name__} took {duration:.2f}s")
        return result
    return wrapper

@log_slow_queries
def search_documents(query_embedding):
    return db.query(Document).filter(...).all()
```

**Fixes**:
- Add indexes on frequently queried columns
- Use vector database (FAISS, Pinecone) instead of brute-force search
- Implement query result caching

#### C. Third-party API Latency
```python
# Monitor external API latency
import requests
import time

def fetch_with_monitoring(url):
    start = time.time()
    response = requests.get(url, timeout=5)
    latency = time.time() - start
    
    if latency > 2.0:
        print(f"WARNING: {url} took {latency:.2f}s")
    
    return response
```

---

### Bottleneck #3: Infrastructure

**Symptoms**:
- High CPU/memory/disk I/O sustained
- Latency spikes correlated with resource exhaustion
- "Out of memory" or "connection refused" errors

**Investigation checklist**:

#### A. CPU Bottleneck
```bash
# Monitor CPU during load test
top -p $(pgrep -f "python app.py")

# Symptom: CPU at 95-100% sustained
```

**Diagnosis**: CPU-bound operations (embeddings, complex algorithms)

**Fixes**:
- **Vertical scale**: Increase CPU cores
- **Horizontal scale**: Add more instances behind load balancer
- **Optimize code**: Profile with `cProfile` to find hotspots
- **Offload**: Use GPU for embeddings

#### B. Memory Bottleneck
```bash
# Monitor memory usage
ps aux | grep python
# Or use memory_profiler

from memory_profiler import profile

@profile
def process_documents(docs):
    embeddings = [embed(doc) for doc in docs]  # Loads all into memory
    return embeddings
```

**Diagnosis**: Memory leak or large object allocation

**Fixes**:
- **Streaming**: Process data in batches instead of all at once
- **Connection pool limits**: Prevent unbounded growth
- **Garbage collection**: Explicitly delete large objects
- **Vertical scale**: Increase RAM (temporary fix)

#### C. Connection Pool Exhaustion
```
Error: QueuePool limit of size 5 overflow 10 reached, connection timed out
```

**Diagnosis**: Database connection pool too small for concurrent users

**Fix**:
```python
from sqlalchemy import create_engine

engine = create_engine(
    DATABASE_URL,
    pool_size=20,        # Increase from default 5
    max_overflow=40,     # Extra connections under load
    pool_timeout=60,     # Wait before error
    pool_recycle=3600,   # Recycle connections hourly
    pool_pre_ping=True   # Verify connection before use
)
```

**Rule of thumb**: `pool_size = expected_concurrent_users / 10`

---

### Real-World Bottleneck Example

**Scenario**: RAG system breaks at 125 concurrent users

**Symptoms**:
- p95 latency: 2s → 8s
- Error rate: 0% → 25%
- CPU: 40% (not bottleneck)
- Memory: 60% (not bottleneck)

**Investigation**:
```bash
# Check logs during load test
tail -f logs/app.log | grep ERROR

# Output:
# RateLimitError: Requests to the OpenAI API have exceeded rate limits
```

**Diagnosis**: OpenAI rate limit (500 RPM) reached at 125 users * 0.5 queries/sec = 62.5 QPS = 3750 QPM

**Fixes applied**:
1. Implement query result caching (Redis)
2. Reduce redundant embedding calls (cache document embeddings)
3. Implement retry logic with exponential backoff

**Result after fixes**:
- Capacity increased: 125 → 400 concurrent users
- Cache hit rate: 75%
- OpenAI calls reduced: 3750 QPM → 900 QPM
- Cost savings: $200/month → $50/month

In [None]:
# Bottleneck diagnosis helper
def diagnose_bottleneck(cpu_pct, memory_pct, error_rate, error_type):
    """
    Simple decision tree for bottleneck diagnosis.
    
    Args:
        cpu_pct: CPU usage percentage (0-100)
        memory_pct: Memory usage percentage (0-100)
        error_rate: Error rate percentage (0-100)
        error_type: Primary error type (str)
    """
    print("=== Bottleneck Diagnosis ===\n")
    print(f"CPU: {cpu_pct}%")
    print(f"Memory: {memory_pct}%")
    print(f"Error Rate: {error_rate}%")
    print(f"Error Type: {error_type}\n")
    
    if "RateLimitError" in error_type or "429" in error_type:
        print("✅ DIAGNOSIS: External API Rate Limit")
        print("   FIX: Implement caching, reduce API calls, upgrade tier")
    
    elif "QueuePool" in error_type or "connection" in error_type.lower():
        print("✅ DIAGNOSIS: Connection Pool Exhaustion")
        print("   FIX: Increase pool_size, add connection pooling")
    
    elif cpu_pct > 80:
        print("✅ DIAGNOSIS: CPU Bottleneck")
        print("   FIX: Optimize algorithms, vertical/horizontal scaling")
    
    elif memory_pct > 85:
        print("✅ DIAGNOSIS: Memory Bottleneck")
        print("   FIX: Check for memory leaks, implement streaming, add RAM")
    
    elif error_rate < 1:
        print("✅ DIAGNOSIS: System Healthy")
        print("   No immediate action needed")
    
    else:
        print("⚠️  DIAGNOSIS: Unknown Issue")
        print("   Check application logs for specific errors")

# Example scenarios
print("Scenario 1: OpenAI Rate Limit Hit")
diagnose_bottleneck(cpu_pct=45, memory_pct=60, error_rate=25, error_type="RateLimitError: 429")

print("\n" + "="*50 + "\n")

print("Scenario 2: Database Connection Pool")
diagnose_bottleneck(cpu_pct=30, memory_pct=50, error_rate=15, error_type="QueuePool limit reached")

print("\n" + "="*50 + "\n")

print("Scenario 3: CPU Overload")
diagnose_bottleneck(cpu_pct=95, memory_pct=40, error_rate=5, error_type="Timeout")

# Expected: Correct diagnosis for each scenario

## Section 5: Scaling Playbook (cache, batch, HPA)

### The Scaling Hierarchy

**Before throwing money at infrastructure, optimize first:**

```
1. Measure (load test)
   ↓
2. Optimize code (fix N+1 queries, add indexes)     ← Often 10x gains
   ↓
3. Add caching (Redis, in-memory)                   ← 2-100x speedup
   ↓
4. Batch operations (embeddings, DB queries)        ← 5-10x efficiency
   ↓
5. Vertical scaling (bigger instance)               ← 2-4x capacity
   ↓
6. Horizontal scaling (more instances + LB)         ← Unlimited capacity
   ↓
7. Auto-scaling (HPA - Horizontal Pod Autoscaler)   ← Dynamic elasticity
```

**Rule**: Each step up costs more money and complexity. Exhaust cheaper options first.

---

### Strategy 1: Caching

**Impact**: 10-100x latency reduction, 80-95% cost savings on API calls

#### Query Result Caching
```python
import redis
import hashlib
import json

redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

def cached_query(query_text, top_k=5, ttl=3600):
    # Generate cache key
    cache_key = f"query:{hashlib.md5(query_text.encode()).hexdigest()}:{top_k}"
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        print(f"✅ Cache HIT: {query_text[:30]}...")
        return json.loads(cached)
    
    # Cache miss - compute result
    print(f"❌ Cache MISS: {query_text[:30]}...")
    result = run_rag_query(query_text, top_k)  # Expensive operation
    
    # Store in cache with TTL
    redis_client.set(cache_key, json.dumps(result), ex=ttl)
    return result

# Simulate RAG query
def run_rag_query(query, top_k):
    import time
    time.sleep(0.5)  # Simulate 500ms latency
    return {"answer": "Cached response", "sources": ["doc1", "doc2"]}
```

**Metrics**:
- Before caching: 1000 queries = 1000 API calls = $0.20
- After caching (80% hit rate): 1000 queries = 200 API calls = $0.04
- **Savings**: $0.16 per 1000 queries

#### Embedding Caching
```python
# Cache document embeddings (persist across restarts)
def get_embedding(text, model="text-embedding-3-small"):
    cache_key = f"embed:{hashlib.md5(text.encode()).hexdigest()}"
    
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Expensive OpenAI call
    embedding = openai.embeddings.create(input=text, model=model).data[0].embedding
    redis_client.set(cache_key, json.dumps(embedding))  # No TTL (permanent)
    return embedding
```

**Impact**: Eliminate redundant embedding calls (especially for static documents)

---

### Strategy 2: Batching

**Impact**: 5-10x reduction in API overhead

#### Embedding Batching
```python
# BAD: 10 API calls for 10 documents
embeddings = []
for doc in documents:
    embedding = openai.embeddings.create(input=doc, model="text-embedding-3-small")
    embeddings.append(embedding.data[0].embedding)

# GOOD: 1 API call for 10 documents (up to 2048 texts per batch)
texts = [doc.content for doc in documents]
response = openai.embeddings.create(input=texts, model="text-embedding-3-small")
embeddings = [item.embedding for item in response.data]
```

**Impact**: 10x reduction in network overhead, API quota usage

#### Database Batching
```python
# BAD: N individual queries (N+1 problem)
for doc_id in document_ids:
    doc = db.query(Document).filter_by(id=doc_id).first()
    process(doc)

# GOOD: 1 bulk query
docs = db.query(Document).filter(Document.id.in_(document_ids)).all()
for doc in docs:
    process(doc)
```

---

### Strategy 3: Horizontal Scaling

**When**: Code is optimized, caching implemented, vertical scaling limit reached

**Concept**: Run multiple instances behind a load balancer

```
                    ┌───────────────┐
User requests  →    │ Load Balancer │
                    └───────┬───────┘
                            │
            ┌───────────────┼───────────────┐
            ↓               ↓               ↓
        ┌───────┐       ┌───────┐       ┌───────┐
        │ App 1 │       │ App 2 │       │ App 3 │
        └───┬───┘       └───┬───┘       └───┬───┘
            │               │               │
            └───────────────┴───────────────┘
                            ↓
                    ┌───────────────┐
                    │   Database    │
                    │  (shared)     │
                    └───────────────┘
```

**Requirements**:
1. **Stateless application**: No local session storage
2. **Shared state**: Use Redis for sessions, cache
3. **Health checks**: `/health` endpoint for load balancer
4. **Sticky sessions OFF**: Load balancer distributes evenly

**Example (Railway auto-scaling)**:
```yaml
# railway.json
{
  "deploy": {
    "numReplicas": 3,
    "restartPolicyType": "ON_FAILURE",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 100
  }
}
```

---

### Strategy 4: Vertical Scaling

**When**: Single-instance performance needs boost (simpler than horizontal)

**Trade-offs**:
- **Pros**: No code changes, simpler deployment
- **Cons**: Limited ceiling, single point of failure, expensive at high specs

**Example progression**:
1. Start: 1 CPU, 1GB RAM → 50 concurrent users
2. Upgrade: 2 CPU, 4GB RAM → 100 concurrent users
3. Upgrade: 4 CPU, 8GB RAM → 200 concurrent users
4. **Hit ceiling**: 8 CPU, 16GB RAM → 400 users (then switch to horizontal)

---

### Strategy 5: Auto-scaling (HPA)

**Horizontal Pod Autoscaler**: Automatically add/remove instances based on metrics

**Trigger metrics**:
- CPU usage >70% sustained for 5 minutes
- Memory usage >80%
- Custom metric: request queue depth >100

**Example (Kubernetes HPA)**:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rag-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rag-api
  minReplicas: 2           # Always maintain minimum
  maxReplicas: 10          # Cost control limit
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 1 min before scaling up
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
```

**Railway/Render auto-scaling**:
- Configure in dashboard: min/max replicas, CPU threshold
- Simpler than Kubernetes, less control

---

### Scaling Decision Matrix

| Symptom | Likely Cause | Solution | Cost | Complexity |
|---------|--------------|----------|------|------------|
| High p99, normal p50 | Outliers/cold starts | Horizontal scale | Medium | Medium |
| High CPU (>80%) | CPU-bound work | Vertical scale OR optimize code | High | Low/High |
| High memory (>85%) | Memory leak | Fix code, then vertical scale | Medium | High |
| DB queries slow | Missing indexes | Add indexes (free!) | Free | Low |
| OpenAI 429 errors | Rate limits | Caching, batching, upgrade tier | Low-High | Medium |
| Inconsistent latency | No caching | Implement Redis cache | Low | Medium |

---

### Real-World Scaling Example

**Starting point**: 1 instance, 2 CPU, 4GB RAM, no caching

**Load test results**:
- Capacity: 50 concurrent users
- p95 latency: 2.5s
- Bottleneck: OpenAI rate limits + repeated queries

**Optimization journey**:

1. **Add query caching (Redis)**  
   - Cost: $10/month (Redis hosting)  
   - Result: Capacity → 150 users (80% cache hit rate)  
   - Time: 4 hours implementation

2. **Implement embedding batching**  
   - Cost: Free (code change)  
   - Result: API calls reduced by 60%  
   - Time: 2 hours implementation

3. **Vertical scale: 4 CPU, 8GB RAM**  
   - Cost: +$40/month  
   - Result: Capacity → 300 users  
   - Time: 5 minutes (infrastructure change)

4. **Horizontal scale: 3 instances + load balancer**  
   - Cost: +$80/month (3x instances)  
   - Result: Capacity → 900 users  
   - Time: 1 hour (setup LB, test)

**Total investment**:
- Cost: +$130/month
- Time: ~7 hours
- Capacity gain: 50 → 900 users (18x improvement)
- ROI: Supports 10x growth without re-architecting

In [None]:
# Simulate caching impact on performance
import time
import random

# Simple in-memory cache for demonstration
cache = {}

def expensive_operation(query_id):
    """Simulate expensive RAG query (500ms)"""
    time.sleep(0.05)  # Reduced for notebook demo (50ms instead of 500ms)
    return f"Result for query {query_id}"

def cached_operation(query_id):
    """Cached version of expensive operation"""
    if query_id in cache:
        return cache[query_id], True  # Cache hit
    
    result = expensive_operation(query_id)
    cache[query_id] = result
    return result, False  # Cache miss

# Simulate 100 queries with some repetition (realistic pattern)
queries = [random.randint(1, 20) for _ in range(100)]  # 20 unique queries, 100 total

print("Running 100 queries without caching...")
start = time.time()
for q in queries:
    expensive_operation(q)
no_cache_time = time.time() - start

print(f"Time without cache: {no_cache_time:.2f}s\n")

# Reset cache
cache.clear()

print("Running 100 queries WITH caching...")
start = time.time()
hits = 0
misses = 0
for q in queries:
    result, is_hit = cached_operation(q)
    if is_hit:
        hits += 1
    else:
        misses += 1
cache_time = time.time() - start

print(f"Time with cache: {cache_time:.2f}s")
print(f"Cache hits: {hits}")
print(f"Cache misses: {misses}")
print(f"Hit rate: {(hits/100)*100:.1f}%")
print(f"Speedup: {no_cache_time/cache_time:.1f}x")

# Expected:
# - Significant speedup (3-5x)
# - High cache hit rate (70-80%)

## Section 6: When NOT to Load Test

### The Cost-Benefit Reality

**Load testing is powerful, but not always the right tool.**

This section helps you decide when to invest in load testing vs. other priorities.

---

### When to SKIP Load Testing

#### 1. Small Scale, No Growth Plans
**Scenario**: Internal tool with 10-50 daily users, stable usage

**Why skip**:
- Setup cost: 8-12 hours
- Maintenance: 2-4 hours per sprint
- Benefit: Minimal (system unlikely to reach capacity limits)

**Better approach**: Monitor basic metrics (response time, error rate), optimize if issues arise

---

#### 2. Staging Environment ≠ Production
**Scenario**: Test environment has different data size, network, or configuration

**Examples of mismatch**:
- Staging: 1GB test data → Production: 100GB real data
- Staging: Single region → Production: Global CDN
- Staging: Mock external APIs → Production: Real APIs with rate limits

**Why skip**: Load test results won't predict production behavior

**Better approach**: 
- Invest in production monitoring (APM tools)
- Conduct small-scale production canary tests
- Use production database replicas for realistic testing

---

#### 3. MVP / Early-Stage Projects
**Scenario**: Building product with uncertain growth, architecture may pivot

**Why skip**: Premature optimization wastes time on code that may be rewritten

**Better approach**:
- Optimize obvious bottlenecks (N+1 queries, missing database indexes)
- Monitor production metrics to identify real user patterns
- Load test after achieving product-market fit and stable architecture

---

#### 4. Limited Failure Modes Covered
**Scenario**: Expecting comprehensive system validation

**Reality check**: Load tests reveal **capacity** limits, not all failure modes

**What load tests DON'T catch**:
- Security vulnerabilities (use penetration testing)
- Data corruption (use integration tests)
- Edge cases in business logic (use unit/functional tests)
- Network partition failures (use chaos engineering)
- Authentication/authorization bugs (use security testing)

**Takeaway**: Load testing complements, doesn't replace, other testing types

---

### When to USE Load Testing

#### 1. Growth Trajectory
**Scenario**: User base growing 20%+ monthly, approaching capacity limits

**Signs you need load testing**:
- CPU usage trending upward (currently 50%, will hit 80% in 2 months)
- Response times increasing week-over-week
- Planning to onboard large customer (10x typical usage)

**Value**: Proactively identify scaling needs before outages

---

#### 2. SLA Requirements
**Scenario**: Performance guarantees in contracts

**Examples**:
- E-commerce: "p95 latency <1s during checkout"
- API provider: "99.9% uptime, p99 <2s"
- Enterprise customer: "Support 1000 concurrent users"

**Value**: Validate compliance, avoid SLA penalties, maintain reputation

---

#### 3. Infrastructure Investment Decisions
**Scenario**: Evaluating whether to upgrade hosting plan

**Example decision**:
- Current plan: $40/month, 2 CPU, 4GB RAM
- Upgrade plan: $120/month, 8 CPU, 16GB RAM
- Question: "Will upgrade support expected growth?"

**Load testing provides**:
- Current capacity: 100 concurrent users
- Projected capacity after upgrade: 400 concurrent users
- ROI calculation: $80/month for 3x capacity

**Value**: Data-driven infrastructure spending (avoid over/under-provisioning)

---

#### 4. Pre-Launch Validation
**Scenario**: Launching new feature, marketing campaign, or product

**Examples**:
- Product Hunt launch (expect 10x typical traffic spike)
- Black Friday sale (high concurrency)
- New API endpoint (unknown performance characteristics)

**Value**: Confidence that system handles expected load, identify bottlenecks before public exposure

---

### Decision Framework

Ask these questions:

1. **Do I have a growth trajectory?**  
   - No → Skip load testing, monitor production  
   - Yes → Continue

2. **Does my staging environment match production?**  
   - No → Fix environment or use production testing  
   - Yes → Continue

3. **Do I have SLA requirements or revenue risk?**  
   - No → Low priority, consider deferring  
   - Yes → High priority, load test now

4. **Am I making infrastructure decisions soon?**  
   - No → Monitor, test later when needed  
   - Yes → Load test to inform decision

5. **Is my architecture stable?**  
   - No → Optimize obvious issues first  
   - Yes → Load test to find limits

---

### Alternative Approaches

If load testing doesn't fit, consider:

**1. Production Monitoring (APM)**
- Tools: DataDog, New Relic, Sentry
- Real user metrics (RUM) capture actual performance
- Cost: $0-100/month
- Effort: 1-2 hours setup

**2. Canary Deployments**
- Roll out changes to 5% of users first
- Monitor error rates, latency before full deployment
- Built into many CI/CD platforms

**3. Feature Flags**
- Enable new features for subset of users
- Gradually increase percentage while monitoring
- Tools: LaunchDarkly, Unleash

**4. Basic Performance Testing**
- Run 10-20 concurrent requests manually
- Check for obvious errors, slow queries
- Effort: 30 minutes, no specialized tools

---

### Cost-Benefit Summary

| Scenario | Time Investment | Load Test? | Alternative |
|----------|-----------------|------------|-------------|
| 50 daily users, stable | 8-12 hours | ❌ No | Basic monitoring |
| 1000 daily users, 20% growth | 8-12 hours | ✅ Yes | N/A |
| MVP, uncertain architecture | 8-12 hours | ❌ No | Optimize obvious issues |
| SLA: p95 <1s guaranteed | 8-12 hours | ✅ Yes | N/A |
| Staging ≠ Production | 8-12 hours | ❌ No | Production canary testing |
| Pre-launch (expecting spike) | 8-12 hours | ✅ Yes | N/A |

---

### Final Takeaway

**Load testing is a tool, not a requirement.**

**Golden rules**:
1. **Measure real impact**: If you can't quantify benefits, defer
2. **Optimize first**: Fix N+1 queries, add indexes before scaling infrastructure
3. **Match environment**: Only test if staging resembles production
4. **Prioritize**: SLAs and growth justify load testing; curiosity doesn't

**When in doubt**: Start with basic monitoring, load test when evidence shows it's needed.

---

## Summary: M3.4 Complete

You've learned:
1. ✅ When to load test (and when to skip)
2. ✅ 5 test types: smoke, load, stress, spike, soak
3. ✅ Interpreting p50/p95/p99 metrics
4. ✅ Diagnosing bottlenecks: code, external, infrastructure
5. ✅ Scaling playbook: cache, batch, vertical, horizontal, auto-scale

**Next steps**:
- Run `locustfile.py` against your own API
- Analyze results using techniques from Section 3
- Optimize based on bottlenecks identified
- Scale strategically using Section 5 playbook

**Resources**:
- `locustfile.py` - Complete load testing implementation
- `scaling_notes.md` - Detailed scaling reference
- `README.md` - Quick start guide
- `requirements.txt` - Dependencies

In [None]:
# Decision helper: Should you load test?
def should_load_test(daily_users, growth_rate_monthly, has_sla, staging_matches_prod, architecture_stable):
    """
    Decision framework for load testing.
    
    Args:
        daily_users: Number of daily active users
        growth_rate_monthly: Growth rate as percentage (e.g., 20 for 20%)
        has_sla: Boolean - do you have SLA requirements?
        staging_matches_prod: Boolean - does staging match production environment?
        architecture_stable: Boolean - is architecture stable (not MVP)?
    
    Returns:
        Decision and reasoning
    """
    score = 0
    reasons = []
    
    # Scoring criteria
    if daily_users > 500:
        score += 2
        reasons.append("✅ High user base (>500 daily users)")
    elif daily_users > 100:
        score += 1
        reasons.append("⚠️  Moderate user base (100-500 users)")
    else:
        reasons.append("❌ Low user base (<100 users)")
    
    if growth_rate_monthly > 15:
        score += 2
        reasons.append("✅ High growth (>15% monthly)")
    elif growth_rate_monthly > 5:
        score += 1
        reasons.append("⚠️  Moderate growth (5-15% monthly)")
    else:
        reasons.append("❌ Low/no growth (<5% monthly)")
    
    if has_sla:
        score += 3
        reasons.append("✅ SLA requirements (critical)")
    
    if not staging_matches_prod:
        score -= 2
        reasons.append("❌ Staging ≠ Production (results unreliable)")
    
    if not architecture_stable:
        score -= 1
        reasons.append("⚠️  Architecture unstable (may change)")
    
    # Decision
    print("=== Load Testing Decision Framework ===\n")
    for reason in reasons:
        print(reason)
    
    print(f"\nScore: {score}/7")
    
    if score >= 5:
        print("\n✅ RECOMMENDATION: YES - Load test now")
        print("   Priority: HIGH")
    elif score >= 3:
        print("\n⚠️  RECOMMENDATION: CONSIDER - Load test if planning infrastructure changes")
        print("   Priority: MEDIUM")
    else:
        print("\n❌ RECOMMENDATION: SKIP - Focus on monitoring and optimization")
        print("   Priority: LOW")
    
    return score

# Example scenarios
print("Scenario 1: Early-stage startup")
should_load_test(daily_users=50, growth_rate_monthly=5, has_sla=False, 
                 staging_matches_prod=False, architecture_stable=False)

print("\n" + "="*60 + "\n")

print("Scenario 2: Growing SaaS product")
should_load_test(daily_users=1000, growth_rate_monthly=20, has_sla=True, 
                 staging_matches_prod=True, architecture_stable=True)

print("\n" + "="*60 + "\n")

print("Scenario 3: Established product with moderate growth")
should_load_test(daily_users=500, growth_rate_monthly=10, has_sla=False, 
                 staging_matches_prod=True, architecture_stable=True)

# Expected: Different recommendations based on criteria