# Module 11.3: Resource Management & Throttling
## Multi-Tenant SaaS Resource Allocation

**Duration:** 38 minutes  
**Level:** 3  
**Prerequisites:** M11.1 (Tenant Isolation), M11.2 (Tenant Customization), Level 2 M6.3 (Rate Limiting)

## Section 1: Introduction & Hook

### The Noisy Neighbor Problem

You built tenant-specific customization in M11.2. That works great... until **Tenant A starts hammering your system with 10,000 queries per hour** while Tenant B can barely get a response.

**Real-world impact:**
- Response times: 2 seconds ‚Üí 30 seconds
- OpenAI bill: $500/month ‚Üí $4,000/month
- 49 other tenants suffering

### What You'll Learn
- Implement per-tenant rate limiting (100 queries/hour per tenant)
- Build a fair query queue that prevents tenant starvation
- Enforce resource quotas (query counts, API tokens, storage)
- Handle emergency quota increases without redeploying
- **Important:** When quotas are premature optimization (<50 tenants)

In [None]:
# Setup and imports
import sys
import logging
import json

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

print("‚úì Module 11.3: Resource Management & Throttling")
print("  Focus: Per-tenant quotas and fair scheduling")

# Expected:
# ‚úì Module 11.3: Resource Management & Throttling
#   Focus: Per-tenant quotas and fair scheduling

## Section 2: Prerequisites & Setup

### Starting Point Verification

Your Level 3 system currently has:
- 50+ tenants with isolated namespaces (M11.1)
- Per-tenant model customization (M11.2)
- Global API rate limiting (Level 2 M6.3)
- Redis for caching and distributed state

**The gap:** Global rate limits don't prevent individual tenants from consuming disproportionate resources.

### Dependencies Installation

We need Redis for quota tracking and queue management.

In [None]:
# Check dependencies and Redis connection
try:
    import redis
    from config import get_redis_client, Config
    from l2_m11_resource_management_throttling import QuotaManager, FairTenantQueue
    
    print("‚úì Dependencies loaded")
    print(f"  Redis config: {Config.REDIS_HOST}:{Config.REDIS_PORT}")
    
    # Test Redis connection
    try:
        r = get_redis_client()
        print("‚úì Redis connection successful")
        print(f"  Version: {r.info('server').get('redis_version', 'unknown')}")
    except Exception as e:
        print(f"‚ö†Ô∏è Skipping Redis calls (no connection): {e}")
        r = None
        
except ImportError as e:
    print(f"‚ö†Ô∏è Missing dependencies: {e}")
    print("  Run: pip install -r requirements.txt")
    r = None

# Expected:
# ‚úì Dependencies loaded
# ‚úì Redis connection successful

## Section 3: Theory Foundation

### Core Concepts

Think of your RAG system like an **apartment building** with 50 tenants sharing infrastructure:

**Without management:** One tenant runs washing machine 24/7 ‚Üí no water pressure for others

**With management:**
1. **Individual metering** - Track each tenant's usage
2. **Fair quotas** - Set reasonable limits per tenant
3. **Queue discipline** - When demand exceeds capacity, serve fairly
4. **Overflow handling** - What happens when tenants exceed quotas

### Request Flow

```
Request ‚Üí Tenant ID ‚Üí Quota Check
                    ‚îú‚îÄ Under quota? ‚Üí Process immediately
                    ‚îî‚îÄ Over quota? ‚Üí Queue or Reject

Queue ‚Üí Round-robin scheduling ‚Üí Process when capacity available
```

### Why This Matters

- **Prevents noisy neighbor** - Saves 20-40% infrastructure costs
- **Predictable performance** - Maintains 2-3s p95 latency under load
- **Cost control** - Prevents $10K+ surprise bills

**Key insight:** Quotas are primarily for **system stability and fairness**, not just billing.

## Section 4: Hands-On Implementation

### Step 1: Per-Tenant Quota Tracker

We'll build a Redis-based system to track query counts, token usage, and storage per tenant with configurable limits.

**Three quota tiers:**
- **Free:** 100 queries/hour, 500K tokens/month
- **Pro:** 1,000 queries/hour, 5M tokens/month
- **Enterprise:** 10,000 queries/hour, 50M tokens/month

In [None]:
from l2_m11_resource_management_throttling import QuotaManager, QuotaType

if r:
    # Initialize quota manager
    qm = QuotaManager(r)
    
    # Set tenant to pro tier
    qm.set_tenant_tier("tenant_demo", "pro")
    print("‚úì Set tenant_demo to pro tier")
    
    # Record some queries
    for i in range(3):
        results = qm.record_query("tenant_demo", tokens_used=1000)
        print(f"  Query {i+1}: hourly={results['queries_hourly']}, tokens ok={results.get('tokens_monthly', True)}")
    
    # Check status
    status = qm.get_quota_status("tenant_demo")
    hourly = status["quotas"]["queries_hourly"]
    print(f"\n‚úì Quota status: {hourly['current']}/{hourly['limit']} queries ({hourly['percentage']}%)")
else:
    print("‚ö†Ô∏è Skipping API calls (no Redis)")

# Expected:
# ‚úì Set tenant_demo to pro tier
# Query 1-3 recorded
# ‚úì Quota status: 3/1000 queries (0.3%)

### Step 2: Request Queue with Fair Scheduling

Now we build a queue system that prevents tenant starvation using **round-robin scheduling**.

**How it works:**
- Each tenant has their own FIFO queue
- We process one request from each tenant in turn
- No tenant can monopolize resources

In [None]:
from l2_m11_resource_management_throttling import FairTenantQueue, QueuedRequest
import asyncio
import time

async def demo_fair_queue():
    if not r:
        print("‚ö†Ô∏è Skipping API calls (no Redis)")
        return
    
    queue = FairTenantQueue(r, max_queue_size=10)
    
    # Enqueue requests from multiple tenants
    tenants = ["tenant_a", "tenant_b", "tenant_c"]
    for tenant in tenants:
        for i in range(2):
            req = QueuedRequest(
                request_id=f"{tenant}_req_{i}",
                tenant_id=tenant,
                query=f"Query {i} from {tenant}",
                queued_at=time.time()
            )
            await queue.enqueue(req)
            print(f"  Enqueued: {req.request_id}")
    
    # Get stats
    stats = queue.get_queue_stats()
    print(f"\n‚úì Queue stats: {stats['total_queued_requests']} total, {stats['active_tenants']} tenants")
    
    # Dequeue fairly (round-robin)
    print("\nDequeuing (round-robin order):")
    for _ in range(3):
        req = await queue.dequeue_fair()
        if req:
            print(f"  ‚Üí {req.tenant_id}: {req.request_id}")

# Run async function
if r:
    await demo_fair_queue()
else:
    print("‚ö†Ô∏è Skipping API calls (no Redis)")

# Expected:
# Enqueued 6 requests from 3 tenants
# Dequeues alternate between tenants (fair)

## Section 5: Reality Check

### What This DOESN'T Do

**1. Doesn't handle cross-service quotas**
- Our quotas only track queries to this API
- If tenants call OpenAI directly with their own keys, we can't track it

**2. Doesn't optimize for cost**
- We count queries, but not all queries cost the same
- GPT-4 with 8K context costs 100x more than GPT-3.5 with 1K context
- For true cost management, need weighted quotas

**3. Doesn't prevent intentional abuse**
- Malicious tenants can create multiple accounts
- Need additional security (email verification, payment, abuse detection)

### Trade-offs Accepted

- **Complexity:** 600+ lines of quota management code
- **Latency:** Quota checks add 5-15ms per request
- **Operations:** Must monitor queue depth, Redis memory, handle quota requests

### When This Breaks

- **At 500+ tenants:** Redis memory grows to 2-5GB, need clustering
- **At 10,000+ req/sec:** Quota checks bottleneck, need caching
- **With SLA commitments:** Queue doesn't guarantee response time

## Section 6: Alternative Solutions

### Alternative 1: No Quotas (Trust-Based)

**Best for:** <50 tenants, all paying customers

**Pros:** Zero complexity, best UX  
**Cons:** One tenant can impact all others

```python
# Just log usage for monitoring
await usage_tracker.record(tenant_id, request.url.path)
```

### Alternative 2: Hard Limits (No Queuing)

**Best for:** 50-200 tenants with clear tiers

**Pros:** Simple (100 lines vs 600), predictable  
**Cons:** Poor UX (hard rejections), no burst handling

```python
if tenant.usage_this_hour >= limit:
    return JSONResponse(status_code=429, content={"error": "Quota exceeded"})
```

### Alternative 3: Dynamic Throttling

**Best for:** 100+ tenants with variable usage

**Pros:** Better resource utilization, adapts to traffic  
**Cons:** Complex to tune, unpredictable for tenants

### Alternative 4: Reserved Capacity (Enterprise)

**Best for:** Enterprise customers paying $10K+/month

**Pros:** Guaranteed performance, SLA-friendly  
**Cons:** High cost, complex orchestration

## Section 7: When NOT to Use

### Scenario 1: Small Tenant Count (<50 tenants)

**Why it fails:** Complexity cost outweighs benefit

**Use instead:** Alternative 1 (No Quotas) - just monitor and contact heavy users

**Red flags:**
- You spend more time managing quotas than building features
- Team size <5 people

### Scenario 2: Ultra-Low Latency Requirements (<50ms p95)

**Why it fails:** Quota checks add 5-15ms; queuing adds 30-300s

**Use instead:** Alternative 4 (Reserved Capacity) - dedicated resources

**Red flags:**
- Your SLA requires <50ms response time
- Latency SLAs in contracts

### Scenario 3: Highly Unpredictable Traffic (10x+ variance)

**Why it fails:** Fair queuing assumes predictable load; 10x spikes fill queue instantly

**Use instead:** Alternative 2 (Hard Limits) + aggressive auto-scaling

**Red flags:**
- Queue depth regularly >500 requests
- Average wait time >60 seconds

## Section 8: Common Failures

### Failure 1: Noisy Neighbor Exhausts Resources Despite Quotas

**Root cause:** Quotas count queries, not resources. One GPT-4 query with 8K context costs 100x more than GPT-3.5 with 500 tokens.

**The fix:** Use resource-weighted quotas

In [None]:
from l2_m11_resource_management_throttling import ResourceWeightedQuota

if r:
    weighted = ResourceWeightedQuota(r)
    
    # Example: Different query costs
    queries = [
        {"model": "gpt-3.5-turbo", "context": "small context", "use_tools": False},
        {"model": "gpt-4", "context": "large " * 1000, "use_tools": True},
    ]
    
    print("Query weights (1.0 = standard query unit):")
    for i, q in enumerate(queries):
        weight = weighted.calculate_query_weight(q)
        print(f"  Query {i+1} ({q['model']}): {weight:.1f}x")
    
    print("\n‚úì Weighted quotas prevent resource gaming")
else:
    print("‚ö†Ô∏è Skipping API calls (no Redis)")

# Expected:
# Query 1 (gpt-3.5-turbo): 1.0x
# Query 2 (gpt-4): 40-80x (expensive!)

### Failure 2: Quota Enforcement Bypass via Race Conditions

**Root cause:** Quota check and increment are not atomic. With concurrent requests, multiple read same value before increment.

**The fix:** Use Redis Lua scripts for atomic check-and-increment

In [None]:
from l2_m11_resource_management_throttling import AtomicQuotaManager

if r:
    atomic_qm = AtomicQuotaManager(r)
    
    # Set low limit for demo
    atomic_qm.set_tenant_tier("tenant_atomic", "free")
    
    # Atomic check-and-increment (prevents race conditions)
    success1, current1, limit1 = atomic_qm.atomic_check_and_increment(
        "tenant_atomic", QuotaType.QUERIES_HOURLY, increment=1
    )
    print(f"Request 1: {'‚úì Allowed' if success1 else '‚úó Rejected'} ({current1}/{limit1})")
    
    success2, current2, limit2 = atomic_qm.atomic_check_and_increment(
        "tenant_atomic", QuotaType.QUERIES_HOURLY, increment=1
    )
    print(f"Request 2: {'‚úì Allowed' if success2 else '‚úó Rejected'} ({current2}/{limit2})")
    
    print("\n‚úì Atomic operations prevent race condition bypass")
else:
    print("‚ö†Ô∏è Skipping API calls (no Redis)")

# Expected:
# Both requests processed atomically, no bypass

## Section 9: Production Considerations

### Scaling Concerns

**Redis memory growth:**
- 500 tenants √ó 5 time windows √ó 100 bytes = ~25MB for quotas
- Add 100MB for queues = **125MB total**
- Plan for 500MB with overhead

**Quota check latency:**
- Each request: 3 Redis operations (get, increment, check) = 3ms
- At 1000 req/sec, Redis becomes bottleneck
- Solution: Use Redis pipelining

**Queue worker capacity:**
- 5 workers √ó 2 req/sec = 10 req/sec = 600 req/min
- If incoming rate exceeds this, queue grows
- Monitor and auto-scale workers when depth >500

### Cost at Scale (500 tenants)

- **Redis:** $50-100/month (2GB managed)
- **Queue workers:** $200-400/month (5 instances)
- **Monitoring:** $50-100/month
- **Engineering:** 8-12 hours/month

**Total:** $300-600/month infrastructure + 1 week/month eng time

## Section 10: Decision Card

### ‚úÖ BENEFIT
- Prevents noisy neighbor problem
- Maintains 2-3s p95 latency even under load
- Caps infrastructure costs (~$500/month for 500 tenants)
- Fair queue ensures all tenants served within 60s

### ‚ùå LIMITATION
- Adds 600+ lines of operational complexity
- Cannot prevent resource gaming without weighted quotas
- Requires human intervention for quota increases
- Queue approach doesn't work for real-time (<5s) requirements

### üí∞ COST
- **Initial:** 12-16 hours implementation
- **Ongoing:** $300-600/month + 8-12 hours/month management
- **Complexity:** 3 new failure modes

### ü§î USE WHEN
- 50-500 tenants on shared infrastructure
- Experiencing noisy neighbor complaints
- Need predictable cost control
- Can accept 10-20ms latency + 30-300s queue wait

### üö´ AVOID WHEN
- <50 tenants (use Alternative 1: No Quotas)
- Need <50ms latency SLA (use Alternative 4: Reserved Capacity)
- Highly spiky traffic (use Alternative 2: Hard Limits + auto-scale)
- Team <5 people (wait or use managed service)

## Section 11: Practice Exercises

### üü¢ Easy: Basic Per-Tenant Rate Limiting (60-90 min)

Implement simple per-tenant rate limiting without queuing:
- Per-tenant quota tracking in Redis (queries per hour)
- Three tier levels (free/pro/enterprise)
- Middleware that rejects over-quota requests with 429
- Admin endpoint to check tenant quota status

### üü° Medium: Fair Queue Management (2-3 hours)

Add queue-based throttling with fair scheduling:
- Build on Easy challenge
- Implement FairTenantQueue with round-robin
- Queue requests when tenant over quota
- Background worker to process queued requests

### üî¥ Hard: Production System (5-6 hours)

Build complete production system:
- All Medium features
- Weighted quotas (resource-aware)
- Atomic quota checking (no race conditions)
- Database-backed configuration
- Bounded queue (global + per-tenant limits)
- Comprehensive monitoring

## Section 12: Summary & Next Steps

### What You Learned

‚úì **Per-tenant quota tracking** with Redis (queries, tokens, storage)  
‚úì **Fair queue scheduling** that prevents noisy neighbors  
‚úì **Atomic quota checking** to avoid race conditions  
‚úì **Weighted quotas** for resource-aware limits  
‚úì **When quotas are premature** (<50 tenants) vs essential (>50 tenants)

### Critical Takeaways

1. **Quotas are for system stability** first, billing second
2. **Fair scheduling is complex** - queue-depth awareness matters
3. **Emergency quota increases** must not require deploys
4. **Always use atomic operations** for quota checks (Lua scripts)
5. **Bounded queues** prevent memory disasters during spikes

### Real-World Application

You now have multi-tenant resource management for **50-500 tenants** that:
- Prevents noisy neighbors
- Gives sales team agility for quota adjustments
- Works for 80% of SaaS applications

### Next Steps

1. Complete the practice challenge (choose your level)
2. Implement monitoring (Prometheus metrics)
3. Test under load (simulate 1000 req/sec)
4. **Next module:** M11.4 Vector Index Sharding

---

**Great work! You're building real production systems now.**