# L3 M13.2: Auto-Scaling Multi-Tenant Infrastructure

## Learning Arc

**Purpose:** Implement intelligent Kubernetes auto-scaling for multi-tenant RAG platforms using per-tenant queue depth metrics to drive HPA scaling decisions while preventing resource monopolization across 50+ business units.

**Concepts Covered:**
- Horizontal Pod Autoscaler (HPA) with custom per-tenant metrics
- Prometheus metrics collection and export for Kubernetes Custom Metrics API
- Resource quotas and LimitRanges for multi-tenant fairness
- Pod affinity and anti-affinity rules for fault tolerance
- Graceful scale-down with connection draining (SIGTERM handling)
- Tier-based scaling policies (Premium, Standard, Free)
- Cost attribution and chargeback for GCC platforms
- Compliance audit trails for SOX/DPDPA requirements

**After Completing This Notebook:**
- You will understand how Kubernetes HPA uses custom metrics to drive scaling decisions
- You can implement per-tenant queue depth tracking with Prometheus
- You will recognize when to use HPA vs fixed capacity vs serverless
- You can configure resource quotas preventing tenant monopoly
- You will apply graceful scale-down strategies preserving SLAs
- You can generate cost reports for CFO chargeback in GCC contexts

**Context in Track L3.M13:**
This module builds on M11 (tenant routing/RBAC) and M12 (vector isolation/compliance) to add intelligent auto-scaling that reduces infrastructure costs by 30-45% while maintaining 99.9%+ SLA compliance. It prepares you for M13.3 (cost attribution systems) and M13.4 (capacity planning).

In [None]:
# Environment Setup
import os
import sys

# Add parent directory to path for imports
parent_dir = os.path.abspath('..')
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

# OFFLINE mode for L3 consistency (no external services required)
OFFLINE = os.getenv("OFFLINE", "true").lower() == "true"

# Service configuration (Redis and Prometheus are optional)
REDIS_ENABLED = os.getenv("REDIS_ENABLED", "false").lower() == "true"
PROMETHEUS_ENABLED = os.getenv("PROMETHEUS_ENABLED", "false").lower() == "true"

if OFFLINE or (not REDIS_ENABLED and not PROMETHEUS_ENABLED):
    print("⚠️ Running in OFFLINE mode")
    print("   → External service calls will be skipped")
    print("   → Redis and Prometheus are optional for this module")
    print("   → All core functionality works without external services")
    print("")
    print("To enable services, set in .env:")
    print("   REDIS_ENABLED=true")
    print("   PROMETHEUS_ENABLED=true")
else:
    print("✓ Online mode - external services enabled")
    if REDIS_ENABLED:
        print("   ✓ Redis: Enabled")
    if PROMETHEUS_ENABLED:
        print("   ✓ Prometheus: Enabled")

## Section 1: The Problem - Noisy Neighbor in Multi-Tenant Auto-Scaling

**Crisis Scenario:** 2:47 AM incident where media tenants processing 10× normal traffic during sporting event cause platform-wide latency spikes (200ms → 12 seconds) affecting 44 other tenants.

**Root Cause:** HPA configured with global CPU metrics instead of per-tenant queue depth. When media tenant spiked, all pods hit 100% CPU, but HPA scaled slowly. Meanwhile, other 44 tenants queued behind the media tenant's backlog.

**Business Impact:**
- 44 tenants experienced 12-second query latency (60× SLA breach)
- Finance tenant missed market-open trades (₹50L+ potential loss)
- Legal tenant's discovery deadline queries timed out
- Platform credibility destroyed, 3 enterprise contracts at risk

**What We'll Build:** Per-tenant queue depth tracking with tier-based resource quotas preventing any tenant from monopolizing resources.

In [None]:
# Imports for core functionality
from src.l3_m13_scale_performance_optimization import (
    TenantTier,
    GCCAutoScalingPolicy,
    TenantQueueManager,
    calculate_target_replicas,
    validate_resource_quota,
    log_scale_event,
    generate_cost_report
)
import asyncio

print("✓ Imported auto-scaling modules successfully")
print("")
print("Available components:")
print("  - TenantTier: Enum for Premium/Standard/Free classification")
print("  - GCCAutoScalingPolicy: Tier-based scaling configuration")
print("  - TenantQueueManager: Per-tenant queue depth tracking")
print("  - Helper functions: calculate_target_replicas, validate_resource_quota, etc.")

## Section 2: Tier-Based Scaling Policies

Different tenant tiers get different scaling configurations:

**Premium Tier:**
- Min replicas: 5 (always-on capacity)
- Max replicas: 30 (can burst high)
- Scale-up cooldown: 60s (fast response)
- Resource quota: 40% of cluster
- SLA target: 99.95% uptime

**Standard Tier:**
- Min replicas: 3 (lower baseline)
- Max replicas: 15 (moderate ceiling)
- Scale-up cooldown: 120s (slower response)
- Resource quota: 20% of cluster
- SLA target: 99.9% uptime

**Free Tier:**
- Min replicas: 1 (minimal baseline)
- Max replicas: 5 (hard cap)
- Scale-up cooldown: 300s (slow response)
- Resource quota: 10% of cluster
- SLA target: 99% uptime

In [None]:
# Example: Premium tier configuration
premium_policy = GCCAutoScalingPolicy(TenantTier.PREMIUM)
premium_config = premium_policy.get_scaling_config()

print("Premium Tier Scaling Configuration:")
print(f"  Min replicas: {premium_config.min_replicas}")
print(f"  Max replicas: {premium_config.max_replicas}")
print(f"  Scale-up cooldown: {premium_config.scale_up_cooldown}s")
print(f"  Scale-down cooldown: {premium_config.scale_down_cooldown}s")
print(f"  Resource quota: {premium_config.resource_quota_percent}%")
print(f"  SLA target: {premium_config.sla_target * 100}%")
print("")

# Expected: Premium gets higher min/max, faster scale-up, larger quota

# Compare all tiers
print("\nTier Comparison:")
print("-" * 70)
print(f"{'Tier':<10} {'Min':<5} {'Max':<5} {'Cooldown':<10} {'Quota':<8} {'SLA'}")
print("-" * 70)

for tier in [TenantTier.PREMIUM, TenantTier.STANDARD, TenantTier.FREE]:
    policy = GCCAutoScalingPolicy(tier)
    config = policy.get_scaling_config()
    print(
        f"{tier.value:<10} "
        f"{config.min_replicas:<5} "
        f"{config.max_replicas:<5} "
        f"{config.scale_up_cooldown}s{'':<6} "
        f"{config.resource_quota_percent}%{'':<6} "
        f"{config.sla_target * 100}%"
    )

# Expected: Premium > Standard > Free in terms of resources and responsiveness

## Section 3: Per-Tenant Queue Depth Tracking

The core metric driving HPA scaling decisions is **per-tenant queue depth** (not global CPU).

**Why queue depth instead of CPU:**
- **Leading indicator:** Detects load spikes before CPU saturates
- **Per-tenant isolation:** Finance tenant's spike doesn't affect Media tenant
- **Works for I/O-bound:** RAG queries spend time waiting for vector DB (low CPU, high latency)

**Target: 10 queries per pod**
- Testing showed: <10 queries/pod = <200ms latency
- >10 queries/pod = latency degrades exponentially

**Backpressure mechanism:**
- Queue size capped at 100 queries per tenant
- When full, returns HTTP 429 "Too Many Requests"
- Triggers HPA scale-up

In [None]:
# Create queue manager
queue_manager = TenantQueueManager(max_queue_size=100)

async def demo_queue_depth_tracking():
    """Demonstrate per-tenant queue depth tracking"""
    
    # Simulate Finance tenant with high load
    print("Simulating Finance tenant spike (50 queries)...")
    for i in range(50):
        await queue_manager.enqueue(
            "finance_corp",
            {"query": f"Market analysis query {i}", "priority": "high"}
        )
    
    # Simulate Media tenant with moderate load
    print("Simulating Media tenant normal load (10 queries)...")
    for i in range(10):
        await queue_manager.enqueue(
            "media_agency",
            {"query": f"News summary query {i}", "priority": "normal"}
        )
    
    # Check queue depths
    finance_depth = queue_manager.get_queue_depth("finance_corp")
    media_depth = queue_manager.get_queue_depth("media_agency")
    
    print("")
    print("Queue depths:")
    print(f"  finance_corp: {finance_depth} queries")
    print(f"  media_agency: {media_depth} queries")
    print("")
    
    # Calculate target replicas for each tenant
    premium_policy = GCCAutoScalingPolicy(TenantTier.PREMIUM)
    standard_policy = GCCAutoScalingPolicy(TenantTier.STANDARD)
    
    finance_target = premium_policy.calculate_target_replicas(finance_depth)
    media_target = standard_policy.calculate_target_replicas(media_depth)
    
    print("HPA scaling recommendations:")
    print(f"  finance_corp (Premium): {finance_depth} queries → {finance_target} pods")
    print(f"  media_agency (Standard): {media_depth} queries → {media_target} pods")
    print("")
    print("✓ Each tenant scales independently based on their own queue depth")

# Run the async demo
await demo_queue_depth_tracking()

# Expected:
# - finance_corp: 50 queries → 5 pods (50/10 = 5, within min=5, max=30)
# - media_agency: 10 queries → 3 pods (10/10 = 1, but min=3 for standard)

## Section 4: Resource Quota Enforcement

**Problem:** Without quotas, a single tenant can monopolize the entire cluster during a spike.

**Solution:** Per-tier resource quotas enforced at Kubernetes namespace level:
- Premium: Max 40% of cluster (40 pods in 100-pod cluster)
- Standard: Max 20% of cluster (20 pods)
- Free: Max 10% of cluster (10 pods)

**Enforcement:**
1. Application-level: `validate_resource_quota()` checks before scaling
2. Kubernetes-level: ResourceQuota object blocks pod creation if exceeded

**Result:** Even if HPA calculates 50 pods needed, quota limits premium tenant to 40 pods max.

In [None]:
# Validate resource quotas for different scenarios

cluster_capacity = 100  # Total cluster capacity: 100 pods

print("Resource Quota Validation:")
print("=" * 70)
print("")

# Scenario 1: Premium tenant within quota
valid, message = validate_resource_quota(
    TenantTier.PREMIUM,
    requested_replicas=30,
    total_cluster_capacity=cluster_capacity
)
print(f"Scenario 1: Premium requests 30 pods (out of 100)")
print(f"  Valid: {valid}")
print(f"  Message: {message}")
print("")

# Scenario 2: Premium tenant exceeds quota
valid, message = validate_resource_quota(
    TenantTier.PREMIUM,
    requested_replicas=50,
    total_cluster_capacity=cluster_capacity
)
print(f"Scenario 2: Premium requests 50 pods (exceeds 40% quota)")
print(f"  Valid: {valid}")
print(f"  Message: {message}")
print("")

# Scenario 3: Standard tenant within quota
valid, message = validate_resource_quota(
    TenantTier.STANDARD,
    requested_replicas=15,
    total_cluster_capacity=cluster_capacity
)
print(f"Scenario 3: Standard requests 15 pods")
print(f"  Valid: {valid}")
print(f"  Message: {message}")
print("")

# Expected:
# - Scenario 1: Valid (30/100 = 30% <= 40% quota)
# - Scenario 2: Invalid (50/100 = 50% > 40% quota)
# - Scenario 3: Valid (15/100 = 15% <= 20% quota)

print("✓ Quota enforcement prevents tenant monopoly")

## Section 5: Calculating Target Replicas

**Formula:** `target_replicas = ceil(queue_depth / target_queue_per_pod)`

**Constraints:**
- Must be >= min_replicas (baseline capacity)
- Must be <= max_replicas (budget ceiling)

**Example:**
- Queue depth: 150 queries
- Target: 10 queries/pod
- Calculation: 150 / 10 = 15 pods
- If tier is Standard (max=15): Target = 15 ✓
- If tier is Free (max=5): Target = 5 (constrained)

**Edge cases:**
- Zero queue depth → min_replicas (always maintain baseline)
- Fractional result → ceil() (always round up for safety)

In [None]:
# Test replica calculations for various queue depths

test_scenarios = [
    (5, TenantTier.PREMIUM, "Low load"),
    (50, TenantTier.PREMIUM, "Medium load"),
    (250, TenantTier.PREMIUM, "High load"),
    (500, TenantTier.PREMIUM, "Extreme load (exceeds max)"),
    (100, TenantTier.STANDARD, "Standard tier high load"),
    (100, TenantTier.FREE, "Free tier (hits max quickly)"),
]

print("Target Replica Calculations:")
print("=" * 80)
print(f"{'Scenario':<35} {'Queue':<8} {'Tier':<10} {'Target':<8} {'Reasoning'}")
print("=" * 80)

for queue_depth, tier, description in test_scenarios:
    policy = GCCAutoScalingPolicy(tier)
    config = policy.get_scaling_config()
    target = policy.calculate_target_replicas(queue_depth)
    
    # Calculate ideal (unconstrained)
    ideal = queue_depth // 10 if queue_depth > 0 else config.min_replicas
    
    # Determine reasoning
    if target == config.min_replicas and ideal < config.min_replicas:
        reasoning = f"Constrained by min={config.min_replicas}"
    elif target == config.max_replicas and ideal > config.max_replicas:
        reasoning = f"Constrained by max={config.max_replicas}"
    else:
        reasoning = f"{queue_depth}/10 = {ideal}"
    
    print(
        f"{description:<35} "
        f"{queue_depth:<8} "
        f"{tier.value:<10} "
        f"{target:<8} "
        f"{reasoning}"
    )

print("")
print("✓ HPA respects both min/max constraints and tier-specific limits")

## Section 6: Compliance Audit Trail

**Regulatory Requirements:**
- SOX Section 404: Document all infrastructure changes
- DPDPA (India): Prove fair resource allocation, no data residency violations
- Client-specific: HIPAA, PCI-DSS for healthcare/finance tenants

**What must be logged:**
- Timestamp (UTC)
- Tenant ID
- Old replica count
- New replica count
- Reason (queue_depth_high, load_decreased, etc.)
- Triggering user (HPA for automated, admin for manual)
- Immutable flag (cannot be modified after creation)

**Audit use cases:**
- CFO: "Prove you didn't favor Premium tenant over Standard"
- Compliance: "Show data residency maintained during scale-up"
- Security: "Trace resource spike to specific tenant event"

In [None]:
# Demonstrate audit trail logging

print("Simulating scaling events with audit logging:")
print("=" * 70)
print("")

# Event 1: Finance tenant scales up
event1 = log_scale_event(
    tenant_id="finance_corp",
    old_replicas=5,
    new_replicas=15,
    reason="queue_depth_exceeded_threshold"
)

print("Event 1: Scale-up")
print(f"  Tenant: {event1['tenant_id']}")
print(f"  Timestamp: {event1['timestamp']}")
print(f"  Change: {event1['old_replicas']} → {event1['new_replicas']} pods")
print(f"  Reason: {event1['reason']}")
print(f"  Immutable: {event1['immutable']}")
print("")

# Event 2: Media tenant scales down
event2 = log_scale_event(
    tenant_id="media_agency",
    old_replicas=10,
    new_replicas=3,
    reason="load_returned_to_baseline"
)

print("Event 2: Scale-down")
print(f"  Tenant: {event2['tenant_id']}")
print(f"  Timestamp: {event2['timestamp']}")
print(f"  Change: {event2['old_replicas']} → {event2['new_replicas']} pods")
print(f"  Reason: {event2['reason']}")
print(f"  Immutable: {event2['immutable']}")
print("")

# Expected: All events logged with complete audit trail for compliance
print("✓ Audit trail enables compliance verification and cost attribution")

## Section 7: Cost Attribution and Chargeback

**CFO Requirement:** "Which tenants are driving costs? Who do we bill for the 150 pods during Black Friday?"

**Cost Model:**
- Pod cost: ₹2,000/pod/month (includes compute, storage, network)
- Actual cost = avg_replicas × pod_cost_per_month
- Peak cost = max_replicas × pod_cost_per_month (for capacity planning)

**Chargeback report includes:**
- Average replicas over reporting period
- Peak replicas reached
- Actual cost (based on average usage)
- Budget comparison (under/over budget)
- Variance percentage

**Example:**
- Finance tenant: Avg 12 pods, peak 28 pods
- Cost: 12 × ₹2,000 = ₹24,000/month
- Budget: ₹30,000/month
- Variance: -₹6,000 (20% under budget) ✓

In [None]:
# Generate cost reports for multiple tenants

tenants_usage = [
    {"id": "finance_corp", "avg": 12.0, "peak": 28, "budget": 30000},
    {"id": "media_agency", "avg": 6.5, "peak": 14, "budget": 15000},
    {"id": "retail_chain", "avg": 8.0, "peak": 15, "budget": 20000},
    {"id": "legal_firm", "avg": 5.5, "peak": 12, "budget": 12000},
    {"id": "startup_pilot", "avg": 2.0, "peak": 4, "budget": 5000},
]

print("Monthly Cost Reports (November 2025):")
print("=" * 90)
print(
    f"{'Tenant':<18} {'Avg Pods':<10} {'Peak':<6} {'Actual Cost':<15} "
    f"{'Budget':<15} {'Status'}"
)
print("=" * 90)

total_cost = 0
total_budget = 0

for tenant in tenants_usage:
    report = generate_cost_report(
        tenant_id=tenant["id"],
        avg_replicas=tenant["avg"],
        peak_replicas=tenant["peak"],
        budget=tenant["budget"]
    )
    
    status = "✓ Under" if report["under_budget"] else "⚠ Over"
    
    print(
        f"{tenant['id']:<18} "
        f"{tenant['avg']:<10.1f} "
        f"{tenant['peak']:<6} "
        f"₹{report['actual_cost']:>12,.0f} "
        f"₹{tenant['budget']:>12,.0f} "
        f"{status}"
    )
    
    total_cost += report["actual_cost"]
    total_budget += tenant["budget"]

print("=" * 90)
print(
    f"{'TOTAL':<18} {'':<10} {'':<6} "
    f"₹{total_cost:>12,.0f} "
    f"₹{total_budget:>12,.0f} "
    f"{'✓ Under' if total_cost <= total_budget else '⚠ Over'}"
)
savings = total_budget - total_cost
savings_pct = (savings / total_budget) * 100
print("")
print(f"Total savings: ₹{savings:,.0f} ({savings_pct:.1f}% under budget)")
print("")

# Expected: All tenants under budget due to auto-scaling efficiency
print("✓ Transparent cost attribution enables fair chargeback to business units")

## Section 8: Putting It All Together - Complete Workflow

**End-to-end auto-scaling workflow:**

1. **Query arrives** → Added to tenant-specific queue
2. **Queue depth increases** → Prometheus metric updated
3. **Prometheus scrapes** (every 15s) → Metrics collected
4. **HPA reads metrics** → Calculates target replicas
5. **Resource quota validated** → Ensure within tier limits
6. **Deployment scaled** → Pods created/terminated
7. **Audit event logged** → Compliance trail recorded
8. **Cost attributed** → Usage tracked for chargeback

**Timeline for scale-up:**
- T+0s: Spike begins
- T+15s: Prometheus scrapes
- T+30s: HPA calculates
- T+60s: Pods scheduled
- T+90s: Pods ready
- **Total: 90-120 seconds**

In [None]:
# Simulate complete auto-scaling workflow

async def complete_autoscaling_workflow():
    """Demonstrate full workflow from query spike to scaled infrastructure"""
    
    print("Complete Auto-Scaling Workflow Simulation")
    print("=" * 70)
    print("")
    
    # Step 1: Initialize tenant (Premium tier)
    tenant_id = "finance_corp"
    tier = TenantTier.PREMIUM
    policy = GCCAutoScalingPolicy(tier)
    current_replicas = policy.get_scaling_config().min_replicas
    
    print(f"Step 1: Initial state")
    print(f"  Tenant: {tenant_id} ({tier.value})")
    print(f"  Current replicas: {current_replicas}")
    print(f"  Queue depth: 0")
    print("")
    
    # Step 2: Simulate traffic spike (150 queries)
    print(f"Step 2: Traffic spike - 150 queries arrive")
    queue_mgr = TenantQueueManager()
    for i in range(150):
        await queue_mgr.enqueue(tenant_id, {"query": f"query_{i}"})
    
    queue_depth = queue_mgr.get_queue_depth(tenant_id)
    print(f"  Queue depth: {queue_depth}")
    print("")
    
    # Step 3: Calculate target replicas
    print(f"Step 3: HPA calculates target replicas")
    target_replicas = policy.calculate_target_replicas(queue_depth)
    print(f"  Calculation: {queue_depth} queries / 10 per pod = {queue_depth // 10}")
    print(f"  Target replicas: {target_replicas}")
    print("")
    
    # Step 4: Validate resource quota
    print(f"Step 4: Validate resource quota")
    valid, message = validate_resource_quota(
        tier,
        requested_replicas=target_replicas,
        total_cluster_capacity=100
    )
    print(f"  Valid: {valid}")
    print(f"  Message: {message}")
    print("")
    
    # Step 5: Log scaling event
    print(f"Step 5: Log scaling event (compliance audit)")
    event = log_scale_event(
        tenant_id=tenant_id,
        old_replicas=current_replicas,
        new_replicas=target_replicas,
        reason=f"queue_depth_{queue_depth}_exceeded_threshold"
    )
    print(f"  Event ID: {event['timestamp']}")
    print(f"  Change: {current_replicas} → {target_replicas} pods")
    print("")
    
    # Step 6: Simulate query processing (drain queue)
    print(f"Step 6: Process queries and drain queue")
    processed = 0
    while queue_mgr.get_queue_depth(tenant_id) > 0 and processed < 20:
        await queue_mgr.dequeue(tenant_id)
        processed += 1
    
    remaining = queue_mgr.get_queue_depth(tenant_id)
    print(f"  Processed: {processed} queries")
    print(f"  Remaining in queue: {remaining}")
    print("  (In production: {target_replicas} pods would process all {queue_depth} queries)")
    print("")
    
    # Step 7: Generate cost report
    print(f"Step 7: Generate cost attribution report")
    report = generate_cost_report(
        tenant_id=tenant_id,
        avg_replicas=float(target_replicas),
        peak_replicas=target_replicas,
        budget=30000.0
    )
    print(f"  Average replicas: {report['avg_replicas']}")
    print(f"  Monthly cost: ₹{report['actual_cost']:,.0f}")
    print(f"  Budget: ₹{report['budget']:,.0f}")
    print(f"  Status: {'✓ Under budget' if report['under_budget'] else '⚠ Over budget'}")
    print("")
    
    print("=" * 70)
    print("✓ Complete workflow executed successfully")
    print("")
    print("Key takeaways:")
    print("  - Per-tenant queue depth drives scaling (not global CPU)")
    print("  - Resource quotas prevent tenant monopoly")
    print("  - Audit trail ensures compliance (SOX/DPDPA)")
    print("  - Cost attribution enables fair chargeback")
    print("  - 30-45% cost savings vs fixed capacity")

# Run complete workflow
await complete_autoscaling_workflow()

## Summary and Next Steps

**You've learned:**
✅ How Kubernetes HPA uses per-tenant queue depth (not CPU) to drive scaling decisions
✅ How tier-based scaling policies (Premium/Standard/Free) enforce fair resource allocation
✅ How resource quotas prevent tenant monopoly (max 40% of cluster per tenant)
✅ How compliance audit trails satisfy SOX/DPDPA requirements
✅ How cost attribution enables CFO chargeback to business units

**Production-ready skills:**
- Deploy auto-scaling multi-tenant RAG infrastructure
- Reduce costs by 30-45% vs fixed capacity
- Maintain 99.9%+ SLA compliance
- Pass GCC compliance audits (SOX, DPDPA, GDPR)

**Next module:** M13.3 - Cost Attribution & Chargeback Systems
- Build usage metering service tracking pod-hours, query counts, storage
- Implement cost calculation engine with pricing tiers and volume discounts
- Generate monthly chargeback reports for CFO
- Detect cost anomalies (>50% spikes)

**Before M13.3:**
- Complete PractaThon Mission M13 (configure HPA for 10 tenants)
- Run load test: Premium tenant 10× spike, verify others unaffected
- Generate cost report proving 30%+ savings vs fixed capacity

**Great work!** You're now ready to implement production-grade auto-scaling for GCC multi-tenant platforms.