# L3 M12.3: Query Isolation & Rate Limiting

## Learning Arc

**Purpose:** Learn to implement production-grade per-tenant rate limiting using Redis token bucket algorithm to prevent noisy neighbor problems in multi-tenant RAG systems. This module demonstrates how to protect platform reliability when shared infrastructure serves multiple business units with unpredictable workloads.

**Concepts Covered:**
- **Token Bucket Algorithm** - Self-refilling rate limiter using Redis TTL (no manual refill logic)
- **Noisy Neighbor Detection** - Sliding window metrics with 3x/5x baseline thresholds
- **Automated Mitigation** - Circuit breakers and rate reductions without manual intervention
- **Multi-Tier Tenant System** - Bronze/Silver/Gold tiers with priority queuing
- **Graceful Degradation** - HTTP 429 responses with Retry-After headers
- **Fairness Guarantees** - Minimum allocation floors (10-25%) during contention
- **Low-Latency Implementation** - Sub-5ms overhead using Redis atomic operations
- **Multi-Channel Notifications** - Slack + Email alerts within 10 seconds of incidents

**After Completing This Notebook:**
- You will understand how to implement token bucket rate limiting with Redis atomic INCR operations
- You can build noisy neighbor detection within 30 seconds using sliding window metrics
- You will design automatic mitigation strategies including circuit breakers and rate limit reduction
- You can create graceful degradation responses with HTTP 429 and retry-after headers
- You will implement multi-tier tenant systems with priority queuing and fairness guarantees
- You can integrate Prometheus for real-time monitoring and automated alerting
- You will recognize when to use shared infrastructure with rate limiting vs dedicated resources

**Context in Track L3.M12:**
This module builds on **M12.1 (Tenant Identification)** and **M12.2 (Resource Allocation)** by adding enforcement mechanisms to prevent resource monopolization. It prepares you for **M13 (Cost Attribution)** where per-tenant usage metrics enable accurate chargeback.

## 1. Environment Setup

In [None]:
import os
import sys

# Add src to path for imports
if '../src' not in sys.path:
    sys.path.insert(0, '../src')
if '..' not in sys.path:
    sys.path.insert(0, '..')

# OFFLINE mode for L3 consistency
OFFLINE = os.getenv("OFFLINE", "true").lower() == "true"

# Service detection from environment
REDIS_ENABLED = os.getenv("REDIS_ENABLED", "false").lower() == "true"
POSTGRES_ENABLED = os.getenv("POSTGRES_ENABLED", "false").lower() == "true"
PROMETHEUS_ENABLED = os.getenv("PROMETHEUS_ENABLED", "false").lower() == "true"

if OFFLINE or not (REDIS_ENABLED or POSTGRES_ENABLED):
    print("⚠️  Running in OFFLINE mode")
    print("   → External service calls will be skipped")
    print("   → Using in-memory fallbacks for Redis and static configs for PostgreSQL")
    print("   → Set REDIS_ENABLED=true and POSTGRES_ENABLED=true in .env to enable")
    print("")
else:
    print("✓ Online mode - external services enabled")
    print(f"  Redis: {REDIS_ENABLED}")
    print(f"  PostgreSQL: {POSTGRES_ENABLED}")
    print(f"  Prometheus: {PROMETHEUS_ENABLED}")
    print("")

## 2. Import Core Components

We'll import all 5 core classes from our business logic package:

In [None]:
from src.l3_m12_data_isolation_security import (
    TenantRateLimiter,
    TenantConfigLoader,
    NoisyNeighborMitigator,
    NotificationService,
    TenantConfig,
    TenantTier,
    RateLimitResult
)

from config import CLIENTS, InMemoryRateLimiter

print("✓ Imports successful")
print(f"  Services available: {list(CLIENTS.keys())}")

## 3. Understanding the Token Bucket Algorithm

The token bucket algorithm is the foundation of our rate limiting system. Unlike fixed-window counters (which allow 2x burst at window boundaries), token buckets provide smooth rate limiting.

**Key Insight:** We use Redis key TTL for automatic refill, eliminating complex refill logic:
- Key: `{tenant_id}:{minute}` (e.g., `tenant_gold:123456`)
- Value: Current usage count (atomic INCR)
- TTL: 60 seconds (auto-expires, acts as refill)

**Example:**
- Tenant has 100 QPM limit
- At T=0s: First query → INCR creates key with value=1, sets TTL=60s
- At T=30s: 50 queries → key value=50
- At T=60s: Key expires → next query creates new key (fresh bucket)

**Latency:** Sub-5ms (single Redis operation with connection pooling)

In [None]:
# Initialize rate limiter
rate_limiter = TenantRateLimiter(
    redis_client=CLIENTS.get("redis"),
    fallback_limiter=CLIENTS.get("in_memory_limiter")
)

print("✓ TenantRateLimiter initialized")
print(f"  Offline mode: {rate_limiter.offline}")
print(f"  Using Redis: {rate_limiter.redis is not None}")
print(f"  Using fallback: {rate_limiter.fallback is not None}")

# Test basic rate limiting
print("\nTesting rate limiter with tenant_demo (limit=5 QPM):")
for i in range(7):
    result = rate_limiter.check_limit("tenant_demo", limit=5)
    status = "✓ ALLOWED" if result.allowed else "✗ BLOCKED"
    print(f"  Query {i+1}: {status} (usage: {result.current_usage}/5)")

# Expected: First 5 allowed, next 2 blocked

## 4. Multi-Tier Tenant System

Different tenants have different business criticality and budget. We implement Bronze/Silver/Gold tiers:

| Tier | QPM Limit | Priority | Min Allocation % | Cost/Month (INR) | Use Case |
|------|-----------|----------|------------------|------------------|----------|
| Bronze | 100 | 1 | 10% | ₹1,500 | Internal teams, batch jobs |
| Silver | 500 | 2 | 15% | ₹5,000 | Business analytics, dashboards |
| Gold | 2000 | 3 | 25% | ₹15,000 | Customer-facing, real-time services |

**Priority Queuing:** During platform contention (>80% capacity), higher-priority tenants are served first.

**Fairness Floor:** Even during contention, each tier gets minimum allocation (prevents starvation).