# L3 M14.1: Multi-Tenant Monitoring & Observability

## Learning Arc

**Purpose:** Global dashboards lie by averaging. When 9 tenants operate at 50ms latency and 1 tenant runs at 5000ms, the platform average shows ~545ms‚Äîmasking the failing tenant completely. This module teaches you to implement tenant-aware monitoring that prevents "averaging blindness" and detects individual tenant failures.

**Concepts Covered:**
- Tenant-aware metrics with Prometheus (Counter, Histogram, Gauge, Info)
- Label-based multi-tenancy pattern
- Cardinality management (preventing metric explosion)
- Drill-down Grafana dashboards
- Distributed trace context propagation
- SLA budget tracking per tenant
- Resource monopolization detection
- Per-tenant alerting and routing

**After Completing This Notebook:**
- You will understand how global averages hide individual tenant failures
- You can implement per-tenant metrics using Prometheus labels
- You will recognize cardinality explosion risks and mitigation strategies
- You can build drill-down dashboards for tenant isolation
- You will track SLA budgets and fire per-tenant alerts
- You can detect resource monopolization ("noisy neighbor" problem)
- You will propagate tenant context in distributed traces

**Context in Track L3.M14:**
This module builds on **L3 M13 (Multi-Tenant Architecture Patterns)** and prepares you for **L3 M14.2 (Incident Response & Runbooks)**.

## Environment Setup

In [None]:
import os
import sys
import time
from datetime import datetime

# Add src to path for imports
if './src' not in sys.path:
    sys.path.insert(0, './src')

# OFFLINE mode for L3 consistency
OFFLINE = os.getenv("OFFLINE", "false").lower() == "true"

# Prometheus detection
PROMETHEUS_ENABLED = os.getenv("PROMETHEUS_ENABLED", "false").lower() == "true"

if OFFLINE or not PROMETHEUS_ENABLED:
    print("‚ö†Ô∏è Running in OFFLINE/PROMETHEUS_DISABLED mode")
    print("   ‚Üí Metrics will be stored in-memory")
    print("   ‚Üí Set PROMETHEUS_ENABLED=true in .env to enable Prometheus server")
else:
    print("‚úÖ Online mode - Prometheus metrics enabled")

# Import our monitoring module
from l3_m14_monitoring_observability import (
    start_query_tracking,
    end_query_tracking,
    track_query,
    update_quota_usage,
    get_tenant_metrics,
    TenantMetricsCollector
)

print("\n‚úÖ Imports successful")

## 1. Introduction & The Hook Problem (2-3 min)

### The Averaging Blindness Problem

Imagine you're monitoring a multi-tenant RAG platform with 50 tenants. Your dashboard shows:
- **Platform Average Latency:** 545ms
- **Platform Success Rate:** 99%
- **Platform CPU:** 60%

Everything looks healthy! ‚úÖ

But here's the reality:
- **49 tenants:** 50ms latency, 100% success rate
- **1 tenant (Finance):** 5000ms latency, 80% success rate

**The Problem:** Global averages completely hide the failing tenant.

**The Impact:**
- Finance team's SLA is being violated
- You discover the issue 45 minutes later via an angry email
- By then, the tenant has lost trust in the platform

In [None]:
# Demonstration: How averaging hides outliers

# Simulate 50 tenants
healthy_tenants = [50] * 49  # 49 tenants at 50ms
failing_tenant = [5000]      # 1 tenant at 5000ms

all_latencies = healthy_tenants + failing_tenant
platform_average = sum(all_latencies) / len(all_latencies)

print("üîç Multi-Tenant Latency Analysis")
print("=" * 50)
print(f"Healthy tenants (49): {healthy_tenants[0]}ms each")
print(f"Failing tenant (1):   {failing_tenant[0]}ms")
print(f"\nüìä Platform Average: {platform_average:.1f}ms")
print("\n‚ùå Problem: The 5000ms outlier is completely hidden!")
print("   ‚Üí Finance team experiences terrible performance")
print("   ‚Üí Dashboard shows 'healthy' 545ms average")
print("   ‚Üí SLA violation goes undetected for 45+ minutes")

# Expected: Shows ~545ms average hiding the 5000ms outlier

## 2. Conceptual Foundation (4-5 min)

### The Label-Based Multi-Tenancy Pattern

Instead of one metric per tenant (which doesn't scale), Prometheus uses **labels** to add tenant context to metrics:

```prometheus
# Single metric definition serves ALL tenants
rag_queries_total{tenant_id="finance", status="success"} 10250
rag_queries_total{tenant_id="marketing", status="success"} 8900
```

**Key Concepts:**

1. **Counter:** One-directional tracking (queries processed, errors occurred)
2. **Histogram:** Distribution measurement (latency buckets, token counts)
3. **Gauge:** Up/down values (active queries, quota usage percentage)
4. **Info:** Static metadata (tenant name, tier, region)

### Cardinality Management

**The Rule:** Limit label cardinality to <1000 unique values per label

‚úÖ **Safe:** `{tenant_id}` with 50 tenants ‚Üí 50 values  
‚ùå **Unsafe:** `{tenant_id, user_id, query_id}` ‚Üí 50 √ó 10K √ó 100K = 50 billion time series

**Result of explosion:** Prometheus runs out of memory, queries timeout, alerting fails

In [None]:
# Demonstration: Cardinality calculation

# Safe approach: tenant_id only
safe_tenants = 50
safe_statuses = 2  # success, error
safe_cardinality = safe_tenants * safe_statuses

# Unsafe approach: adding high-cardinality labels
unsafe_tenants = 50
unsafe_users = 10000
unsafe_queries = 100000
unsafe_cardinality = unsafe_tenants * unsafe_users * unsafe_queries

print("üî¢ Cardinality Analysis")
print("=" * 60)
print(f"\n‚úÖ SAFE: {{tenant_id, status}}")
print(f"   {safe_tenants} tenants √ó {safe_statuses} statuses = {safe_cardinality:,} time series")
print(f"   ‚Üí Prometheus handles this easily")
print(f"\n‚ùå UNSAFE: {{tenant_id, user_id, query_id}}")
print(f"   {unsafe_tenants} √ó {unsafe_users} √ó {unsafe_queries} = {unsafe_cardinality:,} time series")
print(f"   ‚Üí Prometheus OOM (Out of Memory)")
print(f"   ‚Üí Query timeouts")
print(f"   ‚Üí Alerting failures")
print(f"\nüí° Solution: Use logging for high-cardinality data (user_id, query_id)")

# Expected: Shows 100 vs 50 billion series comparison

## 3. Technical Implementation (12-15 min)

### Pattern 1: Start/End Query Tracking

The most common pattern: Track query lifecycle with context propagation

In [None]:
# Pattern 1: Start/End tracking for finance team query

print("üìä Pattern 1: Start/End Query Tracking")
print("=" * 60)

# Start tracking
context = start_query_tracking("finance-team")
print(f"\n‚úÖ Started tracking for: {context['tenant_id']}")
print(f"   Start time: {context['start_time']}")

# Simulate query execution
print("\n‚è≥ Simulating RAG query execution...")
time.sleep(0.5)  # Simulate 500ms query

# End tracking with metrics
end_query_tracking(
    context,
    status="success",
    docs_retrieved=5,
    llm_tokens=1200
)

print("\n‚úÖ Query completed and metrics recorded:")
print("   - Query counter incremented (finance-team, success)")
print("   - Duration histogram updated (~0.5s)")
print("   - Active queries gauge decremented")
print("   - Docs retrieved: 5")
print("   - LLM tokens: 1200")

# Expected: Shows tracking lifecycle with tenant isolation

### Pattern 2: Unified Track Query

For retroactive tracking or log-based metric ingestion

In [None]:
# Pattern 2: Unified tracking for marketing team

print("\nüìä Pattern 2: Unified Track Query")
print("=" * 60)

# Track completed query in one call
track_query(
    tenant_id="marketing-team",
    status="success",
    duration=1.5,
    docs_retrieved=3,
    llm_tokens=800
)

print("\n‚úÖ Query tracked for marketing-team:")
print("   Duration: 1.5s")
print("   Docs: 3")
print("   Tokens: 800")
print("   Status: success")
print("\nüí° Use case: Backfilling metrics from event logs")

# Expected: Single-call tracking for simplified usage

### Pattern 3: Quota Usage Tracking

Monitor resource consumption against tenant limits

In [None]:
# Pattern 3: Quota usage tracking

print("\nüìä Pattern 3: Quota Usage Tracking")
print("=" * 60)

# Finance team has used 7,500 of 10,000 monthly queries
update_quota_usage("finance-team", "queries", 75.0)

# Marketing team has used 250,000 of 500,000 tokens
update_quota_usage("marketing-team", "tokens", 50.0)

print("\n‚úÖ Quota metrics updated:")
print("   finance-team: 75% query quota used")
print("   marketing-team: 50% token quota used")
print("\nüí° Use case: Alert when usage > 90% before hitting hard limit")

# Expected: Quota gauge metrics recorded per tenant

### Multi-Tenant Isolation Demonstration

Simulate multiple tenants with different performance characteristics

In [None]:
# Simulate queries from 3 different tenants

print("\nüè¢ Multi-Tenant Simulation")
print("=" * 60)

# Finance team: High-performing tenant
print("\nüíº Finance Team (Premium):")
for i in range(3):
    track_query("finance-team", "success", 0.8, 5, 1200)
print("   ‚Üí 3 queries, ~0.8s each, 100% success")

# Marketing team: Standard tenant
print("\nüì£ Marketing Team (Standard):")
for i in range(2):
    track_query("marketing-team", "success", 1.5, 3, 800)
print("   ‚Üí 2 queries, ~1.5s each, 100% success")

# Engineering team: Experiencing issues
print("\n‚öôÔ∏è Engineering Team (Free):")
track_query("engineering-team", "error", 0.3, 0, 0)
track_query("engineering-team", "success", 2.5, 2, 500)
print("   ‚Üí 2 queries: 1 error + 1 slow success")

print("\nüìä Each tenant's metrics are ISOLATED")
print("   ‚Üí Finance's performance doesn't affect Marketing")
print("   ‚Üí Engineering's errors are visible per-tenant")
print("   ‚Üí No averaging blindness!")

# Expected: Shows isolated tracking for each tenant

### Retrieving Tenant Metrics

Query aggregated metrics for a specific tenant

In [None]:
# Retrieve metrics for each tenant

print("\nüìà Retrieving Per-Tenant Metrics")
print("=" * 60)

tenants = ["finance-team", "marketing-team", "engineering-team"]

for tenant in tenants:
    metrics = get_tenant_metrics(tenant)
    print(f"\n{tenant}:")
    print(f"   Total queries: {metrics.get('total_queries', 0)}")
    
    if 'success_count' in metrics:
        print(f"   Success: {metrics['success_count']}")
        print(f"   Errors: {metrics['error_count']}")
        print(f"   Avg duration: {metrics.get('avg_duration_seconds', 0):.3f}s")

print("\nüí° With Prometheus enabled, query directly:")
print("   rate(rag_queries_total{tenant_id=\"finance-team\"}[5m])")

# Expected: Shows per-tenant aggregated metrics

## 4. Common Failures & Debugging (4-5 min)

### Failure Scenario 1: Averaging Blindness

**Problem:** Platform shows 99% success; one tenant at 80% (others at 100%)

**Detection:**
```promql
# Per-tenant success rate
sum(rate(rag_queries_total{status="success"}[5m])) by (tenant_id)
/
sum(rate(rag_queries_total[5m])) by (tenant_id)
```

**Alert:** `rag_success_rate{tenant_id} < 0.95`

### Failure Scenario 2: Resource Monopolization

**Problem:** One tenant consuming 40% CPU, slowing 49 others

**Detection:**
```promql
sum(rate(rag_active_queries[1m])) by (tenant_id) > 100
```

**Fix:** Implement per-tenant rate limiting

### Failure Scenario 3: Latency Masking

**Problem:** Global 545ms hides 5000ms outlier

**Detection:**
```promql
histogram_quantile(0.99, rate(rag_query_duration_seconds_bucket[5m]))
```

**Alert:** `p99_latency{tenant_id} > 5s`

### Failure Scenario 4: Cardinality Explosion

**Problem:** Labeling by `{tenant_id, user_id, query_id}` creates billions of series

**Solution:** Limit labels to `tenant_id` only; use logging for high-cardinality data

### Failure Scenario 5: SLA Violation Detection Delay

**Problem:** Without per-tenant error budgets, discovering violation takes 45+ minutes

**Solution:** Per-tenant alerts fire within 3 minutes:
```promql
rag_quota_usage_percent{resource_type="queries"} > 90
```

## 5. Decision Card (3-4 min)

### When to Use Multi-Tenant Monitoring

**√¢≈ì‚Ä¶ Use this pattern when:**
- You have 10+ tenants with different SLA targets
- Individual tenant failures are hidden by platform averages
- You need to attribute costs (tokens, queries) per tenant
- Regulatory compliance requires tenant data isolation
- You're experiencing "noisy neighbor" resource contention
- You need drill-down capabilities from platform ‚Üí tenant ‚Üí user

**‚ùå When NOT to use:**
- Single-tenant system (use standard Prometheus metrics)
- Fewer than 5 tenants (overhead exceeds benefit)
- All tenants have identical SLAs and resource limits
- You can't manage label cardinality (risk of metric explosion)
- Your monitoring infrastructure can't handle 50-500 time series per tenant

### Trade-offs

| Aspect | Single-Tenant Monitoring | Multi-Tenant Monitoring |
|--------|-------------------------|-------------------------|
| **Cost** | Low (1 metric set) | Medium (50 tenants √ó 10 metrics = 500 series) |
| **Latency** | No overhead | Minimal (label filtering in PromQL) |
| **Complexity** | Simple (one dashboard) | Higher (per-tenant dashboards + cardinality mgmt) |
| **Visibility** | Global only | Global + per-tenant drill-down |
| **SLA Detection** | Slow (45+ minutes) | Fast (3 minutes per-tenant alerts) |
| **Cardinality Risk** | None | Must manage to <1000 unique labels |

## 6. Conclusion & Next Steps (2 min)

### What You Learned

√¢≈ì‚Ä¶ **The Problem:** Global averages hide individual tenant failures ("averaging blindness")

√¢≈ì‚Ä¶ **The Solution:** Label-based multi-tenancy with Prometheus
- Counter: Track query counts per tenant
- Histogram: Measure latency distributions per tenant
- Gauge: Monitor active queries and quota usage
- Info: Store tenant metadata

√¢≈ì‚Ä¶ **Key Patterns:**
1. Start/End tracking with context propagation
2. Unified tracking for retroactive metric ingestion
3. Quota usage monitoring
4. Per-tenant alerting

√¢≈ì‚Ä¶ **Critical Rules:**
- Limit label cardinality to <1000 unique values
- Use `tenant_id` only; log high-cardinality data (user_id, query_id)
- Alert per-tenant to detect SLA violations within 3 minutes

### Production Checklist

Before deploying to production:

- [ ] Prometheus server configured with 15-day retention
- [ ] Grafana dashboards created with per-tenant drill-down
- [ ] AlertManager rules configured for each tenant
- [ ] Cardinality limits enforced (<1000 labels per metric)
- [ ] Alert routing configured to tenant-specific channels
- [ ] OpenTelemetry instrumentation adds `tenant_id` to traces
- [ ] SLA budgets defined per tenant tier (free, standard, premium)

### Next Modules

- **L3 M14.2:** Incident Response & Runbooks
- **L3 M14.3:** Cost Attribution & Chargeback
- **L3 M14.4:** Capacity Planning & Forecasting

### Additional Resources

- [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/)
- [Grafana Multi-Tenancy](https://grafana.com/docs/grafana/latest/administration/datasource-management/)
- [Cardinality Management](https://www.robustperception.io/cardinality-is-key)

---

**üéì Congratulations!** You now understand how to implement tenant-aware monitoring and avoid the "averaging blindness" problem that plagues multi-tenant systems.

In [None]:
# Final Summary: The Value of Per-Tenant Monitoring

print("\nüéØ Key Takeaway: Per-Tenant Visibility")
print("=" * 60)
print("\n‚ùå WITHOUT tenant-aware monitoring:")
print("   - Platform shows: 545ms avg, 99% success")
print("   - Reality: 1 tenant at 5000ms, 80% success")
print("   - Detection time: 45+ minutes")
print("   - Impact: Lost customer trust")
print("\n√¢≈ì‚Ä¶ WITH tenant-aware monitoring:")
print("   - Alert: finance-team p99 > 5s")
print("   - Alert: finance-team success rate < 95%")
print("   - Detection time: 3 minutes")
print("   - Impact: Proactive remediation before customer notices")
print("\nüí° The difference: Tenant isolation in metrics")
print("   ‚Üí rag_queries_total{tenant_id=\"finance\"}")
print("   ‚Üí rag_query_duration_seconds{tenant_id=\"finance\"}")
print("\n‚úÖ Notebook complete! Try the FastAPI server:")
print("   ./scripts/run_api.ps1")
print("   Then visit: http://localhost:8000/docs")