# Module 13: Enterprise RAG SaaS - Complete Multi-Tenant Integration

## Learning Arc

**Purpose:**  
This notebook demonstrates a production-ready, multi-tenant SaaS platform that integrates all previous L3 modules into a cohesive system. You'll build a Compliance Copilot that serves multiple paying customers with strict data isolation, flexible configuration, and accurate cost attribution.

**Concepts Covered:**
- Multi-tenant architecture patterns (namespace isolation, context propagation)
- Configuration cascade (system → tenant → query overrides)
- Resource attribution and billing (per-tenant usage tracking)
- Production failure modes and mitigation strategies
- Orchestration patterns for complex workflows

**After Completing This Notebook:**
- Understand when multi-tenancy overhead is justified (5-100 customers)
- Implement tenant context propagation across async boundaries
- Design configuration systems that balance flexibility and safety
- Identify and fix common multi-tenant failure modes
- Make informed trade-offs between simplicity and scale

**Context in Track L3.M13:**  
This capstone module synthesizes everything: query decomposition (M9), multi-agent orchestration (M10), tenant isolation (M11), usage metering (M12), and complete SaaS integration (M13).

## Setup & Environment Check

Before running the code cells below, ensure you have:
1. Installed dependencies: `pip install -r requirements.txt`
2. (Optional) Configured API keys in `.env` for live API calls

**Note:** This notebook runs in a simulated "limited mode" by default (no real API calls). All LLM and vector store operations are mocked for educational purposes.

In [None]:
# Environment setup and OFFLINE mode check
import os
import sys
import json
import asyncio
from pathlib import Path

# OFFLINE mode for L3 consistency (no real API calls)
OFFLINE = os.getenv("OFFLINE", "true").lower() == "true"
if OFFLINE:
    print("⚠️  Running in OFFLINE mode — OpenAI/Pinecone calls will be skipped (mocked).")
    print("   To enable live API calls, set OFFLINE=false and configure .env with API keys.\n")
else:
    print("✓ Running in LIVE mode with real API calls.\n")

# Import from our module
from src.l3_m13_complete_saas_build import (
    ComplianceCopilotSaaS,
    ConfigManager,
    UsageTracker,
    TenantContext,
    ModelTier,
    RetrievalMode
)

print("✓ Imports successful")

## 1. Configuration Layer

The configuration cascade enables flexible settings management:
- **System defaults**: Applied to all tenants
- **Tenant defaults**: Override system settings per customer
- **Query-level overrides**: Runtime adjustments without config changes

This pattern supports A/B testing, gradual rollouts, and customer-specific optimizations.

In [None]:
# Initialize configuration manager
config_mgr = ConfigManager()

# Set system defaults
config_mgr.set_system_defaults(
    model_tier=ModelTier.GPT35,
    retrieval_mode=RetrievalMode.BASIC
)

# Load tenant configs (simulates database lookup)
async def demo_config():
    config_acme = await config_mgr.load_tenant_config("acme_corp")
    config_beta = await config_mgr.load_tenant_config("beta_inc")
    
    print(f"ACME: {config_acme.model_tier.value}, namespace={config_acme.pinecone_namespace}")
    print(f"BETA: {config_beta.model_tier.value}, namespace={config_beta.pinecone_namespace}")
    
    # Override for premium tenant
    config_mgr.update_tenant_config("acme_corp", model_tier=ModelTier.GPT4)
    updated = await config_mgr.load_tenant_config("acme_corp")
    print(f"ACME upgraded: {updated.model_tier.value}")

await demo_config()
# Expected: 3 lines showing tenant configs and upgrade

## 2. Tenant Context Propagation

Tenant identity must flow through all async operations to ensure:
1. **Namespace isolation** in Pinecone (no cross-tenant data leakage)
2. **Accurate billing attribution** for every API call
3. **Distributed tracing** across service boundaries

Uses Python ContextVar for local async + OpenTelemetry baggage for distributed systems.

In [None]:
# Demonstrate context propagation
async def async_operation():
    tenant = TenantContext.get_tenant()
    print(f"  → Async task sees tenant: {tenant}")
    return tenant

async def demo_context():
    # Set tenant context
    TenantContext.set_tenant("acme_corp")
    print(f"Set context: {TenantContext.get_tenant()}")
    
    # Context propagates through async calls
    result = await async_operation()
    
    # Clean up
    TenantContext.clear_tenant()
    print(f"After clear: {TenantContext.get_tenant()}")

await demo_context()
# Expected: 3 lines showing context propagation

## 3. Orchestration Pattern

The `ComplianceCopilotSaaS` class coordinates the complete workflow:

1. **Authentication verification** (not shown - assumes upstream API gateway)
2. **Config loading** with cascade logic
3. **Component initialization** (LLM, vector store)
4. **Execution with context** (tenant ID propagated)
5. **Post-processing** for metrics/billing

This centralization enables consistent error handling, logging, and observability.

In [None]:
# Initialize the orchestrator
copilot = ComplianceCopilotSaaS(
    config_manager=config_mgr,
    usage_tracker=UsageTracker()
)

print("✓ ComplianceCopilotSaaS initialized")
print(f"  - Config manager: {type(copilot.config_manager).__name__}")
print(f"  - Usage tracker: {type(copilot.usage_tracker).__name__}")
print(f"  - Vector store: {type(copilot.vector_store).__name__}")

# Expected: 4 lines showing initialized components

## 4. Document Ingestion with Namespace Isolation

Each tenant's documents are stored in isolated Pinecone namespaces (`tenant_{id}`).

This ensures:
- No cross-tenant data leakage
- Independent document lifecycle management
- Per-tenant storage limits enforcement

In [None]:
# Load sample data
with open('example_data.json', 'r') as f:
    data = json.load(f)

# Ingest documents for each tenant
async def demo_ingestion():
    for tenant_id, docs in data['sample_documents'].items():
        result = await copilot.ingest_documents(
            tenant_id=tenant_id,
            documents=docs
        )
        print(f"{tenant_id}: {result['documents_ingested']} docs → {result['namespace']}")

await demo_ingestion()
# Expected: 3 lines showing ingestion per tenant

## 5. Multi-Tenant Query Execution

Each query is executed with:
1. Tenant context propagation (identity flows through all calls)
2. Resource limit checks (quotas enforced per tenant)
3. Namespace-isolated retrieval (only tenant's documents)
4. Model tier selection (based on tenant config)
5. Usage tracking for billing

Query-level overrides allow runtime customization without config changes.

In [None]:
# Execute queries from sample data
async def demo_queries():
    for query_data in data['sample_queries'][:3]:  # First 3 queries
        response = await copilot.query(
            tenant_id=query_data['tenant_id'],
            query_text=query_data['query']
        )
        
        meta = response['metadata']
        print(f"{meta['tenant_id']}: {meta['latency_ms']:.1f}ms, "
              f"model={meta['model']}, tokens={meta['tokens_used']}")

await demo_queries()
# Expected: 3 lines showing query results per tenant

## 6. Usage Tracking & Billing Attribution

Every operation is tracked with:
- Tenant ID (for attribution)
- Operation type (query, ingestion, etc.)
- Resource consumption (tokens, latency)
- Success/failure status

**Cost Breakdown Example (Monthly):**
- Database: $50-200
- Vector Store: $70-500
- LLM APIs: $100-2000
- Observability: $50-300

**Common Failure:** Async billing lag - operations complete but billing delayed.  
**Fix:** Background worker with retry queue.

In [None]:
# View tenant metrics and costs
for tenant_id in ['acme_corp', 'beta_inc', 'gamma_labs']:
    metrics = copilot.get_tenant_metrics(tenant_id, hours=1)
    costs = metrics['costs']
    
    print(f"\n{tenant_id}:")
    print(f"  Queries: {metrics['successful_queries']}/{metrics['total_queries']}")
    print(f"  Avg Latency: {metrics['avg_latency_ms']:.2f}ms")
    print(f"  Tokens: {costs['total_tokens']}")
    print(f"  Est. Cost: ${costs['estimated_llm_cost']:.4f}")

# Expected: Metrics for 3 tenants (~15 lines)

## 7. Common Failure Modes & Fixes

### 1. Cache Race Conditions (Cross-Tenant Leakage)
**Symptom:** Tenant A sees Tenant B's data  \n**Cause:** Shared cache without tenant isolation  \n**Fix:** Thread-safe caching with tenant-scoped locks

### 2. Cascading Rate Limits
**Symptom:** One heavy tenant blocks others  \n**Fix:** Per-tenant rate limiting + circuit breakers

### 3. Connection Pool Exhaustion
**Symptom:** Timeouts during bulk operations  \n**Fix:** Connection pooling + request batching

### 4. OpenTelemetry Context Loss
**Symptom:** Tracing breaks mid-chain  \n**Fix:** Explicit context propagation in async boundaries

### 5. Async Billing Lag
**Symptom:** Usage tracked but billing delayed  \n**Fix:** Background worker with retry queue

In [None]:
# Demonstrate rate limiting (Failure Mode #2)
async def demo_rate_limit():
    print("Testing rate limit protection...")
    
    try:
        # Our system allows 100 req/min per tenant
        # Simulate rapid queries (will hit limit in real implementation)
        for i in range(3):
            response = await copilot.query(
                tenant_id="test_tenant",
                query_text=f"Query {i}"
            )
            print(f"  Query {i}: OK")
    
    except Exception as e:
        print(f"  ✓ Rate limit enforced: {str(e)}")

await demo_rate_limit()
# Expected: Either 3 OK responses or rate limit error

## 8. Decision Card: When to Use This Architecture

### ✅ Use This When:
- **5-100 paying customers** (sweet spot for multi-tenancy overhead)
- **>500ms P95 latency acceptable** (allows for coordination overhead)
- **Need strong tenant isolation** (data privacy requirements)
- **Have DevOps expertise** (to manage infrastructure)
- **Market size >100** (justifies engineering investment)

### ❌ Avoid This When:
- **<5 customers** → Overhead unjustified, start single-tenant
- **<100 total market** → Overengineered for small opportunity
- **<500ms latency required** → Coordination overhead too high
- **No DevOps team** → Use managed platforms instead
- **MVP stage** → Start simpler, add multi-tenancy later

### 🔄 Alternative Approaches:
1. **MVP-first phasing**: Single-tenant → Add multi-tenancy incrementally
2. **Microservices**: Separate services per component for independent scaling
3. **Managed platforms**: Hosted RAG solutions for faster time-to-value
4. **Tenant-per-instance**: Single-tenant SaaS copies for premium customers

## 9. Production Deployment Checklist

### Secrets Management
- ✓ All API keys in environment variables (never in code)
- ✓ Rotate secrets regularly (quarterly minimum)
- ✓ Use secret management service (AWS Secrets Manager, HashiCorp Vault)

### Monitoring Dashboards
- ✓ Per-tenant query latency (P50/P95/P99)
- ✓ Error rates by tenant
- ✓ Token usage and cost attribution
- ✓ Rate limit violations

### Alerting Thresholds
- ✓ P95 latency >1000ms
- ✓ Error rate >5%
- ✓ Any cross-tenant data leakage
- ✓ Database connection pool >80%

### Incident Response
- ✓ On-call rotation documented
- ✓ Runbook for common failures
- ✓ Rollback procedure tested
- ✓ Customer communication templates

### Load Testing (Practathon Challenge)
- **Easy (10-15 hrs)**: 3 tenants, basic load testing
- **Medium (15-20 hrs)**: Production-ready with comprehensive testing
- **Hard (25-30 hrs)**: Multi-region deployment with failover

**Target:** 1,000 req/hour across 100+ tenants, P95 < 500ms

In [None]:
# Final Summary
print("=" * 60)
print("Module 13: Enterprise RAG SaaS - Complete")
print("=" * 60)
print("\n✅ Demonstrated:")
print("  1. Multi-tenant configuration cascade")
print("  2. Context propagation across async boundaries")
print("  3. Namespace isolation for data privacy")
print("  4. Usage tracking & cost attribution")
print("  5. Failure modes & mitigation strategies")
print("\n📊 Key Metrics:")
print(f"  - Tenants configured: {len([t for t in ['acme_corp', 'beta_inc', 'gamma_labs']])}")
print(f"  - Total queries executed: {sum(copilot.get_tenant_metrics(t, hours=1)['total_queries'] for t in ['acme_corp', 'beta_inc', 'gamma_labs'])}")
print(f"  - Avg P95 latency target: <500ms")
print("\n🎯 Next Steps:")
print("  - Deploy FastAPI wrapper (app.py)")
print("  - Add production monitoring")
print("  - Run load tests (1000 req/hr target)")
print("  - Configure multi-region failover (optional)")
print("=" * 60)