# L3 M13.4: Capacity Planning & Forecasting

## Learning Arc

**Purpose:** This module teaches you to implement proactive capacity management for multi-tenant platforms. You'll learn to analyze historical usage patterns and forecast capacity needs using time-series analysis and linear regression, preventing resource exhaustion before it impacts tenants.

**Concepts Covered:**
- Time-series capacity forecasting with linear regression
- Historical usage pattern analysis from PostgreSQL
- Headroom buffer calculation (20% industry standard)
- Multi-threshold alerting (70%, 80%, 90% utilization)
- Tenant rebalancing for "noisy neighbor" problems
- Batch processing for 50+ tenants
- Seasonal anomaly handling
- Lead time alignment with procurement cycles

**After Completing This Notebook:**
- You will understand how to analyze 6 months of historical usage data
- You can implement linear regression for 3-month capacity forecasting
- You will recognize when to apply headroom buffers (20% standard, 30% for volatile workloads)
- You can design graduated alert thresholds to prevent alert fatigue
- You will identify "noisy neighbor" problems and recommend tenant migrations
- You can batch-process forecasts for 150+ tenant √ó metric combinations
- You will evaluate trade-offs: explainability vs. accuracy (linear regression vs. LSTM)
- You can align forecast windows with hardware procurement lead times

**Context in Track L3.M13:**
This module builds on M13.1-M13.3 (Performance Optimization Fundamentals) and prepares you for M13.5 (Auto-scaling Implementation) and M13.6 (Cost Optimization).

In [None]:
import os
import sys

# Add src to path for imports
if './src' not in sys.path:
    sys.path.insert(0, './src')

# OFFLINE mode for L3 consistency (no database required)
OFFLINE = os.getenv("OFFLINE", "true").lower() == "true"
DB_ENABLED = os.getenv("DB_ENABLED", "false").lower() == "true"

if OFFLINE or not DB_ENABLED:
    print("‚ö†Ô∏è Running in OFFLINE/DB_DISABLED mode")
    print("   ‚Üí Database calls will use synthetic data")
    print("   ‚Üí Set DB_ENABLED=true in .env to enable PostgreSQL")
else:
    print("‚úì Online mode - PostgreSQL enabled")

print("\n‚úì Environment setup complete")

## Section 1: Understanding Historical Usage Data

Capacity forecasting starts with analyzing historical usage patterns. We collect monthly-aggregated data for three key metrics:
- **CPU Usage:** Compute resource utilization
- **Memory Usage:** RAM consumption
- **Storage Usage:** Disk space utilization

**Why 6 months?** Captures two full quarterly cycles without including stale data. Too short (< 3 months) misses seasonal patterns; too long (> 12 months) includes irrelevant trends.

**Why monthly aggregation?** Smooths daily spikes and reduces noise while preserving meaningful trends.

In [None]:
from src.l3_m13_capacity_planning import TenantCapacityForecaster, CapacityMetric
from datetime import datetime, timedelta
import numpy as np

# Initialize forecaster (no database connection in offline mode)
forecaster = TenantCapacityForecaster()

# Fetch historical data for a sample tenant
tenant_id = "tenant-ecommerce-001"
metric_name = "cpu_usage"

historical_data = forecaster.get_historical_usage(
    tenant_id=tenant_id,
    metric_name=metric_name,
    months_back=6
)

print(f"Historical data for {tenant_id}/{metric_name}:")
print(f"Total data points: {len(historical_data)}")
print("\nFirst 3 months:")
for metric in historical_data[:3]:
    print(f"  {metric.timestamp.strftime('%Y-%m')}: {metric.value:.1f}%")

# Expected: 6 monthly data points showing gradual growth trend

SAVED_SECTION:1

## Section 2: Linear Regression Forecasting

We use **linear regression** to predict future capacity needs. While advanced ML models (Prophet, LSTM) offer higher accuracy (90-95%), linear regression provides:
- **Explainability:** CFOs understand "2.5% growth per month"
- **Simplicity:** 50 lines of code vs. 500+ for LSTM
- **Speed:** Millisecond predictions vs. seconds

**Trade-off:** 5-10% lower accuracy, but transparent cost projections.

In [None]:
# Generate forecast using linear regression
forecast_result = forecaster.forecast_capacity(
    historical_data=historical_data,
    months_ahead=3
)

print(f"Forecast Results for {tenant_id}:")
print(f"  Current usage: {forecast_result.current_usage:.1f}%")
print(f"  Predicted usage (3 months): {forecast_result.predicted_usage:.1f}%")
print(f"  With 20% headroom: {forecast_result.predicted_with_headroom:.1f}%")
print(f"  Model confidence (R¬≤): {forecast_result.confidence:.3f}")
print(f"  Alert level: {forecast_result.alert_level}")
print(f"  Recommendation: {forecast_result.recommendation}")

# Expected: Predicted usage 5-10% higher than current, confidence > 0.7

SAVED_SECTION:2

## Section 3: Headroom Buffer Calculation

**Headroom buffer:** Safety margin to absorb unexpected spikes without service degradation.

**Industry standard:** 20% (1.2x multiplier)
- Based on empirical data: Q4 spikes are typically 15-25% above baseline
- Balances safety vs. cost efficiency

**When to adjust:**
- **High-volatility workloads** (e-commerce, retail): 30% (1.3x)
- **Stable workloads** (healthcare, government): 10% (1.1x)
- **Mission-critical systems:** 50% (1.5x)

In [None]:
from src.l3_m13_capacity_planning import calculate_headroom

# Compare headroom factors
base_prediction = 80.0  # 80% predicted usage

scenarios = [
    ("Conservative (10%)", 1.1),
    ("Standard (20%)", 1.2),
    ("High-volatility (30%)", 1.3),
    ("Mission-critical (50%)", 1.5)
]

print(f"Headroom Buffer Comparison (Base prediction: {base_prediction}%)\n")
for scenario, factor in scenarios:
    with_headroom = calculate_headroom(base_prediction, headroom_factor=factor)
    buffer = with_headroom - base_prediction
    print(f"{scenario:30s} ‚Üí {with_headroom:5.1f}% (buffer: +{buffer:.1f}%)")

# Expected: Standard 20% buffer adds 16% to 80% prediction = 96%

SAVED_SECTION:3

## Section 4: Multi-Threshold Alerting

**Graduated alert thresholds** prevent alert fatigue and enable appropriate responses:

| Threshold | Level | Action | Lead Time |
|-----------|-------|--------|----------|
| 70% | CAUTION | Plan ahead | 3+ months |
| 80% | WARNING | Initiate procurement | 1-3 months |
| 90% | CRITICAL | Emergency expansion | < 1 month |

**Why these thresholds?**
- **70%:** Early warning, no urgency
- **80%:** Action required, aligned with typical hardware lead times (2-3 months)
- **90%:** Near capacity, urgent response needed

In [None]:
from src.l3_m13_capacity_planning import get_alert_level

# Test alert level classification
usage_scenarios = [
    (65.0, "Healthy usage"),
    (72.5, "Approaching threshold"),
    (85.0, "Action required"),
    (92.0, "Critical - urgent response")
]

print("Alert Level Classification:\n")
for usage, description in usage_scenarios:
    alert = get_alert_level(usage)
    print(f"{usage:5.1f}% - {alert:8s} | {description}")

# Expected: 65%=OK, 72.5%=CAUTION, 85%=WARNING, 92%=CRITICAL

SAVED_SECTION:4

## Section 5: Batch Forecasting for Multiple Tenants

Production platforms manage 50+ tenants with 3 metrics each = 150+ forecasts.

**Batch processing strategy:**
- Process in parallel where possible
- Handle failures gracefully (skip failed tenants, log errors)
- Report progress every 10 forecasts
- Complete within 5 minutes target

In [None]:
# Batch forecast for multiple tenants
tenant_ids = [
    "tenant-ecommerce-001",
    "tenant-fintech-042",
    "tenant-saas-099",
    "tenant-media-123",
    "tenant-healthcare-456"
]

metrics = ["cpu_usage", "memory_usage", "storage_usage"]

print(f"Batch forecasting {len(tenant_ids)} tenants √ó {len(metrics)} metrics...\n")

batch_results = forecaster.forecast_all_tenants(
    tenant_ids=tenant_ids,
    metrics=metrics
)

print(f"\nBatch Forecast Summary:")
print(f"  Total forecasts: {len(batch_results)}")
print(f"  Expected: {len(tenant_ids) * len(metrics)}")
print(f"\nAlert Distribution:")

alert_counts = {}
for result in batch_results:
    alert_counts[result.alert_level] = alert_counts.get(result.alert_level, 0) + 1

for level, count in sorted(alert_counts.items()):
    print(f"  {level:8s}: {count:2d} forecasts")

# Expected: 15 successful forecasts (5 tenants √ó 3 metrics)

SAVED_SECTION:5

## Section 6: Tenant Rebalancing Recommendations

**"Noisy neighbor" problem:** One high-usage tenant degrades performance for co-located tenants.

**Detection:**
- Calculate usage imbalance across nodes
- Threshold: 30% difference is acceptable
- > 30% triggers rebalancing recommendations

**Remediation:**
- Migrate high-usage tenants to underutilized nodes
- Consider dedicated nodes for consistently heavy users

In [None]:
from src.l3_m13_capacity_planning import recommend_rebalancing

# Simulate unbalanced tenant distribution
tenant_usage = {
    "tenant-ecommerce-001": 72.5,
    "tenant-fintech-042": 55.8,
    "tenant-saas-099": 88.4,  # Noisy neighbor
    "tenant-media-123": 45.2,
    "tenant-healthcare-456": 63.7
}

print("Current Tenant Usage:")
for tenant, usage in sorted(tenant_usage.items(), key=lambda x: x[1], reverse=True):
    print(f"  {tenant:30s}: {usage:5.1f}%")

# Calculate imbalance
max_usage = max(tenant_usage.values())
min_usage = min(tenant_usage.values())
imbalance = (max_usage - min_usage) / max_usage

print(f"\nImbalance Analysis:")
print(f"  Max usage: {max_usage:.1f}%")
print(f"  Min usage: {min_usage:.1f}%")
print(f"  Imbalance ratio: {imbalance:.1%} (threshold: 30%)")

# Get recommendations
recommendations = recommend_rebalancing(tenant_usage, imbalance_threshold=0.3)

if recommendations:
    print(f"\n‚ö†Ô∏è Rebalancing Recommendations ({len(recommendations)}):")
    for tenant_id, source, target in recommendations:
        print(f"  ‚Ä¢ Move {tenant_id} from {source} to {target}")
else:
    print("\n‚úì No rebalancing needed - usage distribution is balanced")

# Expected: 1+ recommendation to move tenant-saas-099 (88.4% usage)

SAVED_SECTION:6

## Section 7: Complete Forecast Example

Let's walk through a complete capacity planning scenario for an e-commerce tenant approaching Q4 (holiday season).

In [None]:
# Complete scenario: E-commerce tenant preparing for Q4
scenario_tenant = "tenant-ecommerce-seasonal"

print("=" * 60)
print("Capacity Planning Scenario: E-Commerce Q4 Preparation")
print("=" * 60)

# Fetch historical data
print("\nStep 1: Analyzing Historical Data (6 months)...")
historical = forecaster.get_historical_usage(
    tenant_id=scenario_tenant,
    metric_name="cpu_usage",
    months_back=6
)
print(f"  ‚úì Retrieved {len(historical)} monthly data points")

# Generate standard forecast (20% headroom)
print("\nStep 2: Generating Standard Forecast (20% headroom)...")
standard_forecast = forecaster.forecast_capacity(historical, months_ahead=3)
print(f"  Current: {standard_forecast.current_usage:.1f}%")
print(f"  Predicted: {standard_forecast.predicted_usage:.1f}%")
print(f"  With headroom: {standard_forecast.predicted_with_headroom:.1f}%")
print(f"  Alert: {standard_forecast.alert_level}")

# Generate high-volatility forecast (30% headroom for Q4)
print("\nStep 3: Generating Q4-Adjusted Forecast (30% headroom)...")
q4_forecaster = TenantCapacityForecaster(headroom_factor=1.3)
q4_forecast = q4_forecaster.forecast_capacity(historical, months_ahead=3)
print(f"  Current: {q4_forecast.current_usage:.1f}%")
print(f"  Predicted: {q4_forecast.predicted_usage:.1f}%")
print(f"  With Q4 headroom: {q4_forecast.predicted_with_headroom:.1f}%")
print(f"  Alert: {q4_forecast.alert_level}")

# Decision summary
print("\nStep 4: Capacity Decision...")
if q4_forecast.alert_level in ["WARNING", "CRITICAL"]:
    print("  ‚ö†Ô∏è Action Required: Provision additional capacity before Q4")
    print(f"  Recommendation: {q4_forecast.recommendation}")
else:
    print("  ‚úì Current capacity adequate for Q4 with 30% buffer")

print("\n" + "=" * 60)

# Expected: Q4 forecast shows higher alert level than standard forecast

SAVED_SECTION:7

## Section 8: Handling Common Failures

Production capacity planning encounters several failure scenarios. Let's explore how to handle them gracefully.

In [None]:
import logging

print("Common Failure Scenarios:\n")

# Scenario 1: Insufficient historical data
print("1. Insufficient Historical Data (< 3 months)")
try:
    # This will work because we generate synthetic data
    # In production with real DB, this would fail
    short_history = forecaster.get_historical_usage(
        tenant_id="new-tenant",
        metric_name="cpu_usage",
        months_back=2  # Less than 3 month minimum
    )
    print(f"   ‚ö†Ô∏è Warning: Only {len(short_history)} months available (min 3 required)")
    print("   ‚Üí Fix: Use conservative defaults (assume 5% monthly growth)")
except ValueError as e:
    print(f"   ‚ùå Error: {e}")
    print("   ‚Üí Fix: Wait for minimum data collection period")

# Scenario 2: Low confidence forecast
print("\n2. Low Confidence Score (R¬≤ < 0.5)")
low_conf_forecast = forecaster.forecast_capacity(historical, months_ahead=3)
if low_conf_forecast.confidence < 0.5:
    print(f"   ‚ö†Ô∏è Warning: Low confidence ({low_conf_forecast.confidence:.2f})")
    print("   ‚Üí Fix: Check data quality, apply moving average smoothing")
else:
    print(f"   ‚úì Good confidence: {low_conf_forecast.confidence:.2f}")

# Scenario 3: Lead time mismatch
print("\n3. Lead Time Mismatch (forecast < procurement time)")
forecast_window = 3  # months
procurement_time = 4  # months (hardware delivery)
if forecast_window < procurement_time:
    print(f"   ‚ö†Ô∏è Warning: Forecast window ({forecast_window}mo) < Procurement ({procurement_time}mo)")
    print("   ‚Üí Fix: Extend forecast to 6 months for hardware procurement")
else:
    print("   ‚úì Forecast window aligned with procurement lead time")

# Scenario 4: Empty forecast results
print("\n4. Empty Historical Data")
try:
    empty_forecast = forecaster.forecast_capacity([], months_ahead=3)
except ValueError as e:
    print(f"   ‚ùå Error: {e}")
    print("   ‚Üí Fix: Verify tenant exists and has collected metrics")

print("\n‚úì Failure handling examples complete")

# Expected: All scenarios demonstrate graceful error handling

SAVED_SECTION:8

## Section 9: Production Integration Patterns

Integrating capacity forecasting into production workflows:

**Monitoring Stack:**
- Prometheus: Collects real-time usage metrics
- PostgreSQL: Stores historical aggregates
- This Module: Generates forecasts
- Grafana: Visualizes forecasts + actuals

**Workflow Automation:**
- Apache Airflow: Schedule daily forecast runs
- PagerDuty: Alert on WARNING/CRITICAL thresholds
- Jira: Auto-create capacity expansion tickets

In [None]:
import json

# Simulate production workflow output
print("Production Workflow Simulation\n")
print("=" * 60)

# Step 1: Generate forecasts for all tenants
print("\n[Airflow DAG: daily_capacity_forecast]")
print("  ‚Üí Fetching active tenants from database...")
active_tenants = ["tenant-001", "tenant-002", "tenant-003"]
print(f"  ‚Üí Found {len(active_tenants)} active tenants")

print("  ‚Üí Running batch forecast...")
production_forecasts = forecaster.forecast_all_tenants(
    tenant_ids=active_tenants,
    metrics=["cpu_usage", "memory_usage", "storage_usage"]
)
print(f"  ‚úì Generated {len(production_forecasts)} forecasts")

# Step 2: Filter critical alerts
print("\n[Alert Manager]")
critical_alerts = [
    f for f in production_forecasts 
    if f.alert_level in ["WARNING", "CRITICAL"]
]
print(f"  ‚Üí Found {len(critical_alerts)} alerts requiring action")

for alert in critical_alerts[:3]:  # Show first 3
    print(f"     ‚Ä¢ {alert.tenant_id}/{alert.metric_name}: {alert.alert_level} ({alert.predicted_with_headroom:.1f}%)")

# Step 3: Generate dashboard data
print("\n[Grafana Dashboard Export]")
dashboard_data = [
    {
        "tenant": f.tenant_id,
        "metric": f.metric_name,
        "current": f.current_usage,
        "predicted": f.predicted_with_headroom,
        "alert": f.alert_level
    }
    for f in production_forecasts[:5]  # First 5 for brevity
]
print("  ‚úì Dashboard data prepared (first 5):")
print(json.dumps(dashboard_data, indent=2))

# Step 4: Ticket creation
print("\n[Jira Integration]")
if critical_alerts:
    print(f"  ‚Üí Creating {len(critical_alerts)} capacity expansion tickets...")
    for alert in critical_alerts[:2]:  # First 2
        print(f"     ‚Ä¢ JIRA-1234: Expand capacity for {alert.tenant_id}")
        print(f"       Priority: {'P1' if alert.alert_level == 'CRITICAL' else 'P2'}")
else:
    print("  ‚úì No tickets needed - all capacity adequate")

print("\n" + "=" * 60)
print("‚úì Production workflow simulation complete\n")

# Expected: Demonstrates end-to-end production integration

SAVED_SECTION:9

## Section 10: Decision Card - When to Use This Approach

**Linear Regression Capacity Forecasting** is ideal for:

### ‚úÖ Use When:
- Historical usage shows **linear or near-linear growth** trend
- You need **explainable forecasts** for CFO/budget discussions
- Platform is **mature** with stable tenant base (< 10% churn)
- Forecast horizon is **short-term** (1-6 months)
- Data collection is **monthly-aggregated** (not real-time)
- Acceptable accuracy: **¬±10-15% prediction error**
- Team **lacks ML expertise** for advanced models
- CFO requires **transparent cost projections**

### ‚ùå Do NOT Use When:
- Usage patterns are **highly seasonal** (use Prophet or SARIMA)
- **Exponential growth** expected (use exponential smoothing)
- **Real-time** capacity decisions needed (use streaming analytics)
- Platform is **new** with < 3 months data (insufficient history)
- Tenant **churn > 25%** (invalidates historical trends)
- Budget requires **¬±5% accuracy** (use ensemble methods)
- **Complex interactions** between metrics (use multivariate models)
- Compliance requires **worst-case planning** (use percentile-based forecasting)

### Trade-offs Summary:

| Dimension | Linear Regression | Advanced ML (Prophet, LSTM) |
|-----------|-------------------|-----------------------------|
| Accuracy | 85-90% for linear trends | 90-95% with seasonality |
| Explainability | High (clear coefficients) | Low (black-box) |
| Complexity | Low (50 lines) | High (500+ lines) |
| Training Time | Milliseconds | Minutes to hours |
| Data Needs | 3 months minimum | 12+ months |
| Maintenance | Minimal | High (drift detection) |
| Cost | Free (open-source) | $50-200/month (GPU) |

In [None]:
# Decision helper: Evaluate if linear regression is appropriate

def evaluate_forecasting_approach(scenario):
    """Helper to determine if linear regression is suitable."""
    
    score = 0
    reasons = []
    
    # Positive indicators
    if scenario.get("history_months", 0) >= 6:
        score += 2
        reasons.append("‚úì Sufficient historical data (6+ months)")
    
    if scenario.get("growth_pattern") == "linear":
        score += 3
        reasons.append("‚úì Linear growth pattern detected")
    
    if scenario.get("requires_explainability"):
        score += 2
        reasons.append("‚úì Explainability required for stakeholders")
    
    if scenario.get("forecast_horizon_months", 0) <= 6:
        score += 1
        reasons.append("‚úì Short-term forecast horizon (‚â§6 months)")
    
    # Negative indicators
    if scenario.get("seasonality") == "high":
        score -= 3
        reasons.append("‚ùå High seasonality (consider Prophet)")
    
    if scenario.get("accuracy_requirement") == "high":  # <5%
        score -= 2
        reasons.append("‚ùå High accuracy requirement (consider ensemble)")
    
    if scenario.get("history_months", 0) < 3:
        score -= 4
        reasons.append("‚ùå Insufficient data (< 3 months)")
    
    # Decision
    if score >= 5:
        recommendation = "‚úÖ RECOMMENDED: Linear regression is well-suited"
    elif score >= 2:
        recommendation = "‚ö†Ô∏è ACCEPTABLE: Linear regression may work with caveats"
    else:
        recommendation = "‚ùå NOT RECOMMENDED: Consider alternative approaches"
    
    return score, recommendation, reasons

# Test scenarios
scenarios = [
    {
        "name": "Mature SaaS Platform",
        "history_months": 12,
        "growth_pattern": "linear",
        "requires_explainability": True,
        "forecast_horizon_months": 3,
        "seasonality": "low",
        "accuracy_requirement": "medium"
    },
    {
        "name": "E-Commerce (Holiday Seasonality)",
        "history_months": 18,
        "growth_pattern": "linear",
        "requires_explainability": True,
        "forecast_horizon_months": 6,
        "seasonality": "high",
        "accuracy_requirement": "medium"
    },
    {
        "name": "New Startup Platform",
        "history_months": 2,
        "growth_pattern": "exponential",
        "requires_explainability": False,
        "forecast_horizon_months": 3,
        "seasonality": "unknown",
        "accuracy_requirement": "high"
    }
]

print("Decision Helper: Evaluate Forecasting Approach\n")
print("=" * 70)

for scenario in scenarios:
    print(f"\nScenario: {scenario['name']}")
    print("-" * 70)
    
    score, recommendation, reasons = evaluate_forecasting_approach(scenario)
    
    print(f"Score: {score}/8")
    print(f"\n{recommendation}\n")
    print("Analysis:")
    for reason in reasons:
        print(f"  {reason}")

print("\n" + "=" * 70)

# Expected: 
# Scenario 1 (Mature SaaS): RECOMMENDED (score 8+)
# Scenario 2 (E-Commerce): ACCEPTABLE (score 4-6, note seasonality)
# Scenario 3 (New Startup): NOT RECOMMENDED (score < 2)

SAVED_SECTION:10

## Conclusion: Key Takeaways

You've completed L3 M13.4: Capacity Planning & Forecasting! Here's what you've learned:

### Core Concepts Mastered:
1. **Time-series analysis** using 6 months of monthly-aggregated data
2. **Linear regression forecasting** with 3-month prediction horizon
3. **Headroom buffer calculation** (20% standard, 30% high-volatility)
4. **Multi-threshold alerting** (70%, 80%, 90% utilization)
5. **Batch processing** for 150+ tenant √ó metric forecasts
6. **Rebalancing recommendations** to address "noisy neighbor" problems

### Critical Design Decisions:
- **Why linear regression?** Explainability beats 5-10% accuracy loss
- **Why 6 months?** Captures seasonality without stale data
- **Why 20% headroom?** Absorbs Q4 spikes (15-25%) without waste
- **Why 3 thresholds?** Graduated responses prevent alert fatigue

### Production Readiness:
‚úÖ Handles insufficient data gracefully (< 3 months minimum)
‚úÖ Processes 50+ tenants √ó 3 metrics < 5 minutes
‚úÖ Stores forecasts for dashboard visualization
‚úÖ Generates actionable recommendations
‚úÖ Integrates with Prometheus, Grafana, Airflow

### Next Steps:
- **M13.5:** Auto-scaling Implementation (trigger scaling from forecasts)
- **M13.6:** Cost Optimization (right-sizing recommendations)
- **Production:** Connect to real PostgreSQL, set up Grafana dashboards

### Additional Resources:
- [scikit-learn Linear Regression](https://scikit-learn.org/stable/modules/linear_model.html)
- [Google SRE Book - Capacity Planning](https://sre.google/sre-book/software-engineering-in-sre/)
- [Facebook Prophet](https://facebook.github.io/prophet/) (for seasonal forecasting)

---

**Congratulations!** You can now implement production-grade capacity forecasting for multi-tenant platforms. üéâ