# 139: Observability & Monitoring - Prometheus, Grafana, and SLOs

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** observability vs monitoring (metrics, logs, traces vs alerting on thresholds)
- **Implement** Prometheus metrics collection for ML model serving (latency, throughput, error rate)
- **Build** Grafana dashboards with SLO tracking (99.9% availability, P95 latency <100ms)
- **Deploy** AlertManager for intelligent alerting (grouping, deduplication, escalation)
- **Apply** observability to semiconductor ML systems (yield prediction API, STDF processing pipelines)
- **Monitor** golden signals (latency, traffic, errors, saturation) across distributed systems

## üìö What is Observability?

**Observability** is the ability to **understand system internal state** from external outputs (metrics, logs, traces). Unlike monitoring (react to known failures), observability enables **exploration and debugging** of unknown issues.

**Three Pillars of Observability:**
- **Metrics**: Numeric time-series data (CPU usage, request latency, error rate) - aggregated, low cardinality
- **Logs**: Discrete events with context (request failed, user logged in, model prediction made) - high cardinality, searchable
- **Traces**: Request journey across services (API call ‚Üí load balancer ‚Üí app server ‚Üí database ‚Üí model inference) - distributed systems

**Why Observability?**
- ‚úÖ **Proactive debugging**: Detect issues before users complain (latency spike, error rate increase)
- ‚úÖ **Root cause analysis**: Quickly identify source of problems (slow database query, memory leak, model degradation)
- ‚úÖ **Capacity planning**: Understand resource usage patterns (auto-scale before saturation)
- ‚úÖ **SLO tracking**: Measure reliability (99.9% uptime = 43 minutes downtime/month allowed)

**Golden Signals (Google SRE):**
1. **Latency**: Response time (P50, P95, P99 percentiles)
2. **Traffic**: Requests per second (RPS), throughput
3. **Errors**: Error rate (5xx responses, exceptions, timeouts)
4. **Saturation**: Resource utilization (CPU, memory, disk, network)

## üè≠ Post-Silicon Validation Use Cases

### **Use Case 1: Prometheus Metrics for Yield Prediction API**
**Input:** ML API serving 5000 RPS with no observability (blind to latency, errors, model performance)  
**Output:** Prometheus metrics track P95 latency (85ms), error rate (0.15%), prediction distribution, model version  
**Value:** $4.2M/year from preventing outages (detect latency spikes before SLA violations, proactive scaling)

### **Use Case 2: Grafana SLO Dashboard for STDF ETL Pipeline**
**Input:** STDF batch processing pipeline with manual monitoring (engineers check logs reactively)  
**Output:** Grafana dashboard tracks SLO (99.5% jobs complete in <30 minutes), alert on violations  
**Value:** $3.5M/year from improved reliability (reduce failed ETL jobs by 60%, faster detection and recovery)

### **Use Case 3: Distributed Tracing for Wafer Map Rendering Service**
**Input:** Slow wafer map generation (4 seconds P95 latency), unclear which service is bottleneck  
**Output:** Jaeger traces show 3.2 seconds spent in image resizing service (70% of total latency)  
**Value:** $2.8M/year from performance optimization (optimize image service, reduce P95 to 1.2 seconds)

### **Use Case 4: AlertManager for Parametric Test Anomaly Detection**
**Input:** Outlier detection model alerts engineer for every anomaly (200 alerts/day, 80% false positives)  
**Output:** AlertManager groups alerts, deduplicates, routes to on-call only for critical anomalies (95% noise reduction)  
**Value:** $2.3M/year from reduced alert fatigue (engineers focus on real issues, 50% faster incident response)

**Total Post-Silicon Value:** $4.2M + $3.5M + $2.8M + $2.3M = **$12.8M/year**

## üîÑ Observability Workflow

```mermaid
graph LR
    A[üñ•Ô∏è ML Service] --> B[üìä Emit Metrics]
    A --> C[üìù Write Logs]
    A --> D[üîç Propagate Traces]
    
    B --> E[Prometheus]
    C --> F[Loki/Elasticsearch]
    D --> G[Jaeger/Tempo]
    
    E --> H[Grafana Dashboard]
    F --> H
    G --> H
    
    E --> I[AlertManager]
    I --> J{Threshold Exceeded?}
    J -->|Yes| K[üìß Alert On-Call]
    J -->|No| L[‚úÖ Healthy]
    
    K --> M[üîß Investigate]
    M --> N[üìä Query Metrics]
    M --> O[üìù Search Logs]
    M --> P[üîç Analyze Traces]
    
    N --> Q[üí° Identify Root Cause]
    O --> Q
    P --> Q
    
    Q --> R[üõ†Ô∏è Deploy Fix]
    R --> S[üìà Verify Recovery]
    S --> L
    
    style A fill:#e1f5ff
    style R fill:#e1ffe1
    style K fill:#fff4e1
    style Q fill:#ffe8cc
```

## üìä Learning Path Context

**Prerequisites:**
- **Notebook 134: Service Mesh (Istio)** - Service mesh provides automatic metrics, tracing
- **Notebook 137: Infrastructure as Code** - Deploy observability stack with Terraform

**Next Steps:**
- **Notebook 140: Logging & Distributed Tracing** - Deep dive into logs and traces
- **Notebook 144: Performance Optimization** - Use observability to identify bottlenecks

---

Let's build observable ML systems with Prometheus and Grafana! üöÄ

In [None]:
# Setup and Imports
import json
import time
import random
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any, Tuple
from enum import Enum
from collections import defaultdict
import hashlib

# Set random seed for reproducibility
random.seed(42)

## 2. üìä Prometheus Metrics - Time-Series Monitoring

### üìù What's Happening in This Code?

**Purpose:** Implement Prometheus-style metrics collection for ML model serving and training workloads.

**Key Points:**
- **Metric Types**: Counter (monotonic increase), Gauge (up/down values), Histogram (latency distributions), Summary (quantiles)
- **Labels**: Dimensional data (e.g., `{model="yield_predictor", version="v2.1"}`) enable powerful queries
- **Scraping**: Prometheus pulls metrics from `/metrics` endpoint every 15 seconds
- **PromQL**: Query language for aggregations (`rate()`, `histogram_quantile()`, `avg()`)
- **Time-Series Database**: Efficient storage with compression (1 year of data in <100GB)

**Why This Matters:**
- **Real-Time Monitoring**: Detect issues as they happen (latency spike from 50ms ‚Üí 500ms)
- **Historical Analysis**: Correlate model drift with data distribution changes over weeks
- **Alerting**: Trigger PagerDuty when prediction latency P99 > 200ms for 5 minutes
- **Capacity Planning**: Track GPU utilization trends to predict when to scale (85% ‚Üí add 2 nodes)

**Post-Silicon Application:**
- **Scenario**: ML model serving API for device yield prediction (1000 req/sec)
- **Metrics Collected**:
  - Counter: `ml_predictions_total{model="yield_predictor", result="success"}` (total predictions)
  - Gauge: `ml_model_accuracy{model="yield_predictor"}` (current accuracy %)
  - Histogram: `ml_prediction_latency_seconds{model="yield_predictor"}` (latency distribution with P50/P95/P99)
  - Gauge: `ml_gpu_utilization{gpu_id="0"}` (GPU usage %)
- **Query Example**: `histogram_quantile(0.99, ml_prediction_latency_seconds)` ‚Üí P99 latency = 85ms
- **Alert**: `ml_model_accuracy < 0.90` for 10 minutes ‚Üí Trigger model retraining pipeline
- **Result**: 60% faster incident detection (MTTR from 2 hours ‚Üí 45 minutes), $1.8M/year savings

In [None]:
# Prometheus Metrics Simulation

class MetricType(Enum):
    """Prometheus metric types"""
    COUNTER = "COUNTER"      # Monotonically increasing (e.g., total requests)
    GAUGE = "GAUGE"          # Can go up/down (e.g., CPU usage, accuracy)
    HISTOGRAM = "HISTOGRAM"  # Latency distributions with buckets
    SUMMARY = "SUMMARY"      # Similar to histogram, pre-calculated quantiles

@dataclass
class MetricValue:
    """Single metric observation"""
    timestamp: datetime
    value: float
    labels: Dict[str, str] = field(default_factory=dict)

class PrometheusMetric:
    """Base Prometheus metric"""
    
    def __init__(self, name: str, metric_type: MetricType, help_text: str):
        self.name = name
        self.metric_type = metric_type
        self.help_text = help_text
        self.values: List[MetricValue] = []
    
    def to_prometheus_format(self) -> str:
        """Export in Prometheus text exposition format"""
        lines = []
        lines.append(f"# HELP {self.name} {self.help_text}")
        lines.append(f"# TYPE {self.name} {self.metric_type.value.lower()}")
        
        for val in self.values[-10:]:  # Last 10 values
            labels_str = ",".join([f'{k}="{v}"' for k, v in val.labels.items()])
            if labels_str:
                lines.append(f"{self.name}{{{labels_str}}} {val.value}")
            else:
                lines.append(f"{self.name} {val.value}")
        
        return "\n".join(lines)

class Counter(PrometheusMetric):
    """Counter metric (monotonically increasing)"""
    
    def __init__(self, name: str, help_text: str):
        super().__init__(name, MetricType.COUNTER, help_text)
        self._counters: Dict[str, float] = defaultdict(float)
    
    def inc(self, labels: Dict[str, str] = None, amount: float = 1.0):
        """Increment counter"""
        labels = labels or {}
        label_key = json.dumps(labels, sort_keys=True)
        self._counters[label_key] += amount
        
        self.values.append(MetricValue(
            timestamp=datetime.now(),
            value=self._counters[label_key],
            labels=labels
        ))

class Gauge(PrometheusMetric):
    """Gauge metric (can go up or down)"""
    
    def __init__(self, name: str, help_text: str):
        super().__init__(name, MetricType.GAUGE, help_text)
        self._gauges: Dict[str, float] = defaultdict(float)
    
    def set(self, value: float, labels: Dict[str, str] = None):
        """Set gauge value"""
        labels = labels or {}
        label_key = json.dumps(labels, sort_keys=True)
        self._gauges[label_key] = value
        
        self.values.append(MetricValue(
            timestamp=datetime.now(),
            value=value,
            labels=labels
        ))
    
    def inc(self, labels: Dict[str, str] = None, amount: float = 1.0):
        """Increment gauge"""
        labels = labels or {}
        label_key = json.dumps(labels, sort_keys=True)
        self._gauges[label_key] += amount
        self.set(self._gauges[label_key], labels)
    
    def dec(self, labels: Dict[str, str] = None, amount: float = 1.0):
        """Decrement gauge"""
        self.inc(labels, -amount)

class Histogram(PrometheusMetric):
    """Histogram metric (latency distributions)"""
    
    def __init__(self, name: str, help_text: str, buckets: List[float] = None):
        super().__init__(name, MetricType.HISTOGRAM, help_text)
        self.buckets = buckets or [0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0]
        self._observations: Dict[str, List[float]] = defaultdict(list)
    
    def observe(self, value: float, labels: Dict[str, str] = None):
        """Record observation"""
        labels = labels or {}
        label_key = json.dumps(labels, sort_keys=True)
        self._observations[label_key].append(value)
        
        self.values.append(MetricValue(
            timestamp=datetime.now(),
            value=value,
            labels=labels
        ))
    
    def get_quantile(self, quantile: float, labels: Dict[str, str] = None) -> float:
        """Calculate quantile (e.g., 0.99 for P99)"""
        labels = labels or {}
        label_key = json.dumps(labels, sort_keys=True)
        observations = self._observations.get(label_key, [])
        
        if not observations:
            return 0.0
        
        sorted_obs = sorted(observations)
        index = int(len(sorted_obs) * quantile)
        return sorted_obs[min(index, len(sorted_obs) - 1)]
    
    def get_bucket_counts(self, labels: Dict[str, str] = None) -> Dict[float, int]:
        """Get counts per bucket (for histogram_quantile in PromQL)"""
        labels = labels or {}
        label_key = json.dumps(labels, sort_keys=True)
        observations = self._observations.get(label_key, [])
        
        counts = {}
        for bucket in self.buckets:
            counts[bucket] = sum(1 for obs in observations if obs <= bucket)
        counts[float('inf')] = len(observations)  # +Inf bucket
        
        return counts

class MetricsRegistry:
    """Prometheus metrics registry"""
    
    def __init__(self):
        self.metrics: Dict[str, PrometheusMetric] = {}
    
    def register(self, metric: PrometheusMetric):
        """Register metric"""
        self.metrics[metric.name] = metric
    
    def get_metric(self, name: str) -> Optional[PrometheusMetric]:
        """Get metric by name"""
        return self.metrics.get(name)
    
    def scrape(self) -> str:
        """Scrape all metrics (Prometheus /metrics endpoint)"""
        output = []
        for metric in self.metrics.values():
            output.append(metric.to_prometheus_format())
        return "\n\n".join(output)

# Example 1: ML Model Serving Metrics
print("=" * 70)
print("Example 1: ML Model Serving Metrics (Yield Prediction API)")
print("=" * 70)

registry = MetricsRegistry()

# Register metrics
predictions_total = Counter(
    name="ml_predictions_total",
    help_text="Total number of ML predictions made"
)
registry.register(predictions_total)

model_accuracy = Gauge(
    name="ml_model_accuracy",
    help_text="Current model accuracy (0.0 to 1.0)"
)
registry.register(model_accuracy)

prediction_latency = Histogram(
    name="ml_prediction_latency_seconds",
    help_text="ML prediction latency in seconds"
)
registry.register(prediction_latency)

gpu_utilization = Gauge(
    name="ml_gpu_utilization_percent",
    help_text="GPU utilization percentage"
)
registry.register(gpu_utilization)

print("\nüìä Simulating ML model serving for 100 requests...")
print("   Model: yield_predictor (device yield prediction)")
print("   SLA: P99 latency < 200ms, accuracy > 90%")

# Simulate predictions
for i in range(100):
    # Simulate prediction
    labels = {"model": "yield_predictor", "version": "v2.1"}
    
    # Record prediction
    predictions_total.inc(labels=labels)
    
    # Simulate latency (most fast, some slow)
    if random.random() < 0.95:
        latency = random.uniform(0.03, 0.08)  # 30-80ms (normal)
    else:
        latency = random.uniform(0.15, 0.25)  # 150-250ms (slow outliers)
    
    prediction_latency.observe(latency, labels=labels)
    
    # Simulate GPU utilization
    gpu_util = random.uniform(70, 85)
    gpu_utilization.set(gpu_util, labels={"gpu_id": "0"})

# Set model accuracy
model_accuracy.set(0.932, labels={"model": "yield_predictor", "dataset": "validation"})

# Calculate metrics
print(f"\n‚úÖ Metrics Summary:")
print(f"   Total Predictions: {predictions_total._counters[json.dumps(labels, sort_keys=True)]:.0f}")
print(f"   Model Accuracy: {0.932:.1%}")
print(f"   Latency P50: {prediction_latency.get_quantile(0.50, labels) * 1000:.1f}ms")
print(f"   Latency P95: {prediction_latency.get_quantile(0.95, labels) * 1000:.1f}ms")
print(f"   Latency P99: {prediction_latency.get_quantile(0.99, labels) * 1000:.1f}ms")
print(f"   GPU Utilization: {gpu_util:.1f}%")

# Check SLA
p99_latency_ms = prediction_latency.get_quantile(0.99, labels) * 1000
if p99_latency_ms < 200:
    print(f"\n‚úÖ SLA Met: P99 latency {p99_latency_ms:.1f}ms < 200ms")
else:
    print(f"\n‚ùå SLA Breach: P99 latency {p99_latency_ms:.1f}ms > 200ms")

# Example 2: Multi-Model Comparison
print("\n\n" + "=" * 70)
print("Example 2: Multi-Model Performance Comparison")
print("=" * 70)

print("\nüìä Simulating 3 models: yield_predictor, binning_optimizer, failure_classifier")

models = [
    {"name": "yield_predictor", "accuracy": 0.932, "latency_base": 0.05},
    {"name": "binning_optimizer", "accuracy": 0.878, "latency_base": 0.12},
    {"name": "failure_classifier", "accuracy": 0.945, "latency_base": 0.08}
]

for model_info in models:
    labels = {"model": model_info["name"], "version": "v1.0"}
    
    # Simulate 50 requests per model
    for _ in range(50):
        predictions_total.inc(labels=labels)
        latency = model_info["latency_base"] + random.uniform(-0.01, 0.02)
        prediction_latency.observe(latency, labels=labels)
    
    # Set accuracy
    model_accuracy.set(model_info["accuracy"], labels={"model": model_info["name"], "dataset": "validation"})

print("\n‚úÖ Model Performance Summary:")
for model_info in models:
    labels = {"model": model_info["name"], "version": "v1.0"}
    p50 = prediction_latency.get_quantile(0.50, labels) * 1000
    p99 = prediction_latency.get_quantile(0.99, labels) * 1000
    acc = model_info["accuracy"]
    
    print(f"\n   {model_info['name']}:")
    print(f"     Accuracy: {acc:.1%}")
    print(f"     Latency P50: {p50:.1f}ms")
    print(f"     Latency P99: {p99:.1f}ms")

# Example 3: Prometheus Scrape Endpoint
print("\n\n" + "=" * 70)
print("Example 3: Prometheus /metrics Endpoint")
print("=" * 70)

print("\nüìä Simulating Prometheus scrape (GET /metrics)...")
print("\n" + "=" * 70)
print(registry.scrape())
print("=" * 70)

print(f"\n‚úÖ Prometheus metrics demonstrated: Counters, Gauges, Histograms with labels!")

## 3. üìà Grafana Dashboards - Visualization and Alerting

### üìù What's Happening in This Code?

**Purpose:** Build interactive dashboards to visualize ML system health and trigger alerts on anomalies.

**Key Points:**
- **Panels**: Time-series graphs, gauges, heatmaps, tables (visualize metrics from Prometheus)
- **Templating**: Dynamic dashboards with variables (select model: yield_predictor, binning_optimizer)
- **Alerting**: Trigger notifications when metrics cross thresholds (accuracy < 90% for 10 minutes)
- **Annotations**: Mark deployments, incidents on graphs (correlate performance drops with changes)
- **Drill-Down**: Click graph ‚Üí filter by specific model/service (investigate anomalies)

**Why This Matters:**
- **Visual Discovery**: Spot trends invisible in raw metrics (gradual latency increase over 3 days)
- **Correlation**: Overlay multiple metrics (CPU spike correlates with batch job start time)
- **Proactive Alerts**: Notify on-call engineer before users report issues (P99 latency trending up)
- **Executive Dashboards**: Show business metrics (model uptime 99.95%, $12K cost savings this month)

**Post-Silicon Application:**
- **Dashboard 1: STDF Processing Pipeline**
  - Panel 1: File processing rate (files/sec) with 15-minute average
  - Panel 2: Parsing latency P50/P95/P99 (heatmap showing time-of-day patterns)
  - Panel 3: Yield prediction accuracy (gauge with 90% threshold marker)
  - Panel 4: Error rate % (alert if >1% for 5 minutes)
- **Dashboard 2: ML Model Serving**
  - Panel 1: Request rate (req/sec) with capacity line at 1500 req/sec
  - Panel 2: Latency percentiles (P50/P95/P99) with SLA threshold (200ms)
  - Panel 3: GPU utilization per device (bar chart, alert if >90%)
  - Panel 4: Cost per 1000 predictions (trend line showing optimization impact)
- **Result**: 40% faster anomaly detection (engineers spot issues in dashboard before alerts fire)

In [None]:
# Grafana Dashboard Simulation

class PanelType(Enum):
    """Grafana panel types"""
    GRAPH = "GRAPH"          # Time-series line chart
    GAUGE = "GAUGE"          # Single value with thresholds
    HEATMAP = "HEATMAP"      # 2D histogram (latency over time)
    TABLE = "TABLE"          # Tabular data
    STAT = "STAT"            # Single stat with trend
    BAR_CHART = "BAR_CHART"  # Bar chart

class AlertSeverity(Enum):
    """Alert severity levels"""
    INFO = "INFO"
    WARNING = "WARNING"
    CRITICAL = "CRITICAL"

@dataclass
class GrafanaPanel:
    """Grafana dashboard panel"""
    title: str
    panel_type: PanelType
    query: str  # PromQL query
    thresholds: List[Tuple[str, float]] = field(default_factory=list)  # [(level, value)]
    unit: str = ""
    
    def evaluate(self, value: float) -> Optional[str]:
        """Evaluate current value against thresholds"""
        for level, threshold in reversed(self.thresholds):
            if value >= threshold:
                return level
        return None

@dataclass
class AlertRule:
    """Grafana alert rule"""
    name: str
    query: str
    condition: str  # e.g., "> 0.90"
    duration: str   # e.g., "5m"
    severity: AlertSeverity
    message: str
    
    def evaluate(self, value: float) -> bool:
        """Evaluate if alert should fire"""
        # Simplified evaluation
        operator = self.condition.split()[0]
        threshold = float(self.condition.split()[1])
        
        if operator == ">":
            return value > threshold
        elif operator == "<":
            return value < threshold
        elif operator == ">=":
            return value >= threshold
        elif operator == "<=":
            return value <= threshold
        
        return False

class GrafanaDashboard:
    """Grafana dashboard"""
    
    def __init__(self, name: str, description: str = ""):
        self.name = name
        self.description = description
        self.panels: List[GrafanaPanel] = []
        self.alerts: List[AlertRule] = []
        self.variables: Dict[str, List[str]] = {}
    
    def add_panel(self, panel: GrafanaPanel):
        """Add panel to dashboard"""
        self.panels.append(panel)
    
    def add_alert(self, alert: AlertRule):
        """Add alert rule"""
        self.alerts.append(alert)
    
    def add_variable(self, name: str, values: List[str]):
        """Add template variable"""
        self.variables[name] = values
    
    def render_panel(self, panel: GrafanaPanel, current_value: float):
        """Render panel (simulate visualization)"""
        print(f"\n{'=' * 60}")
        print(f"üìä {panel.title}")
        print(f"{'=' * 60}")
        print(f"Query: {panel.query}")
        print(f"Current Value: {current_value:.2f} {panel.unit}")
        
        # Check thresholds
        status = panel.evaluate(current_value)
        if status:
            if "CRITICAL" in status:
                print(f"Status: üî¥ {status}")
            elif "WARNING" in status:
                print(f"Status: üü° {status}")
            else:
                print(f"Status: üü¢ {status}")
        else:
            print(f"Status: üü¢ OK")
        
        # Visualize with simple bar
        if panel.panel_type == PanelType.GAUGE:
            bar_length = int((current_value / 100) * 40)
            bar = "‚ñà" * bar_length + "‚ñë" * (40 - bar_length)
            print(f"\n{bar} {current_value:.1f}%")
    
    def check_alerts(self, metric_values: Dict[str, float]):
        """Check all alert rules"""
        triggered_alerts = []
        
        for alert in self.alerts:
            # Simplified: assume metric_values contains values for alert queries
            for metric_name, value in metric_values.items():
                if metric_name in alert.query:
                    if alert.evaluate(value):
                        triggered_alerts.append((alert, value))
        
        return triggered_alerts

# Example 1: ML Model Serving Dashboard
print("=" * 70)
print("Example 1: ML Model Serving Dashboard")
print("=" * 70)

dashboard = GrafanaDashboard(
    name="ML Model Serving - Yield Predictor",
    description="Monitor ML model serving performance, latency, and accuracy"
)

# Add template variable
dashboard.add_variable("model", ["yield_predictor", "binning_optimizer", "failure_classifier"])

# Panel 1: Request Rate
request_rate_panel = GrafanaPanel(
    title="Request Rate",
    panel_type=PanelType.GRAPH,
    query="rate(ml_predictions_total{model='$model'}[5m])",
    unit="req/sec",
    thresholds=[
        ("OK", 0),
        ("WARNING", 800),
        ("CRITICAL", 1200)
    ]
)
dashboard.add_panel(request_rate_panel)

# Panel 2: P99 Latency
latency_panel = GrafanaPanel(
    title="Prediction Latency P99",
    panel_type=PanelType.GRAPH,
    query="histogram_quantile(0.99, ml_prediction_latency_seconds{model='$model'})",
    unit="ms",
    thresholds=[
        ("OK", 0),
        ("WARNING", 150),
        ("CRITICAL", 200)
    ]
)
dashboard.add_panel(latency_panel)

# Panel 3: Model Accuracy
accuracy_panel = GrafanaPanel(
    title="Model Accuracy",
    panel_type=PanelType.GAUGE,
    query="ml_model_accuracy{model='$model'}",
    unit="%",
    thresholds=[
        ("CRITICAL", 0),
        ("WARNING", 85),
        ("OK", 90)
    ]
)
dashboard.add_panel(accuracy_panel)

# Panel 4: GPU Utilization
gpu_panel = GrafanaPanel(
    title="GPU Utilization",
    panel_type=PanelType.GAUGE,
    query="ml_gpu_utilization_percent{gpu_id='0'}",
    unit="%",
    thresholds=[
        ("OK", 0),
        ("WARNING", 85),
        ("CRITICAL", 95)
    ]
)
dashboard.add_panel(gpu_panel)

# Add alert rules
dashboard.add_alert(AlertRule(
    name="High P99 Latency",
    query="histogram_quantile(0.99, ml_prediction_latency_seconds{model='yield_predictor'})",
    condition="> 0.20",  # 200ms
    duration="5m",
    severity=AlertSeverity.CRITICAL,
    message="P99 latency > 200ms for 5 minutes. Investigate slow queries or scale infrastructure."
))

dashboard.add_alert(AlertRule(
    name="Low Model Accuracy",
    query="ml_model_accuracy{model='yield_predictor'}",
    condition="< 0.90",
    duration="10m",
    severity=AlertSeverity.WARNING,
    message="Model accuracy < 90% for 10 minutes. Model drift detected, trigger retraining."
))

dashboard.add_alert(AlertRule(
    name="High GPU Utilization",
    query="ml_gpu_utilization_percent{gpu_id='0'}",
    condition="> 90",
    duration="15m",
    severity=AlertSeverity.WARNING,
    message="GPU utilization > 90% for 15 minutes. Consider adding more GPU nodes."
))

# Simulate current metrics
print(f"\nüìä Dashboard: {dashboard.name}")
print(f"Description: {dashboard.description}")

current_metrics = {
    "request_rate": 650,       # req/sec
    "p99_latency": 85,         # ms
    "model_accuracy": 93.2,    # %
    "gpu_utilization": 78.5    # %
}

# Render panels
dashboard.render_panel(request_rate_panel, current_metrics["request_rate"])
dashboard.render_panel(latency_panel, current_metrics["p99_latency"])
dashboard.render_panel(accuracy_panel, current_metrics["model_accuracy"])
dashboard.render_panel(gpu_panel, current_metrics["gpu_utilization"])

# Check alerts
print(f"\n\n{'=' * 60}")
print("üîî Alert Status")
print(f"{'=' * 60}")

metric_values = {
    "ml_prediction_latency_seconds": current_metrics["p99_latency"] / 1000,  # Convert to seconds
    "ml_model_accuracy": current_metrics["model_accuracy"] / 100,
    "ml_gpu_utilization_percent": current_metrics["gpu_utilization"]
}

triggered_alerts = dashboard.check_alerts(metric_values)

if triggered_alerts:
    for alert, value in triggered_alerts:
        print(f"\nüö® {alert.severity.value}: {alert.name}")
        print(f"   Value: {value}")
        print(f"   Condition: {alert.condition}")
        print(f"   Message: {alert.message}")
else:
    print("\n‚úÖ No alerts triggered. All systems healthy!")

# Example 2: STDF Processing Pipeline Dashboard
print("\n\n" + "=" * 70)
print("Example 2: STDF Processing Pipeline Dashboard")
print("=" * 70)

pipeline_dashboard = GrafanaDashboard(
    name="STDF Processing Pipeline",
    description="Monitor STDF file parsing and yield prediction pipeline"
)

# Panel: File Processing Rate
processing_panel = GrafanaPanel(
    title="STDF Files Processed",
    panel_type=PanelType.STAT,
    query="rate(stdf_files_processed_total[15m])",
    unit="files/min",
    thresholds=[
        ("CRITICAL", 0),
        ("WARNING", 5),
        ("OK", 10)
    ]
)
pipeline_dashboard.add_panel(processing_panel)

# Panel: Parsing Latency
parsing_latency_panel = GrafanaPanel(
    title="STDF Parsing Latency P95",
    panel_type=PanelType.HEATMAP,
    query="histogram_quantile(0.95, stdf_parsing_latency_seconds)",
    unit="ms",
    thresholds=[
        ("OK", 0),
        ("WARNING", 400),
        ("CRITICAL", 500)
    ]
)
pipeline_dashboard.add_panel(parsing_latency_panel)

# Panel: Yield Prediction Throughput
throughput_panel = GrafanaPanel(
    title="Yield Predictions per Second",
    panel_type=PanelType.GRAPH,
    query="rate(yield_predictions_total[5m])",
    unit="predictions/sec",
    thresholds=[
        ("OK", 0),
        ("WARNING", 800),
        ("CRITICAL", 1000)
    ]
)
pipeline_dashboard.add_panel(throughput_panel)

print(f"\nüìä Dashboard: {pipeline_dashboard.name}")

pipeline_metrics = {
    "file_processing_rate": 12.5,  # files/min
    "parsing_latency_p95": 380,    # ms
    "prediction_throughput": 450   # predictions/sec
}

pipeline_dashboard.render_panel(processing_panel, pipeline_metrics["file_processing_rate"])
pipeline_dashboard.render_panel(parsing_latency_panel, pipeline_metrics["parsing_latency_p95"])
pipeline_dashboard.render_panel(throughput_panel, pipeline_metrics["prediction_throughput"])

# Example 3: Multi-Model Comparison Dashboard
print("\n\n" + "=" * 70)
print("Example 3: Multi-Model Comparison Dashboard")
print("=" * 70)

comparison_dashboard = GrafanaDashboard(
    name="Multi-Model Performance Comparison",
    description="Compare performance across yield_predictor, binning_optimizer, failure_classifier"
)

print(f"\nüìä Dashboard: {comparison_dashboard.name}")
print("\nModel Performance Summary (Bar Chart):")
print(f"{'=' * 60}")

models_comparison = [
    {"name": "yield_predictor", "accuracy": 93.2, "latency_p99": 85, "throughput": 650},
    {"name": "binning_optimizer", "accuracy": 87.8, "latency_p99": 125, "throughput": 420},
    {"name": "failure_classifier", "accuracy": 94.5, "latency_p99": 95, "throughput": 580}
]

for model in models_comparison:
    print(f"\n{model['name']}:")
    print(f"  Accuracy: {'‚ñà' * int(model['accuracy'] / 10)} {model['accuracy']:.1f}%")
    print(f"  Latency P99: {'‚ñà' * int(model['latency_p99'] / 10)} {model['latency_p99']:.0f}ms")
    print(f"  Throughput: {'‚ñà' * int(model['throughput'] / 50)} {model['throughput']} req/sec")

print(f"\n\n‚úÖ Grafana dashboards demonstrated: Panels, alerts, thresholds, visualizations!")

## 4. üîç Distributed Tracing - OpenTelemetry and Jaeger

### üìù What's Happening in This Code?

**Purpose:** Track requests across microservices to identify bottlenecks and debug latency issues.

**Key Points:**
- **Spans**: Individual operations (database query, model inference, HTTP request)
- **Traces**: Collection of spans forming complete request journey (API ‚Üí Feature Store ‚Üí Model ‚Üí DB)
- **Context Propagation**: Pass trace_id across services (correlate spans from different microservices)
- **Sampling**: Trace 1% of production traffic (reduce overhead, maintain visibility)
- **Baggage**: Carry metadata across spans (user_id, tenant_id, experiment_id)

**Why This Matters:**
- **Latency Attribution**: Which service caused 2s delay? (Database: 1.8s vs Model: 0.05s)
- **Cascading Failures**: Trace shows API timeout caused by slow feature store query
- **Optimization**: Identify N+1 queries (100 DB calls for 1 prediction ‚Üí fix with batching)
- **Root Cause Analysis**: Trace shows exactly which span failed and why

**Post-Silicon Application:**
- **Scenario**: STDF processing API latency spike from 200ms ‚Üí 3s
- **Trace Investigation**:
  - Span 1: API Gateway (10ms) ‚úÖ
  - Span 2: Authentication (5ms) ‚úÖ
  - Span 3: Feature Store Query (15ms) ‚úÖ
  - Span 4: STDF Parser (2950ms) ‚ùå **bottleneck identified!**
  - Span 5: Yield Prediction Model (25ms) ‚úÖ
  - Span 6: Database Write (12ms) ‚úÖ
- **Root Cause**: STDF parser loading full 10GB file into memory (OOM thrashing)
- **Fix**: Streaming parser (process 1MB chunks) ‚Üí latency reduced to 180ms
- **Result**: 60% faster debugging (MTTR from 2 hours ‚Üí 45 minutes), $1.8M/year savings

In [None]:
# Distributed Tracing Simulation

class SpanKind(Enum):
    """Span types"""
    SERVER = "SERVER"      # Receiving request
    CLIENT = "CLIENT"      # Making request
    INTERNAL = "INTERNAL"  # Internal operation
    PRODUCER = "PRODUCER"  # Message queue producer
    CONSUMER = "CONSUMER"  # Message queue consumer

@dataclass
class Span:
    """Distributed trace span"""
    span_id: str
    trace_id: str
    parent_span_id: Optional[str]
    operation_name: str
    service_name: str
    start_time: datetime
    duration_ms: float
    kind: SpanKind
    tags: Dict[str, Any] = field(default_factory=dict)
    logs: List[Dict[str, Any]] = field(default_factory=list)
    status: str = "OK"  # OK, ERROR
    
    def end_time(self) -> datetime:
        """Calculate end time"""
        return self.start_time + timedelta(milliseconds=self.duration_ms)
    
    def add_tag(self, key: str, value: Any):
        """Add tag (metadata)"""
        self.tags[key] = value
    
    def log_event(self, message: str, level: str = "INFO"):
        """Add log event"""
        self.logs.append({
            "timestamp": datetime.now().isoformat(),
            "level": level,
            "message": message
        })

@dataclass
class Trace:
    """Complete distributed trace"""
    trace_id: str
    spans: List[Span] = field(default_factory=list)
    
    def add_span(self, span: Span):
        """Add span to trace"""
        self.spans.append(span)
    
    def get_root_span(self) -> Optional[Span]:
        """Get root span (no parent)"""
        for span in self.spans:
            if span.parent_span_id is None:
                return span
        return None
    
    def get_critical_path(self) -> List[Span]:
        """Get critical path (longest latency chain)"""
        # Simplified: return spans sorted by start time
        return sorted(self.spans, key=lambda s: s.start_time)
    
    def total_duration_ms(self) -> float:
        """Calculate total trace duration"""
        if not self.spans:
            return 0.0
        root = self.get_root_span()
        return root.duration_ms if root else 0.0
    
    def visualize(self):
        """Visualize trace as waterfall"""
        print(f"\n{'=' * 70}")
        print(f"Trace ID: {self.trace_id}")
        print(f"Total Duration: {self.total_duration_ms():.2f}ms")
        print(f"{'=' * 70}")
        
        root = self.get_root_span()
        if not root:
            return
        
        # Print spans in timeline order
        for span in self.get_critical_path():
            indent = "  " * (1 if span.parent_span_id else 0)
            bar_length = int((span.duration_ms / root.duration_ms) * 50)
            bar = "‚ñà" * bar_length
            
            status_icon = "‚úÖ" if span.status == "OK" else "‚ùå"
            print(f"\n{status_icon} {indent}{span.service_name}: {span.operation_name}")
            print(f"   {indent}Duration: {span.duration_ms:.2f}ms {bar}")
            
            # Print important tags
            if "error" in span.tags:
                print(f"   {indent}Error: {span.tags['error']}")
            if "db.statement" in span.tags:
                query = span.tags['db.statement'][:50] + "..." if len(span.tags['db.statement']) > 50 else span.tags['db.statement']
                print(f"   {indent}Query: {query}")

class Tracer:
    """OpenTelemetry tracer"""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.traces: Dict[str, Trace] = {}
    
    def start_trace(self, operation_name: str, trace_id: Optional[str] = None) -> Span:
        """Start new trace (root span)"""
        trace_id = trace_id or f"trace-{uuid.uuid4().hex[:16]}"
        span_id = f"span-{uuid.uuid4().hex[:8]}"
        
        span = Span(
            span_id=span_id,
            trace_id=trace_id,
            parent_span_id=None,
            operation_name=operation_name,
            service_name=self.service_name,
            start_time=datetime.now(),
            duration_ms=0.0,
            kind=SpanKind.SERVER
        )
        
        if trace_id not in self.traces:
            self.traces[trace_id] = Trace(trace_id=trace_id)
        
        return span
    
    def start_span(self, operation_name: str, parent_span: Span, kind: SpanKind = SpanKind.INTERNAL) -> Span:
        """Start child span"""
        span_id = f"span-{uuid.uuid4().hex[:8]}"
        
        span = Span(
            span_id=span_id,
            trace_id=parent_span.trace_id,
            parent_span_id=parent_span.span_id,
            operation_name=operation_name,
            service_name=self.service_name,
            start_time=datetime.now(),
            duration_ms=0.0,
            kind=kind
        )
        
        return span
    
    def end_span(self, span: Span, duration_ms: float):
        """End span and record"""
        span.duration_ms = duration_ms
        self.traces[span.trace_id].add_span(span)
    
    def get_trace(self, trace_id: str) -> Optional[Trace]:
        """Retrieve trace"""
        return self.traces.get(trace_id)

# Example 1: STDF Processing API Trace (Normal Request)
print("=" * 70)
print("Example 1: STDF Processing API Trace (Normal Request)")
print("=" * 70)

api_tracer = Tracer(service_name="api-gateway")
parser_tracer = Tracer(service_name="stdf-parser")
ml_tracer = Tracer(service_name="ml-model-serving")
db_tracer = Tracer(service_name="postgres")

# Simulate request flow
print("\nüìä Processing STDF file: wafer_test_2024.stdf (500MB)")
print("   Trace: API ‚Üí Parser ‚Üí ML Model ‚Üí Database")

# Span 1: API Gateway
api_span = api_tracer.start_trace(operation_name="POST /api/v1/process-stdf")
api_span.add_tag("http.method", "POST")
api_span.add_tag("http.url", "/api/v1/process-stdf")
api_span.add_tag("file.name", "wafer_test_2024.stdf")
api_span.add_tag("file.size_mb", 500)
time.sleep(0.01)
api_tracer.end_span(api_span, duration_ms=10.0)

# Span 2: Authentication
auth_span = api_tracer.start_span("authenticate_user", parent_span=api_span, kind=SpanKind.INTERNAL)
auth_span.add_tag("user.id", "user-12345")
auth_span.add_tag("auth.method", "jwt")
api_tracer.end_span(auth_span, duration_ms=5.0)

# Span 3: Feature Store Query
feature_span = api_tracer.start_span("fetch_device_metadata", parent_span=api_span, kind=SpanKind.CLIENT)
feature_span.add_tag("feature_store.query", "SELECT * FROM device_metadata WHERE device_id = 'DEV-789'")
api_tracer.end_span(feature_span, duration_ms=15.0)

# Span 4: STDF Parser (longest operation)
parser_span = parser_tracer.start_span("parse_stdf_file", parent_span=api_span, kind=SpanKind.INTERNAL)
parser_span.add_tag("parser.file_size_mb", 500)
parser_span.add_tag("parser.format", "STDF-V4")
parser_span.add_tag("parser.records_parsed", 1500000)
parser_tracer.end_span(parser_span, duration_ms=150.0)

# Span 5: ML Model Inference
ml_span = ml_tracer.start_span("predict_yield", parent_span=api_span, kind=SpanKind.INTERNAL)
ml_span.add_tag("model.name", "yield_predictor")
ml_span.add_tag("model.version", "v2.1")
ml_span.add_tag("prediction.result", 0.87)
ml_tracer.end_span(ml_span, duration_ms=25.0)

# Span 6: Database Write
db_span = db_tracer.start_span("insert_results", parent_span=api_span, kind=SpanKind.CLIENT)
db_span.add_tag("db.system", "postgresql")
db_span.add_tag("db.statement", "INSERT INTO yield_predictions (wafer_id, predicted_yield) VALUES ('W-456', 0.87)")
db_tracer.end_span(db_span, duration_ms=12.0)

# Update root span duration (sum of all operations)
api_span.duration_ms = 217.0  # Total request time

# Collect all spans into single trace
trace_id = api_span.trace_id
complete_trace = Trace(trace_id=trace_id)
complete_trace.add_span(api_span)
complete_trace.add_span(auth_span)
complete_trace.add_span(feature_span)
complete_trace.add_span(parser_span)
complete_trace.add_span(ml_span)
complete_trace.add_span(db_span)

# Visualize trace
complete_trace.visualize()

print("\n‚úÖ Analysis: STDF parser takes 69% of total time (150ms / 217ms)")
print("   Optimization: Implement streaming parser to reduce latency")

# Example 2: Slow Request with Bottleneck
print("\n\n" + "=" * 70)
print("Example 2: STDF Processing API Trace (Slow Request - Bottleneck)")
print("=" * 70)

print("\nüìä Processing STDF file: large_wafer_test.stdf (10GB)")
print("   Trace: API ‚Üí Parser ‚Üí ML Model ‚Üí Database")
print("   ‚ö†Ô∏è  Parser experiencing OOM issues (loading full file into memory)")

# Create slow trace
slow_trace_id = f"trace-{uuid.uuid4().hex[:16]}"

# Span 1: API Gateway
slow_api_span = Span(
    span_id=f"span-{uuid.uuid4().hex[:8]}",
    trace_id=slow_trace_id,
    parent_span_id=None,
    operation_name="POST /api/v1/process-stdf",
    service_name="api-gateway",
    start_time=datetime.now(),
    duration_ms=3050.0,  # Very slow!
    kind=SpanKind.SERVER,
    tags={"http.method": "POST", "file.size_mb": 10000}
)

# Span 2: Parser (bottleneck!)
slow_parser_span = Span(
    span_id=f"span-{uuid.uuid4().hex[:8]}",
    trace_id=slow_trace_id,
    parent_span_id=slow_api_span.span_id,
    operation_name="parse_stdf_file",
    service_name="stdf-parser",
    start_time=datetime.now(),
    duration_ms=2950.0,  # 96.7% of total time!
    kind=SpanKind.INTERNAL,
    tags={
        "parser.file_size_mb": 10000,
        "parser.memory_usage_gb": 12.5,
        "error": "OutOfMemoryError: Java heap space (loading full 10GB file)"
    },
    status="ERROR"
)
slow_parser_span.log_event("OOM while parsing large STDF file", level="ERROR")

# Span 3: ML Model
slow_ml_span = Span(
    span_id=f"span-{uuid.uuid4().hex[:8]}",
    trace_id=slow_trace_id,
    parent_span_id=slow_api_span.span_id,
    operation_name="predict_yield",
    service_name="ml-model-serving",
    start_time=datetime.now(),
    duration_ms=30.0,
    kind=SpanKind.INTERNAL,
    tags={"model.name": "yield_predictor"}
)

# Span 4: Database
slow_db_span = Span(
    span_id=f"span-{uuid.uuid4().hex[:8]}",
    trace_id=slow_trace_id,
    parent_span_id=slow_api_span.span_id,
    operation_name="insert_results",
    service_name="postgres",
    start_time=datetime.now(),
    duration_ms=15.0,
    kind=SpanKind.CLIENT,
    tags={"db.system": "postgresql"}
)

slow_trace = Trace(trace_id=slow_trace_id)
slow_trace.add_span(slow_api_span)
slow_trace.add_span(slow_parser_span)
slow_trace.add_span(slow_ml_span)
slow_trace.add_span(slow_db_span)

slow_trace.visualize()

print("\n‚ùå Bottleneck Identified: STDF parser takes 2950ms (96.7% of total 3050ms)")
print("   Root Cause: OutOfMemoryError - loading full 10GB file into memory")
print("   Fix: Implement streaming parser (process 1MB chunks)")
print("   Expected Improvement: 2950ms ‚Üí 180ms (94% reduction)")

# Example 3: Multi-Service ML Pipeline Trace
print("\n\n" + "=" * 70)
print("Example 3: Multi-Service ML Pipeline Trace")
print("=" * 70)

print("\nüìä ML Training Pipeline: Data ‚Üí Preprocess ‚Üí Train ‚Üí Validate ‚Üí Deploy")

pipeline_trace_id = f"trace-{uuid.uuid4().hex[:16]}"

pipeline_spans = [
    Span(
        span_id=f"span-{uuid.uuid4().hex[:8]}", trace_id=pipeline_trace_id, parent_span_id=None,
        operation_name="ml_training_pipeline", service_name="pipeline-orchestrator",
        start_time=datetime.now(), duration_ms=125000.0, kind=SpanKind.SERVER,
        tags={"pipeline.name": "yield_predictor_retrain"}
    ),
    Span(
        span_id=f"span-{uuid.uuid4().hex[:8]}", trace_id=pipeline_trace_id, parent_span_id="root",
        operation_name="fetch_training_data", service_name="data-service",
        start_time=datetime.now(), duration_ms=15000.0, kind=SpanKind.CLIENT,
        tags={"data.source": "s3://ml-stdf-data", "data.size_gb": 50}
    ),
    Span(
        span_id=f"span-{uuid.uuid4().hex[:8]}", trace_id=pipeline_trace_id, parent_span_id="root",
        operation_name="preprocess_features", service_name="preprocessing-service",
        start_time=datetime.now(), duration_ms=25000.0, kind=SpanKind.INTERNAL,
        tags={"features.count": 120, "samples.count": 1500000}
    ),
    Span(
        span_id=f"span-{uuid.uuid4().hex[:8]}", trace_id=pipeline_trace_id, parent_span_id="root",
        operation_name="train_model", service_name="training-service",
        start_time=datetime.now(), duration_ms=75000.0, kind=SpanKind.INTERNAL,
        tags={"model.type": "RandomForest", "epochs": 100, "gpu.count": 4}
    ),
    Span(
        span_id=f"span-{uuid.uuid4().hex[:8]}", trace_id=pipeline_trace_id, parent_span_id="root",
        operation_name="validate_model", service_name="validation-service",
        start_time=datetime.now(), duration_ms=8000.0, kind=SpanKind.INTERNAL,
        tags={"accuracy": 0.932, "precision": 0.905, "recall": 0.918}
    ),
    Span(
        span_id=f"span-{uuid.uuid4().hex[:8]}", trace_id=pipeline_trace_id, parent_span_id="root",
        operation_name="deploy_model", service_name="deployment-service",
        start_time=datetime.now(), duration_ms=2000.0, kind=SpanKind.CLIENT,
        tags={"deployment.target": "kubernetes", "replicas": 3}
    )
]

pipeline_trace = Trace(trace_id=pipeline_trace_id)
for span in pipeline_spans:
    pipeline_trace.add_span(span)

pipeline_trace.visualize()

print("\n‚úÖ Analysis: Training takes 60% of pipeline time (75s / 125s total)")
print("   Optimization: Distribute training across 8 GPUs (expected 50% reduction)")
print("\n‚úÖ Distributed tracing demonstrated: Spans, traces, bottleneck identification!")

## 5. üìã Real-World Projects: Observability in Production

### Project 1: Complete ML Observability Stack üîç
**Objective:** Build end-to-end observability platform for multi-tenant ML infrastructure

**Business Value:** $2.5M/year from 70% reduction in MTTR and 40% cost optimization

**Features to Implement:**
- **Metrics Collection:**
  - Prometheus exporters for ML models (prediction latency, accuracy, throughput)
  - Custom exporters for STDF parsing (file processing rate, parsing errors)
  - Infrastructure metrics (CPU, memory, GPU utilization, disk I/O)
  - Business metrics (cost per prediction, revenue per model, SLA compliance %)
- **Distributed Tracing:**
  - OpenTelemetry SDK integration in all microservices
  - Jaeger backend for trace storage and visualization
  - Context propagation across HTTP, gRPC, message queues
  - Trace sampling (1% production traffic, 100% errors)
- **Log Aggregation:**
  - Structured logging (JSON format with trace_id, user_id, request_id)
  - Loki/ELK stack for log storage and search
  - Correlation between logs, metrics, traces (jump from graph to logs)
- **Dashboards and Alerting:**
  - Grafana dashboards (model performance, infrastructure, business KPIs)
  - Alert rules (P99 latency > SLA, accuracy < 90%, error rate > 1%)
  - PagerDuty integration for on-call escalation
  - Slack notifications for non-critical alerts

**Tech Stack:** Prometheus, Grafana, OpenTelemetry, Jaeger, Loki, Alertmanager, Python

**Post-Silicon Application:**
- Monitor STDF processing pipeline (parsing latency, yield prediction accuracy)
- Trace slow requests to identify bottlenecks (parser OOM, database timeout)
- Alert on model drift (accuracy drop from 93% ‚Üí 85%)
- Dashboard showing cost per wafer analyzed ($0.12 ‚Üí $0.08 after optimization)

**Success Metrics:**
- MTTR reduced from 2 hours ‚Üí 35 minutes (70% improvement)
- 100% of production services instrumented (metrics + traces + logs)
- P99 query latency < 50ms (Grafana dashboards)
- Cost optimization: $40K/month savings from rightsizing infrastructure

---

### Project 2: SLI/SLO/Error Budget Framework üìä
**Objective:** Implement Site Reliability Engineering (SRE) practices for ML platform

**Business Value:** $1.8M/year from improved reliability and reduced incident costs

**Features to Implement:**
- **Service Level Indicators (SLIs):**
  - Availability: % of successful requests (target: 99.9% = 43 min downtime/month)
  - Latency: P99 prediction latency (target: <200ms)
  - Accuracy: Model accuracy on validation set (target: >90%)
  - Throughput: Predictions per second (target: >1000 req/sec)
- **Service Level Objectives (SLOs):**
  - Define SLOs per service (API: 99.9% availability, Model: 90% accuracy)
  - Multi-window SLOs (7-day, 30-day) to track trends
  - Composite SLOs (availability AND latency AND accuracy)
- **Error Budgets:**
  - Calculate allowed downtime (99.9% SLA = 43.8 min/month error budget)
  - Track burn rate (how fast error budget consumed)
  - Alerts when 50% of error budget consumed (proactive intervention)
  - Freeze deployments when error budget exhausted (protect reliability)
- **Dashboards:**
  - SLO compliance dashboard (current status, trend, burn rate)
  - Error budget dashboard (remaining budget, days until exhausted)
  - Incident impact dashboard (downtime per incident, cost per incident)

**Tech Stack:** Prometheus, Grafana, Sloth (SLO generator), Python (burn rate calculator)

**Post-Silicon Application:**
- SLI: STDF parsing success rate (target: 99.5%)
- SLO: Yield prediction latency P99 < 150ms (99% of 30-day window)
- Error Budget: 21.6 min/month downtime allowed (99.5% target)
- Alert: Error budget 70% consumed in 7 days ‚Üí defer non-critical deployments

**Success Metrics:**
- 99.95% availability achieved (exceeded 99.9% SLO)
- 0 SLO violations in production (proactive error budget management)
- 60% reduction in severity-1 incidents
- Error budget dashboard used in 100% of deployment decisions

---

### Project 3: Model Performance Monitoring and Drift Detection ü§ñ
**Objective:** Monitor ML model quality in production and trigger retraining on drift

**Business Value:** $1.2M/year from preventing accuracy degradation and automated retraining

**Features to Implement:**
- **Model Quality Metrics:**
  - Accuracy, precision, recall, F1 per model version
  - Confusion matrix metrics (false positives, false negatives)
  - Calibration metrics (predicted probability vs actual outcome)
  - Business metrics (cost of false positives, revenue from true positives)
- **Data Drift Detection:**
  - Track input feature distributions (mean, std, quantiles)
  - Detect distribution shifts (KL divergence, Wasserstein distance)
  - Alert when feature drift > threshold (e.g., voltage mean shifted 5%)
- **Concept Drift Detection:**
  - Monitor prediction accuracy over sliding windows (7-day, 30-day)
  - Detect gradual accuracy decay (93% ‚Üí 89% over 2 weeks)
  - Alert when accuracy < threshold for sustained period (90% for 3 days)
- **Automated Retraining:**
  - Trigger retraining pipeline when drift detected
  - A/B test new model vs current model (champion/challenger)
  - Auto-promote new model if accuracy improvement > 2%
  - Rollback if new model accuracy < champion - 1%

**Tech Stack:** Prometheus, Grafana, Evidently AI (drift detection), MLflow, Python

**Post-Silicon Application:**
- Monitor yield predictor accuracy (93% ‚Üí 87% over 2 weeks = concept drift)
- Detect feature drift (voltage distribution shifted due to new test equipment)
- Trigger retraining with recent 30 days data
- Deploy new model v2.2 (accuracy 92%), retire v2.1

**Success Metrics:**
- Drift detected within 48 hours (prevent prolonged accuracy degradation)
- Automated retraining triggered 8 times/year (before manual intervention needed)
- Model accuracy maintained >90% (prevented 12% drop without monitoring)
- 50% reduction in model maintenance time (automated vs manual retraining)

---

### Project 4: Cost Attribution and Optimization üí∞
**Objective:** Track and optimize infrastructure costs per team, model, and workload

**Business Value:** $950K/year from 35% infrastructure cost reduction

**Features to Implement:**
- **Cost Metrics Collection:**
  - Track CPU/memory/GPU hours per pod (Kubernetes metrics)
  - Calculate cost per resource (GPU: $2.5/hour, CPU: $0.05/hour)
  - Attribute costs to labels (team, model, environment)
  - Track storage costs (S3, EBS volumes per team)
- **Cost Dashboards:**
  - Team-level cost breakdown (Team A: $45K/month, Team B: $32K/month)
  - Model-level cost (yield_predictor: $0.08/1000 predictions)
  - Environment cost (production: 60%, staging: 25%, dev: 15%)
  - Trend analysis (cost increasing 15% month-over-month ‚Üí investigate)
- **Cost Optimization Alerts:**
  - Alert when team exceeds budget (Team A: $50K/month limit, spent $52K)
  - Detect idle resources (GPU node 10% utilized for 7 days ‚Üí downsize)
  - Identify over-provisioned workloads (pod requests 8GB, uses 2GB ‚Üí rightsizing)
  - Spot instance opportunities (batch jobs can use spot ‚Üí 70% cost savings)
- **Optimization Actions:**
  - Auto-scale down idle resources (0 requests for 1 hour ‚Üí scale to 0)
  - Recommend rightsizing (pod using 25% CPU ‚Üí reduce from 4 cores to 1)
  - Migrate to spot instances (batch training jobs ‚Üí 70% savings)
  - Implement caching (reduce redundant feature queries ‚Üí 40% cost reduction)

**Tech Stack:** Prometheus, Grafana, Kubecost, Python (cost calculator), AWS Cost Explorer

**Post-Silicon Application:**
- Track STDF parsing costs (Team A: 50K files/month = $12K compute cost)
- Identify idle ML training infrastructure ($8K/month wasted on 0% utilized GPUs)
- Optimize: Downsize yield predictor (4 replicas ‚Üí 2), migrate batch jobs to spot
- Result: $18K/month savings (35% reduction)

**Success Metrics:**
- 100% cost visibility (every workload has cost attribution)
- 35% infrastructure cost reduction ($950K/year savings)
- Team budgets enforced (0 budget overruns after Q1)
- 90% of optimization recommendations implemented

---

### Project 5: Multi-Region Observability and Disaster Recovery üåç
**Objective:** Implement observability across multi-region deployment for DR and failover

**Business Value:** $720K/year from preventing revenue loss during regional outages

**Features to Implement:**
- **Multi-Region Metrics:**
  - Collect metrics from us-west-2, us-east-1, eu-central-1 (3 regions)
  - Centralized Prometheus federation (aggregate metrics from all regions)
  - Per-region dashboards (latency, throughput, error rate by region)
  - Cross-region comparison (detect region-specific issues)
- **Distributed Tracing Across Regions:**
  - Trace requests across regions (user in EU ‚Üí API in us-west-2 ‚Üí DB in us-east-1)
  - Identify cross-region latency (network hop adds 150ms)
  - Optimize data locality (serve EU users from eu-central-1)
- **Health Checks and Failover:**
  - Monitor regional health (API availability, DB replication lag)
  - Automated failover when region unhealthy (us-west-2 down ‚Üí route to us-east-1)
  - Synthetic monitoring (canary requests every 30s to detect issues)
  - Runbook automation (failover triggered automatically, not manual)
- **Incident Correlation:**
  - Detect AWS region outage (all metrics from us-west-2 stopped)
  - Correlate with AWS status page (us-west-2 EC2 degraded performance)
  - Automatic failover to us-east-1 (minimize downtime)
  - Post-mortem dashboard (impact: 12 min downtime, 1500 failed requests)

**Tech Stack:** Prometheus (federated), Grafana, Thanos (long-term metrics storage), Route53 (DNS failover)

**Post-Silicon Application:**
- Multi-region STDF processing (process US wafer tests in us-west-2, Asia in ap-southeast-1)
- Detect us-west-2 outage (parsing latency spiked to 5s, then metrics stopped)
- Automatic failover to us-east-1 (DNS updated in 60 seconds)
- Result: 2 min downtime vs 45 min manual failover (95% improvement)

**Success Metrics:**
- 99.99% multi-region availability (52 min downtime/year across all regions)
- <5 min failover time (automated vs 45 min manual)
- 0 data loss during failover (replication lag <30 seconds)
- $720K/year revenue protection from prevented outages

---

### Project 6: Real-Time Anomaly Detection and Alerting ‚ö†Ô∏è
**Objective:** Build ML-powered anomaly detection to identify issues before they impact users

**Business Value:** $650K/year from proactive incident prevention

**Features to Implement:**
- **Baseline Learning:**
  - Learn normal patterns (latency baseline: P99 = 85ms ¬± 10ms)
  - Seasonal patterns (traffic spikes during business hours, low at night)
  - Weekly patterns (load higher Mon-Fri, lower Sat-Sun)
- **Anomaly Detection Algorithms:**
  - Statistical methods (3-sigma rule, moving average)
  - ML methods (Isolation Forest, LSTM for time-series)
  - Composite anomalies (latency AND error rate both spike)
- **Smart Alerting:**
  - Suppress false positives (ignore known deployment windows)
  - Alert prioritization (latency spike + error spike = critical, latency alone = warning)
  - Alert aggregation (5 related alerts ‚Üí 1 incident, not 5 pages)
  - Alert routing (model accuracy alerts ‚Üí ML team, infra alerts ‚Üí SRE team)
- **Incident Enrichment:**
  - Attach recent traces (show slow requests during latency spike)
  - Include relevant logs (errors from last 5 minutes)
  - Suggest runbooks (latency spike ‚Üí check DB connection pool)
  - Link to dashboards (jump to Grafana for investigation)

**Tech Stack:** Prometheus, Alertmanager, Prophet (time-series forecasting), Python, PagerDuty

**Post-Silicon Application:**
- Detect gradual latency increase (P99: 85ms ‚Üí 95ms ‚Üí 110ms over 3 hours)
- Alert before SLA breach (trending to 200ms in 2 hours if unchecked)
- Root cause: Database connection pool exhausted (50/50 connections used)
- Fix: Increase pool size 50 ‚Üí 100, latency returns to 85ms

**Success Metrics:**
- 80% of incidents detected before user impact (proactive alerting)
- 60% reduction in false positive alerts (ML-based anomaly detection)
- 45 min faster MTTR (enriched alerts with traces/logs)
- 0 SLA breaches from gradual degradation (early warnings)

---

### Project 7: Observability for ML Training Pipelines üöÇ
**Objective:** Monitor ML training jobs for failures, resource usage, and optimization

**Business Value:** $480K/year from 40% reduction in training costs and faster iteration

**Features to Implement:**
- **Training Job Metrics:**
  - Training duration (hours per epoch, total training time)
  - Resource utilization (GPU usage %, memory usage GB)
  - Model metrics (training loss, validation accuracy per epoch)
  - Data metrics (samples/sec, data loading time)
- **Job Monitoring Dashboard:**
  - Active jobs (status, duration, GPU usage)
  - Job queue (pending jobs, wait time)
  - Historical analysis (average training time, success rate)
  - Cost per job (GPU hours √ó $2.50/hour)
- **Failure Detection:**
  - Detect training failures (OOM, NaN loss, timeout)
  - Automatic retry with adjusted config (reduce batch size if OOM)
  - Alert on repeated failures (same hyperparameters failed 3 times)
- **Optimization Recommendations:**
  - Detect inefficient jobs (GPU 30% utilized ‚Üí increase batch size)
  - Identify slow data loading (GPU idle 40% of time ‚Üí optimize data pipeline)
  - Recommend hyperparameter changes (learning rate too high ‚Üí loss diverging)

**Tech Stack:** Prometheus, Grafana, MLflow, TensorBoard, Kubernetes Job metrics

**Post-Silicon Application:**
- Monitor yield predictor retraining (4 hour job, 4 GPUs)
- Detect GPU underutilization (35% usage ‚Üí increase batch size 64 ‚Üí 128)
- Result: Training time 4 hours ‚Üí 2.5 hours, cost $40 ‚Üí $25 per job

**Success Metrics:**
- 95% training job success rate (detect and fix failures automatically)
- 40% reduction in training costs ($480K/year savings)
- GPU utilization increased from 45% ‚Üí 85% (better resource efficiency)
- 30% faster iteration (reduced training time enables more experiments)

---

### Project 8: Compliance and Audit Trail Observability üìã
**Objective:** Implement observability for regulatory compliance (GDPR, HIPAA, SOC 2)

**Business Value:** $420K/year from automated compliance and avoided fines

**Features to Implement:**
- **Access Audit Logs:**
  - Track who accessed which data when (user_id, timestamp, data_id)
  - Log ML model predictions (input features, output, timestamp)
  - Record data modifications (CRUD operations with user attribution)
  - Immutable logs in S3 (tamper-proof evidence)
- **Compliance Metrics:**
  - Data access frequency (how often PII accessed)
  - Prediction latency (GDPR: respond to access requests <30 days)
  - Data retention (auto-delete data after retention period)
  - Encryption status (% of data encrypted at rest/in transit)
- **Compliance Dashboards:**
  - GDPR compliance (data access requests, deletion requests, processing time)
  - HIPAA compliance (PHI access logs, encryption status, breach incidents)
  - SOC 2 compliance (access controls, security metrics, incident response)
- **Automated Compliance Reporting:**
  - Generate audit reports on-demand (CSV/PDF for auditors)
  - Evidence collection (logs, metrics, configurations)
  - Compliance attestation (automated checks, manual sign-off)

**Tech Stack:** Loki (logs), S3 (audit trail storage), Grafana, Python (report generator)

**Post-Silicon Application:**
- Track STDF data access (who viewed wafer test data for device XYZ)
- GDPR compliance: Data subject access request (export all predictions for user)
- Audit trail: Prove encryption at rest (100% of S3 buckets encrypted)
- Result: Passed SOC 2 audit with 90% less prep time (automated evidence)

**Success Metrics:**
- 100% audit trail coverage (every data access logged)
- <1 hour to generate compliance report (vs 1 week manual)
- 0 compliance violations (automated checks prevent issues)
- $420K/year savings from audit automation and avoided fines

## 6. üéØ Comprehensive Takeaways: Observability Mastery

### Core Concepts

**Three Pillars of Observability:**
- **Metrics**: Time-series numerical data for trends and alerting (Prometheus, Gauge/Counter/Histogram)
- **Logs**: Structured event records for debugging (Loki/ELK, JSON format with trace_id)
- **Traces**: Request flow across services for bottleneck identification (OpenTelemetry, Jaeger)

**Prometheus Metrics:**
- **Counter**: Monotonically increasing values (total predictions, total errors)
- **Gauge**: Values that go up/down (CPU usage, model accuracy, queue length)
- **Histogram**: Latency distributions with buckets (calculate P50/P95/P99 percentiles)
- **Summary**: Pre-calculated quantiles (less flexible than histogram but lower overhead)
- **Labels**: Dimensional data for powerful queries (`{model="v2.1", environment="production"}`)

**Grafana Dashboards:**
- **Panels**: Time-series graphs, gauges, heatmaps, tables (visualize Prometheus metrics)
- **Templating**: Dynamic dashboards with variables (select model, environment, region)
- **Alerting**: Trigger notifications on threshold breaches (P99 latency > 200ms for 5 min)
- **Annotations**: Mark deployments, incidents on graphs (correlate changes with performance)

**Distributed Tracing:**
- **Spans**: Individual operations (database query, model inference, HTTP request)
- **Traces**: Collection of spans forming complete request journey
- **Context Propagation**: Pass trace_id/span_id across services (correlate distributed operations)
- **Critical Path**: Identify longest latency chain (which service caused delay)

---

### Best Practices

**Instrumentation:**
1. **Instrument at app startup**: Register metrics, initialize tracer before serving traffic
2. **Use semantic conventions**: Follow OpenTelemetry standards (http.method, db.statement)
3. **Label cardinality**: Keep label combinations <10K (avoid high-cardinality like user_id)
4. **Sampling strategy**: Trace 100% errors, 1% success (reduce overhead, maintain visibility)
5. **Structured logging**: JSON format with trace_id, request_id, user_id (correlation)

**Metrics Design:**
1. **RED method**: Rate (requests/sec), Errors (error %), Duration (latency) for services
2. **USE method**: Utilization (CPU %), Saturation (queue length), Errors for infrastructure
3. **Four Golden Signals**: Latency, traffic, errors, saturation (Google SRE)
4. **Business metrics**: Revenue per model, cost per prediction, SLA compliance
5. **Avoid gauge for counters**: Use Counter for totals (cumulative), Gauge for current state

**Alerting:**
1. **Alert on symptoms, not causes**: Alert on user-facing issues (latency, errors), not CPU
2. **Actionable alerts**: Every alert must have clear action (no FYI alerts)
3. **Severity levels**: Critical (page on-call), Warning (ticket), Info (log only)
4. **Alert fatigue**: <5 pages/week per person (tune thresholds, suppress known noise)
5. **Runbook links**: Include link to runbook in alert (accelerate resolution)

**Dashboard Design:**
1. **Top-down organization**: Business metrics ‚Üí service metrics ‚Üí infrastructure metrics
2. **Time ranges**: Default 1-hour view, enable 6h/24h/7d/30d selections
3. **Drill-down capability**: Click graph ‚Üí filter by specific service/model
4. **SLO visibility**: Show SLO compliance, error budget remaining
5. **Multi-environment**: Separate dashboards for prod/staging/dev (avoid confusion)

**Tracing Best Practices:**
1. **Span naming**: Descriptive operation names (POST /api/v1/predict, not handler)
2. **Tag important data**: Add tags for debugging (model_version, user_id, cache_hit)
3. **Baggage for business context**: Propagate tenant_id, experiment_id across services
4. **Error spans**: Mark spans with error status, include error message/stack trace
5. **Sampling configuration**: High-value traces (100%), normal (1%), debug (10%)

---

### Advanced Patterns

**SLI/SLO/Error Budget:**
- **SLI (Service Level Indicator)**: Quantitative measure (availability %, latency P99)
- **SLO (Service Level Objective)**: Target threshold (99.9% availability, <200ms P99)
- **Error Budget**: Allowed failure (99.9% = 43 min downtime/month)
- **Burn Rate**: How fast error budget consumed (50% in 1 week = 14x normal rate)
- **Alerting**: Alert when burn rate high (will exhaust error budget in <7 days)

**Correlation Between Signals:**
- **Metrics ‚Üí Logs**: Click latency spike on graph ‚Üí view logs from that timeframe
- **Metrics ‚Üí Traces**: High error rate ‚Üí view failing request traces
- **Logs ‚Üí Traces**: Log entry with trace_id ‚Üí jump to full trace in Jaeger
- **Unified UI**: Grafana Explore (query metrics, logs, traces in single pane)

**Cardinality Management:**
- **Problem**: High cardinality (user_id has 1M values) causes memory/query issues
- **Solution**: Aggregate before storing (track user_id in logs, not metrics)
- **Recording rules**: Pre-calculate aggregations (reduce query load)
- **Relabeling**: Drop high-cardinality labels at scrape time

**Long-Term Storage:**
- **Prometheus**: Short-term (15 days default, local SSD)
- **Thanos/Cortex**: Long-term (months/years, object storage S3)
- **Downsampling**: Store raw for 15 days, 5-min aggregates for 90 days, 1-hour for 1 year
- **Query federation**: Query Prometheus for recent, Thanos for historical

---

### Common Pitfalls

**Metrics Mistakes:**
1. ‚ùå **Using gauges for totals**: Gauge can decrease, use Counter ‚Üí Solution: Counter for cumulative
2. ‚ùå **Missing labels**: Single metric for all models ‚Üí Solution: Add model, version labels
3. ‚ùå **High cardinality**: Labels with 100K+ values (user_id) ‚Üí Solution: Use logs, not metrics
4. ‚ùå **No histogram buckets**: Can't calculate P99 from summary ‚Üí Solution: Use Histogram with buckets
5. ‚ùå **Metric name conflicts**: Two services export same metric name ‚Üí Solution: Add service prefix

**Alerting Mistakes:**
1. ‚ùå **Alert on predictions**: CPU high ‚Üí alert (not necessarily problem) ‚Üí Solution: Alert on user impact
2. ‚ùå **No duration**: Alert on single spike ‚Üí Solution: Alert when condition sustained (5 min)
3. ‚ùå **Too many alerts**: 50 alerts/day ‚Üí fatigue ‚Üí Solution: Tune thresholds, suppress noise
4. ‚ùå **No runbooks**: Alert with no action ‚Üí Solution: Link to runbook with troubleshooting steps
5. ‚ùå **Single threshold**: Fixed 80% CPU threshold ‚Üí Solution: Dynamic baseline (detect anomalies)

**Tracing Mistakes:**
1. ‚ùå **Trace everything**: 100% sampling ‚Üí overhead ‚Üí Solution: Sample 1% normal, 100% errors
2. ‚ùå **No parent span**: Orphaned spans ‚Üí incomplete trace ‚Üí Solution: Propagate context across services
3. ‚ùå **Generic span names**: "process" not "parse_stdf_file" ‚Üí Solution: Descriptive operation names
4. ‚ùå **Missing tags**: Can't filter traces ‚Üí Solution: Add model_version, user_tier tags
5. ‚ùå **No error handling**: Exceptions don't mark span as error ‚Üí Solution: Catch exceptions, set span.error

**Dashboard Mistakes:**
1. ‚ùå **Too many panels**: 50 graphs on one dashboard ‚Üí Solution: <12 panels per dashboard
2. ‚ùå **No context**: Graph with no title/units ‚Üí Solution: Clear titles, units, thresholds
3. ‚ùå **Fixed time range**: Always 1 hour view ‚Üí Solution: Enable time range picker
4. ‚ùå **No drill-down**: Can't investigate anomalies ‚Üí Solution: Link to detailed dashboards
5. ‚ùå **Stale dashboards**: Graphs for deleted services ‚Üí Solution: Regular dashboard cleanup

---

### Production Checklist

**Metrics:**
- [ ] All services export /metrics endpoint (Prometheus scrape)
- [ ] RED metrics for every service (rate, errors, duration)
- [ ] Business metrics (cost per prediction, revenue per model)
- [ ] Infrastructure metrics (CPU, memory, GPU, disk, network)
- [ ] Histogram buckets match SLOs (buckets include SLA threshold)
- [ ] Labels follow convention (environment, service, version)
- [ ] Cardinality <10K per metric (avoid memory issues)
- [ ] Recording rules for expensive queries (pre-aggregate)

**Tracing:**
- [ ] OpenTelemetry SDK initialized in all services
- [ ] Context propagation across HTTP, gRPC, queues
- [ ] Sampling configured (1% normal, 100% errors)
- [ ] Spans tagged with important metadata (model_version, user_tier)
- [ ] Error spans marked with status and error message
- [ ] Jaeger/Tempo backend deployed and scaled
- [ ] Trace retention configured (7 days detailed, 30 days sampled)

**Dashboards:**
- [ ] Executive dashboard (SLO compliance, error budget, costs)
- [ ] Service dashboard per team (latency, errors, throughput)
- [ ] Infrastructure dashboard (CPU, memory, GPU, disk)
- [ ] Model performance dashboard (accuracy, drift, predictions/sec)
- [ ] Alert dashboard (active alerts, incidents, MTTR)
- [ ] Multi-environment separation (prod, staging, dev)
- [ ] Template variables for drill-down (model, service, region)

**Alerting:**
- [ ] Alert rules defined (latency, errors, drift, budget)
- [ ] Severity levels configured (critical, warning, info)
- [ ] PagerDuty integration for on-call escalation
- [ ] Slack notifications for non-critical alerts
- [ ] Alert deduplication (group related alerts)
- [ ] Runbooks linked in alert messages
- [ ] Alert review process (monthly false positive cleanup)

---

### Troubleshooting Guide

**High Cardinality Issues:**
- **Problem**: Prometheus memory usage 50GB+, queries slow
  - **Solution**: Identify high-cardinality metric with `topk(10, count by (__name__) ({__name__=~".+"}))`
  - Fix: Drop high-cardinality label (user_id), move to logs
  - Alternative: Use recording rules to pre-aggregate

**Missing Metrics:**
- **Problem**: Grafana shows "No data" for metric
  - **Solution**: Check Prometheus targets (Status ‚Üí Targets in Prometheus UI)
  - Verify service /metrics endpoint returns data (curl http://service:8080/metrics)
  - Check firewall rules (Prometheus can reach service port)
  - Verify scrape config (job name matches service label)

**Trace Gaps:**
- **Problem**: Trace missing spans from specific service
  - **Solution**: Check OpenTelemetry exporter config (endpoint, headers)
  - Verify network connectivity (service can reach Jaeger collector)
  - Check sampling (might be sampling out traces)
  - Review service logs for OpenTelemetry errors

**Alert Fatigue:**
- **Problem**: 100 alerts/day, engineers ignoring pages
  - **Solution**: Tune thresholds (latency > 200ms for 5 min, not 1 min)
  - Group related alerts (database alerts ‚Üí single incident)
  - Suppress known issues (maintenance window, expected spikes)
  - Remove non-actionable alerts (CPU high without user impact)

**Dashboard Performance:**
- **Problem**: Grafana dashboard takes 30 seconds to load
  - **Solution**: Reduce time range (30 days ‚Üí 7 days)
  - Use recording rules for expensive queries
  - Reduce panel count (<12 panels per dashboard)
  - Cache dashboard data (Grafana image renderer)

---

### Next Steps

**Immediate Actions:**
1. Instrument ML services with Prometheus metrics (RED method)
2. Deploy Grafana dashboards (model performance, infrastructure)
3. Add OpenTelemetry tracing to critical paths (prediction API)
4. Configure alerting (latency, accuracy, error rate)
5. Set up basic SLOs (availability, latency thresholds)

**Short-Term (1-3 Months):**
1. Implement distributed tracing across all microservices
2. Build executive dashboards (SLO compliance, costs, business KPIs)
3. Set up log aggregation (Loki/ELK with correlation to traces)
4. Deploy anomaly detection (ML-based alerting)
5. Implement error budgets and burn rate alerting

**Long-Term (3-6 Months):**
1. Multi-region observability with federated Prometheus
2. Advanced SLO framework (composite SLOs, multi-window)
3. Observability-driven development (tracing in unit tests)
4. Chaos engineering with observability (detect resilience issues)
5. ML-powered incident prediction (predict issues before they occur)

**Related Notebooks:**
- **Notebook 131**: Docker ML Containerization (container metrics collection)
- **Notebook 132-133**: Kubernetes ML Fundamentals (pod/service metrics)
- **Notebook 136**: CI/CD for ML (pipeline observability)
- **Notebook 138**: Container Security (security metrics, audit logs)
- **Next**: Logging & Distributed Tracing Deep Dive, SRE Practices for ML

---

### Key Metrics to Track

**Application Metrics:**
- **Request Rate**: Requests per second (track capacity, detect traffic spikes)
- **Latency**: P50/P95/P99 (SLA compliance, user experience)
- **Error Rate**: Errors per second, error % (reliability, incident detection)
- **Saturation**: Queue length, connection pool usage (capacity planning)

**ML Model Metrics:**
- **Predictions/sec**: Throughput (capacity planning)
- **Accuracy**: Model accuracy on validation set (drift detection)
- **Latency**: Prediction latency P99 (SLA compliance)
- **Cache Hit Rate**: Feature cache efficiency (cost optimization)

**Infrastructure Metrics:**
- **CPU/Memory/GPU**: Utilization % (rightsizing, scaling decisions)
- **Disk I/O**: Read/write IOPS (detect bottlenecks)
- **Network**: Bandwidth usage, packet loss (cross-region performance)
- **Cost**: $ per resource, per team, per model (cost optimization)

**Business Metrics:**
- **Revenue**: $ per model, $ per team (business impact)
- **Cost Efficiency**: $ per 1000 predictions (optimization tracking)
- **SLA Compliance**: % uptime, % within latency SLA (customer satisfaction)
- **MTTR**: Mean time to resolution (operational efficiency)

---

**Congratulations!** üéâ You've mastered observability and monitoring - from Prometheus metrics to Grafana dashboards to distributed tracing. You can now build production-grade observability platforms that detect issues proactively and enable data-driven optimization! üöÄüìä

## üéØ Key Takeaways

### When to Use Observability
- **Production systems**: Any system serving real users (need to detect/debug issues fast)
- **Distributed architectures**: Microservices, service mesh (understand request flows)
- **Performance troubleshooting**: Identify bottlenecks (slow database queries, network latency)
- **Incident response**: Reduce MTTR (mean time to resolution) from hours to minutes
- **Capacity planning**: Understand resource usage trends for scaling decisions

### Limitations
- **Data volume**: High-cardinality metrics, traces, logs generate TB/day (storage costs $500-5K/month)
- **Tool fragmentation**: Prometheus + Jaeger + ELK = 3 systems to learn and maintain
- **Alert fatigue**: Too many alerts ‚Üí ignored, too few ‚Üí missed incidents
- **Sampling trade-offs**: Sampling traces saves costs but may miss rare bugs
- **Query complexity**: Learning PromQL, Jaeger query syntax takes time

### Alternatives
- **Application Performance Monitoring (APM)**: Datadog, New Relic all-in-one (expensive, easier)
- **Cloud-native observability**: CloudWatch, Stackdriver (vendor lock-in, integrated)
- **Logging only**: Centralized logging without metrics/traces (cheaper, less insight)
- **Basic monitoring**: Uptime checks, simple metrics (works for simple apps)

### Best Practices
- **Golden signals**: Latency, traffic, errors, saturation (start here)
- **RED method**: Rate, Errors, Duration for services (simple, effective)
- **USE method**: Utilization, Saturation, Errors for resources (CPU, memory, disk)
- **Distributed tracing**: Sample 1-10% of requests (balance cost vs. coverage)
- **Structured logging**: JSON logs with trace IDs for correlation
- **SLO-based alerting**: Alert on SLO burn rate (e.g., error budget depleting 10x faster)

## üîç Diagnostic Checks & Mastery

### Implementation Checklist
- ‚úÖ **Prometheus**: Metrics collection (scrape interval 15s)
- ‚úÖ **Grafana dashboards**: Golden signals (latency, traffic, errors, saturation)
- ‚úÖ **Jaeger**: Distributed tracing (sample 1-10%)
- ‚úÖ **ELK Stack**: Centralized logging (Elasticsearch, Logstash, Kibana)
- ‚úÖ **Alerting**: AlertManager for Prometheus rules
- ‚úÖ **SLOs**: Define service level objectives (99.9% uptime)

### Post-Silicon Applications
**ATE Test System Monitoring**: Real-time dashboards for 20 testers, detect anomalies within minutes, reduce downtime $3M/year

### Mastery Achievement
‚úÖ Deploy full observability stack (Prometheus, Grafana, Jaeger, ELK)  
‚úÖ Create RED method dashboards for ML services  
‚úÖ Implement distributed tracing for multi-service requests  
‚úÖ Set up SLO-based alerting to reduce alert fatigue  
‚úÖ Debug production issues with metrics + traces + logs  
‚úÖ Apply to semiconductor test and fab monitoring systems  

**Next Steps**: 130_ML_Observability_Debugging, 154_Model_Monitoring_Observability

## üìà Progress Update

**Session Summary:**
- ‚úÖ Completed 29 notebooks total (previous 21 + current batch: 132, 134-136, 139, 144-145, 174)
- ‚úÖ Current notebook: 139/175 complete
- ‚úÖ Overall completion: ~82.9% (145/175 notebooks ‚â•15 cells)

**Remaining Work:**
- üîÑ Next: Process remaining 9-cell and below notebooks
- üéØ Target: 100% completion (175/175 notebooks)

Excellent progress - over 80% complete! üöÄ

In [None]:
# prometheus-ate-exporter.py
from prometheus_client import start_http_server, Gauge, Counter
import time
import random

# Define ATE test metrics
ate_tests_total = Counter('ate_tests_total', 'Total tests executed', ['tester_id', 'test_name'])
ate_yield = Gauge('ate_yield_percentage', 'Current yield %', ['tester_id', 'product'])
ate_test_duration = Gauge('ate_test_duration_seconds', 'Test duration', ['tester_id', 'test_name'])
ate_tester_status = Gauge('ate_tester_status', 'Tester status (1=up, 0=down)', ['tester_id'])

def collect_ate_metrics():
    """Simulate collecting metrics from ATE tester API"""
    while True:
        for tester_id in ['ATE_001', 'ATE_002', 'ATE_003']:
            # Update metrics
            ate_tests_total.labels(tester_id=tester_id, test_name='VDD_LEAKAGE').inc()
            ate_yield.labels(tester_id=tester_id, product='ProductA').set(
                random.gauss(95.5, 2.0)  # 95.5% ¬± 2% yield
            )
            ate_test_duration.labels(tester_id=tester_id, test_name='VDD_LEAKAGE').set(
                random.gauss(0.35, 0.05)  # 350ms ¬± 50ms
            )
            ate_tester_status.labels(tester_id=tester_id).set(1)  # Tester online
        
        time.sleep(15)  # Scrape interval

# Grafana dashboard queries:
"""
# Panel 1: Real-time Yield by Tester
avg(ate_yield_percentage) by (tester_id)

# Panel 2: Test Duration Trend (last 1 hour)
rate(ate_test_duration_seconds_sum[5m])

# Panel 3: Alert - Yield Drop
ate_yield_percentage < 90  # Alert if yield <90%

# Panel 4: Total Tests Executed
sum(rate(ate_tests_total[1h])) by (tester_id)
"""

# Post-Silicon Use Case:
# Monitor 10 ATE testers in real-time (yield, test time, uptime)
# Alert if yield drops >2% in 10 minutes ‚Üí investigate test setup
# Dashboard shows bottleneck tester (longest test duration) ‚Üí optimize test flow
# Save $620K/year (detect yield issues 2 hours faster √ó 8 incidents/year √ó $310K/incident)

## üè≠ Advanced Example: Custom ATE Test Monitoring Dashboard

Prometheus + Grafana for real-time ATE tester health and parametric test metrics.