# 140: Logging & Distributed Tracing - Structured Logs, ELK Stack, and Jaeger

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** structured logging vs unstructured logs (JSON with context vs plain text)
- **Implement** centralized logging with ELK stack (Elasticsearch, Logstash, Kibana)
- **Build** distributed tracing with Jaeger for multi-service ML pipelines
- **Deploy** log aggregation for ML systems (correlation IDs, trace context propagation)
- **Apply** logging to semiconductor systems (STDF processing traces, ML inference request flows)
- **Debug** production issues with logs and traces (find slow services, error root causes)

## üìö What is Distributed Tracing?

**Distributed tracing** tracks a **single request's journey** across multiple services, capturing timing, errors, and context at each hop. Essential for debugging microservices and ML pipelines where requests touch 5-10+ services.

**Why Distributed Tracing?**
- ‚úÖ **End-to-end visibility**: See entire request flow (API ‚Üí load balancer ‚Üí app ‚Üí database ‚Üí ML model ‚Üí cache ‚Üí response)
- ‚úÖ **Performance debugging**: Identify slow services (database query 2 seconds, ML inference 500ms, serialization 100ms)
- ‚úÖ **Error propagation**: Trace errors to origin (which service threw exception? what was input?)
- ‚úÖ **Dependency mapping**: Understand service dependencies (service A calls B, B calls C and D)

**Structured Logging vs Unstructured:**

| Aspect | Unstructured Logs | Structured Logs (JSON) |
|--------|-------------------|------------------------|
| **Format** | Plain text: `ERROR: Model prediction failed for user 123` | JSON: `{"level":"ERROR","msg":"Prediction failed","user_id":123,"model":"v2.1","trace_id":"abc123"}` |
| **Searchability** | Regex/grep (slow, brittle) | Index fields (fast queries: `user_id:123 AND level:ERROR`) |
| **Correlation** | Manual parsing of request IDs | trace_id links all logs for one request |
| **Machine readable** | No (humans only) | Yes (parse, aggregate, visualize) |
| **Context** | Limited (text only) | Rich (user_id, model_version, latency, confidence) |

**OpenTelemetry (Modern Standard):**
- **Unified API**: Single library for metrics, logs, and traces (vendor-neutral)
- **Auto-instrumentation**: Frameworks automatically emit traces (FastAPI, Flask, Django)
- **Context propagation**: trace_id and span_id flow across services (HTTP headers, gRPC metadata)

## üè≠ Post-Silicon Validation Use Cases

### **Use Case 1: Jaeger Tracing for STDF ETL Pipeline**
**Input:** Multi-stage ETL pipeline (S3 download ‚Üí parse STDF ‚Üí validate ‚Üí transform ‚Üí load to database ‚Üí index)  
**Problem:** 30% of jobs fail with no visibility into which stage failed or why  
**Output:** Jaeger traces show failures at "parse STDF" stage (corrupted files), specific lot IDs identified  
**Value:** $3.9M/year from faster debugging (reduce MTTR from 4 hours to 20 minutes, 90% reduction)

### **Use Case 2: ELK Stack for ML Inference Logs**
**Input:** Yield prediction API logs scattered across 20 servers, engineers SSH to debug  
**Output:** Centralized Elasticsearch with Kibana dashboards (search by user_id, model_version, error_type)  
**Value:** $3.2M/year from improved debuggability (find all errors for model v2.1 in 10 seconds vs 1 hour)

### **Use Case 3: Correlation IDs for Wafer Map Rendering Service**
**Input:** Wafer map generation spans 5 microservices (API ‚Üí auth ‚Üí database ‚Üí ML model ‚Üí image render ‚Üí S3 upload)  
**Problem:** Timeouts occur but unclear which service is slow (no request correlation)  
**Output:** Correlation IDs (trace_id) link logs across services, reveal database query timeouts (95% of slow requests)  
**Value:** $2.7M/year from targeted optimization (fix database indexes, reduce P95 latency 80%)

### **Use Case 4: Log Sampling for High-Volume Test Data Processing**
**Input:** STDF processing logs 1M events/hour (100GB/day), storage costs $15K/month  
**Output:** Sample 10% of INFO logs, keep 100% of WARN/ERROR logs (10GB/day, $1.5K/month storage)  
**Value:** $2.1M/year from reduced storage costs (save $13.5K/month = $162K/year) + faster search performance

**Total Post-Silicon Value:** $3.9M + $3.2M + $2.7M + $2.1M = **$11.9M/year**

## üîÑ Distributed Tracing Workflow

```mermaid
graph LR
    A[üåê Client Request] --> B[üîÄ API Gateway]
    B --> C[üîê Auth Service]
    C --> D[üíæ Database Lookup]
    C --> E[ü§ñ ML Model Service]
    E --> F[üìä Feature Store]
    E --> G[üß† Model Inference]
    G --> H[üìù Log Prediction]
    H --> I[üì§ Response to Client]
    
    B -.trace_id=abc123.-> C
    C -.trace_id=abc123.-> D
    C -.trace_id=abc123.-> E
    E -.trace_id=abc123.-> F
    E -.trace_id=abc123.-> G
    
    D --> J[Jaeger Collector]
    F --> J
    G --> J
    H --> J
    
    J --> K[Jaeger UI]
    K --> L[üëÄ Visualize Trace]
    L --> M{Slow Span?}
    M -->|Yes| N[üéØ Optimize Service]
    M -->|No| O[‚úÖ Performance Good]
    
    style A fill:#e1f5ff
    style I fill:#e1ffe1
    style N fill:#fff4e1
```

## üìä Learning Path Context

**Prerequisites:**
- **Notebook 139: Observability & Monitoring** - Metrics and alerting foundations
- **Notebook 107: ML Model Monitoring** - Model-specific logging and metrics

**Next Steps:**
- **Notebook 141: CI/CD Pipelines** - Integrate logging into deployment pipelines
- **Notebook 144: Performance Optimization** - Use traces to identify bottlenecks

---

Let's debug distributed ML systems with logs and traces! üöÄ

In [None]:
# Setup and Imports
import json
import time
import random
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any, Tuple
from enum import Enum
from collections import defaultdict
import hashlib
import uuid

# Set random seed for reproducibility
random.seed(42)

## 2. üìù Structured Logging - JSON Logs with Context

### üìù What's Happening in This Code?

**Purpose:** Implement structured logging with JSON format for queryable, correlatable logs in ML systems.

**Key Points:**
- **Structured Format**: JSON logs with standardized fields (timestamp, level, message, trace_id, service)
- **Contextual Fields**: user_id, model_version, request_id enable powerful filtering
- **Log Levels**: DEBUG (verbose), INFO (events), WARNING (issues), ERROR (failures), CRITICAL (system down)
- **Correlation**: trace_id links logs across services, request_id groups logs for single request
- **Search Optimization**: Index key fields in Elasticsearch for fast queries (find all errors for model v2.1)

**Why This Matters:**
- **Debugging**: Search logs by trace_id to see all events for slow request
- **Analytics**: Count errors by model version (v2.1 has 15% error rate vs v2.0 1%)
- **Compliance**: Audit trail of who accessed what data when (GDPR, HIPAA)
- **Alerting**: Trigger PagerDuty when ERROR logs > 10/minute for critical services

**Post-Silicon Application:**
- **Scenario**: Debug STDF parsing failures (10% error rate for wafer lot W-12345)
- **Query**: `service:stdf-parser AND level:ERROR AND wafer_lot:W-12345` in Kibana
- **Result**: 150 error logs showing "Voltage out of range: got 20V, expected [-5V, 15V]"
- **Root Cause**: Test equipment calibration drift on wafer lot W-12345
- **Fix**: Recalibrate equipment, reprocess lot, error rate 10% ‚Üí 0%
- **Value**: 80% faster debugging (3 hours ‚Üí 35 minutes MTTR), $1.5M/year savings

In [None]:
# Combined Structured Logging and Distributed Tracing Implementation

class LogLevel(Enum):
    """Log severity levels"""
    DEBUG = "DEBUG"
    INFO = "INFO"
    WARNING = "WARNING"
    ERROR = "ERROR"
    CRITICAL = "CRITICAL"

@dataclass
class LogEntry:
    """Structured log entry"""
    timestamp: datetime
    level: LogLevel
    service: str
    message: str
    trace_id: Optional[str] = None
    span_id: Optional[str] = None
    fields: Dict[str, Any] = field(default_factory=dict)
    
    def to_json(self) -> str:
        """Export as JSON (Elasticsearch format)"""
        return json.dumps({
            "@timestamp": self.timestamp.isoformat(),
            "level": self.level.value,
            "service": self.service,
            "message": self.message,
            "trace_id": self.trace_id,
            "span_id": self.span_id,
            **self.fields
        }, indent=2)

class StructuredLogger:
    """Structured logger with trace correlation"""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.logs: List[LogEntry] = []
    
    def log(self, level: LogLevel, message: str, trace_id: str = None, 
            span_id: str = None, **fields):
        """Log structured entry"""
        entry = LogEntry(
            timestamp=datetime.now(),
            level=level,
            service=self.service_name,
            message=message,
            trace_id=trace_id,
            span_id=span_id,
            fields=fields
        )
        self.logs.append(entry)
        
        # Print to console (simulating log output)
        level_emoji = {
            LogLevel.DEBUG: "üîç",
            LogLevel.INFO: "‚ÑπÔ∏è",
            LogLevel.WARNING: "‚ö†Ô∏è",
            LogLevel.ERROR: "‚ùå",
            LogLevel.CRITICAL: "üî•"
        }
        print(f"{level_emoji[level]} [{entry.timestamp.strftime('%H:%M:%S')}] {self.service_name} | {message}")
        if fields:
            print(f"   Fields: {json.dumps(fields, indent=2)}")
    
    def debug(self, message: str, **fields):
        self.log(LogLevel.DEBUG, message, **fields)
    
    def info(self, message: str, **fields):
        self.log(LogLevel.INFO, message, **fields)
    
    def warning(self, message: str, **fields):
        self.log(LogLevel.WARNING, message, **fields)
    
    def error(self, message: str, **fields):
        self.log(LogLevel.ERROR, message, **fields)
    
    def critical(self, message: str, **fields):
        self.log(LogLevel.CRITICAL, message, **fields)
    
    def query_logs(self, level: LogLevel = None, trace_id: str = None, 
                   **field_filters) -> List[LogEntry]:
        """Query logs (simulating Elasticsearch query)"""
        results = self.logs
        
        if level:
            results = [log for log in results if log.level == level]
        if trace_id:
            results = [log for log in results if log.trace_id == trace_id]
        for key, value in field_filters.items():
            results = [log for log in results if log.fields.get(key) == value]
        
        return results

@dataclass
class Span:
    """Distributed trace span"""
    span_id: str
    trace_id: str
    parent_span_id: Optional[str]
    operation_name: str
    service_name: str
    start_time: datetime
    duration_ms: float = 0.0
    tags: Dict[str, Any] = field(default_factory=dict)
    status: str = "OK"

class Tracer:
    """Distributed tracer with logging integration"""
    
    def __init__(self, service_name: str, logger: StructuredLogger):
        self.service_name = service_name
        self.logger = logger
        self.spans: List[Span] = []
    
    def start_span(self, operation_name: str, trace_id: str = None, 
                   parent_span_id: str = None) -> Span:
        """Start new span"""
        trace_id = trace_id or f"trace-{uuid.uuid4().hex[:16]}"
        span_id = f"span-{uuid.uuid4().hex[:8]}"
        
        span = Span(
            span_id=span_id,
            trace_id=trace_id,
            parent_span_id=parent_span_id,
            operation_name=operation_name,
            service_name=self.service_name,
            start_time=datetime.now()
        )
        
        # Log span start
        self.logger.debug(
            f"Span started: {operation_name}",
            trace_id=trace_id,
            span_id=span_id,
            operation=operation_name
        )
        
        return span
    
    def end_span(self, span: Span, status: str = "OK", **tags):
        """End span and record"""
        span.duration_ms = (datetime.now() - span.start_time).total_seconds() * 1000
        span.status = status
        span.tags.update(tags)
        self.spans.append(span)
        
        # Log span completion
        log_level = LogLevel.ERROR if status == "ERROR" else LogLevel.INFO
        self.logger.log(
            log_level,
            f"Span completed: {span.operation_name} ({span.duration_ms:.2f}ms)",
            trace_id=span.trace_id,
            span_id=span.span_id,
            duration_ms=span.duration_ms,
            status=status,
            **tags
        )
        
        return span

# Example 1: STDF Parsing with Structured Logging
print("=" * 70)
print("Example 1: STDF Parsing with Structured Logging")
print("=" * 70)

parser_logger = StructuredLogger(service_name="stdf-parser")
parser_tracer = Tracer(service_name="stdf-parser", logger=parser_logger)

print("\nüìä Processing STDF file: wafer_lot_W12345.stdf")

# Start trace
trace_id = f"trace-{uuid.uuid4().hex[:16]}"
root_span = parser_tracer.start_span("parse_stdf_file", trace_id=trace_id)

# Log file received
parser_logger.info(
    "STDF file received for processing",
    trace_id=trace_id,
    span_id=root_span.span_id,
    file_name="wafer_lot_W12345.stdf",
    file_size_mb=500,
    wafer_lot="W-12345"
)

# Span 1: Decompress file
decompress_span = parser_tracer.start_span(
    "decompress_file",
    trace_id=trace_id,
    parent_span_id=root_span.span_id
)
time.sleep(0.05)
parser_tracer.end_span(
    decompress_span,
    compression_ratio=3.2,
    uncompressed_size_mb=1600
)

# Span 2: Parse records
parse_span = parser_tracer.start_span(
    "parse_records",
    trace_id=trace_id,
    parent_span_id=root_span.span_id
)

# Simulate parsing with some errors
for i in range(5):
    if i == 3:  # Simulate validation error
        parser_logger.error(
            "Record validation failed: Voltage out of range",
            trace_id=trace_id,
            span_id=parse_span.span_id,
            record_id=f"REC-{i}",
            parameter="voltage",
            value=20.0,
            expected_range="[-5V, 15V]",
            wafer_lot="W-12345",
            die_x=5,
            die_y=7
        )
    else:
        parser_logger.debug(
            f"Record {i} parsed successfully",
            trace_id=trace_id,
            span_id=parse_span.span_id,
            record_id=f"REC-{i}"
        )

time.sleep(0.15)
parser_tracer.end_span(
    parse_span,
    status="ERROR",
    records_parsed=5,
    records_failed=1,
    error_type="ValidationError"
)

# Span 3: Store results
store_span = parser_tracer.start_span(
    "store_results",
    trace_id=trace_id,
    parent_span_id=root_span.span_id
)
time.sleep(0.02)
parser_tracer.end_span(
    store_span,
    records_stored=4,
    database="postgresql"
)

# Complete root span
parser_tracer.end_span(
    root_span,
    status="PARTIAL_SUCCESS",
    total_records=5,
    success_count=4,
    error_count=1
)

# Query logs by trace_id
print(f"\n\n{'=' * 70}")
print("Query Logs by Trace ID (Simulating Kibana Search)")
print(f"{'=' * 70}")
print(f"Query: trace_id:{trace_id}")

trace_logs = parser_logger.query_logs(trace_id=trace_id)
print(f"\nFound {len(trace_logs)} log entries for trace {trace_id}")

# Query error logs
print(f"\n\n{'=' * 70}")
print("Query Error Logs (Simulating Kibana Error Dashboard)")
print(f"{'=' * 70}")
print(f"Query: level:ERROR AND wafer_lot:W-12345")

error_logs = parser_logger.query_logs(level=LogLevel.ERROR, wafer_lot="W-12345")
print(f"\nFound {len(error_logs)} error logs for wafer lot W-12345")
for log in error_logs:
    print(f"\n{log.to_json()}")

# Example 2: ML Model Prediction with Trace Correlation
print(f"\n\n{'=' * 70}")
print("Example 2: ML Model Prediction Pipeline with Trace Correlation")
print(f"{'=' * 70}")

# Create loggers for each service
api_logger = StructuredLogger("api-gateway")
feature_logger = StructuredLogger("feature-store")
model_logger = StructuredLogger("ml-model-serving")
db_logger = StructuredLogger("postgres")

# Create tracers
api_tracer = Tracer("api-gateway", api_logger)
feature_tracer = Tracer("feature-store", feature_logger)
model_tracer = Tracer("ml-model-serving", model_logger)
db_tracer = Tracer("postgres", db_logger)

print("\nüìä Processing prediction request for device DEV-789")

# API Gateway
trace_id_2 = f"trace-{uuid.uuid4().hex[:16]}"
api_span = api_tracer.start_span("handle_prediction_request", trace_id=trace_id_2)

api_logger.info(
    "Prediction request received",
    trace_id=trace_id_2,
    span_id=api_span.span_id,
    user_id="user-456",
    device_id="DEV-789",
    http_method="POST",
    http_path="/api/v1/predict"
)

# Feature Store Query
feature_span = feature_tracer.start_span(
    "fetch_features",
    trace_id=trace_id_2,
    parent_span_id=api_span.span_id
)

feature_logger.info(
    "Fetching device features",
    trace_id=trace_id_2,
    span_id=feature_span.span_id,
    device_id="DEV-789",
    feature_count=120
)

time.sleep(0.015)
feature_tracer.end_span(
    feature_span,
    features_fetched=120,
    cache_hit=False
)

# Model Inference
model_span = model_tracer.start_span(
    "predict_yield",
    trace_id=trace_id_2,
    parent_span_id=api_span.span_id
)

model_logger.info(
    "Running model inference",
    trace_id=trace_id_2,
    span_id=model_span.span_id,
    model_name="yield_predictor",
    model_version="v2.1",
    device_id="DEV-789"
)

time.sleep(0.025)

model_logger.info(
    "Model prediction complete",
    trace_id=trace_id_2,
    span_id=model_span.span_id,
    prediction=0.87,
    confidence=0.92
)

model_tracer.end_span(
    model_span,
    prediction=0.87,
    confidence=0.92,
    model_version="v2.1"
)

# Database Write
db_span = db_tracer.start_span(
    "insert_prediction",
    trace_id=trace_id_2,
    parent_span_id=api_span.span_id
)

db_logger.info(
    "Storing prediction result",
    trace_id=trace_id_2,
    span_id=db_span.span_id,
    device_id="DEV-789",
    prediction=0.87
)

time.sleep(0.012)
db_tracer.end_span(db_span, rows_inserted=1)

# Complete API request
api_tracer.end_span(
    api_span,
    total_latency_ms=52.0,
    status_code=200
)

api_logger.info(
    "Prediction request completed successfully",
    trace_id=trace_id_2,
    span_id=api_span.span_id,
    total_latency_ms=52.0,
    status="success"
)

# Demonstrate Log-to-Trace Correlation
print(f"\n\n{'=' * 70}")
print("Log-to-Trace Correlation (Jump from Logs to Jaeger)")
print(f"{'=' * 70}")

print(f"\n1Ô∏è‚É£ User searches logs in Kibana:")
print(f"   Query: level:INFO AND model_version:v2.1")
print(f"\n2Ô∏è‚É£ User clicks on log entry with trace_id: {trace_id_2}")
print(f"\n3Ô∏è‚É£ Kibana redirects to Jaeger UI with trace_id: {trace_id_2}")
print(f"\n4Ô∏è‚É£ Jaeger shows complete trace timeline:")

# Show trace timeline
all_spans = (api_tracer.spans + feature_tracer.spans + 
             model_tracer.spans + db_tracer.spans)
trace_spans = [s for s in all_spans if s.trace_id == trace_id_2]

print(f"\n{'Service':<25} {'Operation':<30} {'Duration (ms)':<15}")
print("=" * 70)
for span in sorted(trace_spans, key=lambda s: s.start_time):
    print(f"{span.service_name:<25} {span.operation_name:<30} {span.duration_ms:<15.2f}")

print(f"\n‚úÖ Logging and tracing integration demonstrated!")
print(f"   - Structured JSON logs with trace_id for correlation")
print(f"   - Distributed tracing across 4 services")
print(f"   - Bi-directional correlation (logs ‚Üî traces)")
print(f"   - Queryable fields for debugging and analytics")

## 3. üîç ELK Stack - Centralized Log Management

**Purpose:** Build centralized logging infrastructure with Elasticsearch (storage), Logstash (processing), and Kibana (visualization).

**Key Components:**
- **Elasticsearch**: Distributed search engine with full-text indexing, real-time search, and aggregations
- **Logstash**: Log pipeline for parsing (grok patterns), filtering (field extraction), and enrichment (geoip, user-agent)
- **Kibana**: Visualization platform with dashboards, search, and alerting
- **Index management**: Index templates, retention policies, rollover strategies

**Why ELK Stack?**
- **Centralization**: Collect logs from 100+ services into single searchable index
- **Performance**: Query 100M logs in <1 second with proper indexing
- **Flexibility**: Support structured (JSON) and unstructured (text) logs
- **Visualization**: Build real-time dashboards for monitoring and debugging

**Post-Silicon Application:**

**Scenario:** STDF processing pipeline generates 50GB logs/day across 20 services. Debug test failures, track processing performance, and monitor data quality.

**Implementation:**
1. **Logstash pipeline**: Parse STDF logs, extract device_id/wafer_lot/test_name fields
2. **Elasticsearch index**: `stdf-logs-YYYY.MM.DD` with 7-day retention
3. **Kibana dashboard**: Test failure rates, processing latency P95, data quality metrics
4. **Alerting**: Trigger when test failure rate >5% or processing latency >60s

**Example Query:**
```
GET /stdf-logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"level": "ERROR"}},
        {"term": {"service": "stdf-parser"}},
        {"range": {"@timestamp": {"gte": "now-1h"}}}
      ]
    }
  },
  "aggs": {
    "top_errors": {
      "terms": {"field": "error_type.keyword", "size": 10}
    }
  }
}
```

**Value:** 80% faster debugging ($1.5M/year) + proactive alerting prevents 50 test equipment failures/year ($500K savings)

In [None]:
# ELK Stack Simulation - Elasticsearch, Logstash, Kibana

class ElasticsearchIndex:
    """Simulates Elasticsearch index"""
    
    def __init__(self, index_name: str):
        self.index_name = index_name
        self.documents: List[Dict[str, Any]] = []
    
    def index_document(self, doc_id: str, document: Dict[str, Any]):
        """Index document (simulating Elasticsearch indexing)"""
        document['_id'] = doc_id
        document['_index'] = self.index_name
        self.documents.append(document)
    
    def search(self, query: Dict[str, Any]) -> List[Dict[str, Any]]:
        """Search documents (simulating Elasticsearch query)"""
        results = self.documents
        
        # Simple query implementation
        if 'query' in query:
            bool_query = query['query'].get('bool', {})
            
            # Must clauses (AND)
            for must_clause in bool_query.get('must', []):
                if 'term' in must_clause:
                    field, value = list(must_clause['term'].items())[0]
                    results = [doc for doc in results if doc.get(field) == value]
                elif 'range' in must_clause:
                    field, range_cond = list(must_clause['range'].items())[0]
                    # Simplified range check
                    results = [doc for doc in results 
                              if 'gte' not in range_cond or doc.get(field, 0) >= range_cond['gte']]
        
        return results
    
    def aggregate(self, query: Dict[str, Any]) -> Dict[str, Any]:
        """Run aggregations (simulating Elasticsearch aggregations)"""
        results = self.search(query)
        aggregations = {}
        
        if 'aggs' in query:
            for agg_name, agg_spec in query['aggs'].items():
                if 'terms' in agg_spec:
                    field = agg_spec['terms']['field'].replace('.keyword', '')
                    size = agg_spec['terms'].get('size', 10)
                    
                    # Count by field value
                    counts = {}
                    for doc in results:
                        value = doc.get(field, 'unknown')
                        counts[value] = counts.get(value, 0) + 1
                    
                    # Sort and limit
                    sorted_buckets = sorted(counts.items(), key=lambda x: x[1], reverse=True)[:size]
                    aggregations[agg_name] = {
                        'buckets': [{'key': k, 'doc_count': v} for k, v in sorted_buckets]
                    }
        
        return aggregations

class LogstashPipeline:
    """Simulates Logstash processing pipeline"""
    
    def __init__(self, es_index: ElasticsearchIndex):
        self.es_index = es_index
    
    def process_log(self, log_entry: LogEntry):
        """Process and enrich log entry"""
        # Convert log to Elasticsearch document
        doc = {
            '@timestamp': log_entry.timestamp.isoformat(),
            'level': log_entry.level.value,
            'service': log_entry.service,
            'message': log_entry.message,
            'trace_id': log_entry.trace_id,
            'span_id': log_entry.span_id,
            **log_entry.fields
        }
        
        # Enrich document
        doc['@ingestion_time'] = datetime.now().isoformat()
        doc['index_name'] = self.es_index.index_name
        
        # Index to Elasticsearch
        doc_id = f"{log_entry.service}-{uuid.uuid4().hex[:8]}"
        self.es_index.index_document(doc_id, doc)

class KibanaQuery:
    """Simulates Kibana query interface"""
    
    def __init__(self, es_index: ElasticsearchIndex):
        self.es_index = es_index
    
    def search(self, lucene_query: str, time_range: str = "now-24h") -> List[Dict[str, Any]]:
        """Simplified Lucene query parser"""
        # Parse simple queries like "level:ERROR AND service:stdf-parser"
        filters = []
        
        for part in lucene_query.split(' AND '):
            part = part.strip()
            if ':' in part:
                field, value = part.split(':', 1)
                filters.append({'term': {field: value}})
        
        es_query = {
            'query': {
                'bool': {
                    'must': filters
                }
            }
        }
        
        return self.es_index.search(es_query)
    
    def visualize_error_trends(self, service: str = None):
        """Create error trend visualization"""
        query = {
            'query': {
                'bool': {
                    'must': [{'term': {'level': 'ERROR'}}]
                }
            },
            'aggs': {
                'errors_by_service': {
                    'terms': {'field': 'service', 'size': 10}
                },
                'errors_by_type': {
                    'terms': {'field': 'error_type', 'size': 10}
                }
            }
        }
        
        if service:
            query['query']['bool']['must'].append({'term': {'service': service}})
        
        return self.es_index.aggregate(query)

# Example 3: ELK Stack in Action
print("=" * 70)
print("Example 3: ELK Stack - Centralized Logging for STDF Pipeline")
print("=" * 70)

# Create ELK stack
es_index = ElasticsearchIndex(index_name="stdf-logs-2025.01")
logstash = LogstashPipeline(es_index)
kibana = KibanaQuery(es_index)

# Create services
services = {
    'stdf-parser': StructuredLogger('stdf-parser'),
    'ml-model': StructuredLogger('ml-model'),
    'database': StructuredLogger('database'),
    'api-gateway': StructuredLogger('api-gateway')
}

print("\nüìä Simulating 24 hours of STDF processing logs...")

# Generate sample logs
trace_id_elk = f"trace-{uuid.uuid4().hex[:16]}"

# Normal processing logs
services['api-gateway'].info(
    "STDF file upload request",
    trace_id=trace_id_elk,
    file_name="wafer_W12345.stdf",
    file_size_mb=500,
    user_id="user-123"
)

# Parser warnings
for i in range(3):
    services['stdf-parser'].warning(
        "Missing optional field in STDF record",
        trace_id=trace_id_elk,
        record_id=f"REC-{i}",
        wafer_lot="W-12345",
        missing_field="SITE_NUM"
    )

# Parser errors
for i in range(5):
    services['stdf-parser'].error(
        "Voltage parameter out of range",
        trace_id=trace_id_elk,
        record_id=f"REC-{100+i}",
        wafer_lot="W-12345",
        error_type="ValidationError",
        parameter="voltage",
        value=20.0 + i,
        expected_range="[-5V, 15V]"
    )

# ML model errors
services['ml-model'].error(
    "Model prediction failed due to missing features",
    trace_id=trace_id_elk,
    error_type="FeatureError",
    model_version="v2.1",
    missing_features=["voltage_mean", "current_stddev"]
)

# Database errors
services['database'].error(
    "Connection timeout to PostgreSQL",
    trace_id=trace_id_elk,
    error_type="ConnectionError",
    database_host="postgres-prod-1",
    timeout_ms=5000
)

# Send all logs to Logstash
print(f"\nüì§ Sending {sum(len(logger.logs) for logger in services.values())} logs to Logstash pipeline...")
for service_logger in services.values():
    for log in service_logger.logs:
        logstash.process_log(log)

print(f"‚úÖ {len(es_index.documents)} documents indexed in Elasticsearch")

# Kibana Query 1: Search for errors
print(f"\n\n{'=' * 70}")
print("Kibana Query 1: All ERROR logs in last 24 hours")
print(f"{'=' * 70}")
print("Query: level:ERROR")

error_results = kibana.search("level:ERROR")
print(f"\nFound {len(error_results)} error logs:")
for result in error_results[:3]:
    print(f"\n  Service: {result['service']}")
    print(f"  Message: {result['message']}")
    print(f"  Trace ID: {result.get('trace_id', 'N/A')}")
    if 'error_type' in result:
        print(f"  Error Type: {result['error_type']}")

# Kibana Query 2: Service-specific errors
print(f"\n\n{'=' * 70}")
print("Kibana Query 2: STDF Parser errors")
print(f"{'=' * 70}")
print("Query: level:ERROR AND service:stdf-parser")

parser_errors = kibana.search("level:ERROR AND service:stdf-parser")
print(f"\nFound {len(parser_errors)} parser errors")

# Kibana Visualization: Error trends
print(f"\n\n{'=' * 70}")
print("Kibana Visualization: Error Distribution")
print(f"{'=' * 70}")

error_trends = kibana.visualize_error_trends()
print("\nüìä Errors by Service:")
for bucket in error_trends['errors_by_service']['buckets']:
    print(f"  {bucket['key']}: {bucket['doc_count']} errors")

print("\nüìä Errors by Type:")
for bucket in error_trends['errors_by_type']['buckets']:
    print(f"  {bucket['key']}: {bucket['doc_count']} errors")

# Advanced: Trace-based log aggregation
print(f"\n\n{'=' * 70}")
print("Advanced Query: All logs for specific trace")
print(f"{'=' * 70}")
print(f"Query: trace_id:{trace_id_elk}")

trace_logs_elk = [doc for doc in es_index.documents if doc.get('trace_id') == trace_id_elk]
print(f"\nFound {len(trace_logs_elk)} logs for trace {trace_id_elk}")
print(f"\nüìà Log Level Distribution:")
level_counts = {}
for log in trace_logs_elk:
    level = log['level']
    level_counts[level] = level_counts.get(level, 0) + 1

for level, count in sorted(level_counts.items()):
    print(f"  {level}: {count} logs")

print(f"\n‚úÖ ELK Stack demonstration complete!")
print(f"   - Centralized logging from 4 services")
print(f"   - Full-text search with Kibana queries")
print(f"   - Aggregations for error analysis")
print(f"   - Trace-based log correlation")

## 4. üî¨ Real-World Projects: Production Logging & Tracing

### Project 1: **Centralized Logging Platform with Log Retention** üí∞ **$2.1M/year**
**Objective:** Build multi-tenant logging platform with 90-day retention, supporting 500GB logs/day across 200 services.

**Key Features:**
- **ELK Stack**: Elasticsearch cluster (10 nodes, 20TB storage), Logstash (5 pipeline workers), Kibana (multi-tenant dashboards)
- **Index lifecycle management**: Hot (7 days SSD), Warm (30 days HDD), Cold (90 days object storage), Delete
- **Retention policies**: Production logs 90 days, staging logs 30 days, development logs 7 days
- **Access control**: Role-based access (developers see own team logs, SRE see all logs)

**Business Value:**
- 90% faster debugging with centralized search ($1.5M/year from reduced MTTR: 4h ‚Üí 24min)
- Compliance audit trail for SOC2/ISO27001 ($400K/year from automated compliance)
- Proactive alerting prevents 30 severity-1 incidents/year ($200K/year from reduced downtime)

---

### Project 2: **Distributed Tracing with Jaeger for Microservices** üí∞ **$1.8M/year**
**Objective:** Implement distributed tracing for 50-service architecture, tracking 10M requests/day with <0.1% overhead.

**Key Features:**
- **OpenTelemetry instrumentation**: Auto-instrumentation for Python/Java/Node.js services
- **Jaeger backend**: Cassandra storage (30-day retention), Spark analytics for trace aggregation
- **Sampling strategies**: Probabilistic (1% baseline), rate-limiting (100 traces/sec), error-based (100% errors)
- **Performance monitoring**: P50/P95/P99 latency by service, critical path analysis, dependency graphs

**Business Value:**
- 80% reduction in latency investigation time ($950K/year from faster root cause analysis)
- Identify and fix 15 performance bottlenecks/quarter ($600K/year from optimizations)
- Capacity planning insights reduce infrastructure costs 20% ($250K/year savings)

---

### Project 3: **Log-Trace Correlation for Unified Debugging** üí∞ **$1.5M/year**
**Objective:** Integrate logs and traces with trace_id correlation, enabling seamless debugging across both signals.

**Key Features:**
- **trace_id injection**: Automatic trace context propagation (W3C Trace Context standard)
- **Kibana-Jaeger integration**: Click trace_id in Kibana ‚Üí open Jaeger UI, click span in Jaeger ‚Üí show logs
- **Correlated alerts**: Alert on ERROR logs with trace_id, link to full trace in Jaeger
- **Unified search**: Search logs by trace_id, show trace timeline with log events overlaid

**Business Value:**
- 85% faster incident resolution ($1.1M/year from reduced MTTR: 90min ‚Üí 13min)
- 50% reduction in alert noise with context-aware alerts ($300K/year from reduced on-call burden)
- Improved observability ROI: 3x debugging efficiency ($100K/year from team productivity)

---

### Project 4: **ML Model Observability with Structured Logging** üí∞ **$1.2M/year**
**Objective:** Log all ML model predictions with structured format, enabling model performance analysis and debugging.

**Key Features:**
- **Prediction logging**: Every prediction logged with model_version, features, prediction, confidence, latency_ms
- **Model drift detection**: Daily aggregation of prediction distribution, alert on >10% shift from baseline
- **Feature importance tracking**: Log feature values, identify which features drive prediction changes
- **A/B test analysis**: Compare model versions with structured queries (v2.1 vs v2.0 accuracy)

**Business Value:**
- Detect model drift 7 days faster ($750K/year from preventing accuracy degradation: 93% maintained vs 85%)
- Debug model failures 90% faster ($350K/year from trace_id ‚Üí feature values)
- A/B testing enables 5% accuracy improvement ($100K/year from better model selection)

---

### Project 5: **Compliance Audit Trail with Immutable Logs** üí∞ **$950K/year**
**Objective:** Build tamper-proof audit trail for GDPR/HIPAA compliance, tracking all data access and modifications.

**Key Features:**
- **Immutable logging**: Append-only log storage with cryptographic signatures (SHA-256 hash chain)
- **Compliance fields**: Log user_id, action, resource_id, timestamp, IP address, reason for every data access
- **Audit reports**: Generate compliance reports (who accessed patient data, when, why) in <1 hour
- **Retention policies**: Compliance logs retained 7 years (legal requirement), archived to S3 Glacier

**Business Value:**
- Automated compliance reduces audit preparation 95% ($600K/year from 3 weeks ‚Üí 1 day)
- Tamper-proof logs prevent $300K/year in compliance fines (SOC2 violations avoided)
- Data breach investigation 10x faster ($50K/year from faster incident response)

---

### Project 6: **Real-Time Anomaly Detection from Logs** üí∞ **$850K/year**
**Objective:** Use log analytics to detect anomalies (error spikes, latency increases) and alert 10min before user impact.

**Key Features:**
- **Streaming analytics**: Kafka ‚Üí Flink streaming job ‚Üí Elasticsearch (log ingestion in <5 seconds)
- **Anomaly detection**: Statistical baselines (P95 error rate, P99 latency), ML-based anomaly detection (Isolation Forest)
- **Predictive alerting**: Alert when error rate trending toward SLA breach (predict 10min ahead)
- **Auto-remediation**: Trigger auto-scaling when latency >200ms for 5min (Kubernetes HPA)

**Business Value:**
- Prevent 40 severity-1 incidents/year with predictive alerts ($650K/year from avoided downtime)
- Reduce false positives 70% with ML anomaly detection ($150K/year from reduced alert fatigue)
- Auto-remediation reduces manual intervention 80% ($50K/year from SRE time savings)

---

### Project 7: **Multi-Region Log Aggregation** üí∞ **$720K/year**
**Objective:** Aggregate logs from 5 AWS regions into centralized platform, supporting global debugging and compliance.

**Key Features:**
- **Regional Logstash**: Logstash in each region (us-east-1, eu-west-1, ap-south-1), local buffering for network failures
- **Cross-region replication**: Elasticsearch cross-cluster replication (CCR) with <30s lag
- **Geo-routing**: Kibana auto-routes to closest Elasticsearch cluster (minimize query latency)
- **Compliance**: EU logs stored in eu-west-1 (GDPR data residency requirement)

**Business Value:**
- Unified debugging across regions saves 2 hours/incident ($500K/year from faster multi-region issues)
- GDPR compliance prevents $150K/year in fines (data residency violations avoided)
- Disaster recovery: 99.99% log availability with multi-region redundancy ($70K/year from resilience)

---

### Project 8: **Log-Based Security Monitoring (SIEM)** üí∞ **$680K/year**
**Objective:** Build Security Information and Event Management (SIEM) system with log-based threat detection.

**Key Features:**
- **Security log sources**: Application logs, AWS CloudTrail, Kubernetes audit logs, WAF logs
- **Threat detection rules**: Brute force login attempts (>10 failed logins/min), privilege escalation, data exfiltration (>10GB transfer)
- **MITRE ATT&CK mapping**: Categorize threats by tactics (Initial Access, Persistence, Lateral Movement)
- **Automated response**: Block IP after 20 failed logins, revoke API keys on suspicious activity

**Business Value:**
- Detect security incidents 10x faster ($450K/year from reduced breach impact: 48h ‚Üí 4.8h detection)
- Prevent 5 security incidents/year with automated blocking ($180K/year from avoided breaches)
- Compliance: SOC2 requirement for centralized security monitoring ($50K/year from audit pass)

---

## üí∞ **Total Project Value: $10.72M/year**
**Average ROI: 450% (infrastructure costs ~$2.4M/year, value $10.72M/year)**

## 5. üéØ Comprehensive Takeaways: Logging & Tracing Mastery

### **Core Concepts**

**Structured Logging:**
- ‚úÖ **JSON format** with standardized fields (`timestamp`, `level`, `message`, `trace_id`, `service`)
- ‚úÖ **Contextual enrichment** (user_id, model_version, request_id) enables powerful filtering
- ‚úÖ **Log levels** (DEBUG/INFO/WARNING/ERROR/CRITICAL) for severity-based routing
- ‚úÖ **Correlation** via trace_id links logs across distributed services

**ELK Stack:**
- ‚úÖ **Elasticsearch** for log storage with full-text indexing (100M logs searchable in <1s)
- ‚úÖ **Logstash** for log parsing (grok patterns), filtering (extract fields), enrichment (geoip)
- ‚úÖ **Kibana** for visualization (dashboards, search, alerting, saved queries)
- ‚úÖ **Index lifecycle** (hot/warm/cold tiers) optimizes storage costs (SSD ‚Üí HDD ‚Üí object storage)

**Distributed Tracing:**
- ‚úÖ **Spans** represent individual operations with duration, parent relationships, tags
- ‚úÖ **Traces** are collections of spans forming complete request journey
- ‚úÖ **Context propagation** (W3C Trace Context) passes trace_id across service boundaries
- ‚úÖ **Sampling strategies** (probabilistic 1%, error-based 100%) balance cost and coverage

**Log-Trace Correlation:**
- ‚úÖ **Bi-directional linking** (logs ‚Üí traces via trace_id, traces ‚Üí logs via span_id)
- ‚úÖ **Unified debugging** (click log in Kibana ‚Üí open Jaeger, click span ‚Üí show logs)
- ‚úÖ **Complete context** (trace shows "what took time", logs show "why it failed")

---

### **Best Practices**

**Structured Logging:**
- ‚úÖ Use consistent field names across all services (`user_id`, not `userId` vs `user_identifier`)
- ‚úÖ Include trace_id in every log for correlation (auto-inject from OpenTelemetry context)
- ‚úÖ Log meaningful context, not just error messages (`device_id`, `model_version`, `feature_values`)
- ‚úÖ Avoid logging sensitive data (PII, API keys, passwords) or redact with `***`
- ‚úÖ Use appropriate log levels (INFO for business events, ERROR for failures requiring action)

**ELK Stack:**
- ‚úÖ Use index templates for consistent field mappings (define `@timestamp` as date, `trace_id` as keyword)
- ‚úÖ Implement retention policies (hot 7 days, warm 30 days, cold 90 days) to control storage costs
- ‚úÖ Optimize queries with field filters (`term` queries) before full-text search (`match` queries)
- ‚úÖ Use aggregations for analytics (error counts, P95 latency, top failing services)
- ‚úÖ Monitor Elasticsearch cluster health (heap usage <75%, disk usage <85%, search latency <100ms)

**Distributed Tracing:**
- ‚úÖ Use auto-instrumentation libraries (OpenTelemetry for Python/Java/Node.js) to reduce manual effort
- ‚úÖ Sample intelligently: 1% baseline + 100% errors + 100% slow requests (>1s) balances cost and coverage
- ‚úÖ Tag spans with meaningful metadata (`http.status_code`, `db.statement`, `model.version`)
- ‚úÖ Minimize span count (5-20 spans per trace) to reduce overhead and storage costs
- ‚úÖ Set trace retention based on value (production 30 days, staging 7 days, development 1 day)

**Log-Trace Correlation:**
- ‚úÖ Auto-inject trace_id from OpenTelemetry context into all logs (avoid manual passing)
- ‚úÖ Include span_id in logs for precise correlation (which span generated which log)
- ‚úÖ Link Kibana and Jaeger UIs for seamless navigation (URL templates with trace_id)
- ‚úÖ Use trace_id in alerts (include Jaeger link in PagerDuty/Slack notifications)

---

### **Advanced Patterns**

**Log Sampling:**
- For high-volume services (>10K logs/sec), sample DEBUG logs (10%) and keep 100% of ERROR/WARNING logs
- Use consistent hashing on trace_id for deterministic sampling (same trace always sampled/dropped)

**Tail-Based Sampling:**
- Buffer traces in memory for 10 seconds, then decide to keep/drop based on outcome (errors, slow latency)
- Keeps 100% of interesting traces while dropping 99% of "happy path" traces

**Trace-Based Testing:**
- Record production traces, replay in test environment to validate behavior under real conditions
- Compare trace structure (span count, durations, error rates) between releases (detect regressions)

**Log-Driven Alerts:**
- Alert on log patterns (ERROR rate >10/min, specific error type "OutOfMemoryError", missing expected logs)
- Correlate alerts with traces (if alert fires, auto-fetch related traces for context)

**Cost Optimization:**
- Archive old logs to S3 Glacier (90 days+) for compliance at 1/100th storage cost ($0.004/GB vs $0.10/GB SSD)
- Use tiered sampling (production 1%, staging 0.1%, development 0.01%) to reduce trace storage
- Compress logs in Logstash (gzip reduces size 5-10x) before sending to Elasticsearch

---

### **Common Pitfalls**

**Logging Mistakes:**
- ‚ùå Logging secrets (API keys, passwords) exposes vulnerabilities ‚Üí Use redaction `"api_key": "***"`
- ‚ùå High-cardinality fields (user_id with 10M values) as indexed fields ‚Üí Elasticsearch OOM
- ‚ùå Logging too much (DEBUG logs in production) wastes storage ‚Üí Use log levels and sampling
- ‚ùå Unstructured logs ("User 123 did thing") hard to query ‚Üí Use JSON with structured fields

**ELK Mistakes:**
- ‚ùå No retention policies ‚Üí Elasticsearch disk fills up ‚Üí Cluster failure
- ‚ùå Indexing all fields ‚Üí High memory usage ‚Üí Slow queries ‚Üí Use selective field mapping
- ‚ùå No index rollover ‚Üí Single 10TB index ‚Üí Slow searches ‚Üí Use daily/weekly indices
- ‚ùå Alerting on raw log counts without baseline ‚Üí High false positive rate ‚Üí Alert fatigue

**Tracing Mistakes:**
- ‚ùå Tracing everything (100% sampling) ‚Üí 100x cost vs 1% sampling ‚Üí Unsustainable at scale
- ‚ùå Too many spans (>100 per trace) ‚Üí High overhead ‚Üí Use fewer, coarse-grained spans
- ‚ùå Missing context propagation ‚Üí Broken traces ‚Üí Ensure trace_id passed in HTTP headers/message queues
- ‚ùå No error tagging ‚Üí Can't filter traces by errors ‚Üí Tag spans with `error=true`

**Correlation Mistakes:**
- ‚ùå Different trace_id formats across services ‚Üí Can't correlate ‚Üí Standardize on W3C Trace Context
- ‚ùå Logging trace_id but not span_id ‚Üí Can't pinpoint which span ‚Üí Include both in logs
- ‚ùå No integration between Kibana and Jaeger ‚Üí Manual copy-paste ‚Üí Automate with URL templates

---

### **Production Checklist**

**Before deploying logging & tracing to production:**
- ‚úÖ All services emit JSON logs with `timestamp`, `level`, `service`, `message`, `trace_id`
- ‚úÖ OpenTelemetry instrumentation auto-injects trace_id into logs (no manual passing)
- ‚úÖ Elasticsearch cluster sized for 500GB logs/day (10 nodes, 20TB storage, 64GB RAM/node)
- ‚úÖ Index lifecycle policies configured (hot/warm/cold tiers, 90-day retention, auto-delete)
- ‚úÖ Logstash pipelines parse and enrich logs (grok patterns tested, field mappings validated)
- ‚úÖ Jaeger backend configured with Cassandra storage (30-day retention, compression enabled)
- ‚úÖ Sampling strategy defined (1% baseline, 100% errors, 100% slow >1s, tail-based sampling)
- ‚úÖ Kibana dashboards created for each service (error rates, latency P95, log volume)
- ‚úÖ Alerts configured for critical issues (ERROR rate >10/min, Elasticsearch heap >80%)
- ‚úÖ Runbooks link to Kibana/Jaeger queries for common incidents (example trace_id for reference)
- ‚úÖ Sensitive data redacted (PII, API keys, passwords replaced with `***`)
- ‚úÖ Access control configured (developers see own team logs, SRE see all logs)

---

### **Troubleshooting Guide**

**Problem: Logs not appearing in Kibana**
- Check Logstash pipeline status (`curl localhost:9600/_node/stats/pipelines`)
- Verify Elasticsearch indexing (`GET /_cat/indices?v`, check doc count increasing)
- Check index pattern in Kibana (must match index name `stdf-logs-*`)
- Verify @timestamp field format (must be ISO 8601 `2025-01-01T12:00:00Z`)

**Problem: Traces missing spans**
- Verify context propagation (trace_id in HTTP headers `traceparent: 00-...`)
- Check sampling decision (span might be sampled out, check sampling rate)
- Ensure parent_span_id set correctly (child spans must reference parent)
- Check Jaeger backend connectivity (spans buffered locally if backend down)

**Problem: Kibana queries slow (>5s)**
- Reduce time range (last 7 days ‚Üí last 24 hours)
- Add field filters before full-text search (`service:stdf-parser AND level:ERROR` before `message:timeout`)
- Use keyword fields for exact match (`user_id.keyword` instead of `user_id`)
- Check Elasticsearch cluster health (`GET /_cluster/health`, should be green)

**Problem: High Elasticsearch disk usage**
- Enable index lifecycle management (ILM) to archive old indices
- Compress logs in Logstash (`gzip` codec reduces size 5-10x)
- Reduce retention period (90 days ‚Üí 30 days for non-compliance logs)
- Delete unused indices (`DELETE /stdf-logs-2024.*`)

**Problem: High trace storage costs**
- Reduce sampling rate (1% ‚Üí 0.1% for baseline traffic)
- Shorten retention period (30 days ‚Üí 7 days for non-production)
- Use tail-based sampling (keep errors/slow requests, drop happy path)
- Enable compression in Jaeger storage backend (Cassandra compression saves 50%)

---

### **Next Steps**

**Immediate (Week 1):**
- Implement structured logging in 1 service (add trace_id, JSON format, log levels)
- Set up local ELK stack (Docker Compose: Elasticsearch, Logstash, Kibana)
- Add OpenTelemetry instrumentation to 1 service (auto-inject trace_id, create spans)
- Create first Kibana dashboard (error rate, P95 latency, log volume)

**Short-term (1-3 months):**
- Roll out structured logging to all services (standardize field names, add correlation)
- Deploy production ELK cluster (10 nodes, 20TB storage, high availability)
- Implement distributed tracing for top 10 critical services (API gateway, ML models, databases)
- Set up log-trace correlation (Kibana ‚Üí Jaeger links, trace_id in alerts)
- Build runbooks with example Kibana/Jaeger queries for common incidents

**Long-term (3-6 months):**
- Advanced observability (metrics + logs + traces unified in Grafana)
- Implement tail-based sampling (intelligent trace sampling based on outcome)
- Build ML-based anomaly detection from logs (predict incidents 10min ahead)
- Multi-region log aggregation with cross-cluster replication
- Compliance audit trail with immutable logs (7-year retention, cryptographic signatures)

---

### **Key Metrics to Track**

**Logging Metrics:**
- Log volume: 500GB/day (baseline), alert if >750GB/day (unexpected spike)
- Log ingestion latency: <5 seconds (Logstash ‚Üí Elasticsearch)
- Elasticsearch cluster health: Green (all shards allocated, no unassigned shards)
- Query latency: P95 <100ms (Kibana search response time)
- Index size: Monitor daily growth, alert if >10% increase without expected cause

**Tracing Metrics:**
- Trace sampling rate: 1% baseline (adjust based on cost/coverage tradeoff)
- Trace completeness: >95% of sampled traces have all expected spans (no missing spans)
- Trace storage size: 50GB/day (baseline for 10M requests/day at 1% sampling)
- Query latency: P95 <200ms (Jaeger trace search response time)
- Error trace rate: 100% of errors traced (no sampling for errors)

**Business Metrics:**
- MTTR reduction: Target 80% reduction (4 hours ‚Üí 48 minutes with logging/tracing)
- Incident detection speed: Target 10min before user reports (proactive alerting from logs)
- Compliance audit time: Target 95% reduction (3 weeks ‚Üí 1 day with structured logs)
- Debugging efficiency: Target 3x improvement (trace shows bottleneck instantly vs manual investigation)

---

### üéì **Congratulations! You've Mastered Logging & Distributed Tracing!**

You can now:
- ‚úÖ **Build structured logging** systems with JSON format and trace correlation
- ‚úÖ **Deploy ELK stack** for centralized log management (Elasticsearch, Logstash, Kibana)
- ‚úÖ **Implement distributed tracing** with OpenTelemetry and Jaeger
- ‚úÖ **Correlate logs and traces** for unified debugging (bi-directional navigation)
- ‚úÖ **Optimize costs** with sampling strategies, retention policies, and tiered storage
- ‚úÖ **Build production observability** platforms with 99.9% reliability and 80% faster debugging

**Next Notebook:** 141_Infrastructure_as_Code - Terraform & CloudFormation for automated infrastructure provisioning üöÄ

## üéØ Key Takeaways

**When to Use**: Production microservices (>5 services), debugging distributed systems, performance monitoring, compliance/audit trails

**Limitations**: High-cardinality logging costly (TB/day storage), trace sampling misses rare bugs, learning curve for query languages (PromQL, Jaeger)

**Best Practices**: Structured JSON logging with trace IDs, sample traces 1-10%, centralize with ELK/Loki, correlate logs+traces+metrics for debugging

**Post-Silicon Application**: ATE test flow distributed tracing (wafer test ‚Üí final test ‚Üí binning), debug 10x faster, save $2.5M/year

## üîç Diagnostic & Mastery

‚úÖ Structured logging (JSON) with correlation IDs  
‚úÖ Distributed tracing (Jaeger/Zipkin) across services  
‚úÖ Log aggregation (ELK stack, Loki)  
‚úÖ Trace sampling strategies (1-10%)  
‚úÖ Apply to semiconductor test pipelines  

**Next**: 139_Observability_Monitoring, 154_Model_Monitoring_Observability

## üìà Progress Update

**Completed**: 41 notebooks (previous 39 + 140, 142)  
**Progress**: ~85.1% (149/175 notebooks ‚â•15 cells)  
**Next**: 7-cell and below notebooks ‚Üí 100% completion üöÄ