# 134: Service Mesh for ML - Istio and Linkerd

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** service mesh architecture (control plane manages config, data plane handles traffic)
- **Implement** traffic management with Istio/Linkerd (canary releases, A/B testing, traffic splitting)
- **Master** observability patterns (distributed tracing with Jaeger, service graphs with Kiali)
- **Apply** mTLS security to post-silicon validation pipelines (automatic certificate management)
- **Build** resilience patterns (circuit breakers, retries with exponential backoff, timeouts)
- **Deploy** production ML systems with service mesh (multi-model inference pipelines)

## üìö What is a Service Mesh?

A **service mesh** is an infrastructure layer that handles service-to-service communication in microservices architectures. Instead of each service managing its own networking concerns (retries, timeouts, encryption, load balancing), a **sidecar proxy** is injected into each pod to handle these cross-cutting concerns. The result: **zero code changes** to enable mTLS, distributed tracing, circuit breakers, and advanced traffic management.

**Without Service Mesh:**
- Each service implements its own retry logic (inconsistent, hard to update)
- No automatic mTLS (security team manually manages certificates)
- No distributed tracing (debugging latency issues takes hours)
- Traffic management requires code changes (deploy new version to shift traffic)

**With Service Mesh (Istio/Linkerd):**
- **Sidecar proxies** (Envoy, Linkerd2-proxy) handle all networking automatically
- **Control plane** (Pilot, Citadel) manages configuration centrally
- **mTLS automatic** (certificates issued, rotated every 24 hours without manual intervention)
- **Distributed tracing** (Jaeger shows request flow across 10+ services)
- **Traffic splitting** (route 10% traffic to new model version without code changes)

**Why Service Mesh?**
- ‚úÖ **Zero Code Changes**: Add mTLS, tracing, retries without modifying application code
- ‚úÖ **Centralized Policy Management**: Control all traffic routing from one place (control plane)
- ‚úÖ **Automatic Observability**: Every request traced, metrics exported (no manual instrumentation)
- ‚úÖ **Security by Default**: mTLS enabled for all service-to-service traffic (zero-trust networking)
- ‚úÖ **Progressive Delivery**: Canary releases with automatic rollback on metric degradation

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: Multi-Model Inference Pipeline with Traffic Splitting**
- **Input**: 5-service ML pipeline (feature extraction ‚Üí wafer map analysis ‚Üí parametric model ‚Üí spatial model ‚Üí ensemble)
- **Output**: Istio routes 10% traffic to new ensemble model v2.5, monitors accuracy for 1 hour
- **Value**: Safe deployment - if accuracy <99%, automatic rollback to v2.4 (no downtime)
- **Business Impact**: **$180K/year savings** (prevent bad model deployments, reduce rollback time 95%)

**Use Case 2: Zero-Trust STDF Processing with Automatic mTLS**
- **Input**: STDF parser ‚Üí feature extractor ‚Üí outlier detector ‚Üí results storage (4 services)
- **Output**: Linkerd automatically encrypts all traffic, rotates certificates every 24 hours
- **Value**: Compliance with data security regulations (SOC 2, ISO 27001) without manual certificate management
- **Business Impact**: **$95K/year savings** (eliminate manual certificate rotation, pass security audits)

**Use Case 3: Resilient Wafer Analysis with Circuit Breakers**
- **Input**: Wafer map analyzer calls external defect classification API (3rd party service, sometimes slow)
- **Output**: Istio circuit breaker opens after 5 failures, prevents cascade failures across pipeline
- **Value**: Pipeline continues processing other wafers (graceful degradation vs complete failure)
- **Business Impact**: **$340K/year savings** (prevent pipeline downtime, maintain 99.9% availability)

**Use Case 4: A/B Testing for Yield Prediction Models**
- **Input**: Yield predictor v3.0 (new transformer architecture) vs v2.8 (baseline GBM)
- **Output**: Istio routes premium customers to v3.0, regular customers to v2.8 (header-based routing)
- **Value**: Compare model accuracy on real production traffic (not just test data)
- **Business Impact**: **$2.8M/year savings** (v3.0 improves yield prediction by 0.5% ‚Üí reduce manufacturing waste)

## üîÑ Service Mesh Workflow

```mermaid
graph TB
    subgraph "Control Plane"
        A[Pilot<br/>Service Discovery<br/>Traffic Rules]
        B[Citadel<br/>Certificate Authority<br/>mTLS Certs]
        C[Galley<br/>Configuration<br/>Validation]
    end
    
    subgraph "Data Plane - Application Pods"
        D[Feature Service Pod<br/>App Container + Envoy Proxy]
        E[Model Service Pod<br/>App Container + Envoy Proxy]
        F[Ensemble Service Pod<br/>App Container + Envoy Proxy]
    end
    
    A -->|Config Distribution| D
    A -->|Config Distribution| E
    A -->|Config Distribution| F
    
    B -->|Issue Certificates| D
    B -->|Issue Certificates| E
    B -->|Issue Certificates| F
    
    D -->|mTLS Encrypted| E
    E -->|mTLS Encrypted| F
    
    D -->|Metrics/Traces| G[Prometheus<br/>Jaeger]
    E -->|Metrics/Traces| G
    F -->|Metrics/Traces| G
    
    style A fill:#e1f5ff
    style B fill:#ffe1e1
    style C fill:#fff4e1
    style G fill:#e1ffe1
```

## üìä Learning Path Context

**Prerequisites:**
- **Notebook 131**: Docker for ML (containerization fundamentals)
- **Notebook 132**: Kubernetes Fundamentals (deployments, services, pods)
- **Notebook 133**: Kubernetes Advanced (operators, CRDs, StatefulSets)

**Next Steps:**
- **Notebook 135**: GitOps (ArgoCD, Flux for declarative deployments)
- **Notebook 136**: CI/CD for ML (Tekton, GitHub Actions with service mesh)
- **Notebook 137**: Infrastructure as Code (Terraform for Kubernetes + Istio)

---

Let's build service mesh systems for ML! üöÄ

In [None]:
# Setup and Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import json
import time
from datetime import datetime, timedelta
from pathlib import Path
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Any, Tuple
from enum import Enum
import uuid
import hashlib

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")

print("‚úÖ Environment ready for service mesh simulation")

## 2. üèóÔ∏è Service Mesh Architecture - Control Plane and Data Plane

### üìù What's Happening in This Section?

**Purpose:** Understand the two-layer architecture of service meshes (control plane manages configuration, data plane handles actual traffic)

**Key Points:**
- **Control Plane**: Manages configuration, distributes policies, handles certificate issuance (Pilot, Citadel, Galley)
- **Data Plane**: Sidecar proxies intercept all traffic, enforce policies, collect metrics (Envoy, Linkerd2-proxy)
- **Sidecar Pattern**: Each pod gets additional container (proxy) that handles networking
- **Service Discovery**: Control plane tells proxies about all services and their endpoints
- **Policy Distribution**: Control plane pushes routing rules, security policies to all proxies

**Why This Matters:**
- **Decoupling**: Application code doesn't handle mTLS, retries, metrics (separation of concerns)
- **Centralized Control**: Change traffic routing for all services from one place (control plane)
- **Zero Code Changes**: Add service mesh to existing applications without modifying code
- **Observability**: Proxies automatically export metrics, traces, logs (no manual instrumentation)

**Post-Silicon Application:**
STDF parsing pipeline (4 services: Parser ‚Üí Extractor ‚Üí Analyzer ‚Üí Storage) gets automatic mTLS, distributed tracing, circuit breakers by deploying Istio (inject sidecar proxies via annotation `sidecar.istio.io/inject: "true"`)

In [None]:
# Service Mesh Architecture Simulation

@dataclass
class Certificate:
    """TLS certificate for mTLS."""
    service_name: str
    issued_at: datetime
    expires_at: datetime
    certificate_id: str = field(default_factory=lambda: uuid.uuid4().hex[:12])
    
    def is_valid(self) -> bool:
        """Check if certificate is still valid."""
        now = datetime.now()
        return self.issued_at <= now <= self.expires_at


@dataclass
class ServiceEndpoint:
    """Represents a service endpoint (pod)."""
    service_name: str
    pod_name: str
    ip: str
    port: int
    version: str = "v1"
    healthy: bool = True
    
    def get_address(self) -> str:
        """Get full address."""
        return f"{self.ip}:{self.port}"


@dataclass
class RoutingRule:
    """Traffic routing rule."""
    source_service: str
    destination_service: str
    version_weights: Dict[str, int]  # {"v1": 90, "v2": 10}
    headers: Optional[Dict[str, str]] = None  # Header-based routing
    
    def select_version(self, request_headers: Dict[str, str] = None) -> str:
        """Select version based on weights or headers."""
        # Header-based routing takes precedence
        if self.headers and request_headers:
            for header_key, header_value in self.headers.items():
                if request_headers.get(header_key) == header_value:
                    # Find version for this header
                    for version in self.version_weights:
                        if version != "v1":  # Assume non-v1 is the test version
                            return version
        
        # Weight-based routing
        versions = list(self.version_weights.keys())
        weights = list(self.version_weights.values())
        return np.random.choice(versions, p=[w/sum(weights) for w in weights])


class ControlPlane:
    """Service mesh control plane (Istio Pilot + Citadel)."""
    
    def __init__(self, name: str = "istio-control-plane"):
        self.name = name
        self.services: Dict[str, List[ServiceEndpoint]] = {}
        self.routing_rules: List[RoutingRule] = []
        self.certificates: Dict[str, Certificate] = {}
        self.certificate_lifetime_hours: int = 24
    
    def register_service(self, endpoint: ServiceEndpoint):
        """Register service endpoint."""
        if endpoint.service_name not in self.services:
            self.services[endpoint.service_name] = []
        self.services[endpoint.service_name].append(endpoint)
        print(f"üìù Registered: {endpoint.service_name} ({endpoint.pod_name}) at {endpoint.get_address()}")
    
    def add_routing_rule(self, rule: RoutingRule):
        """Add traffic routing rule."""
        self.routing_rules.append(rule)
        print(f"üîÄ Routing rule: {rule.source_service} ‚Üí {rule.destination_service} "
              f"(weights: {rule.version_weights})")
    
    def issue_certificate(self, service_name: str) -> Certificate:
        """Issue TLS certificate for service (Citadel functionality)."""
        cert = Certificate(
            service_name=service_name,
            issued_at=datetime.now(),
            expires_at=datetime.now() + timedelta(hours=self.certificate_lifetime_hours)
        )
        self.certificates[service_name] = cert
        print(f"üîí Certificate issued: {service_name} (expires in {self.certificate_lifetime_hours}h)")
        return cert
    
    def get_endpoints(self, service_name: str, version: str = None) -> List[ServiceEndpoint]:
        """Get service endpoints (optionally filtered by version)."""
        endpoints = self.services.get(service_name, [])
        
        if version:
            endpoints = [ep for ep in endpoints if ep.version == version]
        
        # Filter out unhealthy endpoints
        endpoints = [ep for ep in endpoints if ep.healthy]
        
        return endpoints
    
    def get_routing_rule(self, source: str, destination: str) -> Optional[RoutingRule]:
        """Get routing rule for source ‚Üí destination."""
        for rule in self.routing_rules:
            if rule.source_service == source and rule.destination_service == destination:
                return rule
        return None


@dataclass
class Request:
    """HTTP request."""
    request_id: str = field(default_factory=lambda: uuid.uuid4().hex[:8])
    source_service: str = ""
    destination_service: str = ""
    headers: Dict[str, str] = field(default_factory=dict)
    trace_id: str = field(default_factory=lambda: uuid.uuid4().hex[:16])
    span_id: str = field(default_factory=lambda: uuid.uuid4().hex[:8])
    timestamp: datetime = field(default_factory=datetime.now)


@dataclass
class Response:
    """HTTP response."""
    request_id: str
    status_code: int = 200
    latency_ms: float = 0.0
    endpoint: Optional[ServiceEndpoint] = None
    encrypted: bool = False


class SidecarProxy:
    """Envoy sidecar proxy (data plane)."""
    
    def __init__(self, service_name: str, pod_name: str, control_plane: ControlPlane):
        self.service_name = service_name
        self.pod_name = pod_name
        self.control_plane = control_plane
        self.certificate: Optional[Certificate] = None
        self.request_count: int = 0
        self.error_count: int = 0
        self.total_latency_ms: float = 0.0
        
        # Request certificate from control plane
        self.certificate = self.control_plane.issue_certificate(self.service_name)
    
    def forward_request(self, request: Request) -> Response:
        """Forward request to destination service."""
        self.request_count += 1
        
        # Get routing rule
        rule = self.control_plane.get_routing_rule(
            self.service_name, 
            request.destination_service
        )
        
        # Select version based on routing rule
        if rule:
            version = rule.select_version(request.headers)
        else:
            version = "v1"  # Default
        
        # Get endpoints for selected version
        endpoints = self.control_plane.get_endpoints(
            request.destination_service, 
            version
        )
        
        if not endpoints:
            self.error_count += 1
            return Response(
                request_id=request.request_id,
                status_code=503,  # Service Unavailable
                latency_ms=5.0
            )
        
        # Load balance across endpoints (random)
        endpoint = np.random.choice(endpoints)
        
        # Simulate latency
        base_latency = np.random.uniform(10, 50)
        if version == "v2":
            base_latency *= 1.2  # v2 slightly slower
        
        # Simulate encryption overhead (mTLS)
        if self.certificate and self.certificate.is_valid():
            base_latency += 2.0  # mTLS overhead
            encrypted = True
        else:
            encrypted = False
        
        latency = base_latency
        self.total_latency_ms += latency
        
        return Response(
            request_id=request.request_id,
            status_code=200,
            latency_ms=latency,
            endpoint=endpoint,
            encrypted=encrypted
        )
    
    def get_metrics(self) -> Dict[str, float]:
        """Get golden metrics (requests, errors, latency)."""
        avg_latency = self.total_latency_ms / self.request_count if self.request_count > 0 else 0
        error_rate = self.error_count / self.request_count if self.request_count > 0 else 0
        
        return {
            "request_count": self.request_count,
            "error_count": self.error_count,
            "error_rate": error_rate,
            "avg_latency_ms": avg_latency
        }


# Example 1: Service Mesh Setup
print("=" * 80)
print("EXAMPLE 1: Service Mesh Architecture Setup")
print("=" * 80)

# Create control plane
control_plane = ControlPlane()

# Register services (ML inference pipeline)
# Feature Service
control_plane.register_service(ServiceEndpoint(
    service_name="feature-service",
    pod_name="feature-service-0",
    ip="10.0.1.10",
    port=8080,
    version="v1"
))

# Model Service v1 (3 replicas)
for i in range(3):
    control_plane.register_service(ServiceEndpoint(
        service_name="model-service",
        pod_name=f"model-service-v1-{i}",
        ip=f"10.0.2.{10+i}",
        port=8080,
        version="v1"
    ))

# Model Service v2 (1 replica, canary)
control_plane.register_service(ServiceEndpoint(
    service_name="model-service",
    pod_name="model-service-v2-0",
    ip="10.0.2.20",
    port=8080,
    version="v2"
))

# Ensemble Service
control_plane.register_service(ServiceEndpoint(
    service_name="ensemble-service",
    pod_name="ensemble-service-0",
    ip="10.0.3.10",
    port=8080,
    version="v1"
))

print(f"\nüìä Services Registered: {len(control_plane.services)}")
print(f"   ‚Ä¢ feature-service: {len(control_plane.get_endpoints('feature-service'))} endpoints")
print(f"   ‚Ä¢ model-service: {len(control_plane.get_endpoints('model-service'))} endpoints")
print(f"     - v1: {len(control_plane.get_endpoints('model-service', 'v1'))} replicas")
print(f"     - v2: {len(control_plane.get_endpoints('model-service', 'v2'))} replicas")
print(f"   ‚Ä¢ ensemble-service: {len(control_plane.get_endpoints('ensemble-service'))} endpoints")

print("\n" + "=" * 80)
print("EXAMPLE 2: Traffic Routing Rules (Canary Release)")
print("=" * 80)

# Add canary routing rule: 90% to v1, 10% to v2
canary_rule = RoutingRule(
    source_service="feature-service",
    destination_service="model-service",
    version_weights={"v1": 90, "v2": 10}
)
control_plane.add_routing_rule(canary_rule)

# Create sidecar proxies
feature_proxy = SidecarProxy("feature-service", "feature-service-0", control_plane)

print("\nüí° Canary Release: 90% traffic to model-v1, 10% to model-v2")

# Simulate 100 requests
print("\nüîÑ Simulating 100 requests...")
v1_count = 0
v2_count = 0
latencies_v1 = []
latencies_v2 = []

for i in range(100):
    request = Request(
        source_service="feature-service",
        destination_service="model-service"
    )
    
    response = feature_proxy.forward_request(request)
    
    if response.endpoint:
        if response.endpoint.version == "v1":
            v1_count += 1
            latencies_v1.append(response.latency_ms)
        else:
            v2_count += 1
            latencies_v2.append(response.latency_ms)

print(f"\nüìä Traffic Distribution:")
print(f"   ‚Ä¢ model-v1: {v1_count} requests ({v1_count}%)")
print(f"   ‚Ä¢ model-v2: {v2_count} requests ({v2_count}%)")
print(f"\nüìà Latency:")
print(f"   ‚Ä¢ model-v1: {np.mean(latencies_v1):.2f} ms (avg)")
print(f"   ‚Ä¢ model-v2: {np.mean(latencies_v2):.2f} ms (avg)")

print("\n" + "=" * 80)
print("EXAMPLE 3: mTLS Security (Automatic Certificate Issuance)")
print("=" * 80)

print(f"\nüîí Certificates Issued:")
for service_name, cert in control_plane.certificates.items():
    valid_for = (cert.expires_at - datetime.now()).total_seconds() / 3600
    print(f"   ‚Ä¢ {service_name}")
    print(f"     - Certificate ID: {cert.certificate_id}")
    print(f"     - Valid for: {valid_for:.1f} hours")
    print(f"     - Status: {'‚úÖ Valid' if cert.is_valid() else '‚ùå Expired'}")

# Check encryption status
sample_request = Request(
    source_service="feature-service",
    destination_service="model-service"
)
sample_response = feature_proxy.forward_request(sample_request)

print(f"\nüîê Request Encryption:")
print(f"   ‚Ä¢ Encrypted: {'‚úÖ Yes (mTLS)' if sample_response.encrypted else '‚ùå No'}")
print(f"   ‚Ä¢ Latency overhead: ~2ms (TLS handshake + encryption)")

print("\nüí° Service Mesh Benefits Demonstrated:")
print("   ‚úÖ Automatic service discovery (control plane knows all endpoints)")
print("   ‚úÖ Traffic splitting (90/10 canary release)")
print("   ‚úÖ mTLS encryption (automatic certificate issuance)")
print("   ‚úÖ Load balancing (random selection across endpoints)")
print("   ‚úÖ Metrics collection (requests, errors, latency)")

## 3. üîÄ Traffic Management - Canary Releases and A/B Testing

### üìù What's Happening in This Section?

**Purpose:** Control traffic flow between service versions for safe deployments and experimentation

**Key Points:**
- **Canary Release**: Gradually shift traffic from old version to new (5% ‚Üí 10% ‚Üí 25% ‚Üí 50% ‚Üí 100%)
- **A/B Testing**: Route traffic based on user attributes (header, cookie, IP) to compare versions
- **Blue-Green Deployment**: Maintain two identical environments (switch traffic instantly)
- **Traffic Mirroring**: Send copy of production traffic to new version (test without user impact)
- **Weight-Based Routing**: Percentage-based distribution (80% v1, 20% v2)

**Why This Matters:**
- **Risk Mitigation**: Detect issues with 5% traffic before affecting all users
- **Fast Rollback**: Revert to v1 by changing weights (seconds vs hours for full redeployment)
- **Data-Driven Decisions**: Compare metrics (latency, errors, accuracy) between versions
- **Zero Downtime**: Gradual migration ensures always enough healthy pods serving traffic

**Post-Silicon Application:**
Deploy new wafer yield model v2.5 (95.5% validation accuracy) with canary: 5% production traffic to v2.5, monitor for 24 hours (if error rate <0.5% and accuracy ‚â•95%, increase to 100%)

In [None]:
# Traffic Management Simulation

class CanaryDeployment:
    """Manage gradual canary rollout."""
    
    def __init__(self, control_plane: ControlPlane, source: str, destination: str):
        self.control_plane = control_plane
        self.source = source
        self.destination = destination
        self.current_weight_v1 = 100
        self.current_weight_v2 = 0
        self.history: List[Dict] = []
    
    def update_weights(self, v2_percentage: int):
        """Update traffic weights."""
        self.current_weight_v1 = 100 - v2_percentage
        self.current_weight_v2 = v2_percentage
        
        # Update routing rule
        for rule in self.control_plane.routing_rules:
            if (rule.source_service == self.source and 
                rule.destination_service == self.destination):
                rule.version_weights = {"v1": self.current_weight_v1, "v2": self.current_weight_v2}
                break
        
        print(f"üîÑ Updated traffic weights: v1={self.current_weight_v1}%, v2={self.current_weight_v2}%")
    
    def evaluate_metrics(self, proxy: SidecarProxy, num_requests: int = 1000) -> Dict:
        """Send requests and evaluate version metrics."""
        v1_requests = []
        v2_requests = []
        
        for _ in range(num_requests):
            request = Request(
                source_service=self.source,
                destination_service=self.destination
            )
            response = proxy.forward_request(request)
            
            if response.endpoint:
                if response.endpoint.version == "v1":
                    v1_requests.append(response)
                else:
                    v2_requests.append(response)
        
        # Calculate metrics
        metrics = {
            "v1": {
                "count": len(v1_requests),
                "avg_latency": np.mean([r.latency_ms for r in v1_requests]) if v1_requests else 0,
                "error_rate": sum(1 for r in v1_requests if r.status_code >= 400) / len(v1_requests) if v1_requests else 0
            },
            "v2": {
                "count": len(v2_requests),
                "avg_latency": np.mean([r.latency_ms for r in v2_requests]) if v2_requests else 0,
                "error_rate": sum(1 for r in v2_requests if r.status_code >= 400) / len(v2_requests) if v2_requests else 0
            }
        }
        
        self.history.append({
            "timestamp": datetime.now(),
            "v2_weight": self.current_weight_v2,
            "metrics": metrics
        })
        
        return metrics
    
    def gradual_rollout(self, proxy: SidecarProxy, stages: List[int], requests_per_stage: int = 1000):
        """Execute gradual canary rollout."""
        print(f"\nüöÄ Starting Canary Rollout: {self.destination}")
        print(f"   Stages: {stages}%")
        print(f"   Requests per stage: {requests_per_stage}\n")
        
        for stage_percentage in stages:
            print(f"{'='*60}")
            print(f"STAGE: {stage_percentage}% traffic to v2")
            print(f"{'='*60}")
            
            # Update weights
            self.update_weights(stage_percentage)
            
            # Evaluate metrics
            metrics = self.evaluate_metrics(proxy, requests_per_stage)
            
            print(f"\nüìä Metrics for stage {stage_percentage}%:")
            print(f"   v1: {metrics['v1']['count']} requests, "
                  f"{metrics['v1']['avg_latency']:.2f} ms avg latency, "
                  f"{metrics['v1']['error_rate']*100:.2f}% errors")
            print(f"   v2: {metrics['v2']['count']} requests, "
                  f"{metrics['v2']['avg_latency']:.2f} ms avg latency, "
                  f"{metrics['v2']['error_rate']*100:.2f}% errors")
            
            # Decision logic
            if metrics['v2']['error_rate'] > 0.05:  # >5% errors
                print(f"\n‚ùå ERROR RATE TOO HIGH! Rolling back to v1...")
                self.update_weights(0)  # Rollback to 100% v1
                break
            
            if metrics['v2']['avg_latency'] > metrics['v1']['avg_latency'] * 1.5:  # 50% slower
                print(f"\n‚ö†Ô∏è  LATENCY REGRESSION! Pausing rollout...")
                break
            
            print(f"‚úÖ Stage {stage_percentage}% successful, proceeding...")
            time.sleep(0.1)  # Simulate monitoring period
        
        print(f"\n{'='*60}")
        print(f"üéâ Canary rollout complete!")
        print(f"   Final weight: v1={self.current_weight_v1}%, v2={self.current_weight_v2}%")
        print(f"{'='*60}")


class ABTestManager:
    """Manage A/B testing with header-based routing."""
    
    def __init__(self, control_plane: ControlPlane):
        self.control_plane = control_plane
        self.experiments: Dict[str, Dict] = {}
    
    def create_experiment(self, name: str, source: str, destination: str, 
                         test_header: Dict[str, str], control_version: str = "v1", 
                         treatment_version: str = "v2"):
        """Create A/B test experiment."""
        # Add routing rule with header matching
        rule = RoutingRule(
            source_service=source,
            destination_service=destination,
            version_weights={control_version: 50, treatment_version: 50},
            headers=test_header
        )
        self.control_plane.add_routing_rule(rule)
        
        self.experiments[name] = {
            "source": source,
            "destination": destination,
            "test_header": test_header,
            "control_version": control_version,
            "treatment_version": treatment_version,
            "control_metrics": [],
            "treatment_metrics": []
        }
        
        print(f"üß™ A/B Test Created: {name}")
        print(f"   Control: {control_version}")
        print(f"   Treatment: {treatment_version}")
        print(f"   Header: {test_header}")
    
    def run_experiment(self, name: str, proxy: SidecarProxy, num_requests: int = 1000):
        """Run A/B test experiment."""
        exp = self.experiments[name]
        
        control_responses = []
        treatment_responses = []
        
        for i in range(num_requests):
            # Alternate between control and treatment groups
            if i % 2 == 0:
                # Control group (no special header)
                request = Request(
                    source_service=exp["source"],
                    destination_service=exp["destination"],
                    headers={}
                )
            else:
                # Treatment group (with test header)
                request = Request(
                    source_service=exp["source"],
                    destination_service=exp["destination"],
                    headers=exp["test_header"]
                )
            
            response = proxy.forward_request(request)
            
            if i % 2 == 0:
                control_responses.append(response)
            else:
                treatment_responses.append(response)
        
        # Calculate metrics
        control_latency = np.mean([r.latency_ms for r in control_responses])
        treatment_latency = np.mean([r.latency_ms for r in treatment_responses])
        
        control_errors = sum(1 for r in control_responses if r.status_code >= 400) / len(control_responses)
        treatment_errors = sum(1 for r in treatment_responses if r.status_code >= 400) / len(treatment_responses)
        
        exp["control_metrics"].append({
            "latency": control_latency,
            "error_rate": control_errors
        })
        exp["treatment_metrics"].append({
            "latency": treatment_latency,
            "error_rate": treatment_errors
        })
        
        print(f"\nüìä A/B Test Results: {name}")
        print(f"   Control ({exp['control_version']}):")
        print(f"     - Requests: {len(control_responses)}")
        print(f"     - Avg Latency: {control_latency:.2f} ms")
        print(f"     - Error Rate: {control_errors*100:.2f}%")
        print(f"   Treatment ({exp['treatment_version']}):")
        print(f"     - Requests: {len(treatment_responses)}")
        print(f"     - Avg Latency: {treatment_latency:.2f} ms")
        print(f"     - Error Rate: {treatment_errors*100:.2f}%")
        
        # Statistical significance (simplified)
        latency_diff = ((treatment_latency - control_latency) / control_latency) * 100
        print(f"\nüìà Latency Impact: {latency_diff:+.2f}%")
        
        if abs(latency_diff) < 5:
            print("   ‚úÖ No significant latency difference")
        elif latency_diff < 0:
            print("   ‚úÖ Treatment is faster!")
        else:
            print("   ‚ö†Ô∏è  Treatment is slower")
        
        return {
            "control_latency": control_latency,
            "treatment_latency": treatment_latency,
            "control_errors": control_errors,
            "treatment_errors": treatment_errors,
            "latency_impact_pct": latency_diff
        }


# Example 1: Canary Deployment with Gradual Rollout
print("=" * 80)
print("EXAMPLE 1: Canary Deployment - Gradual Rollout")
print("=" * 80)

# Setup
canary = CanaryDeployment(control_plane, "feature-service", "model-service")
proxy = SidecarProxy("feature-service", "feature-service-0", control_plane)

# Execute gradual rollout: 5% ‚Üí 10% ‚Üí 25% ‚Üí 50% ‚Üí 100%
canary.gradual_rollout(proxy, stages=[5, 10, 25, 50, 100], requests_per_stage=500)

# Visualize rollout history
print("\n" + "=" * 80)
print("EXAMPLE 2: Canary Rollout Visualization")
print("=" * 80)

stages = [h["v2_weight"] for h in canary.history]
v1_latencies = [h["metrics"]["v1"]["avg_latency"] for h in canary.history]
v2_latencies = [h["metrics"]["v2"]["avg_latency"] for h in canary.history]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Traffic distribution
ax1.plot(stages, [100-s for s in stages], marker='o', linewidth=2.5, markersize=10, 
         label='v1 (stable)', color='#4ECDC4')
ax1.plot(stages, stages, marker='s', linewidth=2.5, markersize=10, 
         label='v2 (canary)', color='#FF6B6B')
ax1.fill_between(stages, 0, [100-s for s in stages], alpha=0.3, color='#4ECDC4')
ax1.fill_between(stages, 0, stages, alpha=0.3, color='#FF6B6B')
ax1.set_xlabel("Rollout Stage (%)", fontsize=12, fontweight='bold')
ax1.set_ylabel("Traffic Percentage (%)", fontsize=12, fontweight='bold')
ax1.set_title("Canary Release: Gradual Traffic Shift\n(v1 ‚Üí v2)", 
              fontsize=14, fontweight='bold', pad=20)
ax1.legend(fontsize=11, loc='center left')
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 105)

# Plot 2: Latency comparison
ax2.plot(stages, v1_latencies, marker='o', linewidth=2.5, markersize=10, 
         label='v1 latency', color='#4ECDC4')
ax2.plot(stages, v2_latencies, marker='s', linewidth=2.5, markersize=10, 
         label='v2 latency', color='#FF6B6B')
ax2.set_xlabel("Rollout Stage (%)", fontsize=12, fontweight='bold')
ax2.set_ylabel("Avg Latency (ms)", fontsize=12, fontweight='bold')
ax2.set_title("Latency Monitoring During Rollout\n(Check for regressions)", 
              fontsize=14, fontweight='bold', pad=20)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 80)
print("EXAMPLE 3: A/B Testing with Header-Based Routing")
print("=" * 80)

# Create A/B test
ab_test = ABTestManager(control_plane)
ab_test.create_experiment(
    name="wafer_yield_model_v2_test",
    source="feature-service",
    destination="model-service",
    test_header={"x-user-segment": "premium"},
    control_version="v1",
    treatment_version="v2"
)

# Run experiment
ab_proxy = SidecarProxy("feature-service", "feature-service-1", control_plane)
results = ab_test.run_experiment("wafer_yield_model_v2_test", ab_proxy, num_requests=1000)

print("\nüí° Traffic Management Capabilities Demonstrated:")
print("   ‚úÖ Canary release (gradual 5% ‚Üí 100% rollout)")
print("   ‚úÖ Automatic rollback (if error rate >5%)")
print("   ‚úÖ A/B testing (header-based routing)")
print("   ‚úÖ Metrics comparison (latency, errors between versions)")
print("   ‚úÖ Risk mitigation (detect issues with small % of traffic)")

## 4. üõ°Ô∏è Resilience Patterns - Circuit Breakers, Retries, Timeouts

### üìù What's Happening in This Section?

**Purpose:** Protect ML pipelines from cascading failures using resilience patterns

**Key Points:**
- **Circuit Breaker**: Stop calling failing service (open circuit after 50% error rate, retry after 30s)
- **Automatic Retries**: Retry transient failures (network blips, temporary overload) with exponential backoff
- **Timeouts**: Prevent hanging requests (fail fast after 10s instead of waiting 5 minutes)
- **Bulkhead Pattern**: Isolate failures (separate thread pools for critical vs non-critical services)
- **Fallback**: Return cached/default result when service unavailable

**Why This Matters:**
- **Prevent Cascade Failures**: One slow service doesn't bring down entire pipeline
- **Improve Availability**: Auto-retry succeeds 80% of the time for transient errors
- **Resource Protection**: Timeouts free up connection pools, prevent resource exhaustion
- **Graceful Degradation**: Return lower-quality result instead of complete failure

**Post-Silicon Application:**
Wafer map analysis pipeline: Spatial analyzer calls external defect classification API (2% error rate) ‚Üí circuit breaker opens after 5 consecutive failures ‚Üí return cached classification ‚Üí auto-retry after 30s ‚Üí restore service when API healthy

In [None]:
# Resilience Patterns Simulation

class CircuitBreakerState(Enum):
    """Circuit breaker states."""
    CLOSED = "closed"  # Normal operation
    OPEN = "open"      # Blocking requests (service unhealthy)
    HALF_OPEN = "half_open"  # Testing if service recovered


@dataclass
class CircuitBreaker:
    """Circuit breaker for resilience."""
    service_name: str
    failure_threshold: int = 5  # Open after N consecutive failures
    success_threshold: int = 2  # Close after N consecutive successes (in half-open)
    timeout_seconds: int = 30   # Time before half-open
    
    state: CircuitBreakerState = CircuitBreakerState.CLOSED
    consecutive_failures: int = 0
    consecutive_successes: int = 0
    last_failure_time: Optional[datetime] = None
    total_requests: int = 0
    blocked_requests: int = 0
    
    def call(self, success: bool) -> bool:
        """Attempt call through circuit breaker."""
        self.total_requests += 1
        
        # Check if circuit should transition from OPEN ‚Üí HALF_OPEN
        if self.state == CircuitBreakerState.OPEN:
            if self.last_failure_time:
                elapsed = (datetime.now() - self.last_failure_time).total_seconds()
                if elapsed >= self.timeout_seconds:
                    print(f"‚è∞ Circuit breaker timeout elapsed, entering HALF_OPEN state")
                    self.state = CircuitBreakerState.HALF_OPEN
                    self.consecutive_successes = 0
                    self.consecutive_failures = 0
                else:
                    # Still open, block request
                    self.blocked_requests += 1
                    print(f"üö´ Circuit OPEN: Request blocked ({self.blocked_requests} total)")
                    return False
        
        # State: CLOSED or HALF_OPEN, allow request
        if success:
            self.consecutive_failures = 0
            self.consecutive_successes += 1
            
            # Transition: HALF_OPEN ‚Üí CLOSED
            if (self.state == CircuitBreakerState.HALF_OPEN and 
                self.consecutive_successes >= self.success_threshold):
                print(f"‚úÖ Circuit breaker CLOSED: Service recovered ({self.consecutive_successes} successes)")
                self.state = CircuitBreakerState.CLOSED
            
            return True
        else:
            self.consecutive_successes = 0
            self.consecutive_failures += 1
            self.last_failure_time = datetime.now()
            
            # Transition: CLOSED ‚Üí OPEN
            if (self.state == CircuitBreakerState.CLOSED and 
                self.consecutive_failures >= self.failure_threshold):
                print(f"‚ùå Circuit breaker OPEN: Too many failures ({self.consecutive_failures}/{self.failure_threshold})")
                self.state = CircuitBreakerState.OPEN
                self.blocked_requests += 1
                return False
            
            # Transition: HALF_OPEN ‚Üí OPEN (service still unhealthy)
            if self.state == CircuitBreakerState.HALF_OPEN:
                print(f"‚ùå Circuit breaker OPEN: Service still failing")
                self.state = CircuitBreakerState.OPEN
                self.blocked_requests += 1
                return False
            
            return False


class RetryPolicy:
    """Retry policy with exponential backoff."""
    
    def __init__(self, max_retries: int = 3, base_delay_ms: float = 100, max_delay_ms: float = 5000):
        self.max_retries = max_retries
        self.base_delay_ms = base_delay_ms
        self.max_delay_ms = max_delay_ms
        self.total_retries = 0
        self.successful_retries = 0
    
    def execute(self, func, *args, **kwargs):
        """Execute function with retries."""
        for attempt in range(self.max_retries + 1):
            try:
                result = func(*args, **kwargs)
                
                if attempt > 0:
                    self.successful_retries += 1
                    print(f"   ‚úÖ Retry {attempt} succeeded")
                
                return result
            except Exception as e:
                if attempt < self.max_retries:
                    self.total_retries += 1
                    
                    # Exponential backoff: 100ms, 200ms, 400ms, 800ms, ...
                    delay_ms = min(self.base_delay_ms * (2 ** attempt), self.max_delay_ms)
                    print(f"   ‚ö†Ô∏è  Attempt {attempt + 1} failed: {e}")
                    print(f"   ‚è≥ Retrying in {delay_ms:.0f}ms...")
                    time.sleep(delay_ms / 1000)
                else:
                    print(f"   ‚ùå All {self.max_retries} retries exhausted")
                    raise


class TimeoutPolicy:
    """Request timeout policy."""
    
    def __init__(self, timeout_seconds: float = 10.0):
        self.timeout_seconds = timeout_seconds
        self.timeout_count = 0
    
    def execute(self, func, latency_ms: float, *args, **kwargs):
        """Execute function with timeout."""
        if latency_ms > self.timeout_seconds * 1000:
            self.timeout_count += 1
            raise TimeoutError(f"Request exceeded timeout ({latency_ms:.0f}ms > {self.timeout_seconds*1000:.0f}ms)")
        
        return func(*args, **kwargs)


# Example 1: Circuit Breaker Pattern
print("=" * 80)
print("EXAMPLE 1: Circuit Breaker - Protect from Cascading Failures")
print("=" * 80)

circuit_breaker = CircuitBreaker(
    service_name="spatial-analyzer-api",
    failure_threshold=5,
    success_threshold=2,
    timeout_seconds=30
)

print(f"\nüîß Circuit Breaker Configuration:")
print(f"   ‚Ä¢ Failure threshold: {circuit_breaker.failure_threshold} (open after N failures)")
print(f"   ‚Ä¢ Success threshold: {circuit_breaker.success_threshold} (close after N successes)")
print(f"   ‚Ä¢ Timeout: {circuit_breaker.timeout_seconds}s (before retry)")

# Simulate API calls with failures
print(f"\n{'='*60}")
print(f"Simulating API calls (50% error rate)")
print(f"{'='*60}")

success_count = 0
failure_count = 0
blocked_count = 0

for i in range(20):
    # Simulate 50% error rate for first 10 calls
    if i < 10:
        success = np.random.random() > 0.5
    else:
        # API recovered, 95% success rate
        success = np.random.random() > 0.05
    
    print(f"\nRequest {i+1}:")
    allowed = circuit_breaker.call(success)
    
    if not allowed:
        blocked_count += 1
    elif success:
        success_count += 1
        print(f"   ‚úÖ Success")
    else:
        failure_count += 1
        print(f"   ‚ùå Failed")
    
    print(f"   State: {circuit_breaker.state.value.upper()}")
    time.sleep(0.05)

print(f"\n{'='*60}")
print(f"Circuit Breaker Summary:")
print(f"   ‚Ä¢ Total requests: {circuit_breaker.total_requests}")
print(f"   ‚Ä¢ Successful: {success_count}")
print(f"   ‚Ä¢ Failed: {failure_count}")
print(f"   ‚Ä¢ Blocked: {blocked_count}")
print(f"   ‚Ä¢ Final state: {circuit_breaker.state.value.upper()}")
print(f"{'='*60}")

print("\n" + "=" * 80)
print("EXAMPLE 2: Retry Policy with Exponential Backoff")
print("=" * 80)

retry_policy = RetryPolicy(max_retries=3, base_delay_ms=100, max_delay_ms=1000)

# Simulate function that fails first 2 times, succeeds on 3rd
def unreliable_api_call(attempt_tracker: List[int]):
    """Simulates unreliable API."""
    attempt_tracker[0] += 1
    
    if attempt_tracker[0] <= 2:
        raise Exception(f"Network error (attempt {attempt_tracker[0]})")
    
    return {"status": "success", "data": "wafer_analysis_results"}

print(f"\nüîß Retry Policy Configuration:")
print(f"   ‚Ä¢ Max retries: {retry_policy.max_retries}")
print(f"   ‚Ä¢ Base delay: {retry_policy.base_delay_ms}ms")
print(f"   ‚Ä¢ Max delay: {retry_policy.max_delay_ms}ms")
print(f"   ‚Ä¢ Backoff: Exponential (100ms ‚Üí 200ms ‚Üí 400ms ‚Üí ...)")

print(f"\n{'='*60}")
print("Calling unreliable API (fails first 2 times)")
print(f"{'='*60}\n")

attempt_tracker = [0]
try:
    result = retry_policy.execute(unreliable_api_call, attempt_tracker)
    print(f"\n‚úÖ Final result: {result}")
except Exception as e:
    print(f"\n‚ùå Final failure: {e}")

print(f"\nüìä Retry Statistics:")
print(f"   ‚Ä¢ Total retries: {retry_policy.total_retries}")
print(f"   ‚Ä¢ Successful retries: {retry_policy.successful_retries}")
print(f"   ‚Ä¢ Success rate: {retry_policy.successful_retries / retry_policy.total_retries * 100:.1f}%")

print("\n" + "=" * 80)
print("EXAMPLE 3: Timeout Policy - Fail Fast")
print("=" * 80)

timeout_policy = TimeoutPolicy(timeout_seconds=1.0)

print(f"\nüîß Timeout Policy Configuration:")
print(f"   ‚Ä¢ Timeout: {timeout_policy.timeout_seconds}s")
print(f"   ‚Ä¢ Strategy: Fail fast (don't wait indefinitely)")

# Simulate requests with varying latencies
latencies = [50, 150, 300, 800, 1200, 2000, 100]  # ms

print(f"\n{'='*60}")
print(f"Simulating requests with varying latencies")
print(f"{'='*60}\n")

for i, latency in enumerate(latencies):
    print(f"Request {i+1}: latency={latency}ms")
    
    try:
        def dummy_request():
            return {"status": "success"}
        
        timeout_policy.execute(dummy_request, latency)
        print(f"   ‚úÖ Success ({latency}ms < {timeout_policy.timeout_seconds*1000}ms)")
    except TimeoutError as e:
        print(f"   ‚ùå Timeout: {e}")

print(f"\nüìä Timeout Statistics:")
print(f"   ‚Ä¢ Total requests: {len(latencies)}")
print(f"   ‚Ä¢ Timeouts: {timeout_policy.timeout_count}")
print(f"   ‚Ä¢ Timeout rate: {timeout_policy.timeout_count / len(latencies) * 100:.1f}%")

# Visualize resilience patterns impact
print("\n" + "=" * 80)
print("EXAMPLE 4: Resilience Patterns Impact Visualization")
print("=" * 80)

# Simulate service with/without resilience patterns
np.random.seed(42)

# Scenario: External API with 20% error rate
num_requests = 100
base_error_rate = 0.20

# Without resilience (direct calls)
without_resilience_successes = []
for _ in range(num_requests):
    success = np.random.random() > base_error_rate
    without_resilience_successes.append(1 if success else 0)

without_resilience_success_rate = np.mean(without_resilience_successes) * 100

# With resilience (retries + circuit breaker)
with_resilience_successes = []
cb = CircuitBreaker(service_name="test", failure_threshold=5, timeout_seconds=5)
retry = RetryPolicy(max_retries=2)

for _ in range(num_requests):
    # Simulate with retries
    success_attempts = []
    for attempt in range(3):  # 1 initial + 2 retries
        success = np.random.random() > base_error_rate
        success_attempts.append(success)
        if success:
            break
    
    final_success = any(success_attempts)
    
    # Circuit breaker check
    allowed = cb.call(final_success)
    
    with_resilience_successes.append(1 if (allowed and final_success) else 0)

with_resilience_success_rate = np.mean(with_resilience_successes) * 100

# Visualization
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 6))

# Plot 1: Success rate comparison
categories = ['Without\nResilience', 'With Retries +\nCircuit Breaker']
success_rates = [without_resilience_success_rate, with_resilience_success_rate]
colors = ['#FF6B6B', '#4ECDC4']

bars = ax1.bar(categories, success_rates, color=colors, alpha=0.8, edgecolor='black', linewidth=2)
ax1.axhline(y=95, color='green', linestyle='--', linewidth=2, label='Target (95%)')
ax1.set_ylabel("Success Rate (%)", fontsize=12, fontweight='bold')
ax1.set_title("Service Availability Improvement\n(20% base error rate)", 
              fontsize=14, fontweight='bold', pad=20)
ax1.set_ylim(0, 105)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, rate in zip(bars, success_rates):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 2,
             f'{rate:.1f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Plot 2: Cumulative success over time
ax2.plot(np.cumsum(without_resilience_successes), linewidth=2.5, 
         label='Without Resilience', color='#FF6B6B')
ax2.plot(np.cumsum(with_resilience_successes), linewidth=2.5, 
         label='With Resilience', color='#4ECDC4')
ax2.set_xlabel("Request Number", fontsize=12, fontweight='bold')
ax2.set_ylabel("Cumulative Successes", fontsize=12, fontweight='bold')
ax2.set_title("Cumulative Success Over Time\n(Resilience reduces failures)", 
              fontsize=14, fontweight='bold', pad=20)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

# Plot 3: Improvement breakdown
improvements = {
    'Base\nSuccess': 100 - base_error_rate * 100,
    '+Retries': (with_resilience_success_rate - without_resilience_success_rate) * 0.7,
    '+Circuit\nBreaker': (with_resilience_success_rate - without_resilience_success_rate) * 0.3
}

x_pos = np.arange(len(improvements))
bars3 = ax3.bar(improvements.keys(), improvements.values(), 
                color=['#95E1D3', '#F38181', '#AA96DA'], alpha=0.8,
                edgecolor='black', linewidth=2)
ax3.set_ylabel("Contribution (%)", fontsize=12, fontweight='bold')
ax3.set_title("Resilience Pattern Contributions\n(to overall success rate)", 
              fontsize=14, fontweight='bold', pad=20)
ax3.grid(True, alpha=0.3, axis='y')

# Add value labels
for bar in bars3:
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + 1,
             f'{height:.1f}%', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nüìä Resilience Impact:")
print(f"   Without resilience: {without_resilience_success_rate:.1f}% success")
print(f"   With resilience: {with_resilience_success_rate:.1f}% success")
print(f"   Improvement: +{with_resilience_success_rate - without_resilience_success_rate:.1f} percentage points")

print("\nüí° Resilience Patterns Demonstrated:")
print("   ‚úÖ Circuit breaker (prevent cascade failures)")
print("   ‚úÖ Automatic retries (exponential backoff)")
print("   ‚úÖ Timeouts (fail fast, free resources)")
print("   ‚úÖ Success rate improvement (80% ‚Üí 90%+)")

## 5. üöÄ Real-World Projects Using Service Mesh

Build production ML services with Istio and Linkerd:

---

### **Project 1: Multi-Model Inference Pipeline with Canary Releases** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Deploy 5-service ML inference pipeline with automatic canary releases and rollback

**Business Value:**  
- $280K/year savings (catch regressions before full rollout, reduce downtime from bad deployments)
- 99.95% uptime (circuit breakers prevent cascade failures)
- 3x faster deployments (gradual rollouts with automatic rollback)

**Success Criteria:**
- ‚úÖ Pipeline handles 1000 req/sec with <100ms p99 latency
- ‚úÖ Canary release completes 5% ‚Üí 100% in 24 hours (no manual intervention)
- ‚úÖ Automatic rollback triggered if error rate >1% or latency >150ms
- ‚úÖ Distributed tracing shows end-to-end request flow across all services

**Pipeline Services:**
1. **Feature Engineering Service:** Extract 50 features from raw wafer test data
2. **Wafer Map Analyzer:** Spatial pattern detection (defects, hotspots)
3. **Parametric Model:** Predict yield from electrical parameters
4. **Spatial Model:** Predict yield from wafer map patterns
5. **Ensemble Combiner:** Weighted voting (parametric 60%, spatial 40%)

**Service Mesh Features:**
- **Istio VirtualService:** Traffic splitting (v2.4: 95%, v2.5: 5%)
- **DestinationRule:** Circuit breaker (max connections: 100, consecutive errors: 5)
- **ServiceEntry:** External defect classification API with timeout (10s)
- **Prometheus metrics:** Request rate, latency, error rate (automatic)
- **Jaeger tracing:** Distributed traces across all 5 services

**Implementation Hints:**
```yaml
# Canary VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ensemble-combiner
spec:
  hosts:
  - ensemble-combiner
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: ensemble-combiner
        subset: v2-5
      weight: 100
  - route:
    - destination:
        host: ensemble-combiner
        subset: v2-4
      weight: 95
    - destination:
        host: ensemble-combiner
        subset: v2-5
      weight: 5
---
# Circuit Breaker DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: wafer-map-analyzer
spec:
  host: wafer-map-analyzer
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        maxRequestsPerConnection: 2
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 60s
```

**Post-Silicon Application:**  
Deploy new ensemble model v2.5 (96.5% accuracy) with canary release, monitor for 24 hours, auto-rollback if p99 latency >150ms or accuracy drops below 95%

---

### **Project 2: Zero-Trust mTLS for STDF Processing** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Secure STDF parsing pipeline with automatic mTLS and authorization policies

**Business Value:**  
- $150K/year savings (eliminate manual certificate management, pass security audits)
- Compliance with data security requirements (all traffic encrypted)
- 30-minute incident response (detect unauthorized access via service graph)

**Success Criteria:**
- ‚úÖ All service-to-service traffic encrypted with mTLS (100% coverage)
- ‚úÖ Certificates auto-rotated every 24 hours (zero manual intervention)
- ‚úÖ Authorization policies enforce least-privilege (parser can't call storage directly)
- ‚úÖ Security audit passes (all traffic logged, encrypted, authorized)

**Pipeline Services:**
1. **STDF Parser:** Parse binary STDF files (IEEE 1505 format)
2. **Feature Extractor:** Extract parametric features (Vdd, Idd, frequency)
3. **Outlier Detector:** Detect parametric anomalies (z-score >3)
4. **Results Storage:** Store parsed data in PostgreSQL

**Service Mesh Security:**
- **Linkerd automatic mTLS:** All traffic encrypted without code changes
- **AuthorizationPolicy:** Parser ‚Üí Extractor (allow), Parser ‚Üí Storage (deny)
- **PeerAuthentication:** Require mTLS for all services (STRICT mode)
- **Certificate rotation:** Auto-rotate every 24 hours (Linkerd identity)

**Implementation Hints:**
```yaml
# Linkerd automatic mTLS (annotation on namespace)
apiVersion: v1
kind: Namespace
metadata:
  name: stdf-processing
  annotations:
    linkerd.io/inject: enabled
---
# Istio AuthorizationPolicy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: stdf-parser-policy
spec:
  selector:
    matchLabels:
      app: feature-extractor
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/stdf-processing/sa/stdf-parser"]
---
# Deny direct storage access
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: storage-deny-parser
spec:
  selector:
    matchLabels:
      app: results-storage
  action: DENY
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/stdf-processing/sa/stdf-parser"]
```

**Post-Silicon Application:**  
Secure proprietary wafer test data (trade secrets) with end-to-end mTLS, prevent parser from directly accessing storage (defense in depth)

---

### **Project 3: Chaos Engineering with Fault Injection** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Test ML pipeline resilience by injecting faults (delays, errors, aborts)

**Business Value:**  
- $220K/year savings (proactively find weaknesses before production incidents)
- 60% reduction in MTTR (mean time to recovery, from 4 hours ‚Üí 1.5 hours)
- 99.9% availability improvement (find and fix single points of failure)

**Success Criteria:**
- ‚úÖ Pipeline survives 50% error injection on non-critical services
- ‚úÖ Circuit breakers open after 5 consecutive failures (prevent cascade)
- ‚úÖ Automatic retries succeed 80% of the time for transient errors
- ‚úÖ Critical services (payment, logging) have fallback mechanisms

**Chaos Experiments:**
1. **Latency Injection:** Add 5s delay to feature service (test timeouts)
2. **Error Injection:** 50% error rate on spatial analyzer (test circuit breakers)
3. **Abort Injection:** Kill ensemble combiner pod (test pod restart)
4. **Network Partition:** Block traffic between services (test retry logic)

**Service Mesh Fault Injection:**
- **VirtualService with delays:** Inject 5s delay on 20% of requests
- **VirtualService with aborts:** Return HTTP 500 on 50% of requests
- **DestinationRule with retries:** Retry 3x with exponential backoff

**Implementation Hints:**
```yaml
# Latency injection
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: feature-service-fault
spec:
  hosts:
  - feature-service
  http:
  - fault:
      delay:
        percentage:
          value: 20.0
        fixedDelay: 5s
    route:
    - destination:
        host: feature-service
---
# Error injection
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: spatial-analyzer-fault
spec:
  hosts:
  - spatial-analyzer
  http:
  - fault:
      abort:
        percentage:
          value: 50.0
        httpStatus: 500
    route:
    - destination:
        host: spatial-analyzer
---
# Retry policy
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ensemble-combiner-retry
spec:
  hosts:
  - ensemble-combiner
  http:
  - retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx
    route:
    - destination:
        host: ensemble-combiner
```

**Post-Silicon Application:**  
Test wafer analysis pipeline resilience: inject 50% errors on external defect API, verify circuit breaker opens and cached classifications used

---

### **Project 4: Distributed Tracing for Performance Debugging** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Debug slow ML predictions using distributed tracing (Jaeger, Zipkin)

**Business Value:**  
- $95K/year savings (reduce debugging time from 8 hours/week ‚Üí 2 hours/week)
- 40% latency reduction (identify and optimize slowest services)
- 2x faster root cause analysis (trace shows exact service adding latency)

**Success Criteria:**
- ‚úÖ 100% of requests traced end-to-end (no sampling gaps)
- ‚úÖ Traces show service-by-service latency breakdown
- ‚úÖ Critical path identified (feature extraction adds 45ms, optimize first)
- ‚úÖ Anomaly detection (flag requests >500ms for investigation)

**Tracing Features:**
- **Automatic span creation:** Istio/Linkerd proxies create spans for each service call
- **Trace ID propagation:** Headers (x-request-id, x-b3-traceid) passed through pipeline
- **Service graph:** Visualize request flow (Feature ‚Üí Model A ‚Üí Model B ‚Üí Ensemble)
- **Latency heatmap:** Find p99 latency hotspots

**Implementation Hints:**
```python
# Application code (propagate trace headers)
def call_next_service(request_headers):
    trace_headers = {
        'x-request-id': request_headers.get('x-request-id'),
        'x-b3-traceid': request_headers.get('x-b3-traceid'),
        'x-b3-spanid': request_headers.get('x-b3-spanid'),
        'x-b3-parentspanid': request_headers.get('x-b3-parentspanid'),
        'x-b3-sampled': request_headers.get('x-b3-sampled')
    }
    
    response = requests.post(
        'http://next-service:8080/predict',
        headers=trace_headers,
        json=data
    )
    
    return response

# Jaeger query (find slow requests)
# UI: http://jaeger:16686
# Query: service=ensemble-combiner duration>500ms
# Result: Traces sorted by latency, click to see service-by-service breakdown
```

**Post-Silicon Application:**  
Debug slow wafer yield predictions (p99 latency 350ms, target 100ms), trace shows feature extraction adds 120ms ‚Üí optimize by caching computed features

---

### **Project 5: Blue-Green Deployment with Instant Rollback** ‚≠ê‚≠ê‚≠ê
**Objective:** Deploy new model version to separate environment, switch traffic instantly

**Business Value:**  
- $180K/year savings (zero downtime during deployments)
- 10-second rollback (vs 15 minutes for rolling update rollback)
- 100% confidence in new version (test with production traffic before cutover)

**Success Criteria:**
- ‚úÖ Blue and green environments identical (same replicas, resources)
- ‚úÖ Traffic switch completes in <10 seconds (VirtualService update)
- ‚úÖ Rollback completes in <10 seconds (revert VirtualService)
- ‚úÖ Zero dropped requests during cutover

**Implementation Hints:**
```yaml
# Blue environment (production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: yield-model-blue
spec:
  replicas: 10
  template:
    metadata:
      labels:
        app: yield-model
        version: blue
    spec:
      containers:
      - name: model
        image: yield-model:v2.4
---
# Green environment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: yield-model-green
spec:
  replicas: 10
  template:
    metadata:
      labels:
        app: yield-model
        version: green
    spec:
      containers:
      - name: model
        image: yield-model:v2.5
---
# VirtualService (switch traffic)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: yield-model
spec:
  hosts:
  - yield-model
  http:
  - route:
    - destination:
        host: yield-model
        subset: green  # Switch to green (change to blue for rollback)
      weight: 100
```

**Post-Silicon Application:**  
Deploy wafer yield model v2.5 to green environment, test with mirror traffic (10%), switch production traffic to green, rollback to blue if issues detected

---

### **Project 6: Multi-Cluster Service Mesh (Global Load Balancing)** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Deploy ML services across 3 regions (US, EU, Asia) with global load balancing

**Business Value:**  
- $420K/year savings (eliminate manual multi-region deployments)
- 99.99% availability (survive entire region failure)
- 60% latency reduction (route to nearest region)

**Success Criteria:**
- ‚úÖ Services deployed to 3 Kubernetes clusters (us-west, eu-central, asia-east)
- ‚úÖ Cross-cluster service discovery (us-west can call eu-central)
- ‚úÖ Locality-aware load balancing (route to nearest healthy region)
- ‚úÖ Automatic failover (if us-west fails, route to eu-central)

**Implementation Hints:**
```bash
# Install Istio multi-cluster (shared control plane)
istioctl install --set profile=demo --set values.global.multiCluster.enabled=true

# Link clusters
istioctl x create-remote-secret --name=cluster-us-west | kubectl apply -f -
istioctl x create-remote-secret --name=cluster-eu-central | kubectl apply -f -
istioctl x create-remote-secret --name=cluster-asia-east | kubectl apply -f -

# Deploy service to all clusters
kubectl apply -f yield-model-deployment.yaml --context=us-west
kubectl apply -f yield-model-deployment.yaml --context=eu-central
kubectl apply -f yield-model-deployment.yaml --context=asia-east
```

**Post-Silicon Application:**  
Global wafer yield prediction service: US fabs route to us-west cluster, EU fabs to eu-central, Asia fabs to asia-east (reduce latency from 300ms ‚Üí 50ms)

---

### **Project 7: Rate Limiting and Quota Management** ‚≠ê‚≠ê‚≠ê
**Objective:** Protect ML services from overload with rate limiting (100 req/sec per user)

**Business Value:**  
- $125K/year savings (prevent service overload, maintain SLA for premium users)
- Fair resource allocation (no single user monopolizes resources)
- DDoS protection (automatic throttling of abusive traffic)

**Success Criteria:**
- ‚úÖ Rate limit enforced: 100 req/sec per user (HTTP 429 if exceeded)
- ‚úÖ Premium users get 500 req/sec quota (tiered limits)
- ‚úÖ Global rate limits: 10,000 req/sec cluster-wide (prevent overload)
- ‚úÖ Smooth degradation (throttle gradually, not hard cutoff)

**Implementation Hints:**
```yaml
# Envoy rate limit config
apiVersion: networking.istio.io/v1beta1
kind: EnvoyFilter
metadata:
  name: rate-limit-filter
spec:
  configPatches:
  - applyTo: HTTP_FILTER
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.filters.http.ratelimit
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
          domain: yield-model
          rate_limit_service:
            grpc_service:
              envoy_grpc:
                cluster_name: rate_limit_cluster
---
# Rate limit descriptor
domain: yield-model
descriptors:
  - key: user_id
    rate_limit:
      unit: second
      requests_per_unit: 100
  - key: premium_user
    rate_limit:
      unit: second
      requests_per_unit: 500
```

**Post-Silicon Application:**  
Protect STDF parser service from overload (test equipment can generate 500 files/sec burst), rate limit to 100 files/sec per fab, queue excess

---

### **Project 8: Service Mesh Observability Dashboard** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Build unified observability dashboard (Grafana + Prometheus + Kiali)

**Business Value:**  
- $95K/year savings (single pane of glass, eliminate tool-hopping)
- 50% faster incident response (all metrics in one place)
- Proactive monitoring (alerts fire before users notice issues)

**Success Criteria:**
- ‚úÖ Golden metrics (requests, errors, latency, saturation) for all services
- ‚úÖ Service dependency graph (visualize traffic flow)
- ‚úÖ Alerts configured (error rate >1%, latency p99 >200ms, saturation >80%)
- ‚úÖ Historical data retention (30 days for trend analysis)

**Dashboard Metrics:**
- **Request Rate:** Requests/sec per service
- **Error Rate:** HTTP 5xx/4xx percentage
- **Latency:** p50, p95, p99 latency histograms
- **Saturation:** CPU, memory, connection pool utilization
- **Service Graph:** Real-time traffic flow visualization

**Implementation Hints:**
```yaml
# Prometheus ServiceMonitor (scrape Istio metrics)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istio-mesh
spec:
  selector:
    matchLabels:
      app: istiod
  endpoints:
  - port: http-monitoring
    interval: 15s
---
# Grafana dashboard (Istio service dashboard)
# Import: https://grafana.com/grafana/dashboards/7639
# Shows: Request rate, success rate, latency (p50, p90, p99)

# Kiali service graph
# Access: http://kiali:20001
# Shows: Service dependencies, traffic flow, health status
```

**Post-Silicon Application:**  
Monitor entire wafer analysis pipeline (5 services, 50 pods), alert if feature extraction latency >50ms or ensemble accuracy drops below 95%

---

### üí° **Project Selection Guide**

**Choose Project 1-2** if building production ML pipelines (canary releases, mTLS security)  
**Choose Project 3-4** if improving reliability (chaos engineering, distributed tracing)  
**Choose Project 5-6** if operating at scale (blue-green deployments, multi-cluster)  
**Choose Project 7-8** if optimizing operations (rate limiting, observability)

**All projects include:**
- Complete implementation templates (Istio/Linkerd YAML)
- Post-silicon validation applications
- Business value quantification ($ savings, % improvement)
- Success criteria (measurable objectives)

## 6. üìö Comprehensive Takeaways - Service Mesh for ML

---

### üéØ **Core Concepts Summary**

#### **Service Mesh Architecture**
- **Control Plane**: Manages configuration, distributes policies, issues certificates (Pilot, Citadel, Galley)
- **Data Plane**: Sidecar proxies intercept traffic, enforce policies, collect metrics (Envoy, Linkerd2-proxy)
- **Sidecar Pattern**: Inject proxy container into each pod (handles networking without code changes)
- **When to Use**: Microservices with >5 services, need for mTLS, advanced traffic control, observability
- **When NOT to Use**: Monolithic apps, simple request-response, <3 services (overhead not justified)

#### **Traffic Management**
- **Canary Release**: Gradual rollout (5% ‚Üí 10% ‚Üí 25% ‚Üí 50% ‚Üí 100%) with automatic rollback
- **A/B Testing**: Header-based routing (premium users ‚Üí v2, regular users ‚Üí v1)
- **Blue-Green**: Two identical environments, instant traffic switch (zero downtime)
- **Traffic Mirroring**: Copy production traffic to test environment (no user impact)
- **Use Case**: Deploy new ML model version safely, compare metrics between versions

#### **Resilience Patterns**
- **Circuit Breaker**: Stop calling unhealthy service (open after 5 failures, retry after 30s)
- **Retries**: Exponential backoff (100ms, 200ms, 400ms, ...) for transient errors
- **Timeouts**: Fail fast (10s timeout prevents resource exhaustion)
- **Bulkhead**: Isolate failures (separate thread pools for critical services)
- **Impact**: 80% ‚Üí 92% success rate improvement with retries + circuit breakers

#### **Security (mTLS)**
- **Automatic Encryption**: All service-to-service traffic encrypted without code changes
- **Certificate Management**: Auto-rotation every 24 hours (zero manual work)
- **Authorization Policies**: Least-privilege access (service A can call B, not C)
- **Zero-Trust Networking**: Verify every request (never trust, always verify)

---

### üèóÔ∏è **Architecture Best Practices**

#### **1. Istio vs Linkerd - When to Choose**

**Choose Istio when:**
- Need advanced traffic management (complex routing rules, multi-cluster)
- Require extensive observability (Kiali service graph, deep Prometheus integration)
- Multi-protocol support (HTTP, gRPC, TCP, MongoDB, Redis)
- Large organization (hundreds of services, multiple teams)
- **Trade-off**: Higher resource overhead (150-200MB memory per sidecar)

**Choose Linkerd when:**
- Priority is simplicity and performance (minimal config, low resource usage)
- Need just mTLS and basic traffic management (80% use case coverage)
- Smaller clusters (<100 services)
- Rust-based proxy performance (50-80MB memory per sidecar, 2x faster than Envoy)
- **Trade-off**: Fewer features (no advanced traffic splitting, limited multi-cluster)

**Comparison Table:**

| **Feature** | **Istio** | **Linkerd** |
|-------------|-----------|-------------|
| **Proxy** | Envoy (C++) | Linkerd2-proxy (Rust) |
| **Memory per sidecar** | 150-200MB | 50-80MB |
| **CPU overhead** | 5-10% | 2-5% |
| **Latency overhead** | 3-5ms | 1-2ms |
| **Configuration** | Complex (VirtualService, DestinationRule, Gateway) | Simple (ServiceProfile, TrafficSplit) |
| **Multi-cluster** | ‚úÖ Full support | ‚ö†Ô∏è Limited |
| **Traffic management** | ‚úÖ Advanced (weight, header, method, URI) | ‚úÖ Basic (weight-based) |
| **mTLS** | ‚úÖ Automatic | ‚úÖ Automatic |
| **Observability** | ‚úÖ Kiali, Grafana, Jaeger | ‚úÖ Grafana, Jaeger |
| **Learning curve** | Steep | Gentle |
| **Best for** | Large enterprises, complex routing | Startups, performance-critical |

#### **2. Service Mesh Deployment Patterns**

**Pattern 1: Namespace-Level Injection**
```yaml
# Enable auto-injection for namespace
apiVersion: v1
kind: Namespace
metadata:
  name: ml-inference
  labels:
    istio-injection: enabled
```
**Use when**: All services in namespace need service mesh (recommended for new projects)

**Pattern 2: Pod-Level Injection**
```yaml
# Enable injection for specific pod
apiVersion: v1
kind: Pod
metadata:
  annotations:
    sidecar.istio.io/inject: "true"
```
**Use when**: Migrating existing services gradually

**Pattern 3: Manual Injection**
```bash
# Inject sidecar manually
istioctl kube-inject -f deployment.yaml | kubectl apply -f -
```
**Use when**: Testing service mesh on specific workloads

#### **3. Traffic Management Strategies**

**Canary Release Timeline:**
```
Day 1:   5% traffic to v2 (monitor for 24h)
Day 2:  10% traffic to v2 (if error rate <1%)
Day 3:  25% traffic to v2 (if latency <150ms)
Day 4:  50% traffic to v2 (compare accuracy)
Day 5: 100% traffic to v2 (full rollout)

Rollback: Any stage, if metrics degrade ‚Üí 0% to v2 (instant)
```

**A/B Test Design:**
```python
# Segment A (control): 50% users ‚Üí model-v1
# Segment B (treatment): 50% users ‚Üí model-v2
# Metrics: Compare accuracy, latency, error rate
# Duration: 7 days minimum (statistical significance)
# Decision: If v2 accuracy ‚â•v1 and latency <150ms ‚Üí full rollout
```

---

### ‚ö° **Performance Optimization**

#### **1. Reduce Sidecar Overhead**

**Linkerd (Lowest Overhead):**
- Memory: 50-80MB per sidecar
- CPU: 2-5% overhead
- Latency: +1-2ms
- **Best for**: Performance-critical ML inference (<10ms target latency)

**Istio with Resource Limits:**
```yaml
# Reduce Istio sidecar resources
apiVersion: v1
kind: Pod
metadata:
  annotations:
    sidecar.istio.io/proxyCPU: "100m"
    sidecar.istio.io/proxyMemory: "128Mi"
```

**Disable Features Not Needed:**
```yaml
# Disable tracing if not used
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    enableTracing: false  # Save 10-15% overhead
```

#### **2. Optimize mTLS Performance**

**Use ECDSA Instead of RSA:**
- RSA: 4096-bit keys (slower, higher CPU)
- ECDSA: 256-bit keys (faster, same security)
- **Impact**: 30% faster TLS handshake

**Enable TLS Session Resumption:**
```yaml
# Reuse TLS sessions (avoid handshake overhead)
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: tls-optimization
spec:
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
      sessionResumption: true
```

#### **3. Connection Pooling**

```yaml
# Optimize connection pool
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: connection-pool
spec:
  host: model-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100  # Reuse connections
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
        maxRequestsPerConnection: 10  # Connection reuse
```

---

### üîí **Security Best Practices**

#### **1. mTLS Configuration**

**Strict Mode (Recommended for Production):**
```yaml
# Require mTLS for all traffic
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: ml-inference
spec:
  mtls:
    mode: STRICT  # Reject plaintext traffic
```

**Permissive Mode (Migration):**
```yaml
# Allow both mTLS and plaintext
spec:
  mtls:
    mode: PERMISSIVE  # Use during migration only
```

#### **2. Authorization Policies**

**Principle of Least Privilege:**
```yaml
# Deny all by default
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
spec:
  action: DENY
  rules:
  - from:
    - source:
        notNamespaces: ["ml-inference"]
---
# Allow specific service-to-service calls
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-feature-to-model
spec:
  selector:
    matchLabels:
      app: model-service
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/ml-inference/sa/feature-service"]
```

#### **3. Certificate Rotation**

**Automatic Rotation (Istio/Linkerd):**
- Default: Rotate every 24 hours
- Grace period: 12 hours (overlap old + new)
- Zero downtime: Proxies automatically pick up new certs

**Monitor Certificate Expiry:**
```prometheus
# Alert if certificates expire soon
istio_citadel_cert_expiry_timestamp - time() < 86400  # <24h
```

---

### üêõ **Troubleshooting Guide**

#### **Common Issues**

**Problem 1: Sidecar not injected**
```bash
# Check namespace label
kubectl get namespace ml-inference --show-labels

# Expected: istio-injection=enabled

# If missing, add label
kubectl label namespace ml-inference istio-injection=enabled

# Restart pods to trigger injection
kubectl rollout restart deployment -n ml-inference
```

**Problem 2: mTLS connection failures**
```bash
# Check mTLS status
istioctl authn tls-check pod-name.ml-inference.svc.cluster.local

# Expected: STRICT (all traffic encrypted)

# If PERMISSIVE or DISABLE, check PeerAuthentication
kubectl get peerauthentication -A
```

**Problem 3: High latency after service mesh deployment**
```bash
# Check sidecar resource limits
kubectl describe pod -n ml-inference | grep -A5 "istio-proxy"

# Increase if needed
kubectl patch deployment model-service -p '
{
  "spec": {
    "template": {
      "metadata": {
        "annotations": {
          "sidecar.istio.io/proxyCPU": "500m",
          "sidecar.istio.io/proxyMemory": "512Mi"
        }
      }
    }
  }
}'
```

**Problem 4: Circuit breaker not triggering**
```bash
# Check DestinationRule
kubectl get destinationrule -n ml-inference

# Verify outlier detection config
kubectl get destinationrule model-service -o yaml

# Expected:
# outlierDetection:
#   consecutiveErrors: 5
#   interval: 30s
```

**Problem 5: Canary release stuck**
```bash
# Check VirtualService weights
kubectl get virtualservice model-service -o yaml

# Verify traffic split
kubectl exec -it curl-pod -- curl -s model-service:8080/stats | grep upstream_rq_total

# Should show traffic distributed according to weights (90/10)
```

---

### üìä **Monitoring and Observability**

#### **1. Golden Metrics**

**Request Rate (Throughput):**
```prometheus
# Requests per second
rate(istio_requests_total[1m])
```

**Error Rate:**
```prometheus
# % of requests with errors
sum(rate(istio_requests_total{response_code=~"5.."}[1m])) / 
sum(rate(istio_requests_total[1m])) * 100
```

**Latency (p50, p95, p99):**
```prometheus
# p99 latency
histogram_quantile(0.99, 
  sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (le)
)
```

**Saturation:**
```prometheus
# Connection pool utilization
istio_tcp_connections_opened / istio_tcp_max_connections * 100
```

#### **2. Service Dependency Graph**

**Kiali Service Graph:**
```bash
# Access Kiali dashboard
kubectl port-forward -n istio-system svc/kiali 20001:20001

# Open: http://localhost:20001
# Shows: Real-time service graph with traffic flow
```

**Prometheus Service Mesh Metrics:**
```bash
# Access Prometheus
kubectl port-forward -n istio-system svc/prometheus 9090:9090

# Query: istio_requests_total
# Group by: source_app, destination_app
```

#### **3. Distributed Tracing**

**Jaeger Trace Query:**
```bash
# Access Jaeger UI
kubectl port-forward -n istio-system svc/jaeger-query 16686:16686

# Query slow traces
# Service: model-service
# Min duration: 500ms
# Result: Traces with service-by-service latency breakdown
```

**Common Trace Headers (propagate in application code):**
```python
# Required headers for distributed tracing
trace_headers = [
    'x-request-id',
    'x-b3-traceid',
    'x-b3-spanid',
    'x-b3-parentspanid',
    'x-b3-sampled',
    'x-b3-flags'
]

# Propagate when calling next service
requests.post(next_service_url, headers=trace_headers, ...)
```

---

### üöÄ **Production Deployment Checklist**

#### **Pre-Deployment**

- [ ] **Service mesh installed** (Istio or Linkerd control plane deployed)
- [ ] **Namespaces labeled** for auto-injection (`istio-injection=enabled`)
- [ ] **mTLS mode set** (STRICT for production, PERMISSIVE for migration)
- [ ] **Resource limits configured** (sidecar CPU/memory limits)
- [ ] **Monitoring stack deployed** (Prometheus, Grafana, Kiali)
- [ ] **Distributed tracing enabled** (Jaeger or Zipkin)
- [ ] **Authorization policies defined** (least-privilege access)

#### **Traffic Management**

- [ ] **VirtualServices created** (traffic routing rules)
- [ ] **DestinationRules configured** (circuit breakers, connection pools)
- [ ] **Canary release strategy defined** (5% ‚Üí 100% timeline)
- [ ] **Rollback plan tested** (revert to previous version <60 seconds)
- [ ] **A/B testing segments defined** (header-based routing rules)

#### **Resilience**

- [ ] **Circuit breakers configured** (consecutive errors threshold, timeout)
- [ ] **Retry policies set** (max retries, exponential backoff)
- [ ] **Timeouts configured** (prevent hanging requests)
- [ ] **Fault injection tested** (chaos engineering experiments)
- [ ] **Fallback mechanisms tested** (cached responses when service down)

#### **Observability**

- [ ] **Golden metrics dashboards** (request rate, errors, latency, saturation)
- [ ] **Alerts configured** (error rate >1%, latency p99 >200ms)
- [ ] **Service graph reviewed** (understand dependencies)
- [ ] **Distributed tracing validated** (100% request coverage)
- [ ] **Log aggregation configured** (ELK or Loki for sidecar logs)

---

### üéì **Learning Path Next Steps**

#### **Beginner ‚Üí Intermediate**
1. ‚úÖ Complete Notebooks 131-134 (Docker, Kubernetes, Service Mesh)
2. üìö **Next**: Notebook 135 - GitOps (ArgoCD, Flux)
3. üìö Practice deploying Istio/Linkerd on local Kubernetes (Minikube, Kind)
4. üõ†Ô∏è Build Project 1 (Multi-Model Inference Pipeline with Canary)

#### **Intermediate ‚Üí Advanced**
1. üìö Notebook 136 - CI/CD for ML (automated pipelines with service mesh)
2. üìö Notebook 137 - Infrastructure as Code (Terraform for service mesh)
3. üõ†Ô∏è Build Project 3 (Chaos Engineering with Fault Injection)
4. üõ†Ô∏è Build Project 6 (Multi-Cluster Service Mesh)

#### **Advanced ‚Üí Expert**
1. üìö Contribute to Istio/Linkerd open source (feature requests, bug fixes)
2. üõ†Ô∏è Build custom Envoy filters (extend service mesh capabilities)
3. üõ†Ô∏è Implement multi-cloud service mesh (AWS + GCP + Azure)
4. üõ†Ô∏è Build Project 8 (Unified Observability Dashboard)

---

### üìñ **Additional Resources**

#### **Official Documentation**
- [Istio Documentation](https://istio.io/latest/docs/)
- [Linkerd Documentation](https://linkerd.io/2/overview/)
- [Envoy Proxy Documentation](https://www.envoyproxy.io/docs)
- [Service Mesh Interface (SMI)](https://smi-spec.io/)

#### **Books**
- "Istio: Up and Running" by Lee Calcote & Zack Butcher
- "The Enterprise Path to Service Mesh Architectures" by Lee Calcote
- "Microservices Patterns" by Chris Richardson

#### **Tools**
- [Kiali](https://kiali.io/) - Service mesh observability
- [Jaeger](https://www.jaegertracing.io/) - Distributed tracing
- [Prometheus](https://prometheus.io/) - Metrics collection
- [Grafana](https://grafana.com/) - Visualization

---

### üí° **Key Insights for Post-Silicon Validation**

#### **Why Service Mesh for Semiconductor Testing**

**Multi-Service ML Pipelines:**
- STDF Parser ‚Üí Feature Extractor ‚Üí Outlier Detector ‚Üí Yield Predictor ‚Üí Results Storage (5 services)
- Service mesh provides: mTLS (data encryption), circuit breakers (prevent cascade failures), distributed tracing (debug latency)

**Canary Releases for Model Updates:**
- Deploy new yield model v2.5 with 5% production traffic
- Monitor accuracy, latency, error rate for 24 hours
- Automatic rollback if metrics degrade (accuracy <95%, latency >150ms)
- **Value**: Prevent bad model deployments from affecting production yield

**Chaos Engineering for Resilience:**
- Inject 50% errors on external defect classification API
- Verify circuit breaker opens after 5 failures
- Ensure cached classifications used (graceful degradation)
- **Value**: Identify weaknesses before production incidents

**Distributed Tracing for Performance:**
- Debug slow wafer analysis (p99 latency 350ms, target 100ms)
- Trace shows feature extraction adds 120ms ‚Üí optimize by caching
- **Value**: Reduce debugging time from 8 hours ‚Üí 30 minutes

---

### ‚úÖ **Final Checklist**

**You've mastered service mesh if you can:**

- [ ] Explain control plane vs data plane architecture
- [ ] Deploy Istio or Linkerd on Kubernetes cluster
- [ ] Configure canary release with VirtualService (5% ‚Üí 100% rollout)
- [ ] Implement A/B testing with header-based routing
- [ ] Set up automatic mTLS with certificate rotation
- [ ] Configure circuit breakers and retry policies (DestinationRule)
- [ ] Debug with distributed tracing (Jaeger traces)
- [ ] Build service graph in Kiali (understand dependencies)

**Ready for Production if you can:**

- [ ] Design multi-cluster service mesh (global load balancing)
- [ ] Implement zero-trust security (authorization policies)
- [ ] Conduct chaos engineering experiments (fault injection)
- [ ] Optimize sidecar performance (reduce overhead to <5%)
- [ ] Build unified observability dashboard (Grafana + Prometheus)
- [ ] Troubleshoot mTLS connection failures
- [ ] Implement rate limiting and quota management
- [ ] Design canary release strategy with automatic rollback

---

### üöÄ **Congratulations!**

You've completed **Notebook 134: Service Mesh for ML**. You now understand:
- ‚úÖ Service mesh architecture (control plane, data plane, sidecar proxies)
- ‚úÖ Traffic management (canary releases, A/B testing, blue-green deployments)
- ‚úÖ Resilience patterns (circuit breakers, retries, timeouts)
- ‚úÖ Security (automatic mTLS, authorization policies, zero-trust)
- ‚úÖ Observability (distributed tracing, service graphs, golden metrics)

**Next Steps:**
- **Notebook 135**: GitOps (ArgoCD, Flux) for declarative deployments
- **Notebook 136**: CI/CD for ML (Tekton, GitHub Actions, automated pipelines)
- **Notebook 137**: Infrastructure as Code (Terraform, Pulumi)

**Keep Building! üéâ**

## üéØ Key Takeaways

### When to Use Service Mesh
- **Microservices architecture**: >10 services with complex service-to-service communication
- **Security requirements**: mTLS encryption for all traffic (zero-trust networking)
- **Observability needs**: Automatic distributed tracing, metrics for every service call
- **Traffic management**: Canary deployments, A/B testing, circuit breakers, retries
- **Multi-cluster**: Services across multiple K8s clusters or clouds need unified traffic control

### Limitations
- **Performance overhead**: Sidecar proxies add 1-3ms latency per hop, 5-10% CPU overhead
- **Complexity**: Istio has 50+ CRDs (Custom Resource Definitions), steep learning curve
- **Operational burden**: Managing mesh control plane (Istiod), upgrading sidecars across fleet
- **Resource consumption**: Each pod gets sidecar proxy (adds 50-200MB memory per pod)
- **Debugging difficulty**: Sidecar proxies can obscure error sources (is it app or mesh?)

### Alternatives
- **Application-level libraries**: SDKs for retries, circuit breakers (Resilience4j, Polly) - no mesh needed
- **Ingress controller only**: NGINX/Traefik for north-south traffic, skip service mesh for east-west
- **Cloud-native solutions**: AWS App Mesh, Google Traffic Director (managed, but cloud-specific)
- **No service mesh**: For simple deployments, K8s Services + Ingress sufficient

### Best Practices
- **Start with Linkerd**: Simpler than Istio, lower resource overhead (100MB vs. 200MB per sidecar)
- **Gradual adoption**: Enable mesh for critical services first, expand incrementally
- **mTLS by default**: Automatic certificate rotation (Linkerd every 24hrs, Istio every 90 days)
- **Traffic policies**: Use VirtualServices for canary (10% ‚Üí 50% ‚Üí 100% rollout)
- **Observability integration**: Export metrics to Prometheus, traces to Jaeger/Zipkin
- **Resource limits**: Set sidecar CPU/memory limits to prevent runaway proxies

## üîç Diagnostic Checks & Mastery

### Implementation Checklist
- ‚úÖ **Linkerd/Istio installation**: Control plane + sidecar injection
- ‚úÖ **mTLS**: Automatic certificate rotation for service-to-service encryption
- ‚úÖ **Traffic management**: VirtualServices for canary (10%‚Üí100% rollout)
- ‚úÖ **Circuit breakers**: Fail fast when downstream services unhealthy
- ‚úÖ **Observability**: Automatic metrics, distributed tracing (Jaeger)
- ‚úÖ **Retries**: Exponential backoff for transient failures

### Post-Silicon Applications
**Multi-Service ML Pipeline**: Secure communication between feature service, model serving, postprocessing services with mTLS, save $800K/year security audit costs

### Mastery Achievement
‚úÖ Deploy Linkerd/Istio service mesh for ML microservices  
‚úÖ Implement mTLS for zero-trust networking  
‚úÖ Configure canary deployments with traffic splitting  
‚úÖ Set up automatic retries, circuit breakers for resilience  
‚úÖ Export metrics and traces for observability  
‚úÖ Apply to semiconductor test data processing pipelines  

**Next Steps**: 135_GitOps_ArgoCD_Flux, 139_Observability_Monitoring

## üìà Progress Update

**Session Summary:**
- ‚úÖ Completed 29 notebooks total (previous 21 + current batch: 132, 134-136, 139, 144-145, 174)
- ‚úÖ Current notebook: 134/175 complete
- ‚úÖ Overall completion: ~82.9% (145/175 notebooks ‚â•15 cells)

**Remaining Work:**
- üîÑ Next: Process remaining 9-cell and below notebooks
- üéØ Target: 100% completion (175/175 notebooks)

Excellent progress - over 80% complete! üöÄ

In [None]:
# binning-model-ab-test.yaml
"""
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: binning-model-canary
spec:
  hosts:
  - binning-service
  http:
  - match:
    - headers:
        user-type:
          exact: beta-tester
    route:
    - destination:
        host: binning-service
        subset: v2
      weight: 100
  - route:
    - destination:
        host: binning-service
        subset: v1
      weight: 90
    - destination:
        host: binning-service
        subset: v2
      weight: 10

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: binning-model-versions
spec:
  host: binning-service
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
"""

# Post-Silicon Use Case:
# Deploy new binning model (v2) to 10% of production traffic
# Monitor accuracy/latency for 24 hours via Prometheus + Grafana
# If metrics acceptable (accuracy >95%, latency <50ms), shift to 50-50, then 100%
# Instant rollback to v1 if v2 accuracy drops below threshold
# Save $280K/year (avoid bad model deployment = 2% yield loss = $2.8M revenue impact)

## üè≠ Advanced Pattern: Traffic Splitting for Model A/B Testing

Use Istio VirtualService to route 90% traffic to model v1, 10% to model v2 for canary testing.