# 133: Kubernetes Advanced Patterns for ML

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** StatefulSets for stateful ML workloads (stable identities, persistent storage, ordered operations)
- **Deploy** DaemonSets for cluster-wide services (GPU drivers, monitoring agents, logging)
- **Implement** Kubernetes Operators (custom controllers, reconciliation loops, domain-specific automation)
- **Apply** Custom Resource Definitions (CRDs) to extend Kubernetes API for ML workflows
- **Master** advanced scheduling patterns (affinity, taints, tolerations, custom schedulers)
- **Build** production ML platforms (Kubeflow, KServe, multi-tenant systems)

## üìö What Are Kubernetes Advanced Patterns?

While basic Kubernetes (Deployments, Services) works for stateless applications, **ML workloads require advanced patterns** to handle:

**The Gap in Basic Kubernetes:**
```
Basic Deployment (Stateless):
‚úÖ Random pod names: yield-model-7f4d8-xyz
‚úÖ Ephemeral storage: data lost on pod restart
‚úÖ Random scheduling: pod-1 may land on any node
‚úÖ Parallel scaling: all replicas created simultaneously

ML Challenges:
‚ùå Distributed training: worker-0 needs stable DNS to coordinate with worker-1
‚ùå Stateful services: Database primary vs replicas need persistent identities
‚ùå GPU drivers: Every GPU node needs NVIDIA driver (not just some nodes)
‚ùå Custom operations: Auto-scaling based on queue depth (not just CPU)
```

**Advanced Patterns Solution:**
```
StatefulSets:
‚úÖ Stable pod names: redis-0, redis-1, redis-2
‚úÖ Persistent volumes: Each pod gets dedicated PVC (survives restarts)
‚úÖ Ordered operations: Sequential creation (0‚Üí1‚Üí2), reverse deletion (2‚Üí1‚Üí0)
‚úÖ Stable DNS: worker-0.service.namespace.svc.cluster.local

DaemonSets:
‚úÖ One pod per node: GPU driver on ALL GPU nodes automatically
‚úÖ Node selectors: Only schedule on nodes with label gpu=nvidia
‚úÖ Auto-scheduling: New node joins ‚Üí DaemonSet pod deployed instantly

Operators:
‚úÖ Custom controllers: Watch TrainingJob CRD, create StatefulSet automatically
‚úÖ Reconciliation loop: Ensure desired state = actual state (self-healing)
‚úÖ Domain knowledge: Auto-retry failed training, checkpoint management

Custom Resource Definitions (CRDs):
‚úÖ Extend Kubernetes API: TrainingJob, ModelServer, HyperparameterTuning
‚úÖ Declarative: Users create YAML, operator handles complexity
‚úÖ Native kubectl: kubectl get trainingjobs, kubectl describe modelserver
```

**Why Advanced Patterns for ML?**
- ‚úÖ **Distributed Training:** PyTorch DDP, Horovod require stable pod names and network IDs
- ‚úÖ **Stateful Services:** Databases, caches, message queues need persistent identities
- ‚úÖ **GPU Management:** DaemonSets ensure every GPU node has drivers, monitoring
- ‚úÖ **ML-Specific Operations:** Operators encode best practices (checkpointing, auto-retry, hyperparameter tuning)
- ‚úÖ **Self-Service ML:** Data scientists create TrainingJob CRD, operator handles pod creation, GPU allocation, monitoring
- ‚úÖ **Automated Operations:** Operators reduce manual work (auto-scale, auto-heal, auto-optimize)

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: Distributed Wafer Map Analysis (StatefulSet)**
- **Input:** 10,000 wafer maps (spatial defect patterns, 300MB each)
- **Deployment:** StatefulSet with 5 analyzer pods (`analyzer-0` to `analyzer-4`)
- **Stable Identity:** Each pod processes specific wafer range (shard-0: wafers 0-1999, shard-1: 2000-3999, etc.)
- **Persistent Storage:** Each pod has 100GB PVC for intermediate spatial correlation matrices (survives restarts)
- **Output:** Aggregated defect patterns (hotspots, systematic failures, spatial trends)
- **Value:** 8x faster analysis (5 pods in parallel vs sequential), $340K/year savings

**Use Case 2: GPU Driver Installation (DaemonSet)**
- **Input:** 10 GPU nodes with NVIDIA A100 GPUs (no drivers installed initially)
- **Deployment:** DaemonSet with node selector `gpu=nvidia` (only GPU nodes)
- **Init Container:** Install NVIDIA driver 525.xx, configure CUDA 12.1, enable GPU resource plugin
- **Monitoring:** DaemonSet also deploys GPU metrics exporter (utilization, temperature, memory)
- **Output:** All GPU nodes ready for ML inference/training workloads
- **Value:** Zero manual driver management, auto-deployment on new nodes, 99.8% GPU uptime

**Use Case 3: STDF Parser Operator (Kubernetes Operator)**
- **Input:** Test equipment streams STDF files (variable rate: 10-100 wafers/hour)
- **CRD:** Data scientists create `STDFParserJob` custom resource (specify input bucket, output format)
- **Operator:** Watches `STDFParserJob`, creates pods based on queue depth (auto-scale 1-20 workers)
- **Self-Healing:** Failed jobs auto-retry (up to 3 attempts), corrupted files quarantined automatically
- **Output:** Parsed data in Parquet format (sub-5 second p95 latency)
- **Value:** $125K/year savings (eliminate manual job management), 60% faster processing

**Use Case 4: Multi-Model Ensemble (Custom Resource Definition)**
- **Input:** 5 yield prediction models (Random Forest, XGBoost, LightGBM, CatBoost, Neural Net)
- **CRD:** `EnsembleModel` custom resource defines models, weights, voting strategy
- **Operator:** Deploys each model as separate pod, creates ensemble combiner, monitors performance
- **Auto-Scaling:** Each model scales independently based on traffic (Random Forest: 3 pods, Neural Net: 8 pods with GPU)
- **Output:** Single prediction endpoint (weighted voting, 96.5% accuracy vs 92% for single model)
- **Value:** $4.2M/year savings (fewer false negatives ‚Üí less yield loss)

## üîÑ Kubernetes Advanced Patterns Workflow

```mermaid
graph LR
    A[Data Scientist] -->|Create TrainingJob CRD| B[Kubernetes API]
    B -->|Store| C[etcd]
    D[Operator] -->|Watch| B
    D -->|Detect New TrainingJob| E[Reconcile]
    E -->|Create| F[StatefulSet: 4 workers]
    F -->|Deploy| G[worker-0: master]
    F -->|Deploy| H[worker-1: replica]
    F -->|Deploy| I[worker-2: replica]
    F -->|Deploy| J[worker-3: replica]
    G -->|Stable DNS| K[Distributed Training]
    H -->|Stable DNS| K
    I -->|Stable DNS| K
    J -->|Stable DNS| K
    K -->|Metrics| L[Prometheus]
    L -->|Trigger| M[Auto-Scale to 8 workers]
    
    style A fill:#e1f5ff
    style B fill:#fff4e1
    style D fill:#e1ffe1
    style E fill:#ffe1e1
    style F fill:#f0e1ff
    style G fill:#ffe1f5
    style H fill:#ffe1f5
    style I fill:#ffe1f5
    style J fill:#ffe1f5
    style K fill:#e1ffff
    style M fill:#ffe1e1
```

## üìä Learning Path Context

**Prerequisites:**
- **Notebook 131:** Docker for ML (containerization fundamentals, multi-stage builds)
- **Notebook 132:** Kubernetes Fundamentals (architecture, deployments, services, HPA, rolling updates)

**Current Notebook:**
- **Notebook 133:** Kubernetes Advanced Patterns (StatefulSets, DaemonSets, Operators, CRDs)

**Next Steps:**
- **Notebook 134:** Service Mesh (Istio, Linkerd for advanced networking and observability)
- **Notebook 135:** GitOps (ArgoCD, Flux for declarative deployments)
- **Notebook 136:** CI/CD for ML (Tekton, GitHub Actions, automated pipelines)

---

Let's master advanced Kubernetes patterns for production ML! üöÄ

In [None]:
# Setup and Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import json
import time
from datetime import datetime, timedelta
from pathlib import Path
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Any
from enum import Enum
import uuid

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")

print("‚úÖ Environment ready for Kubernetes advanced patterns simulation")

## 2. üóÑÔ∏è StatefulSets - Stable Identities for Stateful ML Workloads

**Purpose:** Deploy applications requiring stable network identities, ordered deployment/scaling, and persistent storage per pod.

**Key Points:**
- **Stable Pod Names:** Pods get predictable names (`redis-0`, `redis-1`, `redis-2`, not random hashes)
- **Stable Network IDs:** Each pod gets DNS entry (`redis-0.redis.default.svc.cluster.local`)
- **Ordered Operations:** Pods created sequentially (0‚Üí1‚Üí2), deleted in reverse (2‚Üí1‚Üí0)
- **Persistent Volumes:** Each pod gets own PVC (survives pod deletion/restart)
- **Headless Service:** Service without ClusterIP (DNS for individual pods)

**Why It Matters:**
- **Distributed training:** PyTorch DDP worker-0 needs to know worker-1's address (stable DNS)
- **Databases:** PostgreSQL primary (`postgres-0`) vs replicas (`postgres-1`, `postgres-2`)
- **Consensus systems:** Kafka, ZooKeeper need stable IDs for leader election
- **Data locality:** Pod always mounts same PVC (training checkpoints, model artifacts)

**Post-Silicon Application:**
Distributed wafer map analysis: 5 analyzer pods (`analyzer-0` to `analyzer-4`), each processes specific wafer range (0-1999, 2000-3999, etc.), stable names allow consistent shard assignment, each pod has 100GB PVC for intermediate spatial correlation matrices.

In [None]:
# StatefulSet Simulation

@dataclass
class PersistentVolumeClaim:
    """Represents a Kubernetes PVC."""
    name: str
    size_gb: int
    storage_class: str = "fast-ssd"
    status: str = "Bound"
    mount_path: str = "/data"
    data_size_gb: float = 0.0  # Current data size
    
    def write_data(self, size_gb: float):
        """Simulate writing data to PVC."""
        self.data_size_gb += size_gb
        if self.data_size_gb > self.size_gb:
            raise Exception(f"PVC full: {self.data_size_gb:.1f}GB > {self.size_gb}GB")


@dataclass
class StatefulPod:
    """Represents a pod in StatefulSet with stable identity."""
    name: str  # Predictable: redis-0, redis-1, etc.
    ordinal: int  # 0, 1, 2, ...
    node: str
    status: str = "Running"
    ip: str = ""
    pvc: Optional[PersistentVolumeClaim] = None
    role: str = "replica"  # master or replica
    
    def __post_init__(self):
        if not self.ip:
            self.ip = f"10.244.{np.random.randint(0, 255)}.{ordinal}"
    
    def get_dns_name(self, service_name: str, namespace: str = "default") -> str:
        """Get stable DNS name for pod."""
        return f"{self.name}.{service_name}.{namespace}.svc.cluster.local"
    
    def attach_pvc(self, pvc: PersistentVolumeClaim):
        """Attach PVC to pod."""
        self.pvc = pvc
        print(f"‚úÖ PVC {pvc.name} attached to {self.name} (mount: {pvc.mount_path})")


class StatefulSet:
    """Simulates Kubernetes StatefulSet."""
    
    def __init__(self, name: str, service_name: str, replicas: int, 
                 pvc_size_gb: int = 10):
        self.name = name
        self.service_name = service_name
        self.desired_replicas = replicas
        self.pvc_size_gb = pvc_size_gb
        self.pods: List[StatefulPod] = []
        self.pvcs: List[PersistentVolumeClaim] = []
        
    def create(self):
        """Create StatefulSet pods sequentially."""
        print(f"\nüöÄ Creating StatefulSet '{self.name}' (replicas: {self.desired_replicas})")
        print(f"   Service: {self.service_name}")
        print(f"   PVC per pod: {self.pvc_size_gb}GB")
        print()
        
        for i in range(self.desired_replicas):
            # Create PVC first
            pvc = PersistentVolumeClaim(
                name=f"{self.name}-data-{self.name}-{i}",
                size_gb=self.pvc_size_gb
            )
            self.pvcs.append(pvc)
            
            # Create pod with predictable name
            pod = StatefulPod(
                name=f"{self.name}-{i}",
                ordinal=i,
                node=f"node-{(i % 3) + 1}",
                role="master" if i == 0 else "replica"
            )
            
            # Attach PVC
            pod.attach_pvc(pvc)
            
            # Add to StatefulSet
            self.pods.append(pod)
            
            # Get DNS name
            dns_name = pod.get_dns_name(self.service_name)
            print(f"‚úÖ Pod {pod.name} created on {pod.node}")
            print(f"   IP: {pod.ip}")
            print(f"   DNS: {dns_name}")
            print(f"   Role: {pod.role}")
            print(f"   PVC: {pvc.name} ({pvc.size_gb}GB)")
            print()
            
            # Simulate sequential startup delay
            time.sleep(0.1)
        
        print(f"‚úÖ StatefulSet {self.name} ready: {len(self.pods)} pods running\n")
    
    def scale(self, new_replicas: int):
        """Scale StatefulSet (ordered operations)."""
        current_replicas = len(self.pods)
        
        if new_replicas > current_replicas:
            # Scale up: add pods sequentially
            print(f"üìà Scaling UP: {current_replicas} ‚Üí {new_replicas}")
            for i in range(current_replicas, new_replicas):
                pvc = PersistentVolumeClaim(
                    name=f"{self.name}-data-{self.name}-{i}",
                    size_gb=self.pvc_size_gb
                )
                self.pvcs.append(pvc)
                
                pod = StatefulPod(
                    name=f"{self.name}-{i}",
                    ordinal=i,
                    node=f"node-{(i % 3) + 1}"
                )
                pod.attach_pvc(pvc)
                self.pods.append(pod)
                print(f"‚úÖ Added pod {pod.name}")
        
        elif new_replicas < current_replicas:
            # Scale down: remove pods in reverse order
            print(f"üìâ Scaling DOWN: {current_replicas} ‚Üí {new_replicas}")
            for i in range(current_replicas - 1, new_replicas - 1, -1):
                pod = self.pods.pop()
                print(f"üóëÔ∏è  Removed pod {pod.name} (PVC {pod.pvc.name} retained)")
        
        self.desired_replicas = new_replicas
        print(f"‚úÖ Scale complete: {len(self.pods)} pods running\n")
    
    def get_pod_by_name(self, name: str) -> Optional[StatefulPod]:
        """Get pod by stable name."""
        for pod in self.pods:
            if pod.name == name:
                return pod
        return None
    
    def simulate_data_write(self, pod_name: str, data_gb: float):
        """Simulate writing data to pod's PVC."""
        pod = self.get_pod_by_name(pod_name)
        if pod and pod.pvc:
            pod.pvc.write_data(data_gb)
            print(f"üíæ {pod_name}: Wrote {data_gb}GB to {pod.pvc.name}")
            print(f"   Total data: {pod.pvc.data_size_gb:.1f}GB / {pod.pvc.size_gb}GB")
    
    def get_status(self) -> Dict:
        """Get StatefulSet status."""
        total_storage = sum(pvc.size_gb for pvc in self.pvcs)
        used_storage = sum(pvc.data_size_gb for pvc in self.pvcs)
        
        return {
            "name": self.name,
            "replicas": len(self.pods),
            "pods": [p.name for p in self.pods],
            "total_storage_gb": total_storage,
            "used_storage_gb": used_storage,
            "storage_utilization": (used_storage / total_storage * 100) if total_storage > 0 else 0
        }


# Example 1: Create Redis StatefulSet (caching layer for ML predictions)
print("=" * 80)
print("EXAMPLE 1: Redis StatefulSet for ML Prediction Cache")
print("=" * 80)

redis_sts = StatefulSet(
    name="redis",
    service_name="redis",
    replicas=3,
    pvc_size_gb=50
)

redis_sts.create()

print("=" * 80)
print("EXAMPLE 2: Stable DNS Names for Service Discovery")
print("=" * 80)

print("üì° DNS Resolution for Redis Pods:\n")
for pod in redis_sts.pods:
    dns = pod.get_dns_name("redis")
    print(f"   {pod.name} ‚Üí {dns}")
    print(f"      Role: {pod.role}")
    print(f"      IP: {pod.ip}")
    print()

print("üí° Use Case: ML service connects to redis-0.redis.default.svc.cluster.local")
print("   (Stable DNS, always resolves to master pod)")

print("\n" + "=" * 80)
print("EXAMPLE 3: Persistent Storage Per Pod")
print("=" * 80)

# Simulate data writes to each pod
print("\nüíæ Simulating cache writes to Redis pods:\n")
redis_sts.simulate_data_write("redis-0", 15.5)  # Master gets most writes
redis_sts.simulate_data_write("redis-1", 8.2)   # Replica
redis_sts.simulate_data_write("redis-2", 6.7)   # Replica

status = redis_sts.get_status()
print(f"\nüìä StatefulSet Status:")
print(f"   Pods: {status['replicas']}")
print(f"   Total Storage: {status['total_storage_gb']}GB")
print(f"   Used Storage: {status['used_storage_gb']:.1f}GB")
print(f"   Utilization: {status['storage_utilization']:.1f}%")

print("\nüí° Key Insight: Each pod has own PVC, data survives pod restart")

print("\n" + "=" * 80)
print("EXAMPLE 4: Ordered Scaling")
print("=" * 80)

# Scale up from 3 to 5 replicas
redis_sts.scale(5)

# Show new pods
print("üìä Current Pods:")
for pod in redis_sts.pods:
    print(f"   {pod.name} on {pod.node} (PVC: {pod.pvc.name})")

# Scale down from 5 to 3 replicas
print()
redis_sts.scale(3)

print("üí° Observations:")
print("   ‚Ä¢ Scale-up: Pods added sequentially (redis-3, then redis-4)")
print("   ‚Ä¢ Scale-down: Pods removed in reverse (redis-4, then redis-3)")
print("   ‚Ä¢ PVCs retained: Can reattach if scaled back up")

print("\n" + "=" * 80)
print("EXAMPLE 5: StatefulSet vs Deployment Comparison")
print("=" * 80)

comparison_data = {
    "Feature": [
        "Pod Names",
        "Network Identity",
        "Storage",
        "Scaling Order",
        "Use Cases"
    ],
    "Deployment": [
        "Random (model-7f4d8)",
        "Dynamic IPs",
        "Shared or no PVC",
        "Parallel",
        "Stateless apps (web servers, ML APIs)"
    ],
    "StatefulSet": [
        "Predictable (redis-0, redis-1)",
        "Stable DNS names",
        "PVC per pod",
        "Sequential (0‚Üí1‚Üí2)",
        "Databases, caches, distributed training"
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

print("\nüí° When to Use StatefulSet for ML:")
print("   ‚úÖ Distributed training (PyTorch DDP, Horovod)")
print("   ‚úÖ Caching layer (Redis, Memcached)")
print("   ‚úÖ Message queues (Kafka, RabbitMQ)")
print("   ‚úÖ Databases (PostgreSQL, MongoDB)")
print("   ‚úÖ Model artifact storage (per-worker checkpoints)")

## 3. üåê DaemonSets - One Pod Per Node for Cluster-Wide Services

**Purpose:** Ensure exactly one pod runs on every node (or selected nodes) for infrastructure services.

**Key Points:**
- **One pod per node:** Automatically schedules pod on each node (including new nodes)
- **Node Selectors:** Run only on nodes matching labels (`gpu=nvidia`, `workload=ml`)
- **Tolerations:** Run on tainted nodes (e.g., GPU nodes tainted to prevent non-GPU workloads)
- **Init Containers:** Run setup tasks before main container (install drivers, configure system)
- **Auto-scaling:** New node added ‚Üí DaemonSet pod automatically scheduled

**Why It Matters:**
- **GPU drivers:** Every GPU node needs NVIDIA drivers (DaemonSet installs automatically)
- **Monitoring:** Prometheus node-exporter on every node (collects CPU, memory, disk metrics)
- **Logging:** Fluentd on every node (collects logs, ships to Elasticsearch)
- **Networking:** CNI plugins, kube-proxy run as DaemonSets
- **Security:** Falco security agent on every node (detects anomalies)

**Post-Silicon Application:**
GPU driver DaemonSet: Runs on 10 GPU nodes (label: `gpu=nvidia`), installs NVIDIA driver v525, configures CUDA toolkit, enables `nvidia.com/gpu` resource, skips 10 CPU-only nodes (no label match).

In [None]:
# DaemonSet Simulation

@dataclass
class Node:
    """Represents a Kubernetes node."""
    name: str
    labels: Dict[str, str] = field(default_factory=dict)
    taints: List[str] = field(default_factory=list)
    cpu_capacity: float = 8.0
    memory_capacity: int = 16384
    has_gpu: bool = False
    
    def matches_selector(self, node_selector: Dict[str, str]) -> bool:
        """Check if node matches selector."""
        for key, value in node_selector.items():
            if self.labels.get(key) != value:
                return False
        return True
    
    def has_toleration_for_taints(self, tolerations: List[str]) -> bool:
        """Check if pod can tolerate node taints."""
        for taint in self.taints:
            if taint not in tolerations:
                return False
        return True


@dataclass
class DaemonPod:
    """Represents a pod scheduled by DaemonSet."""
    name: str
    node: str
    status: str = "Running"
    init_complete: bool = False
    driver_version: Optional[str] = None
    
    def run_init_container(self, task: str) -> str:
        """Simulate init container execution."""
        self.init_complete = True
        return f"Init: {task} completed"


class DaemonSet:
    """Simulates Kubernetes DaemonSet."""
    
    def __init__(self, name: str, image: str, 
                 node_selector: Optional[Dict[str, str]] = None,
                 tolerations: Optional[List[str]] = None,
                 init_tasks: Optional[List[str]] = None):
        self.name = name
        self.image = image
        self.node_selector = node_selector or {}
        self.tolerations = tolerations or []
        self.init_tasks = init_tasks or []
        self.pods: List[DaemonPod] = []
    
    def deploy(self, nodes: List[Node]):
        """Deploy DaemonSet (one pod per matching node)."""
        print(f"\nüöÄ Deploying DaemonSet '{self.name}'")
        print(f"   Image: {self.image}")
        if self.node_selector:
            print(f"   Node Selector: {self.node_selector}")
        if self.tolerations:
            print(f"   Tolerations: {self.tolerations}")
        print()
        
        scheduled_count = 0
        skipped_count = 0
        
        for node in nodes:
            # Check node selector
            if self.node_selector and not node.matches_selector(self.node_selector):
                print(f"‚è≠Ô∏è  Skipped {node.name}: Node selector mismatch")
                skipped_count += 1
                continue
            
            # Check taints
            if node.taints and not node.has_toleration_for_taints(self.tolerations):
                print(f"‚è≠Ô∏è  Skipped {node.name}: Missing toleration for taints {node.taints}")
                skipped_count += 1
                continue
            
            # Schedule pod
            pod = DaemonPod(
                name=f"{self.name}-{node.name}",
                node=node.name
            )
            
            # Run init containers
            if self.init_tasks:
                print(f"üîß {pod.name} on {node.name}:")
                for task in self.init_tasks:
                    result = pod.run_init_container(task)
                    print(f"   {result}")
                print(f"   Main container started")
            else:
                print(f"‚úÖ {pod.name} scheduled on {node.name}")
            
            self.pods.append(pod)
            scheduled_count += 1
        
        print(f"\n‚úÖ DaemonSet {self.name} deployed:")
        print(f"   Pods scheduled: {scheduled_count}")
        print(f"   Nodes skipped: {skipped_count}")
        print(f"   Total pods: {len(self.pods)}\n")
    
    def handle_new_node(self, node: Node):
        """Automatically schedule pod on new node."""
        print(f"\nüÜï New node detected: {node.name}")
        
        # Check if should schedule
        if self.node_selector and not node.matches_selector(self.node_selector):
            print(f"   ‚è≠Ô∏è Skipped: Node selector mismatch")
            return
        
        if node.taints and not node.has_toleration_for_taints(self.tolerations):
            print(f"   ‚è≠Ô∏è Skipped: Missing toleration")
            return
        
        # Schedule pod
        pod = DaemonPod(
            name=f"{self.name}-{node.name}",
            node=node.name
        )
        
        # Run init tasks
        for task in self.init_tasks:
            pod.run_init_container(task)
        
        self.pods.append(pod)
        print(f"   ‚úÖ DaemonSet pod {pod.name} scheduled automatically")
    
    def get_status(self) -> Dict:
        """Get DaemonSet status."""
        return {
            "name": self.name,
            "desired_pods": len(self.pods),
            "ready_pods": sum(1 for p in self.pods if p.status == "Running"),
            "nodes_scheduled": [p.node for p in self.pods]
        }


# Example 1: Create cluster with GPU and CPU nodes
print("=" * 80)
print("EXAMPLE 1: Cluster Setup - GPU and CPU Nodes")
print("=" * 80)

nodes = [
    # GPU nodes (labeled and tainted)
    Node(name="gpu-node-1", labels={"gpu": "nvidia", "gpu-model": "t4"}, 
         taints=["gpu-only"], has_gpu=True),
    Node(name="gpu-node-2", labels={"gpu": "nvidia", "gpu-model": "t4"}, 
         taints=["gpu-only"], has_gpu=True),
    Node(name="gpu-node-3", labels={"gpu": "nvidia", "gpu-model": "v100"}, 
         taints=["gpu-only"], has_gpu=True),
    
    # CPU-only nodes (no special labels)
    Node(name="cpu-node-1", labels={"workload": "general"}),
    Node(name="cpu-node-2", labels={"workload": "general"}),
    Node(name="cpu-node-3", labels={"workload": "general"}),
]

print("üìä Cluster Nodes:\n")
for node in nodes:
    gpu_info = "GPU ‚úÖ" if node.has_gpu else "CPU only"
    labels_str = ", ".join([f"{k}={v}" for k, v in node.labels.items()])
    taints_str = ", ".join(node.taints) if node.taints else "None"
    print(f"   {node.name}: {gpu_info}")
    print(f"      Labels: {labels_str}")
    print(f"      Taints: {taints_str}")
    print()

print("=" * 80)
print("EXAMPLE 2: GPU Driver DaemonSet (GPU Nodes Only)")
print("=" * 80)

# DaemonSet for GPU drivers (only on GPU nodes)
gpu_driver_ds = DaemonSet(
    name="nvidia-driver-installer",
    image="nvidia/driver:525.60.13",
    node_selector={"gpu": "nvidia"},  # Only GPU nodes
    tolerations=["gpu-only"],  # Tolerate GPU taint
    init_tasks=[
        "Install NVIDIA driver v525.60.13",
        "Configure CUDA toolkit 12.0",
        "Enable nvidia.com/gpu resource"
    ]
)

gpu_driver_ds.deploy(nodes)

print("=" * 80)
print("EXAMPLE 3: Monitoring DaemonSet (All Nodes)")
print("=" * 80)

# DaemonSet for monitoring (all nodes)
prometheus_ds = DaemonSet(
    name="node-exporter",
    image="prom/node-exporter:latest",
    node_selector={},  # No selector = all nodes
    tolerations=["gpu-only"]  # Tolerate GPU taint to run on GPU nodes too
)

prometheus_ds.deploy(nodes)

print("=" * 80)
print("EXAMPLE 4: Auto-Scheduling on New Node")
print("=" * 80)

# Add new GPU node
new_gpu_node = Node(
    name="gpu-node-4",
    labels={"gpu": "nvidia", "gpu-model": "a100"},
    taints=["gpu-only"],
    has_gpu=True
)

print("üÜï Adding new GPU node to cluster...\n")
nodes.append(new_gpu_node)

# DaemonSets automatically schedule pods
gpu_driver_ds.handle_new_node(new_gpu_node)
prometheus_ds.handle_new_node(new_gpu_node)

print("\n" + "=" * 80)
print("EXAMPLE 5: DaemonSet Status Summary")
print("=" * 80)

# GPU driver status
gpu_status = gpu_driver_ds.get_status()
print(f"üìä {gpu_status['name']}:")
print(f"   Desired Pods: {gpu_status['desired_pods']}")
print(f"   Ready Pods: {gpu_status['ready_pods']}")
print(f"   Nodes: {', '.join(gpu_status['nodes_scheduled'])}")

print()

# Monitoring status
mon_status = prometheus_ds.get_status()
print(f"üìä {mon_status['name']}:")
print(f"   Desired Pods: {mon_status['desired_pods']}")
print(f"   Ready Pods: {mon_status['ready_pods']}")
print(f"   Nodes: {', '.join(mon_status['nodes_scheduled'])}")

# Visualize DaemonSet deployment
print("\n" + "=" * 80)
print("EXAMPLE 6: DaemonSet Deployment Visualization")
print("=" * 80)

# Create visualization data
node_names = [n.name for n in nodes]
gpu_driver_scheduled = [1 if n.has_gpu else 0 for n in nodes]
node_exporter_scheduled = [1 for _ in nodes]  # All nodes

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: GPU Driver DaemonSet
colors_gpu = ['#4ECDC4' if v == 1 else '#95E1D3' for v in gpu_driver_scheduled]
bars1 = ax1.bar(node_names, gpu_driver_scheduled, color=colors_gpu, edgecolor='black', linewidth=1.5)
ax1.set_ylabel("Pod Scheduled (1=Yes, 0=No)", fontsize=12, fontweight='bold')
ax1.set_xlabel("Node Name", fontsize=12, fontweight='bold')
ax1.set_title("GPU Driver DaemonSet\n(GPU Nodes Only)", fontsize=14, fontweight='bold', pad=20)
ax1.set_ylim(0, 1.2)
ax1.set_xticklabels(node_names, rotation=45, ha='right')
ax1.grid(axis='y', alpha=0.3)

# Add labels
for i, (bar, val) in enumerate(zip(bars1, gpu_driver_scheduled)):
    label = "‚úÖ Scheduled" if val == 1 else "‚è≠Ô∏è Skipped"
    ax1.text(bar.get_x() + bar.get_width()/2, val + 0.05, label, 
             ha='center', va='bottom', fontsize=10, fontweight='bold')

# Plot 2: Node Exporter DaemonSet
bars2 = ax2.bar(node_names, node_exporter_scheduled, color='#FF6B6B', edgecolor='black', linewidth=1.5)
ax2.set_ylabel("Pod Scheduled (1=Yes, 0=No)", fontsize=12, fontweight='bold')
ax2.set_xlabel("Node Name", fontsize=12, fontweight='bold')
ax2.set_title("Node Exporter DaemonSet\n(All Nodes)", fontsize=14, fontweight='bold', pad=20)
ax2.set_ylim(0, 1.2)
ax2.set_xticklabels(node_names, rotation=45, ha='right')
ax2.grid(axis='y', alpha=0.3)

# Add labels
for bar in bars2:
    ax2.text(bar.get_x() + bar.get_width()/2, 1.05, "‚úÖ Scheduled", 
             ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("üí° Key Insights:")
print("   ‚Ä¢ GPU Driver DaemonSet: 4/7 nodes (only GPU nodes)")
print("   ‚Ä¢ Node Exporter DaemonSet: 7/7 nodes (all nodes)")
print("   ‚Ä¢ New nodes: Pods automatically scheduled")
print("   ‚Ä¢ Node selectors: Filter nodes by labels")
print("   ‚Ä¢ Tolerations: Run on tainted nodes")

print("\nüí° Common DaemonSet Use Cases for ML:")
print("   ‚úÖ GPU drivers (NVIDIA, AMD)")
print("   ‚úÖ Monitoring agents (Prometheus node-exporter, Datadog)")
print("   ‚úÖ Log collectors (Fluentd, Filebeat)")
print("   ‚úÖ Security agents (Falco, Sysdig)")
print("   ‚úÖ Network plugins (Calico, Weave)")
print("   ‚úÖ Storage drivers (Ceph, GlusterFS)")

## 4. ü§ñ Kubernetes Operators - Encoding Operational Knowledge

**Purpose:** Automate complex application lifecycle management by extending Kubernetes with custom controllers that watch resources and take actions.

**Key Points:**
- **Custom Resource Definitions (CRDs):** Define custom resources (`TrainingJob`, `ModelServer`, `HyperparameterTuning`)
- **Controller:** Watches CRDs, reconciles desired vs actual state (creates pods, services, manages lifecycle)
- **Reconciliation Loop:** Continuously ensures desired state (`spec`) matches actual state (`status`)
- **Domain Knowledge:** Operator encodes best practices (backup strategies, scaling logic, failure handling)
- **Self-Healing:** Automatically recovers from failures (restarts training, scales down unhealthy pods)

**Why It Matters:**
- **Simplicity:** User creates `TrainingJob` CRD, operator handles complexity (pods, volumes, services)
- **Consistency:** Operator ensures best practices (always use GPU scheduling, always save checkpoints)
- **Automation:** Operator watches metrics, auto-scales, auto-retries, auto-backs-up
- **Extensibility:** Add ML-specific features to Kubernetes (model versioning, A/B testing, drift detection)

**Post-Silicon Application:**
STDF Parser Operator: User creates `STDFParserJob` CRD with file list, operator creates parser pods (scales based on queue depth), monitors progress (tracks files parsed/failed), handles failures (retries failed files up to 3 times), updates job status (parsed=9500, failed=500, completion=95%).

In [None]:
# Kubernetes Operator Simulation

class JobStatus(Enum):
    """Training job status."""
    PENDING = "Pending"
    RUNNING = "Running"
    SUCCEEDED = "Succeeded"
    FAILED = "Failed"
    RETRYING = "Retrying"


@dataclass
class TrainingJobSpec:
    """Desired state for training job (user-defined)."""
    model_name: str
    dataset: str
    epochs: int = 10
    batch_size: int = 32
    replicas: int = 1  # Distributed training workers
    gpu_per_worker: int = 1
    max_retries: int = 3
    checkpoint_interval: int = 5  # Save every N epochs


@dataclass
class TrainingJobStatus:
    """Actual state for training job (operator-managed)."""
    phase: JobStatus = JobStatus.PENDING
    current_epoch: int = 0
    accuracy: float = 0.0
    loss: float = 0.0
    retries: int = 0
    pods_ready: int = 0
    start_time: Optional[str] = None
    end_time: Optional[str] = None
    message: str = ""


@dataclass
class TrainingJobCRD:
    """Custom Resource Definition for ML training job."""
    api_version: str = "ml.kubeflow.org/v1"
    kind: str = "TrainingJob"
    metadata: Dict[str, str] = field(default_factory=dict)
    spec: TrainingJobSpec = None
    status: TrainingJobStatus = field(default_factory=TrainingJobStatus)
    
    def __post_init__(self):
        if not self.metadata:
            self.metadata = {"name": f"training-{uuid.uuid4().hex[:8]}"}


class MLTrainingOperator:
    """Kubernetes Operator for ML training jobs."""
    
    def __init__(self, name: str = "ml-training-operator"):
        self.name = name
        self.watched_jobs: List[TrainingJobCRD] = []
        self.reconciliation_history: List[Dict] = []
    
    def watch_job(self, job: TrainingJobCRD):
        """Add job to watch list."""
        self.watched_jobs.append(job)
        print(f"üëÄ Operator watching job: {job.metadata['name']}")
    
    def reconcile(self, job: TrainingJobCRD) -> Dict:
        """Reconciliation loop: ensure desired state matches actual state."""
        reconciliation = {
            "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "job_name": job.metadata['name'],
            "actions": []
        }
        
        # Phase: PENDING ‚Üí RUNNING
        if job.status.phase == JobStatus.PENDING:
            print(f"\nüîÑ Reconciling job: {job.metadata['name']}")
            print(f"   Current state: {job.status.phase.value}")
            print(f"   Desired state: Training with {job.spec.replicas} workers")
            
            # Create pods for training
            job.status.pods_ready = job.spec.replicas
            job.status.phase = JobStatus.RUNNING
            job.status.start_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            job.status.message = f"Created {job.spec.replicas} worker pods"
            
            reconciliation["actions"].append(
                f"Created {job.spec.replicas} pods with {job.spec.gpu_per_worker} GPU each"
            )
            reconciliation["actions"].append(f"Started training: {job.spec.model_name}")
            
            print(f"   ‚úÖ Actions: {reconciliation['actions']}")
        
        # Phase: RUNNING ‚Üí simulate training progress
        elif job.status.phase == JobStatus.RUNNING:
            # Simulate epoch progress
            if job.status.current_epoch < job.spec.epochs:
                job.status.current_epoch += 1
                job.status.accuracy = 0.50 + (job.status.current_epoch / job.spec.epochs) * 0.45
                job.status.loss = 2.0 - (job.status.current_epoch / job.spec.epochs) * 1.5
                
                reconciliation["actions"].append(
                    f"Epoch {job.status.current_epoch}/{job.spec.epochs}: "
                    f"accuracy={job.status.accuracy:.3f}, loss={job.status.loss:.3f}"
                )
                
                # Checkpoint every N epochs
                if job.status.current_epoch % job.spec.checkpoint_interval == 0:
                    reconciliation["actions"].append(
                        f"Saved checkpoint at epoch {job.status.current_epoch}"
                    )
            
            # Training complete
            if job.status.current_epoch >= job.spec.epochs:
                job.status.phase = JobStatus.SUCCEEDED
                job.status.end_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                job.status.message = f"Training completed: {job.spec.epochs} epochs"
                reconciliation["actions"].append("Training succeeded ‚úÖ")
        
        # Phase: FAILED ‚Üí retry logic
        elif job.status.phase == JobStatus.FAILED:
            if job.status.retries < job.spec.max_retries:
                job.status.retries += 1
                job.status.phase = JobStatus.RETRYING
                job.status.current_epoch = 0  # Restart from last checkpoint
                reconciliation["actions"].append(
                    f"Retry {job.status.retries}/{job.spec.max_retries}: Restarting training"
                )
            else:
                job.status.message = f"Max retries exceeded ({job.spec.max_retries})"
                reconciliation["actions"].append("Max retries reached, job failed permanently ‚ùå")
        
        self.reconciliation_history.append(reconciliation)
        return reconciliation
    
    def simulate_failure(self, job: TrainingJobCRD, reason: str = "OOM"):
        """Simulate training failure."""
        job.status.phase = JobStatus.FAILED
        job.status.message = f"Training failed: {reason}"
        print(f"\n‚ùå Simulated failure: {job.metadata['name']} ({reason})")
    
    def get_job_status(self, job_name: str) -> Optional[TrainingJobCRD]:
        """Get job status."""
        for job in self.watched_jobs:
            if job.metadata['name'] == job_name:
                return job
        return None
    
    def run_reconciliation_loop(self, iterations: int = 15):
        """Run reconciliation loop for all watched jobs."""
        print(f"\nüîÑ Starting reconciliation loop ({iterations} iterations)...\n")
        
        for i in range(iterations):
            print(f"--- Iteration {i+1}/{iterations} ---")
            
            for job in self.watched_jobs:
                if job.status.phase not in [JobStatus.SUCCEEDED, JobStatus.FAILED]:
                    self.reconcile(job)
            
            time.sleep(0.1)  # Simulate time passing
            
            # Check if all jobs done
            all_done = all(job.status.phase in [JobStatus.SUCCEEDED, JobStatus.FAILED] 
                          for job in self.watched_jobs)
            if all_done:
                print(f"\n‚úÖ All jobs completed")
                break


# Example 1: Create Training Job CRD
print("=" * 80)
print("EXAMPLE 1: Define TrainingJob Custom Resource")
print("=" * 80)

# User creates TrainingJob CRD (like kubectl apply -f training-job.yaml)
wafer_training_job = TrainingJobCRD(
    metadata={"name": "wafer-yield-training-v1"},
    spec=TrainingJobSpec(
        model_name="wafer_yield_predictor",
        dataset="wafer_test_data_2024_q4",
        epochs=20,
        batch_size=64,
        replicas=4,  # 4 workers for distributed training
        gpu_per_worker=1,
        max_retries=3,
        checkpoint_interval=5
    )
)

print("üìÑ TrainingJob CRD:")
print(f"   Name: {wafer_training_job.metadata['name']}")
print(f"   Model: {wafer_training_job.spec.model_name}")
print(f"   Dataset: {wafer_training_job.spec.dataset}")
print(f"   Epochs: {wafer_training_job.spec.epochs}")
print(f"   Workers: {wafer_training_job.spec.replicas}")
print(f"   GPU per worker: {wafer_training_job.spec.gpu_per_worker}")
print(f"   Status: {wafer_training_job.status.phase.value}")

print("\n" + "=" * 80)
print("EXAMPLE 2: Operator Watches and Reconciles Job")
print("=" * 80)

# Create operator
operator = MLTrainingOperator()

# Operator watches job
operator.watch_job(wafer_training_job)

# Simulate reconciliation loop (operator runs continuously)
operator.run_reconciliation_loop(iterations=25)

# Check final status
final_job = operator.get_job_status("wafer-yield-training-v1")
print(f"\nüìä Final Job Status:")
print(f"   Phase: {final_job.status.phase.value}")
print(f"   Epochs Completed: {final_job.status.current_epoch}/{final_job.spec.epochs}")
print(f"   Final Accuracy: {final_job.status.accuracy:.3f}")
print(f"   Final Loss: {final_job.status.loss:.3f}")
print(f"   Start Time: {final_job.status.start_time}")
print(f"   End Time: {final_job.status.end_time}")

print("\n" + "=" * 80)
print("EXAMPLE 3: Operator Handles Failures with Auto-Retry")
print("=" * 80)

# Create another job
stdf_training_job = TrainingJobCRD(
    metadata={"name": "stdf-parser-training-v2"},
    spec=TrainingJobSpec(
        model_name="stdf_anomaly_detector",
        dataset="stdf_historical_2024",
        epochs=15,
        batch_size=32,
        replicas=2,
        gpu_per_worker=1,
        max_retries=3
    )
)

operator.watch_job(stdf_training_job)

# Run for 5 iterations
for i in range(5):
    operator.reconcile(stdf_training_job)
    print(f"   Epoch {stdf_training_job.status.current_epoch}: "
          f"accuracy={stdf_training_job.status.accuracy:.3f}")
    time.sleep(0.05)

# Simulate OOM failure
operator.simulate_failure(stdf_training_job, reason="OutOfMemory (GPU OOM)")

# Operator auto-retries
print("\nüîÑ Operator detecting failure, initiating auto-retry...")
operator.reconcile(stdf_training_job)

print(f"\nüìä Job Status After Retry:")
print(f"   Phase: {stdf_training_job.status.phase.value}")
print(f"   Retries: {stdf_training_job.status.retries}/{stdf_training_job.spec.max_retries}")
print(f"   Message: {stdf_training_job.status.message}")

print("\nüí° Key Insight: Operator automatically retries failed jobs (no manual intervention)")

# Visualize training progress
print("\n" + "=" * 80)
print("EXAMPLE 4: Training Progress Visualization")
print("=" * 80)

# Extract metrics from successful job
epochs = list(range(1, final_job.status.current_epoch + 1))
accuracies = [0.50 + (e / final_job.spec.epochs) * 0.45 for e in epochs]
losses = [2.0 - (e / final_job.spec.epochs) * 1.5 for e in epochs]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Accuracy
ax1.plot(epochs, accuracies, marker='o', linewidth=2.5, markersize=8, color='#4ECDC4')
ax1.fill_between(epochs, 0, accuracies, alpha=0.3, color='#4ECDC4')
ax1.axhline(y=0.95, color='green', linestyle='--', linewidth=2, label='Target (95%)')
ax1.set_xlabel("Epoch", fontsize=12, fontweight='bold')
ax1.set_ylabel("Accuracy", fontsize=12, fontweight='bold')
ax1.set_title("Training Accuracy Over Time\n(Wafer Yield Predictor)", 
              fontsize=14, fontweight='bold', pad=20)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 1)

# Mark checkpoints
checkpoint_epochs = [e for e in epochs if e % final_job.spec.checkpoint_interval == 0]
checkpoint_accs = [accuracies[e-1] for e in checkpoint_epochs]
ax1.scatter(checkpoint_epochs, checkpoint_accs, s=200, c='red', marker='s', 
            edgecolors='black', linewidths=2, label='Checkpoint', zorder=5)

# Plot 2: Loss
ax2.plot(epochs, losses, marker='s', linewidth=2.5, markersize=8, color='#FF6B6B')
ax2.fill_between(epochs, 0, losses, alpha=0.3, color='#FF6B6B')
ax2.set_xlabel("Epoch", fontsize=12, fontweight='bold')
ax2.set_ylabel("Loss", fontsize=12, fontweight='bold')
ax2.set_title("Training Loss Over Time\n(Wafer Yield Predictor)", 
              fontsize=14, fontweight='bold', pad=20)
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 2.5)

# Mark checkpoints
checkpoint_losses = [losses[e-1] for e in checkpoint_epochs]
ax2.scatter(checkpoint_epochs, checkpoint_losses, s=200, c='red', marker='s', 
            edgecolors='black', linewidths=2, label='Checkpoint', zorder=5)
ax2.legend(fontsize=10)

plt.tight_layout()
plt.show()

print("üí° Operator Features Demonstrated:")
print("   ‚úÖ Watches TrainingJob CRDs (reconciliation loop)")
print("   ‚úÖ Creates pods and resources automatically")
print("   ‚úÖ Monitors training progress (updates status)")
print("   ‚úÖ Saves checkpoints periodically")
print("   ‚úÖ Auto-retries on failure (up to max_retries)")
print("   ‚úÖ Updates job status (phase, accuracy, loss)")

print("\nüí° Popular ML Operators:")
print("   ‚Ä¢ Kubeflow Training Operator (TensorFlow, PyTorch, MXNet)")
print("   ‚Ä¢ KServe (Model serving with auto-scaling)")
print("   ‚Ä¢ Seldon Core Operator (Advanced model serving)")
print("   ‚Ä¢ Argo Workflows (DAG-based ML pipelines)")
print("   ‚Ä¢ MLflow Operator (Experiment tracking integration)")

## 5. üìã Custom Resource Definitions (CRDs) - Extending Kubernetes API

### üìù What Are CRDs?

**Purpose:** Extend Kubernetes API with custom resource types for ML workflows

**Key Points:**
- **Custom Resources**: Define domain-specific objects (TrainingJob, ModelServer, HyperparameterTuning)
- **API Extension**: Kubernetes treats CRDs like built-in resources (Pod, Service, Deployment)
- **Declarative**: Users define desired state in YAML, operators ensure actual state matches
- **Validation**: CRDs support schema validation, default values, version management
- **CRUD Operations**: kubectl get/describe/delete work with CRDs just like native resources

**Why CRDs Matter:**
- ‚úÖ **Simplify ML Workflows**: Data scientists use `kubectl apply -f training-job.yaml` instead of complex scripts
- ‚úÖ **Self-Service**: Users create resources without understanding Kubernetes internals
- ‚úÖ **Standardization**: Consistent API across teams and projects
- ‚úÖ **Automation**: Operators watch CRDs and automate complex operations

**Post-Silicon Validation Application:**

**Multi-Model Ensemble for Wafer Yield Prediction:**
- **Input**: Create `EnsembleModel` CRD with 5 base models (Random Forest, XGBoost, LightGBM, CatBoost, Neural Net)
- **Operator**: Deploys each model as separate pod, creates ensemble combiner, monitors performance
- **Output**: Single prediction endpoint combining all models (improves accuracy from 92% to 96.5%)
- **Value**: $4.2M annual savings (fewer false negatives reducing yield loss)

In [None]:
# Custom Resource Definitions (CRDs) Simulation

@dataclass
class ModelServerSpec:
    """Desired state for model server."""
    model_name: str
    model_path: str
    framework: str  # tensorflow, pytorch, sklearn
    replicas: int = 3
    gpu_enabled: bool = False
    min_replicas: int = 1
    max_replicas: int = 10
    target_cpu_utilization: int = 70  # Auto-scale at 70% CPU


@dataclass
class ModelServerStatus:
    """Actual state for model server."""
    replicas: int = 0
    ready_replicas: int = 0
    endpoint: str = ""
    requests_per_second: float = 0.0
    avg_latency_ms: float = 0.0
    error_rate: float = 0.0


@dataclass
class ModelServerCRD:
    """CRD for model serving."""
    api_version: str = "serving.kubeflow.org/v1"
    kind: str = "ModelServer"
    metadata: Dict[str, str] = field(default_factory=dict)
    spec: ModelServerSpec = None
    status: ModelServerStatus = field(default_factory=ModelServerStatus)


@dataclass
class HyperparameterTuningSpec:
    """Desired state for hyperparameter tuning job."""
    model_name: str
    algorithm: str  # random, grid, bayesian
    max_trials: int = 50
    max_parallel_trials: int = 5
    objective_metric: str = "accuracy"
    objective_type: str = "maximize"  # maximize or minimize
    parameters: Dict[str, Dict] = field(default_factory=dict)


@dataclass
class Trial:
    """Single hyperparameter trial."""
    trial_id: str
    parameters: Dict[str, float]
    status: str = "Pending"  # Pending, Running, Succeeded, Failed
    objective_value: Optional[float] = None


@dataclass
class HyperparameterTuningStatus:
    """Actual state for hyperparameter tuning."""
    trials_completed: int = 0
    trials_failed: int = 0
    best_trial: Optional[Trial] = None
    current_trials: List[Trial] = field(default_factory=list)


@dataclass
class HyperparameterTuningCRD:
    """CRD for hyperparameter tuning."""
    api_version: str = "katib.kubeflow.org/v1"
    kind: str = "HyperparameterTuning"
    metadata: Dict[str, str] = field(default_factory=dict)
    spec: HyperparameterTuningSpec = None
    status: HyperparameterTuningStatus = field(default_factory=HyperparameterTuningStatus)


# Example 1: ModelServer CRD for Wafer Yield Prediction
print("=" * 80)
print("EXAMPLE 1: ModelServer CRD - Deploy ML Model with Auto-Scaling")
print("=" * 80)

wafer_model_server = ModelServerCRD(
    metadata={"name": "wafer-yield-predictor-v3"},
    spec=ModelServerSpec(
        model_name="wafer_yield_xgboost",
        model_path="s3://ml-models/wafer/xgboost-v3.pkl",
        framework="sklearn",
        replicas=3,
        gpu_enabled=False,
        min_replicas=2,
        max_replicas=8,
        target_cpu_utilization=75
    )
)

print("üìÑ ModelServer CRD:")
print(f"   Name: {wafer_model_server.metadata['name']}")
print(f"   Model: {wafer_model_server.spec.model_name}")
print(f"   Framework: {wafer_model_server.spec.framework}")
print(f"   Replicas: {wafer_model_server.spec.replicas}")
print(f"   Auto-scaling: {wafer_model_server.spec.min_replicas}-{wafer_model_server.spec.max_replicas} replicas")
print(f"   GPU: {'Enabled' if wafer_model_server.spec.gpu_enabled else 'Disabled'}")

# Simulate operator creating resources
wafer_model_server.status.replicas = wafer_model_server.spec.replicas
wafer_model_server.status.ready_replicas = wafer_model_server.spec.replicas
wafer_model_server.status.endpoint = f"http://{wafer_model_server.metadata['name']}.default.svc.cluster.local/v1/predict"
wafer_model_server.status.requests_per_second = 245.5
wafer_model_server.status.avg_latency_ms = 12.3
wafer_model_server.status.error_rate = 0.002

print(f"\nüìä ModelServer Status:")
print(f"   Replicas: {wafer_model_server.status.ready_replicas}/{wafer_model_server.status.replicas}")
print(f"   Endpoint: {wafer_model_server.status.endpoint}")
print(f"   Requests/sec: {wafer_model_server.status.requests_per_second}")
print(f"   Avg Latency: {wafer_model_server.status.avg_latency_ms} ms")
print(f"   Error Rate: {wafer_model_server.status.error_rate * 100:.2f}%")

print("\nüí° What Operator Does:")
print("   1. Creates Deployment with 3 replicas")
print("   2. Creates Service for load balancing")
print("   3. Creates HPA (HorizontalPodAutoscaler) for auto-scaling")
print("   4. Monitors metrics (requests/sec, latency, errors)")
print("   5. Auto-scales 2-8 replicas based on CPU utilization")

print("\n" + "=" * 80)
print("EXAMPLE 2: HyperparameterTuning CRD - Automated HPO")
print("=" * 80)

hpo_job = HyperparameterTuningCRD(
    metadata={"name": "wafer-yield-hpo-v1"},
    spec=HyperparameterTuningSpec(
        model_name="wafer_yield_neural_net",
        algorithm="bayesian",
        max_trials=30,
        max_parallel_trials=4,
        objective_metric="f1_score",
        objective_type="maximize",
        parameters={
            "learning_rate": {"min": 0.0001, "max": 0.01, "type": "double"},
            "hidden_units": {"min": 64, "max": 512, "type": "int"},
            "dropout": {"min": 0.1, "max": 0.5, "type": "double"},
            "batch_size": {"values": [32, 64, 128], "type": "categorical"}
        }
    )
)

print("üìÑ HyperparameterTuning CRD:")
print(f"   Name: {hpo_job.metadata['name']}")
print(f"   Model: {hpo_job.spec.model_name}")
print(f"   Algorithm: {hpo_job.spec.algorithm}")
print(f"   Max Trials: {hpo_job.spec.max_trials}")
print(f"   Parallel Trials: {hpo_job.spec.max_parallel_trials}")
print(f"   Objective: {hpo_job.spec.objective_type} {hpo_job.spec.objective_metric}")

print(f"\nüî¨ Hyperparameter Search Space:")
for param, config in hpo_job.spec.parameters.items():
    if config["type"] == "categorical":
        print(f"   ‚Ä¢ {param}: {config['values']}")
    else:
        print(f"   ‚Ä¢ {param}: [{config['min']}, {config['max']}] ({config['type']})")

# Simulate running trials
print(f"\nüîÑ Simulating Bayesian Optimization...")

trials = []
for i in range(10):
    trial = Trial(
        trial_id=f"trial-{i+1:03d}",
        parameters={
            "learning_rate": np.random.uniform(0.0001, 0.01),
            "hidden_units": int(np.random.uniform(64, 512)),
            "dropout": np.random.uniform(0.1, 0.5),
            "batch_size": np.random.choice([32, 64, 128])
        },
        status="Succeeded",
        objective_value=np.random.uniform(0.85, 0.97)
    )
    trials.append(trial)
    hpo_job.status.trials_completed += 1

# Find best trial
best_trial = max(trials, key=lambda t: t.objective_value)
hpo_job.status.best_trial = best_trial

print(f"\nüìä HPO Results (10/{hpo_job.spec.max_trials} trials completed):")
print(f"   Trials Completed: {hpo_job.status.trials_completed}")
print(f"   Trials Failed: {hpo_job.status.trials_failed}")

print(f"\nüèÜ Best Trial: {best_trial.trial_id}")
print(f"   F1 Score: {best_trial.objective_value:.4f}")
print(f"   Parameters:")
print(f"     ‚Ä¢ learning_rate: {best_trial.parameters['learning_rate']:.5f}")
print(f"     ‚Ä¢ hidden_units: {best_trial.parameters['hidden_units']}")
print(f"     ‚Ä¢ dropout: {best_trial.parameters['dropout']:.3f}")
print(f"     ‚Ä¢ batch_size: {best_trial.parameters['batch_size']}")

# Visualize trial results
print("\n" + "=" * 80)
print("EXAMPLE 3: HPO Trial Results Visualization")
print("=" * 80)

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: F1 Score vs Learning Rate
ax1 = axes[0, 0]
lrs = [t.parameters['learning_rate'] for t in trials]
f1s = [t.objective_value for t in trials]
scatter1 = ax1.scatter(lrs, f1s, s=150, c=f1s, cmap='viridis', edgecolors='black', linewidths=1.5)
ax1.scatter(best_trial.parameters['learning_rate'], best_trial.objective_value, 
            s=400, c='red', marker='*', edgecolors='black', linewidths=2, label='Best Trial', zorder=5)
ax1.set_xlabel("Learning Rate", fontsize=12, fontweight='bold')
ax1.set_ylabel("F1 Score", fontsize=12, fontweight='bold')
ax1.set_title("F1 Score vs Learning Rate", fontsize=14, fontweight='bold', pad=15)
ax1.set_xscale('log')
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=10)
plt.colorbar(scatter1, ax=ax1, label='F1 Score')

# Plot 2: F1 Score vs Hidden Units
ax2 = axes[0, 1]
hidden = [t.parameters['hidden_units'] for t in trials]
scatter2 = ax2.scatter(hidden, f1s, s=150, c=f1s, cmap='viridis', edgecolors='black', linewidths=1.5)
ax2.scatter(best_trial.parameters['hidden_units'], best_trial.objective_value, 
            s=400, c='red', marker='*', edgecolors='black', linewidths=2, label='Best Trial', zorder=5)
ax2.set_xlabel("Hidden Units", fontsize=12, fontweight='bold')
ax2.set_ylabel("F1 Score", fontsize=12, fontweight='bold')
ax2.set_title("F1 Score vs Hidden Units", fontsize=14, fontweight='bold', pad=15)
ax2.grid(True, alpha=0.3)
ax2.legend(fontsize=10)
plt.colorbar(scatter2, ax=ax2, label='F1 Score')

# Plot 3: F1 Score vs Dropout
ax3 = axes[1, 0]
dropouts = [t.parameters['dropout'] for t in trials]
scatter3 = ax3.scatter(dropouts, f1s, s=150, c=f1s, cmap='viridis', edgecolors='black', linewidths=1.5)
ax3.scatter(best_trial.parameters['dropout'], best_trial.objective_value, 
            s=400, c='red', marker='*', edgecolors='black', linewidths=2, label='Best Trial', zorder=5)
ax3.set_xlabel("Dropout Rate", fontsize=12, fontweight='bold')
ax3.set_ylabel("F1 Score", fontsize=12, fontweight='bold')
ax3.set_title("F1 Score vs Dropout Rate", fontsize=14, fontweight='bold', pad=15)
ax3.grid(True, alpha=0.3)
ax3.legend(fontsize=10)
plt.colorbar(scatter3, ax=ax3, label='F1 Score')

# Plot 4: Batch Size Distribution
ax4 = axes[1, 1]
batch_sizes = [t.parameters['batch_size'] for t in trials]
batch_f1s = {}
for bs, f1 in zip(batch_sizes, f1s):
    if bs not in batch_f1s:
        batch_f1s[bs] = []
    batch_f1s[bs].append(f1)

# Create box plot
bp = ax4.boxplot([batch_f1s[bs] for bs in sorted(batch_f1s.keys())], 
                   labels=[str(bs) for bs in sorted(batch_f1s.keys())],
                   patch_artist=True)
for patch, color in zip(bp['boxes'], ['#4ECDC4', '#FFD93D', '#FF6B6B']):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax4.set_xlabel("Batch Size", fontsize=12, fontweight='bold')
ax4.set_ylabel("F1 Score", fontsize=12, fontweight='bold')
ax4.set_title("F1 Score Distribution by Batch Size", fontsize=14, fontweight='bold', pad=15)
ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nüí° Popular CRDs in ML/Kubernetes Ecosystem:")
print("\nü§ñ Kubeflow:")
print("   ‚Ä¢ TrainingJob (TFJob, PyTorchJob, MXJob) - Distributed training")
print("   ‚Ä¢ Notebook - Jupyter notebook servers")
print("   ‚Ä¢ Experiment - ML experiment tracking")
print("   ‚Ä¢ Pipeline - ML pipeline workflows")

print("\nüöÄ KServe (InferenceService):")
print("   ‚Ä¢ Predictor - Model serving with auto-scaling")
print("   ‚Ä¢ Transformer - Pre/post-processing pipelines")
print("   ‚Ä¢ Explainer - Model explainability")

print("\nüî¨ Katib:")
print("   ‚Ä¢ Experiment - Hyperparameter tuning jobs")
print("   ‚Ä¢ Suggestion - HPO algorithm configuration")
print("   ‚Ä¢ Trial - Individual training trial")

print("\nüéØ Custom CRDs for Post-Silicon:")
print("   ‚Ä¢ STDFParserJob - Parse and analyze STDF files")
print("   ‚Ä¢ WaferMapAnalysis - Spatial pattern detection")
print("   ‚Ä¢ YieldPredictor - Yield forecasting models")
print("   ‚Ä¢ ParametricOutlierDetection - Anomaly detection on test data")
print("   ‚Ä¢ BinOptimization - Optimal binning strategies")

## 6. üöÄ Real-World Projects Using Kubernetes Advanced Patterns

Build production ML systems with StatefulSets, DaemonSets, Operators, and CRDs:

---

### **Project 1: Distributed PyTorch Training with StatefulSets** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Build distributed deep learning system for wafer defect detection using PyTorch DDP (DistributedDataParallel)

**Business Value:**  
- 5x faster training (8 hours ‚Üí 1.6 hours for 1M wafer images)
- $125K/year savings in compute costs (GPU utilization up from 60% ‚Üí 92%)
- Enables real-time model updates (daily retraining on fresh data)

**Success Criteria:**
- ‚úÖ Training completes successfully across 8 GPU workers
- ‚úÖ Linear scaling efficiency >85% (8 GPUs = 6.8x speedup vs 1 GPU)
- ‚úÖ Automatic recovery from worker failures (checkpoint restoration)
- ‚úÖ Sub-10 minute recovery time from total cluster failure

**Features:**
- **StatefulSet with 8 replicas** (worker-0 to worker-7, stable DNS names)
- **Master-worker architecture** (worker-0 coordinates via stable DNS)
- **Shared PersistentVolume** for checkpoints (NFS or S3)
- **Init container** to download dataset shards
- **Headless Service** for worker-to-worker communication
- **Auto-checkpointing** every 500 steps (survives restarts)

**Implementation Hints:**
```python
# StatefulSet YAML structure:
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pytorch-distributed-training
spec:
  serviceName: "pytorch-workers"
  replicas: 8
  selector:
    matchLabels:
      app: pytorch-training
  template:
    spec:
      containers:
      - name: worker
        image: pytorch/pytorch:2.0-gpu
        env:
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.name  # worker-0, worker-1, ...
        - name: MASTER_ADDR
          value: "pytorch-workers-0.pytorch-workers"  # Stable DNS
        - name: WORLD_SIZE
          value: "8"
        volumeMounts:
        - name: checkpoint-storage
          mountPath: /checkpoints
  volumeClaimTemplates:
  - metadata:
      name: checkpoint-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi
```

**Post-Silicon Application:**  
Train ResNet-50 on 1M wafer defect images (spatial patterns, scratch detection, particle contamination), deploy model to edge devices for real-time inspection

---

### **Project 2: GPU Driver Management with DaemonSet** ‚≠ê‚≠ê‚≠ê
**Objective:** Automate NVIDIA GPU driver installation and monitoring across 50-node GPU cluster

**Business Value:**  
- $80K/year savings (eliminate manual driver updates, 3 DevOps engineers ‚Üí 1)
- 99.8% GPU uptime (automatic driver recovery vs 96.2% manual)
- Zero-downtime driver updates (rolling updates on tainted nodes)

**Success Criteria:**
- ‚úÖ GPU drivers installed on all GPU nodes automatically
- ‚úÖ Non-GPU nodes unaffected (node selector filters)
- ‚úÖ New GPU nodes get drivers within 5 minutes of joining cluster
- ‚úÖ Driver upgrades complete with zero ML job interruptions

**Features:**
- **DaemonSet with node selector** (gpu=nvidia)
- **Init container** to install NVIDIA driver kernel modules
- **Tolerations** for GPU taints (gpu-workload=true:NoSchedule)
- **Host mounts** for /dev, /sys, /proc (driver access)
- **ConfigMap** for driver version pinning (525.xx)
- **Prometheus metrics** for GPU utilization monitoring

**Implementation Hints:**
```python
# DaemonSet YAML structure:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-driver-installer
spec:
  selector:
    matchLabels:
      name: nvidia-driver
  template:
    spec:
      nodeSelector:
        gpu: nvidia  # Only GPU nodes
      tolerations:
      - key: gpu-workload
        operator: Exists
        effect: NoSchedule
      initContainers:
      - name: driver-installer
        image: nvidia/driver:525.105.17-ubuntu22.04
        securityContext:
          privileged: true
        volumeMounts:
        - name: dev
          mountPath: /dev
        - name: nvidia-install-dir
          mountPath: /usr/local/nvidia
      containers:
      - name: nvidia-device-plugin
        image: nvidia/k8s-device-plugin:v0.14.0
      volumes:
      - name: dev
        hostPath:
          path: /dev
      - name: nvidia-install-dir
        hostPath:
          path: /usr/local/nvidia
```

**Post-Silicon Application:**  
Ensure all GPU nodes have consistent CUDA 12.1 drivers for ML inference workloads (YOLOv8 wafer defect detection requires specific driver version)

---

### **Project 3: ML Training Operator for Auto-Scaling** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Build Kubernetes Operator that auto-scales training jobs based on queue depth and GPU availability

**Business Value:**  
- $220K/year savings (reduce idle GPU time from 35% ‚Üí 8%)
- 3x more experiments per day (automatic queue processing)
- 40% faster time-to-model (parallel training when GPUs available)

**Success Criteria:**
- ‚úÖ Operator watches TrainingJob CRDs and reconciles state
- ‚úÖ Auto-scales from 0 ‚Üí 8 workers when queue depth > 5
- ‚úÖ Scales down to 0 when no jobs queued (save costs)
- ‚úÖ Automatic retry on failure (up to 3 retries with exponential backoff)

**Features:**
- **Custom Resource Definition** (TrainingJob with spec: model, dataset, epochs)
- **Operator controller** with reconciliation loop (every 10 seconds)
- **Queue depth monitoring** (scale up when >5 jobs waiting)
- **GPU availability checks** (don't schedule if no GPUs free)
- **Checkpoint management** (save every N epochs, restore on retry)
- **Metrics collection** (accuracy, loss, training time)

**Implementation Hints:**
```python
# Operator reconciliation logic:
def reconcile(training_job):
    # Get desired state from CRD
    desired_workers = training_job.spec.replicas
    
    # Get actual state from cluster
    actual_workers = len(get_pods(training_job.name))
    
    # Reconcile: create missing workers
    if actual_workers < desired_workers:
        for i in range(actual_workers, desired_workers):
            create_pod(f"{training_job.name}-worker-{i}")
    
    # Reconcile: delete extra workers
    elif actual_workers > desired_workers:
        for i in range(desired_workers, actual_workers):
            delete_pod(f"{training_job.name}-worker-{i}")
    
    # Auto-scale based on queue depth
    queue_depth = get_queue_depth()
    if queue_depth > 5 and gpu_available():
        training_job.spec.replicas = min(8, queue_depth)
    elif queue_depth == 0:
        training_job.spec.replicas = 0  # Scale to zero
```

**Post-Silicon Application:**  
Auto-scale STDF parsing jobs based on wafer test data ingestion rate (10 wafers/hour ‚Üí 2 workers, 100 wafers/hour ‚Üí 8 workers)

---

### **Project 4: KServe Multi-Model Ensemble Deployment** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Deploy ensemble of 5 models for wafer yield prediction with canary releases and A/B testing

**Business Value:**  
- 4.2% accuracy improvement (92.3% ‚Üí 96.5%, ensemble voting)
- $4.2M/year savings (fewer false negatives ‚Üí less yield loss)
- Zero-downtime model updates (canary releases with 5% traffic)

**Success Criteria:**
- ‚úÖ All 5 models deployed and healthy (Random Forest, XGBoost, LightGBM, CatBoost, Neural Net)
- ‚úÖ Ensemble combiner aggregates predictions (weighted voting)
- ‚úÖ Canary release completes with <0.5% error rate increase
- ‚úÖ Auto-rollback if canary metrics degrade >10%

**Features:**
- **InferenceService CRD** for each model
- **Ensemble combiner** (weighted voting based on validation accuracy)
- **Canary deployment** (route 5% traffic to new version)
- **Prometheus metrics** (latency, throughput, error rate)
- **Auto-scaling** (2-10 replicas based on request rate)
- **GPU acceleration** for Neural Net model only

**Implementation Hints:**
```python
# InferenceService CRD for Random Forest:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: wafer-yield-rf
spec:
  predictor:
    sklearn:
      storageUri: s3://ml-models/wafer-yield/rf-v2.pkl
      resources:
        limits:
          cpu: 2
          memory: 4Gi
  scaleTarget: 3
  scaleMetric: rps
  canaryTrafficPercent: 5  # Canary release

# Ensemble combiner:
@dataclass
class EnsemblePredictor:
    models: List[InferenceService]
    weights: List[float]  # Based on validation accuracy
    
    def predict(self, features):
        predictions = [model.predict(features) for model in self.models]
        return np.average(predictions, weights=self.weights)
```

**Post-Silicon Application:**  
Combine 5 models for wafer yield prediction (each model trained on different feature subsets: electrical params, spatial features, test sequence, temperature, lot history)

---

### **Project 5: Kubeflow Pipelines for End-to-End ML** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Build automated ML pipeline from STDF parsing ‚Üí feature engineering ‚Üí training ‚Üí deployment

**Business Value:**  
- $340K/year savings (reduce data scientist time from 30 hours/week ‚Üí 8 hours/week)
- 5x faster model iteration (2 weeks ‚Üí 3 days from data to production)
- 100% reproducibility (versioned pipelines, no manual steps)

**Success Criteria:**
- ‚úÖ Pipeline executes all 7 steps automatically (parse STDF ‚Üí deploy model)
- ‚úÖ Each step cached (re-run only changed components)
- ‚úÖ Full lineage tracking (model ‚Üí training data ‚Üí STDF files)
- ‚úÖ One-click rollback to previous pipeline version

**Features:**
- **Kubeflow Pipeline DAG** (7 steps with dependencies)
- **Component caching** (skip unchanged steps)
- **Artifact versioning** (MLflow integration)
- **Conditional execution** (skip training if accuracy >95%)
- **Parallel execution** (5 feature engineering jobs in parallel)
- **Notifications** (Slack alerts on failure/success)

**Pipeline Steps:**
1. **STDF Parser**: Parse 1000 STDF files ‚Üí Parquet (parallelized, 10 workers)
2. **Feature Engineering**: 50 derived features (electrical + spatial)
3. **Train/Test Split**: 80/20 stratified split
4. **Hyperparameter Tuning**: Bayesian optimization (30 trials)
5. **Model Training**: Best hyperparameters, full dataset
6. **Model Validation**: Test set evaluation (accuracy, F1, AUC)
7. **Model Deployment**: KServe InferenceService (canary release)

**Implementation Hints:**
```python
from kfp import dsl

@dsl.pipeline(name="Wafer Yield Prediction Pipeline")
def wafer_yield_pipeline(stdf_bucket: str, model_version: str):
    # Step 1: Parse STDF files
    parse_op = dsl.ContainerOp(
        name="Parse STDF Files",
        image="wafer-ml/stdf-parser:v2",
        arguments=["--bucket", stdf_bucket, "--output", "/data/parsed"]
    )
    
    # Step 2: Feature engineering (depends on parse_op)
    feature_op = dsl.ContainerOp(
        name="Feature Engineering",
        image="wafer-ml/feature-eng:v1",
        arguments=["--input", parse_op.outputs["data_path"]]
    )
    
    # Step 3-7: Training, validation, deployment
    # ...
    
    # Conditional: only deploy if validation accuracy > 95%
    with dsl.Condition(validation_op.outputs["accuracy"] > 0.95):
        deploy_op = dsl.ContainerOp(...)
```

**Post-Silicon Application:**  
Automate weekly model retraining pipeline (Friday night: parse week's STDF data ‚Üí retrain ‚Üí deploy by Monday morning)

---

### **Project 6: Custom GPU Scheduler for Cost Optimization** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Build custom Kubernetes scheduler that packs ML jobs efficiently on GPU nodes (bin packing)

**Business Value:**  
- $180K/year savings (reduce GPU nodes from 50 ‚Üí 38 via better packing)
- 25% better GPU utilization (76% ‚Üí 95% average)
- 3x more jobs per GPU (time-slicing for small inference jobs)

**Success Criteria:**
- ‚úÖ Scheduler achieves >90% GPU utilization across cluster
- ‚úÖ Bin packing reduces wasted GPU memory by 40%
- ‚úÖ Latency-sensitive jobs get priority (p99 latency <50ms)
- ‚úÖ Training jobs preemptible by inference jobs (cost optimization)

**Features:**
- **Custom scheduler** (implements Kubernetes scheduler interface)
- **Bin packing algorithm** (first-fit-decreasing by GPU memory)
- **Priority classes** (inference > training)
- **GPU time-slicing** (MIG for small inference jobs)
- **Anti-affinity** (spread replicas across nodes for HA)
- **Cost-aware scheduling** (prefer spot instances for training)

**Implementation Hints:**
```python
class GPUScheduler:
    def filter_nodes(self, pod, nodes):
        """Filter nodes that can run this pod."""
        viable = []
        for node in nodes:
            # Check GPU availability
            if pod.gpu_required > node.gpu_available:
                continue
            
            # Check GPU memory
            if pod.gpu_memory_required > node.gpu_memory_available:
                continue
            
            viable.append(node)
        
        return viable
    
    def score_nodes(self, pod, nodes):
        """Score nodes (higher = better)."""
        scores = {}
        for node in nodes:
            # Bin packing: prefer fuller nodes (reduce fragmentation)
            utilization = node.gpu_used / node.gpu_total
            scores[node] = utilization * 100
            
            # Bonus: spot instances for preemptible jobs
            if pod.preemptible and node.is_spot:
                scores[node] += 50
        
        return scores
    
    def bind_pod(self, pod, node):
        """Bind pod to selected node."""
        node.gpu_available -= pod.gpu_required
        node.gpu_memory_available -= pod.gpu_memory_required
```

**Post-Silicon Application:**  
Pack 20 small STDF parser jobs (0.5 GPU each) + 5 large training jobs (4 GPU each) on 12 GPU nodes (A100 80GB, 8 GPUs/node)

---

### **Project 7: Multi-Tenant ML Platform with Namespace Isolation** ‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Build shared ML platform for 5 engineering teams with resource quotas and network isolation

**Business Value:**  
- $420K/year savings (shared infrastructure vs per-team clusters)
- 60% better resource utilization (team-1 borrows GPUs from idle team-2)
- 99.9% tenant isolation (no cross-team data leakage)

**Success Criteria:**
- ‚úÖ Each team has dedicated namespace with ResourceQuota
- ‚úÖ Network policies prevent cross-team traffic
- ‚úÖ Fair scheduling (no team monopolizes GPUs)
- ‚úÖ Chargebacks based on actual usage (GPU-hours, storage GB)

**Features:**
- **Namespace per team** (team-design, team-test, team-validation, team-analytics, team-packaging)
- **ResourceQuota** (max 10 GPUs, 500GB RAM, 2TB storage per team)
- **LimitRange** (min/max resources per pod)
- **NetworkPolicy** (deny all cross-namespace traffic except API server)
- **PodSecurityPolicy** (prevent privileged containers)
- **Chargeback tracking** (Prometheus metrics ‚Üí cost allocation)

**Implementation Hints:**
```yaml
# Namespace with ResourceQuota:
apiVersion: v1
kind: Namespace
metadata:
  name: team-design
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-design-quota
  namespace: team-design
spec:
  hard:
    requests.nvidia.com/gpu: "10"  # Max 10 GPUs
    requests.memory: "500Gi"
    persistentvolumeclaims: "20"
    requests.storage: "2Ti"
---
# NetworkPolicy: deny all ingress except from same namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-namespace
  namespace: team-design
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector: {}  # Only same namespace
```

**Post-Silicon Application:**  
Design team (10 users), Test team (15 users), Validation team (8 users), Analytics team (5 users), Packaging team (3 users) share 50-node GPU cluster

---

### **Project 8: Production Monitoring with Custom Operator** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Objective:** Build operator that auto-deploys Prometheus, Grafana, and Alertmanager for each ML application

**Business Value:**  
- $95K/year savings (eliminate manual monitoring setup, 2 SRE engineers ‚Üí 0.5)
- 99.5% model uptime (auto-alerts on degradation vs 94% manual)
- 15-minute MTTR (mean time to recovery, down from 4 hours)

**Success Criteria:**
- ‚úÖ Operator deploys full monitoring stack in <5 minutes
- ‚úÖ Auto-configured dashboards for model metrics (latency, throughput, accuracy)
- ‚úÖ Alerts fire within 60 seconds of anomaly detection
- ‚úÖ Self-healing: operator recreates failed Prometheus pods

**Features:**
- **MonitoringStack CRD** (defines Prometheus, Grafana, Alertmanager config)
- **Operator** watches MonitoringStack and reconciles state
- **Service monitors** (auto-discover model endpoints)
- **Alerting rules** (p99 latency >100ms, error rate >1%, accuracy drop >5%)
- **Grafana dashboards** (auto-generated from CRD)
- **PersistentVolume** for metrics retention (30 days)

**Implementation Hints:**
```python
@dataclass
class MonitoringStackCRD:
    api_version: str = "monitoring.ml/v1"
    kind: str = "MonitoringStack"
    spec: Dict = field(default_factory=dict)

class MonitoringOperator:
    def reconcile(self, stack: MonitoringStackCRD):
        # Deploy Prometheus
        if not self.prometheus_exists(stack.name):
            self.create_prometheus(stack)
        
        # Deploy Grafana
        if not self.grafana_exists(stack.name):
            self.create_grafana(stack)
        
        # Configure ServiceMonitors
        for model in stack.spec["models"]:
            self.create_service_monitor(model)
        
        # Configure AlertingRules
        for rule in stack.spec["alert_rules"]:
            self.create_alerting_rule(rule)
```

**Post-Silicon Application:**  
Monitor 25 deployed ML models (wafer yield, test time, parametric outliers, bin optimization, defect detection) with unified observability

---

### üí° **Project Selection Guide**

**Choose Project 1-2** if learning Kubernetes basics (StatefulSets, DaemonSets)  
**Choose Project 3-5** if building production ML pipelines (Operators, Kubeflow)  
**Choose Project 6-8** if optimizing infrastructure (scheduling, multi-tenancy, monitoring)

**All projects include:**
- Complete implementation templates (YAML + Python code)
- Post-silicon validation applications
- Business value quantification ($ savings, % improvement)
- Success criteria (measurable objectives)

## 7. üìö Comprehensive Takeaways - Kubernetes Advanced Patterns

---

### üéØ **Core Concepts Summary**

#### **StatefulSets**
- **Purpose**: Provide stable, unique identities for pods (predictable names, stable DNS, persistent storage)
- **When to Use**: Databases (MySQL, PostgreSQL, MongoDB), distributed training (PyTorch DDP, Horovod), caches (Redis cluster), consensus systems (etcd, ZooKeeper)
- **Key Features**: Ordered creation/deletion (0‚Üí1‚Üí2, 2‚Üí1‚Üí0), stable network IDs (pod-0.service.ns.svc.cluster.local), PersistentVolumeClaim per pod
- **Anti-Pattern**: Using StatefulSets for stateless applications (use Deployments instead)

#### **DaemonSets**
- **Purpose**: Run one pod per node (or matching nodes) for cluster-wide services
- **When to Use**: GPU drivers, monitoring agents (Prometheus node-exporter), logging (Fluentd, Filebeat), networking (Calico, Cilium), security (Falco)
- **Key Features**: Auto-scheduling on new nodes, node selectors (gpu=nvidia), tolerations (run on tainted nodes)
- **Anti-Pattern**: Using DaemonSets for application workloads (use Deployments with pod anti-affinity)

#### **Operators**
- **Purpose**: Automate complex application lifecycle management (encode operational knowledge as code)
- **When to Use**: ML training automation, database backups, certificate management, custom scaling logic
- **Key Features**: Watch CRDs, reconciliation loop (ensure desired state = actual state), self-healing, domain-specific operations
- **Anti-Pattern**: Using operators for simple tasks (shell scripts or CronJobs suffice)

#### **Custom Resource Definitions (CRDs)**
- **Purpose**: Extend Kubernetes API with custom resource types (TrainingJob, ModelServer, Experiment)
- **When to Use**: ML platforms (Kubeflow, KServe), CI/CD (Tekton, Argo), databases (CockroachDB, Vitess), custom controllers
- **Key Features**: Schema validation, versioning, defaulting, conversion webhooks, CRUD operations via kubectl
- **Anti-Pattern**: Creating CRDs for one-off tasks (use ConfigMaps or Jobs)

---

### üèóÔ∏è **Architecture Best Practices**

#### **1. StatefulSet Design Patterns**

**Master-Worker Architecture:**
```yaml
# worker-0 is master (coordinator)
# worker-1, worker-2, ... are workers
env:
- name: RANK
  value: "0"  # From pod ordinal
- name: MASTER_ADDR
  value: "training-workers-0.training-workers"  # Stable DNS
```

**Storage Management:**
- Use PersistentVolumeClaims for stateful data (survives pod restarts)
- Use emptyDir for temporary data (deleted on pod termination)
- Use NFS/S3 for shared data (all pods access same files)

**Headless Services:**
```yaml
# Required for StatefulSets to provide stable DNS
apiVersion: v1
kind: Service
metadata:
  name: training-workers
spec:
  clusterIP: None  # Headless
  selector:
    app: training
```

#### **2. DaemonSet Design Patterns**

**Node Affinity vs Node Selector:**
- **Node Selector**: Simple label matching (`gpu: nvidia`)
- **Node Affinity**: Complex rules (`requiredDuringScheduling`, `preferredDuringScheduling`)

**Tolerations:**
```yaml
# Run on tainted nodes
tolerations:
- key: gpu-workload
  operator: Exists
  effect: NoSchedule
```

**Update Strategies:**
- **RollingUpdate** (default): Update one pod at a time (zero downtime)
- **OnDelete**: Manual update (delete pod to trigger update)

#### **3. Operator Design Patterns**

**Reconciliation Loop:**
```python
while True:
    for resource in watch_resources():
        desired_state = resource.spec
        actual_state = get_actual_state(resource)
        
        if desired_state != actual_state:
            reconcile(resource, desired_state, actual_state)
    
    time.sleep(reconcile_interval)
```

**Idempotency:**
- Reconciliation must be idempotent (calling multiple times = same result)
- Check if resource exists before creating
- Use status subresource to track state

**Error Handling:**
- Exponential backoff for retries (1s, 2s, 4s, 8s, ...)
- Max retries limit (3-5 retries)
- Update resource status with error message

#### **4. CRD Design Patterns**

**Spec vs Status:**
- **Spec**: User-defined desired state (immutable after creation)
- **Status**: System-managed actual state (updated by controller)

**Versioning:**
```yaml
apiVersion: ml.kubeflow.org/v1beta1  # Version in API group
kind: TrainingJob
spec:
  # v1beta1 fields
status:
  # Managed by operator
```

**Validation:**
```yaml
# OpenAPI schema validation
validation:
  openAPIV3Schema:
    properties:
      spec:
        properties:
          replicas:
            type: integer
            minimum: 1
            maximum: 100
```

---

### ‚ö° **Performance Optimization**

#### **1. StatefulSet Scaling**

**Parallel Scaling (Kubernetes 1.26+):**
```yaml
# Scale multiple pods simultaneously
spec:
  podManagementPolicy: Parallel  # Default: OrderedReady
```

**Performance Impact:**
- OrderedReady: Sequential (0‚Üí1‚Üí2, slower but safer)
- Parallel: All pods at once (faster but may cause resource contention)

#### **2. DaemonSet Resource Limits**

**Prevent Node Overload:**
```yaml
resources:
  requests:
    cpu: 100m
    memory: 200Mi
  limits:
    cpu: 200m
    memory: 500Mi
```

**Priority Classes:**
```yaml
# Prevent DaemonSet eviction
priorityClassName: system-node-critical  # Highest priority
```

#### **3. Operator Efficiency**

**Watch vs Polling:**
- Use **Watch API** (event-driven, efficient)
- Avoid **Polling** (wasteful, high API server load)

**Leader Election:**
```python
# Only one operator replica reconciles (avoid conflicts)
from kubernetes import client, config

lock = client.V1Lease(...)
if acquire_lock(lock):
    run_reconciliation_loop()
```

**Batch Reconciliation:**
```python
# Process multiple resources in one reconciliation
pending_jobs = [job for job in jobs if job.status.phase == "Pending"]
for job in pending_jobs[:10]:  # Batch of 10
    reconcile(job)
```

---

### üîí **Security Best Practices**

#### **1. RBAC for Operators**

**Principle of Least Privilege:**
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: training-operator
rules:
- apiGroups: ["ml.kubeflow.org"]
  resources: ["trainingjobs"]
  verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "delete", "get", "list"]
```

**Avoid Cluster-Admin:**
- Never grant `cluster-admin` to operators
- Use `Role` (namespace-scoped) or `ClusterRole` (cluster-wide) with minimal permissions

#### **2. Pod Security Policies**

**StatefulSet Security:**
```yaml
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 2000
  capabilities:
    drop:
    - ALL
```

**DaemonSet Privileged Containers:**
- Only use `privileged: true` for GPU drivers, networking (absolutely necessary)
- Use securityContext to drop unnecessary capabilities

#### **3. Network Policies**

**Isolate StatefulSets:**
```yaml
# Only allow traffic from same StatefulSet
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-statefulset
spec:
  podSelector:
    matchLabels:
      app: redis-cluster
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: redis-cluster
```

---

### üêõ **Troubleshooting Guide**

#### **StatefulSet Issues**

**Problem: Pods stuck in Pending**
- **Cause**: PersistentVolumeClaim not bound
- **Fix**: Check PV availability (`kubectl get pv`), verify StorageClass exists

**Problem: Pods restarting frequently**
- **Cause**: Liveness probe failing
- **Fix**: Increase `initialDelaySeconds`, check application logs

**Problem: Scaling stuck (pods not created)**
- **Cause**: Pod disruption budget blocking
- **Fix**: Update PodDisruptionBudget or use `kubectl delete pdb`

#### **DaemonSet Issues**

**Problem: DaemonSet pods not scheduled on all nodes**
- **Cause**: Node selector not matching, taints not tolerated
- **Fix**: Verify node labels (`kubectl get nodes --show-labels`), add tolerations

**Problem: DaemonSet update stuck**
- **Cause**: maxUnavailable too conservative
- **Fix**: Increase `maxUnavailable` in RollingUpdate strategy

**Problem: Init container failing**
- **Cause**: Missing host mount, insufficient permissions
- **Fix**: Check `securityContext`, verify hostPath volumes

#### **Operator Issues**

**Problem: Operator not reconciling**
- **Cause**: RBAC permissions missing, leader election conflict
- **Fix**: Check ServiceAccount permissions, verify only one leader

**Problem: Infinite reconciliation loop**
- **Cause**: Status updates triggering new reconciliations
- **Fix**: Use `metadata.generation` to detect spec changes only

**Problem: CRD not found**
- **Cause**: CRD not installed or wrong API version
- **Fix**: Install CRD (`kubectl apply -f crd.yaml`), verify version

#### **CRD Issues**

**Problem: Validation errors on create**
- **Cause**: Schema validation failing
- **Fix**: Check CRD schema, ensure required fields present

**Problem: CRD version conversion failing**
- **Cause**: Conversion webhook not configured
- **Fix**: Deploy conversion webhook, update CRD with webhook config

---

### üìä **Monitoring and Observability**

#### **1. StatefulSet Metrics**

**Key Metrics:**
- `kube_statefulset_status_replicas` (desired replicas)
- `kube_statefulset_status_replicas_ready` (ready replicas)
- `kube_statefulset_status_replicas_current` (current version replicas)
- `kube_statefulset_status_replicas_updated` (updated replicas)

**Alerts:**
```yaml
# StatefulSet replicas not ready
- alert: StatefulSetNotReady
  expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
  for: 5m
```

#### **2. DaemonSet Metrics**

**Key Metrics:**
- `kube_daemonset_status_desired_number_scheduled` (should be = number of nodes)
- `kube_daemonset_status_number_ready` (ready pods)
- `kube_daemonset_status_number_misscheduled` (shouldn't run but running)

**Alerts:**
```yaml
# DaemonSet not deployed on all nodes
- alert: DaemonSetNotFullyDeployed
  expr: kube_daemonset_status_number_ready < kube_daemonset_status_desired_number_scheduled
  for: 10m
```

#### **3. Operator Metrics**

**Custom Metrics:**
```python
from prometheus_client import Counter, Gauge

reconciliation_total = Counter('operator_reconciliations_total', 'Total reconciliations')
reconciliation_errors = Counter('operator_reconciliation_errors_total', 'Failed reconciliations')
resources_managed = Gauge('operator_resources_managed', 'Number of resources managed')

def reconcile(resource):
    reconciliation_total.inc()
    try:
        # Reconciliation logic
        pass
    except Exception as e:
        reconciliation_errors.inc()
```

**Alerts:**
```yaml
# High reconciliation error rate
- alert: OperatorHighErrorRate
  expr: rate(operator_reconciliation_errors_total[5m]) > 0.1
  for: 5m
```

---

### üöÄ **Production Deployment Checklist**

#### **Pre-Deployment**

- [ ] **Resource requests/limits set** (prevent node overload)
- [ ] **Health checks configured** (liveness, readiness, startup probes)
- [ ] **RBAC configured** (ServiceAccount, Role, RoleBinding)
- [ ] **Network policies defined** (restrict traffic)
- [ ] **PodDisruptionBudget created** (prevent downtime during node maintenance)
- [ ] **Monitoring configured** (Prometheus metrics, Grafana dashboards)
- [ ] **Alerts configured** (PagerDuty, Slack integration)
- [ ] **Backup strategy defined** (for StatefulSets with persistent data)

#### **StatefulSet Specific**

- [ ] **PersistentVolumeClaims configured** (with sufficient storage)
- [ ] **Headless Service created** (required for stable DNS)
- [ ] **Update strategy defined** (RollingUpdate with partition for canary)
- [ ] **Pod management policy set** (OrderedReady vs Parallel)
- [ ] **Persistent data backup tested** (restore from backup verified)

#### **DaemonSet Specific**

- [ ] **Node selector configured** (if not all nodes)
- [ ] **Tolerations configured** (for tainted nodes)
- [ ] **Update strategy defined** (maxUnavailable set appropriately)
- [ ] **Priority class set** (prevent eviction)
- [ ] **Resource limits set** (prevent node resource exhaustion)

#### **Operator Specific**

- [ ] **CRD installed** (before deploying operator)
- [ ] **Leader election enabled** (for multi-replica operators)
- [ ] **Reconciliation interval tuned** (balance responsiveness vs API load)
- [ ] **Error handling tested** (retries, exponential backoff)
- [ ] **Webhook certificates configured** (if using admission/conversion webhooks)
- [ ] **Operator versioning strategy** (for CRD version upgrades)

---

### üéì **Learning Path Next Steps**

#### **Beginner ‚Üí Intermediate**
1. ‚úÖ Complete Notebooks 131-133 (Docker, Kubernetes Fundamentals, Advanced Patterns)
2. üìö **Next**: Notebook 134 - Service Mesh (Istio, Linkerd for microservices)
3. üìö Study Kubeflow components (Training Operator, KServe, Katib)
4. üõ†Ô∏è Build Project 1 (Distributed PyTorch Training with StatefulSets)

#### **Intermediate ‚Üí Advanced**
1. üìö Notebook 135 - GitOps (ArgoCD, Flux for declarative deployments)
2. üìö Notebook 136 - CI/CD for ML (Tekton, GitHub Actions, ML pipelines)
3. üõ†Ô∏è Build Project 3 (ML Training Operator for Auto-Scaling)
4. üõ†Ô∏è Build Project 5 (Kubeflow Pipelines End-to-End)

#### **Advanced ‚Üí Expert**
1. üìö Contribute to open-source operators (Kubeflow, KServe)
2. üõ†Ô∏è Build custom CRDs for domain-specific ML workflows
3. üõ†Ô∏è Build Project 6 (Custom GPU Scheduler)
4. üõ†Ô∏è Build Project 8 (Production Monitoring Operator)

---

### üìñ **Additional Resources**

#### **Official Documentation**
- [Kubernetes StatefulSets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/)
- [Kubernetes DaemonSets](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/)
- [Kubernetes Operators](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
- [Custom Resource Definitions](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/)

#### **Operator Frameworks**
- [Kubebuilder](https://book.kubebuilder.io/) - Go-based operator framework
- [Operator SDK](https://sdk.operatorframework.io/) - Multi-language operator framework
- [Kopf](https://kopf.readthedocs.io/) - Python-based operator framework
- [KUDO](https://kudo.dev/) - Declarative operator framework

#### **ML Platforms**
- [Kubeflow](https://www.kubeflow.org/) - End-to-end ML platform
- [KServe](https://kserve.github.io/website/) - Model serving (successor to KFServing)
- [Katib](https://www.kubeflow.org/docs/components/katib/) - Hyperparameter tuning
- [Seldon Core](https://www.seldon.io/solutions/open-source-projects/core) - Advanced model serving

#### **Books**
- "Programming Kubernetes" by Michael Hausenblas & Stefan Schimanski
- "Kubernetes Operators" by Jason Dobies & Joshua Wood
- "Kubernetes Patterns" by Bilgin Ibryam & Roland Hu√ü

---

### üí° **Key Insights for Post-Silicon Validation**

#### **Why Advanced Patterns Matter for Semiconductor Testing**

**StatefulSets for Distributed Wafer Analysis:**
- Stable pod names enable sharding (wafer-0 processes lot A, wafer-1 processes lot B)
- Persistent storage retains intermediate results (survive pod restarts)
- Ordered scaling prevents data corruption (complete shard 0 before starting shard 1)

**DaemonSets for GPU Driver Management:**
- Every GPU node needs NVIDIA driver 525.xx (consistency critical for inference)
- Auto-deployment on new nodes (scale from 10 ‚Üí 50 GPU nodes with zero manual work)
- Rolling updates enable zero-downtime driver upgrades

**Operators for STDF Parsing Automation:**
- Data scientists create `STDFParserJob` CRD (no Kubernetes expertise needed)
- Operator auto-scales workers based on queue depth (10 wafers ‚Üí 2 workers, 100 wafers ‚Üí 8 workers)
- Auto-retry on failure (handle corrupted STDF files gracefully)

**CRDs for ML Workflow Standardization:**
- `YieldPredictorJob` CRD standardizes wafer yield prediction across teams
- `WaferMapAnalysis` CRD encodes spatial analysis best practices
- `BinOptimizationJob` CRD automates binning strategy experiments

---

### üéØ **When to Use Each Pattern**

| **Pattern** | **Use Case** | **Example** | **Complexity** |
|-------------|--------------|-------------|----------------|
| **StatefulSet** | Stable identities, persistent storage, ordered operations | Distributed training, databases, Redis cluster | ‚≠ê‚≠ê‚≠ê |
| **DaemonSet** | One pod per node, cluster-wide services | GPU drivers, monitoring, logging | ‚≠ê‚≠ê |
| **Operator** | Complex lifecycle management, domain-specific automation | ML training automation, backup/restore | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| **CRD** | Custom resources, API extension | TrainingJob, ModelServer, Experiment | ‚≠ê‚≠ê‚≠ê‚≠ê |
| **Deployment** | Stateless applications, simple scaling | Model serving (stateless), web apps | ‚≠ê |
| **Job** | One-time tasks, batch processing | STDF parsing, ETL pipelines | ‚≠ê |
| **CronJob** | Scheduled tasks | Daily model retraining, backup jobs | ‚≠ê‚≠ê |

---

### ‚úÖ **Final Checklist**

**You've mastered Kubernetes Advanced Patterns if you can:**

- [ ] Explain when to use StatefulSet vs Deployment (and provide 3 examples)
- [ ] Design a DaemonSet with node selectors and tolerations
- [ ] Build a simple operator with reconciliation loop
- [ ] Create a CRD with schema validation and versioning
- [ ] Debug StatefulSet scaling issues (PVC not bound, PDB blocking)
- [ ] Configure monitoring for operators (Prometheus metrics, alerts)
- [ ] Implement distributed training with StatefulSets (PyTorch DDP)
- [ ] Deploy GPU drivers with DaemonSets (NVIDIA driver installation)

**Ready for Production if you can:**

- [ ] Design multi-tenant ML platform with namespace isolation
- [ ] Build custom scheduler for GPU bin packing
- [ ] Implement canary releases for model serving (KServe)
- [ ] Create end-to-end ML pipeline (Kubeflow Pipelines)
- [ ] Troubleshoot operator infinite reconciliation loops
- [ ] Secure operators with RBAC and network policies
- [ ] Monitor cluster health (DaemonSet coverage, StatefulSet readiness)
- [ ] Implement auto-scaling based on custom metrics (queue depth, accuracy)

---

### üöÄ **Congratulations!**

You've completed **Notebook 133: Kubernetes Advanced Patterns for ML**. You now understand:
- ‚úÖ StatefulSets for stable identities and persistent storage
- ‚úÖ DaemonSets for cluster-wide services
- ‚úÖ Operators for automating complex workflows
- ‚úÖ CRDs for extending Kubernetes API

**Next Steps:**
- **Notebook 134**: Service Mesh (Istio, Linkerd) for advanced networking
- **Notebook 135**: GitOps (ArgoCD, Flux) for declarative deployments
- **Notebook 136**: CI/CD for ML (Tekton, GitHub Actions)

**Keep Building! üéâ**

## üéØ Key Takeaways

### When to Use Kubernetes Advanced Patterns
- **Sidecar pattern**: Add capabilities (logging, monitoring, service mesh) without modifying main container (ML model + Prometheus exporter sidecar)
- **Ambassador pattern**: Proxy connections to external services (database connection pooling, circuit breaking)
- **Adapter pattern**: Standardize outputs from heterogeneous containers (normalize logs from different ML frameworks)
- **Init containers**: Run setup tasks before main container starts (download model artifacts, database migrations)
- **StatefulSets**: Deploy stateful applications requiring stable network IDs and persistent storage (feature stores, vector databases)
- **DaemonSets**: Run one pod per node for node-level tasks (log collection, GPU monitoring on every inference node)

### Limitations
- **Complexity overhead**: Advanced patterns add YAML configuration, debugging difficulty vs. simple deployments
- **Resource consumption**: Sidecars/adapters consume CPU/memory on every pod (2-5% overhead typical)
- **Networking complexity**: Service meshes (Istio/Linkerd) add latency (1-5ms p99) and operational burden
- **Learning curve**: Teams need deep Kubernetes knowledge (pod lifecycles, volumes, networking)

### Alternatives
- **Monolithic containers**: Package all functionality in single container (simpler, but less flexible)
- **VM-based deployments**: Traditional VMs for stateful apps (easier state management, higher resource overhead)
- **Serverless (Lambda/Cloud Run)**: For stateless inference workloads (no Kubernetes needed, vendor lock-in risk)
- **Docker Compose**: Local/dev environments (simpler than K8s, doesn't scale to production)

### Best Practices
- **Resource limits**: Always set CPU/memory requests and limits to prevent pod evictions
- **Health checks**: Implement liveness (restart unhealthy pods) and readiness (traffic routing) probes
- **Rolling updates**: Use RollingUpdate strategy with maxUnavailable=1 for zero-downtime deployments
- **Pod disruption budgets**: Ensure minimum availability during node maintenance/upgrades
- **Network policies**: Restrict pod-to-pod traffic for security (ML inference pods can't access training data stores)
- **Horizontal Pod Autoscaling**: Scale based on custom metrics (inference latency p95, GPU utilization) not just CPU

## üìä Diagnostic Checks Summary

### Implementation Checklist
‚úÖ **Sidecar Pattern**
- Logging sidecar: Fluentd/Filebeat collects logs from main container shared volume
- Monitoring sidecar: Prometheus exporter scrapes model metrics (latency, throughput, error rate)
- Service mesh sidecar: Envoy proxy handles mTLS, retries, circuit breaking

‚úÖ **Ambassador Pattern**
- Database proxy: PgBouncer pools connections, reduces connection overhead
- Circuit breaker: Hystrix prevents cascading failures to downstream services
- Rate limiter: Token bucket limits requests to expensive GPU inference

‚úÖ **Adapter Pattern**
- Log normalizer: Convert framework-specific logs (TensorFlow, PyTorch) to standard JSON format
- Metrics adapter: Transform model-specific metrics to Prometheus format
- API adapter: Convert legacy REST API responses to new GraphQL schema

‚úÖ **Init Containers**
- Model artifact downloader: Fetch model weights from S3/GCS before inference pod starts
- Database schema migrator: Apply schema updates before app deployment
- Config validator: Check ConfigMaps/Secrets before starting main container

‚úÖ **StatefulSets**
- Stable network IDs: Pods get predictable names (redis-0, redis-1) for peer discovery
- Persistent volumes: Data survives pod restarts (feature store, vector database)
- Ordered deployment: Pods created/deleted in sequence (master-slave database setup)

‚úÖ **DaemonSets**
- Node monitoring: GPU utilization, temperature tracking on every inference node
- Log collection: Fluentd on every node ships logs to centralized Elasticsearch
- Network monitoring: Packet capture for debugging distributed training

### Quality Metrics
- **Pod startup time**: <30s for inference pods (model download + health check)
- **Resource overhead**: Sidecars consume <10% CPU, <200MB memory per pod
- **Service mesh latency**: p99 <5ms added by Envoy proxy
- **StatefulSet availability**: >99.9% uptime for stateful services (Redis, Postgres)

### Post-Silicon Validation Applications
**1. Sidecar Pattern for ATE Test Data Streaming**
- Main container: Test execution engine (ATE controller)
- Sidecar: Real-time STDF parser + Kafka producer
- Use case: Stream parametric test results to centralized yield database
- Business value: Real-time yield dashboards enable immediate excursion response (2-4hr faster root cause)

**2. Ambassador Pattern for Test Floor Database Connections**
- Main container: Yield prediction service (ML inference)
- Ambassador: PgBouncer connection pool to wafer test database
- Use case: Reduce database connection overhead from 5000 pods hitting PostgreSQL
- Business value: Database cost reduction 40-60% (fewer connections = smaller RDS instance)

**3. Init Container for Model Artifact Management**
- Init container: Download yield prediction model from S3 (200MB XGBoost model)
- Main container: Inference service starts after model loaded
- Use case: Ensure latest model deployed before accepting inference requests
- Business value: Zero-downtime model updates, rollback in <2min if accuracy drops

### Business ROI Estimation

**Scenario 1: Medium-Scale Kubernetes Cluster (50 nodes, 500 pods)**
- Sidecar logging/monitoring: $1.5M/year observability value (faster debugging)
- Ambassador pattern DB pooling: $400K/year reduced database costs
- Init containers for artifact mgmt: $800K/year faster deployments (10min ‚Üí 2min)
- **Total ROI: $2.7M/year** (cost: $200K learning + $100K tooling = $2.4M net)

**Scenario 2: Large-Scale Production Cluster (200 nodes, 3000 pods)**
- Service mesh (Istio): $5M/year improved reliability (circuit breaking, retries)
- StatefulSets for feature stores: $3M/year data persistence guarantees
- DaemonSets for GPU monitoring: $2M/year reduced GPU failures (proactive thermal management)
- **Total ROI: $10M/year** (cost: $1.2M infrastructure + $800K ops = $8M net)

**Scenario 3: Multi-Cluster Global Deployment (500+ nodes across 3 regions)**
- Advanced patterns across all clusters: $15M/year standardized operations
- Cross-cluster service mesh: $8M/year improved cross-region latency (traffic shaping)
- Disaster recovery with StatefulSets: $12M/year downtime reduction
- **Total ROI: $35M/year** (cost: $5M infrastructure + $3M team = $27M net)

---

## üéì Mastery Achievement

**You now have production-grade expertise in:**
- ‚úÖ Implementing sidecar, ambassador, and adapter patterns for separation of concerns in Kubernetes
- ‚úÖ Using init containers for pre-deployment setup tasks (model downloads, schema migrations)
- ‚úÖ Deploying StatefulSets with persistent storage for stateful ML applications (feature stores, vector DBs)
- ‚úÖ Running DaemonSets for node-level tasks (GPU monitoring, log collection)
- ‚úÖ Applying K8s patterns to semiconductor test data streaming, database optimization, and model deployment

**Next Steps:**
- **Service Mesh Deep Dive**: Istio/Linkerd for advanced traffic management, observability, security
- **Custom Resource Definitions (CRDs)**: Extend Kubernetes API for ML-specific resources (TFJob, PyTorchJob)
- **Kubernetes Operators**: Automate complex application lifecycle management (database backups, model retraining triggers)