# 048: Model Deployment & Serving

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** Production ML system architecture (training, serving, monitoring)
- **Implement** REST APIs with FastAPI for real-time model serving
- **Build** Docker containers for reproducible deployments
- **Deploy** Models to Kubernetes with auto-scaling and load balancing
- **Monitor** Model performance, data drift, and system health in production

## üìö What is Model Deployment?

**Model Deployment** is the process of making trained ML models available for inference in production environments. It's the bridge between research (Jupyter notebooks) and real-world impact (serving 1M predictions/day at <100ms latency with 99.99% uptime).

**Production ML Stack:**
```
Training Pipeline ‚Üí Model Registry ‚Üí Serving Infrastructure ‚Üí Monitoring
   (offline)          (versioning)      (online inference)      (alerts)
```

**Why Model Deployment Matters?**
- ‚úÖ **Business Value**: Models only create value when serving predictions (research ‚Üí revenue)
- ‚úÖ **Scale**: Serve 1K-1M predictions/sec (Intel: 500K dies/day, <10ms per prediction)
- ‚úÖ **Reliability**: 99.99% uptime required (NVIDIA: $100K/hour downtime cost)
- ‚úÖ **Latency**: Real-time decisions (<100ms for user-facing, <10ms for embedded)
- ‚úÖ **Monitoring**: Detect model degradation before business impact

## üè≠ Post-Silicon Validation Use Cases

**1. Real-Time Defect Detection (Intel)**
- **Input**: 512 test parameters per die from test equipment
- **Output**: Pass/fail decision + confidence score in <10ms
- **Value**: Screen 500K dies/day, 95% defect detection, $15M savings (reduced test escapes)

**2. Model Serving Platform (NVIDIA)**
- **Input**: Wafer map images + parametric data for quality prediction
- **Output**: Yield forecast + failure mode classification
- **Value**: Kubernetes deployment with auto-scaling, 100K predictions/day, 99.99% uptime, $8M savings

**3. Edge Inference (AMD)**
- **Input**: Sensor data from test equipment (temperature, power, timing)
- **Output**: Anomaly detection on edge devices (no cloud latency)
- **Value**: <1ms inference on FPGA/TPU, real-time monitoring, $5M savings

**4. Multi-Model Orchestration (Qualcomm)**
- **Input**: Test data requiring 5 different models (yield, bin prediction, outlier detection, time-series forecast, root cause)
- **Output**: Unified API serving all models with intelligent routing
- **Value**: Centralized platform for 50+ models, 200K predictions/day, $12M savings

## üîÑ Model Deployment Workflow

```mermaid
graph LR
    A[Train Model<br/>Jupyter/Python] --> B[Validate Model<br/>Offline Metrics]
    B --> C[Register Model<br/>MLflow/Registry]
    C --> D[Package Model<br/>Docker Container]
    D --> E[Deploy to K8s<br/>Auto-scaling]
    E --> F[Serve Predictions<br/>REST API]
    F --> G[Monitor<br/>Metrics/Alerts]
    G --> H{Performance OK?}
    H -->|No| A
    H -->|Yes| F
    
    style A fill:#e1f5ff
    style C fill:#fff4e1
    style E fill:#e1ffe1
    style G fill:#ffe1e1
```

## üìä Learning Path Context

**Prerequisites:**
- **010: Linear Regression** - Model training basics
- **034: Neural Networks** - Deep learning models
- **008: System Design** - Scalability, load balancing, microservices
- **009: Git & Version Control** - CI/CD pipelines

**Next Steps:**
- **111: MLOps Fundamentals** - End-to-end ML pipelines
- **131: Cloud Deployment** - AWS SageMaker, GCP Vertex AI, Azure ML
- **151: Advanced MLOps** - Feature stores, experiment tracking, A/B testing

---

Let's deploy production ML systems! üöÄ

---

## Part 1: REST API with FastAPI

### Why FastAPI for ML Serving?

**FastAPI** is the modern Python framework for building high-performance ML APIs.

**Advantages:**
- ‚ö° **Performance**: Async I/O, ~3√ó faster than Flask (Intel: 10ms ‚Üí 3ms latency)
- üìù **Auto-documentation**: Interactive API docs at `/docs` (Swagger UI)
- ‚úÖ **Type Safety**: Pydantic validation catches errors before inference
- üîÑ **Async Support**: Handle 1000+ concurrent requests (NVIDIA: 10K req/sec)
- üéØ **Production-ready**: Built-in monitoring, health checks, dependency injection

**Flask vs FastAPI:**
| Feature | Flask | FastAPI |
|---------|-------|---------|
| **Performance** | Sync (WSGI) | Async (ASGI) 3√ó faster |
| **Type Validation** | Manual | Automatic (Pydantic) |
| **API Docs** | Manual (Swagger) | Auto-generated |
| **Async** | ‚ùå (gevent workaround) | ‚úÖ Native |
| **Learning Curve** | Easy | Moderate |

---

### FastAPI Model Serving Architecture

**Intel Defect Detection API:**
```
Client Request (JSON with 512 test params)
    ‚Üì
FastAPI Endpoint (/predict)
    ‚Üì
Input Validation (Pydantic)
    ‚Üì
Preprocessing (normalize, handle missing)
    ‚Üì
Model Inference (loaded from disk/cache)
    ‚Üì
Postprocessing (threshold, confidence)
    ‚Üì
JSON Response (pass/fail, score, latency)
```

**Key Components:**
1. **Pydantic Models**: Define input/output schemas
2. **Model Loading**: Load once at startup (not per request)
3. **Health Check**: `/health` endpoint for K8s liveness/readiness
4. **Monitoring**: Log latency, request count, errors
5. **Error Handling**: Graceful failures with informative messages

---

### Production Considerations

**1. Model Loading Strategy:**
- ‚ùå **Bad**: Load model on every request (1s overhead)
- ‚úÖ **Good**: Load model at startup, store in memory
- ‚úÖ **Better**: Load on-demand with LRU cache (multi-model serving)

**2. Batching:**
- Single prediction: Simple but inefficient (10ms inference + 5ms overhead)
- Dynamic batching: Accumulate requests for 10ms, batch infer (2ms per sample)
- Intel: 10√ó throughput with dynamic batching

**3. Async vs Sync:**
- CPU-bound inference: Sync is fine (blocking operation)
- I/O-bound (DB lookup, feature store): Use async (don't block)
- NVIDIA: Async feature fetching while model loads

**4. Resource Management:**
- **CPU**: One worker per core (Intel: 32 cores ‚Üí 32 workers)
- **GPU**: One model per GPU, batch requests (NVIDIA: RTX 4090, batch=32)
- **Memory**: Monitor model size + request buffers (AMD: 8GB model + 2GB buffer)

---

### Performance Targets

**Latency (P99):**
- User-facing: <100ms (recommendation systems)
- Internal tools: <500ms (batch processing acceptable)
- Real-time: <10ms (Intel wafer test, AMD edge devices)
- Embedded: <1ms (FPGA/TPU accelerators)

**Throughput:**
- Small scale: 10-100 req/sec (single instance)
- Medium scale: 1K-10K req/sec (horizontal scaling)
- Large scale: 100K+ req/sec (NVIDIA: GPU batching + load balancer)

**Availability:**
- 99.9% (8.76 hours downtime/year) - Acceptable for internal tools
- 99.99% (52 minutes downtime/year) - Production user-facing
- 99.999% (5 minutes downtime/year) - Critical systems (Intel fab operations)

### üìù What's Happening in This Code?

**Purpose:** Build production-ready FastAPI service for Intel defect detection model

**Key Points:**
- **Pydantic Models**: `TestData` validates 512 input parameters, `PredictionResponse` structures output
- **Startup Event**: Load ML model once at startup (not per request for performance)
- **Predict Endpoint**: Validates input ‚Üí preprocess ‚Üí model inference ‚Üí postprocess ‚Üí JSON response
- **Health Check**: `/health` for Kubernetes liveness/readiness probes

**Intel Application**: Test equipment sends 512 parametric measurements via HTTP POST to `/predict`. API returns pass/fail decision + confidence in <10ms. Handles 500K requests/day with 99.99% uptime.

**Why This Matters:** FastAPI's async architecture + type safety enables high-throughput, reliable ML serving. $15M savings from catching defects in real-time during wafer test.

In [None]:
# FastAPI Model Serving Example
# Run with: uvicorn main:app --reload --host 0.0.0.0 --port 8000

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field, validator
from typing import List, Dict, Optional
import numpy as np
import time
from datetime import datetime
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Pydantic models for request/response validation
class TestData(BaseModel):
    """Input schema for die test parameters"""
    die_id: str = Field(..., description="Unique die identifier")
    test_params: List[float] = Field(..., min_items=512, max_items=512, 
                                      description="512 parametric test measurements")
    
    @validator('test_params')
    def validate_params(cls, v):
        # Check for NaN or infinite values
        if any(np.isnan(v)) or any(np.isinf(v)):
            raise ValueError("Test parameters contain NaN or infinite values")
        return v
    
    class Config:
        schema_extra = {
            "example": {
                "die_id": "wafer123_die456",
                "test_params": [0.5] * 512  # Simplified example
            }
        }

class PredictionResponse(BaseModel):
    """Output schema for defect prediction"""
    die_id: str
    prediction: str  # "pass" or "fail"
    confidence: float = Field(..., ge=0.0, le=1.0)
    anomaly_score: float
    inference_time_ms: float
    timestamp: str
    model_version: str

class HealthResponse(BaseModel):
    """Health check response"""
    status: str
    model_loaded: bool
    uptime_seconds: float
    requests_served: int

# Initialize FastAPI app
app = FastAPI(
    title="Intel Die Defect Detection API",
    description="Real-time defect detection for semiconductor wafer test",
    version="1.0.0"
)

# Global state
model = None
model_version = "v1.2.3"
start_time = time.time()
request_count = 0

# Simple mock model for demonstration
class MockDefectDetector:
    """Placeholder for actual trained model (sklearn, PyTorch, etc.)"""
    
    def __init__(self):
        self.threshold = 0.05
        self.mean = np.random.randn(512) * 0.1 + 0.5
        self.std = np.random.randn(512) * 0.1 + 0.1
    
    def predict(self, X: np.ndarray) -> Dict:
        """Compute anomaly score (reconstruction error)"""
        # Simulate autoencoder reconstruction error
        normalized = (X - self.mean) / (self.std + 1e-8)
        anomaly_score = np.mean(normalized ** 2)
        
        prediction = "fail" if anomaly_score > self.threshold else "pass"
        confidence = 1.0 - min(anomaly_score / (self.threshold * 2), 1.0)
        
        return {
            "prediction": prediction,
            "confidence": float(confidence),
            "anomaly_score": float(anomaly_score)
        }

@app.on_event("startup")
async def load_model():
    """Load model at startup (once, not per request)"""
    global model
    logger.info("Loading defect detection model...")
    
    # In production: load from model registry (MLflow, S3, etc.)
    # model = joblib.load("model.pkl")
    # or: model = torch.load("model.pt")
    
    model = MockDefectDetector()
    logger.info(f"Model loaded successfully - version {model_version}")

@app.get("/", tags=["Root"])
async def root():
    """Root endpoint"""
    return {
        "message": "Intel Die Defect Detection API",
        "version": model_version,
        "docs": "/docs",
        "health": "/health"
    }

@app.get("/health", response_model=HealthResponse, tags=["Health"])
async def health_check():
    """Health check endpoint for Kubernetes liveness/readiness probes"""
    return {
        "status": "healthy" if model is not None else "unhealthy",
        "model_loaded": model is not None,
        "uptime_seconds": time.time() - start_time,
        "requests_served": request_count
    }

@app.post("/predict", response_model=PredictionResponse, tags=["Prediction"])
async def predict(data: TestData):
    """
    Predict die defect status from test parameters
    
    - **die_id**: Unique identifier for the die
    - **test_params**: 512 parametric measurements (voltage, current, timing, etc.)
    
    Returns pass/fail prediction with confidence and anomaly score
    """
    global request_count
    request_count += 1
    
    # Check if model is loaded
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    # Start timer
    start = time.time()
    
    try:
        # Convert to numpy array
        X = np.array(data.test_params).reshape(1, -1)
        
        # Model inference
        result = model.predict(X)
        
        # Calculate inference time
        inference_time = (time.time() - start) * 1000  # Convert to ms
        
        # Build response
        response = PredictionResponse(
            die_id=data.die_id,
            prediction=result["prediction"],
            confidence=result["confidence"],
            anomaly_score=result["anomaly_score"],
            inference_time_ms=round(inference_time, 2),
            timestamp=datetime.now().isoformat(),
            model_version=model_version
        )
        
        # Log prediction
        logger.info(f"Predicted {data.die_id}: {result['prediction']} "
                   f"(confidence={result['confidence']:.3f}, latency={inference_time:.2f}ms)")
        
        return response
    
    except Exception as e:
        logger.error(f"Prediction failed for {data.die_id}: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Prediction error: {str(e)}")

@app.post("/predict/batch", tags=["Prediction"])
async def predict_batch(data: List[TestData]):
    """
    Batch prediction for multiple dies (more efficient)
    """
    results = []
    for sample in data:
        result = await predict(sample)
        results.append(result)
    return {"predictions": results, "count": len(results)}

# Demonstration: Simulate API usage
if __name__ == "__main__":
    print("=" * 70)
    print("FASTAPI MODEL SERVING DEMONSTRATION")
    print("=" * 70)
    
    # Simulate model loading
    print("\nüîÑ Loading model...")
    model = MockDefectDetector()
    print("‚úÖ Model loaded successfully")
    
    # Simulate predictions
    print("\nüìä Simulating predictions:")
    
    # Normal die
    normal_die = {
        "die_id": "wafer001_die123",
        "test_params": (np.random.randn(512) * 0.1 + 0.5).tolist()
    }
    X_normal = np.array(normal_die["test_params"]).reshape(1, -1)
    result_normal = model.predict(X_normal)
    print(f"  Normal die: {result_normal['prediction']} (score={result_normal['anomaly_score']:.4f})")
    
    # Defective die (anomalous pattern)
    defective_die = {
        "die_id": "wafer001_die456",
        "test_params": (np.random.randn(512) * 0.5 + 0.8).tolist()
    }
    X_defective = np.array(defective_die["test_params"]).reshape(1, -1)
    result_defective = model.predict(X_defective)
    print(f"  Defective die: {result_defective['prediction']} (score={result_defective['anomaly_score']:.4f})")
    
    print("\nüì° API Ready:")
    print("  POST /predict - Single prediction")
    print("  POST /predict/batch - Batch prediction")
    print("  GET /health - Health check")
    print("  GET /docs - Interactive API documentation")
    
    print("\nüöÄ To run the API server:")
    print("  uvicorn main:app --reload --host 0.0.0.0 --port 8000")
    print("  Then visit: http://localhost:8000/docs")
    
    print("\n‚úÖ Intel Production Stats:")
    print("  Throughput: 500K predictions/day (5.8 req/sec)")
    print("  Latency: <10ms P99 (target: <10ms)")
    print("  Uptime: 99.99% (52 minutes downtime/year)")
    print("  Business Value: $15M annual savings")
    
    print("=" * 70)

---

## Part 2: Docker Containerization

### Why Docker for ML Models?

**Docker** packages your model + dependencies + code into a portable container that runs identically anywhere.

**Benefits:**
- ‚úÖ **Reproducibility**: Works on dev laptop = works in production (no "works on my machine")
- ‚úÖ **Isolation**: Dependencies don't conflict (TensorFlow 2.x + PyTorch 1.x in separate containers)
- ‚úÖ **Portability**: Deploy to AWS, GCP, Azure, on-prem without changes
- ‚úÖ **Versioning**: Tag images (`intel-defect-v1.2.3`), rollback in seconds
- ‚úÖ **Scaling**: Kubernetes orchestrates thousands of containers

---

### Dockerfile Best Practices

**NVIDIA Model Serving Dockerfile:**

```dockerfile
# Multi-stage build for smaller images
FROM python:3.10-slim as base

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user for security
RUN useradd -m -u 1000 mluser

# Set working directory
WORKDIR /app

# Copy requirements first (Docker layer caching)
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY app/ ./app/
COPY models/ ./models/

# Change ownership to non-root user
RUN chown -R mluser:mluser /app

# Switch to non-root user
USER mluser

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

# Run application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
```

**Key Practices:**
1. **Multi-stage builds**: Separate build dependencies from runtime (smaller image)
2. **Layer caching**: Copy requirements.txt before code (faster rebuilds)
3. **Non-root user**: Security best practice (mluser, not root)
4. **Health check**: Docker knows if container is healthy
5. **.dockerignore**: Exclude .git, __pycache__, *.ipynb (smaller context)

---

### Docker Commands Quick Reference

```bash
# Build image
docker build -t intel-defect-api:v1.2.3 .

# Run container locally
docker run -d -p 8000:8000 --name defect-api intel-defect-api:v1.2.3

# View logs
docker logs -f defect-api

# Execute command in container
docker exec -it defect-api bash

# Stop and remove
docker stop defect-api && docker rm defect-api

# Push to registry
docker tag intel-defect-api:v1.2.3 registry.intel.com/ml/defect-api:v1.2.3
docker push registry.intel.com/ml/defect-api:v1.2.3

# Pull from registry
docker pull registry.intel.com/ml/defect-api:v1.2.3
```

---

### Image Optimization

**Before Optimization (NVIDIA):**
```
Image size: 2.5GB
Build time: 10 minutes
Layers: 45
```

**Optimization Strategies:**
1. **Use slim base images**: `python:3.10-slim` (200MB) vs `python:3.10` (1GB)
2. **Multi-stage builds**: Discard build tools in final image
3. **Combine RUN commands**: Each RUN creates a layer
4. **Remove cache**: `pip install --no-cache-dir`, `apt-get clean`
5. **Minimize layers**: Combine related operations

**After Optimization:**
```
Image size: 800MB (68% reduction)
Build time: 3 minutes (70% faster)
Layers: 12 (73% fewer)
```

**NVIDIA Result:** Faster deployments (3 min vs 10 min), lower storage cost ($1K/month ‚Üí $320/month for 500 images).

---

### AMD Edge Deployment

**Challenge:** Deploy model to test equipment with limited resources (4GB RAM, ARM CPU, no GPU).

**Solution:** Optimize Docker image for edge devices.

**Optimizations:**
1. **Quantize model**: FP32 ‚Üí INT8 (4√ó smaller, 3√ó faster on ARM)
2. **Model pruning**: Remove 50% of weights (minimal accuracy loss)
3. **ARM-specific base image**: `arm64v8/python:3.10-slim`
4. **ONNX Runtime**: 5√ó faster inference than PyTorch on CPU
5. **Distillation**: Teacher model (large) ‚Üí Student model (small)

**Results:**
- Model size: 200MB ‚Üí 12MB (95% reduction)
- Inference: 50ms ‚Üí 0.8ms (62√ó faster)
- Memory: 2GB ‚Üí 150MB (93% reduction)
- Fits on edge device with <1ms latency

---

## Part 3: Kubernetes Deployment

### Why Kubernetes for ML Serving?

**Kubernetes (K8s)** is the container orchestration platform for production ML systems.

**Key Features:**
- ‚ö° **Auto-scaling**: Scale from 2 to 100 pods based on CPU/memory/custom metrics
- üîÑ **Load Balancing**: Distribute requests across pods automatically
- üíö **Self-healing**: Restart failed pods, replace unhealthy instances
- üöÄ **Rolling Updates**: Zero-downtime deployments (gradually replace old pods)
- üìä **Resource Management**: CPU/memory requests & limits per pod
- üîê **Secrets Management**: Securely store API keys, credentials

---

### Kubernetes Architecture for ML

**NVIDIA Model Serving on K8s:**
```
                          Ingress (NGINX)
                          Load Balancer
                                 |
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚Üì            ‚Üì            ‚Üì
            Service (ClusterIP)
                    |
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚Üì               ‚Üì               ‚Üì
  Pod 1           Pod 2           Pod 3
  (API + Model)   (API + Model)   (API + Model)
  2 CPU, 4GB      2 CPU, 4GB      2 CPU, 4GB
  
Horizontal Pod Autoscaler (HPA)
Scale 2-20 pods based on CPU >70%
```

**Components:**
1. **Deployment**: Defines desired state (3 replicas, resource limits)
2. **Service**: Stable endpoint for pods (load balances requests)
3. **Ingress**: External access via HTTPS with TLS
4. **HPA**: Auto-scaling based on metrics
5. **ConfigMap**: Configuration (model paths, thresholds)
6. **Secret**: Credentials (model registry, database)

---

### Kubernetes Manifests

**Intel Defect Detection Deployment:**

```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: defect-detection
  namespace: ml-models
spec:
  replicas: 3  # Start with 3 pods
  selector:
    matchLabels:
      app: defect-detection
  template:
    metadata:
      labels:
        app: defect-detection
        version: v1.2.3
    spec:
      containers:
      - name: api
        image: registry.intel.com/ml/defect-api:v1.2.3
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"  # 1 CPU
          limits:
            memory: "4Gi"
            cpu: "2000m"  # 2 CPUs
        env:
        - name: MODEL_PATH
          value: "/models/defect_v1.2.3.pkl"
        - name: THRESHOLD
          value: "0.05"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: defect-detection-svc
  namespace: ml-models
spec:
  selector:
    app: defect-detection
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: ClusterIP
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: defect-detection-hpa
  namespace: ml-models
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: defect-detection
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
```

**Deployment Commands:**
```bash
# Apply manifests
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f hpa.yaml

# Check status
kubectl get pods -n ml-models
kubectl get svc -n ml-models
kubectl get hpa -n ml-models

# View logs
kubectl logs -f deployment/defect-detection -n ml-models

# Scale manually
kubectl scale deployment defect-detection --replicas=10 -n ml-models

# Rolling update (zero downtime)
kubectl set image deployment/defect-detection \
  api=registry.intel.com/ml/defect-api:v1.3.0 -n ml-models

# Rollback
kubectl rollout undo deployment/defect-detection -n ml-models
```

---

### Auto-Scaling Strategies

**1. CPU-based (Simple):**
- Scale when CPU >70% for 30 seconds
- Intel: 3 pods ‚Üí 8 pods during peak hours (8am-6pm)

**2. Memory-based:**
- Scale when memory >80%
- NVIDIA: Large models require memory management

**3. Custom Metrics (Advanced):**
- Request count: >1000 req/sec ‚Üí scale up
- Latency: P99 >50ms ‚Üí scale up
- Queue depth: >100 requests queued ‚Üí scale up
- Qualcomm: Custom Prometheus metrics for queue depth

**4. Scheduled Scaling:**
- Predictable load patterns
- Scale up at 7am (before production shift)
- Scale down at 7pm (after hours)

---

### Qualcomm Multi-Model Serving

**Challenge:** Serve 50 different models (yield, binning, outlier, forecast, etc.) efficiently.

**Solution:** Multi-model deployment with intelligent routing.

**Architecture:**
```
API Gateway (single endpoint)
    ‚Üì
Routing Logic (based on model_id in request)
    ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚Üì          ‚Üì          ‚Üì          ‚Üì          ‚Üì
Yield      Bin        Outlier    Forecast   RCA
Model      Model      Model      Model      Model
(10 pods)  (5 pods)   (3 pods)   (8 pods)   (2 pods)
```

**Benefits:**
- Resource optimization: Allocate pods based on usage
- Fault isolation: One model fails, others continue
- Independent scaling: Scale yield model without touching others
- A/B testing: Route 10% traffic to new model version

**Results:**
- 50 models serving 200K predictions/day
- 99.99% uptime (5 minutes downtime/month)
- $12M savings (centralized platform, efficient resource usage)

---

## Part 4: Monitoring & Observability

### Why Monitor ML Models in Production?

**Models degrade over time** due to data drift, concept drift, and system changes. Monitoring catches problems before they impact business.

**What to Monitor:**
1. **System Metrics**: Latency, throughput, error rate, CPU/memory
2. **Model Metrics**: Accuracy, precision, recall, F1 (requires labels)
3. **Data Drift**: Input distribution changes over time
4. **Prediction Drift**: Output distribution changes
5. **Business Metrics**: Revenue impact, user engagement

---

### Three Pillars of Observability

**1. Metrics (Quantitative):**
- Time-series data (latency, requests/sec, accuracy)
- Aggregated: mean, P50, P95, P99
- Tools: Prometheus, Grafana, CloudWatch

**2. Logs (Qualitative):**
- Structured events (prediction logs, errors, warnings)
- Searchable, filterable
- Tools: ELK stack (Elasticsearch, Logstash, Kibana), Splunk

**3. Traces (Causal):**
- Request flow through distributed system
- Identify bottlenecks (DB query slow? Model inference slow?)
- Tools: Jaeger, Zipkin, AWS X-Ray

---

### Prometheus + Grafana Stack

**Intel Monitoring Architecture:**
```
FastAPI (expose /metrics)
    ‚Üì
Prometheus (scrape metrics every 15s)
    ‚Üì
Grafana (visualize dashboards)
    ‚Üì
AlertManager (send alerts to Slack/PagerDuty)
```

**Key Metrics to Track:**
```python
from prometheus_client import Counter, Histogram, Gauge

# Request counters
predictions_total = Counter(
    'predictions_total', 
    'Total predictions',
    ['model_version', 'prediction']
)

# Latency histogram
prediction_latency = Histogram(
    'prediction_latency_seconds',
    'Prediction latency',
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)

# Model accuracy (when labels arrive)
model_accuracy = Gauge(
    'model_accuracy',
    'Model accuracy over last 1000 predictions'
)

# Anomaly score distribution
anomaly_score = Histogram(
    'anomaly_score',
    'Anomaly scores',
    buckets=[0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]
)
```

**Intel Dashboard:**
- Requests/sec: 5.8 (500K/day avg)
- P99 latency: 8.3ms (target: <10ms)
- Error rate: 0.02% (target: <0.1%)
- Accuracy: 95.2% (baseline: 92%)

---

### Data Drift Detection

**Problem:** Training data (2023) != Production data (2024). Model degrades silently.

**AMD Sensor Drift Example:**
- **Training**: Temperature sensors calibrated, range [20¬∞C, 80¬∞C]
- **Production (6 months later)**: Sensors drift, range [22¬∞C, 85¬∞C]
- **Impact**: Model accuracy 92% ‚Üí 87% (5% drop)

**Detection Methods:**

**1. Statistical Tests:**
- **Kolmogorov-Smirnov test**: Compare distributions (p-value <0.05 ‚Üí drift)
- **Population Stability Index (PSI)**: PSI >0.1 ‚Üí moderate drift, >0.25 ‚Üí severe drift

**2. Domain Classifier:**
- Train binary classifier: Training data (class 0) vs Production data (class 1)
- Random performance (50% accuracy) ‚Üí no drift
- High accuracy (>70%) ‚Üí significant drift

**3. Feature-wise Monitoring:**
- Track mean, std, min, max, percentiles for each feature
- Alert if >2 std deviations from training statistics

**NVIDIA Implementation:**
```python
# Compute PSI for feature
def compute_psi(expected, actual, bins=10):
    expected_percents = np.histogram(expected, bins=bins)[0] / len(expected)
    actual_percents = np.histogram(actual, bins=bins)[0] / len(actual)
    
    psi = np.sum((actual_percents - expected_percents) * 
                 np.log(actual_percents / (expected_percents + 1e-10)))
    return psi

# Monitor daily
for feature_idx in range(512):
    psi = compute_psi(X_train[:, feature_idx], X_prod_today[:, feature_idx])
    if psi > 0.25:
        alert(f"Severe drift detected in feature {feature_idx}: PSI={psi:.3f}")
```

**NVIDIA Results:**
- Detected drift 2 weeks before accuracy drop
- Retrained model proactively
- Maintained 99.5% accuracy (no degradation)

---

### Alert Strategy

**Intel Alerting Rules:**

**Critical (PagerDuty - immediate response):**
- API down (health check fails for 2 minutes)
- Error rate >1% for 5 minutes
- P99 latency >50ms for 5 minutes
- Model accuracy <85% (20% below baseline)

**Warning (Slack - investigate within 4 hours):**
- Error rate >0.1% for 15 minutes
- P99 latency >20ms for 15 minutes
- Request rate 2√ó above normal
- Data drift PSI >0.25 for any feature

**Info (Email - review daily):**
- Model accuracy <90%
- Request rate drops >50%
- New error types appear

**Qualcomm Alert Response:**
1. **Investigate**: Check Grafana dashboard, read logs
2. **Triage**: Determine root cause (data drift? system issue? model bug?)
3. **Mitigate**: Rollback to previous version, scale up resources, or retrain
4. **Post-mortem**: Document incident, update runbooks, improve monitoring

---

### Model Performance Tracking

**Challenges:**
- Ground truth labels arrive late (Intel: die pass/fail known after final test, 2 weeks later)
- Can't wait 2 weeks to detect model degradation

**Solutions:**

**1. Proxy Metrics (Real-time):**
- Confidence distribution (sudden drop ‚Üí model uncertain)
- Anomaly score distribution (shift ‚Üí input pattern change)
- Prediction distribution (more failures than usual?)

**2. Sampling + Human Labeling:**
- Sample 1% of predictions for immediate expert review
- Intel: 50 dies/day reviewed by engineer (detect issues in 1 day, not 2 weeks)

**3. A/B Testing:**
- Route 10% traffic to new model (candidate)
- Compare metrics: latency, confidence, anomaly scores
- If candidate better, promote to 100%

**4. Shadow Deployment:**
- New model runs in parallel, doesn't affect production
- Compare predictions: if >5% disagreement, investigate
- Safe way to validate new models

**NVIDIA Shadow Deployment:**
- Deployed model v2.0 in shadow mode
- Discovered 8% prediction disagreement with v1.5
- Investigation: v2.0 overfitted to recent data
- Decision: Keep v1.5 in production, retrain v2.0 with more diverse data

---

## Part 5: Real-World Projects

### Post-Silicon Validation Projects

**1. End-to-End ML Platform (Intel)**
- **Objective**: Production platform for 20+ ML models serving 1M predictions/day
- **Architecture**:
  - **Training Pipeline**: Airflow DAG (data prep ‚Üí train ‚Üí validate ‚Üí register)
  - **Model Registry**: MLflow (version control, stage transitions, lineage)
  - **Serving**: Kubernetes (3-20 pods per model, auto-scaling)
  - **API Gateway**: NGINX Ingress with rate limiting, authentication
  - **Monitoring**: Prometheus + Grafana + AlertManager
  - **Logging**: ELK stack (Elasticsearch, Logstash, Kibana)
  - **CI/CD**: GitHub Actions (test ‚Üí build Docker ‚Üí deploy to staging ‚Üí canary ‚Üí production)
- **Key Features**:
  - Multi-model serving with intelligent routing
  - A/B testing framework (10-90 split, gradual rollout)
  - Shadow deployment for safe validation
  - Automated retraining on data drift (weekly schedule + on-demand)
  - Feature store (Feast) for training/serving consistency
- **Success Metrics**:
  - 20 models deployed, 1M predictions/day
  - 99.99% uptime (5 minutes downtime/month)
  - <10ms P99 latency (target: <10ms)
  - Zero manual deployments (fully automated CI/CD)
  - Detect data drift 2 weeks early (proactive retraining)
- **Business Value**: $25M annually (20 models √ó $1-2M each, automated operations, early drift detection)
- **Implementation**: 12 months (platform design, infrastructure setup, migrate 20 models, train 50 engineers)

---

**2. Real-Time Edge Inference (AMD)**
- **Objective**: Deploy anomaly detection to 500 test equipment units (ARM CPU, 4GB RAM, no cloud)
- **Architecture**:
  - **Model**: Quantized autoencoder (FP32 ‚Üí INT8, 200MB ‚Üí 12MB)
  - **Runtime**: ONNX Runtime (optimized for ARM)
  - **Container**: Docker (ARM64 base image, multi-stage build)
  - **Orchestration**: K3s (lightweight Kubernetes for edge)
  - **Update Mechanism**: GitOps (Fleet pulls updates from Git repo)
  - **Monitoring**: Prometheus agent (ship metrics to central server)
- **Key Features**:
  - Over-the-air updates (deploy to 500 devices in 10 minutes)
  - Offline operation (equipment isolated from internet for security)
  - Local inference (<1ms latency, no cloud round-trip)
  - Fallback model (if primary fails, use simpler rule-based)
  - Gradual rollout (canary to 10 devices ‚Üí validate ‚Üí roll out to 500)
- **Success Metrics**:
  - <1ms inference latency (target: <5ms)
  - 150MB memory footprint (fits in 4GB device)
  - 99.9% uptime per device (remote monitoring + auto-restart)
  - Update 500 devices in 10 minutes (was 2 weeks manual)
  - Zero failed updates (atomic updates with rollback)
- **Business Value**: $18M annually (real-time anomaly detection, eliminated cloud costs $500K/year, faster updates)
- **Implementation**: 8 months (model optimization, K3s setup, GitOps pipeline, fleet management)

---

**3. Multi-Region Deployment (NVIDIA)**
- **Objective**: Serve models globally with <100ms latency from any location
- **Architecture**:
  - **Regions**: 3 data centers (US-West, US-East, Asia)
  - **Load Balancing**: GeoDNS routes to nearest region
  - **Kubernetes**: EKS cluster per region (10-50 pods each)
  - **Data Replication**: PostgreSQL primary-replica (read from nearest)
  - **Model Sync**: S3 cross-region replication (models synced in <5 minutes)
  - **Monitoring**: Centralized Grafana (aggregate metrics from all regions)
- **Key Features**:
  - Geo-routing (US users ‚Üí US cluster, Asia users ‚Üí Asia cluster)
  - Failover (US-West down ‚Üí route to US-East automatically)
  - Regional model caching (avoid cross-region model fetches)
  - Data sovereignty compliance (EU data stays in EU)
  - Disaster recovery (backup to different region, RTO <30 minutes)
- **Success Metrics**:
  - <100ms P99 latency globally (was 300ms single region)
  - 99.995% availability (26 seconds downtime/month)
  - 10K requests/sec globally (3K-4K per region)
  - Zero data loss during region failure (replication lag <5s)
  - $2M cost savings (avoid premium tier single-region solution)
- **Business Value**: $15M annually (global expansion enabled, improved user experience, reduced latency)
- **Implementation**: 6 months (multi-region setup, DR testing, traffic migration)

---

**4. Continuous Training Pipeline (Qualcomm)**
- **Objective**: Automatically retrain models weekly using latest production data
- **Architecture**:
  - **Data Pipeline**: Kafka ‚Üí Spark Streaming ‚Üí Feature Store (Feast)
  - **Training Orchestration**: Kubeflow Pipelines (DAG for train ‚Üí evaluate ‚Üí register ‚Üí deploy)
  - **Compute**: Kubernetes with GPU nodes (train 10 models in parallel)
  - **Model Registry**: MLflow (track experiments, lineage, staging)
  - **Deployment**: Automated promotion (staging ‚Üí canary ‚Üí production)
  - **Monitoring**: Track model performance, trigger retraining on drift
- **Key Features**:
  - Scheduled retraining (every Sunday 2am, low-traffic window)
  - Data drift trigger (PSI >0.25 ‚Üí immediate retraining)
  - Automated validation (accuracy >90% required for promotion)
  - Rollback on failure (if new model worse, revert to previous)
  - Experiment tracking (compare 1000+ training runs)
- **Success Metrics**:
  - Weekly retraining cycle (was monthly manual)
  - 92% ‚Üí 95% accuracy (models adapt to recent data)
  - Zero manual interventions (fully automated)
  - 3 hours training time (parallel GPU training)
  - $500K ML engineer time saved (no manual retraining)
- **Business Value**: $20M annually (higher accuracy = better decisions, automation saves $500K, faster adaptation to changes)
- **Implementation**: 5 months (Kubeflow setup, feature store, automated validation, monitor integration)

---

### General AI/ML Projects

**5. High-Traffic Recommendation API**
- **Objective**: Serve 100K recommendations/sec for e-commerce platform
- **Architecture**: TensorFlow Serving + Kubernetes + Redis caching + CDN
- **Key Features**: Model batching (32 samples), feature caching, multi-tier architecture
- **Success Metrics**: <50ms P99 latency, 99.99% uptime, 15% CTR increase
- **Value**: $50M revenue increase from better recommendations

---

**6. Medical Imaging API**
- **Objective**: Real-time cancer detection from radiology images
- **Architecture**: PyTorch + ONNX Runtime + GPU serving + DICOM integration
- **Key Features**: High-accuracy model (AUC 0.96), explainable AI (Grad-CAM), HIPAA compliance
- **Success Metrics**: <5s inference, 96% sensitivity, 98% specificity, radiologist approval
- **Value**: Early cancer detection saves lives, $10M/year revenue

---

**7. Fraud Detection System**
- **Objective**: Real-time fraud scoring for financial transactions
- **Architecture**: XGBoost + FastAPI + Redis + Kubernetes + real-time feature pipeline
- **Key Features**: <10ms scoring, 1M transactions/day, explainable predictions
- **Success Metrics**: 99.5% fraud detection, 0.5% false positives, $100M fraud prevented
- **Value**: Protect customers, reduce chargebacks

---

**8. Chatbot Backend**
- **Objective**: Deploy LLM for customer support (1M conversations/day)
- **Architecture**: BERT + FastAPI + vLLM (batching) + GPU + prompt caching
- **Key Features**: Context management, streaming responses, safety filters
- **Success Metrics**: <500ms first token, 90% customer satisfaction, 50% support cost reduction
- **Value**: $20M annual savings from automation

---

## üéì Key Takeaways & Next Steps

### What You Learned

**1. REST API Serving (FastAPI):**
- ‚úÖ **FastAPI**: Async performance, auto-docs, type safety, 3√ó faster than Flask
- ‚úÖ **Pydantic**: Input/output validation catches errors before inference
- ‚úÖ **Best Practices**: Load model at startup, batch requests, async I/O, health checks
- ‚úÖ **Intel**: 500K predictions/day, <10ms P99 latency, 99.99% uptime

**2. Docker Containerization:**
- ‚úÖ **Reproducibility**: Same environment dev ‚Üí staging ‚Üí production
- ‚úÖ **Optimization**: Multi-stage builds, slim images, layer caching (2.5GB ‚Üí 800MB)
- ‚úÖ **Security**: Non-root user, health checks, minimal attack surface
- ‚úÖ **AMD**: Edge deployment (200MB ‚Üí 12MB), <1ms inference on ARM

**3. Kubernetes Deployment:**
- ‚úÖ **Auto-scaling**: HPA scales 3-20 pods based on CPU/memory/custom metrics
- ‚úÖ **Self-healing**: Restart failed pods, replace unhealthy instances
- ‚úÖ **Rolling Updates**: Zero-downtime deployments, gradual rollout, instant rollback
- ‚úÖ **NVIDIA**: 100K predictions/day, 99.99% uptime, auto-scale in 30 seconds

**4. Monitoring & Observability:**
- ‚úÖ **Prometheus + Grafana**: Track latency, throughput, error rate, model metrics
- ‚úÖ **Data Drift Detection**: PSI, KS test, domain classifier (detect 2 weeks early)
- ‚úÖ **Alerting**: Critical (PagerDuty), Warning (Slack), Info (Email)
- ‚úÖ **Qualcomm**: Continuous training, automated retraining on drift, 95% accuracy maintained

---

### Deployment Architecture Comparison

| Aspect | Flask + VM | FastAPI + Docker | FastAPI + K8s |
|--------|-----------|------------------|---------------|
| **Setup Complexity** | Simple | Moderate | Complex |
| **Performance** | 100 req/sec | 300 req/sec | 10K+ req/sec |
| **Scaling** | Manual (add VMs) | Manual (add containers) | Auto (HPA) |
| **Deployment** | SSH + script | Docker push/pull | `kubectl apply` |
| **Downtime** | Yes (5-10 min) | Minimal (1 min) | Zero (rolling) |
| **Monitoring** | Basic logs | Docker logs | Prometheus/Grafana |
| **Cost (1K req/sec)** | $500/month | $300/month | $200/month |

---

### Deployment Checklist

**Before Production Deployment:**
- ‚úÖ **Model Validation**: Accuracy >90% on hold-out test set
- ‚úÖ **Load Testing**: Simulate 10√ó expected traffic (Locust, JMeter)
- ‚úÖ **Latency Testing**: P99 <100ms (target based on use case)
- ‚úÖ **Error Handling**: Graceful failures, informative error messages
- ‚úÖ **Security**: API authentication, rate limiting, input sanitization
- ‚úÖ **Documentation**: API docs (/docs), runbooks, architecture diagrams
- ‚úÖ **Monitoring**: Dashboards, alerts, log aggregation
- ‚úÖ **Disaster Recovery**: Backup models, rollback plan, multi-region (optional)

**After Deployment:**
- ‚úÖ **Canary Deploy**: Route 10% ‚Üí validate ‚Üí 100%
- ‚úÖ **Shadow Deploy**: Run new model in parallel, compare predictions
- ‚úÖ **Monitor Metrics**: Latency, error rate, model performance, data drift
- ‚úÖ **On-call Rotation**: Engineers on-call for critical alerts
- ‚úÖ **Post-mortem**: Document incidents, improve processes

---

### Performance Optimization Guide

**Latency Optimization:**
1. **Model Level**: Quantization (FP32‚ÜíINT8), pruning, distillation, ONNX Runtime
2. **Serving Level**: Batching (dynamic batching for throughput), caching (Redis), async I/O
3. **Infrastructure**: GPU (vs CPU), co-location (model + API), CDN (for features)
4. **Intel Example**: 10ms ‚Üí 3ms (quantization + batching + GPU)

**Throughput Optimization:**
1. **Horizontal Scaling**: More pods/containers/VMs
2. **Vertical Scaling**: More CPU/memory per instance
3. **Batching**: Process 32 samples together (10√ó throughput)
4. **Load Balancing**: Distribute requests evenly (NGINX, K8s Service)
5. **NVIDIA Example**: 1K ‚Üí 10K req/sec (GPU batching + 20 pods)

**Cost Optimization:**
1. **Right-sizing**: Don't over-provision (monitor actual usage)
2. **Spot Instances**: 70% cheaper for non-critical workloads
3. **Auto-scaling**: Scale down during low traffic (nights, weekends)
4. **Model Optimization**: Smaller model = less compute = lower cost
5. **AMD Example**: $500K/year cloud costs ‚Üí $50K/year edge deployment

---

### Real-World Impact Summary

| Company | Solution | Problem Solved | Savings |
|---------|----------|----------------|---------|
| **Intel** | End-to-end ML platform | 20 models, 1M predictions/day | $25M |
| **AMD** | Edge inference | 500 devices, <1ms latency | $18M |
| **NVIDIA** | Multi-region deployment | Global <100ms latency | $15M |
| **Qualcomm** | Continuous training | Weekly retraining, 95% accuracy | $20M |

**Total measurable impact:** $78M across 4 companies

---

### Common Pitfalls & Solutions

**1. Loading Model Per Request:**
- ‚ùå Problem: 1s overhead, slow inference
- ‚úÖ Solution: Load once at startup, cache in memory

**2. No Health Checks:**
- ‚ùå Problem: K8s routes traffic to crashed pods
- ‚úÖ Solution: /health endpoint for liveness/readiness probes

**3. No Monitoring:**
- ‚ùå Problem: Model degrades silently, business impact unknown
- ‚úÖ Solution: Prometheus + Grafana + alerts on drift/accuracy

**4. No Rollback Plan:**
- ‚ùå Problem: Bad deployment breaks production, panic
- ‚úÖ Solution: Version models, test in staging, canary deploy, instant rollback

**5. Ignoring Data Drift:**
- ‚ùå Problem: Model trained on 2023 data, serving 2024 data (92% ‚Üí 87% accuracy)
- ‚úÖ Solution: Monitor PSI, retrain weekly, alert on drift

**6. Single Point of Failure:**
- ‚ùå Problem: One server down = entire service down
- ‚úÖ Solution: Deploy multiple replicas, load balancing, auto-healing

---

### Next Steps

**Immediate (This Week):**
1. Build FastAPI endpoint for personal ML model
2. Write Dockerfile and test locally
3. Deploy to Docker Hub or local registry

**Short-term (This Month):**
1. Deploy to Kubernetes (Minikube locally, then cloud)
2. Setup Prometheus + Grafana monitoring
3. Implement auto-scaling with HPA

**Long-term (This Quarter):**
1. Build end-to-end ML platform (training ‚Üí registry ‚Üí serving ‚Üí monitoring)
2. Implement continuous training pipeline
3. Deploy to production with 99.9%+ uptime

---

### Resources

**Books:**
1. *Building Machine Learning Powered Applications* by Emmanuel Ameisen
2. *Machine Learning Systems* by Chip Huyen
3. *Kubernetes Patterns* by Bilgin Ibryam & Roland Hu√ü

**Online:**
- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [Docker Documentation](https://docs.docker.com/)
- [Kubernetes Documentation](https://kubernetes.io/docs/)
- [Prometheus + Grafana Tutorials](https://prometheus.io/docs/tutorials/)

**Courses:**
- [Full Stack Deep Learning](https://fullstackdeeplearning.com/)
- [Made With ML](https://madewithml.com/)
- [Kubernetes for ML Engineers](https://www.coursera.org/learn/kubernetes)

**Practice:**
- Deploy simple model (scikit-learn) with FastAPI
- Containerize with Docker
- Deploy to Kubernetes (Minikube or cloud)
- Add monitoring and alerts

---

**üéâ Congratulations!** You now master production ML deployment from REST APIs to Kubernetes orchestration to monitoring. You can deploy models serving 1M predictions/day with <10ms latency and 99.99% uptime.

**Measurable skills gained:**
- Build FastAPI services (3√ó faster than Flask)
- Containerize models with Docker (reproducible deployments)
- Deploy to Kubernetes with auto-scaling (3-20 pods dynamically)
- Monitor production models (Prometheus + Grafana + alerts)
- Detect and fix data drift 2 weeks early (proactive retraining)
- Achieve 99.99% uptime (5 minutes downtime/month)
- Save $15-25M through efficient deployment and monitoring

**Ready for end-to-end MLOps?** Proceed to **Notebook 111: MLOps Fundamentals** to learn complete ML pipelines with feature stores, experiment tracking, and CI/CD! üöÄ

In [None]:
from pathlib import Path
import json
import pickle

# Create Dockerfile for ML model deployment
dockerfile_content = """
# Multi-stage build for optimized ML model serving
FROM python:3.11-slim as builder

# Install build dependencies
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Production stage
FROM python:3.11-slim

# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH

# Set working directory
WORKDIR /app

# Copy application code
COPY model_server.py .
COPY models/ ./models/
COPY config/ ./config/

# Create non-root user for security
RUN useradd -m -u 1000 mluser && chown -R mluser:mluser /app
USER mluser

# Expose port
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8080/health')"

# Run application
CMD ["python", "model_server.py"]
"""

# Create requirements.txt
requirements_content = """
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
numpy==1.24.3
scikit-learn==1.3.2
pandas==2.1.3
prometheus-client==0.19.0
python-json-logger==2.0.7
"""

# Create FastAPI model serving application
model_server_content = '''
"""
Production ML Model Server with FastAPI
Handles yield prediction for semiconductor wafer test data
"""

from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
from prometheus_client import Counter, Histogram, generate_latest
import numpy as np
import pickle
import logging
import time
from typing import List, Dict, Any
import json

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='{"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": "%(message)s"}'
)
logger = logging.getLogger(__name__)

# Prometheus metrics
prediction_counter = Counter('model_predictions_total', 'Total predictions made')
prediction_latency = Histogram('model_prediction_latency_seconds', 'Prediction latency')
error_counter = Counter('model_errors_total', 'Total prediction errors')

# Initialize FastAPI app
app = FastAPI(
    title="Wafer Yield Prediction API",
    description="Production ML model for predicting semiconductor wafer yield",
    version="2.0.0"
)

# Load model at startup
model = None

@app.on_event("startup")
async def load_model():
    global model
    try:
        with open('models/yield_predictor_v2.pkl', 'rb') as f:
            model = pickle.load(f)
        logger.info("Model loaded successfully", extra={"model_version": "v2.0"})
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
        raise

# Request/Response models
class PredictionRequest(BaseModel):
    wafer_id: str = Field(..., description="Unique wafer identifier")
    voltage: float = Field(..., ge=1.0, le=1.5, description="Voltage (V)")
    current: float = Field(..., ge=100, le=1000, description="Current (mA)")
    temperature: float = Field(..., ge=20, le=85, description="Temperature (¬∞C)")
    test_time: float = Field(..., ge=0, le=300, description="Test time (seconds)")
    
    class Config:
        json_schema_extra = {
            "example": {
                "wafer_id": "W12345",
                "voltage": 1.2,
                "current": 500,
                "temperature": 25,
                "test_time": 45.5
            }
        }

class PredictionResponse(BaseModel):
    wafer_id: str
    predicted_yield: float = Field(..., ge=0, le=100)
    confidence: float
    model_version: str
    latency_ms: float

@app.get("/health")
async def health_check():
    """Health check endpoint for load balancer"""
    return {
        "status": "healthy",
        "model_loaded": model is not None,
        "timestamp": time.time()
    }

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint"""
    return generate_latest()

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    """Make yield prediction"""
    start_time = time.time()
    
    try:
        # Extract features
        features = np.array([[
            request.voltage,
            request.current,
            request.temperature,
            request.test_time
        ]])
        
        # Make prediction
        prediction = model.predict(features)[0]
        confidence = model.predict_proba(features).max()
        
        # Calculate latency
        latency_ms = (time.time() - start_time) * 1000
        
        # Update metrics
        prediction_counter.inc()
        prediction_latency.observe(latency_ms / 1000)
        
        # Log prediction
        logger.info(
            "Prediction completed",
            extra={
                "wafer_id": request.wafer_id,
                "predicted_yield": float(prediction),
                "latency_ms": latency_ms
            }
        )
        
        return PredictionResponse(
            wafer_id=request.wafer_id,
            predicted_yield=float(prediction),
            confidence=float(confidence),
            model_version="v2.0",
            latency_ms=latency_ms
        )
        
    except Exception as e:
        error_counter.inc()
        logger.error(f"Prediction failed: {e}", extra={"wafer_id": request.wafer_id})
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/batch_predict")
async def batch_predict(requests: List[PredictionRequest]):
    """Batch prediction for multiple wafers"""
    results = []
    for req in requests:
        result = await predict(req)
        results.append(result)
    return {"predictions": results, "count": len(results)}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080, log_level="info")
'''

# Create docker-compose.yml for local testing
docker_compose_content = """
version: '3.8'

services:
  model-server:
    build: .
    ports:
      - "8080:8080"
    environment:
      - MODEL_VERSION=v2.0
      - LOG_LEVEL=INFO
    volumes:
      - ./models:/app/models:ro
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped
"""

# Print deployment files
print("üê≥ Docker Deployment Configuration")
print("=" * 80)

print("\nüìÑ Dockerfile (Multi-stage optimized):")
print("-" * 80)
print("Key features:")
print("  ‚úÖ Multi-stage build (reduces image size 60%)")
print("  ‚úÖ Non-root user (security best practice)")
print("  ‚úÖ Health check (Kubernetes readiness probe)")
print("  ‚úÖ Python 3.11 slim base (350MB vs 1GB full image)")

print("\nüìÑ model_server.py (FastAPI Application):")
print("-" * 80)
print("Capabilities:")
print("  ‚úÖ RESTful API with OpenAPI docs")
print("  ‚úÖ Pydantic validation (input validation, type safety)")
print("  ‚úÖ Prometheus metrics (predictions, latency, errors)")
print("  ‚úÖ Structured JSON logging")
print("  ‚úÖ Batch prediction endpoint")
print("  ‚úÖ Health check for load balancer")

print("\nüìÑ docker-compose.yml (Local Development):")
print("-" * 80)
print("Services:")
print("  üöÄ model-server: ML model API (port 8080)")
print("  üìä prometheus: Metrics collection (port 9090)")
print("  üìà grafana: Visualization dashboard (port 3000)")
print("  ")
print("Resource limits:")
print("  CPU: 1-2 cores, Memory: 2-4 GB")

print("\nüîß Build and Run Commands:")
print("-" * 80)
print("# Build Docker image")
print("docker build -t wafer-yield-model:v2.0 .")
print("")
print("# Run single container")
print("docker run -p 8080:8080 wafer-yield-model:v2.0")
print("")
print("# Run with docker-compose (includes monitoring)")
print("docker-compose up -d")
print("")
print("# Test API")
print("curl -X POST http://localhost:8080/predict \\")
print('  -H "Content-Type: application/json" \\')
print('  -d \'{"wafer_id": "W001", "voltage": 1.2, "current": 500, "temperature": 25, "test_time": 45}\'')

print("\nüìä Expected Performance:")
print("-" * 80)
print("  Image size: ~450 MB (multi-stage build)")
print("  Startup time: 3-5 seconds")
print("  Prediction latency: <50ms (p95)")
print("  Throughput: 1000 req/sec (single container)")
print("  Memory footprint: 1.5-2 GB (loaded model + cache)")

print("\nüè≠ Post-Silicon Validation Deployment:")
print("  Use case: Deploy yield predictor to production fab network")
print("  Deployment: 3 containers behind load balancer")
print("  Monitoring: Prometheus + Grafana dashboards")
print("  Integration: REST API called by test equipment controllers")

In [None]:
# Kubernetes deployment configuration
k8s_deployment = """
apiVersion: apps/v1
kind: Deployment
metadata:
  name: wafer-yield-model
  namespace: ml-models
  labels:
    app: wafer-yield-model
    version: v2.0
spec:
  replicas: 3
  selector:
    matchLabels:
      app: wafer-yield-model
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max 1 extra pod during update
      maxUnavailable: 0  # No downtime during update
  template:
    metadata:
      labels:
        app: wafer-yield-model
        version: v2.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: model-server
        image: registry.company.com/wafer-yield-model:v2.0
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: MODEL_VERSION
          value: "v2.0"
        - name: LOG_LEVEL
          value: "INFO"
        resources:
          requests:
            cpu: 500m      # 0.5 CPU core
            memory: 1Gi    # 1 GB RAM
          limits:
            cpu: 2000m     # 2 CPU cores max
            memory: 4Gi    # 4 GB RAM max
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
        volumeMounts:
        - name: model-storage
          mountPath: /app/models
          readOnly: true
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - wafer-yield-model
              topologyKey: kubernetes.io/hostname
"""

k8s_service = """
apiVersion: v1
kind: Service
metadata:
  name: wafer-yield-model-service
  namespace: ml-models
spec:
  type: LoadBalancer
  selector:
    app: wafer-yield-model
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
    name: http
  sessionAffinity: ClientIP  # Sticky sessions
"""

k8s_hpa = """
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: wafer-yield-model-hpa
  namespace: ml-models
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: wafer-yield-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60   # Scale up quickly
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
"""

k8s_ingress = """
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: wafer-yield-model-ingress
  namespace: ml-models
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  tls:
  - hosts:
    - ml-api.company.com
    secretName: ml-api-tls
  rules:
  - host: ml-api.company.com
    http:
      paths:
      - path: /predict
        pathType: Prefix
        backend:
          service:
            name: wafer-yield-model-service
            port:
              number: 80
"""

print("‚ò∏Ô∏è Kubernetes Deployment Configuration")
print("=" * 80)

print("\nüì¶ Deployment Manifest:")
print("-" * 80)
print("Configuration:")
print("  ‚Ä¢ Replicas: 3 (high availability)")
print("  ‚Ä¢ Rolling update: MaxSurge=1, MaxUnavailable=0 (zero downtime)")
print("  ‚Ä¢ Resources: 0.5-2 CPU, 1-4 GB memory per pod")
print("  ‚Ä¢ Probes: Liveness (detect crashes), Readiness (traffic routing)")
print("  ‚Ä¢ Anti-affinity: Spread pods across nodes")

print("\nüîÄ Service (LoadBalancer):")
print("-" * 80)
print("  ‚Ä¢ Type: LoadBalancer (cloud provider integration)")
print("  ‚Ä¢ Port: 80 ‚Üí 8080 (external ‚Üí internal)")
print("  ‚Ä¢ Session affinity: ClientIP (sticky sessions)")

print("\nüìà HorizontalPodAutoscaler:")
print("-" * 80)
print("  ‚Ä¢ Min replicas: 2 (always available)")
print("  ‚Ä¢ Max replicas: 10 (handle traffic spikes)")
print("  ‚Ä¢ CPU target: 70% utilization")
print("  ‚Ä¢ Memory target: 80% utilization")
print("  ‚Ä¢ Scale-up: Fast (60s window), Scale-down: Slow (300s window)")

print("\nüåê Ingress (External Access):")
print("-" * 80)
print("  ‚Ä¢ Domain: ml-api.company.com")
print("  ‚Ä¢ TLS: Automatic HTTPS with Let's Encrypt")
print("  ‚Ä¢ Rate limiting: 100 req/sec per IP")

print("\nüöÄ Deployment Commands:")
print("-" * 80)
print("# Apply configurations")
print("kubectl apply -f deployment.yaml")
print("kubectl apply -f service.yaml")
print("kubectl apply -f hpa.yaml")
print("kubectl apply -f ingress.yaml")
print("")
print("# Check status")
print("kubectl get pods -n ml-models")
print("kubectl get hpa -n ml-models")
print("kubectl describe deployment wafer-yield-model -n ml-models")
print("")
print("# Rolling update (zero downtime)")
print("kubectl set image deployment/wafer-yield-model \\")
print("  model-server=registry.company.com/wafer-yield-model:v2.1 -n ml-models")
print("")
print("# Rollback if issues")
print("kubectl rollout undo deployment/wafer-yield-model -n ml-models")

print("\nüìä Scaling Behavior Simulation:")
print("-" * 80)
import numpy as np

# Simulate traffic pattern (24 hours)
hours = np.arange(24)
traffic_pattern = np.array([
    20, 15, 10, 10, 15, 30,  # 00:00-05:00 (low)
    50, 80, 100, 90, 85, 95,  # 06:00-11:00 (morning peak)
    100, 110, 100, 95, 90, 100,  # 12:00-17:00 (afternoon peak)
    80, 60, 50, 40, 30, 25   # 18:00-23:00 (evening decline)
])

# Calculate required pods (assuming 100 req/sec per pod at 70% CPU)
cpu_per_pod = 70  # req/sec at 70% CPU target
required_pods = np.ceil(traffic_pattern / cpu_per_pod).astype(int)
required_pods = np.clip(required_pods, 2, 10)  # Min 2, max 10

print("Hour | Traffic | Required Pods | Scaling Action")
print("-" * 60)
for h in [0, 6, 12, 18, 23]:
    action = ""
    if h > 0:
        prev_pods = required_pods[h-1]
        curr_pods = required_pods[h]
        if curr_pods > prev_pods:
            action = f"‚Üë Scale up (+{curr_pods - prev_pods})"
        elif curr_pods < prev_pods:
            action = f"‚Üì Scale down (-{prev_pods - curr_pods})"
        else:
            action = "‚Üí No change"
    print(f"{h:02d}:00 | {traffic_pattern[h]:3d} req/s | {required_pods[h]:2d} pods        | {action}")

avg_pods = np.mean(required_pods)
fixed_pods = 10  # If no auto-scaling
cost_savings = ((fixed_pods * 24) - np.sum(required_pods)) / (fixed_pods * 24) * 100

print(f"\nüí∞ Cost Analysis:")
print(f"  Fixed capacity (10 pods √ó 24h): {fixed_pods * 24} pod-hours")
print(f"  Auto-scaled capacity: {np.sum(required_pods)} pod-hours")
print(f"  Cost savings: {cost_savings:.1f}%")

print("\nüè≠ Post-Silicon Validation K8s Deployment:")
print("  Cluster: 5 nodes (3 control plane, 2 worker nodes)")
print("  Pods: 2-10 replicas based on wafer test volume")
print("  Storage: NFS-mounted model files (PersistentVolume)")
print("  Networking: Internal ClusterIP for test equipment access")
print("  Monitoring: Prometheus + Grafana on separate namespace")

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict
import random

# Simulate production monitoring data
class ProductionMonitor:
    """Monitor ML model performance in production"""
    
    def __init__(self):
        self.metrics = defaultdict(list)
        self.alerts = []
    
    def record_prediction(self, model_version: str, latency_ms: float, 
                         error: bool, actual_yield: float = None, 
                         predicted_yield: float = None):
        """Record prediction metrics"""
        self.metrics[f"{model_version}_latency"].append(latency_ms)
        self.metrics[f"{model_version}_errors"].append(1 if error else 0)
        
        if actual_yield is not None and predicted_yield is not None:
            mae = abs(actual_yield - predicted_yield)
            self.metrics[f"{model_version}_mae"].append(mae)
    
    def check_alerts(self, model_version: str) -> list:
        """Check for alerting conditions"""
        alerts = []
        
        # Latency alert (p95 > 500ms)
        latencies = self.metrics[f"{model_version}_latency"]
        if latencies:
            p95_latency = np.percentile(latencies, 95)
            if p95_latency > 500:
                alerts.append(f"‚ö†Ô∏è High latency: p95={p95_latency:.0f}ms (threshold: 500ms)")
        
        # Error rate alert (>1%)
        errors = self.metrics[f"{model_version}_errors"]
        if errors:
            error_rate = np.mean(errors) * 100
            if error_rate > 1.0:
                alerts.append(f"üö® High error rate: {error_rate:.2f}% (threshold: 1.0%)")
        
        # Accuracy drift alert (MAE increase >5%)
        mae_values = self.metrics[f"{model_version}_mae"]
        if len(mae_values) > 100:
            recent_mae = np.mean(mae_values[-100:])
            baseline_mae = np.mean(mae_values[:100])
            drift = ((recent_mae - baseline_mae) / baseline_mae) * 100
            if drift > 5.0:
                alerts.append(f"üìâ Accuracy drift: +{drift:.1f}% MAE increase (threshold: 5%)")
        
        return alerts

# A/B Testing simulation
class ABTestManager:
    """Manage A/B tests for model deployments"""
    
    def __init__(self, model_a_version: str, model_b_version: str, 
                 traffic_split: float = 0.9):
        self.model_a = model_a_version
        self.model_b = model_b_version
        self.traffic_split = traffic_split  # 90% to A, 10% to B
        self.results = {model_a_version: [], model_b_version: []}
    
    def route_request(self) -> str:
        """Route request to A or B based on traffic split"""
        return self.model_a if random.random() < self.traffic_split else self.model_b
    
    def record_result(self, model: str, accuracy: float):
        """Record prediction accuracy"""
        self.results[model].append(accuracy)
    
    def analyze_test(self) -> dict:
        """Statistical analysis of A/B test results"""
        results_a = np.array(self.results[self.model_a])
        results_b = np.array(self.results[self.model_b])
        
        mean_a = np.mean(results_a)
        mean_b = np.mean(results_b)
        std_a = np.std(results_a)
        std_b = np.std(results_b)
        
        # Calculate improvement
        improvement = ((mean_b - mean_a) / mean_a) * 100
        
        # Simple significance test (t-statistic)
        n_a, n_b = len(results_a), len(results_b)
        pooled_std = np.sqrt((std_a**2 / n_a) + (std_b**2 / n_b))
        t_stat = (mean_b - mean_a) / pooled_std if pooled_std > 0 else 0
        
        # Decision threshold: >2% improvement, t-stat > 2 (roughly p < 0.05)
        decision = "ROLLOUT" if improvement > 2.0 and abs(t_stat) > 2.0 else "HOLD"
        
        return {
            "model_a": self.model_a,
            "model_b": self.model_b,
            "mean_a": mean_a,
            "mean_b": mean_b,
            "improvement_pct": improvement,
            "t_statistic": t_stat,
            "decision": decision,
            "confidence": "High" if abs(t_stat) > 2.5 else "Medium"
        }

# Simulation
monitor = ProductionMonitor()
ab_test = ABTestManager("v2.0", "v2.1", traffic_split=0.9)

print("üìä Production Monitoring & A/B Testing Simulation")
print("=" * 80)

# Simulate 1000 predictions
print("\nüîÑ Simulating 1000 production predictions...")
for i in range(1000):
    # Route to model version
    model_version = ab_test.route_request()
    
    # Simulate prediction metrics (v2.1 slightly better)
    if model_version == "v2.0":
        latency = np.random.gamma(shape=2, scale=25)  # Mean ~50ms
        error = random.random() < 0.005  # 0.5% error rate
        mae = np.random.normal(2.5, 0.5)  # MAE ~2.5%
    else:  # v2.1
        latency = np.random.gamma(shape=2, scale=22)  # Slightly faster
        error = random.random() < 0.003  # Lower error rate
        mae = np.random.normal(2.0, 0.4)  # Better accuracy
    
    # Record metrics
    monitor.record_prediction(model_version, latency, error, 
                             actual_yield=95.0, predicted_yield=95.0 - mae)
    ab_test.record_result(model_version, 100 - mae)  # Accuracy as %

# Check alerts
print("\nüö® Alert Check (v2.0):")
alerts_v20 = monitor.check_alerts("v2.0")
if alerts_v20:
    for alert in alerts_v20:
        print(f"   {alert}")
else:
    print("   ‚úÖ All metrics within thresholds")

print("\nüö® Alert Check (v2.1):")
alerts_v21 = monitor.check_alerts("v2.1")
if alerts_v21:
    for alert in alerts_v21:
        print(f"   {alert}")
else:
    print("   ‚úÖ All metrics within thresholds")

# A/B test analysis
print("\nüìà A/B Test Results:")
print("-" * 80)
ab_results = ab_test.analyze_test()
print(f"Model A ({ab_results['model_a']}): {ab_results['mean_a']:.2f}% accuracy")
print(f"Model B ({ab_results['model_b']}): {ab_results['mean_b']:.2f}% accuracy")
print(f"Improvement: {ab_results['improvement_pct']:+.2f}%")
print(f"T-statistic: {ab_results['t_statistic']:.2f}")
print(f"Confidence: {ab_results['confidence']}")
print(f"\nüéØ Decision: {ab_results['decision']}")
if ab_results['decision'] == "ROLLOUT":
    print("   ‚úÖ New model shows statistically significant improvement")
    print("   ‚Üí Gradually increase traffic: 10% ‚Üí 25% ‚Üí 50% ‚Üí 100%")
else:
    print("   ‚ö†Ô∏è Improvement not significant enough")
    print("   ‚Üí Keep monitoring, need more data or larger improvement")

# Visualizations
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Latency comparison
latencies_v20 = monitor.metrics["v2.0_latency"]
latencies_v21 = monitor.metrics["v2.1_latency"]

ax1.hist(latencies_v20, bins=30, alpha=0.6, label='v2.0', color='#3498db', edgecolor='black')
ax1.hist(latencies_v21, bins=30, alpha=0.6, label='v2.1', color='#2ecc71', edgecolor='black')
ax1.axvline(np.percentile(latencies_v20, 95), color='#3498db', linestyle='--', linewidth=2, 
            label=f'v2.0 p95: {np.percentile(latencies_v20, 95):.0f}ms')
ax1.axvline(np.percentile(latencies_v21, 95), color='#2ecc71', linestyle='--', linewidth=2,
            label=f'v2.1 p95: {np.percentile(latencies_v21, 95):.0f}ms')
ax1.set_xlabel('Latency (ms)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Frequency', fontsize=12, fontweight='bold')
ax1.set_title('Prediction Latency Distribution', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Error rate over time
window_size = 50
errors_v20 = monitor.metrics["v2.0_errors"]
errors_v21 = monitor.metrics["v2.1_errors"]

error_rate_v20 = [np.mean(errors_v20[max(0, i-window_size):i+1]) * 100 
                  for i in range(len(errors_v20))]
error_rate_v21 = [np.mean(errors_v21[max(0, i-window_size):i+1]) * 100 
                  for i in range(len(errors_v21))]

ax2.plot(error_rate_v20, linewidth=2, label='v2.0', color='#3498db', alpha=0.8)
ax2.plot(error_rate_v21, linewidth=2, label='v2.1', color='#2ecc71', alpha=0.8)
ax2.axhline(y=1.0, color='red', linestyle='--', linewidth=2, label='Alert threshold (1%)')
ax2.set_xlabel('Prediction Number', fontsize=12, fontweight='bold')
ax2.set_ylabel('Error Rate (%)', fontsize=12, fontweight='bold')
ax2.set_title('Error Rate Over Time (50-request moving average)', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

# Plot 3: Accuracy comparison (boxplot)
accuracy_v20 = ab_test.results["v2.0"]
accuracy_v21 = ab_test.results["v2.1"]

box_data = [accuracy_v20, accuracy_v21]
bp = ax3.boxplot(box_data, labels=['v2.0', 'v2.1'], patch_artist=True,
                 boxprops=dict(facecolor='lightblue', alpha=0.7),
                 medianprops=dict(color='red', linewidth=2),
                 whiskerprops=dict(linewidth=1.5),
                 capprops=dict(linewidth=1.5))

# Color boxes
bp['boxes'][0].set_facecolor('#3498db')
bp['boxes'][1].set_facecolor('#2ecc71')

ax3.set_ylabel('Accuracy (%)', fontsize=12, fontweight='bold')
ax3.set_title('Model Accuracy Comparison (A/B Test)', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3, axis='y')

# Add improvement annotation
improvement_pct = ab_results['improvement_pct']
ax3.annotate(f'+{improvement_pct:.2f}% improvement', 
             xy=(1.5, np.mean(accuracy_v21)), xytext=(1.7, np.mean(accuracy_v21) + 0.5),
             fontsize=11, color='green', fontweight='bold',
             arrowprops=dict(arrowstyle='->', color='green', lw=2))

# Plot 4: Traffic split visualization
traffic_data = {'v2.0\n(90%)': 900, 'v2.1\n(10%)': 100}
colors_pie = ['#3498db', '#2ecc71']

wedges, texts, autotexts = ax4.pie(traffic_data.values(), labels=traffic_data.keys(), 
                                     autopct='%1.0f%%', startangle=90, colors=colors_pie,
                                     textprops={'fontsize': 12, 'fontweight': 'bold'})

for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontsize(14)
    autotext.set_fontweight('bold')

ax4.set_title('A/B Test Traffic Split (1000 Predictions)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('production_monitoring_ab_testing.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüìä Monitoring Dashboard Metrics:")
print("-" * 80)
print(f"v2.0 Performance:")
print(f"  ‚Ä¢ Latency: p50={np.percentile(latencies_v20, 50):.0f}ms, p95={np.percentile(latencies_v20, 95):.0f}ms, p99={np.percentile(latencies_v20, 99):.0f}ms")
print(f"  ‚Ä¢ Error rate: {np.mean(errors_v20) * 100:.3f}%")
print(f"  ‚Ä¢ Accuracy: {np.mean(accuracy_v20):.2f}%")
print(f"  ‚Ä¢ Throughput: {len(latencies_v20)} predictions")

print(f"\nv2.1 Performance:")
print(f"  ‚Ä¢ Latency: p50={np.percentile(latencies_v21, 50):.0f}ms, p95={np.percentile(latencies_v21, 95):.0f}ms, p99={np.percentile(latencies_v21, 99):.0f}ms")
print(f"  ‚Ä¢ Error rate: {np.mean(errors_v21) * 100:.3f}%")
print(f"  ‚Ä¢ Accuracy: {np.mean(accuracy_v21):.2f}%")
print(f"  ‚Ä¢ Throughput: {len(latencies_v21)} predictions")

print("\nüè≠ Post-Silicon Validation Monitoring:")
print("  Metrics tracked:")
print("  ‚Ä¢ Yield prediction MAE (target: <2.5%)")
print("  ‚Ä¢ Wafer map rendering time (target: <100ms)")
print("  ‚Ä¢ Test equipment API latency (target: <50ms)")
print("  ‚Ä¢ Model refresh rate (retrain weekly with new fab data)")
print("  ")
print("  A/B testing strategy:")
print("  ‚Ä¢ Shadow mode: Run v2.1 alongside v2.0, compare offline")
print("  ‚Ä¢ Canary: Route 10% wafers to v2.1 for 24 hours")
print("  ‚Ä¢ Gradual rollout: 10% ‚Üí 25% ‚Üí 50% ‚Üí 100% over 1 week")
print("  ‚Ä¢ Rollback plan: Instant rollback if accuracy drops >1%")

## üîë Key Takeaways

### Deployment Strategy Decision Matrix

| Requirement | Strategy | Tech Stack | Example |
|-------------|----------|------------|---------|
| **Simple API** (< 100 req/day) | Single VM | Flask + gunicorn | Internal tool |
| **Medium Scale** (1K-10K req/sec) | Docker + K8s | FastAPI + Uvicorn | B2B API |
| **High Scale** (10K+ req/sec) | Distributed | TensorFlow Serving + Load Balancer | Consumer app |
| **Ultra-Low Latency** (<10ms) | Custom C++/Rust | gRPC + Redis cache | Fraud detection |
| **Batch Processing** | Scheduled jobs | Spark + Airflow | Nightly retraining |
| **Edge Deployment** | Model optimization | TFLite, ONNX | Mobile app |

### Model Serving Patterns

**1. REST API (Most Common)**
- Pros: Language-agnostic, easy integration, HTTP tooling
- Cons: Higher latency than gRPC (50-100ms overhead)
- Use for: B2B APIs, internal services

**2. gRPC**
- Pros: 2-5x faster than REST, streaming support
- Cons: Requires proto definitions, less tooling
- Use for: Microservices, high-throughput systems

**3. Batch Inference**
- Pros: High throughput (100-1000x), cost-efficient
- Cons: Not real-time, latency in hours
- Use for: Nightly scoring, recommendations precomputation

**4. Streaming**
- Pros: Real-time, stateful processing
- Cons: Complex infrastructure (Kafka, Flink)
- Use for: Fraud detection, real-time analytics

### Infrastructure Checklist ‚úÖ

**Before Production:**
- [ ] Model versioning (MLflow, DVC)
- [ ] API documentation (OpenAPI/Swagger)
- [ ] Input validation (Pydantic, JSON schema)
- [ ] Error handling (graceful degradation)
- [ ] Logging (structured JSON logs)
- [ ] Monitoring (Prometheus, DataDog)
- [ ] Alerting (PagerDuty, Slack)
- [ ] Load testing (Locust, JMeter)
- [ ] Security (API keys, rate limiting)
- [ ] Docker image (<1GB, multi-stage build)
- [ ] K8s manifests (deployment, service, HPA)
- [ ] CI/CD pipeline (GitHub Actions, Jenkins)
- [ ] Rollback plan (blue-green, canary)
- [ ] Documentation (runbook, troubleshooting)

### Performance Optimization Tips ‚ö°

**Model Optimization:**
```python
# 1. Quantization (4x smaller, 2-3x faster)
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# 2. ONNX Runtime (1.5-3x faster inference)
import onnxruntime as ort
session = ort.InferenceSession('model.onnx')
predictions = session.run(None, {'input': features})

# 3. Batch inference (10-100x throughput)
batch_size = 32  # Process 32 requests together
predictions = model.predict(batch_inputs)
```

**Infrastructure Optimization:**
- **Caching**: Redis for hot features (10-100x speedup)
- **Load balancing**: Consistent hashing for cache affinity
- **Auto-scaling**: HPA based on custom metrics (queue depth)
- **GPU acceleration**: 10-100x for deep learning models
- **Model pruning**: Remove 30-50% weights with <1% accuracy loss

### Common Deployment Pitfalls ‚ö†Ô∏è

1. **No versioning**: Can't rollback when issues arise
   - Solution: Tag every model (v1.0, v1.1), store in registry

2. **Insufficient monitoring**: Can't debug production issues
   - Solution: Log every prediction with metadata, track latency/errors

3. **No input validation**: Model crashes on unexpected inputs
   - Solution: Use Pydantic, validate ranges, handle missing values

4. **Tight coupling**: Model server depends on 10 other services
   - Solution: Decouple with message queues, implement circuit breakers

5. **No A/B testing**: Deploy new model to 100% traffic immediately
   - Solution: Shadow mode ‚Üí Canary (10%) ‚Üí Gradual rollout

6. **Ignoring latency**: Only optimize for accuracy
   - Solution: Balance accuracy vs latency, use simpler models if needed

7. **Single point of failure**: One server crash = system down
   - Solution: Deploy 3+ replicas, use load balancer, auto-restart pods

### Post-Silicon Validation Deployment Best Practices

**Model Deployment:**
- Deploy yield predictor to fab internal network (isolated from internet)
- Use private container registry (Harbor, Artifactory)
- Model update frequency: Weekly (trained on latest 1M STDF records)
- Rollback capability: Keep last 3 model versions available

**Infrastructure:**
- Kubernetes cluster on-premises (3 control + 5 worker nodes)
- PostgreSQL for feature store (test parameters, historical yield)
- Redis for caching hot wafer data (last 1000 wafers)
- Prometheus + Grafana for fab-specific dashboards

**Monitoring:**
- Track prediction MAE per lot, wafer, die location
- Alert if MAE >5% for any lot (indicates equipment drift)
- Dashboard: Real-time yield predictions, wafer maps, equipment health
- Audit trail: Log every prediction with STDF file ID for traceability

**Security:**
- API key authentication for test equipment access
- Rate limiting: 100 req/sec per equipment ID
- Network isolation: VPN required for external access
- Data retention: 90 days for predictions, GDPR compliant

### Next Steps üöÄ

**Master Deployment:**
1. **Practice**: Deploy simple model to Heroku/AWS Lambda
2. **Build**: Create Docker + FastAPI + Kubernetes pipeline
3. **Monitor**: Set up Prometheus + Grafana dashboards
4. **Optimize**: Load test, profile, optimize latency

**Continue Learning:**
- **Next**: `082_Production_RAG_Systems.ipynb` - Deploy LLM applications
- **Advanced**: MLOps practices, feature stores, model governance
- **Read**: "Building Machine Learning Powered Applications" by Emmanuel Ameisen

---

**Congratulations!** üéâ You now understand production ML deployment from Docker containerization to Kubernetes orchestration, monitoring, A/B testing, and optimization. You can confidently deploy ML models that serve millions of predictions reliably.

## üéØ Real-World Deployment Projects

### Project 1: Wafer Yield Predictor Production Deployment üè≠
**Objective:** Deploy real-time yield prediction model to 5 semiconductor fabs

**Architecture:**
- **Model**: Random Forest trained on 5M+ STDF records
- **Infrastructure**: AWS EKS (3-node cluster per fab)
- **Deployment**: Docker container, Kubernetes orchestration
- **API**: FastAPI serving predictions <50ms latency
- **Monitoring**: Prometheus + Grafana dashboards

**Deployment Pipeline:**
1. Train model on EMR Spark cluster (weekly)
2. Package model in Docker image
3. Push to ECR (Elastic Container Registry)
4. Canary deployment: 10% traffic for 24h
5. Full rollout if accuracy stable
6. Monitor: MAE, latency, throughput

**Success Metrics:**
- Prediction accuracy: >95% (MAE <2.5%)
- Latency: p95 <100ms
- Availability: 99.9% uptime
- Cost: <$500/month per fab

### Project 2: Customer Churn Prediction API üì±
**Objective:** Deploy churn prediction model for 10M+ users

**Tech Stack:**
- Model: XGBoost (150MB model file)
- Serving: TensorFlow Serving + REST API
- Infrastructure: GCP GKE, 5-20 pods (auto-scaled)
- Caching: Redis for hot user features
- Database: BigQuery for feature store

**Deployment Strategy:**
- Blue-green deployment (zero downtime)
- Shadow mode testing (1 week)
- A/B test: 20% traffic to new model
- Gradual rollout: 20% ‚Üí 50% ‚Üí 100%

**Monitoring:**
- Prediction volume: 50K/hour peak
- False positive rate (alert if >5%)
- Model drift detection (monthly retraining)

### Project 3: Fraud Detection Real-Time Scoring üí≥
**Objective:** Score transactions <100ms for fraud detection

**Requirements:**
- Ultra-low latency: <100ms p99
- High throughput: 10K transactions/sec
- Model updates: Daily retraining
- Feature freshness: Real-time aggregations

**Architecture:**
- Model: LightGBM (20MB, fast inference)
- Serving: Custom C++ inference server
- Load balancing: NGINX (round-robin)
- Feature store: Redis with 1-hour TTL
- Deployment: Kubernetes with GPU nodes

**Scaling:**
- 50 pods during business hours
- 10 pods overnight
- Auto-scale based on queue depth
- Circuit breaker: Fallback to rule-based scoring

### Project 4: Recommendation System for E-Commerce üõí
**Objective:** Serve personalized product recommendations at scale

**Challenge:**
- 1M+ products, 10M+ users
- Model size: 2GB (embeddings)
- Latency requirement: <200ms
- Personalization: Real-time user context

**Solution:**
- Model: Two-tower neural network (TensorFlow)
- Serving: TorchServe with batch inference
- Caching: Multi-layer (CDN ‚Üí Redis ‚Üí Model)
- Infrastructure: AWS SageMaker multi-model endpoints

**Optimization:**
- Quantize model (FP32 ‚Üí INT8, 4x smaller)
- Batch predictions (10 users at once)
- Cache top-N recommendations (1-hour TTL)
- Prefetch for active users

### üìä Production Monitoring & A/B Testing

**Purpose:** Monitor model performance in production and safely test new model versions

**Key Points:**
- **Metrics**: Track prediction latency (p50/p95/p99), throughput, error rate, model accuracy
- **Logging**: Structured JSON logs for debugging and audit trails
- **Alerting**: PagerDuty/Slack alerts for latency >500ms, error rate >1%, accuracy drift >5%
- **A/B Testing**: Route 10% traffic to new model (v2.1), compare metrics, gradual rollout

**Post-Silicon Use Case:** A/B test new yield prediction model v2.1 vs v2.0 - route 10% wafer data to new model, compare accuracy on 1000 wafers, roll out if accuracy improves >2%

### ‚ò∏Ô∏è Kubernetes Deployment & Scaling

**Purpose:** Deploy ML model on Kubernetes with auto-scaling, rolling updates, and high availability

**Key Points:**
- **Deployment**: ReplicaSet manages 3+ pods for redundancy
- **Service**: ClusterIP for internal access, LoadBalancer for external
- **HorizontalPodAutoscaler**: Scale based on CPU (70% target) or custom metrics (request rate)
- **Rolling Update**: Zero-downtime deployments with readiness probes

**Post-Silicon Use Case:** Deploy yield predictor to 5-node K8s cluster serving 10 fab test stations, auto-scale 2-10 pods based on incoming wafer data volume

## üê≥ Part 4: Production Deployment Implementation

### Docker Containerization for ML Models