# 135: GitOps for ML - ArgoCD and Flux

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** GitOps principles (Git as single source of truth, declarative infrastructure)
- **Implement** ArgoCD for continuous deployment (application sync, automated rollback)
- **Build** Flux workflows for ML model deployment (GitRepository, Kustomization)
- **Apply** GitOps to post-silicon validation pipelines (STDF processing, wafer analysis)
- **Master** progressive delivery patterns (canary with Flagger, blue-green deployments)
- **Deploy** production ML systems with automated reconciliation (drift detection, self-healing)

## üìö What is GitOps?

**GitOps** is an operational framework where **Git repositories are the single source of truth** for infrastructure and application configuration. Instead of manually running `kubectl apply` or clicking deploy buttons, every change is committed to Git, and automated agents (ArgoCD, Flux) continuously reconcile the cluster state with the Git repository. This creates an **audit trail**, enables **instant rollback** (revert Git commit), and ensures **environment consistency** (dev/staging/prod configurations in Git).

**Traditional Deployment** (Push-based):
```
Developer ‚Üí CI Pipeline ‚Üí kubectl apply ‚Üí Kubernetes Cluster
  Problem: CI needs cluster credentials (security risk)
  Problem: No drift detection (manual changes not caught)
  Problem: Rollback requires rebuilding previous state
```

**GitOps Deployment** (Pull-based):
```
Developer ‚Üí Git Commit ‚Üí Git Repository ‚Üê ArgoCD/Flux pulls ‚Üí Kubernetes Cluster
  Benefit: Cluster credentials never leave cluster (secure)
  Benefit: Automatic drift detection (reconcile every 3 minutes)
  Benefit: Instant rollback (git revert commit)
```

**Why GitOps?**
- ‚úÖ **Single Source of Truth**: Git repository defines cluster state (no "it works on my machine")
- ‚úÖ **Audit Trail**: Every change tracked in Git history (who deployed what, when, why)
- ‚úÖ **Instant Rollback**: `git revert` ‚Üí automatic rollback to previous working state (<60 seconds)
- ‚úÖ **Drift Detection**: Reconciliation loops detect manual changes (restore Git state automatically)
- ‚úÖ **Multi-Environment Consistency**: Same GitOps workflow for dev/staging/prod (environment-specific overlays)
- ‚úÖ **Disaster Recovery**: Rebuild entire cluster from Git repository (infrastructure as code)

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: STDF Processing Pipeline GitOps Deployment**
- **Input**: STDF parser, feature extractor, outlier detector Kubernetes manifests in Git
- **Output**: ArgoCD syncs Git ‚Üí cluster every 3 minutes (new model version auto-deployed)
- **Value**: Engineers commit model update ‚Üí ArgoCD deploys ‚Üí rollback in <60 seconds if issues detected
- **Business Impact**: **$220K/year savings** (eliminate manual deployment errors, reduce deployment time 90%)

**Use Case 2: Multi-Region Wafer Analysis with Flux**
- **Input**: Wafer map analyzer deployed to 3 regions (US, EU, Asia) with GitRepository CRD
- **Output**: Flux syncs same Git manifests ‚Üí consistent deployments across regions
- **Value**: Update defect classification model v3.2 ‚Üí all regions updated in parallel
- **Business Impact**: **$340K/year savings** (ensure model consistency, reduce regional deployment drift)

**Use Case 3: Canary Deployments with Flagger**
- **Input**: New yield prediction model v2.7 (99.2% accuracy vs 98.8% v2.6)
- **Output**: Flagger automatically routes 5% ‚Üí 10% ‚Üí 25% ‚Üí 50% ‚Üí 100% based on Prometheus metrics
- **Value**: Automatic rollback if accuracy drops <99% or latency >150ms (no human intervention)
- **Business Impact**: **$1.8M/year savings** (prevent bad model deployments, reduce downtime 95%)

**Use Case 4: Disaster Recovery for Test Infrastructure**
- **Input**: Entire post-silicon test infrastructure (10 microservices, 5 databases, 3 ML models) in Git
- **Output**: Cluster failure ‚Üí rebuild from Git in 15 minutes (ArgoCD syncs all applications)
- **Value**: Resume wafer testing after infrastructure failure (minimal data loss)
- **Business Impact**: **$420K/year savings** (reduce RTO from 8 hours ‚Üí 15 minutes, prevent test delays)

## üîÑ GitOps Workflow

```mermaid
graph LR
    A[Developer commits<br/>model update to Git] --> B[Git Repository<br/>single source of truth]
    B --> C[ArgoCD/Flux<br/>pulls changes<br/>every 3 min]
    C --> D[Kubernetes Cluster<br/>applies manifests]
    D --> E{Drift detected?}
    E -->|No| F[Cluster in sync]
    E -->|Yes| C
    
    G[Manual change<br/>kubectl edit] --> D
    
    H[Prometheus Metrics<br/>latency, accuracy] --> I[Flagger<br/>progressive delivery]
    I --> C
    
    style A fill:#e1f5ff
    style B fill:#ffe1e1
    style F fill:#e1ffe1
    style I fill:#fff4e1
```

## üìä Learning Path Context

**Prerequisites:**
- **Notebook 131**: Docker for ML (containerization fundamentals)
- **Notebook 132**: Kubernetes Fundamentals (deployments, services)
- **Notebook 133**: Kubernetes Advanced (operators, CRDs)
- **Notebook 134**: Service Mesh (traffic management, observability)

**Next Steps:**
- **Notebook 136**: CI/CD for ML (Tekton pipelines, GitHub Actions)
- **Notebook 137**: Infrastructure as Code (Terraform for Kubernetes)
- **Notebook 138**: Container Security & Compliance (Falco, OPA)

---

Let's build GitOps systems for ML! üöÄ

In [None]:
# Setup and Imports
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from pathlib import Path
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
from enum import Enum
import json
import time
import uuid
import hashlib

# Visualization settings
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Random seed for reproducibility
np.random.seed(42)

print("‚úÖ Setup complete - Ready for GitOps simulation")

## 2. üîß GitOps Fundamentals - Git as Single Source of Truth

### üìù What's Happening in This Section?

**Purpose:** Simulate GitOps workflow where Git repository defines desired cluster state, and reconciliation loops detect drift.

**Key Points:**
- **Git Repository**: Stores Kubernetes manifests (YAML files) - single source of truth for cluster state
- **Desired State**: What Git says should exist (model-service v2.5 with 3 replicas)
- **Actual State**: What exists in cluster (might have manual changes: 5 replicas, v2.4)
- **Reconciliation Loop**: Compare desired vs actual every 3 minutes ‚Üí fix drift automatically
- **Drift Detection**: Manual `kubectl scale` changes detected and reverted to Git state

**Why This Matters:**
- **Eliminates Configuration Drift**: Cluster always matches Git (no "mystery deployments")
- **Audit Trail**: Every change in Git history (who, what, when, why)
- **Disaster Recovery**: Rebuild cluster from Git repository in minutes
- **Rollback**: `git revert` ‚Üí automatic rollback to previous working state

**Post-Silicon Application:** STDF pipeline deployment tracked in Git - any manual change reverted, ensuring consistent test infrastructure across dev/staging/prod.

In [None]:
# GitOps Fundamentals - Git Repository and Reconciliation

class SyncStatus(Enum):
    """GitOps sync status"""
    SYNCED = "Synced"  # Cluster matches Git
    OUT_OF_SYNC = "OutOfSync"  # Cluster differs from Git
    SYNCING = "Syncing"  # Reconciliation in progress
    DEGRADED = "Degraded"  # Sync failed

@dataclass
class GitCommit:
    """Git commit representing infrastructure change"""
    commit_hash: str
    author: str
    message: str
    timestamp: datetime
    manifests: Dict[str, Dict]  # Kubernetes manifests (deployment, service, etc.)
    
    def get_manifest(self, name: str) -> Optional[Dict]:
        """Get specific manifest from commit"""
        return self.manifests.get(name)

@dataclass
class KubernetesResource:
    """Simulated Kubernetes resource"""
    name: str
    kind: str  # Deployment, Service, ConfigMap, etc.
    namespace: str
    spec: Dict
    status: Dict = field(default_factory=dict)
    
    def get_replicas(self) -> int:
        """Get replica count (for Deployments)"""
        return self.spec.get('replicas', 0)
    
    def get_image(self) -> str:
        """Get container image version"""
        containers = self.spec.get('template', {}).get('spec', {}).get('containers', [])
        return containers[0].get('image', '') if containers else ''

class GitRepository:
    """Simulated Git repository storing Kubernetes manifests"""
    
    def __init__(self, repo_url: str):
        self.repo_url = repo_url
        self.commits: List[GitCommit] = []
        self.current_commit_index = -1
    
    def commit(self, author: str, message: str, manifests: Dict[str, Dict]) -> GitCommit:
        """Create new commit with Kubernetes manifests"""
        commit = GitCommit(
            commit_hash=hashlib.sha1(f"{time.time()}".encode()).hexdigest()[:8],
            author=author,
            message=message,
            timestamp=datetime.now(),
            manifests=manifests
        )
        self.commits.append(commit)
        self.current_commit_index = len(self.commits) - 1
        return commit
    
    def get_current_commit(self) -> Optional[GitCommit]:
        """Get HEAD commit"""
        if self.current_commit_index >= 0:
            return self.commits[self.current_commit_index]
        return None
    
    def revert_to_commit(self, commit_hash: str) -> bool:
        """Rollback to previous commit (git revert)"""
        for i, commit in enumerate(self.commits):
            if commit.commit_hash == commit_hash:
                self.current_commit_index = i
                return True
        return False
    
    def get_history(self) -> List[GitCommit]:
        """Get commit history"""
        return self.commits

class ClusterState:
    """Simulated Kubernetes cluster state"""
    
    def __init__(self, cluster_name: str):
        self.cluster_name = cluster_name
        self.resources: Dict[str, KubernetesResource] = {}
    
    def apply_resource(self, resource: KubernetesResource):
        """Apply Kubernetes resource (kubectl apply)"""
        key = f"{resource.namespace}/{resource.name}"
        self.resources[key] = resource
    
    def get_resource(self, namespace: str, name: str) -> Optional[KubernetesResource]:
        """Get resource from cluster"""
        key = f"{namespace}/{name}"
        return self.resources.get(key)
    
    def manual_edit(self, namespace: str, name: str, new_spec: Dict):
        """Simulate manual cluster change (kubectl edit) - creates drift"""
        resource = self.get_resource(namespace, name)
        if resource:
            resource.spec.update(new_spec)
            print(f"‚ö†Ô∏è Manual change detected: {name} edited directly in cluster (drift created)")

class GitOpsReconciler:
    """GitOps reconciliation loop (ArgoCD/Flux agent)"""
    
    def __init__(self, git_repo: GitRepository, cluster: ClusterState, sync_interval: int = 180):
        self.git_repo = git_repo
        self.cluster = cluster
        self.sync_interval = sync_interval  # seconds (default 3 minutes)
        self.sync_history: List[Dict] = []
    
    def detect_drift(self) -> List[Tuple[str, str, str]]:
        """Compare Git (desired) vs Cluster (actual) state"""
        drifts = []
        current_commit = self.git_repo.get_current_commit()
        
        if not current_commit:
            return drifts
        
        for resource_name, manifest in current_commit.manifests.items():
            namespace = manifest.get('metadata', {}).get('namespace', 'default')
            name = manifest.get('metadata', {}).get('name', resource_name)
            cluster_resource = self.cluster.get_resource(namespace, name)
            
            if not cluster_resource:
                drifts.append((name, "missing", "Resource in Git but not in cluster"))
                continue
            
            # Check replica drift
            git_replicas = manifest.get('spec', {}).get('replicas', 0)
            cluster_replicas = cluster_resource.get_replicas()
            if git_replicas != cluster_replicas:
                drifts.append((name, "replicas", f"Git: {git_replicas}, Cluster: {cluster_replicas}"))
            
            # Check image drift
            git_containers = manifest.get('spec', {}).get('template', {}).get('spec', {}).get('containers', [])
            git_image = git_containers[0].get('image', '') if git_containers else ''
            cluster_image = cluster_resource.get_image()
            if git_image and git_image != cluster_image:
                drifts.append((name, "image", f"Git: {git_image}, Cluster: {cluster_image}"))
        
        return drifts
    
    def reconcile(self) -> SyncStatus:
        """Reconcile cluster state with Git repository"""
        print(f"\nüîÑ Reconciliation started at {datetime.now().strftime('%H:%M:%S')}")
        
        current_commit = self.git_repo.get_current_commit()
        if not current_commit:
            print("‚ùå No commits in Git repository")
            return SyncStatus.DEGRADED
        
        drifts = self.detect_drift()
        
        if not drifts:
            print("‚úÖ Cluster state matches Git (no drift detected)")
            self.sync_history.append({
                'timestamp': datetime.now(),
                'status': SyncStatus.SYNCED,
                'commit': current_commit.commit_hash,
                'drifts_fixed': 0
            })
            return SyncStatus.SYNCED
        
        print(f"‚ö†Ô∏è Detected {len(drifts)} drift(s):")
        for resource, field, details in drifts:
            print(f"  - {resource}: {field} ({details})")
        
        # Fix drifts: Apply Git state to cluster
        print("\nüîß Fixing drifts (applying Git state)...")
        for resource_name, manifest in current_commit.manifests.items():
            namespace = manifest.get('metadata', {}).get('namespace', 'default')
            name = manifest.get('metadata', {}).get('name', resource_name)
            kind = manifest.get('kind', 'Unknown')
            
            resource = KubernetesResource(
                name=name,
                kind=kind,
                namespace=namespace,
                spec=manifest.get('spec', {})
            )
            self.cluster.apply_resource(resource)
            print(f"  ‚úÖ {name} synced to Git state")
        
        self.sync_history.append({
            'timestamp': datetime.now(),
            'status': SyncStatus.SYNCED,
            'commit': current_commit.commit_hash,
            'drifts_fixed': len(drifts)
        })
        
        print(f"\n‚úÖ Reconciliation complete - {len(drifts)} drift(s) fixed")
        return SyncStatus.SYNCED
    
    def get_sync_status(self) -> Dict:
        """Get current sync status"""
        drifts = self.detect_drift()
        return {
            'status': SyncStatus.SYNCED if not drifts else SyncStatus.OUT_OF_SYNC,
            'drifts': len(drifts),
            'last_sync': self.sync_history[-1]['timestamp'] if self.sync_history else None
        }

# Example 1: Create Git repository with ML model deployment
print("=" * 70)
print("Example 1: Git Repository as Single Source of Truth")
print("=" * 70)

git_repo = GitRepository(repo_url="https://github.com/ml-team/stdf-pipeline-manifests.git")

# Initial commit: Deploy STDF parser v1.0 with 3 replicas
manifest_v1 = {
    'stdf-parser-deployment': {
        'apiVersion': 'apps/v1',
        'kind': 'Deployment',
        'metadata': {'name': 'stdf-parser', 'namespace': 'ml-inference'},
        'spec': {
            'replicas': 3,
            'template': {
                'spec': {
                    'containers': [
                        {'name': 'stdf-parser', 'image': 'ml-models/stdf-parser:v1.0'}
                    ]
                }
            }
        }
    }
}

commit1 = git_repo.commit(
    author="alice@company.com",
    message="feat: Deploy STDF parser v1.0 with 3 replicas",
    manifests=manifest_v1
)

print(f"\n‚úÖ Commit {commit1.commit_hash}: {commit1.message}")
print(f"   Author: {commit1.author}")
print(f"   Manifests: {list(commit1.manifests.keys())}")

# Example 2: GitOps reconciliation (sync Git ‚Üí Cluster)
print("\n" + "=" * 70)
print("Example 2: Reconciliation Loop - Sync Git to Cluster")
print("=" * 70)

cluster = ClusterState(cluster_name="production-us-west")
reconciler = GitOpsReconciler(git_repo=git_repo, cluster=cluster)

# First reconciliation: Apply Git state to empty cluster
status1 = reconciler.reconcile()

# Verify cluster state
resource = cluster.get_resource('ml-inference', 'stdf-parser')
print(f"\nüìä Cluster state after sync:")
print(f"   Replicas: {resource.get_replicas()}")
print(f"   Image: {resource.get_image()}")

# Example 3: Drift detection and auto-remediation
print("\n" + "=" * 70)
print("Example 3: Drift Detection - Manual Change Reverted")
print("=" * 70)

# Simulate manual change (engineer scales replicas to 5)
print("\nüë§ Engineer manually scales replicas: kubectl scale deployment stdf-parser --replicas=5")
cluster.manual_edit('ml-inference', 'stdf-parser', {'replicas': 5})

# Check drift
drifts_before = reconciler.detect_drift()
print(f"\n‚ö†Ô∏è Drift detected: {len(drifts_before)} difference(s)")
for resource, field, details in drifts_before:
    print(f"   - {resource}.{field}: {details}")

# Reconciliation fixes drift
status2 = reconciler.reconcile()

# Verify drift fixed
resource_after = cluster.get_resource('ml-inference', 'stdf-parser')
print(f"\nüìä Cluster state after reconciliation:")
print(f"   Replicas: {resource_after.get_replicas()} (reverted to Git state: 3)")

print(f"\n‚úÖ GitOps fundamentals demonstrated: Git is single source of truth!")


## 3. üéØ ArgoCD - Declarative Continuous Deployment

### üìù What's Happening in This Section?

**Purpose:** Implement ArgoCD application controller for automated deployment with health checks and rollback.

**Key Points:**
- **Application CRD**: Define application (Git repo URL, target namespace, sync policy)
- **Sync Policy**: Automated (auto-sync every 3 min) vs Manual (require approval)
- **Health Assessment**: Check deployment rollout status, pod health, service endpoints
- **Auto-Pruning**: Delete resources removed from Git (keep cluster clean)
- **Self-Healing**: Detect manual changes ‚Üí revert to Git state automatically

**Why This Matters:**
- **Declarative Config**: Entire deployment defined in YAML (version controlled)
- **Multi-App Management**: Deploy 10+ applications from single ArgoCD instance
- **RBAC Integration**: Control who can sync which applications (team boundaries)
- **Visual UI**: See sync status, resource tree, deployment history (better than kubectl)

**Post-Silicon Application:** ArgoCD manages STDF pipeline (parser, feature extractor, outlier detector, yield predictor) - health checks ensure all services ready before production traffic.

In [None]:
# ArgoCD - Application Controller and Health Assessment

class HealthStatus(Enum):
    """ArgoCD health status"""
    HEALTHY = "Healthy"  # All resources ready
    PROGRESSING = "Progressing"  # Deployment in progress
    DEGRADED = "Degraded"  # Some resources unhealthy
    MISSING = "Missing"  # Resource not found
    SUSPENDED = "Suspended"  # Resource suspended
    UNKNOWN = "Unknown"  # Health status unknown

class SyncPolicy(Enum):
    """ArgoCD sync policy"""
    AUTOMATED = "Automated"  # Auto-sync every 3 minutes
    MANUAL = "Manual"  # Require manual approval

@dataclass
class ArgoCDApplication:
    """ArgoCD Application CRD"""
    name: str
    git_repo: GitRepository
    target_namespace: str
    sync_policy: SyncPolicy
    auto_prune: bool = True  # Delete resources removed from Git
    self_heal: bool = True  # Revert manual changes
    sync_interval: int = 180  # seconds
    
    # State
    last_sync_commit: Optional[str] = None
    health_status: HealthStatus = HealthStatus.UNKNOWN
    sync_status: SyncStatus = SyncStatus.OUT_OF_SYNC
    
    def should_sync(self) -> bool:
        """Check if app should sync (new commit or drift detected)"""
        current_commit = self.git_repo.get_current_commit()
        if not current_commit:
            return False
        
        # Sync if different commit
        if self.last_sync_commit != current_commit.commit_hash:
            return True
        
        # Sync if self-healing enabled and drift detected
        if self.self_heal and self.sync_status == SyncStatus.OUT_OF_SYNC:
            return True
        
        return False

@dataclass
class ResourceHealth:
    """Health status for individual Kubernetes resource"""
    resource_name: str
    kind: str
    health: HealthStatus
    message: str = ""
    
    def is_healthy(self) -> bool:
        return self.health == HealthStatus.HEALTHY

class ArgoCDController:
    """ArgoCD application controller"""
    
    def __init__(self, cluster: ClusterState):
        self.cluster = cluster
        self.applications: Dict[str, ArgoCDApplication] = {}
        self.sync_operations: List[Dict] = []
    
    def create_application(self, app: ArgoCDApplication):
        """Register ArgoCD application"""
        self.applications[app.name] = app
        print(f"‚úÖ ArgoCD Application created: {app.name}")
        print(f"   Git Repo: {app.git_repo.repo_url}")
        print(f"   Sync Policy: {app.sync_policy.value}")
        print(f"   Auto-Prune: {app.auto_prune}, Self-Heal: {app.self_heal}")
    
    def assess_health(self, app: ArgoCDApplication) -> List[ResourceHealth]:
        """Check health of all resources in application"""
        health_results = []
        current_commit = app.git_repo.get_current_commit()
        
        if not current_commit:
            return health_results
        
        for resource_name, manifest in current_commit.manifests.items():
            namespace = manifest.get('metadata', {}).get('namespace', app.target_namespace)
            name = manifest.get('metadata', {}).get('name', resource_name)
            kind = manifest.get('kind', 'Unknown')
            
            cluster_resource = self.cluster.get_resource(namespace, name)
            
            if not cluster_resource:
                health_results.append(ResourceHealth(
                    resource_name=name,
                    kind=kind,
                    health=HealthStatus.MISSING,
                    message="Resource not found in cluster"
                ))
                continue
            
            # Check deployment health (simulate pod readiness)
            if kind == 'Deployment':
                desired_replicas = cluster_resource.get_replicas()
                # Simulate: 80% chance healthy, 15% progressing, 5% degraded
                rand_val = np.random.random()
                if rand_val < 0.80:
                    health = HealthStatus.HEALTHY
                    message = f"{desired_replicas}/{desired_replicas} pods ready"
                elif rand_val < 0.95:
                    health = HealthStatus.PROGRESSING
                    ready_pods = int(desired_replicas * 0.6)
                    message = f"{ready_pods}/{desired_replicas} pods ready (rollout in progress)"
                else:
                    health = HealthStatus.DEGRADED
                    message = "CrashLoopBackOff detected"
                
                health_results.append(ResourceHealth(
                    resource_name=name,
                    kind=kind,
                    health=health,
                    message=message
                ))
        
        return health_results
    
    def sync_application(self, app_name: str, force: bool = False) -> Dict:
        """Sync ArgoCD application (apply Git state to cluster)"""
        app = self.applications.get(app_name)
        if not app:
            return {'success': False, 'message': f'Application {app_name} not found'}
        
        print(f"\nüîÑ Syncing application: {app_name}")
        
        current_commit = app.git_repo.get_current_commit()
        if not current_commit:
            return {'success': False, 'message': 'No commits in Git repository'}
        
        # Apply manifests to cluster
        resources_synced = []
        for resource_name, manifest in current_commit.manifests.items():
            namespace = manifest.get('metadata', {}).get('namespace', app.target_namespace)
            name = manifest.get('metadata', {}).get('name', resource_name)
            kind = manifest.get('kind', 'Unknown')
            
            resource = KubernetesResource(
                name=name,
                kind=kind,
                namespace=namespace,
                spec=manifest.get('spec', {})
            )
            self.cluster.apply_resource(resource)
            resources_synced.append(name)
            print(f"  ‚úÖ {kind}/{name} synced")
        
        # Update application state
        app.last_sync_commit = current_commit.commit_hash
        app.sync_status = SyncStatus.SYNCED
        
        # Assess health
        health_results = self.assess_health(app)
        all_healthy = all(h.is_healthy() for h in health_results)
        app.health_status = HealthStatus.HEALTHY if all_healthy else HealthStatus.PROGRESSING
        
        sync_record = {
            'timestamp': datetime.now(),
            'app_name': app_name,
            'commit': current_commit.commit_hash,
            'resources_synced': len(resources_synced),
            'health': app.health_status.value
        }
        self.sync_operations.append(sync_record)
        
        print(f"\n‚úÖ Sync complete: {len(resources_synced)} resource(s) synced")
        print(f"   Commit: {current_commit.commit_hash} - {current_commit.message}")
        print(f"   Health: {app.health_status.value}")
        
        return {'success': True, 'sync_record': sync_record}
    
    def get_application_status(self, app_name: str) -> Dict:
        """Get ArgoCD application status"""
        app = self.applications.get(app_name)
        if not app:
            return {}
        
        health_results = self.assess_health(app)
        
        return {
            'name': app.name,
            'sync_status': app.sync_status.value,
            'health_status': app.health_status.value,
            'last_sync_commit': app.last_sync_commit,
            'resources': [
                {
                    'name': h.resource_name,
                    'kind': h.kind,
                    'health': h.health.value,
                    'message': h.message
                }
                for h in health_results
            ]
        }
    
    def rollback_application(self, app_name: str, target_commit_hash: str) -> Dict:
        """Rollback application to previous Git commit"""
        app = self.applications.get(app_name)
        if not app:
            return {'success': False, 'message': f'Application {app_name} not found'}
        
        print(f"\n‚è™ Rolling back {app_name} to commit {target_commit_hash}")
        
        # Revert Git repository
        if not app.git_repo.revert_to_commit(target_commit_hash):
            return {'success': False, 'message': f'Commit {target_commit_hash} not found'}
        
        # Sync to rollback commit
        sync_result = self.sync_application(app_name)
        
        if sync_result['success']:
            print(f"‚úÖ Rollback complete - application reverted to commit {target_commit_hash}")
        
        return sync_result

# Example 1: Create ArgoCD application with automated sync
print("=" * 70)
print("Example 1: ArgoCD Application with Automated Sync")
print("=" * 70)

# Create new Git repo for wafer analysis service
git_repo_wafer = GitRepository(repo_url="https://github.com/ml-team/wafer-analysis-manifests.git")

# Initial deployment: Wafer analyzer v1.5 with 4 replicas
manifest_wafer_v1_5 = {
    'wafer-analyzer-deployment': {
        'apiVersion': 'apps/v1',
        'kind': 'Deployment',
        'metadata': {'name': 'wafer-analyzer', 'namespace': 'ml-inference'},
        'spec': {
            'replicas': 4,
            'template': {
                'spec': {
                    'containers': [
                        {'name': 'wafer-analyzer', 'image': 'ml-models/wafer-analyzer:v1.5'}
                    ]
                }
            }
        }
    }
}

commit_wafer1 = git_repo_wafer.commit(
    author="bob@company.com",
    message="feat: Deploy wafer analyzer v1.5 with 4 replicas",
    manifests=manifest_wafer_v1_5
)

# Create ArgoCD application
argocd = ArgoCDController(cluster=cluster)

wafer_app = ArgoCDApplication(
    name="wafer-analyzer",
    git_repo=git_repo_wafer,
    target_namespace="ml-inference",
    sync_policy=SyncPolicy.AUTOMATED,
    auto_prune=True,
    self_heal=True
)

argocd.create_application(wafer_app)

# Sync application
argocd.sync_application("wafer-analyzer")

# Check application status
status = argocd.get_application_status("wafer-analyzer")
print(f"\nüìä Application Status:")
print(f"   Sync: {status['sync_status']}")
print(f"   Health: {status['health_status']}")
print(f"   Resources: {len(status['resources'])}")

# Example 2: Self-healing - Automatic drift remediation
print("\n" + "=" * 70)
print("Example 2: Self-Healing - Automatic Drift Remediation")
print("=" * 70)

# Manual change: Scale replicas to 8
print("\nüë§ Engineer manually scales replicas: kubectl scale deployment wafer-analyzer --replicas=8")
cluster.manual_edit('ml-inference', 'wafer-analyzer', {'replicas': 8})

# ArgoCD detects drift (self-healing enabled)
print("\nüîç ArgoCD reconciliation loop (every 3 minutes)...")
time.sleep(0.5)  # Simulate reconciliation interval

# Self-healing sync
print("‚ö†Ô∏è Drift detected: replicas differ from Git (4 vs 8)")
argocd.sync_application("wafer-analyzer")

# Verify replicas reverted to Git state
resource_wafer = cluster.get_resource('ml-inference', 'wafer-analyzer')
print(f"\nüìä After self-healing:")
print(f"   Replicas: {resource_wafer.get_replicas()} (reverted to Git: 4)")

# Example 3: Rollback to previous commit
print("\n" + "=" * 70)
print("Example 3: Instant Rollback to Previous Commit")
print("=" * 70)

# Deploy v1.6 (new version)
manifest_wafer_v1_6 = {
    'wafer-analyzer-deployment': {
        'apiVersion': 'apps/v1',
        'kind': 'Deployment',
        'metadata': {'name': 'wafer-analyzer', 'namespace': 'ml-inference'},
        'spec': {
            'replicas': 4,
            'template': {
                'spec': {
                    'containers': [
                        {'name': 'wafer-analyzer', 'image': 'ml-models/wafer-analyzer:v1.6'}
                    ]
                }
            }
        }
    }
}

commit_wafer2 = git_repo_wafer.commit(
    author="bob@company.com",
    message="feat: Upgrade wafer analyzer to v1.6",
    manifests=manifest_wafer_v1_6
)

print(f"üìù New commit: {commit_wafer2.commit_hash} - {commit_wafer2.message}")
argocd.sync_application("wafer-analyzer")

# Simulate issue with v1.6 (accuracy drops)
print("\n‚ö†Ô∏è Issue detected: Model v1.6 accuracy dropped from 99.2% ‚Üí 97.8%")
print(f"‚è™ Initiating rollback to commit {commit_wafer1.commit_hash} (v1.5)")

# Rollback to v1.5
argocd.rollback_application("wafer-analyzer", commit_wafer1.commit_hash)

# Verify rollback
resource_after_rollback = cluster.get_resource('ml-inference', 'wafer-analyzer')
print(f"\nüìä After rollback:")
print(f"   Image: {resource_after_rollback.get_image()} (restored to v1.5)")

print(f"\n‚úÖ ArgoCD demonstrated: Automated sync, self-healing, instant rollback!")


## 4. üöÄ Flux and Progressive Delivery with Flagger

### üìù What's Happening in This Section?

**Purpose:** Implement Flux GitOps toolkit with Flagger for automated canary deployments based on metrics.

**Key Points:**
- **GitRepository CRD**: Watch Git repo for changes (poll interval, branch, authentication)
- **Kustomization CRD**: Apply Kubernetes manifests from GitRepository (prune, health checks)
- **Flagger**: Progressive delivery controller (canary releases based on Prometheus metrics)
- **Canary Analysis**: Automatic traffic shift (5% ‚Üí 100%) if success rate >99%, latency <150ms
- **Automatic Rollback**: Revert to stable version if metrics degrade (no human intervention)

**Why This Matters:**
- **Metric-Driven Deployments**: Deploy new model only if metrics improve (accuracy, latency)
- **Zero Downtime**: Gradual traffic shift ensures minimal risk (5% users test new version first)
- **Automated Decision**: Flagger decides promote/rollback based on Prometheus (no manual approval)
- **Multi-Stage Canary**: 5% (1h) ‚Üí 10% (1h) ‚Üí 25% (1h) ‚Üí 50% (1h) ‚Üí 100% (if all stages pass)

**Post-Silicon Application:** Flux deploys yield prediction model v3.0 - Flagger routes 5% wafer analysis traffic, monitors accuracy/latency for 1 hour, auto-promotes if accuracy ‚â•99.5%.

In [None]:
# Flux and Flagger - Progressive Delivery with Canary Analysis

class CanaryPhase(Enum):
    """Flagger canary deployment phase"""
    INITIALIZED = "Initialized"  # Canary created
    WAITING = "Waiting"  # Waiting for analysis interval
    PROGRESSING = "Progressing"  # Traffic shifting in progress
    PROMOTING = "Promoting"  # Promoting canary to primary
    FINALIZING = "Finalizing"  # Cleanup canary resources
    SUCCEEDED = "Succeeded"  # Canary promotion successful
    FAILED = "Failed"  # Canary analysis failed, rollback

@dataclass
class PrometheusMetric:
    """Simulated Prometheus metric"""
    name: str
    value: float
    threshold: float
    operator: str  # '<', '>', '<=', '>='
    
    def check_threshold(self) -> bool:
        """Check if metric passes threshold"""
        if self.operator == '<':
            return self.value < self.threshold
        elif self.operator == '>':
            return self.value > self.threshold
        elif self.operator == '<=':
            return self.value <= self.threshold
        elif self.operator == '>=':
            return self.value >= self.threshold
        return False

@dataclass
class FlaggerCanary:
    """Flagger Canary CRD - Progressive delivery configuration"""
    name: str
    target_deployment: str
    service_name: str
    
    # Traffic shifting configuration
    step_weight: int = 5  # Traffic increment (5%, 10%, 25%, 50%, 100%)
    max_weight: int = 100
    
    # Canary analysis configuration
    interval: int = 60  # seconds between analysis
    threshold: int = 5  # success threshold (5 consecutive successful checks)
    
    # Metrics for canary analysis
    metrics: List[PrometheusMetric] = field(default_factory=list)
    
    # State
    current_weight: int = 0
    analysis_iteration: int = 0
    consecutive_successes: int = 0
    phase: CanaryPhase = CanaryPhase.INITIALIZED
    
    def should_promote(self) -> bool:
        """Check if canary should be promoted to primary"""
        return self.consecutive_successes >= self.threshold and self.current_weight == self.max_weight
    
    def should_rollback(self) -> bool:
        """Check if canary should be rolled back"""
        # Rollback if any metric fails threshold
        return any(not metric.check_threshold() for metric in self.metrics)

class FluxController:
    """Flux GitOps Toolkit controller"""
    
    def __init__(self, cluster: ClusterState):
        self.cluster = cluster
        self.git_repositories: Dict[str, GitRepository] = {}
        self.canaries: Dict[str, FlaggerCanary] = {}
        self.canary_history: List[Dict] = []
    
    def add_git_repository(self, name: str, git_repo: GitRepository):
        """Register GitRepository CRD"""
        self.git_repositories[name] = git_repo
        print(f"‚úÖ Flux GitRepository registered: {name}")
        print(f"   URL: {git_repo.repo_url}")
    
    def sync_kustomization(self, git_repo_name: str, namespace: str) -> Dict:
        """Sync Kustomization CRD (apply manifests from Git)"""
        git_repo = self.git_repositories.get(git_repo_name)
        if not git_repo:
            return {'success': False, 'message': f'GitRepository {git_repo_name} not found'}
        
        current_commit = git_repo.get_current_commit()
        if not current_commit:
            return {'success': False, 'message': 'No commits in Git repository'}
        
        print(f"\nüîÑ Flux syncing Kustomization from {git_repo_name}")
        
        # Apply manifests to cluster
        for resource_name, manifest in current_commit.manifests.items():
            name = manifest.get('metadata', {}).get('name', resource_name)
            kind = manifest.get('kind', 'Unknown')
            
            resource = KubernetesResource(
                name=name,
                kind=kind,
                namespace=namespace,
                spec=manifest.get('spec', {})
            )
            self.cluster.apply_resource(resource)
            print(f"  ‚úÖ {kind}/{name} synced")
        
        return {'success': True, 'commit': current_commit.commit_hash}
    
    def create_canary(self, canary: FlaggerCanary):
        """Create Flagger Canary for progressive delivery"""
        self.canaries[canary.name] = canary
        print(f"‚úÖ Flagger Canary created: {canary.name}")
        print(f"   Target: {canary.target_deployment}")
        print(f"   Step Weight: {canary.step_weight}%")
        print(f"   Threshold: {canary.threshold} consecutive successes")
    
    def analyze_canary(self, canary_name: str) -> Dict:
        """Run canary analysis (check metrics and decide promote/rollback)"""
        canary = self.canaries.get(canary_name)
        if not canary:
            return {'success': False, 'message': f'Canary {canary_name} not found'}
        
        print(f"\nüîç Canary Analysis #{canary.analysis_iteration + 1} - {canary_name}")
        print(f"   Current traffic weight: {canary.current_weight}%")
        
        # Evaluate metrics
        all_metrics_pass = True
        for metric in canary.metrics:
            passes = metric.check_threshold()
            status_icon = "‚úÖ" if passes else "‚ùå"
            print(f"   {status_icon} {metric.name}: {metric.value:.2f} {metric.operator} {metric.threshold}")
            if not passes:
                all_metrics_pass = False
        
        canary.analysis_iteration += 1
        
        # Decision logic
        if canary.should_rollback():
            print(f"\n‚ùå Canary FAILED - Metrics below threshold, rolling back...")
            canary.phase = CanaryPhase.FAILED
            canary.current_weight = 0
            
            self.canary_history.append({
                'timestamp': datetime.now(),
                'canary': canary_name,
                'phase': CanaryPhase.FAILED.value,
                'weight': 0,
                'decision': 'Rollback'
            })
            
            return {'success': False, 'decision': 'rollback', 'phase': CanaryPhase.FAILED.value}
        
        if all_metrics_pass:
            canary.consecutive_successes += 1
            print(f"   ‚úÖ Metrics passed ({canary.consecutive_successes}/{canary.threshold} successes)")
            
            if canary.should_promote():
                print(f"\n‚úÖ Canary SUCCEEDED - Promoting to primary (100% traffic)")
                canary.phase = CanaryPhase.SUCCEEDED
                
                self.canary_history.append({
                    'timestamp': datetime.now(),
                    'canary': canary_name,
                    'phase': CanaryPhase.SUCCEEDED.value,
                    'weight': 100,
                    'decision': 'Promote'
                })
                
                return {'success': True, 'decision': 'promote', 'phase': CanaryPhase.SUCCEEDED.value}
            
            # Increase traffic weight
            if canary.current_weight < canary.max_weight:
                # Progressive stages: 5% ‚Üí 10% ‚Üí 25% ‚Üí 50% ‚Üí 100%
                if canary.current_weight == 0:
                    canary.current_weight = 5
                elif canary.current_weight == 5:
                    canary.current_weight = 10
                elif canary.current_weight == 10:
                    canary.current_weight = 25
                elif canary.current_weight == 25:
                    canary.current_weight = 50
                elif canary.current_weight == 50:
                    canary.current_weight = 100
                
                canary.phase = CanaryPhase.PROGRESSING
                print(f"   ‚¨ÜÔ∏è Increasing traffic weight: {canary.current_weight}%")
                
                self.canary_history.append({
                    'timestamp': datetime.now(),
                    'canary': canary_name,
                    'phase': CanaryPhase.PROGRESSING.value,
                    'weight': canary.current_weight,
                    'decision': 'Continue'
                })
        else:
            canary.consecutive_successes = 0  # Reset on failure
            print(f"   ‚ö†Ô∏è Metrics passed but not consecutive ({canary.consecutive_successes}/{canary.threshold})")
        
        return {'success': True, 'decision': 'continue', 'phase': canary.phase.value}
    
    def run_canary_deployment(self, canary_name: str, max_iterations: int = 10) -> Dict:
        """Run complete canary deployment (multiple analysis iterations)"""
        canary = self.canaries.get(canary_name)
        if not canary:
            return {'success': False, 'message': f'Canary {canary_name} not found'}
        
        print(f"\nüöÄ Starting Canary Deployment: {canary_name}")
        print(f"   Strategy: Progressive delivery (5% ‚Üí 10% ‚Üí 25% ‚Üí 50% ‚Üí 100%)")
        print("=" * 70)
        
        for iteration in range(max_iterations):
            result = self.analyze_canary(canary_name)
            
            if result['decision'] == 'rollback':
                print(f"\n‚è™ Rollback initiated - canary deployment failed")
                return {'success': False, 'final_phase': CanaryPhase.FAILED.value, 'iterations': iteration + 1}
            
            if result['decision'] == 'promote':
                print(f"\nüéâ Promotion complete - canary deployed successfully!")
                return {'success': True, 'final_phase': CanaryPhase.SUCCEEDED.value, 'iterations': iteration + 1}
            
            # Wait for next analysis interval (simulated)
            if iteration < max_iterations - 1:
                print(f"\n‚è≥ Waiting {canary.interval}s for next analysis...")
                time.sleep(0.3)  # Simulate interval
        
        return {'success': False, 'final_phase': canary.phase.value, 'iterations': max_iterations, 'message': 'Max iterations reached'}

# Example 1: Flux GitRepository and Kustomization
print("=" * 70)
print("Example 1: Flux GitRepository and Kustomization")
print("=" * 70)

# Create Git repo for yield predictor
git_repo_yield = GitRepository(repo_url="https://github.com/ml-team/yield-predictor-manifests.git")

manifest_yield_v2_8 = {
    'yield-predictor-deployment': {
        'apiVersion': 'apps/v1',
        'kind': 'Deployment',
        'metadata': {'name': 'yield-predictor', 'namespace': 'ml-inference'},
        'spec': {
            'replicas': 5,
            'template': {
                'spec': {
                    'containers': [
                        {'name': 'yield-predictor', 'image': 'ml-models/yield-predictor:v2.8'}
                    ]
                }
            }
        }
    }
}

commit_yield1 = git_repo_yield.commit(
    author="alice@company.com",
    message="feat: Deploy yield predictor v2.8 (99.3% accuracy)",
    manifests=manifest_yield_v2_8
)

# Create Flux controller
flux = FluxController(cluster=cluster)
flux.add_git_repository("yield-predictor-repo", git_repo_yield)

# Sync Kustomization
flux.sync_kustomization("yield-predictor-repo", namespace="ml-inference")

# Example 2: Flagger Canary with metric-based analysis
print("\n" + "=" * 70)
print("Example 2: Flagger Canary - Progressive Delivery with Metrics")
print("=" * 70)

# Deploy new version v2.9 (canary)
manifest_yield_v2_9 = {
    'yield-predictor-deployment': {
        'apiVersion': 'apps/v1',
        'kind': 'Deployment',
        'metadata': {'name': 'yield-predictor', 'namespace': 'ml-inference'},
        'spec': {
            'replicas': 5,
            'template': {
                'spec': {
                    'containers': [
                        {'name': 'yield-predictor', 'image': 'ml-models/yield-predictor:v2.9'}
                    ]
                }
            }
        }
    }
}

commit_yield2 = git_repo_yield.commit(
    author="alice@company.com",
    message="feat: Upgrade yield predictor to v2.9 (target 99.5% accuracy)",
    manifests=manifest_yield_v2_9
)

# Create Flagger Canary with metric thresholds
canary_yield = FlaggerCanary(
    name="yield-predictor-canary",
    target_deployment="yield-predictor",
    service_name="yield-predictor-svc",
    step_weight=5,
    threshold=2,  # 2 consecutive successes per stage
    metrics=[
        PrometheusMetric(name="request_success_rate", value=99.6, threshold=99.0, operator='>='),  # Success rate ‚â•99%
        PrometheusMetric(name="request_duration_p99_ms", value=125.0, threshold=150.0, operator='<'),  # Latency <150ms
        PrometheusMetric(name="model_accuracy_percent", value=99.5, threshold=99.0, operator='>=')  # Accuracy ‚â•99%
    ]
)

flux.create_canary(canary_yield)

# Run canary deployment
result = flux.run_canary_deployment("yield-predictor-canary", max_iterations=12)

print(f"\nüìä Canary Deployment Result:")
print(f"   Success: {result['success']}")
print(f"   Final Phase: {result['final_phase']}")
print(f"   Iterations: {result['iterations']}")

# Example 3: Canary rollback scenario (metrics fail)
print("\n" + "=" * 70)
print("Example 3: Automatic Rollback - Metrics Below Threshold")
print("=" * 70)

# Deploy v3.0 with degraded performance (simulate bad deployment)
manifest_yield_v3_0 = {
    'yield-predictor-deployment': {
        'apiVersion': 'apps/v1',
        'kind': 'Deployment',
        'metadata': {'name': 'yield-predictor', 'namespace': 'ml-inference'},
        'spec': {
            'replicas': 5,
            'template': {
                'spec': {
                    'containers': [
                        {'name': 'yield-predictor', 'image': 'ml-models/yield-predictor:v3.0'}
                    ]
                }
            }
        }
    }
}

commit_yield3 = git_repo_yield.commit(
    author="bob@company.com",
    message="feat: Upgrade yield predictor to v3.0 (new algorithm)",
    manifests=manifest_yield_v3_0
)

# Create canary with failing metrics
canary_yield_bad = FlaggerCanary(
    name="yield-predictor-canary-v3",
    target_deployment="yield-predictor",
    service_name="yield-predictor-svc",
    step_weight=5,
    threshold=2,
    metrics=[
        PrometheusMetric(name="request_success_rate", value=97.2, threshold=99.0, operator='>='),  # FAIL: 97.2% < 99%
        PrometheusMetric(name="request_duration_p99_ms", value=180.0, threshold=150.0, operator='<'),  # FAIL: 180ms > 150ms
        PrometheusMetric(name="model_accuracy_percent", value=98.1, threshold=99.0, operator='>=')  # FAIL: 98.1% < 99%
    ]
)

flux.create_canary(canary_yield_bad)

# Run canary (will fail and rollback)
result_bad = flux.run_canary_deployment("yield-predictor-canary-v3", max_iterations=5)

print(f"\nüìä Canary Deployment Result (v3.0):")
print(f"   Success: {result_bad['success']}")
print(f"   Final Phase: {result_bad['final_phase']}")
print(f"   Decision: Automatic rollback to v2.9 (stable version)")

# Visualize canary deployment history
print("\n" + "=" * 70)
print("Canary Deployment History")
print("=" * 70)

canary_df = pd.DataFrame(flux.canary_history)
if not canary_df.empty:
    print(canary_df[['timestamp', 'canary', 'phase', 'weight', 'decision']].to_string(index=False))

# Visualization: Canary traffic progression
if not canary_df.empty:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Traffic weight progression
    successful_canary = canary_df[canary_df['canary'] == 'yield-predictor-canary']
    if not successful_canary.empty:
        axes[0].plot(range(len(successful_canary)), successful_canary['weight'], marker='o', linewidth=2, color='green')
        axes[0].set_title('Successful Canary: Traffic Progression', fontsize=14, fontweight='bold')
        axes[0].set_xlabel('Analysis Iteration')
        axes[0].set_ylabel('Traffic Weight (%)')
        axes[0].grid(True, alpha=0.3)
        axes[0].axhline(y=100, color='blue', linestyle='--', label='Full Promotion')
        axes[0].legend()
    
    # Plot 2: Failed canary (stays at 0%)
    failed_canary = canary_df[canary_df['canary'] == 'yield-predictor-canary-v3']
    if not failed_canary.empty:
        axes[1].plot(range(len(failed_canary)), failed_canary['weight'], marker='x', linewidth=2, color='red')
        axes[1].set_title('Failed Canary: Automatic Rollback', fontsize=14, fontweight='bold')
        axes[1].set_xlabel('Analysis Iteration')
        axes[1].set_ylabel('Traffic Weight (%)')
        axes[1].grid(True, alpha=0.3)
        axes[1].axhline(y=0, color='orange', linestyle='--', label='Rollback')
        axes[1].legend()
    
    plt.tight_layout()
    plt.show()

print(f"\n‚úÖ Flux and Flagger demonstrated: Metric-driven progressive delivery!")


## 5. üöÄ Real-World Projects Using GitOps

---

### **Project 1: Multi-Environment STDF Pipeline with ArgoCD** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê

**Objective:** Deploy STDF processing pipeline (parser, feature extractor, outlier detector, yield predictor) to dev/staging/prod environments using ArgoCD ApplicationSet.

**Business Value:**
- **$280K/year savings** (eliminate manual deployment errors, reduce deployment time 85%)
- **99.9% deployment success rate** (Git-based rollback in <60 seconds)
- **3x faster releases** (automated sync reduces deployment time from 45 minutes ‚Üí 15 minutes)

**Success Criteria:**
- All environments synced from single Git repository (environment-specific overlays)
- Automatic drift detection and remediation (reconcile every 3 minutes)
- Zero manual kubectl commands (100% GitOps workflow)
- Audit trail in Git history (every deployment change tracked)

**Implementation Hints:**

```yaml
# ApplicationSet for multi-environment deployment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: stdf-pipeline
  namespace: argocd
spec:
  generators:
  - list:
      elements:
      - cluster: dev
        url: https://kubernetes-dev.company.com
        namespace: ml-inference-dev
      - cluster: staging
        url: https://kubernetes-staging.company.com
        namespace: ml-inference-staging
      - cluster: prod
        url: https://kubernetes-prod.company.com
        namespace: ml-inference-prod
  
  template:
    metadata:
      name: 'stdf-pipeline-{{cluster}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/ml-team/stdf-pipeline-manifests.git
        targetRevision: main
        path: overlays/{{cluster}}  # Environment-specific Kustomize overlay
      destination:
        server: '{{url}}'
        namespace: '{{namespace}}'
      syncPolicy:
        automated:
          prune: true  # Delete resources removed from Git
          selfHeal: true  # Revert manual changes
        syncOptions:
        - CreateNamespace=true
```

**Kustomize Overlay Structure:**
```
stdf-pipeline-manifests/
‚îú‚îÄ‚îÄ base/
‚îÇ   ‚îú‚îÄ‚îÄ stdf-parser-deployment.yaml
‚îÇ   ‚îú‚îÄ‚îÄ feature-extractor-deployment.yaml
‚îÇ   ‚îú‚îÄ‚îÄ outlier-detector-deployment.yaml
‚îÇ   ‚îî‚îÄ‚îÄ yield-predictor-deployment.yaml
‚îú‚îÄ‚îÄ overlays/
‚îÇ   ‚îú‚îÄ‚îÄ dev/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ kustomization.yaml  # 1 replica, debug logging
‚îÇ   ‚îú‚îÄ‚îÄ staging/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ kustomization.yaml  # 2 replicas, info logging
‚îÇ   ‚îî‚îÄ‚îÄ prod/
‚îÇ       ‚îî‚îÄ‚îÄ kustomization.yaml  # 5 replicas, error logging, resource limits
```

**Post-Silicon Application:**
- STDF parser handles 10K wafers/day (dev: 100, staging: 1K, prod: 10K)
- Feature extractor processes parametric test data (voltage, current, frequency)
- Outlier detector flags anomalies (2-sigma threshold)
- Yield predictor forecasts manufacturing yield (99.2% accuracy)

---

### **Project 2: Canary Deployment for Wafer Yield Model with Flagger** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê

**Objective:** Implement automated canary deployment for wafer yield prediction model using Flagger with Prometheus metrics (accuracy, latency, error rate).

**Business Value:**
- **$1.8M/year savings** (prevent bad model deployments, reduce downtime 95%)
- **Zero production incidents** from model updates (automatic rollback on metric degradation)
- **5x faster rollback** (automatic vs manual: 2 minutes vs 10 minutes)

**Success Criteria:**
- Progressive traffic shift (5% ‚Üí 10% ‚Üí 25% ‚Üí 50% ‚Üí 100%) based on metrics
- Automatic promotion if accuracy ‚â•99.5%, latency <100ms, error rate <0.5%
- Automatic rollback if metrics degrade below thresholds
- Metrics collected from Prometheus every 60 seconds

**Implementation Hints:**

```yaml
# Flagger Canary for yield prediction model
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: yield-predictor
  namespace: ml-inference
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: yield-predictor
  
  service:
    port: 8080
  
  analysis:
    interval: 60s  # Analysis frequency
    threshold: 5  # 5 consecutive successes to promote
    maxWeight: 100
    stepWeight: 5  # Traffic increment (5%, 10%, 25%, 50%, 100%)
    
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99  # Success rate ‚â•99%
      interval: 60s
    
    - name: request-duration
      thresholdRange:
        max: 100  # p99 latency <100ms
      interval: 60s
    
    - name: model-accuracy
      templateRef:
        name: model-accuracy
      thresholdRange:
        min: 99.5  # Model accuracy ‚â•99.5%
      interval: 60s
    
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        type: cmd
        cmd: "hey -z 60s -q 10 -c 2 http://yield-predictor-canary:8080/predict"
```

**Prometheus Metrics (Custom):**
```yaml
# ServiceMonitor for model accuracy
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: yield-predictor-metrics
spec:
  selector:
    matchLabels:
      app: yield-predictor
  endpoints:
  - port: metrics
    path: /metrics
```

**Post-Silicon Application:**
- Model v3.0 deployed with canary (new algorithm using transformer architecture)
- Baseline v2.9: 99.3% accuracy, 85ms p99 latency
- Canary v3.0: 99.6% accuracy, 78ms p99 latency ‚Üí promoted to 100% traffic
- Failed canary v3.1: 98.7% accuracy ‚Üí automatic rollback to v3.0 after 5% traffic test

---

### **Project 3: GitOps-Based Disaster Recovery for Test Infrastructure** ‚≠ê‚≠ê‚≠ê‚≠ê

**Objective:** Implement disaster recovery strategy where entire post-silicon test infrastructure (10 microservices, 5 databases, 3 ML models) can be rebuilt from Git repository in <20 minutes.

**Business Value:**
- **$420K/year savings** (reduce RTO from 8 hours ‚Üí 15 minutes, prevent test delays)
- **99.95% infrastructure availability** (automated recovery vs manual: 99.5%)
- **Zero knowledge dependency** (any engineer can trigger recovery from Git)

**Success Criteria:**
- Complete cluster rebuilt from Git in <20 minutes
- All stateful services restored with PVC backups (Velero integration)
- Automated testing after recovery (synthetic STDF data validation)
- Recovery documented in Git history (audit trail)

**Implementation Hints:**

```yaml
# ArgoCD App-of-Apps pattern (deploy all infrastructure apps)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: infrastructure-root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/ml-team/infrastructure-manifests.git
    targetRevision: main
    path: root-app
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

---
# root-app/kustomization.yaml defines all infrastructure apps
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../apps/postgres-db
- ../apps/redis-cache
- ../apps/stdf-parser
- ../apps/feature-extractor
- ../apps/outlier-detector
- ../apps/yield-predictor
- ../apps/wafer-analyzer
- ../apps/defect-classifier
- ../apps/monitoring-stack
- ../apps/logging-stack
```

**Velero Backup Strategy (Stateful Data):**
```bash
# Daily automated backups of PVCs and Kubernetes state
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces ml-inference \
  --storage-location aws-s3 \
  --volume-snapshot-locations aws-ebs

# Restore after disaster
velero restore create --from-backup daily-backup-20231210
```

**Post-Silicon Application:**
- Infrastructure failure scenario: Entire production cluster lost (data center outage)
- Recovery steps:
  1. Provision new Kubernetes cluster (10 minutes, Terraform automation)
  2. Install ArgoCD (2 minutes, Helm chart)
  3. Deploy infrastructure-root app (5 minutes, ArgoCD syncs all apps)
  4. Restore Velero backup (3 minutes, PVC data restored)
  5. Total RTO: 20 minutes vs 8 hours manual recovery

---

### **Project 4: Blue-Green Deployment for Critical ML Services** ‚≠ê‚≠ê‚≠ê

**Objective:** Implement blue-green deployment strategy for critical STDF processing services (instant traffic switch, zero downtime, instant rollback).

**Business Value:**
- **$180K/year savings** (eliminate deployment downtime, reduce rollback time 90%)
- **10-second rollback** (switch traffic back to blue environment instantly)
- **Zero downtime deployments** (no service interruption during upgrades)

**Success Criteria:**
- Separate blue and green environments (identical infrastructure)
- Instant traffic switch via Service selector change (no pod restarts)
- Automated smoke tests before traffic switch (validate green environment)
- Rollback in <10 seconds (revert Service selector to blue)

**Implementation Hints:**

```yaml
# Blue environment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stdf-parser-blue
  namespace: ml-inference
spec:
  replicas: 5
  selector:
    matchLabels:
      app: stdf-parser
      version: blue
  template:
    metadata:
      labels:
        app: stdf-parser
        version: blue
    spec:
      containers:
      - name: stdf-parser
        image: ml-models/stdf-parser:v2.5

---
# Green environment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stdf-parser-green
  namespace: ml-inference
spec:
  replicas: 5
  selector:
    matchLabels:
      app: stdf-parser
      version: green
  template:
    metadata:
      labels:
        app: stdf-parser
        version: green
    spec:
      containers:
      - name: stdf-parser
        image: ml-models/stdf-parser:v2.6

---
# Service (initially points to blue)
apiVersion: v1
kind: Service
metadata:
  name: stdf-parser
  namespace: ml-inference
spec:
  selector:
    app: stdf-parser
    version: blue  # Change to 'green' to switch traffic
  ports:
  - port: 8080
    targetPort: 8080
```

**GitOps Traffic Switch (Git commit):**
```yaml
# 1. Deploy green environment (Git commit)
# 2. Run smoke tests (automated validation)
# 3. Switch Service selector to green (Git commit)
# 4. Monitor metrics for 10 minutes
# 5. If issues, rollback (revert Git commit ‚Üí Service points to blue)
```

**Post-Silicon Application:**
- STDF parser v2.6 deployed to green environment (new IEEE 1505 standard support)
- Smoke tests validate: parse 100 STDF files, check schema compliance (100% pass)
- Traffic switched to green (Service selector: blue ‚Üí green)
- Rollback scenario: v2.6 has parsing bug ‚Üí revert Service selector to blue in <10 seconds

---

### **Project 5: Multi-Cluster GitOps with Cluster API** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê

**Objective:** Manage 5 Kubernetes clusters (3 regions: US-West, US-East, EU, Asia; 2 environments: staging, prod) using GitOps and Cluster API.

**Business Value:**
- **$620K/year savings** (centralized management, reduce ops overhead 70%)
- **99.99% multi-region availability** (automatic failover across regions)
- **Consistent deployments** across all clusters (same Git manifests, environment overlays)

**Success Criteria:**
- Single ArgoCD instance manages 5 clusters
- Cluster provisioning automated with Cluster API (infrastructure as code)
- Application deployments synced across all clusters (same Git repo)
- Regional-specific configurations (US clusters use us-docker-registry, EU uses eu-docker-registry)

**Implementation Hints:**

```yaml
# Cluster API Cluster definition (infrastructure as code)
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: ml-inference-us-west
  namespace: clusters
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: ml-inference-us-west-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: ml-inference-us-west

---
# ArgoCD ApplicationSet for multi-cluster deployment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: yield-predictor-multicluster
spec:
  generators:
  - matrix:
      generators:
      - list:
          elements:
          - cluster: us-west-prod
            server: https://k8s-us-west.company.com
            region: us
          - cluster: us-east-prod
            server: https://k8s-us-east.company.com
            region: us
          - cluster: eu-west-prod
            server: https://k8s-eu-west.company.com
            region: eu
  
  template:
    metadata:
      name: 'yield-predictor-{{cluster}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/ml-team/yield-predictor-manifests.git
        targetRevision: main
        path: overlays/{{region}}-prod  # Region-specific overlay
      destination:
        server: '{{server}}'
        namespace: ml-inference
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
```

**Post-Silicon Application:**
- Wafer fabrication sites in 3 regions (US, EU, Asia)
- Each region processes local wafer test data (low latency, data sovereignty)
- Yield prediction model deployed consistently across all regions
- Automatic failover: US-West cluster failure ‚Üí traffic routed to US-East (global load balancer)

---

### **Project 6: Progressive Delivery with Feature Flags (LaunchDarkly + Flagger)** ‚≠ê‚≠ê‚≠ê‚≠ê

**Objective:** Combine Flagger progressive delivery with feature flags (LaunchDarkly) for fine-grained control over new model features.

**Business Value:**
- **$240K/year savings** (separate deployment from feature release, reduce risk 80%)
- **Instant feature rollback** (disable feature flag vs redeploy: 5 seconds vs 5 minutes)
- **A/B testing flexibility** (test multiple model variants with different user segments)

**Success Criteria:**
- Model v3.0 deployed to 100% of pods (via Flagger canary)
- New feature (transformer-based yield prediction) controlled by feature flag (initially 0% users)
- Gradual feature rollout (5% ‚Üí 100% users) independent of deployment
- Instant feature disable if accuracy drops (toggle flag without redeployment)

**Implementation Hints:**

```python
# Model code with feature flag
import launchdarkly
from launchdarkly.client import LDClient

ld_client = LDClient(sdk_key="YOUR_SDK_KEY")

def predict_yield(wafer_data, user_context):
    # Feature flag: Use transformer model or legacy GBM model?
    use_transformer = ld_client.variation(
        "transformer-yield-predictor",
        user_context,
        default=False
    )
    
    if use_transformer:
        return transformer_model.predict(wafer_data)
    else:
        return gbm_model.predict(wafer_data)  # Fallback to stable model
```

```yaml
# Flagger canary deploys model v3.0 (100% traffic, but feature flag at 0%)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: yield-predictor-v3
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: yield-predictor
  analysis:
    interval: 60s
    threshold: 5
    maxWeight: 100
    stepWeight: 10
```

**Feature Flag Rollout:**
```
Day 1: Deploy v3.0 (100% pods), feature flag at 0% (all users use GBM)
Day 2: Feature flag ‚Üí 5% (5% users test transformer model)
Day 3: Feature flag ‚Üí 25% (accuracy monitoring: 99.6% vs 99.3% baseline)
Day 4: Feature flag ‚Üí 100% (full feature release, no redeployment needed)

Rollback: Accuracy drops ‚Üí disable feature flag (5 seconds) vs redeploy (5 minutes)
```

**Post-Silicon Application:**
- New transformer-based yield predictor (99.7% accuracy vs 99.3% GBM baseline)
- Deploy to production (100% pods via Flagger)
- Feature flag controls which users get transformer predictions (start at 0%)
- Gradual rollout: internal users (5%) ‚Üí beta customers (25%) ‚Üí all users (100%)
- Instant rollback: Transformer bug detected ‚Üí disable flag ‚Üí all users revert to GBM

---

### **Project 7: Automated Rollback with Prometheus Alerts** ‚≠ê‚≠ê‚≠ê‚≠ê

**Objective:** Integrate Prometheus alerting with ArgoCD to trigger automatic rollback when production metrics degrade.

**Business Value:**
- **$195K/year savings** (reduce MTTR from 20 minutes ‚Üí 2 minutes, prevent cascading failures)
- **Automatic incident response** (no manual intervention for known failure patterns)
- **99.95% SLA achievement** (automated rollback prevents prolonged outages)

**Success Criteria:**
- Prometheus alerts trigger ArgoCD rollback (error rate >1%, latency >200ms, accuracy <99%)
- Rollback executed in <2 minutes (Git revert + ArgoCD sync)
- Alert escalation if rollback fails (PagerDuty notification)
- Post-rollback analysis in Git history (alerts logged as comments on Git commits)

**Implementation Hints:**

```yaml
# Prometheus AlertManager rule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: yield-predictor-alerts
spec:
  groups:
  - name: model-degradation
    interval: 30s
    rules:
    - alert: ModelAccuracyDegraded
      expr: model_accuracy_percent < 99
      for: 5m
      labels:
        severity: critical
        service: yield-predictor
      annotations:
        summary: "Yield predictor accuracy below threshold"
        description: "Model accuracy {{ $value }}% < 99% for 5 minutes"
        rollback_commit: "{{ $labels.previous_commit }}"
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
      for: 2m
      labels:
        severity: critical
        service: yield-predictor
      annotations:
        summary: "Error rate exceeded 1%"
        rollback_commit: "{{ $labels.previous_commit }}"
```

```python
# AlertManager webhook to ArgoCD API (automated rollback)
import requests

def handle_prometheus_alert(alert):
    if alert['labels']['severity'] == 'critical':
        app_name = alert['labels']['service']
        previous_commit = alert['annotations']['rollback_commit']
        
        # Trigger ArgoCD rollback via API
        argocd_api = "https://argocd.company.com/api/v1"
        response = requests.post(
            f"{argocd_api}/applications/{app_name}/rollback",
            json={"revision": previous_commit},
            headers={"Authorization": f"Bearer {ARGOCD_TOKEN}"}
        )
        
        if response.status_code == 200:
            print(f"‚úÖ Automatic rollback triggered for {app_name} to {previous_commit}")
        else:
            # Escalate to PagerDuty
            pagerduty_alert(f"Rollback failed for {app_name}")
```

**Post-Silicon Application:**
- Yield predictor v3.2 deployed (new binning algorithm)
- Metrics after 10 minutes: accuracy 98.7% (below 99% threshold)
- Prometheus alert fires: ModelAccuracyDegraded
- Automatic rollback: ArgoCD reverts to v3.1 (previous commit)
- Total MTTR: 2 minutes (detection + rollback) vs 20 minutes manual

---

### **Project 8: GitOps Security Scanning with Checkov and OPA** ‚≠ê‚≠ê‚≠ê

**Objective:** Implement security policy enforcement in GitOps workflow (scan manifests for vulnerabilities, enforce policies before deployment).

**Business Value:**
- **$125K/year savings** (prevent security incidents, reduce compliance audit time 60%)
- **Zero security violations** in production (100% policy enforcement at Git level)
- **Compliance automation** (SOC 2, ISO 27001 requirements validated in CI/CD)

**Success Criteria:**
- All Kubernetes manifests scanned with Checkov before merge (CI/CD integration)
- OPA policies enforce: no privileged containers, resource limits required, approved image registries only
- Policy violations block Git merge (prevent insecure manifests from reaching cluster)
- Security reports generated for each deployment (audit trail)

**Implementation Hints:**

```yaml
# GitHub Actions CI workflow (scan manifests before merge)
name: Security Scan
on:
  pull_request:
    paths:
      - 'manifests/**/*.yaml'

jobs:
  security-scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Checkov scan
      uses: bridgecrewio/checkov-action@master
      with:
        directory: manifests/
        framework: kubernetes
        soft_fail: false  # Fail PR if violations found
    
    - name: OPA policy validation
      run: |
        opa test policies/ -v
        conftest test manifests/ --policy policies/
```

**OPA Policies (Rego):**
```rego
# policies/deny-privileged-containers.rego
package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Pod"
  some container
  input.request.object.spec.containers[container].securityContext.privileged == true
  msg := sprintf("Privileged container detected: %v", [container])
}

# policies/require-resource-limits.rego
deny[msg] {
  input.request.kind.kind == "Deployment"
  some container
  not input.request.object.spec.template.spec.containers[container].resources.limits
  msg := sprintf("Container missing resource limits: %v", [container])
}

# policies/approved-registries.rego
deny[msg] {
  input.request.kind.kind == "Pod"
  some container
  image := input.request.object.spec.containers[container].image
  not startswith(image, "ml-models.company.com/")
  msg := sprintf("Unapproved image registry: %v", [image])
}
```

**Post-Silicon Application:**
- Engineer commits STDF parser deployment with privileged: true (security risk)
- GitHub Actions CI triggers Checkov + OPA scans
- OPA policy violation: "Privileged container detected"
- PR blocked from merge (red X on GitHub)
- Engineer fixes: Remove privileged flag, add resource limits
- Re-scan passes ‚Üí PR approved ‚Üí ArgoCD deploys secure manifest

---

## üìä Summary: 8 GitOps Projects

| **Project** | **Technology** | **Value** | **Complexity** |
|-------------|---------------|-----------|----------------|
| 1. Multi-Environment STDF Pipeline | ArgoCD ApplicationSet | $280K | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| 2. Canary with Flagger | Flagger + Prometheus | $1.8M | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| 3. Disaster Recovery | ArgoCD + Velero | $420K | ‚≠ê‚≠ê‚≠ê‚≠ê |
| 4. Blue-Green Deployment | ArgoCD + Service Selector | $180K | ‚≠ê‚≠ê‚≠ê |
| 5. Multi-Cluster GitOps | Cluster API + ArgoCD | $620K | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| 6. Feature Flags + Canary | Flagger + LaunchDarkly | $240K | ‚≠ê‚≠ê‚≠ê‚≠ê |
| 7. Automated Rollback | Prometheus + ArgoCD API | $195K | ‚≠ê‚≠ê‚≠ê‚≠ê |
| 8. Security Scanning | Checkov + OPA | $125K | ‚≠ê‚≠ê‚≠ê |

**Total Business Value:** **$3.86M/year savings** across 8 production GitOps projects.

## 6. üìö Comprehensive Takeaways - GitOps for ML

---

### üéØ **Core Concepts Summary**

#### **GitOps Principles**
- **Git as Single Source of Truth**: All infrastructure and application config stored in Git (no manual `kubectl apply`)
- **Declarative Configuration**: Desired state defined in YAML manifests (not imperative scripts)
- **Automated Reconciliation**: Agents (ArgoCD, Flux) continuously sync cluster state with Git (every 3 minutes)
- **Pull-based Deployment**: Cluster pulls changes from Git (vs CI pushing to cluster - more secure)

#### **ArgoCD**
- **Application CRD**: Define application (Git repo, target namespace, sync policy)
- **Automated Sync**: Auto-sync every 3 minutes (detect new commits, apply changes)
- **Self-Healing**: Detect manual changes ‚Üí revert to Git state automatically
- **Health Assessment**: Check deployment rollout, pod readiness, service endpoints
- **Rollback**: `git revert` ‚Üí ArgoCD syncs to previous commit (<60 seconds)

#### **Flux**
- **GitRepository CRD**: Watch Git repo for changes (poll interval, branch, authentication)
- **Kustomization CRD**: Apply manifests from GitRepository (prune, health checks, dependencies)
- **HelmRelease CRD**: Deploy Helm charts declaratively (values in Git)
- **Image Automation**: Detect new container images ‚Üí auto-commit to Git ‚Üí trigger deployment

#### **Flagger (Progressive Delivery)**
- **Canary Analysis**: Gradual traffic shift (5% ‚Üí 100%) based on Prometheus metrics
- **Metric Thresholds**: Success rate ‚â•99%, latency <150ms, custom metrics (model accuracy)
- **Automatic Promotion**: Promote canary if metrics pass for threshold iterations (e.g., 5 consecutive successes)
- **Automatic Rollback**: Revert to stable version if metrics degrade (no human intervention)

---

### üèóÔ∏è **Architecture Best Practices**

#### **1. ArgoCD vs Flux - When to Choose**

**Choose ArgoCD when:**
- Need visual UI (web dashboard for sync status, resource tree, deployment history)
- Multi-cluster management from single control plane (centralized GitOps)
- RBAC integration (control who can sync which applications, SSO with OAuth)
- Application dependency management (sync waves, hooks)
- **Trade-off**: More complex setup (requires ArgoCD server + repo server + application controller)

**Choose Flux when:**
- Priority is simplicity and Kubernetes-native design (CRDs only, no external server)
- Helm chart deployment (HelmRelease CRD with values in Git)
- Image automation (auto-update image tags in Git when new versions published)
- GitOps Toolkit approach (modular components: source-controller, kustomize-controller, helm-controller)
- **Trade-off**: No built-in UI (requires separate tools like Weave GitOps Dashboard)

**Comparison Table:**

| **Feature** | **ArgoCD** | **Flux** |
|-------------|-----------|----------|
| **Architecture** | Server-based (API server + UI) | Controller-based (Kubernetes-native) |
| **UI** | ‚úÖ Built-in web dashboard | ‚ö†Ô∏è Optional (Weave GitOps Dashboard) |
| **Multi-cluster** | ‚úÖ Centralized management | ‚úÖ Decentralized (Flux per cluster) |
| **Helm support** | ‚úÖ Via Application CRD | ‚úÖ HelmRelease CRD (better Helm integration) |
| **Image automation** | ‚ö†Ô∏è Requires Argo CD Image Updater | ‚úÖ Built-in (ImageRepository, ImagePolicy) |
| **RBAC** | ‚úÖ Fine-grained (AppProject, JWT tokens) | ‚úÖ Kubernetes RBAC |
| **Sync waves** | ‚úÖ PreSync, Sync, PostSync hooks | ‚ö†Ô∏è Via dependencies in Kustomization |
| **Learning curve** | Moderate (more concepts) | Gentle (Kubernetes-native) |
| **Best for** | Large enterprises, multi-tenancy | Cloud-native teams, Helm-heavy |

#### **2. Multi-Environment Strategy**

**Pattern 1: Single Repo, Multiple Overlays (Kustomize)**
```
manifests/
‚îú‚îÄ‚îÄ base/
‚îÇ   ‚îú‚îÄ‚îÄ deployment.yaml  # Common config
‚îÇ   ‚îú‚îÄ‚îÄ service.yaml
‚îÇ   ‚îî‚îÄ‚îÄ configmap.yaml
‚îú‚îÄ‚îÄ overlays/
‚îÇ   ‚îú‚îÄ‚îÄ dev/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ kustomization.yaml  # Dev-specific: 1 replica, debug logging
‚îÇ   ‚îú‚îÄ‚îÄ staging/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ kustomization.yaml  # Staging: 2 replicas, info logging
‚îÇ   ‚îî‚îÄ‚îÄ prod/
‚îÇ       ‚îî‚îÄ‚îÄ kustomization.yaml  # Prod: 5 replicas, error logging, resource limits
```

**Pattern 2: Repo per Environment**
```
stdf-pipeline-dev/
stdf-pipeline-staging/
stdf-pipeline-prod/
```
**Use when:** Strict environment isolation (different teams, compliance requirements)

**Pattern 3: Monorepo with App-of-Apps**
```
infrastructure-monorepo/
‚îú‚îÄ‚îÄ apps/
‚îÇ   ‚îú‚îÄ‚îÄ stdf-parser/
‚îÇ   ‚îú‚îÄ‚îÄ yield-predictor/
‚îÇ   ‚îî‚îÄ‚îÄ wafer-analyzer/
‚îú‚îÄ‚îÄ environments/
‚îÇ   ‚îú‚îÄ‚îÄ dev.yaml
‚îÇ   ‚îú‚îÄ‚îÄ staging.yaml
‚îÇ   ‚îî‚îÄ‚îÄ prod.yaml
‚îî‚îÄ‚îÄ root-app.yaml  # ArgoCD App-of-Apps
```

#### **3. Progressive Delivery Strategies**

**Canary Release Timeline:**
```
Stage 1:   5% traffic (analyze for 1 hour)
Stage 2:  10% traffic (analyze for 1 hour)
Stage 3:  25% traffic (analyze for 1 hour)
Stage 4:  50% traffic (analyze for 2 hours)
Stage 5: 100% traffic (full promotion)

Rollback: Any stage, if metrics degrade ‚Üí 0% traffic (instant)
```

**Metric Thresholds (Prometheus):**
- **Success Rate**: ‚â•99% (HTTP 2xx / total requests)
- **Latency**: p99 <150ms (99th percentile response time)
- **Custom Metrics**: Model accuracy ‚â•99.5%, throughput ‚â•1000 req/sec

**Flagger Configuration Best Practices:**
- **Interval**: 60s (balance between fast feedback and metric stability)
- **Threshold**: 5 consecutive successes (avoid flapping on transient issues)
- **Max Weight**: 100% (full promotion)
- **Step Weight**: 5% (small increments for low-risk testing)

---

### ‚ö° **Performance Optimization**

#### **1. Reduce ArgoCD Sync Overhead**

**Optimize Application Refresh:**
```yaml
# Reduce refresh interval for low-change apps
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  annotations:
    argocd.argoproj.io/refresh: "hard"  # Force refresh
spec:
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - PruneLast=true  # Prune after sync (safer)
```

**Use ApplicationSet for Scalability:**
- Manage 100+ applications with single ApplicationSet (vs 100 Application CRDs)
- Generators: List, Git, Cluster, Matrix (combine generators)

#### **2. Optimize Flux Reconciliation**

**Tune Kustomization Interval:**
```yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: yield-predictor
spec:
  interval: 5m  # Default 1m, increase for stable apps
  retryInterval: 1m  # Retry on failure
  timeout: 3m  # Sync timeout
```

**Use Dependency Ordering:**
```yaml
# Kustomization with dependencies (database before app)
spec:
  dependsOn:
  - name: postgres-db
  - name: redis-cache
```

#### **3. Image Automation Optimization**

**Flux ImageRepository:**
```yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: yield-predictor
spec:
  image: ml-models/yield-predictor
  interval: 10m  # Check for new images every 10 minutes
  secretRef:
    name: docker-registry-secret
```

**ImagePolicy (semantic versioning):**
```yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: yield-predictor
spec:
  imageRepositoryRef:
    name: yield-predictor
  policy:
    semver:
      range: '>=2.5.0 <3.0.0'  # Auto-update within minor versions
```

---

### üîí **Security Best Practices**

#### **1. Git Repository Access Control**

**SSH Key Authentication (Recommended):**
```bash
# Generate SSH key for ArgoCD/Flux
ssh-keygen -t ed25519 -C "argocd@company.com" -f argocd-deploy-key

# Add public key to GitHub as deploy key (read-only)
# Add private key to Kubernetes secret
kubectl create secret generic argocd-repo-secret \
  --from-file=sshPrivateKey=argocd-deploy-key \
  -n argocd
```

**HTTPS with Personal Access Token:**
```yaml
apiVersion: v1
kind: Secret
metadata:
  name: github-token
  namespace: flux-system
type: Opaque
stringData:
  username: git
  password: ghp_1234567890abcdef  # GitHub PAT
```

#### **2. RBAC and Multi-Tenancy**

**ArgoCD AppProject (Team Boundaries):**
```yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: ml-team
spec:
  description: ML Team Applications
  sourceRepos:
  - https://github.com/ml-team/*  # Only ml-team repos
  destinations:
  - namespace: ml-inference  # Only ml-inference namespace
    server: https://kubernetes.default.svc
  clusterResourceWhitelist:
  - group: ''
    kind: Namespace
  roles:
  - name: ml-engineer
    policies:
    - p, proj:ml-team:ml-engineer, applications, sync, ml-team/*, allow
```

#### **3. Secret Management**

**Sealed Secrets (Encrypt secrets in Git):**
```bash
# Install Sealed Secrets controller
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml

# Encrypt secret
echo -n 'my-secret-password' | kubectl create secret generic db-password --dry-run=client --from-file=password=/dev/stdin -o yaml | \
  kubeseal -o yaml > sealed-secret.yaml

# Commit sealed-secret.yaml to Git (safe, encrypted)
# Sealed Secrets controller decrypts in cluster
```

**External Secrets Operator (AWS Secrets Manager, HashiCorp Vault):**
```yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: db-credentials
  data:
  - secretKey: password
    remoteRef:
      key: prod/db/password
```

---

### üêõ **Troubleshooting Guide**

#### **Common Issues**

**Problem 1: ArgoCD application stuck in "OutOfSync"**
```bash
# Check sync status
argocd app get stdf-parser

# Force hard refresh
argocd app diff stdf-parser --hard-refresh

# Manual sync
argocd app sync stdf-parser --prune

# Check application events
kubectl describe application stdf-parser -n argocd
```

**Problem 2: Flux Kustomization fails to reconcile**
```bash
# Check Kustomization status
flux get kustomizations

# View reconciliation logs
flux logs --kind=Kustomization --name=yield-predictor

# Force reconciliation
flux reconcile kustomization yield-predictor --with-source

# Check GitRepository source
flux get sources git
```

**Problem 3: Flagger canary stuck in "Progressing"**
```bash
# Check canary status
kubectl describe canary yield-predictor -n ml-inference

# View Flagger controller logs
kubectl logs -n flagger-system deployment/flagger -f

# Check Prometheus metrics
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))
```

**Problem 4: Git authentication failure**
```bash
# ArgoCD: Check repository connection
argocd repo list

# Flux: Check GitRepository secret
kubectl get secret -n flux-system github-deploy-key -o yaml

# Test SSH connection
ssh -T git@github.com -i /path/to/deploy-key
```

**Problem 5: Resource prune deletes unexpected resources**
```yaml
# Prevent resource deletion with annotation
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    argocd.argoproj.io/sync-options: Prune=false
```

---

### üìä **Monitoring and Observability**

#### **1. ArgoCD Metrics (Prometheus)**

**Application Sync Status:**
```prometheus
# Count applications by sync status
count by (sync_status) (argocd_app_info)

# Applications out of sync
argocd_app_info{sync_status="OutOfSync"}
```

**Sync Performance:**
```prometheus
# Sync duration (p95)
histogram_quantile(0.95, 
  rate(argocd_app_sync_total_bucket[5m])
)
```

#### **2. Flux Metrics (Prometheus)**

**Reconciliation Status:**
```prometheus
# Kustomization reconciliation failures
gotk_reconcile_condition{type="Ready",status="False",kind="Kustomization"}

# Reconciliation duration
rate(gotk_reconcile_duration_seconds_sum[5m]) / 
rate(gotk_reconcile_duration_seconds_count[5m])
```

#### **3. Flagger Metrics**

**Canary Analysis:**
```prometheus
# Canary phase distribution
count by (phase) (flagger_canary_info)

# Canary failures
flagger_canary_total{event="failed"}
```

---

### üöÄ **Production Deployment Checklist**

#### **Pre-Deployment**

- [ ] **Git repository configured** (SSH key or PAT with appropriate permissions)
- [ ] **ArgoCD/Flux installed** (control plane deployed, controllers running)
- [ ] **RBAC configured** (AppProjects for ArgoCD, Kubernetes RBAC for Flux)
- [ ] **Secret management** (Sealed Secrets or External Secrets Operator)
- [ ] **Monitoring stack deployed** (Prometheus, Grafana for GitOps metrics)
- [ ] **Git repository structure** (base manifests, environment overlays)

#### **GitOps Configuration**

- [ ] **Application/Kustomization CRDs created** (one per environment: dev, staging, prod)
- [ ] **Sync policy configured** (automated vs manual, prune, selfHeal)
- [ ] **Health checks defined** (custom health assessments for CRDs)
- [ ] **Sync waves configured** (dependencies, database before app)
- [ ] **Notification configured** (Slack/email alerts on sync failures)

#### **Progressive Delivery (Flagger)**

- [ ] **Canary CRD created** (target deployment, service, analysis config)
- [ ] **Metric thresholds defined** (success rate, latency, custom metrics)
- [ ] **Prometheus metrics available** (ServiceMonitor for application metrics)
- [ ] **Load testing configured** (Flagger loadtester or custom webhooks)
- [ ] **Rollback tested** (verify automatic rollback on metric failures)

#### **Security**

- [ ] **Git commits signed** (GPG signatures for audit trail)
- [ ] **Manifest scanning** (Checkov, OPA policies in CI/CD)
- [ ] **Image signature verification** (Cosign, Notary)
- [ ] **Network policies** (restrict egress to Git repository only)
- [ ] **Audit logging enabled** (ArgoCD audit logs, Flux controller logs)

---

### üéì **Learning Path Next Steps**

#### **Beginner ‚Üí Intermediate**
1. ‚úÖ Complete Notebooks 131-135 (Docker, Kubernetes, Service Mesh, GitOps)
2. üìö **Next**: Notebook 136 - CI/CD for ML (Tekton, GitHub Actions with GitOps)
3. üìö Practice: Deploy ArgoCD on local Kubernetes (Minikube, Kind)
4. üõ†Ô∏è Build Project 1 (Multi-Environment STDF Pipeline with ArgoCD ApplicationSet)

#### **Intermediate ‚Üí Advanced**
1. üìö Notebook 137 - Infrastructure as Code (Terraform + ArgoCD for full stack GitOps)
2. üìö Notebook 138 - Container Security (Falco, OPA Gatekeeper integrated with GitOps)
3. üõ†Ô∏è Build Project 2 (Canary Deployment with Flagger + Prometheus)
4. üõ†Ô∏è Build Project 5 (Multi-Cluster GitOps with Cluster API)

#### **Advanced ‚Üí Expert**
1. üìö Contribute to ArgoCD/Flux open source (feature requests, bug fixes, plugins)
2. üõ†Ô∏è Build custom Argo Workflows (ML pipeline orchestration with GitOps)
3. üõ†Ô∏è Implement GitOps for edge deployments (K3s clusters, Akri, Azure Arc)
4. üõ†Ô∏è Build Project 7 (Automated Rollback with Prometheus Alerts + ArgoCD API)

---

### üìñ **Additional Resources**

#### **Official Documentation**
- [ArgoCD Documentation](https://argo-cd.readthedocs.io/)
- [Flux Documentation](https://fluxcd.io/docs/)
- [Flagger Documentation](https://docs.flagger.app/)
- [GitOps Working Group (CNCF)](https://opengitops.dev/)

#### **Books**
- "GitOps and Kubernetes" by Billy Yuen, Alexander Matyushentsev, Todd Ekenstam, Jesse Suen
- "Continuous Delivery with Docker and Jenkins" by Rafa≈Ç Leszko
- "Kubernetes Patterns" by Bilgin Ibryam & Roland Hu√ü

#### **Tools**
- [ArgoCD](https://argo-cd.readthedocs.io/) - Declarative GitOps for Kubernetes
- [Flux](https://fluxcd.io/) - GitOps Toolkit
- [Flagger](https://flagger.app/) - Progressive delivery operator
- [Sealed Secrets](https://sealed-secrets.netlify.app/) - Encrypt secrets in Git
- [Weave GitOps Dashboard](https://www.weave.works/product/gitops-core/) - UI for Flux

---

### üí° **Key Insights for Post-Silicon Validation**

#### **Why GitOps for Semiconductor Testing**

**Multi-Environment Consistency:**
- STDF pipeline deployed to dev/staging/prod with identical Git workflow
- Environment-specific config (1 replica dev, 5 replicas prod) managed via Kustomize overlays
- **Value**: Eliminate environment drift, ensure test parity across all environments

**Automated Rollback for Model Updates:**
- Yield prediction model v3.2 deployed via ArgoCD canary
- Metrics degrade (accuracy 98.7% < 99% threshold) ‚Üí automatic rollback to v3.1
- **Value**: Prevent bad model deployments from affecting production yield analysis

**Disaster Recovery:**
- Entire post-silicon test infrastructure (10 microservices, 5 databases) stored in Git
- Cluster failure ‚Üí rebuild from Git in 15 minutes (vs 8 hours manual recovery)
- **Value**: Minimize test delays, meet production schedules despite infrastructure failures

**Audit Trail:**
- Every deployment change tracked in Git history (who, what, when, why)
- Compliance requirements (SOC 2, ISO 27001) satisfied with Git-based audit trail
- **Value**: Reduce audit time 60%, demonstrate change control for regulatory compliance

---

### ‚úÖ **Final Checklist**

**You've mastered GitOps if you can:**

- [ ] Explain GitOps principles (Git as single source of truth, pull-based deployment)
- [ ] Deploy ArgoCD application with automated sync and self-healing
- [ ] Configure Flux GitRepository and Kustomization CRDs
- [ ] Implement Flagger canary with Prometheus metric thresholds
- [ ] Rollback deployment via `git revert` (instant rollback to previous commit)
- [ ] Configure multi-environment deployment (dev/staging/prod with overlays)
- [ ] Troubleshoot sync failures (ArgoCD diff, Flux reconcile logs)
- [ ] Integrate secret management (Sealed Secrets or External Secrets)

**Ready for Production if you can:**

- [ ] Design multi-cluster GitOps architecture (centralized ArgoCD or Flux per cluster)
- [ ] Implement progressive delivery (canary releases with automatic promotion/rollback)
- [ ] Configure RBAC for multi-tenancy (AppProjects, team boundaries)
- [ ] Integrate security scanning (Checkov, OPA policies in CI/CD)
- [ ] Build disaster recovery strategy (rebuild cluster from Git in <20 minutes)
- [ ] Automate rollback based on Prometheus alerts (no manual intervention)
- [ ] Implement blue-green deployment (instant traffic switch, <10 second rollback)
- [ ] Manage secrets securely (Sealed Secrets, External Secrets Operator)

---

### üöÄ **Congratulations!**

You've completed **Notebook 135: GitOps for ML - ArgoCD and Flux**. You now understand:
- ‚úÖ GitOps principles (Git as single source of truth, declarative config, pull-based deployment)
- ‚úÖ ArgoCD (application management, automated sync, self-healing, rollback)
- ‚úÖ Flux (GitRepository, Kustomization, HelmRelease, image automation)
- ‚úÖ Flagger (progressive delivery, canary analysis, metric-based promotion/rollback)
- ‚úÖ Multi-environment strategy (Kustomize overlays, App-of-Apps, monorepo)

**Next Steps:**
- **Notebook 136**: CI/CD for ML (Tekton, GitHub Actions, automated pipelines)
- **Notebook 137**: Infrastructure as Code (Terraform + ArgoCD for full GitOps)
- **Notebook 138**: Container Security & Compliance (Falco, OPA Gatekeeper)

**Keep Building! üéâ**

## üéØ Key Takeaways

### When to Use GitOps
- **Declarative infrastructure**: All K8s manifests in Git (deployments, services, config)
- **Audit trail**: Every change tracked with Git history (who, what, when, why)
- **Rollback capability**: Instant revert to previous working state (`git revert`)
- **Multi-environment consistency**: Promote changes dev ‚Üí staging ‚Üí prod via Git branches/tags
- **Team collaboration**: Pull request review workflow for infrastructure changes

### Limitations
- **Git as single source of truth**: Manual `kubectl` changes drift from Git (need drift detection)
- **Secrets management**: Storing secrets in Git risky (need SealedSecrets, SOPS, Vault)
- **Learning curve**: Developers need to learn YAML, Kustomize/Helm, Git workflows
- **Sync latency**: ArgoCD/Flux polls Git every 3min (manual sync for immediate changes)
- **Complex debugging**: Issues span Git, K8s, ArgoCD - multi-layer troubleshooting

### Alternatives
- **Imperative deployments**: `kubectl apply`, CI/CD scripts push directly to K8s (simpler, less traceable)
- **Helm-only**: Use Helm CLI without GitOps (good for testing, bad for production repeatability)
- **Cloud-native CD**: Spinnaker, Jenkins X for deployment (more features, higher complexity)
- **Manual deployments**: For small teams, direct `kubectl` can work (doesn't scale)

### Best Practices
- **Separate repos**: Infrastructure repo (K8s manifests) vs. application repo (source code)
- **Environment branching**: Main branch = prod, develop = staging (or use Kustomize overlays)
- **Automated sync**: Enable auto-sync with prune for hands-off operations
- **Sync waves**: Order deployments (database ‚Üí app ‚Üí ingress) with annotations
- **Health checks**: ArgoCD validates deployments healthy before marking synced
- **Secret encryption**: Use SOPS + age or SealedSecrets for sensitive data in Git

## üîç Diagnostic Checks & Mastery

### Implementation Checklist
- ‚úÖ **GitOps repo**: Separate infrastructure repo with K8s manifests
- ‚úÖ **ArgoCD/Flux**: Installed and syncing Git ‚Üí K8s cluster
- ‚úÖ **Auto-sync**: Enable automatic deployment on Git commit
- ‚úÖ **SOPS/SealedSecrets**: Encrypted secrets in Git
- ‚úÖ **Kustomize/Helm**: Environment-specific overlays (dev/staging/prod)
- ‚úÖ **Sync waves**: Order deployments for dependencies

### Post-Silicon Applications
**Model Deployment Automation**: GitOps-driven yield prediction model deployments across 15 fabs, audit trail for compliance, save $600K/year deployment overhead

### Mastery Achievement
‚úÖ Implement GitOps workflow with ArgoCD or Flux  
‚úÖ Store all K8s manifests in Git with version control  
‚úÖ Automate deployments with Git commits (no manual kubectl)  
‚úÖ Manage secrets securely (SOPS, SealedSecrets)  
‚úÖ Rollback instantly with git revert  
‚úÖ Apply to semiconductor ML model deployment workflows  

**Next Steps**: 136_CICD_ML_Pipelines, 151_MLOps_Fundamentals

## üìà Progress Update

**Session Summary:**
- ‚úÖ Completed 29 notebooks total (previous 21 + current batch: 132, 134-136, 139, 144-145, 174)
- ‚úÖ Current notebook: 135/175 complete
- ‚úÖ Overall completion: ~82.9% (145/175 notebooks ‚â•15 cells)

**Remaining Work:**
- üîÑ Next: Process remaining 9-cell and below notebooks
- üéØ Target: 100% completion (175/175 notebooks)

Excellent progress - over 80% complete! üöÄ

In [None]:
# argocd-application.yaml (multi-environment)
"""
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: yield-prediction-prod
  namespace: argocd
spec:
  project: ml-models
  source:
    repoURL: https://github.com/fab/ml-deployments
    targetRevision: main
    path: kustomize/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-prod
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
"""

# Directory structure:
"""
kustomize/
‚îú‚îÄ‚îÄ base/
‚îÇ   ‚îú‚îÄ‚îÄ deployment.yaml      # Common deployment config
‚îÇ   ‚îú‚îÄ‚îÄ service.yaml
‚îÇ   ‚îî‚îÄ‚îÄ kustomization.yaml
‚îî‚îÄ‚îÄ overlays/
    ‚îú‚îÄ‚îÄ dev/
    ‚îÇ   ‚îú‚îÄ‚îÄ kustomization.yaml    # 1 replica, CPU limits
    ‚îÇ   ‚îî‚îÄ‚îÄ configmap.yaml        # dev database
    ‚îú‚îÄ‚îÄ staging/
    ‚îÇ   ‚îú‚îÄ‚îÄ kustomization.yaml    # 2 replicas, higher resources
    ‚îÇ   ‚îî‚îÄ‚îÄ configmap.yaml        # staging database
    ‚îî‚îÄ‚îÄ production/
        ‚îú‚îÄ‚îÄ kustomization.yaml    # 5 replicas, GPU enabled
        ‚îî‚îÄ‚îÄ configmap.yaml        # prod database (read-only)
"""

# Post-Silicon Use Case:
# Git commit to main branch ‚Üí ArgoCD auto-syncs production yield model
# Staging branch deploys to staging cluster for validation
# Dev branch deploys to dev cluster with 1 replica for testing
# Audit trail: All deployments tracked via Git commits
# Rollback: Revert Git commit ‚Üí ArgoCD redeploys previous version
# Save $420K/year (eliminate manual deployment errors, 3 SRE-days/month)

## üè≠ Advanced Pattern: Multi-Environment Model Deployment with ArgoCD

Manage dev/staging/prod ML deployments with Git branches and Kustomize overlays.