# Production EKS Deployment - Interview Questions & Answers

## Overview
This notebook covers common interview questions and detailed answers for a production-grade ML inference service deployed on AWS EKS with Terraform, Kubernetes, and FastAPI.

**Project Context:**
- ML Model: Random Forest Classifier (scikit-learn)
- API Framework: FastAPI with Uvicorn
- Container Registry: AWS ECR
- Orchestration: AWS EKS (Elastic Kubernetes Service)
- Infrastructure: Terraform
- Region: ap-southeast-2 (Sydney)

---

# 1. KUBERNETES FUNDAMENTALS

## Q1.1: Explain the difference between Deployment, StatefulSet, and DaemonSet

### Answer:

**Deployment:**
- For **stateless applications** (like our ML API)
- Pods are interchangeable and can be scaled up/down
- Manages ReplicaSets for rolling updates
- Our use case: ML inference service with 2-10 replicas

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-api
spec:
  replicas: 2  # Managed by HPA (2-10)
  selector:
    matchLabels:
      app: ml-inference
```

**StatefulSet:**
- For **stateful applications** (databases, caches)
- Pods have stable, unique identities (pod-0, pod-1, pod-2)
- Maintains persistent storage
- Example: PostgreSQL database with persistent volumes

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: "postgres"
  replicas: 3
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 10Gi
```

**DaemonSet:**
- Ensures **one pod per node** (or per selected nodes)
- Used for cluster monitoring, logging, networking
- Examples: Prometheus node exporter, Fluentd, Calico

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: prometheus-node-exporter
spec:
  selector:
    matchLabels:
      app: prometheus-node-exporter
  template:
    # Pod template automatically deployed to all nodes
```

### Key Differences Table:

| Aspect | Deployment | StatefulSet | DaemonSet |
|--------|-----------|------------|----------|
| Use Case | Stateless apps | Stateful apps | Node-level services |
| Pod Identity | Interchangeable | Stable & unique | One per node |
| Scaling | Any number | Ordered, unique | Fixed per topology |
| Storage | No persistent | Persistent volumes | Host storage |
| Examples | APIs, web servers | Databases, Kafka | Monitoring, logging |
| Our Usage | ‚úì ML API | ‚úó | ‚úó |

### Follow-up Questions:
- Why can't you scale a StatefulSet as easily as a Deployment?
- What happens if you delete a pod in a StatefulSet?
- Why would you use a DaemonSet instead of a Deployment with node affinity?

## Q1.2: Explain Kubernetes Services and the difference between ClusterIP, NodePort, LoadBalancer, and ExternalName

### Answer:

**Kubernetes Service** = Network abstraction that provides stable endpoint for pods

**ClusterIP (Default):**
- Only accessible **within the cluster**
- Internal communication between pods
- Our use case: `ml-inference-service` for internal pod-to-pod communication
- IP is stable but only routable within the cluster VPC

```yaml
apiVersion: v1
kind: Service
metadata:
  name: ml-inference-service
spec:
  type: ClusterIP
  ports:
    - port: 5000
      targetPort: 5000
  selector:
    app: ml-inference
```

**NodePort:**
- Opens port on **every node** (30000-32767)
- Can access from outside: `<node-ip>:<node-port>`
- Problems: Requires managing many node IPs, port conflicts
- Use case: Development, testing, legacy systems

```yaml
apiVersion: v1
kind: Service
metadata:
  name: ml-inference-nodeport
spec:
  type: NodePort
  ports:
    - port: 5000
      targetPort: 5000
      nodePort: 30500  # Fixed port on all nodes
```

**LoadBalancer:**
- Provisions **cloud provider load balancer** (AWS NLB/ALB in our case)
- Single public IP/hostname for external access
- Our use case: `ml-inference-lb` service
- Best for: Production APIs needing external access

```yaml
apiVersion: v1
kind: Service
metadata:
  name: ml-inference-lb
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 5000
  selector:
    app: ml-inference
```

**ExternalName:**
- Creates DNS CNAME record pointing to external service
- Use case: Access external database, legacy systems
- No load balancing, just DNS redirection

```yaml
apiVersion: v1
kind: Service
metadata:
  name: external-database
spec:
  type: ExternalName
  externalName: database.example.com
  ports:
    - port: 5432
```

### Architecture in Our Setup:

```
Internet
  ‚Üì
AWS NLB (LoadBalancer service, port 80)
  ‚Üì
Kubernetes Node 1           Kubernetes Node 2
  ‚îú‚îÄ Pod ml-inference-api   ‚îú‚îÄ Pod ml-inference-api
  ‚îú‚îÄ Pod ml-inference-api   ‚îî‚îÄ Pod ml-inference-api
  ‚îî‚îÄ Pod ml-inference-api
  
All pods accessible via:
- ClusterIP: ml-inference-service:5000 (internal)
- LoadBalancer: <NLB-IP>:80 (external)
```

### Follow-up:
- What happens to existing connections when you scale down pods?
- How does service discovery work in Kubernetes?
- Why use both ClusterIP and LoadBalancer services?

## Q1.3: Explain Kubernetes Ingress and when to use it vs LoadBalancer Service

### Answer:

**Ingress:**
- **Layer 7 (Application layer)** routing
- Host-based and path-based routing
- Multiple services behind single IP
- Requires Ingress Controller (AWS ALB Controller in our case)
- Cost-effective for multiple services

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ml-inference-ingress
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
spec:
  rules:
  - host: api.ml-inference.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ml-inference-service
            port:
              number: 5000
```

**LoadBalancer Service:**
- **Layer 4 (Transport layer)** routing
- Direct IP routing (TCP/UDP)
- One load balancer per service
- No routing rules (just port mapping)
- Simpler but more expensive

### Comparison:

| Aspect | Ingress | LoadBalancer Service |
|--------|---------|---------------------|
| Layer | L7 (HTTP/HTTPS) | L4 (TCP/UDP) |
| Load Balancer | Shared (ALB) | Dedicated per service (NLB) |
| Host-based routing | ‚úì Yes | ‚úó No |
| Path-based routing | ‚úì Yes | ‚úó No |
| Cost | ‚úì Lower (1 ALB for multiple services) | ‚úó Higher (1 NLB per service) |
| SSL/TLS | ‚úì Easy (via annotations) | ‚úì Easy (via service) |
| Latency | Slightly higher (extra hop) | Lower (direct) |
| Use Case | REST APIs, web apps | Databases, gaming, low latency |
| Our Usage | ‚úì Optional (for domain) | ‚úì Primary method |

### Our Architecture Decision:

We use **LoadBalancer service** because:
1. Simple single API service (not multiple)
2. Direct L4 routing is sufficient
3. Lower latency for inference requests
4. Cost difference minimal for single service

If we had multiple services (API, admin, webhook), we'd use **Ingress** instead:
```
api.ml-inference.com ‚Üí ALB ‚Üí api service
admin.ml-inference.com ‚Üí ALB ‚Üí admin service
webhook.ml-inference.com ‚Üí ALB ‚Üí webhook service
```

### Follow-up:
- How would you implement path-based routing (e.g., /api vs /health)?
- Can you use both Ingress and LoadBalancer together?
- How do you handle SSL/TLS with Ingress?

# 2. KUBERNETES ADVANCED TOPICS

## Q2.1: Explain health checks in Kubernetes: Liveness, Readiness, and Startup Probes

### Answer:

These three probe types help Kubernetes manage pod lifecycle:

**Startup Probe:**
- Checks if container has started successfully
- Fails if startup takes too long (deadlock detection)
- Once successful, switches to liveness/readiness checks
- Our config: 30 checks √ó 10 seconds = 300 second max startup

```yaml
startupProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 0
  periodSeconds: 10
  failureThreshold: 30  # 30 √ó 10 = 5 min max startup
```

**Liveness Probe:**
- Checks if container is **still alive**
- If fails ‚Üí Kubernetes restarts the pod
- Detects deadlocks, infinite loops, memory leaks
- Our config: Check every 10 seconds, restart after 3 failures

```yaml
livenessProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 30  # Wait 30 sec before first check
  periodSeconds: 10        # Check every 10 sec
  failureThreshold: 3      # Restart after 3 failures
```

**Readiness Probe:**
- Checks if container is **ready to accept traffic**
- If fails ‚Üí Remove from load balancer (but don't restart)
- Allows graceful degradation (e.g., during database migration)
- Our config: Check every 5 seconds, remove from LB after 2 failures

```yaml
readinessProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 10  # Wait 10 sec before first check
  periodSeconds: 5         # Check every 5 sec (more frequent)
  failureThreshold: 2      # Remove from LB after 2 failures
```

### Timeline Example:

```
t=0s   Pod created
t=0-300s   Startup probe running (allows 300s to initialize)
t=30s  Startup successful ‚Üí Liveness/Readiness probes begin
t=30-50s  Readiness probe: 4 successful checks
t=50s  Pod becomes READY, added to load balancer
t=50s+ Continuous monitoring:
        - Readiness: every 5 sec (fast detection of service degradation)
        - Liveness: every 10 sec (detect crashes)

If readiness fails:
  ‚Üí Pod removed from LB (no traffic sent)
  ‚Üí But pod stays running (app can recover)

If liveness fails:
  ‚Üí Pod is restarted (killed and recreated)
```

### Probe Types in Our Implementation:

```yaml
containers:
- name: ml-api
  image: ml-inference-service:latest
  ports:
  - containerPort: 5000
  
  # Startup: Give app 5 min to initialize
  startupProbe:
    httpGet:
      path: /health
      port: 5000
    failureThreshold: 30
    periodSeconds: 10
  
  # Readiness: Quick detection of issues (every 5s)
  readinessProbe:
    httpGet:
      path: /health
      port: 5000
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 2
  
  # Liveness: Restart if really dead (every 10s)
  livenessProbe:
    httpGet:
      path: /health
      port: 5000
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 3
```

### Follow-up:
- What happens if a pod fails readiness but passes liveness?
- Why different `periodSeconds` for readiness (5) vs liveness (10)?
- How would you implement a custom readiness check?

## Q2.2: Explain Horizontal Pod Autoscaler (HPA) and how it works

### Answer:

**HPA** automatically scales pods based on metrics:

### How HPA Works (Controller Loop):

```
Every 30 seconds (by default):

1. Metrics Server collects pod metrics
   ‚îú‚îÄ CPU usage (millicores)
   ‚îî‚îÄ Memory usage (bytes)

2. HPA Controller reads metrics
   ‚îî‚îÄ Calculates desired replicas based on policy

3. Compare current vs desired
   ‚îú‚îÄ If CPU > threshold ‚Üí Scale UP
   ‚îú‚îÄ If CPU < threshold ‚Üí Scale DOWN (with cooldown)
   ‚îî‚îÄ Otherwise ‚Üí No change

4. Update Deployment replicas
   ‚îî‚îÄ Kubernetes creates/destroys pods
```

### Our HPA Configuration:

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-api
  
  minReplicas: 2      # Always at least 2 for HA
  maxReplicas: 10     # Max 10 to control costs
  
  # Metrics to monitor
  metrics:
  # Metric 1: CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up if avg CPU > 70%
  
  # Metric 2: Memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale up if avg memory > 80%
  
  # Scaling behavior
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Double the pods (100% increase)
        periodSeconds: 30
      - type: Pods
        value: 2    # Add 2 pods
        periodSeconds: 30
    
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50   # Remove 50% of pods
        periodSeconds: 60
```

### Scaling Algorithm:

```
desiredReplicas = ceil(
  currentReplicas √ó 
  (current_metric / target_metric)
)

Example:
- Current: 2 replicas, each using 80% CPU
- Average CPU: 80%
- Target CPU: 70%
- Desired: 2 √ó (80/70) = 2.28 ‚Üí 3 replicas
```

### Timeline Example:

```
t=0min    API deployed with 2 replicas (min)
          CPU per pod: 30%
          All pods serving traffic

t=5min    Traffic spike! 100 requests/sec
          CPU per pod: 75% (above 70% threshold)
          HPA detects: scale needed

t=5:30min HPA creates 2 new pods (100% increase, max 2 pods)
          Now 4 replicas total
          CPU per pod: 40% (load distributed)

t=6min    Traffic still high
          CPU: 72% ‚Üí Scale to 5 pods

t=7min    Traffic peaks
          CPU: 82% ‚Üí Scale to 8 pods

t=8min    Traffic reduces
          CPU: 42% (below 70%)
          HPA waits 5 min (stabilization window)

t=13min   Traffic still low (CPU 40%)
          Scale down: 8 √ó 0.5 = 4 pods

t=14min   Scale down again: 4 √ó 0.5 = 2 pods (min)
          Back to baseline
```

### Resource Requests/Limits (Critical for HPA):

```yaml
resources:
  requests:      # What pod needs to run
    cpu: 250m    # Kubernetes reserves 250 millicores
    memory: 512Mi # Kubernetes reserves 512 MB
  
  limits:        # Max allowed
    cpu: 1000m   # Can burst up to 1 CPU
    memory: 1Gi  # Can use up to 1 GB

# HPA calculates: (actual usage / requests) √ó 100 = utilization %
# Example: Using 175m CPU of 250m request = 70% utilization
```

### Follow-up:
- Why have both minReplicas and a LoadBalancer service?
- What's the difference between Utilization and AverageValue metrics?
- How would you scale based on custom metrics (e.g., request latency)?

## Q2.3: What is a Pod Disruption Budget (PDB) and why do we use it?

### Answer:

**Pod Disruption Budget** = Guarantee minimum availability during "voluntary" disruptions

### Types of Disruptions:

**Voluntary Disruptions:**
- Node maintenance (patching, upgrading)
- Cluster autoscaling (removing underutilized nodes)
- Manual kubectl drain
- Pod evictions

**Involuntary Disruptions:**
- Hardware failure
- Network partition
- Power outage
- Kernel panic
- PDB does NOT protect against these

### Our PDB Configuration:

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ml-inference-pdb
spec:
  minAvailable: 1  # Always keep at least 1 pod running
  selector:
    matchLabels:
      app: ml-inference
```

### Scenario Without PDB:

```
Cluster has 2 nodes with 5 pods total:
Node 1: 3 ml-inference pods
Node 2: 2 ml-inference pods

Node 1 needs security patch:
1. Kubernetes drain Node 1 (evict all pods)
2. 3 pods are killed immediately
3. Only 2 pods remain on Node 2
4. API is degraded (underprovisioned)
5. HPA takes 30 seconds to notice and scale up
6. Requests during this window: high latency/timeouts
```

### Scenario With PDB (minAvailable: 1):

```
Same scenario:
Node 1: 3 ml-inference pods
Node 2: 2 ml-inference pods

Node 1 needs security patch:
1. Kubernetes wants to drain Node 1
2. PDB says: "Keep at least 1 pod running"
3. Kubernetes evicts 2 pods (keeps 1 running)
4. New pods scheduled on other nodes
5. After evicted pods restart:
   - Total 4 pods running during maintenance
   - API remains responsive
6. HPA may scale up if needed
7. After Node 1 patched, pod returns
```

### PDB Strategies:

**Strategy 1: minAvailable (absolute count)**
```yaml
minAvailable: 1  # Always 1+ pods
# Good when you need guaranteed minimum replicas
```

**Strategy 2: minAvailable (percentage)**
```yaml
minAvailable: 50%  # Always 50%+ of replicas
# With 2-10 replicas (HPA), scales dynamically
# Min 2 replicas ‚Üí minAvailable = 1
# Max 10 replicas ‚Üí minAvailable = 5
```

**Strategy 3: maxUnavailable (absolute count)**
```yaml
maxUnavailable: 1  # Max 1 pod can be disrupted
# Opposite of minAvailable
```

### Why minAvailable: 1 for Our Setup:

1. **HPA Range (2-10 replicas):** Even min 2 replicas is OK, but 1 provides safety
2. **API Service:** Must remain responsive during maintenance
3. **Graceful Degradation:** 1 pod is better than 0
4. **Cost:** minAvailable: 1 doesn't require extra resources

### Interaction with HPA:

```yaml
Deployment:
  minReplicas: 2   # HPA keeps at least 2
  maxReplicas: 10

PDB:
  minAvailable: 1  # During disruption, keep at least 1

Result:
- Normal state: 2-10 pods (HPA manages)
- During disruption: At least 1 pod guaranteed
```

### Follow-up:
- What happens if maxUnavailable exceeds minAvailable?
- Can PDB prevent all disruptions?
- How does PDB interact with node affinity?

# 3. AWS & TERRAFORM

## Q3.1: Explain IRSA (IAM Roles for Service Accounts) and why it's better than hardcoding credentials

### Answer:

**IRSA** = Kubernetes ServiceAccount can assume AWS IAM roles

### Traditional Approach (Bad):

```yaml
# Create AWS access keys
apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
stringData:
  AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE
  AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

---
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: AWS_ACCESS_KEY_ID
```

**Problems:**
- ‚úó Long-lived credentials (never expire)
- ‚úó Credentials visible in pod environment
- ‚úó Hard to rotate (requires secret update + pod restart)
- ‚úó Difficult to audit (who has access?)
- ‚úó If pod compromised, attacker gets AWS keys
- ‚úó Cannot granularly control which pod has which role

### IRSA Approach (Good):

```yaml
# 1. Kubernetes ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-inference-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::802520734572:role/ml-inference-pod-role

---
# 2. Pod uses the ServiceAccount
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      serviceAccountName: ml-inference-sa
      containers:
      - name: ml-api
        image: ml-inference-service:latest
```

### How IRSA Works (Behind the Scenes):

```
1. EKS Cluster has OIDC Provider
   ‚îî‚îÄ URL: https://oidc.eks.ap-southeast-2.amazonaws.com/id/EXAMPLEID

2. IAM Trust Relationship
   ‚îî‚îÄ Role "ml-inference-pod-role" trusts the OIDC provider
      for ServiceAccount "ml-inference:ml-inference-sa"

3. When pod starts:
   a) Kubernetes mounts ServiceAccount token into pod
      ‚îî‚îÄ /var/run/secrets/eks.amazonaws.com/serviceaccount/token
   
   b) Pod makes STS AssumeRoleWithWebIdentity request
      ‚îú‚îÄ URL: https://sts.amazonaws.com
      ‚îú‚îÄ RoleArn: arn:aws:iam::802520734572:role/ml-inference-pod-role
      ‚îî‚îÄ WebIdentityToken: (JWT from Kubernetes)
   
   c) AWS STS validates token (checks OIDC issuer)
      ‚îî‚îÄ If valid: Issues temporary credentials
   
   d) Pod receives temporary AWS credentials
      ‚îú‚îÄ AWS_ROLE_ARN
      ‚îî‚îÄ AWS_WEB_IDENTITY_TOKEN_FILE
   
   e) AWS SDK automatically uses these credentials
      ‚îî‚îÄ No need to hardcode!

4. Credentials auto-refresh
   ‚îî‚îÄ AWS SDK refreshes before expiration (every ~1 hour)
```

### IRSA Benefits:

‚úì Short-lived credentials (temporary, ~1 hour)  
‚úì Auto-rotating (no manual rotation needed)  
‚úì Not stored in Kubernetes secrets  
‚úì Granular per-pod permissions  
‚úì Easy to audit (CloudTrail shows which SA made request)  
‚úì Pod doesn't have access to other pods' credentials  
‚úì Better security (stolen credential expires soon)  

### Our Setup:

**IAM Role (AWS side):**
```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchGetImage",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
,
Trust Relationship":
  {
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::802520734572:oidc-provider/oidc.eks.ap-southeast-2.amazonaws.com/id/EXAMPLEID"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "oidc.eks.ap-southeast-2.amazonaws.com/id/EXAMPLEID:sub": "system:serviceaccount:ml-inference:ml-inference-sa"
      }
    }
  }
}
```

**Kubernetes ServiceAccount:**
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-inference-sa
  namespace: ml-inference
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::802520734572:role/ml-inference-pod-role
```

### Follow-up:
- How does IRSA differ from assuming a cross-account role?
- What happens if the OIDC provider certificate expires?
- Can you have multiple ServiceAccounts per pod?

## Q3.2: Explain Terraform state file and why remote state is important

### Answer:

**Terraform State** = Database of all resources Terraform manages

### What's in State File:

```json
{
  "version": 4,
  "terraform_version": "1.5.0",
  "resources": [
    {
      "type": "aws_eks_cluster",
      "name": "main",
      "instances": [
        {
          "attributes": {
            "id": "ml-inference-prod-cluster",
            "arn": "arn:aws:eks:ap-southeast-2:802520734572:cluster/ml-inference-prod-cluster",
            "status": "ACTIVE",
            "created_at": "2024-02-22T10:00:00Z"
          }
        }
      ]
    }
  ]
}
```

### Local State (Default) - Bad for Teams:

```bash
# terraform.tfstate stored locally
project/
‚îú‚îÄ main.tf
‚îú‚îÄ variables.tf
‚îî‚îÄ terraform.tfstate  ‚Üê Only on your machine

Problem:
1. Alice runs terraform apply ‚Üí state updated on Alice's machine
2. Bob runs terraform plan ‚Üí sees old state (resources missing)
3. Bob runs terraform apply ‚Üí creates duplicate resources!
4. Conflicts, race conditions, resource drift
```

### Remote State (S3) - Good for Teams:

```hcl
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "ml-inference/terraform.tfstate"
    region         = "ap-southeast-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}
```

**Workflow:**
```
1. State stored in S3 (shared, remote)
2. DynamoDB lock prevents simultaneous applies
3. Alice runs terraform apply:
   a) Terraform acquires DynamoDB lock
   b) Reads state from S3
   c) Makes changes
   d) Writes state back to S3
   e) Releases lock
4. Bob waits for lock (cannot apply simultaneously)
5. Bob runs terraform apply:
   a) Sees latest state (from Alice)
   b) Applies his changes
   c) No conflicts!
```

### Our Recommendation (in production):

```hcl
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  
  backend "s3" {
    bucket         = "company-terraform-state-802520734572"  # Unique per account
    key            = "ml-inference/terraform.tfstate"
    region         = "ap-southeast-2"
    encrypt        = true  # Encrypt at rest
    dynamodb_table = "terraform-locks"
  }
}
```

### State File Security:

**Sensitive data in state:**
- Database passwords
- API keys
- Private key content

**Protect state file:**
1. Store in S3 with encryption
   ```hcl
   encrypt = true  # KMS encryption
   ```

2. Enable versioning
   ```hcl
   resource "aws_s3_bucket_versioning" "terraform" {
     bucket = aws_s3_bucket.terraform.id
     versioning_configuration {
       status = "Enabled"
     }
   }
   ```

3. Block public access
   ```hcl
   resource "aws_s3_bucket_public_access_block" "terraform" {
     bucket = aws_s3_bucket.terraform.id
     block_public_acls       = true
     block_public_policy     = true
     ignore_public_acls      = true
     restrict_public_buckets = true
   }
   ```

4. Enable MFA delete
   ```hcl
   # Requires MFA to delete state versions
   ```

5. Restrict access
   ```json
   {
     "Version": "2012-10-17",
     "Statement": [{
       "Principal": {
         "AWS": "arn:aws:iam::802520734572:role/TerraformRole"
       },
       "Effect": "Allow",
       "Action": "s3:*",
       "Resource": [
         "arn:aws:s3:::terraform-state-bucket",
         "arn:aws:s3:::terraform-state-bucket/*"
       ]
     }]
   }
   ```

### Never Commit State to Git:

```bash
# .gitignore
terraform.tfstate
terraform.tfstate.*
.terraform/
*.tfvars
```

### Follow-up:
- What's in the lock file (terraform.tflock)?
- How do you recover from state corruption?
- Can you migrate from local to remote state?

# 4. DEPLOYMENT & OPERATIONS

## Q4.1: Explain the deployment process and what happens when you apply Kubernetes manifests

### Answer:

### Step-by-Step Deployment Process:

```
1. TERRAFORM PHASE (Infrastructure)
   ‚îú‚îÄ terraform init        ‚Üí Initialize working directory
   ‚îú‚îÄ terraform plan        ‚Üí Preview changes (review before applying!)
   ‚îî‚îÄ terraform apply       ‚Üí Create AWS resources:
      ‚îú‚îÄ VPC with 2 public, 2 private subnets
      ‚îú‚îÄ NAT Gateways for private subnet egress
      ‚îú‚îÄ EKS Control Plane (master nodes, AWS-managed)
      ‚îú‚îÄ EKS Node Group (worker nodes, EC2 instances)
      ‚îú‚îÄ Security Groups
      ‚îú‚îÄ IAM Roles and Policies
      ‚îú‚îÄ OIDC Provider
      ‚îî‚îÄ CloudWatch Log Groups
      
      At this point:
      - EKS cluster is running (empty, no pods yet)
      - Nodes are ready to run containers
      - Cluster is accessible via kubectl

2. KUBECTL CONFIGURATION
   ‚îî‚îÄ aws eks update-kubeconfig ...
      ‚îî‚îÄ Adds cluster credentials to ~/.kube/config
      ‚îî‚îÄ kubectl can now communicate with cluster

3. KUBERNETES PHASE (Manifests)
   ‚îú‚îÄ kubectl apply -f 01-namespace-configmap.yaml
   ‚îÇ  ‚îú‚îÄ Creates namespace "ml-inference"
   ‚îÇ  ‚îú‚îÄ Creates ConfigMap with env variables
   ‚îÇ  ‚îú‚îÄ Creates Secret (if provided)
   ‚îÇ  ‚îú‚îÄ Creates ServiceAccount with IRSA annotation
   ‚îÇ  ‚îú‚îÄ Creates Deployment:
   ‚îÇ  ‚îÇ  ‚îú‚îÄ ReplicaSet (manages pod replicas)
   ‚îÇ  ‚îÇ  ‚îú‚îÄ Pods (2 initially, managed by HPA)
   ‚îÇ  ‚îÇ  ‚îî‚îÄ Each pod creates container from ECR image
   ‚îÇ  ‚îú‚îÄ Creates Service (ClusterIP) for internal routing
   ‚îÇ  ‚îú‚îÄ Creates HPA (watches metrics, scales pods)
   ‚îÇ  ‚îî‚îÄ Creates PDB (min 1 replica during disruption)
   ‚îÇ
   ‚îú‚îÄ kubectl apply -f 02-ingress-network-policy.yaml
   ‚îÇ  ‚îú‚îÄ Creates Ingress (optional, for domain routing)
   ‚îÇ  ‚îú‚îÄ Creates NetworkPolicy (traffic restrictions)
   ‚îÇ  ‚îú‚îÄ Creates ResourceQuota (namespace limits)
   ‚îÇ  ‚îî‚îÄ Creates LimitRange (per-container limits)
   ‚îÇ
   ‚îî‚îÄ kubectl apply -f 03-rbac-monitoring.yaml
      ‚îú‚îÄ Creates Role (permissions for SA)
      ‚îú‚îÄ Creates RoleBinding (assign role to SA)
      ‚îî‚îÄ Creates ServiceMonitor (for Prometheus)
```

### What Happens When Pod Starts:

```
1. Kubernetes Scheduler picks a node
   ‚îî‚îÄ Considers: resource requests, node affinity, pod affinity, taints/tolerations

2. kubelet (node agent) receives pod spec
   ‚îî‚îÄ Tells container runtime (Docker/containerd) to create container

3. Container Runtime pulls image
   ‚îú‚îÄ From ECR: 802520734572.dkr.ecr.ap-southeast-2.amazonaws.com/ml-inference-service:latest
   ‚îú‚îÄ Uses ECR credentials (via IRSA)
   ‚îî‚îÄ Caches image on node

4. Container starts
   ‚îú‚îÄ Mounts volumes (ConfigMap, ServiceAccount token)
   ‚îú‚îÄ Sets environment variables
   ‚îú‚îÄ Runs startup command: python src/app.py
   ‚îî‚îÄ Sets resource limits (CPU: 250m-1000m, Memory: 512Mi-1Gi)

5. Startup Probe begins
   ‚îú‚îÄ Every 10 seconds, checks GET /health
   ‚îú‚îÄ If fails: increments failure counter
   ‚îú‚îÄ After 3 failures (30 seconds): pod is considered failed
   ‚îî‚îÄ After success: switches to liveness/readiness probes

6. Pod Initialization (if successful startup probe)
   ‚îú‚îÄ Container fully running
   ‚îú‚îÄ Liveness probe begins (restart if unhealthy)
   ‚îú‚îÄ Readiness probe begins (remove from LB if unhealthy)
   ‚îî‚îÄ Pod status: Running

7. Network Connectivity
   ‚îú‚îÄ Pod gets IP address (from Calico/Flannel)
   ‚îú‚îÄ Service gets endpoints (list of pod IPs)
   ‚îú‚îÄ LoadBalancer gets targets (pod IPs)
   ‚îî‚îÄ Traffic can flow

8. HPA Monitoring
   ‚îú‚îÄ Metrics Server collects CPU/memory every 15 seconds
   ‚îú‚îÄ HPA checks metrics every 30 seconds
   ‚îú‚îÄ If utilization > 70% (CPU) or > 80% (memory): scale up
   ‚îî‚îÄ New pods follow same startup process
```

### Timeline Example:

```
t=0s      kubectl apply -f manifests
          ‚îî‚îÄ API server receives manifests

t=1s      Deployment created
          ‚îî‚îÄ ReplicaSet created
          ‚îî‚îÄ 2 Pods created (minReplicas from HPA)

t=2s      Scheduler assigns pods to nodes
          ‚îî‚îÄ Pod 1 ‚Üí Node 1
          ‚îî‚îÄ Pod 2 ‚Üí Node 2 (anti-affinity)

t=5s      kubelet pulls image
          ‚îî‚îÄ ~2-3 seconds for 500MB image

t=8s      Container started
          ‚îî‚îÄ FastAPI server initializing
          ‚îî‚îÄ Loading ML model (~1 second)

t=10s     Startup probe: 1st check ‚Üí GET /health ‚Üí ‚úì Success
t=20s     Startup probe: 2nd check ‚Üí ‚úì Success
t=30s     Startup probe: 3rd check ‚Üí ‚úì Success
          ‚îî‚îÄ Switched to liveness/readiness

t=40s     Readiness probe: 1st check ‚Üí ‚úì Success
t=45s     Readiness probe: 2nd check ‚Üí ‚úì Success
          ‚îî‚îÄ Pod added to Service endpoints
          ‚îî‚îÄ Pod added to LoadBalancer targets

t=50s     Traffic begins flowing to pod
          ‚îî‚îÄ First inference requests incoming

t=60s     Service fully ready with both pods
          ‚îî‚îÄ API responding to health checks
          ‚îî‚îÄ Ready for production traffic
```

### Kubectl apply vs delete vs patch:

```bash
# Apply: Idempotent (safe to run multiple times)
kubectl apply -f manifests.yaml
# ‚Üí Creates if doesn't exist
# ‚Üí Updates if exists
# ‚Üí Stores applied configuration in annotation

# Delete: Removes resource and pods (with grace period)
kubectl delete -f manifests.yaml
# ‚Üí Pods get 30 seconds (terminationGracePeriodSeconds) to shutdown
# ‚Üí After 30s: forcefully killed
# ‚Üí Resource deleted from cluster

# Patch: Surgical update (changes specific fields)
kubectl patch deployment ml-inference-api -n ml-inference \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"ml-api","image":"ml-inference-service:v2"}]}}}}'
# ‚Üí Updates just the image, keeps other config
# ‚Üí Triggers rolling update
```

### Follow-up:
- What's the difference between Recreate and RollingUpdate deployment strategy?
- How does graceful shutdown work (SIGTERM vs SIGKILL)?
- What happens if image pull fails?

# 5. ML SPECIFIC QUESTIONS

## Q5.1: How would you handle model versioning in production?

### Answer:

**Model Versioning Strategies:**

### Strategy 1: Docker Image Tags
```bash
# Build with model version
docker build -t ml-inference-service:1.0 .
docker build -t ml-inference-service:1.1 .
docker build -t ml-inference-service:2.0 .

# Deploy specific version
kubectl set image deployment/ml-inference-api \
  ml-api=ml-inference-service:2.0 \
  -n ml-inference

# Automatic rolling update (old pods gradually replaced with new)
```

### Strategy 2: Model Registry
```python
# MLflow, DVC, or custom registry
import mlflow

# Register model
mlflow.sklearn.log_model(model, "ml-inference-service")
# ‚Üí Creates version 1, 2, 3, etc.

# In production: Load specific version
model = mlflow.sklearn.load_model("models:/ml-inference-service/1")
```

### Strategy 3: Canary Deployment
```yaml
# Gradually roll out new model version
# 10% traffic to v2, 90% to v1
# Monitor metrics, then gradually increase

# Phase 1: 10% canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ml-inference
spec:
  hosts:
  - ml-inference
  http:
  - match:
    - headers:
        user-agent:
          regex: ".*canary.*"
    route:
    - destination:
        host: ml-inference
        subset: v2  # New model
  - route:  # 90% default traffic
    - destination:
        host: ml-inference
        subset: v1  # Old model
```

### Our Recommendation:
```yaml
# Store model version in ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: ml-config
data:
  MODEL_VERSION: "2.0.1"
  MODEL_S3_PATH: "s3://models/ml-inference/v2.0.1/model.pkl"

---
# Pod reads this and loads correct model
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: ml-api
        env:
        - name: MODEL_VERSION
          valueFrom:
            configMapKeyRef:
              name: ml-config
              key: MODEL_VERSION
        - name: MODEL_S3_PATH
          valueFrom:
            configMapKeyRef:
              name: ml-config
              key: MODEL_S3_PATH
```

**Advantages:**
- ‚úì Update model without rebuilding Docker image
- ‚úì Quick rollback (just change ConfigMap)
- ‚úì A/B testing possible (route to different versions)
- ‚úì No deployment downtime

### Follow-up:
- How do you handle incompatible model versions?
- What metrics would you monitor for a new model?
- How would you implement shadow mode testing?

## Q5.2: Explain the flow from prediction request to response

### Answer:

```
1. CLIENT REQUEST
   ‚îî‚îÄ curl -X POST http://<LB-URL>/predict-single \
      -H 'Content-Type: application/json' \
      -d '{"features": [1,2,3,...,20]}'

2. AWS NLB (Network Load Balancer)
   ‚îú‚îÄ Receives request on port 80
   ‚îú‚îÄ Performs Layer 4 routing (TCP)
   ‚îú‚îÄ Selects backend pod (IP: 10.0.11.45, port: 5000)
   ‚îî‚îÄ Forwards request

3. KUBERNETES NETWORK
   ‚îú‚îÄ Request hits pod network interface (eth0)
   ‚îú‚îÄ iptables rules (Calico/Flannel) route to container
   ‚îî‚îÄ Request forwarded to FastAPI server

4. FASTAPI ENDPOINT (/predict-single)
   ‚îú‚îÄ Request: POST /predict-single
   ‚îú‚îÄ Body: {"features": [1,2,3,...,20]}
   ‚îÇ
   ‚îú‚îÄ Pydantic validation
   ‚îÇ  ‚îî‚îÄ Validates 20 features provided (correct type, range)
   ‚îÇ
   ‚îú‚îÄ Feature preprocessing
   ‚îÇ  ‚îî‚îÄ Reshape to 2D array: (1, 20)
   ‚îÇ
   ‚îî‚îÄ Model inference
      ‚îî‚îÄ Call model.predict(features)

5. MODEL PREDICTION (SKLearn Random Forest)
   ‚îú‚îÄ Load preprocessor (StandardScaler)
   ‚îÇ  ‚îî‚îÄ Standardize features: (value - mean) / std
   ‚îÇ
   ‚îú‚îÄ Load model (100 decision trees)
   ‚îÇ  ‚îî‚îÄ Each tree votes on class
   ‚îÇ
   ‚îú‚îÄ Prediction phase
   ‚îÇ  ‚îú‚îÄ Each tree: pass through splits, reach leaf
   ‚îÇ  ‚îú‚îÄ Leaf predicts a class (0, 1, or 2)
   ‚îÇ  ‚îú‚îÄ Aggregate 100 tree predictions (majority vote)
   ‚îÇ  ‚îî‚îÄ Return final class + probabilities
   ‚îÇ
   ‚îî‚îÄ Return
      ‚îú‚îÄ Prediction: 1 (class)
      ‚îî‚îÄ Probabilities: [0.05, 0.85, 0.10]

6. API RESPONSE CONSTRUCTION
   ‚îú‚îÄ Class 0: 5% confidence
   ‚îú‚îÄ Class 1: 85% confidence (highest)
   ‚îú‚îÄ Class 2: 10% confidence
   ‚îÇ
   ‚îî‚îÄ Serialize to JSON
      {
        "prediction": 1,
        "confidence": 0.85,
        "probabilities": [0.05, 0.85, 0.10]
      }

7. FASTAPI SERIALIZATION
   ‚îú‚îÄ Convert numpy arrays to JSON
   ‚îú‚îÄ Add HTTP headers
   ‚îÇ  ‚îú‚îÄ Content-Type: application/json
   ‚îÇ  ‚îú‚îÄ Content-Length: 87
   ‚îÇ  ‚îî‚îÄ ...
   ‚îî‚îÄ Response code: 200 OK

8. NETWORK RESPONSE
   ‚îú‚îÄ Pod sends response to client
   ‚îú‚îÄ NLB forwards response (Layer 4 NAT)
   ‚îî‚îÄ Client receives JSON

9. CLIENT PROCESSING
   ‚îî‚îÄ Parse JSON response
   ‚îî‚îÄ Use prediction: class 1
```

### Latency Breakdown (typical numbers):

```
Network latency         1 ms   (client to NLB)
NLB routing             1 ms   (L4 load balancer)
Kubernetes network      2 ms   (iptables, overlay network)
FastAPI parsing/validation 1 ms   (request deserialization)
Feature preprocessing   0.1 ms  (standardscaler on 20 features)
Model inference         2-5 ms  (random forest 100 trees)
Response serialization  1 ms   (JSON encoding)
Network latency         1 ms   (response to client)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Total                   ~10 ms
```

### Optimization Strategies:

**1. Batch Predictions**
```python
# Instead of 1 prediction at a time
# Send 100 predictions in one request
# Amortizes overhead
@app.post("/predict")
async def predict_batch(request: PredictRequest):
    # request.features shape: (100, 20)
    predictions = model.predict(request.features)
    # Return 100 predictions at once
```

**2. Model Quantization**
```python
# Reduce model size
# int8 instead of float32 = 4x smaller
# Inference 10-20% faster
from skl2onnx import convert_sklearn
import onnx

onnx_model = convert_sklearn(model)
# Export to ONNX Runtime (faster inference)
```

**3. Caching**
```python
from functools import lru_cache

@lru_cache(maxsize=1000)
def predict(features_tuple):
    # Cache predictions for identical inputs
    # Repeated predictions return cached result (0 ms)
```

### Follow-up:
- How would you implement request batching?
- What's the difference between latency and throughput?
- How would you profile model inference time?

# 6. TROUBLESHOOTING & PRODUCTION ISSUES

## Q6.1: Pod is not starting. How would you debug?

### Answer:

### Step 1: Check Pod Status
```bash
kubectl get pods -n ml-inference
# Output:
# NAME                      READY STATUS             AGE
# ml-inference-api-xyz      0/1   ImagePullBackOff   2m
# ml-inference-api-abc      0/1   Pending             3m
# ml-inference-api-def      0/1   CrashLoopBackOff   1m
```

**Status meanings:**
- **Pending:** Waiting to be scheduled (not enough resources)
- **ImagePullBackOff:** Cannot pull image from registry
- **CrashLoopBackOff:** Container starts but crashes immediately
- **Running:** Container running but may not be ready

### Step 2: Describe Pod (detailed info)
```bash
kubectl describe pod <pod-name> -n ml-inference
# Shows:
# - Events (last 10 events)
# - Image pull status
# - Probe failure reasons
# - Node assignment
# - Container status
```

**Common Event Messages:**
```
Failed to pull image "...": rpc error: code = Unknown
  ‚Üí ECR authentication failed
  ‚Üí Check IRSA role permissions

Insufficient cpu/memory
  ‚Üí Node doesn't have enough resources
  ‚Üí Need more nodes (cluster autoscale)

Liveness probe failed
  ‚Üí Container is alive but /health endpoint failed
  ‚Üí App crashed or hanging
```

### Step 3: Check Logs
```bash
# Current logs
kubectl logs <pod-name> -n ml-inference

# Previous logs (if pod crashed)
kubectl logs <pod-name> -n ml-inference --previous

# Stream logs (follow)
kubectl logs -f <pod-name> -n ml-inference

# Logs from specific container (if multiple)
kubectl logs <pod-name> -c ml-api -n ml-inference
```

**Common Log Messages:**
```
ModuleNotFoundError: No module named 'sklearn'
  ‚Üí Dependency missing in Docker image
  ‚Üí Update requirements.txt and rebuild image

FileNotFoundError: [Errno 2] No such file or directory: 'models/model.pkl'
  ‚Üí Model files not in Docker image
  ‚Üí Check COPY models/ in Dockerfile

PermissionError: /var/log/app.log
  ‚Üí Pod running as non-root user without write permission
  ‚Üí Use emptyDir volume for logs

Address already in use: port 5000
  ‚Üí Two processes trying to use same port
  ‚Üí Check if app is running twice
```

### Step 4: Probe Failures
```bash
# Test endpoint manually
kubectl port-forward <pod-name> 5000:5000 -n ml-inference
# In another terminal:
curl http://localhost:5000/health

# Or exec into pod
kubectl exec -it <pod-name> -n ml-inference -- /bin/bash
$ curl http://localhost:5000/health
$ python -c "import sklearn; print(sklearn.__version__)"
```

### Step 5: Check Resource Limits
```bash
kubectl describe node <node-name>
# Shows:
# - Allocatable resources
# - Current usage
# - Pods running on node

# Check if pod is being OOMKilled
kubectl get pod <pod-name> -n ml-inference -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Output: OOMKilled
# ‚Üí Pod using more memory than limit (512Mi)
# ‚Üí Increase memory limit or reduce model size
```

### Step 6: Events from Cluster
```bash
# View all events in namespace
kubectl get events -n ml-inference --sort-by='.lastTimestamp'

# Watch events in real-time
kubectl get events -n ml-inference --watch
```

### Debugging Checklist:

```
1. Pod status:
   ‚òê ImagePullBackOff ‚Üí Check image name, ECR credentials
   ‚òê Pending ‚Üí Check resources (kubectl top nodes)
   ‚òê CrashLoopBackOff ‚Üí Check logs (kubectl logs --previous)

2. Image issues:
   ‚òê Image exists in ECR? (aws ecr describe-images ...)
   ‚òê Tag is correct? (latest, v1.0, etc.)
   ‚òê IRSA role has ECR permissions? (aws iam get-role-policy ...)

3. App issues:
   ‚òê Dependencies installed? (pip list in image)
   ‚òê Config files present? (ls -la /app/)
   ‚òê Permissions correct? (ls -la <file>, whoami)

4. Probes:
   ‚òê /health endpoint works? (curl localhost:5000/health)
   ‚òê Port is correct? (5000, not 8000)
   ‚òê Timeout too short? (increase initialDelaySeconds)

5. Resources:
   ‚òê Enough CPU? (kubectl top nodes)
   ‚òê Enough memory? (check for OOMKilled)
   ‚òê Disk space? (df -h inside pod)
```

### Follow-up:
- How would you debug a pod that's running but not responding to traffic?
- What metrics would you monitor to detect issues early?
- How would you implement better logging for debugging?

# 7. ARCHITECTURE & DESIGN DECISIONS

## Q7.1: Why did you choose EKS over other options (EC2, Fargate, App Runner)?

### Answer:

### Comparison Table:

| Aspect | EC2 (Manual) | ECS Fargate | App Runner | EKS |
|--------|------------|-----------|-----------|-----|
| **Control** | Full | Medium | Low | Full |
| **Learning Curve** | High | Medium | Low | Very High |
| **Flexibility** | Maximum | Good | Limited | Maximum |
| **Scaling** | Manual/ASG | Automatic | Automatic | Automatic (HPA) |
| **Cost** | Low (self-manage) | Medium | Medium-High | Medium |
| **Multi-pod per node** | ‚úì (if deployed) | ‚úì | ‚úó (one app per) | ‚úì |
| **Pod networking** | Manual | AWS ENI | Simple | CNI plugins |
| **Use Case** | Legacy, simple | Container apps | Simple web apps | Complex microservices |
| **Our Choice** | ‚úó | ~ | ‚úó | ‚úì |

### Why EKS for Our Project:

**1. Future Scalability**
```
Today: Single ML API
Future: Multiple services
  ‚îú‚îÄ ML API v1 (current model)
  ‚îú‚îÄ ML API v2 (new model, A/B testing)
  ‚îú‚îÄ Feature service (feature engineering)
  ‚îú‚îÄ Monitoring service (Prometheus)
  ‚îú‚îÄ Logging service (Fluentd)
  ‚îî‚îÄ Admin dashboard

EKS can manage all these efficiently.
Fargate would create N independent services (costly).
```

**2. Advanced Deployment Patterns**
```yaml
# Canary deployments
# 10% traffic ‚Üí new model, 90% ‚Üí old model
# Monitor metrics, gradually increase
# ‚Üí Only possible with orchestration

# Rolling updates
# Graceful pod termination (30s shutdown period)
# ‚Üí Good for long-running jobs

# Resource isolation
# Multiple apps on same node
# Each gets guaranteed resources
```

**3. Team Skill Transfer**
```
Kubernetes is industry standard.
Skills transfer to other companies/projects.
ECS is AWS-specific (lock-in).
```

**4. Ecosystem & Integrations**
```
Kubernetes has massive ecosystem:
  - Prometheus (monitoring)
  - Istio (service mesh)
  - Helm (package manager)
  - Operators (custom logic)
  - KNative (serverless)

ECS is limited to AWS services.
```

**5. Multi-cloud Capability**
```
Kubernetes manifests work on:
  - AWS EKS
  - Google GKE
  - Azure AKS
  - On-premises (minikube, kubeadm)
  - Bare metal

Lock-in to single cloud is avoided.
```

### Why Not Fargate (Serverless Containers)?

**Fargate Advantages:**
- No node management
- Simpler to start
- Pay per container (fine-grained billing)

**Fargate Disadvantages:**
- Limited control (can't customize networking, kernel params)
- Cold starts (slight latency on new container creation)
- Expensive for sustained workloads (ECS Fargate ~$0.29/vCPU/hour vs EC2 ~$0.03/hour)
- Cannot run multiple pods per task
- Limited networking (no host networking)
- Vendor lock-in (AWS only)

**When Fargate makes sense:**
- Simple, bursty workloads
- Unpredictable traffic patterns
- Small teams (less ops overhead)
- Short-lived jobs

### Why Not App Runner?

**App Runner is for simple applications:**
```
  ‚úì Connect GitHub repo
  ‚úì Auto-deploy on commit
  ‚úì TLS/SSL automatic
  ‚úó No pod management
  ‚úó No multi-container
  ‚úó Limited customization
```

### Why Not Manual EC2?

**Manual EC2 issues:**
```
  ‚úó No auto-scaling (ops calls scaling API)
  ‚úó No auto-restart (manual SSH to restart)
  ‚úó No rolling updates (downtime during deploy)
  ‚úó Ops overhead (patching, security updates)
  ‚úó Resource fragmentation (hard to pack apps efficiently)
  ‚úó No health checks (monitor manually)
```

### Our Final Decision:

```
EKS because:

‚úì Industry standard (Kubernetes)
‚úì Scalable to many services
‚úì Advanced deployment patterns (canary, shadow, A/B testing)
‚úì Multi-cloud flexibility
‚úì Rich ecosystem (monitoring, logging, service mesh)
‚úì Team skill development
‚úì Proven for ML model serving (Netflix, Uber, etc.)

Trade-off:
  - Higher learning curve
  - More initial complexity
  - But pays off as system grows
```

### Follow-up:
- At what point would you reconsider (maybe move to Fargate)?
- How would cost change if we used Fargate instead?
- What other deployment targets are viable?

# 8. ADVANCED TOPICS & OPEN-ENDED QUESTIONS

## Q8.1: How would you implement zero-downtime deployments?

### Answer:

### Zero-Downtime Deployment Process:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-api
spec:
  # Key settings for zero-downtime
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # 1 extra pod allowed (total 3 during update)
      maxUnavailable: 0  # 0 pods allowed down (key!)
  
  template:
    spec:
      # Graceful shutdown
      terminationGracePeriodSeconds: 30
      
      containers:
      - name: ml-api
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]
              # Give load balancer 5 sec to drain connections
        
        # Health checks
        readinessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 5  # Check often
          failureThreshold: 2
        
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
```

### Timeline of Zero-Downtime Deployment:

```
t=0s      kubectl set image deployment/ml-inference-api ml-api=ml-inference-service:v2
          ‚Üí Triggers rolling update

t=1s      Kubernetes creates new pod (v2)
          Current state:
            - Pod 1 (v1) ‚Üê Receiving traffic
            - Pod 2 (v1) ‚Üê Receiving traffic
            - Pod 3 (v2) ‚Üê Starting (not ready yet)

t=10s     Pod 3 readiness check 1: PASS
t=15s     Pod 3 readiness check 2: PASS
t=15s     Pod 3 added to LoadBalancer
          Current state:
            - Pod 1 (v1) ‚Üê Receiving traffic
            - Pod 2 (v1) ‚Üê Receiving traffic
            - Pod 3 (v2) ‚Üê Now receiving traffic

t=16s     Kubernetes sends SIGTERM to Pod 1 (v1)
          ‚Üí Signals graceful shutdown
          Current state:
            - Pod 1 (v1) ‚Üê Shutting down (no new traffic)
            - Pod 2 (v1) ‚Üê Receiving traffic
            - Pod 3 (v2) ‚Üê Receiving traffic

t=16-21s  preStop hook executes (sleep 5)
          ‚Üí LoadBalancer drains existing connections
          ‚Üí In-flight requests complete

t=21s     Pod 1 terminates (process shuts down cleanly)
          New pod created for Pod 2 replacement
          Current state:
            - Pod 2 (v1) ‚Üê Receiving traffic
            - Pod 3 (v2) ‚Üê Receiving traffic
            - Pod 4 (v2) ‚Üê Starting

t=31s     Pod 4 becomes ready
          LoadBalancer starts routing to Pod 4

t=32s     SIGTERM sent to Pod 2 (v1)
          Current state:
            - Pod 2 (v1) ‚Üê Shutting down
            - Pod 3 (v2) ‚Üê Receiving traffic
            - Pod 4 (v2) ‚Üê Receiving traffic

t=46s     Pod 2 terminates
          Update complete! All pods running v2
          Current state:
            - Pod 3 (v2) ‚Üê Receiving traffic
            - Pod 4 (v2) ‚Üê Receiving traffic

THROUGHOUT:
- At least 2 pods always ready (minReplicas: 2)
- LoadBalancer always has healthy targets
- Traffic never interrupted
- No 502/503 errors
```

### Critical Settings for Zero-Downtime:

```yaml
# 1. maxUnavailable: 0
   ‚Üí Never remove more pods than available
   ‚Üí Ensures continuous availability

# 2. maxSurge: 1 (or percentage)
   ‚Üí Allows temporary over-provisioning
   ‚Üí New pod starts before old pod killed

# 3. terminationGracePeriodSeconds: 30
   ‚Üí Give app 30 seconds to shutdown gracefully
   ‚Üí Listen for SIGTERM, finish requests, close connections

# 4. preStop hook
   ‚Üí Execute before SIGTERM
   ‚Üí Sleep to allow connection draining
   ‚Üí Wait for load balancer to remove from targets

# 5. readinessProbe
   ‚Üí Only add pod to LB after passing
   ‚Üí Fast detection of issues
   ‚Üí Remove unhealthy pods from rotation
```

### Application Code for Graceful Shutdown:

```python
import signal
import asyncio
from fastapi import FastAPI

app = FastAPI()
shutdown_event = asyncio.Event()

@app.on_event("startup")
async def startup():
    # Load model, initialize connections
    global model
    model = load_model()

@app.on_event("shutdown")
async def shutdown():
    # Clean up resources
    await close_database_connections()
    await close_cache_connections()

def signal_handler(sig, frame):
    print("SIGTERM received, gracefully shutting down...")
    shutdown_event.set()

signal.signal(signal.SIGTERM, signal_handler)

@app.get("/health")
async def health():
    if shutdown_event.is_set():
        return {"status": "shutting_down"}, 503
    return {"status": "healthy"}

@app.post("/predict")
async def predict(request: PredictRequest):
    if shutdown_event.is_set():
        return {"error": "Service shutting down"}, 503
    
    # Process prediction
    result = model.predict(request.features)
    return {"prediction": result}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=5000)
```

### Testing Zero-Downtime:

```bash
# Terminal 1: Deploy new version
kubectl set image deployment/ml-inference-api \
  ml-api=ml-inference-service:v2

# Terminal 2: Monitor traffic
while true; do
  curl -s http://<LB-URL>/health | jq .
  sleep 0.1
done

# Result: All responses are 200 OK (no errors)
```

### Follow-up:
- What's the difference between Recreate and RollingUpdate strategy?
- How would you handle database schema migrations?
- How would you rollback a bad deployment?

## Q8.2: Discuss potential issues and how you'd handle them

### Answer:

### Common Production Issues:

**1. Memory Leak in Model**
```
Problem:
  - Pod memory gradually increases (not reclaimed)
  - After days: Pod OOMKilled
  - Service degrades

Detection:
  - Monitor memory trends (CloudWatch)
  - Prometheus metrics: container_memory_usage_bytes

Solution:
  1. Identify leak (memory profiler in Python)
  2. Temporary: Reduce memory limit to force restart
  3. Permanent: Fix code, deploy new version
  
  apiVersion: apps/v1
  kind: Deployment
  spec:
    template:
      spec:
        containers:
        - name: ml-api
          resources:
            limits:
              memory: "512Mi"  # Restart after 512MB
```

**2. Model Inference Slow Under Load**
```
Problem:
  - API latency increases from 10ms to 500ms
  - Users see timeouts
  - Requests queue up

Causes:
  - Model not optimized (too complex)
  - Features not cached
  - CPU contention (other processes competing)
  - GIL (Python Global Interpreter Lock)

Solutions:
  1. Profile inference (time each step)
  2. Optimize model (quantization, pruning)
  3. Add request queue (celery, message broker)
  4. Use async inference (ProcessPoolExecutor)
  5. Scale horizontally (more pods via HPA)
```

**3. ECR Credential Expiration**
```
Problem:
  - IRSA token expires or becomes invalid
  - New pods cannot pull image
  - Deployments hang (Pending status)

Prevention:
  - Kubernetes/AWS handle token refresh automatically
  - Just ensure IRSA is properly configured

Debugging:
  kubectl describe pod <pod-name>
  # Look for: Failed to pull image "...": rpc error
  
  # Check IRSA role
  aws iam get-role-policy \
    --role-name ml-inference-pod-role \
    --policy-name ml-inference-pod-policy
```

**4. Node Failure**
```
Problem:
  - Hardware fails (disk, memory, network)
  - All pods on node become unreachable
  - ~5 minute detection + restart (pod-eviction-timeout)

Prevention (HA Setup):
  - Multi-AZ nodes (spread pods across nodes)
  - Pod anti-affinity (pods don't share nodes)
  - Pod Disruption Budget (min replicas)
  - Our setup has all these! ‚úì

What Happens:
  1. Kubernetes detects node unhealthy
  2. Marks pods as Terminating
  3. Pods evicted to other nodes
  4. HPA scales up if needed
  5. Service continues (minimal impact)

Recovery Time: ~60-120 seconds
User Impact: <1% requests fail (if unlucky with timing)
```

**5. DDoS / Traffic Spike**
```
Problem:
  - Sudden 10x traffic increase
  - Pods max out (high CPU/latency)
  - Users see 503 errors

Automatic Mitigation (HPA):
  - CPU jumps to 85%
  - HPA detects in 30 seconds
  - Scales to 10 pods immediately
  - Cluster autoscaler adds nodes
  - Load distributed
  - Latency returns to normal

Additional Mitigation:
  - AWS WAF (block malicious IPs)
  - Rate limiting (per IP limit)
  - API Gateway throttling
  - Circuit breaker pattern

Time to Recover: 1-3 minutes
```

**6. Database Connection Exhaustion**
```
Problem:
  - DB has 100 connections max
  - 10 pods √ó 15 connections/pod = 150 (exceeds!)
  - New requests fail with "connection pool exhausted"

Prevention:
  1. Connection pooling (HikariCP, SQLAlchemy pool)
  2. Limit connections per pod (pool_size=5)
  3. Scale down pods if DB overloaded
  
  # In app:
  from sqlalchemy.pool import NullPool
  engine = create_engine(db_url, poolclass=NullPool)
  # Don't maintain persistent connections

Detection:
  - Monitor DB connection count
  - Alert if > 80% of max
```

**7. Bad Model Update**
```
Problem:
  - Deploy new model v2.0
  - Accuracy is actually worse (overfitting)
  - Business impact: wrong predictions

Prevention:
  1. Pre-deployment validation
     - Test on hold-out test set
     - Compare metrics vs old model
     - Require explicit approval
  
  2. Canary deployment
     - Deploy to 10% of traffic
     - Monitor predictions (logging)
     - Compare accuracy vs baseline
     - If good: rollout to 100%
  
  3. Staged deployment
     - Dev ‚Üí Staging ‚Üí Production
     - Real validation at each stage

Rollback (if needed):
  kubectl rollout undo deployment/ml-inference-api -n ml-inference
  # Instantly reverts to previous version
  # Time to recover: 30-60 seconds
```

### Monitoring & Alerting Strategy:

```yaml
Key Metrics to Monitor:
  1. Pod health
     - Pods in CrashLoopBackOff
     - Restart count > 10
  
  2. Resource usage
     - CPU > 80%
     - Memory > 85%
     - Disk > 90%
  
  3. API performance
     - Latency p95 > 100ms
     - Error rate > 1%
     - Requests/second trending
  
  4. Model quality
     - Prediction confidence < 50% (drift)
     - Distribution shift detected
  
  5. Cluster health
     - Nodes NotReady
     - Pod eviction rate high
     - Pending pods (resource starved)

Alerting Rules (examples):
  - PodCrashLooping: Restart immediately
  - HighMemoryUsage: Investigate leak
  - HighAPILatency: Scale out or optimize
  - HighErrorRate: Page on-call engineer
```

### Follow-up:
- How would you implement automated rollback on error rate?
- What's the RPO/RTO for your deployment?
- How would you test disaster recovery scenarios?

# 9. SUMMARY & KEY TAKEAWAYS

## What We've Covered:

### Kubernetes Fundamentals
‚úì Deployment vs StatefulSet vs DaemonSet  
‚úì Services (ClusterIP, NodePort, LoadBalancer)  
‚úì Ingress vs LoadBalancer  
‚úì Health checks (Startup, Liveness, Readiness)  

### Advanced Kubernetes
‚úì HorizontalPodAutoscaler (pod scaling)  
‚úì PodDisruptionBudget (HA guarantees)  
‚úì Zero-downtime deployments  
‚úì Resource management (requests/limits)  

### AWS & Infrastructure
‚úì IRSA (IAM Roles for Service Accounts)  
‚úì Terraform state management  
‚úì VPC design (public/private subnets)  
‚úì Security groups & networking  

### Production Deployment
‚úì Deployment process (Terraform ‚Üí Kubernetes)  
‚úì Debugging pod issues  
‚úì Monitoring & observability  
‚úì Handling production failures  

### ML Specific
‚úì Model versioning strategies  
‚úì Inference request flow  
‚úì Latency optimization  
‚úì Model quality monitoring  

## Key Architecture Principles:

1. **High Availability**
   - Multi-AZ deployment
   - Multiple replicas (2-10)
   - Pod anti-affinity
   - Health checks

2. **Auto-Scaling**
   - Pod level (HPA: CPU/Memory)
   - Node level (Cluster Autoscaler)
   - Respects min/max limits

3. **Security**
   - IRSA (no hardcoded credentials)
   - RBAC (least privilege)
   - NetworkPolicy (traffic control)
   - Pod security context

4. **Observability**
   - CloudWatch Container Insights
   - Centralized logging
   - Prometheus metrics
   - Health checks

5. **Reliability**
   - Zero-downtime deployments
   - Graceful shutdown (30s grace period)
   - Pod Disruption Budgets
   - Automated recovery

## Interview Tips:

1. **Show understanding of trade-offs**
   - Why EKS vs Fargate?
   - Why 2-10 replicas vs fixed?
   - Why maxUnavailable: 0?

2. **Connect to business value**
   - "This ensures 99.9% uptime"
   - "Reduces MTTR from 30min to 1min"
   - "Scales automatically with traffic"

3. **Demonstrate hands-on knowledge**
   - Share debugging experience
   - Explain error messages (not just copy-paste)
   - Reference your actual setup

4. **Ask clarifying questions**
   - "What's the expected traffic pattern?"
   - "What's the acceptable downtime?"
   - "What's the cost constraint?"

5. **Discuss improvements**
   - "We could add Prometheus for detailed metrics"
   - "We should implement canary deployments"
   - "We need automated rollback on error rate"

## Your Competitive Advantage:

You have **complete, production-grade infrastructure code**:
- 2,500+ lines of Terraform & Kubernetes
- Real ML model deployment (not toy example)
- Addresses actual production concerns:
  - Security (IRSA, RBAC, NetworkPolicy)
  - Reliability (health checks, PDB, multi-AZ)
  - Scalability (HPA, cluster autoscaling)
  - Observability (CloudWatch, logs)

This is **far beyond typical interviews** which usually have minimal infrastructure.

---

**Good luck with your interviews! This architecture demonstrates senior-level cloud engineering skills.** üöÄ