# 054: Transfer Learning & Fine-Tuning## 📚 Learning ObjectivesBy the end of this notebook, you will master:1. **Transfer Learning Theory** - Mathematical foundations, feature hierarchy, domain adaptation2. **Pre-trained Model Zoo** - ImageNet models (ResNet, EfficientNet, ViT), when to use each3. **Fine-Tuning Strategies** - Layer-wise learning rates, gradual unfreezing, discriminative training4. **Learning Rate Policies** - Warm-up, cyclical LR, cosine annealing for transfer learning5. **Feature Extraction vs Fine-Tuning** - When to freeze, when to train, computational tradeoffs6. **Domain Adaptation** - Handling distribution shift between source (ImageNet) and target (semiconductor)7. **Multi-Task Transfer Learning** - Leveraging multiple pre-trained models simultaneously8. **Production Deployment** - Model compression, quantization, ONNX export for transfer learned models---## 🎯 Why Transfer Learning Matters### The Core ProblemTraining deep neural networks from scratch requires:- **Massive datasets** (millions of labeled examples like ImageNet's 14M images)- **Computational resources** (hundreds of GPU-hours, $1000s in cloud costs)- **Time** (days to weeks of training)- **Expertise** (hyperparameter tuning, regularization, debugging)### The Transfer Learning SolutionLeverage **pre-trained models** trained on large datasets (ImageNet, COCO, etc.) and adapt them to your specific task:- **10-100× less data** needed (1000s instead of millions)- **10-100× faster training** (minutes to hours instead of days)- **Better generalization** (pre-trained features capture universal patterns)- **Lower costs** (10× reduction in compute requirements)---## 💼 Business Value for Semiconductor Industry### Use Case 1: Wafer Defect Classification with Limited Data**Problem:** New fab produces novel defect patterns. Only 500 labeled wafer maps available (insufficient for training from scratch).**Solution:** Transfer learning from ImageNet → Fine-tune on 500 wafer maps- **Result:** 92% accuracy (vs 65% training from scratch)- **Business Impact:** $5M-$10M/year in yield improvement- **Time Saved:** 2 weeks training → 4 hours fine-tuning### Use Case 2: Multi-Product Test Adaptation**Problem:** Company produces 50+ IC products. Need separate yield models for each product family.**Solution:** Train base model on Product A (largest dataset) → Transfer to Products B-Z- **Result:** 15% improvement in cross-product generalization- **Business Impact:** $20M-$50M/year from optimized test flows across portfolio- **Cost Reduction:** Train 1 base model instead of 50 separate models### Use Case 3: SEM Image Defect Detection**Problem:** High-resolution SEM images (4096×4096) of die defects. Limited labeled data due to expert annotation cost.**Solution:** EfficientNet-B7 (ImageNet pre-trained) → Fine-tune on 2000 SEM images- **Result:** 96% mAP for 30 defect classes- **Business Impact:** $2M-$8M/year in faster defect root-cause analysis- **Annotation Savings:** $50K-$200K (need 10× fewer labeled images)---## 🏗️ What We'll Build### 1. **Wafer Map Multi-Class Classifier** (Semiconductor Focus)- **Task:** Classify 20 defect patterns (center, edge, scratch, ring, cluster, etc.)- **Approach:** Compare 3 strategies:  1. Feature extraction (freeze all layers)  2. Fine-tuning last N layers  3. Gradual unfreezing (progressive layer-wise training)- **Models:** ResNet-50, EfficientNet-B3, Vision Transformer (ViT-B/16)- **Metrics:** Test accuracy, training time, parameter efficiency### 2. **Domain Adaptation Experiment** (Transfer ImageNet → Grayscale Wafer Maps)- **Challenge:** ImageNet = RGB natural images, Wafer maps = grayscale spatial patterns- **Solution:** Domain-specific preprocessing, adaptive batch normalization- **Analysis:** Quantify domain gap, measure adaptation effectiveness### 3. **Production Deployment Pipeline**- **Model compression:** Quantization (FP32 → INT8), pruning- **Export:** ONNX format for multi-framework compatibility- **Inference:** TensorRT optimization, batch processing- **Monitoring:** Track prediction confidence, detect distribution drift---## 📊 Transfer Learning Workflow```mermaidgraph TD    A[Pre-trained Model<br/>ImageNet 1000 classes] --> B{Transfer Strategy}    B -->|Feature Extraction| C[Freeze all layers<br/>Train only classifier]    B -->|Fine-Tuning| D[Unfreeze last N layers<br/>Train with low LR]    B -->|Gradual Unfreezing| E[Progressive layer training<br/>Start from top, unfreeze downward]        C --> F[Target Dataset<br/>Wafer Maps 20 classes]    D --> F    E --> F        F --> G[Validation]    G -->|Poor Performance| H{Diagnosis}    H -->|Underfitting| I[Unfreeze more layers<br/>Increase model capacity]    H -->|Overfitting| J[Freeze more layers<br/>Add regularization]    H -->|Domain Gap| K[Domain adaptation<br/>Data augmentation]        I --> F    J --> F    K --> F        G -->|Good Performance| L[Production Deployment]    L --> M[ONNX Export]    L --> N[INT8 Quantization]    L --> O[TensorRT Optimization]        M --> P[Inference Server<br/>TorchServe/TF Serving]    N --> P    O --> P        P --> Q[Monitoring & Feedback]    Q -->|Distribution Drift| R[Retrain/Adapt]    R --> F        style A fill:#e1f5ff    style F fill:#fff4e1    style L fill:#e8f5e9    style P fill:#f3e5f5

```

---

## 🛠️ Notebook Structure

1. **Mathematical Foundations** - Why transfer learning works, feature hierarchy theory
2. **Pre-trained Model Comparison** - ResNet vs EfficientNet vs Vision Transformer
3. **Strategy 1: Feature Extraction** - Freeze backbone, train classifier only
4. **Strategy 2: Full Fine-Tuning** - Unfreeze all layers with differential learning rates
5. **Strategy 3: Gradual Unfreezing** - Progressive layer-wise training (best practice)
6. **Domain Adaptation Techniques** - Handle ImageNet → Semiconductor distribution shift
7. **Production Deployment** - Model compression, ONNX export, TensorRT optimization
8. **Real-World Projects** - 8 semiconductor + general AI/ML transfer learning applications

---

## 📦 Prerequisites

**Libraries:**
```bash
pip install torch torchvision timm  # PyTorch + model zoo (timm = PyTorch Image Models)
pip install tensorflow tensorflow-hub  # TensorFlow + TF Hub
pip install onnx onnxruntime tensorrt  # Model export & optimization
pip install matplotlib seaborn scikit-learn  # Visualization & metrics
pip install grad-cam  # Explainability for transfer learned models
```

**Prior Knowledge:**
- Notebook 052: Deep Learning Frameworks (PyTorch/Keras basics)
- Notebook 053: CNN Architectures (convolution, ResNet, VGG concepts)
- Basic understanding of overfitting, regularization

---

## 📊 Dataset Overview

### Synthetic Wafer Map Dataset (20 Defect Classes)
We'll generate **10,000 wafer maps** (128×128 grayscale images) with these defect patterns:

| Class | Pattern | Frequency | Business Impact |
|-------|---------|-----------|-----------------|
| 0 | Normal (no defects) | 20% | Baseline |
| 1 | Center cluster | 8% | Process contamination ($2M-$5M/incident) |
| 2 | Edge defects | 10% | Chuck/vacuum issues ($500K-$2M) |
| 3 | Vertical scratch | 6% | Handling damage ($1M-$3M) |
| 4 | Horizontal scratch | 6% | Robot arm misalignment ($1M-$3M) |
| 5 | Ring pattern | 5% | Plasma etching non-uniformity ($3M-$8M) |
| 6 | Random clusters | 7% | Particle contamination ($1M-$4M) |
| 7 | Localized defects | 6% | Lithography hotspot ($2M-$6M) |
| 8 | Near-full wafer | 4% | Catastrophic process failure ($10M-$30M) |
| 9 | Donut pattern | 5% | Temperature gradient ($2M-$5M) |
| 10-19 | Mixed/complex patterns | 23% | Various root causes |

**Key Characteristics:**
- **Class imbalance:** Mimics real production (normal wafers most common, catastrophic failures rare)
- **Spatial features:** Defects have geometric structure (vs random noise)
- **Grayscale:** Unlike ImageNet (RGB), tests domain adaptation capability

---

## 🎓 Learning Strategy

### Progressive Complexity
1. **Start simple:** Feature extraction (easiest, fastest)
2. **Add complexity:** Fine-tune last layers (moderate difficulty)
3. **Optimize:** Gradual unfreezing with discriminative LR (advanced, best results)

### Experimentation Framework
For each strategy, we'll measure:
- **Test accuracy** (primary metric)
- **Training time** (efficiency)
- **Number of trainable parameters** (computational cost)
- **Overfitting behavior** (train vs validation curves)
- **Inference speed** (production readiness)

### Success Criteria
- **Baseline (train from scratch):** ~75% accuracy, 2+ hours training
- **Target (transfer learning):** ≥92% accuracy, <30 minutes training
- **Production requirement:** <50ms inference per wafer map

---

## 🔗 How This Fits in the Learning Path

**Previous Notebooks:**
- 052: Deep Learning Frameworks → PyTorch/Keras fundamentals
- 053: CNN Architectures → Convolution, ResNet, basic transfer learning

**This Notebook (054):**
- **Advanced transfer learning strategies** (layer-wise LR, gradual unfreezing)
- **Multiple model families** (ResNet, EfficientNet, Vision Transformer)
- **Domain adaptation** (ImageNet → semiconductor)
- **Production deployment** (compression, optimization)

**Next Notebooks:**
- 055: Object Detection (YOLO, R-CNN) → Localize defects, not just classify
- 056: RNN/LSTM → Sequential test pattern analysis
- 057: Seq2Seq & Attention → Foundation for Transformers

---

## 🚀 Let's Begin!

We'll start with the mathematical foundations of transfer learning, then systematically compare three fine-tuning strategies on our wafer map dataset.

---

# 📐 Part 1: Mathematical Foundations of Transfer Learning

## 🧮 Why Does Transfer Learning Work?

### The Feature Hierarchy Hypothesis

Deep neural networks learn a **hierarchy of features** from low-level to high-level:

```
Layer 1 (Early):  Edge detectors, Gabor filters, color blobs
                  ↓ (Generic, universal patterns)
Layer 2-3:        Textures, simple shapes, gradients
                  ↓ (Semi-generic, useful across domains)
Layer 4-5:        Object parts (wheels, eyes, corners)
                  ↓ (Domain-specific but transferable)
Final Layers:     Complete objects (cats, dogs, cars)
                  ↓ (Task-specific, must be retrained)
Classification:   1000 ImageNet classes → 20 wafer defect classes
```

**Key Insight:** Early layers learn **universal features** (edges, textures) that transfer across domains. Only final layers need retraining for new tasks.

---

### Mathematical Formulation

**Source Domain (ImageNet):**
- Input: $X_s \in \mathbb{R}^{224 \times 224 \times 3}$ (RGB images)
- Labels: $Y_s \in \{1, 2, \ldots, 1000\}$ (1000 ImageNet classes)
- Distribution: $P_s(X_s, Y_s)$ (natural images of animals, objects, scenes)

**Target Domain (Semiconductor Wafer Maps):**
- Input: $X_t \in \mathbb{R}^{128 \times 128 \times 1}$ (grayscale wafer maps)
- Labels: $Y_t \in \{1, 2, \ldots, 20\}$ (20 defect patterns)
- Distribution: $P_t(X_t, Y_t)$ (spatial defect patterns)

**Transfer Learning Goal:**
Leverage knowledge from $P_s$ to improve performance on $P_t$ despite:
1. **Input distribution shift:** $P_s(X_s) \neq P_t(X_t)$ (RGB vs grayscale, natural vs spatial)
2. **Label space mismatch:** $Y_s \neq Y_t$ (1000 classes vs 20 classes)
3. **Limited target data:** $|D_t| \ll |D_s|$ (10K wafer maps vs 14M ImageNet images)

---

### The Transfer Learning Assumption

**Assumption:** There exists a shared feature representation $\phi: X \to \mathbb{R}^d$ such that:

$$
\phi(X_s) \approx \phi(X_t)
$$

In other words, the **intermediate feature representations** learned on ImageNet are useful for wafer map classification.

**Validation of Assumption:**
- ✅ Both tasks involve 2D spatial pattern recognition
- ✅ Low-level features (edges, corners) are universal
- ✅ Mid-level features (textures, shapes) transfer across domains
- ⚠️ High-level features (object semantics) differ → **must retrain final layers**

---

### Three Transfer Learning Strategies

Let's denote the pre-trained model as $f_\theta = g_{\theta_{\text{head}}} \circ h_{\theta_{\text{backbone}}}$:
- $h_{\theta_{\text{backbone}}}$: Feature extractor (convolutional layers)
- $g_{\theta_{\text{head}}}$: Classification head (fully-connected layers)

#### **Strategy 1: Feature Extraction (Freeze Backbone)**

**Approach:** Use pre-trained $h_{\theta_{\text{backbone}}}$ as fixed feature extractor, train only $g_{\theta_{\text{head}}}$.

$$
\theta_{\text{backbone}} \leftarrow \text{frozen (no gradients)} \\
\theta_{\text{head}} \leftarrow \text{trainable}
$$

**Optimization:**
$$
\min_{\theta_{\text{head}}} \sum_{(x,y) \in D_t} \mathcal{L}\left(g_{\theta_{\text{head}}}(h_{\theta_{\text{backbone}}}(x)), y\right)
$$

**Pros:**
- ✅ **Fastest training** (only 1-5% of parameters trainable)
- ✅ **No overfitting** (backbone parameters fixed)
- ✅ **Low memory** (no gradients for backbone)

**Cons:**
- ❌ **Limited adaptability** (backbone never adjusts to target domain)
- ❌ **Suboptimal for large domain shift** (ImageNet features may not align with wafer maps)

**When to Use:**
- Target dataset very small (<1000 samples)
- Target domain similar to source (e.g., ImageNet → other natural images)
- Computational resources limited

---

#### **Strategy 2: Full Fine-Tuning (Unfreeze All Layers)**

**Approach:** Unfreeze entire model, train all parameters with smaller learning rate.

$$
\theta_{\text{backbone}} \leftarrow \text{trainable (low LR)} \\
\theta_{\text{head}} \leftarrow \text{trainable (high LR)}
$$

**Discriminative Learning Rates:**
$$
\begin{aligned}
\theta_{\text{head}}^{(t+1)} &\leftarrow \theta_{\text{head}}^{(t)} - \eta_{\text{head}} \cdot \nabla_{\theta_{\text{head}}} \mathcal{L} \\
\theta_{\text{backbone}}^{(t+1)} &\leftarrow \theta_{\text{backbone}}^{(t)} - \eta_{\text{backbone}} \cdot \nabla_{\theta_{\text{backbone}}} \mathcal{L}
\end{aligned}
$$

where $\eta_{\text{head}} = 10 \times \eta_{\text{backbone}}$ (head learns faster, backbone fine-tunes slowly).

**Pros:**
- ✅ **Best accuracy** (model fully adapts to target domain)
- ✅ **Handles domain shift** (backbone adjusts to wafer map statistics)

**Cons:**
- ❌ **Overfitting risk** (especially with small datasets)
- ❌ **Slower training** (all parameters update)
- ❌ **High memory** (full backpropagation through entire network)

**When to Use:**
- Target dataset sufficiently large (>5000 samples)
- Large domain shift (e.g., natural images → medical/industrial images)
- Computational resources available

---

#### **Strategy 3: Gradual Unfreezing (Progressive Layer-Wise Training)**

**Approach:** Start with feature extraction, then progressively unfreeze layers from top to bottom.

**Phase 1 (Epochs 1-5):** Train classifier head only
$$
\theta_{\text{backbone}} \leftarrow \text{frozen}
$$

**Phase 2 (Epochs 6-10):** Unfreeze last block of backbone
$$
\theta_{\text{backbone, last block}} \leftarrow \text{trainable (low LR)}
$$

**Phase 3 (Epochs 11-15):** Unfreeze all backbone
$$
\theta_{\text{backbone, all}} \leftarrow \text{trainable (very low LR)}
$$

**Layer-Wise Learning Rates (Discriminative Fine-Tuning):**
$$
\eta_{\text{layer } i} = \eta_{\text{base}} \times \alpha^{L-i}
$$

where:
- $L$: Total number of layers
- $i$: Layer index (0 = first layer, $L$ = last layer)
- $\alpha \in [0.5, 0.95]$: Decay factor (typical: 0.8)
- Result: Early layers learn slower (preserve pre-trained features), late layers learn faster (adapt to new task)

**Example (ResNet-50 with 50 layers, $\eta_{\text{base}} = 10^{-4}$, $\alpha = 0.8$):**
- Layer 1 (early): $\eta_1 = 10^{-4} \times 0.8^{49} \approx 10^{-8}$ (nearly frozen)
- Layer 25 (middle): $\eta_{25} = 10^{-4} \times 0.8^{25} \approx 10^{-6}$
- Layer 50 (classifier): $\eta_{50} = 10^{-4} \times 0.8^{0} = 10^{-4}$ (full learning rate)

**Pros:**
- ✅ **Best of both worlds** (feature extraction stability + fine-tuning adaptability)
- ✅ **Reduced overfitting** (gradual adaptation prevents catastrophic forgetting)
- ✅ **Faster convergence** (each phase uses optimal LR for layer depth)

**Cons:**
- ❌ **Complex implementation** (requires layer-wise LR scheduling)
- ❌ **Longer training** (multiple phases)

**When to Use:**
- **Best practice for most scenarios** (balances accuracy, stability, efficiency)
- Target dataset moderate size (1000-10000 samples)
- Production deployments (minimizes overfitting risk)

---

## 📊 Feature Transferability Analysis

**Question:** Which layers transfer well from ImageNet to semiconductor wafer maps?

**Experimental Setup:**
1. Train ResNet-50 on ImageNet (standard pre-training)
2. Freeze layer $i$, train layers $i+1$ to $L$ on wafer maps
3. Measure test accuracy for each $i$

**Expected Results:**
```
Freeze Layer 0 (input):        ~75% accuracy (train from scratch)
Freeze Layers 0-10 (early):    ~85% accuracy (low-level features transfer)
Freeze Layers 0-30 (middle):   ~90% accuracy (mid-level textures transfer)
Freeze Layers 0-45 (late):     ~92% accuracy (optimal balance)
Freeze Layers 0-49 (all):      ~88% accuracy (backbone too rigid)
```

**Interpretation:**
- **Layers 1-30:** Highly transferable (edges, textures, shapes)
- **Layers 31-45:** Moderately transferable (need fine-tuning for wafer map specifics)
- **Layers 46-50:** Task-specific (must retrain for 20 defect classes)

**Takeaway:** Freeze early layers (preserve universal features), fine-tune late layers (adapt to semiconductor domain).

---

## 🎯 Domain Adaptation: ImageNet → Semiconductor

### Challenge: Distribution Shift

**ImageNet Statistics:**
- Mean: $\mu_s = [0.485, 0.456, 0.406]$ (RGB channels)
- Std: $\sigma_s = [0.229, 0.224, 0.225]$
- Color: Rich RGB information (natural scenes)

**Wafer Map Statistics:**
- Mean: $\mu_t \approx [0.15]$ (grayscale, mostly background/passing dies)
- Std: $\sigma_t \approx [0.25]$ (defects = bright spots)
- Color: Single-channel (no color information)

**Problem:** Pre-trained BatchNorm layers have statistics from ImageNet distribution. Direct application causes **internal covariate shift**.

### Solution 1: Input Preprocessing

**Replicate Grayscale to 3 Channels:**
$$
X_{\text{RGB}} = \text{stack}(X_{\text{gray}}, X_{\text{gray}}, X_{\text{gray}})
$$

**Normalize with ImageNet Statistics:**
$$
X_{\text{normalized}} = \frac{X_{\text{RGB}} - \mu_s}{\sigma_s}
$$

**Pros:** Simple, preserves pre-trained weights exactly  
**Cons:** Wastes computation (3 identical channels), doesn't fix BatchNorm statistics

### Solution 2: Adaptive Batch Normalization

**Freeze model, compute BatchNorm statistics on target domain:**

```python
model.train()  # Enable BatchNorm statistics update
with torch.no_grad():  # Don't update weights
    for x, _ in target_dataloader:
        _ = model(x)  # Forward pass updates running_mean/running_var
```

**Effect:** BatchNorm layers adapt to $\mu_t, \sigma_t$ without changing convolutional filters.

**Pros:** Fast (one pass through dataset), no weight updates needed  
**Cons:** Limited improvement (BatchNorm alone doesn't fix feature misalignment)

### Solution 3: Domain-Specific Data Augmentation

**Standard ImageNet Augmentations (DON'T USE for wafer maps):**
- Color jitter ❌ (wafer maps are grayscale)
- Hue shifts ❌ (no color information)
- Vertical flips ❌ (breaks top-edge vs bottom-edge defect distinction)

**Wafer-Map-Specific Augmentations (RECOMMENDED):**
- **Rotation 0-360°** ✅ (wafers have rotational symmetry)
- **Horizontal flips** ✅ (left-edge = right-edge defects)
- **Gaussian noise** ✅ (mimics sensor noise)
- **Brightness scaling** ✅ (different test equipment sensitivities)
- **Elastic deformation** ✅ (mimics wafer warping)

---

## 📚 Pre-Trained Model Zoo Comparison

### Models We'll Use

| Model | Parameters | ImageNet Top-1 | Speed (GPU) | When to Use |
|-------|------------|----------------|-------------|-------------|
| **ResNet-50** | 25.6M | 76.2% | 60 FPS | Baseline, proven architecture, semiconductor standard |
| **EfficientNet-B3** | 12.0M | 81.6% | 45 FPS | **Best accuracy/params tradeoff**, mobile-friendly |
| **Vision Transformer (ViT-B/16)** | 86.6M | 84.5% | 30 FPS | Cutting-edge, attention-based, data-hungry |

**Source:** [PyTorch Image Models (timm)](https://github.com/huggingface/pytorch-image-models) - 700+ pre-trained models

---

### ResNet-50 (Residual Networks)

**Architecture:**
- 50 layers (48 convolutional + 1 max pool + 1 avg pool)
- **Skip connections:** $y = F(x) + x$ (solve vanishing gradient problem)
- 4 residual blocks (conv3_x, conv4_x, conv5_x, conv6_x)

**Transfer Learning Strategy:**
- Freeze: Blocks 1-3 (early feature extraction)
- Fine-tune: Block 4 + classifier (task-specific adaptation)

**Semiconductor Use Case:** Industry standard for wafer map classification (proven reliability).

---

### EfficientNet-B3 (Compound Scaling)

**Architecture:**
- Inverted residual blocks (MobileNetV2-style)
- **Squeeze-and-Excitation (SE) blocks:** Channel attention ($y = x \odot \sigma(W_2 \text{ReLU}(W_1 \text{GAP}(x)))$)
- Compound scaling: Simultaneously scale depth, width, resolution

**Transfer Learning Strategy:**
- Freeze: Blocks 1-5 (efficient feature pyramid)
- Fine-tune: Blocks 6-7 + classifier

**Semiconductor Use Case:** **Recommended for production** (2× fewer parameters than ResNet, 5% higher accuracy).

---

### Vision Transformer (ViT-B/16)

**Architecture:**
- Patch embedding: Split image into 16×16 patches → Flatten → Linear projection
- Transformer encoder: 12 layers of multi-head self-attention
- No convolution (purely attention-based)

**Transfer Learning Strategy:**
- Freeze: Patch embedding + first 10 transformer blocks
- Fine-tune: Last 2 transformer blocks + classifier

**Semiconductor Use Case:** **Experimental** (requires more data, slower inference, but captures long-range spatial dependencies).

---

### Comparison on Wafer Map Task (Predicted Results)

| Metric | ResNet-50 | EfficientNet-B3 | ViT-B/16 |
|--------|-----------|-----------------|----------|
| **Test Accuracy** | 91.5% | **93.2%** | 92.8% |
| **Training Time (20 epochs)** | 25 min | **18 min** | 45 min |
| **Inference Speed (batch=32)** | 60 FPS | **80 FPS** | 30 FPS |
| **GPU Memory** | 4.2 GB | **2.8 GB** | 8.5 GB |
| **Model Size (FP32)** | 98 MB | **46 MB** | 330 MB |

**Winner for Semiconductor:** **EfficientNet-B3** (best accuracy, fastest inference, smallest model size).

---

### Practical Selection Guide

**Choose ResNet-50 if:**
- ✅ First time implementing transfer learning (simplest architecture)
- ✅ Need industry-standard baseline (proven in semiconductor)
- ✅ Debugging model (easier to interpret residual blocks)

**Choose EfficientNet-B3 if:**
- ✅ **Production deployment** (best efficiency)
- ✅ Edge device inference (smallest model size)
- ✅ Limited GPU memory (2-3× less memory than ResNet)

**Choose Vision Transformer if:**
- ✅ Research project (cutting-edge architecture)
- ✅ Very large dataset (>50K wafer maps, ViT needs more data)
- ✅ Interpretability needed (attention maps show "what model looks at")

---

## 🔬 Learning Rate Policies for Transfer Learning

### Standard Learning Rate (NOT OPTIMAL for Transfer Learning)

**Constant LR:**
$$
\eta(t) = \eta_0 = 10^{-3}
$$

**Problem:** 
- Pre-trained weights already near optimal → High LR causes **catastrophic forgetting**
- Random classifier head needs higher LR to converge → Low LR too slow

---

### Discriminative Learning Rates (RECOMMENDED)

**Layer-wise LR scaling:**
$$
\eta_{\text{layer } i} = \eta_{\text{base}} \times \text{scale}^{i}
$$

**PyTorch Implementation:**
```python
optimizer = torch.optim.Adam([
    {'params': model.backbone.layer1.parameters(), 'lr': 1e-5},  # Early layers: very low LR
    {'params': model.backbone.layer2.parameters(), 'lr': 5e-5},
    {'params': model.backbone.layer3.parameters(), 'lr': 1e-4},
    {'params': model.backbone.layer4.parameters(), 'lr': 5e-4},  # Late layers: moderate LR
    {'params': model.classifier.parameters(), 'lr': 1e-3}        # Classifier: high LR
], lr=1e-4)  # Base LR (overridden by param group LRs)
```

**Effect:** Early layers preserve ImageNet features (small updates), late layers adapt to wafer maps (large updates).

---

### Warm-Up + Cosine Annealing (BEST PRACTICE)

**Phase 1: Linear Warm-Up (Epochs 0-2)**
$$
\eta(t) = \eta_{\text{max}} \times \frac{t}{T_{\text{warmup}}}
$$

**Phase 2: Cosine Annealing (Epochs 3-20)**
$$
\eta(t) = \eta_{\text{min}} + \frac{1}{2}(\eta_{\text{max}} - \eta_{\text{min}})\left(1 + \cos\left(\frac{t - T_{\text{warmup}}}{T_{\text{max}} - T_{\text{warmup}}} \pi\right)\right)
$$

**PyTorch Implementation:**
```python
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

warmup_scheduler = LinearLR(optimizer, start_factor=0.1, end_factor=1.0, total_iters=2)
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=18, eta_min=1e-6)
scheduler = SequentialLR(optimizer, schedulers=[warmup_scheduler, cosine_scheduler], milestones=[2])
```

**Benefits:**
- **Warm-up:** Prevents early divergence (random classifier head needs gentle start)
- **Cosine decay:** Smooth convergence to optimum (avoids oscillations near minimum)

**Results:**
- +2-3% accuracy vs constant LR
- Faster convergence (reaches 90% in 10 epochs vs 15 epochs)

---

### Cyclical Learning Rates (Alternative)

**Triangular cycle:**
$$
\eta(t) = \eta_{\text{min}} + (\eta_{\text{max}} - \eta_{\text{min}}) \times \max\left(0, 1 - \frac{|t \mod (2 \times \text{cycle}\_\text{length}) - \text{cycle}\_\text{length}|}{\text{cycle}\_\text{length}}\right)
$$

**PyTorch Implementation:**
```python
from torch.optim.lr_scheduler import CyclicLR

scheduler = CyclicLR(optimizer, base_lr=1e-5, max_lr=1e-3, step_size_up=500, mode='triangular2')
```

**Use Case:** Helps escape local minima, useful when fine-tuning gets stuck.

---

## 🧪 Experimental Framework Summary

### What We'll Implement Next

1. **Generate 10K wafer maps** (20 defect classes, 80/10/10 train/val/test split)
2. **Load 3 pre-trained models** (ResNet-50, EfficientNet-B3, ViT-B/16)
3. **Apply 3 fine-tuning strategies:**
   - Feature extraction (freeze backbone)
   - Full fine-tuning (unfreeze all, discriminative LR)
   - Gradual unfreezing (progressive 3-phase training)
4. **Compare results:** Accuracy, training time, parameters, overfitting
5. **Production pipeline:** ONNX export, INT8 quantization, TensorRT optimization

---

### Expected Outcomes

| Strategy | ResNet-50 Accuracy | EfficientNet-B3 Accuracy | Training Time |
|----------|-------------------|--------------------------|---------------|
| **Train from Scratch** | 75.2% | 76.8% | 120 min |
| **Feature Extraction** | 87.5% | 89.2% | **12 min** |
| **Full Fine-Tuning** | 91.5% | 93.2% | 25 min |
| **Gradual Unfreezing** | **92.8%** | **94.1%** | 30 min |

**Key Insight:** Gradual unfreezing achieves **best accuracy** (94.1%) with **minimal overfitting risk**.

---

## 🎯 Next Steps

Let's implement these strategies in code! We'll start with data generation, then systematically compare the three fine-tuning approaches.

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
# ========================================
# Part 2: Data Generation & Preparation
# ========================================
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
import torchvision.models as models
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report
import time
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')
# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
# ========================================
# Synthetic Wafer Map Generation
# ========================================
def generate_wafer_map(defect_type, size=128):
    """
    Generate synthetic wafer map with specified defect pattern.
    
    Parameters:
    -----------
    defect_type : int
        Defect class (0-19)
    size : int
        Image size (default 128x128)
    
    Returns:
    --------
    wafer_map : np.ndarray
        Grayscale wafer map (0=background, 1=passing die, 2=failing die)
    """
    wafer = np.ones((size, size), dtype=np.float32)  # All passing
    center_x, center_y = size // 2, size // 2
    
    # Create circular wafer boundary
    y, x = np.ogrid[:size, :size]
    mask = (x - center_x)**2 + (y - center_y)**2 <= (size // 2 - 2)**2
    wafer[~mask] = 0  # Background outside wafer
    
    # Generate defect patterns based on type
    if defect_type == 0:
        # Normal (no defects)
        pass
    
    elif defect_type == 1:
        # Center cluster
        cluster_size = np.random.randint(10, 20)
        cluster_x = np.random.randint(center_x - 15, center_x + 15)
        cluster_y = np.random.randint(center_y - 15, center_y + 15)
        for _ in range(cluster_size):
            dx = np.random.randint(-8, 8)
            dy = np.random.randint(-8, 8)
            px, py = cluster_x + dx, cluster_y + dy
            if 0 <= px < size and 0 <= py < size and mask[py, px]:
                wafer[py, px] = 0  # Failing die
    
    elif defect_type == 2:
        # Edge defects
        edge_type = np.random.choice(['top', 'bottom', 'left', 'right'])
        num_defects = np.random.randint(15, 30)
        if edge_type == 'top':
            for _ in range(num_defects):
                x = np.random.randint(0, size)
                y = np.random.randint(0, center_y // 2)
                if mask[y, x]:
                    wafer[y, x] = 0
        elif edge_type == 'bottom':
            for _ in range(num_defects):
                x = np.random.randint(0, size)
                y = np.random.randint(center_y + center_y // 2, size)
                if mask[y, x]:
                    wafer[y, x] = 0
        elif edge_type == 'left':
            for _ in range(num_defects):
                x = np.random.randint(0, center_x // 2)
                y = np.random.randint(0, size)
                if mask[y, x]:
                    wafer[y, x] = 0
        else:  # right
            for _ in range(num_defects):
                x = np.random.randint(center_x + center_x // 2, size)
                y = np.random.randint(0, size)
                if mask[y, x]:
                    wafer[y, x] = 0
    
    elif defect_type == 3:
        # Vertical scratch
        scratch_x = np.random.randint(center_x - 20, center_x + 20)
        scratch_width = np.random.randint(2, 4)
        for y in range(size):
            for dx in range(scratch_width):
                if 0 <= scratch_x + dx < size and mask[y, scratch_x + dx]:
                    wafer[y, scratch_x + dx] = 0
    
    elif defect_type == 4:
        # Horizontal scratch
        scratch_y = np.random.randint(center_y - 20, center_y + 20)
        scratch_width = np.random.randint(2, 4)
        for x in range(size):
            for dy in range(scratch_width):
                if 0 <= scratch_y + dy < size and mask[scratch_y + dy, x]:
                    wafer[scratch_y + dy, x] = 0
    
    elif defect_type == 5:
        # Ring pattern (concentric defects)
        ring_radius = np.random.randint(25, 40)
        ring_width = np.random.randint(3, 6)
        for y in range(size):
            for x in range(size):
                dist = np.sqrt((x - center_x)**2 + (y - center_y)**2)
                if ring_radius <= dist <= ring_radius + ring_width and mask[y, x]:
                    if np.random.rand() > 0.3:  # 70% defect density in ring
                        wafer[y, x] = 0
    
    elif defect_type == 6:
        # Random clusters (multiple small clusters)
        num_clusters = np.random.randint(3, 6)
        for _ in range(num_clusters):
            cluster_x = np.random.randint(10, size - 10)
            cluster_y = np.random.randint(10, size - 10)
            cluster_size = np.random.randint(5, 10)
            for __ in range(cluster_size):
                dx = np.random.randint(-5, 5)
                dy = np.random.randint(-5, 5)
                px, py = cluster_x + dx, cluster_y + dy
                if 0 <= px < size and 0 <= py < size and mask[py, px]:
                    wafer[py, px] = 0
    
    elif defect_type == 7:
        # Localized defects (single region)
        region_x = np.random.randint(center_x - 25, center_x + 25)
        region_y = np.random.randint(center_y - 25, center_y + 25)
        region_size = np.random.randint(15, 25)
        for y in range(region_y - region_size, region_y + region_size):
            for x in range(region_x - region_size, region_x + region_size):
                if 0 <= x < size and 0 <= y < size and mask[y, x]:
                    if np.random.rand() > 0.5:  # 50% defect density
                        wafer[y, x] = 0
    
    elif defect_type == 8:
        # Near-full wafer failure (catastrophic)
        for y in range(size):
            for x in range(size):
                if mask[y, x] and np.random.rand() > 0.2:  # 80% defect rate
                    wafer[y, x] = 0
    
    elif defect_type == 9:
        # Donut pattern (center good, ring defective)
        inner_radius = np.random.randint(15, 25)
        outer_radius = np.random.randint(35, 45)
        for y in range(size):
            for x in range(size):
                dist = np.sqrt((x - center_x)**2 + (y - center_y)**2)
                if inner_radius <= dist <= outer_radius and mask[y, x]:
                    if np.random.rand() > 0.3:
                        wafer[y, x] = 0
    
    else:
        # Mixed patterns for classes 10-19
        # Combine 2-3 random patterns from above
        patterns = np.random.choice(range(1, 10), size=2, replace=False)
        for pattern in patterns:
            temp_wafer = generate_wafer_map(pattern, size)
            wafer = np.minimum(wafer, temp_wafer)  # Combine defects
    
    return wafer


### 📝 Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Generate Dataset
# ========================================
print("\n" + "="*60)
print("GENERATING SYNTHETIC WAFER MAP DATASET")
print("="*60)
# Class distribution (mimics real production imbalance)
class_distribution = {
    0: 2000,   # Normal (20%)
    1: 800,    # Center cluster (8%)
    2: 1000,   # Edge defects (10%)
    3: 600,    # Vertical scratch (6%)
    4: 600,    # Horizontal scratch (6%)
    5: 500,    # Ring pattern (5%)
    6: 700,    # Random clusters (7%)
    7: 600,    # Localized defects (6%)
    8: 400,    # Near-full failure (4%)
    9: 500,    # Donut pattern (5%)
}
# Generate remaining classes (10-19) with 10-400 samples each
for cls in range(10, 20):
    class_distribution[cls] = np.random.randint(200, 400)
# Total samples
total_samples = sum(class_distribution.values())
print(f"\nTotal samples: {total_samples}")
print(f"Number of classes: 20")
print(f"Image size: 128x128 grayscale")
# Generate data
X_data = []
y_data = []
start_time = time.time()
for defect_class, num_samples in class_distribution.items():
    for _ in range(num_samples):
        wafer_map = generate_wafer_map(defect_class, size=128)
        X_data.append(wafer_map)
        y_data.append(defect_class)
    if (defect_class + 1) % 5 == 0:
        print(f"Generated classes 0-{defect_class}: {sum([class_distribution[i] for i in range(defect_class+1)])} samples")
X_data = np.array(X_data, dtype=np.float32)
y_data = np.array(y_data, dtype=np.int64)
generation_time = time.time() - start_time
print(f"\n✓ Dataset generated in {generation_time:.2f} seconds")
print(f"  Shape: X={X_data.shape}, y={y_data.shape}")
# ========================================
# Train/Val/Test Split
# ========================================
# Split: 80% train, 10% val, 10% test
X_train, X_temp, y_train, y_temp = train_test_split(X_data, y_data, test_size=0.2, random_state=42, stratify=y_data)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
print(f"\nDataset Split:")
print(f"  Train: {X_train.shape[0]} samples ({X_train.shape[0]/total_samples*100:.1f}%)")
print(f"  Val:   {X_val.shape[0]} samples ({X_val.shape[0]/total_samples*100:.1f}%)")
print(f"  Test:  {X_test.shape[0]} samples ({X_test.shape[0]/total_samples*100:.1f}%)")
# ========================================
# Visualize Sample Wafer Maps
# ========================================
print("\nVisualizing sample wafer maps from each class...")
fig, axes = plt.subplots(4, 5, figsize=(15, 12))
axes = axes.flatten()
for cls in range(20):
    # Find first sample of this class
    idx = np.where(y_train == cls)[0][0]
    wafer_map = X_train[idx]
    
    axes[cls].imshow(wafer_map, cmap='viridis', vmin=0, vmax=1)
    axes[cls].set_title(f'Class {cls}\n({class_distribution[cls]} samples)', fontsize=10)
    axes[cls].axis('off')
plt.tight_layout()
plt.savefig('wafer_map_samples.png', dpi=150, bbox_inches='tight')
print("✓ Saved visualization to 'wafer_map_samples.png'")
plt.show()


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# PyTorch Dataset Class
# ========================================
class WaferMapDataset(Dataset):
    """
    PyTorch Dataset for wafer maps with preprocessing for transfer learning.
    """
    def __init__(self, X, y, transform=None, replicate_channels=True):
        """
        Parameters:
        -----------
        X : np.ndarray
            Wafer maps (N, H, W)
        y : np.ndarray
            Labels (N,)
        transform : torchvision.transforms
            Data augmentation pipeline
        replicate_channels : bool
            If True, replicate grayscale to 3 channels for ImageNet models
        """
        self.X = torch.from_numpy(X).float().unsqueeze(1)  # (N, 1, H, W)
        self.y = torch.from_numpy(y).long()
        self.transform = transform
        self.replicate_channels = replicate_channels
    
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        image = self.X[idx]  # (1, H, W)
        label = self.y[idx]
        
        # Replicate to 3 channels (grayscale → RGB) for ImageNet models
        if self.replicate_channels:
            image = image.repeat(3, 1, 1)  # (3, H, W)
        
        # Apply transforms (augmentation)
        if self.transform:
            image = self.transform(image)
        
        return image, label
# ========================================
# Data Augmentation & Preprocessing
# ========================================
# ImageNet normalization statistics (IMPORTANT for transfer learning)
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
# Training augmentation (wafer-specific)
train_transform = transforms.Compose([
    transforms.Resize((224, 224)),  # ResNet/EfficientNet input size
    transforms.RandomRotation(180),  # Wafers have rotational symmetry
    transforms.RandomHorizontalFlip(p=0.5),  # Left-right symmetry
    transforms.ColorJitter(brightness=0.2, contrast=0.2),  # Simulate different sensors
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])
# Validation/Test (no augmentation, only resize + normalize)
val_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])
# Create datasets
train_dataset = WaferMapDataset(X_train, y_train, transform=train_transform)
val_dataset = WaferMapDataset(X_val, y_val, transform=val_transform)
test_dataset = WaferMapDataset(X_test, y_test, transform=val_transform)
# Create data loaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=True)
print(f"\nData Loaders Created:")
print(f"  Train batches: {len(train_loader)} (batch_size={batch_size})")
print(f"  Val batches:   {len(val_loader)}")
print(f"  Test batches:  {len(test_loader)}")
# Verify data shape
sample_batch, sample_labels = next(iter(train_loader))
print(f"\nSample batch shape: {sample_batch.shape}")  # Should be (32, 3, 224, 224)
print(f"Sample labels shape: {sample_labels.shape}")  # Should be (32,)
print(f"Label range: {sample_labels.min()}-{sample_labels.max()}")
print("\n" + "="*60)
print("DATA PREPARATION COMPLETE")
print("="*60)


# 📐 Part 3: Strategy 1 - Feature Extraction (Freeze Backbone)

## 📝 What's Happening in This Code?

**Purpose:** Use pre-trained ResNet-50 as a **fixed feature extractor**, training only the final classification layer.

**Key Points:**
- **Freeze backbone:** Set `requires_grad=False` for all convolutional layers (preserve ImageNet features)
- **Replace classifier:** Swap 1000-class ImageNet head with 20-class semiconductor head
- **Fast training:** Only ~2M parameters trainable (vs 25M total)
- **Low overfitting risk:** Pre-trained features fixed, only linear classifier learns

**Strategy 1 Architecture:**
```
Input (224×224×3 grayscale replicated) 
    ↓
[FROZEN] ResNet-50 Backbone (23.5M params)
    ├─ Conv1 + MaxPool
    ├─ Layer1 (Residual blocks 1-3)
    ├─ Layer2 (Residual blocks 4-7)
    ├─ Layer3 (Residual blocks 8-13)
    └─ Layer4 (Residual blocks 14-16)
    ↓
Global Average Pooling → 2048-dim feature vector
    ↓
[TRAINABLE] Classifier (2M params)
    ├─ Dropout(0.3)
    ├─ Linear(2048 → 512)
    ├─ ReLU + Dropout(0.3)
    └─ Linear(512 → 20)
    ↓
Output: 20 defect classes
```

**Training Details:**
- **Optimizer:** Adam with LR=1e-3 (can use higher LR since only classifier trains)
- **Loss:** CrossEntropyLoss with class weights (handle imbalance)
- **Epochs:** 15 (converges quickly)
- **Early stopping:** Patience=5 epochs

**Why This Matters:**
- **Baseline approach:** Simplest transfer learning strategy, proves value of pre-training
- **Production use case:** When dataset too small (<1000 samples) to fine-tune safely
- **Semiconductor application:** Quick prototyping for new defect types with limited data

---

## 🔧 Implementation: Feature Extraction with ResNet-50

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
# ========================================
# Strategy 1: Feature Extraction (Freeze Backbone)
# ========================================
print("\n" + "="*70)
print("STRATEGY 1: FEATURE EXTRACTION (FREEZE BACKBONE)")
print("="*70)
# Load pre-trained ResNet-50
resnet50_frozen = models.resnet50(pretrained=True)
# Freeze all parameters in backbone
for param in resnet50_frozen.parameters():
    param.requires_grad = False
# Replace final classifier
# Original: Linear(2048 → 1000) for ImageNet
# New: Linear(2048 → 512 → 20) for semiconductor
num_features = resnet50_frozen.fc.in_features  # 2048 for ResNet-50
resnet50_frozen.fc = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(num_features, 512),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(512, 20)  # 20 defect classes
)
# Move model to GPU
resnet50_frozen = resnet50_frozen.to(device)
# Count parameters
total_params = sum(p.numel() for p in resnet50_frozen.parameters())
trainable_params = sum(p.numel() for p in resnet50_frozen.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params
print(f"\nModel: ResNet-50 (Feature Extraction)")
print(f"  Total parameters:     {total_params:,} ({total_params/1e6:.2f}M)")
print(f"  Trainable parameters: {trainable_params:,} ({trainable_params/1e6:.2f}M)")
print(f"  Frozen parameters:    {frozen_params:,} ({frozen_params/1e6:.2f}M)")
print(f"  % Trainable:          {trainable_params/total_params*100:.2f}%")
# ========================================
# Training Setup
# ========================================
# Class weights (handle imbalance)
class_counts = np.bincount(y_train)
class_weights = 1.0 / class_counts
class_weights = class_weights / class_weights.sum() * len(class_counts)  # Normalize
class_weights = torch.FloatTensor(class_weights).to(device)
# Loss function with class weights
criterion = nn.CrossEntropyLoss(weight=class_weights)
# Optimizer (only trainable parameters)
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, resnet50_frozen.parameters()),
    lr=1e-3,  # Can use higher LR since only classifier trains
    weight_decay=1e-4
)
# Learning rate scheduler (ReduceLROnPlateau)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max', factor=0.5, patience=3, verbose=True
)
# ========================================
# Training Function
# ========================================


### 📝 Function: train_epoch

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
def train_epoch(model, dataloader, criterion, optimizer, device):
    """Train for one epoch."""
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (inputs, labels) in enumerate(dataloader):
        inputs, labels = inputs.to(device), labels.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Statistics
        running_loss += loss.item() * inputs.size(0)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
        
        # Progress (every 50 batches)
        if (batch_idx + 1) % 50 == 0:
            print(f"  Batch {batch_idx+1}/{len(dataloader)}: "
                  f"Loss={loss.item():.4f}, Acc={100.*correct/total:.2f}%")
    
    epoch_loss = running_loss / total
    epoch_acc = 100. * correct / total
    return epoch_loss, epoch_acc
def validate_epoch(model, dataloader, criterion, device):
    """Validate for one epoch."""
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for inputs, labels in dataloader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            
            running_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    epoch_loss = running_loss / total
    epoch_acc = 100. * correct / total
    return epoch_loss, epoch_acc
# ========================================
# Training Loop
# ========================================
num_epochs = 15
best_val_acc = 0.0
patience = 5
patience_counter = 0
train_losses, train_accs = [], []
val_losses, val_accs = [], []
print(f"\nTraining for {num_epochs} epochs...")
print(f"Device: {device}")
start_time = time.time()
for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")
    print("-" * 70)
    
    # Train
    train_loss, train_acc = train_epoch(resnet50_frozen, train_loader, criterion, optimizer, device)
    
    # Validate
    val_loss, val_acc = validate_epoch(resnet50_frozen, val_loader, criterion, device)
    
    # Learning rate scheduling
    scheduler.step(val_acc)
    
    # Save metrics
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    val_losses.append(val_loss)
    val_accs.append(val_acc)
    
    print(f"\n  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"  Val Loss:   {val_loss:.4f}, Val Acc:   {val_acc:.2f}%")
    print(f"  Current LR: {optimizer.param_groups[0]['lr']:.6f}")
    
    # Early stopping
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        patience_counter = 0
        # Save best model
        torch.save(resnet50_frozen.state_dict(), 'resnet50_frozen_best.pth')
        print(f"  ✓ New best validation accuracy! Model saved.")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"\n  Early stopping triggered (patience={patience})")
            break
training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time/60:.2f} minutes")
print(f"  Best validation accuracy: {best_val_acc:.2f}%")
# Load best model
resnet50_frozen.load_state_dict(torch.load('resnet50_frozen_best.pth'))


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Test Evaluation
# ========================================
print("\n" + "="*70)
print("TEST SET EVALUATION")
print("="*70)
# Evaluate on test set
test_loss, test_acc = validate_epoch(resnet50_frozen, test_loader, criterion, device)
print(f"\nTest Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_acc:.2f}%")
# Detailed metrics
resnet50_frozen.eval()
y_true = []
y_pred = []
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs = inputs.to(device)
        outputs = resnet50_frozen(inputs)
        _, predicted = outputs.max(1)
        
        y_true.extend(labels.cpu().numpy())
        y_pred.extend(predicted.cpu().numpy())
# Classification report
print("\n" + "-"*70)
print("CLASSIFICATION REPORT")
print("-"*70)
print(classification_report(y_true, y_pred, target_names=[f'Class {i}' for i in range(20)], digits=4))
# Precision, Recall, F1 (weighted)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print(f"\nWeighted Metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1-Score:  {f1:.4f}")
# ========================================
# Visualize Training Curves
# ========================================
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Loss curves
axes[0].plot(train_losses, label='Train Loss', linewidth=2)
axes[0].plot(val_losses, label='Val Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Strategy 1: Training & Validation Loss', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Accuracy curves
axes[1].plot(train_accs, label='Train Acc', linewidth=2)
axes[1].plot(val_accs, label='Val Acc', linewidth=2)
axes[1].axhline(y=test_acc, color='red', linestyle='--', label=f'Test Acc ({test_acc:.2f}%)', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy (%)', fontsize=12)
axes[1].set_title('Strategy 1: Training & Validation Accuracy', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('strategy1_training_curves.png', dpi=150, bbox_inches='tight')
print("\n✓ Saved training curves to 'strategy1_training_curves.png'")
plt.show()


### 📝 Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Confusion Matrix
# ========================================
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True,
            xticklabels=[f'C{i}' for i in range(20)],
            yticklabels=[f'C{i}' for i in range(20)])
plt.xlabel('Predicted Class', fontsize=12)
plt.ylabel('True Class', fontsize=12)
plt.title('Strategy 1: Confusion Matrix (Test Set)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('strategy1_confusion_matrix.png', dpi=150, bbox_inches='tight')
print("✓ Saved confusion matrix to 'strategy1_confusion_matrix.png'")
plt.show()
# ========================================
# Summary Statistics
# ========================================
print("\n" + "="*70)
print("STRATEGY 1 SUMMARY")
print("="*70)
print(f"Model: ResNet-50 (Feature Extraction)")
print(f"  Total parameters:     {total_params/1e6:.2f}M")
print(f"  Trainable parameters: {trainable_params/1e6:.2f}M ({trainable_params/total_params*100:.2f}%)")
print(f"  Training time:        {training_time/60:.2f} minutes")
print(f"  Best val accuracy:    {best_val_acc:.2f}%")
print(f"  Test accuracy:        {test_acc:.2f}%")
print(f"  Test F1-score:        {f1:.4f}")
print("="*70)


# 🔥 Part 4: Strategy 2 - Full Fine-Tuning (Discriminative Learning Rates)

## 📝 What's Happening in This Code?

**Purpose:** Unfreeze **all layers** of ResNet-50 and train with **discriminative learning rates** (early layers learn slower, late layers learn faster).

**Key Points:**
- **Unfreeze backbone:** All 25M parameters trainable (vs 2M in Strategy 1)
- **Discriminative LR:** Early layers use LR=1e-6, late layers use LR=1e-3 (1000× difference!)
- **Prevents catastrophic forgetting:** Low LR in early layers preserves ImageNet edge/texture detectors
- **Adapts to domain:** High LR in late layers adjusts to semiconductor-specific patterns

**Strategy 2 Architecture:**
```
Input (224×224×3)
    ↓
[TRAINABLE - LR=1e-6] Layer1 (Conv + Residual blocks 1-3)
    ↓
[TRAINABLE - LR=5e-6] Layer2 (Residual blocks 4-7)
    ↓
[TRAINABLE - LR=1e-5] Layer3 (Residual blocks 8-13)
    ↓
[TRAINABLE - LR=1e-4] Layer4 (Residual blocks 14-16)
    ↓
Global Average Pooling → 2048-dim feature vector
    ↓
[TRAINABLE - LR=1e-3] Classifier (512 → 20)
    ↓
Output: 20 defect classes
```

**Discriminative LR Formula:**
$$
\eta_{\text{layer } i} = \eta_{\text{base}} \times \alpha^{L-i}
$$

where $\eta_{\text{base}} = 10^{-3}$, $\alpha = 0.1$, $L = 5$ (5 layer groups).

**Example:**
- Layer 1: $\eta_1 = 10^{-3} \times 0.1^{4} = 10^{-7}$ (nearly frozen)
- Layer 2: $\eta_2 = 10^{-3} \times 0.1^{3} = 10^{-6}$
- Layer 3: $\eta_3 = 10^{-3} \times 0.1^{2} = 10^{-5}$
- Layer 4: $\eta_4 = 10^{-3} \times 0.1^{1} = 10^{-4}$
- Classifier: $\eta_5 = 10^{-3} \times 0.1^{0} = 10^{-3}$ (full LR)

**Training Details:**
- **Optimizer:** Adam with parameter groups (each layer group = different LR)
- **Scheduler:** Cosine annealing with warm-up (2 epochs warm-up, then cosine decay)
- **Epochs:** 20 (needs more training since all layers update)
- **Expected improvement:** +3-5% accuracy vs Strategy 1

**Why This Matters:**
- **Maximizes accuracy:** Full model adaptation to semiconductor domain
- **Production standard:** Industry best practice for transfer learning
- **Handles domain shift:** Early layers preserve universal features, late layers specialize

---

## 🔧 Implementation: Full Fine-Tuning with Discriminative LR

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
# ========================================
# Strategy 2: Full Fine-Tuning (Discriminative LR)
# ========================================
print("\n" + "="*70)
print("STRATEGY 2: FULL FINE-TUNING (DISCRIMINATIVE LEARNING RATES)")
print("="*70)
# Load pre-trained ResNet-50 (fresh copy)
resnet50_finetune = models.resnet50(pretrained=True)
# Replace final classifier (same as Strategy 1)
num_features = resnet50_finetune.fc.in_features
resnet50_finetune.fc = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(num_features, 512),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(512, 20)
)
# Move to GPU
resnet50_finetune = resnet50_finetune.to(device)
# Count parameters (ALL trainable now)
total_params_ft = sum(p.numel() for p in resnet50_finetune.parameters())
trainable_params_ft = sum(p.numel() for p in resnet50_finetune.parameters() if p.requires_grad)
print(f"\nModel: ResNet-50 (Full Fine-Tuning)")
print(f"  Total parameters:     {total_params_ft:,} ({total_params_ft/1e6:.2f}M)")
print(f"  Trainable parameters: {trainable_params_ft:,} ({trainable_params_ft/1e6:.2f}M)")
print(f"  % Trainable:          100.00%")
# ========================================
# Discriminative Learning Rates Setup
# ========================================
# Define parameter groups with different LRs
base_lr = 1e-3
lr_decay = 0.1
param_groups = [
    {'params': resnet50_finetune.conv1.parameters(), 'lr': base_lr * (lr_decay ** 4)},         # LR = 1e-7
    {'params': resnet50_finetune.layer1.parameters(), 'lr': base_lr * (lr_decay ** 3)},        # LR = 1e-6
    {'params': resnet50_finetune.layer2.parameters(), 'lr': base_lr * (lr_decay ** 2)},        # LR = 1e-5
    {'params': resnet50_finetune.layer3.parameters(), 'lr': base_lr * lr_decay},               # LR = 1e-4
    {'params': resnet50_finetune.layer4.parameters(), 'lr': base_lr * (lr_decay ** 0.5)},      # LR = 3e-4
    {'params': resnet50_finetune.fc.parameters(), 'lr': base_lr}                               # LR = 1e-3
]
print(f"\nDiscriminative Learning Rates:")
layer_names = ['conv1', 'layer1', 'layer2', 'layer3', 'layer4', 'fc (classifier)']
for i, (layer_name, pg) in enumerate(zip(layer_names, param_groups)):
    print(f"  {layer_name:20s}: LR = {pg['lr']:.2e}")
# Optimizer with parameter groups
optimizer_ft = torch.optim.Adam(param_groups, weight_decay=1e-4)
# Cosine annealing scheduler with warm-up
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
warmup_epochs = 2
total_epochs_ft = 20
warmup_scheduler = LinearLR(optimizer_ft, start_factor=0.1, end_factor=1.0, total_iters=warmup_epochs)
cosine_scheduler = CosineAnnealingLR(optimizer_ft, T_max=total_epochs_ft - warmup_epochs, eta_min=1e-7)
scheduler_ft = SequentialLR(optimizer_ft, schedulers=[warmup_scheduler, cosine_scheduler], milestones=[warmup_epochs])
print(f"\nLearning Rate Scheduler:")
print(f"  Warm-up: Epochs 1-{warmup_epochs} (linear 0.1× → 1.0×)")
print(f"  Cosine annealing: Epochs {warmup_epochs+1}-{total_epochs_ft} (→ 1e-7)")
# Loss function (reuse from Strategy 1)
criterion_ft = nn.CrossEntropyLoss(weight=class_weights)


### 📝 Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Training Loop (Strategy 2)
# ========================================
best_val_acc_ft = 0.0
patience_ft = 5
patience_counter_ft = 0
train_losses_ft, train_accs_ft = [], []
val_losses_ft, val_accs_ft = [], []
print(f"\nTraining for {total_epochs_ft} epochs...")
start_time_ft = time.time()
for epoch in range(total_epochs_ft):
    print(f"\nEpoch {epoch+1}/{total_epochs_ft}")
    print("-" * 70)
    
    # Train
    train_loss_ft, train_acc_ft = train_epoch(resnet50_finetune, train_loader, criterion_ft, optimizer_ft, device)
    
    # Validate
    val_loss_ft, val_acc_ft = validate_epoch(resnet50_finetune, val_loader, criterion_ft, device)
    
    # Learning rate scheduling
    scheduler_ft.step()
    
    # Save metrics
    train_losses_ft.append(train_loss_ft)
    train_accs_ft.append(train_acc_ft)
    val_losses_ft.append(val_loss_ft)
    val_accs_ft.append(val_acc_ft)
    
    # Get current LRs for each group
    current_lrs = [pg['lr'] for pg in optimizer_ft.param_groups]
    
    print(f"\n  Train Loss: {train_loss_ft:.4f}, Train Acc: {train_acc_ft:.2f}%")
    print(f"  Val Loss:   {val_loss_ft:.4f}, Val Acc:   {val_acc_ft:.2f}%")
    print(f"  LR Range:   {min(current_lrs):.2e} - {max(current_lrs):.2e}")
    
    # Early stopping
    if val_acc_ft > best_val_acc_ft:
        best_val_acc_ft = val_acc_ft
        patience_counter_ft = 0
        torch.save(resnet50_finetune.state_dict(), 'resnet50_finetune_best.pth')
        print(f"  ✓ New best validation accuracy! Model saved.")
    else:
        patience_counter_ft += 1
        if patience_counter_ft >= patience_ft:
            print(f"\n  Early stopping triggered (patience={patience_ft})")
            break
training_time_ft = time.time() - start_time_ft
print(f"\n✓ Training completed in {training_time_ft/60:.2f} minutes")
print(f"  Best validation accuracy: {best_val_acc_ft:.2f}%")
# Load best model
resnet50_finetune.load_state_dict(torch.load('resnet50_finetune_best.pth'))
# ========================================
# Test Evaluation (Strategy 2)
# ========================================
print("\n" + "="*70)
print("TEST SET EVALUATION (STRATEGY 2)")
print("="*70)
test_loss_ft, test_acc_ft = validate_epoch(resnet50_finetune, test_loader, criterion_ft, device)
print(f"\nTest Loss: {test_loss_ft:.4f}")
print(f"Test Accuracy: {test_acc_ft:.2f}%")
# Detailed metrics
resnet50_finetune.eval()
y_true_ft = []
y_pred_ft = []
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs = inputs.to(device)
        outputs = resnet50_finetune(inputs)
        _, predicted = outputs.max(1)
        
        y_true_ft.extend(labels.cpu().numpy())
        y_pred_ft.extend(predicted.cpu().numpy())
precision_ft, recall_ft, f1_ft, _ = precision_recall_fscore_support(y_true_ft, y_pred_ft, average='weighted')
print(f"\nWeighted Metrics:")
print(f"  Precision: {precision_ft:.4f}")
print(f"  Recall:    {recall_ft:.4f}")
print(f"  F1-Score:  {f1_ft:.4f}")


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Visualize Training Curves (Strategy 2)
# ========================================
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(train_losses_ft, label='Train Loss', linewidth=2)
axes[0].plot(val_losses_ft, label='Val Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Strategy 2: Training & Validation Loss', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[1].plot(train_accs_ft, label='Train Acc', linewidth=2)
axes[1].plot(val_accs_ft, label='Val Acc', linewidth=2)
axes[1].axhline(y=test_acc_ft, color='red', linestyle='--', label=f'Test Acc ({test_acc_ft:.2f}%)', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy (%)', fontsize=12)
axes[1].set_title('Strategy 2: Training & Validation Accuracy', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('strategy2_training_curves.png', dpi=150, bbox_inches='tight')
print("\n✓ Saved training curves to 'strategy2_training_curves.png'")
plt.show()
# ========================================
# Comparison: Strategy 1 vs Strategy 2
# ========================================
print("\n" + "="*70)
print("COMPARISON: STRATEGY 1 vs STRATEGY 2")
print("="*70)
comparison_data = {
    'Metric': [
        'Trainable Params (M)',
        'Training Time (min)',
        'Best Val Acc (%)',
        'Test Acc (%)',
        'Test F1-Score',
        'Overfitting Gap (Train-Val %)'
    ],
    'Strategy 1 (Frozen)': [
        f'{trainable_params/1e6:.2f}',
        f'{training_time/60:.2f}',
        f'{best_val_acc:.2f}',
        f'{test_acc:.2f}',
        f'{f1:.4f}',
        f'{train_accs[-1] - val_accs[-1]:.2f}'
    ],
    'Strategy 2 (Fine-Tune)': [
        f'{trainable_params_ft/1e6:.2f}',
        f'{training_time_ft/60:.2f}',
        f'{best_val_acc_ft:.2f}',
        f'{test_acc_ft:.2f}',
        f'{f1_ft:.4f}',
        f'{train_accs_ft[-1] - val_accs_ft[-1]:.2f}'
    ]
}
import pandas as pd
df_comparison = pd.DataFrame(comparison_data)
print(df_comparison.to_string(index=False))
print(f"\n📊 Key Observations:")
print(f"  • Fine-tuning improves test accuracy by {test_acc_ft - test_acc:.2f}%")
print(f"  • Training time increases by {(training_time_ft - training_time)/60:.2f} min")
print(f"  • Fine-tuning uses {trainable_params_ft/trainable_params:.1f}× more trainable parameters")
if test_acc_ft > test_acc:
    print(f"  ✓ Strategy 2 wins on accuracy!")
else:
    print(f"  ✓ Strategy 1 wins on efficiency!")
print("="*70)


# 🎯 Part 5: Strategy 3 - Gradual Unfreezing & EfficientNet Comparison

## 📝 What's Happening in This Code?

**Purpose:** Implement **best-practice transfer learning** - gradual unfreezing combined with EfficientNet-B3 (most efficient architecture).

**Key Points:**
- **Phase 1 (Epochs 1-5):** Train classifier only (feature extraction baseline)
- **Phase 2 (Epochs 6-10):** Unfreeze layer4 (late features adapt to semiconductor patterns)
- **Phase 3 (Epochs 11-15):** Unfreeze layer3 (mid-level features fine-tune)
- **EfficientNet-B3:** 2× fewer parameters than ResNet-50, 3-5% higher accuracy

**Strategy 3: Progressive Layer Unfreezing**

```
Phase 1 (Epochs 1-5): Feature Extraction
    [FROZEN] Backbone → [TRAINABLE] Classifier
    Goal: Train classifier to convergence
    LR: 1e-3 for classifier

Phase 2 (Epochs 6-10): Unfreeze Top Layers
    [FROZEN] Layers 1-3 → [TRAINABLE] Layer4 + Classifier
    Goal: Adapt high-level features to wafer maps
    LR: 1e-4 for layer4, 1e-3 for classifier

Phase 3 (Epochs 11-15): Full Fine-Tuning
    [TRAINABLE] All layers
    Goal: End-to-end optimization for semiconductor domain
    LR: Discriminative (1e-6 for layer1 → 1e-3 for classifier)
```

**Why Gradual Unfreezing Works:**
- **Prevents catastrophic forgetting:** Early layers preserve ImageNet features (edges, textures)
- **Stable training:** Classifier converges first, then backbone adapts gradually
- **Best accuracy:** Balances feature extraction (Strategy 1) + full fine-tuning (Strategy 2)

**EfficientNet-B3 Advantages:**
- **Compound scaling:** Optimally scales depth, width, and resolution
- **Fewer parameters:** 12M vs ResNet-50's 25.6M (2× smaller)
- **Higher accuracy:** ImageNet Top-1: 81.6% vs 76.2% (5.4% better)
- **Faster inference:** 2× faster on edge devices
- **Semiconductor use case:** **Recommended for production deployment**

**Expected Results:**
- **Strategy 3 (Gradual):** 93-94% accuracy, best generalization
- **EfficientNet-B3:** 94-95% accuracy, fastest inference, smallest model size

---

## 🔧 Implementation: Gradual Unfreezing + EfficientNet-B3

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
# ========================================
# Strategy 3: Gradual Unfreezing with EfficientNet-B3
# ========================================
print("\n" + "="*70)
print("STRATEGY 3: GRADUAL UNFREEZING with EfficientNet-B3")
print("="*70)
# Install timm if not available (PyTorch Image Models - 700+ pre-trained models)
try:
    import timm
except ImportError:
    print("Installing timm (PyTorch Image Models)...")
    import subprocess
    subprocess.check_call(['pip', 'install', 'timm', '-q'])
    import timm
print(f"timm version: {timm.__version__}")
# Load pre-trained EfficientNet-B3
efficientnet = timm.create_model('efficientnet_b3', pretrained=True, num_classes=20)
# Move to GPU
efficientnet = efficientnet.to(device)
# Model info
total_params_eff = sum(p.numel() for p in efficientnet.parameters())
print(f"\nModel: EfficientNet-B3")
print(f"  Total parameters: {total_params_eff:,} ({total_params_eff/1e6:.2f}M)")
print(f"  Input size: 224×224×3")
print(f"  ImageNet Top-1 accuracy: 81.6%")
# ========================================
# Phase 1: Feature Extraction (Freeze Backbone)
# ========================================
print("\n" + "="*70)
print("PHASE 1: FEATURE EXTRACTION (Epochs 1-5)")
print("="*70)
# Freeze all parameters except classifier
for name, param in efficientnet.named_parameters():
    if 'classifier' not in name:  # EfficientNet uses 'classifier' instead of 'fc'
        param.requires_grad = False
trainable_phase1 = sum(p.numel() for p in efficientnet.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_phase1:,} ({trainable_phase1/total_params_eff*100:.2f}%)")
# Optimizer (classifier only)
optimizer_phase1 = torch.optim.Adam(
    filter(lambda p: p.requires_grad, efficientnet.parameters()),
    lr=1e-3,
    weight_decay=1e-4
)
# Loss function
criterion_eff = nn.CrossEntropyLoss(weight=class_weights)
# Train Phase 1
phase1_epochs = 5
train_losses_phase1, train_accs_phase1 = [], []
val_losses_phase1, val_accs_phase1 = [], []
start_time_phase1 = time.time()
for epoch in range(phase1_epochs):
    print(f"\nPhase 1 - Epoch {epoch+1}/{phase1_epochs}")
    print("-" * 70)
    
    train_loss_p1, train_acc_p1 = train_epoch(efficientnet, train_loader, criterion_eff, optimizer_phase1, device)
    val_loss_p1, val_acc_p1 = validate_epoch(efficientnet, val_loader, criterion_eff, device)
    
    train_losses_phase1.append(train_loss_p1)
    train_accs_phase1.append(train_acc_p1)
    val_losses_phase1.append(val_loss_p1)
    val_accs_phase1.append(val_acc_p1)
    
    print(f"  Train Loss: {train_loss_p1:.4f}, Train Acc: {train_acc_p1:.2f}%")
    print(f"  Val Loss:   {val_loss_p1:.4f}, Val Acc:   {val_acc_p1:.2f}%")
time_phase1 = time.time() - start_time_phase1
print(f"\n✓ Phase 1 completed in {time_phase1/60:.2f} minutes")
print(f"  Val accuracy: {val_accs_phase1[-1]:.2f}%")


### 📝 Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Phase 2: Unfreeze Last Block (Layers 6-7)
# ========================================
print("\n" + "="*70)
print("PHASE 2: UNFREEZE TOP BLOCKS (Epochs 6-10)")
print("="*70)
# Unfreeze blocks 6-7 (last two blocks before classifier)
for name, param in efficientnet.named_parameters():
    if 'blocks.6' in name or 'blocks.7' in name or 'classifier' in name:
        param.requires_grad = True
trainable_phase2 = sum(p.numel() for p in efficientnet.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_phase2:,} ({trainable_phase2/total_params_eff*100:.2f}%)")
# Optimizer with discriminative LR
optimizer_phase2 = torch.optim.Adam([
    {'params': [p for n, p in efficientnet.named_parameters() if 'blocks.6' in n or 'blocks.7' in n], 'lr': 1e-4},
    {'params': [p for n, p in efficientnet.named_parameters() if 'classifier' in n], 'lr': 1e-3}
], weight_decay=1e-4)
print(f"Learning rates:")
print(f"  Blocks 6-7: 1e-4")
print(f"  Classifier: 1e-3")
# Train Phase 2
phase2_epochs = 5
train_losses_phase2, train_accs_phase2 = [], []
val_losses_phase2, val_accs_phase2 = [], []
start_time_phase2 = time.time()
for epoch in range(phase2_epochs):
    print(f"\nPhase 2 - Epoch {epoch+1}/{phase2_epochs}")
    print("-" * 70)
    
    train_loss_p2, train_acc_p2 = train_epoch(efficientnet, train_loader, criterion_eff, optimizer_phase2, device)
    val_loss_p2, val_acc_p2 = validate_epoch(efficientnet, val_loader, criterion_eff, device)
    
    train_losses_phase2.append(train_loss_p2)
    train_accs_phase2.append(train_acc_p2)
    val_losses_phase2.append(val_loss_p2)
    val_accs_phase2.append(val_acc_p2)
    
    print(f"  Train Loss: {train_loss_p2:.4f}, Train Acc: {train_acc_p2:.2f}%")
    print(f"  Val Loss:   {val_loss_p2:.4f}, Val Acc:   {val_acc_p2:.2f}%")
time_phase2 = time.time() - start_time_phase2
print(f"\n✓ Phase 2 completed in {time_phase2/60:.2f} minutes")
print(f"  Val accuracy: {val_accs_phase2[-1]:.2f}%")
print(f"  Improvement vs Phase 1: +{val_accs_phase2[-1] - val_accs_phase1[-1]:.2f}%")
# ========================================
# Phase 3: Full Fine-Tuning (Unfreeze All)
# ========================================
print("\n" + "="*70)
print("PHASE 3: FULL FINE-TUNING (Epochs 11-15)")
print("="*70)
# Unfreeze all layers
for param in efficientnet.parameters():
    param.requires_grad = True
trainable_phase3 = sum(p.numel() for p in efficientnet.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_phase3:,} ({trainable_phase3/total_params_eff*100:.2f}%)")
# Optimizer with full discriminative LR
optimizer_phase3 = torch.optim.Adam([
    {'params': [p for n, p in efficientnet.named_parameters() if 'blocks.0' in n or 'blocks.1' in n], 'lr': 1e-6},
    {'params': [p for n, p in efficientnet.named_parameters() if 'blocks.2' in n or 'blocks.3' in n], 'lr': 5e-6},
    {'params': [p for n, p in efficientnet.named_parameters() if 'blocks.4' in n or 'blocks.5' in n], 'lr': 1e-5},
    {'params': [p for n, p in efficientnet.named_parameters() if 'blocks.6' in n or 'blocks.7' in n], 'lr': 1e-4},
    {'params': [p for n, p in efficientnet.named_parameters() if 'classifier' in n], 'lr': 1e-3}
], weight_decay=1e-4)
print(f"Discriminative learning rates:")
print(f"  Blocks 0-1 (early):   1e-6")
print(f"  Blocks 2-3:           5e-6")
print(f"  Blocks 4-5:           1e-5")
print(f"  Blocks 6-7 (late):    1e-4")
print(f"  Classifier:           1e-3")
# Train Phase 3
phase3_epochs = 5
train_losses_phase3, train_accs_phase3 = [], []
val_losses_phase3, val_accs_phase3 = [], []
best_val_acc_eff = 0.0
start_time_phase3 = time.time()
for epoch in range(phase3_epochs):
    print(f"\nPhase 3 - Epoch {epoch+1}/{phase3_epochs}")
    print("-" * 70)
    
    train_loss_p3, train_acc_p3 = train_epoch(efficientnet, train_loader, criterion_eff, optimizer_phase3, device)
    val_loss_p3, val_acc_p3 = validate_epoch(efficientnet, val_loader, criterion_eff, device)
    
    train_losses_phase3.append(train_loss_p3)
    train_accs_phase3.append(train_acc_p3)
    val_losses_phase3.append(val_loss_p3)
    val_accs_phase3.append(val_acc_p3)
    
    print(f"  Train Loss: {train_loss_p3:.4f}, Train Acc: {train_acc_p3:.2f}%")
    print(f"  Val Loss:   {val_loss_p3:.4f}, Val Acc:   {val_acc_p3:.2f}%")
    
    # Save best model
    if val_acc_p3 > best_val_acc_eff:
        best_val_acc_eff = val_acc_p3
        torch.save(efficientnet.state_dict(), 'efficientnet_gradual_best.pth')
        print(f"  ✓ New best model saved!")
time_phase3 = time.time() - start_time_phase3
total_training_time_eff = time_phase1 + time_phase2 + time_phase3
print(f"\n✓ Phase 3 completed in {time_phase3/60:.2f} minutes")
print(f"  Val accuracy: {val_accs_phase3[-1]:.2f}%")
print(f"  Improvement vs Phase 2: +{val_accs_phase3[-1] - val_accs_phase2[-1]:.2f}%")
print(f"\n✓ Total training time (all 3 phases): {total_training_time_eff/60:.2f} minutes")
print(f"  Best validation accuracy: {best_val_acc_eff:.2f}%")
# Load best model
efficientnet.load_state_dict(torch.load('efficientnet_gradual_best.pth'))


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Test Evaluation (Strategy 3)
# ========================================
print("\n" + "="*70)
print("TEST SET EVALUATION (STRATEGY 3 - EfficientNet-B3)")
print("="*70)
test_loss_eff, test_acc_eff = validate_epoch(efficientnet, test_loader, criterion_eff, device)
print(f"\nTest Loss: {test_loss_eff:.4f}")
print(f"Test Accuracy: {test_acc_eff:.2f}%")
# Detailed metrics
efficientnet.eval()
y_true_eff = []
y_pred_eff = []
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs = inputs.to(device)
        outputs = efficientnet(inputs)
        _, predicted = outputs.max(1)
        
        y_true_eff.extend(labels.cpu().numpy())
        y_pred_eff.extend(predicted.cpu().numpy())
precision_eff, recall_eff, f1_eff, _ = precision_recall_fscore_support(y_true_eff, y_pred_eff, average='weighted')
print(f"\nWeighted Metrics:")
print(f"  Precision: {precision_eff:.4f}")
print(f"  Recall:    {recall_eff:.4f}")
print(f"  F1-Score:  {f1_eff:.4f}")
# ========================================
# Visualize 3-Phase Training
# ========================================
# Combine all phases
all_train_losses = train_losses_phase1 + train_losses_phase2 + train_losses_phase3
all_train_accs = train_accs_phase1 + train_accs_phase2 + train_accs_phase3
all_val_losses = val_losses_phase1 + val_losses_phase2 + val_losses_phase3
all_val_accs = val_accs_phase1 + val_accs_phase2 + val_accs_phase3
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Loss curves with phase markers
epochs_range = range(1, len(all_train_losses) + 1)
axes[0].plot(epochs_range, all_train_losses, label='Train Loss', linewidth=2)
axes[0].plot(epochs_range, all_val_losses, label='Val Loss', linewidth=2)
axes[0].axvline(x=5, color='red', linestyle='--', alpha=0.7, label='Phase 1→2')
axes[0].axvline(x=10, color='green', linestyle='--', alpha=0.7, label='Phase 2→3')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Strategy 3: Gradual Unfreezing - Loss', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
# Accuracy curves with phase markers
axes[1].plot(epochs_range, all_train_accs, label='Train Acc', linewidth=2)
axes[1].plot(epochs_range, all_val_accs, label='Val Acc', linewidth=2)
axes[1].axvline(x=5, color='red', linestyle='--', alpha=0.7, label='Phase 1→2')
axes[1].axvline(x=10, color='green', linestyle='--', alpha=0.7, label='Phase 2→3')
axes[1].axhline(y=test_acc_eff, color='purple', linestyle='--', label=f'Test Acc ({test_acc_eff:.2f}%)', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy (%)', fontsize=12)
axes[1].set_title('Strategy 3: Gradual Unfreezing - Accuracy', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('strategy3_training_curves.png', dpi=150, bbox_inches='tight')
print("\n✓ Saved training curves to 'strategy3_training_curves.png'")
plt.show()


### 📝 Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ========================================
# Final Comparison: All 3 Strategies
# ========================================
print("\n" + "="*70)
print("FINAL COMPARISON: ALL STRATEGIES")
print("="*70)
final_comparison = {
    'Metric': [
        'Model',
        'Total Params (M)',
        'Trainable Params (M)',
        'Training Time (min)',
        'Test Accuracy (%)',
        'Test F1-Score',
        'Training Strategy'
    ],
    'Strategy 1': [
        'ResNet-50',
        f'{total_params/1e6:.2f}',
        f'{trainable_params/1e6:.2f}',
        f'{training_time/60:.2f}',
        f'{test_acc:.2f}',
        f'{f1:.4f}',
        'Feature Extraction'
    ],
    'Strategy 2': [
        'ResNet-50',
        f'{total_params_ft/1e6:.2f}',
        f'{trainable_params_ft/1e6:.2f}',
        f'{training_time_ft/60:.2f}',
        f'{test_acc_ft:.2f}',
        f'{f1_ft:.4f}',
        'Full Fine-Tuning'
    ],
    'Strategy 3': [
        'EfficientNet-B3',
        f'{total_params_eff/1e6:.2f}',
        f'{total_params_eff/1e6:.2f}',
        f'{total_training_time_eff/60:.2f}',
        f'{test_acc_eff:.2f}',
        f'{f1_eff:.4f}',
        'Gradual Unfreezing'
    ]
}
df_final = pd.DataFrame(final_comparison)
print(df_final.to_string(index=False))
print(f"\n🏆 WINNER: Strategy 3 (EfficientNet-B3 + Gradual Unfreezing)")
print(f"  ✓ Highest test accuracy: {test_acc_eff:.2f}%")
print(f"  ✓ Smallest model size: {total_params_eff/1e6:.2f}M params (2× smaller than ResNet-50)")
print(f"  ✓ Best F1-score: {f1_eff:.4f}")
print(f"  ✓ Balanced training time: {total_training_time_eff/60:.2f} minutes")
print("\n📊 Key Insights:")
print(f"  • Transfer learning improves accuracy by {test_acc_eff - 75:.2f}% over training from scratch")
print(f"  • EfficientNet-B3 outperforms ResNet-50 by {test_acc_eff - test_acc_ft:.2f}% with 2× fewer parameters")
print(f"  • Gradual unfreezing provides best accuracy with minimal overfitting risk")
print(f"  • Discriminative LR essential for preserving pre-trained features")
print("="*70)


# 🚀 Part 6: Real-World Projects & Production Deployment

## 📊 8 Real-World Transfer Learning Projects

---

### **🔬 Semiconductor Projects (Post-Silicon Validation)**

#### **Project 1: Production Wafer Yield Predictor with Multi-Site Data**

**Objective:** Train base model on Fab A data, transfer to Fabs B-F with minimal data collection

**Business Value:** $50M-$200M/year from cross-site yield optimization

**Architecture:**
- Base model: EfficientNet-B4 trained on 100K wafer maps from Fab A (primary production site)
- Transfer strategy: Gradual unfreezing with domain adaptation for Fabs B-F
- Input: 256×256 wafer maps (higher resolution for fine defect patterns)
- Output: Continuous yield% prediction (0-100%)

**Implementation Approach:**
```python
# Pseudocode
base_model = train_base_efficientnet_b4(fab_a_data_100k)  # 3 days training

for fab in [fab_b, fab_c, fab_d, fab_e, fab_f]:
    # Collect only 2K wafer maps per fab (vs 100K from scratch)
    small_dataset = collect_wafer_data(fab, n_samples=2000)
    
    # Transfer with gradual unfreezing
    fab_model = copy.deepcopy(base_model)
    fine_tune_gradual(fab_model, small_dataset, epochs=15)  # 4 hours
    
    deploy_to_production(fab, fab_model)
```

**Success Metrics:**
- **Accuracy:** R² ≥ 0.90 for yield prediction
- **Data efficiency:** 50× less data per fab (2K vs 100K wafer maps)
- **Training time:** 4 hours fine-tuning vs 3 days from scratch
- **Cost savings:** $10M-$40M per fab from optimized test flows

**Key Techniques:**
- Domain adaptation (handle fab-to-fab process variations)
- Multi-task learning (predict yield + top-3 defect patterns simultaneously)
- Active learning (prioritize labeling most informative wafer maps)

---

#### **Project 2: SEM Image Defect Classification with Few-Shot Learning**

**Objective:** Classify 50+ defect types from high-resolution SEM images with <100 samples per class

**Business Value:** $2M-$8M/year in faster root-cause analysis, reduce expert annotation cost by $100K-$500K

**Architecture:**
- Pre-trained: Vision Transformer (ViT-L/16) on ImageNet + fine-tuned on 10K general SEM images
- Few-shot approach: Prototypical networks (learn metric space where similar defects cluster)
- Input: 4096×4096 SEM images → 512×512 crops (sliding window)
- Output: 50 defect classes (scratches, pits, voids, contaminants, etc.)

**Implementation Approach:**
```python
# Step 1: Pre-train ViT on large general SEM dataset (10K images, 20 classes)
vit_base = train_vit_on_general_sem(sem_dataset_10k)

# Step 2: Meta-learning for few-shot adaptation
prototypical_network = build_prototypical_head(vit_base.features)

# Step 3: Fine-tune with few samples (5-50 per new defect class)
for new_defect_class in novel_defects_50_classes:
    support_set = get_labeled_samples(new_defect_class, n_shot=10)
    query_set = get_test_samples(new_defect_class, n=100)
    
    # Episodic training
    prototypical_network.meta_learn(support_set, query_set, episodes=500)

# Step 4: Deploy for production defect detection
deploy_sem_classifier(prototypical_network)
```

**Success Metrics:**
- **Accuracy:** ≥85% with 10-shot (10 samples per class)
- **mAP:** ≥0.88 for multi-label detection
- **Inference speed:** <500ms per 4K×4K image (with sliding window)
- **Annotation savings:** $100K-$500K (need 50 samples vs 5000 per class)

**Key Techniques:**
- Prototypical networks (metric learning for few-shot)
- Data augmentation specific to SEM (rotation, elastic deformation, Gaussian noise)
- Attention visualization (Grad-CAM to show defect locations for engineer trust)

---

#### **Project 3: Adaptive Test Program with RL + Transfer Learning**

**Objective:** Optimize test sequence dynamically using reinforcement learning agent guided by CNN-extracted wafer features

**Business Value:** $20M-$80M/year from reduced test time (15-30% reduction) + improved binning accuracy

**Architecture:**
- CNN feature extractor: EfficientNet-B3 (frozen, pre-trained on 50K wafer maps)
- RL policy network: PPO (Proximal Policy Optimization) actor-critic
- State: CNN features (2048-dim) + current test results (128-dim) → 2176-dim
- Action: Choose next test from 150 parametric tests + binning decision
- Reward: −(test_time_cost) + (binning_accuracy_reward) − (misclassification_penalty)

**Implementation Approach:**
```python
# Step 1: Pre-train CNN on large wafer dataset (frozen feature extractor)
cnn_features = pretrained_efficientnet_b3.forward_features(wafer_map)  # (2048,)

# Step 2: RL training
class TestSequenceEnv(gym.Env):
    def __init__(self):
        self.state = concat(cnn_features, current_test_results)
        self.action_space = Discrete(151)  # 150 tests + 1 binning action
    
    def step(self, action):
        if action < 150:  # Perform test
            test_result, test_time = execute_test(action)
            reward = -test_time * 0.1  # Cost of time
        else:  # Binning decision
            binning_accuracy = evaluate_binning()
            reward = binning_accuracy * 100 - misclassification_penalty * 50
        return new_state, reward, done, info

# Step 3: Train PPO agent
ppo_agent = PPO(policy='MlpPolicy', env=TestSequenceEnv(), learning_rate=3e-4)
ppo_agent.learn(total_timesteps=1_000_000)  # 2-3 days training

# Step 4: Deploy RL-guided test sequence
deploy_adaptive_test_program(ppo_agent, cnn_features)
```

**Success Metrics:**
- **Test time reduction:** 15-30% (60 sec → 45 sec average per device)
- **Binning accuracy:** ≥99.5% (match expert human decisions)
- **Throughput increase:** 20-40% more devices tested per hour
- **ROI:** $20M-$80M/year from faster time-to-market + yield improvement

**Key Techniques:**
- Transfer learning (CNN provides spatial awareness to RL agent)
- Curriculum learning (start with easy devices, progress to complex failure modes)
- Safe RL (constrain policy to prevent catastrophic misclassifications)

---

#### **Project 4: Cross-Product Wafer Map Synthesis with Conditional GANs**

**Objective:** Generate synthetic wafer maps for rare defect patterns to augment training data (solve class imbalance)

**Business Value:** $5M-$15M/year from improved detection of rare but catastrophic defects (e.g., near-full wafer failures)

**Architecture:**
- Generator: StyleGAN2 conditioned on defect class label
- Discriminator: EfficientNet-B1 (transfer learned from ImageNet)
- Input: Noise vector (512-dim) + class label (one-hot 20-dim)
- Output: Synthetic 128×128 wafer map

**Implementation Approach:**
```python
# Step 1: Train conditional GAN
generator = StyleGAN2Generator(latent_dim=512, num_classes=20)
discriminator = EfficientNetB1Discriminator(num_classes=20)  # Pre-trained features

for epoch in range(500):
    # Generate fake wafer maps
    z = torch.randn(batch_size, 512)
    labels = torch.randint(0, 20, (batch_size,))
    fake_wafers = generator(z, labels)
    
    # Train discriminator (real vs fake + classify defect type)
    d_loss_real = discriminator(real_wafers, real_labels)
    d_loss_fake = discriminator(fake_wafers.detach(), labels)
    
    # Train generator (fool discriminator + match defect statistics)
    g_loss = discriminator(fake_wafers, labels)

# Step 2: Augment training data
for rare_class in [8, 15, 17, 19]:  # Classes with <500 samples
    synthetic_wafers = generator.generate(n=5000, class=rare_class)
    training_data.add(synthetic_wafers)

# Step 3: Train final classifier on augmented data
classifier = train_efficientnet_b3(training_data_augmented)
```

**Success Metrics:**
- **FID score:** ≤50 (synthetic wafers indistinguishable from real)
- **Classifier accuracy on rare classes:** +10-15% improvement (75% → 88%)
- **Data augmentation ratio:** 5:1 synthetic:real for rare classes
- **Cost avoidance:** $2M-$5M (avoid collecting 50K+ real rare-defect wafers)

**Key Techniques:**
- Conditional GAN (control defect pattern generation)
- Transfer learning in discriminator (EfficientNet features improve GAN training stability)
- Fréchet Inception Distance (FID) for quality evaluation

---

### **🌐 General AI/ML Projects**

#### **Project 5: Medical Image Diagnosis Transfer Learning**

**Objective:** Classify chest X-rays (Normal, Pneumonia, COVID-19, Tuberculosis) with limited labeled data

**Business Value:** Clinical decision support, reduce radiologist workload by 40%, faster triage

**Architecture:**
- Pre-trained: DenseNet-121 (ImageNet) → Fine-tuned on ChestX-ray14 (100K images)
- Transfer: Gradual unfreezing for COVID-19 detection (5K labeled images)
- Input: 224×224 grayscale X-ray (replicated to 3 channels)
- Output: 4 classes (Normal, Pneumonia, COVID, TB)

**Implementation:**
```python
densenet121 = models.densenet121(pretrained=True)

# Phase 1: Train on ChestX-ray14 (general lung pathologies, 100K images)
densenet121 = train_on_chestxray14(densenet121, epochs=50)  # 2 days

# Phase 2: Transfer to COVID-19 dataset (5K images)
densenet121_covid = gradual_unfreeze(densenet121, covid_dataset, epochs=15)  # 3 hours

# Ensemble with multiple pre-trained models
efficientnet = fine_tune_efficientnet_b4(covid_dataset)
resnet50 = fine_tune_resnet50(covid_dataset)
ensemble = weighted_average([densenet121_covid, efficientnet, resnet50])
```

**Success Metrics:**
- **AUC-ROC:** ≥0.95 for COVID-19 detection
- **Sensitivity:** ≥92% (catch most positive cases)
- **Specificity:** ≥88% (minimize false alarms)
- **Inference time:** <100ms per X-ray (real-time triage)

---

#### **Project 6: Autonomous Vehicle Object Detection with Domain Adaptation**

**Objective:** Transfer object detection from COCO dataset (natural images) to automotive cameras (different lighting, angles, weather)

**Business Value:** Autonomous driving, reduce annotation cost by $500K-$2M, faster deployment

**Architecture:**
- Pre-trained: YOLOv8-X (COCO dataset, 80 classes)
- Transfer: Fine-tune on autonomous driving dataset (BDD100K, 10 classes: car, truck, pedestrian, cyclist, etc.)
- Domain adaptation: Adapt to different weather (rain, fog, night)

**Implementation:**
```python
yolov8 = YOLO('yolov8x.pt')  # Pre-trained on COCO

# Fine-tune on BDD100K (autonomous driving dataset)
yolov8.train(data='bdd100k.yaml', epochs=100, imgsz=640, batch=16)

# Domain adaptation for adverse weather
yolov8_rain = domain_adapt(yolov8, rain_images, method='CycleGAN')
yolov8_fog = domain_adapt(yolov8, fog_images, method='CycleGAN')
yolov8_night = domain_adapt(yolov8, night_images, method='adversarial')

# Deploy ensemble
ensemble_yolo = MultiDomainYOLO([yolov8, yolov8_rain, yolov8_fog, yolov8_night])
```

**Success Metrics:**
- **mAP@0.5:** ≥0.65 (COCO-style evaluation)
- **Inference speed:** <30ms per frame (real-time at 30 FPS)
- **Robustness:** ≥85% accuracy in rain/fog/night (vs 95% in clear day)

---

#### **Project 7: Satellite Image Change Detection**

**Objective:** Detect changes in satellite imagery (buildings, deforestation, floods) using pre-trained models

**Business Value:** Disaster response, urban planning, environmental monitoring

**Architecture:**
- Pre-trained: ResNet-101 (ImageNet) → Fine-tuned on xView (1M labeled buildings)
- Transfer: Siamese network for change detection (compare before/after images)
- Input: Pair of 256×256 RGB satellite images (t1, t2)
- Output: Change map (pixel-wise classification: no change, new building, demolished, flooded, etc.)

**Implementation:**
```python
# Siamese architecture
resnet101_backbone = models.resnet101(pretrained=True)
resnet101_backbone.fc = nn.Identity()  # Remove classifier, use features only

class ChangeDetectionNet(nn.Module):
    def __init__(self):
        self.backbone = resnet101_backbone  # Shared weights
        self.fusion = nn.Conv2d(4096, 512, 1)  # Fuse t1 + t2 features
        self.decoder = UNetDecoder(512, num_classes=5)  # Upsample to change map
    
    def forward(self, img_t1, img_t2):
        feat_t1 = self.backbone(img_t1)  # (B, 2048, 8, 8)
        feat_t2 = self.backbone(img_t2)  # (B, 2048, 8, 8)
        concat = torch.cat([feat_t1, feat_t2], dim=1)  # (B, 4096, 8, 8)
        fused = self.fusion(concat)  # (B, 512, 8, 8)
        change_map = self.decoder(fused)  # (B, 5, 256, 256)
        return change_map

model = ChangeDetectionNet()
train_on_change_detection_dataset(model, epochs=50)
```

**Success Metrics:**
- **IoU:** ≥0.75 for building change detection
- **Precision:** ≥88% (minimize false alarms for disaster response)
- **Processing speed:** <2 sec per 10 km² satellite tile

---

#### **Project 8: Product Recommendation with Visual Features**

**Objective:** Extract visual features from product images to improve recommendation system (e.g., "similar items")

**Business Value:** E-commerce, increase click-through rate by 15-25%, boost sales by $10M-$50M

**Architecture:**
- Pre-trained: EfficientNet-B5 (ImageNet) → Feature extractor (no fine-tuning)
- Transfer: Use 2048-dim image embeddings for similarity search
- Recommendation: Combine visual features + user behavior + product metadata

**Implementation:**
```python
efficientnet_b5 = timm.create_model('efficientnet_b5', pretrained=True, num_classes=0)  # No classifier
efficientnet_b5.eval()

# Extract features for entire product catalog
product_catalog = load_product_images()  # 1M products
embeddings = {}

with torch.no_grad():
    for product_id, image in tqdm(product_catalog.items()):
        image_tensor = preprocess(image)
        embedding = efficientnet_b5(image_tensor)  # (2048,)
        embeddings[product_id] = embedding.cpu().numpy()

# Build FAISS index for fast similarity search
import faiss
index = faiss.IndexFlatL2(2048)
index.add(np.array(list(embeddings.values())))

# Find similar products
query_embedding = embeddings['product_12345']
distances, indices = index.search(query_embedding.reshape(1, -1), k=10)
similar_products = [product_ids[i] for i in indices[0]]

# Hybrid recommendation (visual + collaborative filtering)
recommendations = combine_visual_and_cf(similar_products, user_history)
```

**Success Metrics:**
- **Click-through rate:** +15-25% on "similar items" section
- **Conversion rate:** +8-12% from visual recommendations
- **Inference speed:** <50ms per query (sub-second response)
- **Revenue impact:** $10M-$50M/year for large e-commerce platform

---

## 🎓 Key Takeaways & Best Practices

### **When to Use Each Transfer Learning Strategy**

| Strategy | Use Case | Data Size | Domain Similarity | Computational Budget |
|----------|----------|-----------|-------------------|----------------------|
| **Feature Extraction** | Quick prototyping, very small dataset | <1K | High (similar to ImageNet) | Low (1× baseline) |
| **Full Fine-Tuning** | Maximum accuracy, sufficient data | >5K | Medium-Low (domain shift) | High (10× baseline) |
| **Gradual Unfreezing** | **Production standard**, balanced approach | 1K-10K | Any | Medium (3-5× baseline) |

**Recommendation:** **Start with gradual unfreezing** (Strategy 3) for most projects.

---

### **Model Selection Guide**

| Model | Parameters | Speed | Accuracy | When to Use |
|-------|------------|-------|----------|-------------|
| **ResNet-50** | 25.6M | Medium | Good | Industry baseline, debugging, interpretability |
| **EfficientNet-B3** | 12.0M | **Fast** | **Best** | **Production (recommended)**, edge devices, cost-sensitive |
| **Vision Transformer** | 86.6M | Slow | Excellent | Research, very large datasets (>50K), interpretability (attention maps) |

**Recommendation:** **EfficientNet-B3** for semiconductor production (best accuracy/efficiency tradeoff).

---

### **Learning Rate Strategies**

1. **Discriminative LR (ESSENTIAL):**
   - Early layers: 1e-6 to 1e-5 (preserve ImageNet features)
   - Late layers: 1e-4 to 1e-3 (adapt to target domain)
   - Formula: $\eta_{\text{layer } i} = \eta_{\text{base}} \times \text{decay}^{L-i}$

2. **Warm-up + Cosine Annealing (RECOMMENDED):**
   - Warm-up: 2-5 epochs, linear increase 0.1× → 1.0×
   - Cosine decay: Smooth convergence, prevents oscillations
   - PyTorch: `SequentialLR(LinearLR, CosineAnnealingLR)`

3. **Cyclical LR (ALTERNATIVE):**
   - Good for escaping local minima
   - Use when training plateaus

---

### **Domain Adaptation Checklist**

For semiconductor (ImageNet → grayscale wafer maps):

✅ **Data Preprocessing:**
- Replicate grayscale to 3 channels: `image.repeat(3, 1, 1)`
- Normalize with ImageNet stats: `mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]`
- Resize to 224×224 (or 299×299 for Inception, 384×384 for ViT)

✅ **Data Augmentation (Wafer-Specific):**
- ✅ Rotation (0-360°, wafers have rotational symmetry)
- ✅ Horizontal flip (left-right symmetry)
- ✅ Gaussian noise (sensor variability)
- ✅ Brightness/contrast (equipment differences)
- ❌ Color jitter (grayscale only)
- ❌ Vertical flip (breaks edge defect semantics)

✅ **Batch Normalization Adaptation:**
- Option 1: Adaptive BN (compute running stats on target domain, no weight updates)
- Option 2: Fine-tune BN layers only (unfreeze `running_mean`, `running_var`)

---

### **Production Deployment Pipeline**

**Step 1: Model Compression**
```python
# INT8 Quantization (4× smaller, 2-3× faster)
from torch.quantization import quantize_dynamic
quantized_model = quantize_dynamic(efficientnet, {nn.Linear, nn.Conv2d}, dtype=torch.qint8)

# Pruning (remove 30-50% weights)
from torch.nn.utils import prune
prune.l1_unstructured(efficientnet.classifier, name='weight', amount=0.3)
```

**Step 2: ONNX Export**
```python
dummy_input = torch.randn(1, 3, 224, 224).to(device)
torch.onnx.export(efficientnet, dummy_input, "efficientnet_b3.onnx",
                  input_names=['input'], output_names=['output'],
                  dynamic_axes={'input': {0: 'batch_size'}})
```

**Step 3: TensorRT Optimization** (NVIDIA GPUs)
```bash
trtexec --onnx=efficientnet_b3.onnx --saveEngine=efficientnet_b3.trt \
        --fp16 --workspace=4096  # Mixed precision, 4GB workspace
```

**Step 4: Inference Server** (TorchServe/TF Serving)
```python
# TorchServe deployment
torch-model-archiver --model-name efficientnet_wafer_classifier \
                     --version 1.0 \
                     --serialized-file efficientnet_b3.onnx \
                     --handler image_classifier

torchserve --start --model-store model_store --models efficientnet_wafer_classifier.mar
```

**Step 5: Monitoring & Retraining**
- **Track inference metrics:** Prediction confidence, latency, throughput
- **Detect distribution drift:** KL divergence on prediction distributions
- **Active learning:** Flag low-confidence predictions for expert review
- **Retraining schedule:** Monthly or when accuracy drops >2%

---

### **Semiconductor-Specific Best Practices**

1. **Class Imbalance Handling:**
   - Use weighted loss: `nn.CrossEntropyLoss(weight=class_weights)`
   - Focal loss for hard examples: $FL(p_t) = -(1-p_t)^\gamma \log(p_t)$
   - SMOTE (Synthetic Minority Over-sampling) for rare defects

2. **Spatial Correlation:**
   - CNNs naturally capture spatial patterns (wafer maps)
   - Consider graph neural networks (GNNs) for die-to-die spatial dependencies

3. **Explainability (CRITICAL for engineer trust):**
   - Grad-CAM: Visualize which wafer regions drive prediction
   - SHAP: Feature importance for parametric test data
   - Sanity checks: Ensure model learns defect patterns, not spurious correlations

4. **Multi-Task Learning:**
   - Simultaneously predict: Yield% + Defect class + Severity + Root cause
   - Shared backbone, multiple heads → Better feature utilization

---

## 📚 What's Next?

**Upcoming Notebooks:**
- **055: Object Detection (YOLO, R-CNN)** → Localize defects on wafer, not just classify
- **056: RNN/LSTM/GRU** → Sequential test pattern analysis (time-series wafer data)
- **057: Seq2Seq & Attention** → Test sequence optimization
- **058: Transformers** → Self-attention for spatial wafer map features + BERT-style pre-training

---

## ✅ Learning Objectives Review

By completing this notebook, you've mastered:

1. ✅ **Transfer Learning Theory** - Feature hierarchy, domain adaptation, mathematical formulation
2. ✅ **Pre-trained Model Zoo** - ResNet-50, EfficientNet-B3, Vision Transformer comparison
3. ✅ **Fine-Tuning Strategies** - Feature extraction, full fine-tuning, gradual unfreezing (best practice)
4. ✅ **Learning Rate Policies** - Discriminative LR, warm-up, cosine annealing, cyclical LR
5. ✅ **Feature Extraction vs Fine-Tuning** - When to freeze, when to train, computational tradeoffs
6. ✅ **Domain Adaptation** - ImageNet → semiconductor (grayscale, augmentation, batch norm)
7. ✅ **Multi-Task Transfer Learning** - Leveraging multiple pre-trained models (ensemble)
8. ✅ **Production Deployment** - ONNX export, INT8 quantization, TensorRT optimization, inference serving

**Key Skill Acquired:** You can now apply transfer learning to any image classification problem with confidence!

---

## 📖 Additional Resources

**Must-Read Papers:**
- "Visualizing and Understanding Convolutional Networks" (Zeiler & Fergus, 2013) - Why transfer learning works
- "How transferable are features in deep neural networks?" (Yosinski et al., 2014) - Feature hierarchy analysis
- "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" (Tan & Le, 2019)
- "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2020) - Vision Transformer

**Courses:**
- CS231n (Stanford) - Lecture 11: Transfer Learning & Fine-Tuning
- Fast.ai Practical Deep Learning - Lesson 1 (transfer learning focus)

**Libraries:**
- **timm** (PyTorch Image Models): 700+ pre-trained models - https://github.com/huggingface/pytorch-image-models
- **TensorFlow Hub**: Pre-trained models for TensorFlow - https://tfhub.dev
- **ONNX Runtime**: Cross-framework inference - https://onnxruntime.ai

**Deployment Tools:**
- **TorchServe**: PyTorch model serving - https://pytorch.org/serve
- **TensorRT**: NVIDIA GPU optimization - https://developer.nvidia.com/tensorrt
- **TF Serving**: TensorFlow model serving - https://www.tensorflow.org/tfx/guide/serving

---

## 🎯 Final Summary

**Transfer Learning ROI:**
- **10-100× less data** needed (1K vs 100K samples)
- **10-100× faster training** (hours vs days)
- **5-15% higher accuracy** than training from scratch
- **$5M-$200M business value** for semiconductor applications

**Best Practices:**
1. **Always start with pre-trained models** (ImageNet baseline)
2. **Use gradual unfreezing** (Strategy 3) for production
3. **Discriminative LR is essential** (early layers slower, late layers faster)
4. **EfficientNet-B3 recommended** for semiconductor (best accuracy/efficiency)
5. **Monitor for distribution drift** (retrain when accuracy drops)

**You're now ready to deploy production-grade transfer learning systems!** 🚀

---

**Congratulations on completing Notebook 054!** 🎉

Next notebook: **055_Object_Detection_YOLO_RCNN.ipynb** - Learn to localize defects, not just classify wafer maps!