# 052: Deep Learning Frameworks (PyTorch & TensorFlow)## **From NumPy to Production: Modern Deep Learning Tools**---### **📖 What You'll Learn**By the end of this notebook, you will master:1. **PyTorch fundamentals:** Tensors, autograd, nn.Module, optimizers, device management2. **TensorFlow/Keras fundamentals:** Keras API, Sequential vs Functional, custom layers, callbacks3. **Framework comparison:** When to use PyTorch vs TensorFlow, trade-offs, ecosystem4. **Production-ready implementations:** Build the same neural network in both frameworks5. **Semiconductor applications:** Wafer yield predictor, defect classifier with modern frameworks6. **GPU acceleration:** Device management, mixed precision training, performance optimization7. **Model persistence:** Save/load models, checkpointing, transfer learning8. **Debugging & profiling:** TensorBoard, visualization, bottleneck identification---### **🎯 Why Deep Learning Frameworks Matter**In **Notebook 051**, we implemented neural networks from scratch using NumPy. This taught us:- ✅ Mathematical foundations (backpropagation, gradients)- ✅ How neural networks actually work under the hood- ✅ Debugging skills (gradient checking, initialization)**But for production AI/ML**, writing everything from scratch is:- ❌ Time-consuming (100s of lines → 10s of lines)- ❌ Error-prone (manual gradient computation)- ❌ Not optimized (no GPU acceleration, no advanced optimizers)- ❌ Hard to maintain (custom codebases)- ❌ Limited features (no pre-trained models, no deployment tools)**Deep learning frameworks solve this:**- ✅ **Automatic differentiation:** Backpropagation computed automatically (no manual gradients)- ✅ **GPU acceleration:** 10-100× speedup with CUDA (PyTorch) or XLA (TensorFlow)- ✅ **Production-ready:** Model serving (TorchServe, TensorFlow Serving), optimization (ONNX)- ✅ **Rich ecosystem:** Pre-trained models (torchvision, tf.keras.applications), visualization (TensorBoard)- ✅ **Community support:** 100K+ GitHub stars, extensive documentation, tutorials**Analogy:** NumPy implementation = building a car from parts (educational).  Framework = driving a Tesla (production-ready, optimized, feature-rich).---### **🏢 Industry Usage (2024-2025)****PyTorch dominates research & many production systems:**- **Market share:** ~55-60% (research), 40-45% (production)- **Users:** Meta, Tesla, OpenAI (GPT models), Microsoft, Qualcomm, AMD- **Strengths:** Pythonic API, dynamic computation graphs, debugging ease, research flexibility- **Use cases:** LLMs (GPT, Llama), computer vision (YOLO, SAM), research prototyping**TensorFlow/Keras strong in enterprise production:**- **Market share:** ~40-45% (research), 50-55% (production)- **Users:** Google, NVIDIA, Intel, Samsung, many Fortune 500 companies- **Strengths:** Production ecosystem (TFX, TF Lite, TF Serving), deployment tools, stability- **Use cases:** Google products (Search, Ads, Photos), mobile (TF Lite), edge devices**Trend:** PyTorch gaining ground in production (PyTorch 2.0+ improvements), but TensorFlow still leads enterprise deployment.---### **🔧 Semiconductor Post-Silicon Validation Use Cases**#### **Use Case 1: Wafer Yield Prediction (PyTorch)****Problem:** Predict die-level yield from 50+ parametric tests in real-time during wafer test.**Why frameworks:** - 50K+ die/hour throughput → Need GPU acceleration (100× faster than NumPy)- Model deployment → TorchServe or ONNX Runtime for production inference- Complex architectures → Multi-layer networks with batch normalization, dropout**Business value:** $50M-$200M/year scrap reduction through early failure detection.---#### **Use Case 2: Defect Pattern Classification (TensorFlow/Keras)****Problem:** Classify 20+ defect types from wafer maps (spatial images) with 98%+ accuracy.**Why frameworks:**- Convolutional neural networks (CNNs) → Pre-built layers (Conv2D, MaxPool) in frameworks- Transfer learning → Use pre-trained ImageNet models (ResNet, EfficientNet)- Production deployment → TensorFlow Serving for real-time inference, TF Lite for edge devices**Business value:** $5M-$20M per incident through faster root cause analysis.---#### **Use Case 3: Adaptive Test Insertion (PyTorch)****Problem:** Dynamically select optimal test sequence to minimize test time while maintaining 99%+ coverage.**Why frameworks:**- Reinforcement learning → Policy networks with PyTorch (flexible for RL research)- GPU training → 10× faster iteration for policy optimization- Real-time inference → <10ms decision time using TorchScript or ONNX**Business value:** $10M-$50M/year test time reduction (30-50% savings).---### **📊 Framework Comparison at a Glance**| Feature | PyTorch | TensorFlow/Keras ||---------|---------|------------------|| **API Style** | Pythonic, imperative | Keras (high-level), TF (low-level) || **Learning Curve** | Moderate (intuitive) | Easy (Keras), Hard (TF 1.x) || **Computation Graph** | Dynamic (define-by-run) | Static + Eager (TF 2.x) || **Debugging** | Easy (standard Python debugger) | Moderate (better in TF 2.x) || **Production** | Good (TorchServe, ONNX) | Excellent (TF Serving, TFX, TF Lite) || **Mobile/Edge** | Moderate (PyTorch Mobile) | Excellent (TF Lite, TF.js) || **Research** | **Dominant** (60%+ papers) | Strong (40% papers) || **Pre-trained Models** | torchvision, timm, HF | tf.keras.applications, TF Hub || **GPU Support** | CUDA (NVIDIA) | CUDA + XLA (better multi-GPU) || **Community** | Very active (researchers) | Very active (enterprise) || **Ecosystem** | HuggingFace, Lightning | TFX, Model Garden, Vertex AI |**Verdict:** - **For research, prototyping, flexibility:** PyTorch (easier debugging, more intuitive)- **For production deployment, mobile, enterprise:** TensorFlow (better tooling, maturity)- **For most projects:** Learn both (PyTorch for training, convert to ONNX for deployment)---### **🚀 What We'll Build**In this notebook, we'll implement the **same neural network** in both PyTorch and TensorFlow:**Architecture:** Multi-layer perceptron for semiconductor yield prediction```Input (50 features)   → Dense (128, ReLU) + BatchNorm + Dropout (0.3)  → Dense (64, ReLU) + BatchNorm + Dropout (0.2)  → Dense (32, ReLU)  → Output (1, Sigmoid)```**Training configuration:**- Optimizer: Adam (lr=0.001)- Loss: Binary cross-entropy- Regularization: L2 (λ=0.01) + Dropout- Metrics: Accuracy, Precision, Recall, AUC-ROC- Hardware: GPU if available (otherwise CPU)**We'll demonstrate:**1. **Model definition:** Class-based (PyTorch) vs Sequential/Functional (Keras)2. **Training loop:** Manual (PyTorch) vs fit() (Keras)3. **Callbacks:** Early stopping, learning rate scheduling, checkpointing4. **Device management:** CPU vs GPU, mixed precision training5. **Model saving:** State dict (PyTorch), SavedModel (TensorFlow)6. **Inference:** Batch prediction, production deployment**Dataset:** Simulated semiconductor parametric test data (5,000 samples, 50 features, binary yield).---### **🗺️ Notebook Roadmap****Part 1: PyTorch Fundamentals**1. Tensors, operations, device management2. Autograd (automatic differentiation)3. Building models with nn.Module4. Training loop from scratch5. Optimizers, schedulers, callbacks**Part 2: TensorFlow/Keras Fundamentals**1. Tensors, operations, eager execution2. Sequential vs Functional API3. Custom layers and models4. Built-in training with fit()5. Callbacks and model checkpointing**Part 3: Side-by-Side Comparison**1. Same architecture in both frameworks2. Performance comparison (training time, inference speed)3. Model conversion (ONNX for interoperability)4. Production deployment options**Part 4: Advanced Topics**1. GPU acceleration and mixed precision2. Distributed training (multi-GPU)3. Hyperparameter tuning (Ray Tune, Optuna)4. Production best practices**Part 5: Real-World Projects**1. Wafer yield predictor (PyTorch + TorchServe)2. Defect image classifier (TensorFlow + TF Serving)3. Power anomaly detector (Autoencoder in both frameworks)4. General AI/ML projects (churn, fraud, medical imaging)---### **🔗 Architecture Diagram: Framework Workflow**```mermaidgraph TB    A[Raw Data] --> B[Preprocessing]    B --> C{Framework Choice}        C -->|PyTorch| D[PyTorch Tensors]    C -->|TensorFlow| E[TF Tensors]        D --> F[nn.Module Model]    E --> G[Keras Model]        F --> H[Manual Training Loop]    G --> I[model.fit Training]        H --> J[torch.save]    I --> K[model.save]        J --> L{Deployment}    K --> L        L --> M[TorchServe]    L --> N[TF Serving]    L --> O[ONNX Runtime]        M --> P[Production API]    N --> P    O --> P        style C fill:#f9f,stroke:#333,stroke-width:2px    style L fill:#bbf,stroke:#333,stroke-width:2px    style P fill:#bfb,stroke:#333,stroke-width:2px```**Key differences:**- **PyTorch:** More manual control (custom training loop), better for research/experimentation- **TensorFlow/Keras:** Higher-level API (fit() handles training), better for quick prototyping and production- **ONNX:** Universal format for model exchange (train in PyTorch, deploy with ONNX Runtime)---### **📦 Installation & Setup****PyTorch:**```bash# CPU onlypip install torch torchvision# CUDA 11.8 (NVIDIA GPU)pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118# CUDA 12.1pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121```**TensorFlow:**```bash# CPU + GPU (automatic GPU detection)pip install tensorflow# Specific versionpip install tensorflow==2.15.0```**Supporting libraries:**```bashpip install numpy pandas matplotlib scikit-learn tensorboard onnx onnxruntime```**Check installation:**```pythonimport torchprint(f"PyTorch version: {torch.__version__}")print(f"CUDA available: {torch.cuda.is_available()}")import tensorflow as tfprint(f"TensorFlow version: {tf.__version__}")print(f"GPU available: {len(tf.config.list_physical_devices('GPU'))}")```---### **💡 Learning Strategy****If you're new to deep learning frameworks:**1. Start with **Keras** (easiest, high-level API)2. Learn **PyTorch basics** (intuitive, Pythonic)3. Compare implementations side-by-side4. Pick primary framework based on use case (research → PyTorch, production → TensorFlow)**If you have framework experience:**- Focus on production patterns (deployment, monitoring, optimization)- Learn the other framework (cross-framework skills valuable)- Master ONNX for framework interoperability**For semiconductor engineers:**- Both frameworks widely used in industry (Qualcomm, AMD use both)- PyTorch common for research/prototyping- TensorFlow common for production deployment- ONNX increasingly popular for edge devices---### **📚 Prerequisites****Required:**- ✅ Completed **Notebook 051** (Neural Networks Foundations)- ✅ Understand backpropagation, gradient descent, regularization- ✅ Python basics (classes, functions, decorators)- ✅ NumPy fundamentals (arrays, broadcasting)**Helpful:**- Familiarity with object-oriented programming- Basic understanding of GPU computing concepts- Experience with any ML library (scikit-learn, XGBoost)---### **⏱️ Time Investment**- **Reading + code execution:** 3-4 hours- **Practice exercises:** 2-3 hours- **Real-world project:** 5-10 hours- **Total:** 10-17 hours for mastery**Recommendation:** Spread over 3-5 sessions, practice with your own datasets between sessions.---### **🎓 Learning Objectives**After completing this notebook, you will be able to:✅ **Build neural networks** in PyTorch and TensorFlow/Keras  ✅ **Train models efficiently** with automatic differentiation and GPU acceleration  ✅ **Deploy models to production** using TorchServe, TF Serving, or ONNX Runtime  ✅ **Debug training issues** using TensorBoard and framework-specific tools  ✅ **Optimize performance** with mixed precision, data parallelism, and profiling  ✅ **Choose the right framework** based on project requirements and constraints  ✅ **Convert between frameworks** using ONNX for interoperability  ✅ **Apply to semiconductor testing** with production-grade implementations---**Let's dive in!** 🚀 We'll start with PyTorch fundamentals, then TensorFlow/Keras, and finally compare them side-by-side.

## 🔥 Part 1: PyTorch Fundamentals

### **What is PyTorch?**

**PyTorch** is an open-source deep learning framework developed by Meta AI (formerly Facebook AI Research). It provides:
- **Tensors:** GPU-accelerated multi-dimensional arrays (like NumPy + GPU)
- **Autograd:** Automatic differentiation for backpropagation
- **nn.Module:** Building blocks for neural networks
- **Optimizers:** SGD, Adam, RMSprop with learning rate scheduling
- **Ecosystem:** torchvision (computer vision), torchaudio (audio), torchtext (NLP)

**Philosophy:** "Pythonic" design - tensors behave like NumPy arrays, imperative programming style, easy debugging.

---

### **1. PyTorch Tensors: The Foundation**

**Tensors** are multi-dimensional arrays (like NumPy ndarray) with GPU acceleration.

#### **A. Creating Tensors**

```python
import torch

# From Python list
x = torch.tensor([1, 2, 3, 4, 5])  # 1D tensor (vector)
print(f"1D tensor: {x}, shape: {x.shape}, dtype: {x.dtype}")

# From NumPy array
import numpy as np
arr = np.array([[1, 2], [3, 4]])
x = torch.from_numpy(arr)  # Shares memory with NumPy
print(f"From NumPy: {x}")

# Special tensors
zeros = torch.zeros(2, 3)           # 2×3 matrix of zeros
ones = torch.ones(3, 4)             # 3×4 matrix of ones
rand = torch.rand(2, 2)             # Uniform [0, 1)
randn = torch.randn(3, 3)           # Normal N(0, 1)
eye = torch.eye(4)                  # 4×4 identity matrix
arange = torch.arange(0, 10, 2)     # [0, 2, 4, 6, 8]

print(f"Zeros:\n{zeros}")
print(f"Random:\n{rand}")
```

#### **B. Tensor Operations**

```python
# Element-wise operations
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

print(f"Addition: {a + b}")           # [5, 7, 9]
print(f"Multiplication: {a * b}")     # [4, 10, 18]
print(f"Power: {a ** 2}")             # [1, 4, 9]

# Matrix operations
A = torch.randn(2, 3)
B = torch.randn(3, 4)

print(f"Matrix multiply: {torch.mm(A, B).shape}")  # (2, 4)
print(f"Transpose: {A.T.shape}")                    # (3, 2)

# Aggregations
x = torch.randn(3, 4)
print(f"Sum: {x.sum()}")              # Total sum
print(f"Mean: {x.mean()}")            # Average
print(f"Max: {x.max()}")              # Maximum value
print(f"Row sums: {x.sum(dim=1)}")    # Sum per row
```

#### **C. Reshaping & Indexing**

```python
x = torch.arange(12)
print(f"Original: {x}")

# Reshape
x_reshaped = x.view(3, 4)             # 3×4 matrix (shares memory)
x_copy = x.reshape(4, 3)              # 4×3 matrix (may copy)
print(f"Reshaped:\n{x_reshaped}")

# Indexing
A = torch.randn(4, 5)
print(f"First row: {A[0]}")
print(f"First column: {A[:, 0]}")
print(f"Submatrix: {A[1:3, 2:4]}")    # Rows 1-2, cols 2-3

# Boolean indexing
x = torch.tensor([1, 2, 3, 4, 5])
mask = x > 3
print(f"Elements > 3: {x[mask]}")     # [4, 5]
```

---

### **2. Device Management: CPU vs GPU**

**Key concept:** Tensors live on a **device** (CPU or GPU). All operations must use tensors on the **same device**.

#### **A. Check GPU Availability**

```python
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
```

#### **B. Moving Tensors Between Devices**

```python
# Create tensor on CPU
x_cpu = torch.randn(3, 3)
print(f"x_cpu device: {x_cpu.device}")

# Move to GPU
x_gpu = x_cpu.to(device)  # or x_cpu.cuda() if GPU available
print(f"x_gpu device: {x_gpu.device}")

# Operations on GPU (10-100× faster for large tensors)
y_gpu = torch.randn(3, 3, device=device)  # Create directly on GPU
z_gpu = x_gpu + y_gpu                      # GPU computation

# Move back to CPU (required for NumPy conversion)
z_cpu = z_gpu.cpu()
z_np = z_cpu.numpy()  # Convert to NumPy

print(f"Result: {z_cpu}")
```

**Performance tip:** Keep tensors on GPU throughout computation, only move to CPU when necessary (visualization, saving).

---

### **3. Autograd: Automatic Differentiation**

**Autograd** is PyTorch's automatic differentiation engine. It tracks operations and computes gradients automatically.

#### **A. Basic Gradient Computation**

```python
# Create tensor with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
print(f"x: {x}, requires_grad: {x.requires_grad}")

# Compute function
y = x ** 2 + 3 * x + 1  # y = x² + 3x + 1
print(f"y: {y}")

# Compute gradient dy/dx
y.backward()  # Computes gradients
print(f"dy/dx: {x.grad}")  # Should be 2x + 3 = 7 at x=2
```

**Mathematical verification:**
- $y = x^2 + 3x + 1$
- $\frac{dy}{dx} = 2x + 3$
- At $x = 2$: $\frac{dy}{dx} = 2(2) + 3 = 7$ ✅

#### **B. Multi-Variable Gradients**

```python
# Multiple inputs
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

# Function
z = x**2 + y**3  # z = x² + y³

# Compute gradients
z.backward()

print(f"∂z/∂x: {x.grad}")  # Should be 2x = 2
print(f"∂z/∂y: {y.grad}")  # Should be 3y² = 12
```

#### **C. Gradient Accumulation**

**Important:** Gradients **accumulate** by default. Reset with `zero_grad()` between iterations.

```python
x = torch.tensor(2.0, requires_grad=True)

# First computation
y1 = x ** 2
y1.backward()
print(f"First gradient: {x.grad}")  # 2x = 4

# Second computation (without zero_grad)
y2 = x ** 3
y2.backward()
print(f"Accumulated gradient: {x.grad}")  # 4 + 3x² = 16 (wrong!)

# Correct approach
x.grad.zero_()  # Reset gradient
y2.backward()
print(f"Correct gradient: {x.grad}")  # 3x² = 12 ✅
```

#### **D. Detaching from Computation Graph**

```python
x = torch.randn(2, 2, requires_grad=True)
y = x + 2

# Detach (stop gradient tracking)
y_detached = y.detach()  # y_detached doesn't track gradients

# Use in computation without affecting gradients
z = y_detached * 3
z.sum().backward()  # Error: y_detached doesn't require gradients

# Correct: Use torch.no_grad() context
with torch.no_grad():
    z = y * 3  # No gradient tracking (useful for inference)
```

---

### **4. Building Neural Networks with nn.Module**

**nn.Module** is the base class for all neural network models in PyTorch. It provides:
- Parameter management (weights, biases)
- Automatic device placement
- Built-in save/load functionality
- Forward pass abstraction

#### **A. Simple Linear Layer**

```python
import torch.nn as nn

# Single linear layer: y = Wx + b
linear = nn.Linear(in_features=10, out_features=5)

# Input: batch of 32 samples, each with 10 features
x = torch.randn(32, 10)

# Forward pass
y = linear(x)  # Output shape: (32, 5)
print(f"Output shape: {y.shape}")

# Inspect parameters
print(f"Weight shape: {linear.weight.shape}")  # (5, 10)
print(f"Bias shape: {linear.bias.shape}")      # (5,)
print(f"Number of parameters: {sum(p.numel() for p in linear.parameters())}")  # 55
```

#### **B. Custom Model with nn.Module**

```python
class SimpleNet(nn.Module):
    """
    Simple 2-layer neural network.
    
    Architecture: Input → Hidden (ReLU) → Output
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNet, self).__init__()
        
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        """Forward pass: defines computation"""
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create model
model = SimpleNet(input_size=20, hidden_size=64, output_size=1)
print(model)

# Forward pass
x = torch.randn(10, 20)  # Batch of 10 samples
output = model(x)
print(f"Output shape: {output.shape}")  # (10, 1)

# Count parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
```

#### **C. Common Layers**

```python
# Activation functions
relu = nn.ReLU()
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)

# Regularization
dropout = nn.Dropout(p=0.5)              # Randomly zero 50% of inputs
batch_norm = nn.BatchNorm1d(num_features=64)  # Normalize batch

# Loss functions
mse_loss = nn.MSELoss()                  # Mean squared error (regression)
bce_loss = nn.BCELoss()                  # Binary cross-entropy (binary classification)
ce_loss = nn.CrossEntropyLoss()          # Cross-entropy (multi-class)

# Pooling (for CNNs)
max_pool = nn.MaxPool2d(kernel_size=2)   # 2×2 max pooling
avg_pool = nn.AvgPool2d(kernel_size=2)   # 2×2 average pooling
```

---

### **5. Training Loop: The PyTorch Pattern**

**PyTorch doesn't have a built-in `fit()` method.** You write the training loop manually (more control, better for research).

**Standard training loop:**

```python
# 1. Setup
model = SimpleNet(input_size=20, hidden_size=64, output_size=1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCEWithLogitsLoss()  # BCE with sigmoid built-in

# 2. Training loop
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    # Backward pass
    optimizer.zero_grad()  # Reset gradients
    loss.backward()         # Compute gradients
    optimizer.step()        # Update parameters
    
    # Logging
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
```

**Key steps:**
1. **Forward pass:** Compute predictions (`outputs = model(X)`)
2. **Compute loss:** Compare predictions to targets (`loss = criterion(outputs, y)`)
3. **Zero gradients:** Reset accumulated gradients (`optimizer.zero_grad()`)
4. **Backward pass:** Compute gradients (`loss.backward()`)
5. **Update weights:** Apply gradients (`optimizer.step()`)

---

### **6. Optimizers**

PyTorch provides common optimizers in `torch.optim`:

```python
# SGD with momentum
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (default choice)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# RMSprop
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9)

# AdamW (Adam with decoupled weight decay)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate scheduling
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# Or cosine annealing
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# Use in training loop
for epoch in range(num_epochs):
    # ... training code ...
    scheduler.step()  # Update learning rate
```

---

### **7. Model Saving & Loading**

#### **A. Save/Load State Dict (Recommended)**

```python
# Save model parameters
torch.save(model.state_dict(), 'model_weights.pth')

# Load parameters
model = SimpleNet(input_size=20, hidden_size=64, output_size=1)
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()  # Set to evaluation mode
```

#### **B. Save Entire Model (Less Flexible)**

```python
# Save entire model
torch.save(model, 'model_complete.pth')

# Load
model = torch.load('model_complete.pth')
model.eval()
```

#### **C. Checkpointing (Save Training State)**

```python
# Save checkpoint
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')

# Resume training
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
loss = checkpoint['loss']
```

---

### **🎯 Key PyTorch Concepts Summary**

| Concept | Purpose | Key Methods |
|---------|---------|-------------|
| **Tensor** | Multi-dimensional array with GPU support | `.to(device)`, `.numpy()`, `.item()` |
| **Autograd** | Automatic differentiation | `.backward()`, `.zero_grad()`, `.detach()` |
| **nn.Module** | Base class for models | `forward()`, `.parameters()`, `.train()`, `.eval()` |
| **Optimizer** | Parameter update algorithms | `.zero_grad()`, `.step()` |
| **Loss Function** | Measure prediction error | `nn.MSELoss()`, `nn.CrossEntropyLoss()` |
| **Device** | CPU or GPU placement | `torch.device('cuda')`, `.to(device)` |

---

**Next:** We'll implement a complete PyTorch model for semiconductor yield prediction! 🚀

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
"""
PyTorch Complete Example: Semiconductor Yield Prediction
Architecture: Input(50) → Dense(128, ReLU) + BatchNorm + Dropout(0.3)
                        → Dense(64, ReLU) + BatchNorm + Dropout(0.2)
                        → Dense(32, ReLU)
                        → Output(1, Sigmoid)
Goal: Predict wafer-level yield (binary: pass/fail) from 50 parametric test features.
Business value: $50M-$200M/year scrap reduction through early failure detection.
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix
import time
# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)
print("="*80)
print("PyTorch Semiconductor Yield Prediction Model")
print("="*80)
# -----------------------------------------------------------------------------
# 1. Device Setup
# -----------------------------------------------------------------------------
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nDevice: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
# -----------------------------------------------------------------------------
# 2. Generate Simulated Semiconductor Data
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("2. GENERATING SIMULATED SEMICONDUCTOR DATA")
print("="*80)
n_samples = 5000
n_features = 50
# Simulate 50 parametric test features
# Features: voltage, current, frequency, power, temperature measurements
# Real-world: extracted from STDF files (wafer test + final test)
np.random.seed(42)
# Generate correlated features (semiconductor tests are often correlated)
# Good devices: mean=0, std=1, high yield
# Bad devices: shifted distributions, low yield
def generate_semiconductor_data(n_samples, n_features):
    """
    Generate simulated semiconductor parametric test data.
    
    Features represent:
    - Voltage measurements (Vdd, Vss) - features 0-9
    - Current measurements (Idd, leakage) - features 10-19
    - Frequency measurements (clock, PLL) - features 20-29
    - Power measurements (dynamic, static) - features 30-39
    - Temperature coefficients - features 40-49
    
    Target: 1 = pass (good die), 0 = fail (bad die)
    """
    
    # Generate features with realistic correlations
    X = np.zeros((n_samples, n_features))
    
    # Create base latent factors (simulates underlying process variation)
    n_factors = 5
    latent_factors = np.random.randn(n_samples, n_factors)
    
    # Each feature is a linear combination of latent factors + noise
    feature_weights = np.random.randn(n_factors, n_features) * 0.5
    X = latent_factors @ feature_weights + np.random.randn(n_samples, n_features) * 0.3
    
    # Generate target (yield) based on feature patterns
    # Good devices: sum of certain features is positive
    # Bad devices: sum is negative
    
    # Critical features (indices 0, 10, 20, 30, 40 - one from each category)
    critical_features = [0, 10, 20, 30, 40]
    device_score = X[:, critical_features].sum(axis=1)
    
    # Add nonlinearity
    device_score += 0.1 * (X[:, 5] * X[:, 15])  # Interaction term
    device_score -= 0.2 * np.abs(X[:, 25])       # Nonlinear dependency
    
    # Convert to binary (threshold at median)
    threshold = np.median(device_score)
    y = (device_score > threshold).astype(int)
    
    # Add label noise (realistic: ~2% mislabeling)
    flip_indices = np.random.choice(n_samples, size=int(0.02 * n_samples), replace=False)
    y[flip_indices] = 1 - y[flip_indices]
    
    return X, y
X, y = generate_semiconductor_data(n_samples, n_features)
print(f"Dataset shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution: {np.bincount(y)} (0=fail, 1=pass)")
print(f"Yield rate: {y.mean()*100:.2f}%")
# Create feature names (for interpretability)
feature_names = (
    [f"Vdd_{i}" for i in range(10)] +
    [f"Idd_{i}" for i in range(10)] +
    [f"Freq_{i}" for i in range(10)] +
    [f"Power_{i}" for i in range(10)] +
    [f"Temp_{i}" for i in range(10)]
)
df = pd.DataFrame(X, columns=feature_names)
df['yield'] = y
print("\nFirst few samples:")
print(df.head())


### 📝 Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 3. Data Preprocessing
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("3. DATA PREPROCESSING")
print("="*80)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
# Standardize features (critical for neural networks)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Feature means (after scaling): {X_train_scaled.mean(axis=0)[:5]}")  # Should be ~0
print(f"Feature stds (after scaling): {X_train_scaled.std(axis=0)[:5]}")    # Should be ~1
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train_scaled).to(device)
y_train_tensor = torch.FloatTensor(y_train).unsqueeze(1).to(device)  # Shape: (n, 1)
X_test_tensor = torch.FloatTensor(X_test_scaled).to(device)
y_test_tensor = torch.FloatTensor(y_test).unsqueeze(1).to(device)
print(f"\nTensor shapes:")
print(f"X_train: {X_train_tensor.shape}, device: {X_train_tensor.device}")
print(f"y_train: {y_train_tensor.shape}, device: {y_train_tensor.device}")
# Create DataLoader for batch training
batch_size = 64
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
print(f"\nNumber of batches per epoch: {len(train_loader)}")
# -----------------------------------------------------------------------------
# 4. Define Model Architecture
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("4. MODEL ARCHITECTURE")
print("="*80)
class SemiconductorYieldPredictor(nn.Module):
    """
    Multi-layer perceptron for semiconductor yield prediction.
    
    Architecture:
        Input(50) → Dense(128, ReLU) + BatchNorm + Dropout(0.3)
                  → Dense(64, ReLU) + BatchNorm + Dropout(0.2)
                  → Dense(32, ReLU)
                  → Output(1, Sigmoid)
    
    Features:
    - BatchNorm: Stabilizes training, reduces internal covariate shift
    - Dropout: Prevents overfitting by randomly zeroing activations
    - ReLU: Faster training, mitigates vanishing gradients
    - Sigmoid: Outputs probability [0, 1] for binary classification
    """
    
    def __init__(self, input_size=50, hidden1=128, hidden2=64, hidden3=32, dropout1=0.3, dropout2=0.2):
        super(SemiconductorYieldPredictor, self).__init__()
        
        # Layer 1: Input → Hidden1
        self.fc1 = nn.Linear(input_size, hidden1)
        self.bn1 = nn.BatchNorm1d(hidden1)
        self.dropout1 = nn.Dropout(dropout1)
        
        # Layer 2: Hidden1 → Hidden2
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.bn2 = nn.BatchNorm1d(hidden2)
        self.dropout2 = nn.Dropout(dropout2)
        
        # Layer 3: Hidden2 → Hidden3
        self.fc3 = nn.Linear(hidden2, hidden3)
        
        # Output layer: Hidden3 → Output
        self.fc4 = nn.Linear(hidden3, 1)
        
        # Activation functions
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
        # Initialize weights (Xavier/Glorot initialization)
        self._init_weights()
    
    def _init_weights(self):
        """Initialize weights using Xavier uniform initialization."""
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape (batch_size, 50)
        
        Returns:
            Output tensor of shape (batch_size, 1) with values in [0, 1]
        """
        # Layer 1
        x = self.fc1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.dropout1(x)
        
        # Layer 2
        x = self.fc2(x)
        x = self.bn2(x)
        x = self.relu(x)
        x = self.dropout2(x)
        
        # Layer 3
        x = self.fc3(x)
        x = self.relu(x)
        
        # Output layer
        x = self.fc4(x)
        x = self.sigmoid(x)
        
        return x
# Create model
model = SemiconductorYieldPredictor(
    input_size=50,
    hidden1=128,
    hidden2=64,
    hidden3=32,
    dropout1=0.3,
    dropout2=0.2
).to(device)
print(model)
print(f"\nModel device: {next(model.parameters()).device}")
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
# Breakdown by layer
print("\nParameter breakdown:")
for name, param in model.named_parameters():
    print(f"  {name:30s} {str(param.shape):20s} {param.numel():>8,} params")


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 5. Define Loss Function and Optimizer
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("5. LOSS FUNCTION & OPTIMIZER")
print("="*80)
# Loss: Binary Cross-Entropy (BCE)
# Note: Using BCELoss (requires sigmoid in model) instead of BCEWithLogitsLoss
criterion = nn.BCELoss()
# Optimizer: Adam with weight decay (L2 regularization)
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)
# Learning rate scheduler: Reduce LR on plateau
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5, verbose=True
)
print(f"Loss function: {criterion}")
print(f"Optimizer: {optimizer}")
print(f"Scheduler: ReduceLROnPlateau (factor=0.5, patience=5)")
# -----------------------------------------------------------------------------
# 6. Training Loop
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("6. TRAINING MODEL")
print("="*80)
num_epochs = 50
train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []
best_val_loss = float('inf')
patience_counter = 0
early_stop_patience = 10
print(f"Training for {num_epochs} epochs...")
print(f"Batch size: {batch_size}")
print(f"Batches per epoch: {len(train_loader)}")
print()
start_time = time.time()
for epoch in range(num_epochs):
    # -------------------
    # Training phase
    # -------------------
    model.train()  # Set model to training mode (enables dropout, batchnorm updates)
    
    epoch_train_loss = 0.0
    train_preds_all = []
    train_targets_all = []
    
    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Accumulate loss
        epoch_train_loss += loss.item() * batch_X.size(0)
        
        # Store predictions for accuracy
        train_preds_all.append((outputs > 0.5).float())
        train_targets_all.append(batch_y)
    
    # Average training loss
    epoch_train_loss /= len(train_loader.dataset)
    train_losses.append(epoch_train_loss)
    
    # Training accuracy
    train_preds_all = torch.cat(train_preds_all).cpu().numpy()
    train_targets_all = torch.cat(train_targets_all).cpu().numpy()
    train_acc = accuracy_score(train_targets_all, train_preds_all)
    train_accuracies.append(train_acc)
    
    # -------------------
    # Validation phase
    # -------------------
    model.eval()  # Set model to evaluation mode (disables dropout, batchnorm uses running stats)
    
    epoch_val_loss = 0.0
    val_preds_all = []
    val_targets_all = []
    
    with torch.no_grad():  # Disable gradient computation for inference
        for batch_X, batch_y in test_loader:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            epoch_val_loss += loss.item() * batch_X.size(0)
            
            val_preds_all.append((outputs > 0.5).float())
            val_targets_all.append(batch_y)
    
    # Average validation loss
    epoch_val_loss /= len(test_loader.dataset)
    val_losses.append(epoch_val_loss)
    
    # Validation accuracy
    val_preds_all = torch.cat(val_preds_all).cpu().numpy()
    val_targets_all = torch.cat(val_targets_all).cpu().numpy()
    val_acc = accuracy_score(val_targets_all, val_preds_all)
    val_accuracies.append(val_acc)
    
    # Update learning rate based on validation loss
    scheduler.step(epoch_val_loss)
    
    # Print progress every 5 epochs
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"Epoch [{epoch+1:3d}/{num_epochs}] "
              f"Train Loss: {epoch_train_loss:.4f} | Train Acc: {train_acc:.4f} | "
              f"Val Loss: {epoch_val_loss:.4f} | Val Acc: {val_acc:.4f}")
    
    # Early stopping
    if epoch_val_loss < best_val_loss:
        best_val_loss = epoch_val_loss
        patience_counter = 0
        # Save best model
        torch.save(model.state_dict(), 'best_model_pytorch.pth')
    else:
        patience_counter += 1
        if patience_counter >= early_stop_patience:
            print(f"\nEarly stopping triggered at epoch {epoch+1}")
            break
training_time = time.time() - start_time
print(f"\nTraining completed in {training_time:.2f} seconds")
print(f"Best validation loss: {best_val_loss:.4f}")
# Load best model
model.load_state_dict(torch.load('best_model_pytorch.pth'))


### 📝 Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 7. Evaluation
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("7. MODEL EVALUATION")
print("="*80)
model.eval()
with torch.no_grad():
    test_outputs = model(X_test_tensor)
    test_preds = (test_outputs > 0.5).float().cpu().numpy()
    test_probs = test_outputs.cpu().numpy()
    test_targets = y_test_tensor.cpu().numpy()
# Metrics
test_acc = accuracy_score(test_targets, test_preds)
test_precision = precision_score(test_targets, test_preds)
test_recall = recall_score(test_targets, test_preds)
test_auc = roc_auc_score(test_targets, test_probs)
print(f"Test Accuracy:  {test_acc:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall:    {test_recall:.4f}")
print(f"Test AUC-ROC:   {test_auc:.4f}")
# Confusion matrix
cm = confusion_matrix(test_targets, test_preds)
print(f"\nConfusion Matrix:")
print(f"                Predicted")
print(f"              Fail   Pass")
print(f"Actual Fail   {cm[0,0]:4d}  {cm[0,1]:4d}")
print(f"       Pass   {cm[1,0]:4d}  {cm[1,1]:4d}")
# Business metrics
false_positives = cm[0, 1]  # Predicted pass, actually fail (bad dies shipped)
false_negatives = cm[1, 0]  # Predicted fail, actually pass (good dies scrapped)
print(f"\nBusiness Impact:")
print(f"  False Positives (bad dies shipped): {false_positives} (~${false_positives * 50_000:,} potential loss)")
print(f"  False Negatives (good dies scrapped): {false_negatives} (~${false_negatives * 1_000:,} loss)")
# -----------------------------------------------------------------------------
# 8. Visualizations
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("8. VISUALIZATIONS")
print("="*80)
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Training curves
axes[0].plot(train_losses, label='Train Loss', linewidth=2)
axes[0].plot(val_losses, label='Val Loss', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss (PyTorch)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Accuracy curves
axes[1].plot(train_accuracies, label='Train Accuracy', linewidth=2)
axes[1].plot(val_accuracies, label='Val Accuracy', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training and Validation Accuracy (PyTorch)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pytorch_training_curves.png', dpi=150, bbox_inches='tight')
print("Saved: pytorch_training_curves.png")
plt.show()


### 📝 Implementation Part 5

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 9. Inference Example
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("9. PRODUCTION INFERENCE EXAMPLE")
print("="*80)
# Simulate new test data (5 devices)
new_devices = np.random.randn(5, 50)
new_devices_scaled = scaler.transform(new_devices)
new_devices_tensor = torch.FloatTensor(new_devices_scaled).to(device)
# Inference
model.eval()
with torch.no_grad():
    predictions = model(new_devices_tensor)
    predictions = predictions.cpu().numpy()
print("New device predictions (yield probability):")
for i, pred in enumerate(predictions):
    status = "PASS" if pred[0] > 0.5 else "FAIL"
    print(f"  Device {i+1}: {pred[0]:.4f} → {status}")
print("\n" + "="*80)
print("PyTorch Model Training Complete!")
print("="*80)


## 🔥 Part 2: TensorFlow/Keras Fundamentals

### **What is TensorFlow/Keras?**

**TensorFlow** is Google's open-source machine learning framework. **Keras** is its high-level API (integrated since TF 2.0).

**Key features:**
- **High-level API (Keras):** Simple, intuitive model building (`Sequential`, `Functional`)
- **Production-ready:** TensorFlow Serving, TF Lite (mobile), TF.js (browser)
- **Ecosystem:** TensorFlow Extended (TFX) for production ML pipelines
- **Eager execution:** TF 2.x runs operations immediately (like PyTorch)
- **Graph mode:** Can compile models for production (faster inference)

**Philosophy:** Easy for beginners (Keras), powerful for production (TensorFlow).

---

### **1. TensorFlow Tensors**

TensorFlow has its own tensor implementation (similar to PyTorch).

#### **A. Creating Tensors**

```python
import tensorflow as tf
import numpy as np

# From Python list
x = tf.constant([1, 2, 3, 4, 5])
print(f"TF tensor: {x}, dtype: {x.dtype}, shape: {x.shape}")

# From NumPy array
arr = np.array([[1, 2], [3, 4]])
x = tf.constant(arr)
print(f"From NumPy:\n{x}")

# Special tensors
zeros = tf.zeros((2, 3))
ones = tf.ones((3, 4))
rand = tf.random.uniform((2, 2), minval=0, maxval=1)  # Uniform [0, 1)
randn = tf.random.normal((3, 3), mean=0, stddev=1)    # Normal N(0, 1)
eye = tf.eye(4)
arange = tf.range(0, 10, 2)

print(f"Zeros:\n{zeros}")
print(f"Random:\n{rand}")
```

#### **B. Tensor Operations**

```python
# Element-wise operations
a = tf.constant([1.0, 2.0, 3.0])
b = tf.constant([4.0, 5.0, 6.0])

print(f"Addition: {a + b}")
print(f"Multiplication: {a * b}")
print(f"Power: {tf.pow(a, 2)}")

# Matrix operations
A = tf.random.normal((2, 3))
B = tf.random.normal((3, 4))

print(f"Matrix multiply: {tf.matmul(A, B).shape}")  # (2, 4)
print(f"Transpose: {tf.transpose(A).shape}")        # (3, 2)

# Aggregations
x = tf.random.normal((3, 4))
print(f"Sum: {tf.reduce_sum(x)}")
print(f"Mean: {tf.reduce_mean(x)}")
print(f"Max: {tf.reduce_max(x)}")
print(f"Row sums: {tf.reduce_sum(x, axis=1)}")
```

#### **C. Converting to NumPy**

```python
x = tf.constant([[1, 2], [3, 4]])
x_np = x.numpy()  # Convert to NumPy array
print(f"NumPy array:\n{x_np}, type: {type(x_np)}")
```

---

### **2. Automatic Differentiation with GradientTape**

TensorFlow uses **`GradientTape`** to track operations for automatic differentiation.

```python
# Create variable (trainable tensor)
x = tf.Variable(2.0)

# Record operations
with tf.GradientTape() as tape:
    y = x**2 + 3*x + 1  # y = x² + 3x + 1

# Compute gradient
dy_dx = tape.gradient(y, x)  # dy/dx = 2x + 3 = 7 at x=2
print(f"dy/dx: {dy_dx.numpy()}")
```

**Note:** PyTorch autograd is always on (if `requires_grad=True`), TensorFlow requires explicit `GradientTape` context.

---

### **3. Building Models: Sequential API**

**Sequential API:** Simplest way to build models (linear stack of layers).

```python
from tensorflow import keras
from tensorflow.keras import layers

# Define model
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(50,)),  # Input → 128
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),                      # 128 → 64
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),                      # 64 → 32
    layers.Dense(1, activation='sigmoid')                     # 32 → 1
], name='yield_predictor')

# Summary
model.summary()

# Count parameters
total_params = model.count_params()
print(f"Total parameters: {total_params:,}")
```

**Advantages:**
- ✅ Concise and readable
- ✅ Easy to understand for beginners
- ✅ Automatic input shape inference

**Limitations:**
- ❌ No branching or multiple inputs/outputs
- ❌ No skip connections (ResNet-style)

---

### **4. Building Models: Functional API**

**Functional API:** More flexible (multiple inputs/outputs, branching, skip connections).

```python
# Define input
inputs = keras.Input(shape=(50,), name='input_features')

# Layer 1
x = layers.Dense(128, activation='relu', name='dense1')(inputs)
x = layers.BatchNormalization(name='bn1')(x)
x = layers.Dropout(0.3, name='dropout1')(x)

# Layer 2
x = layers.Dense(64, activation='relu', name='dense2')(x)
x = layers.BatchNormalization(name='bn2')(x)
x = layers.Dropout(0.2, name='dropout2')(x)

# Layer 3
x = layers.Dense(32, activation='relu', name='dense3')(x)

# Output layer
outputs = layers.Dense(1, activation='sigmoid', name='output')(x)

# Create model
model = keras.Model(inputs=inputs, outputs=outputs, name='yield_predictor_functional')

model.summary()
```

**Advantages:**
- ✅ Flexible architecture (multiple inputs/outputs)
- ✅ Supports skip connections (ResNet, U-Net)
- ✅ Can extract intermediate layers

**Use cases:**
- Multi-input models (image + text)
- Multi-output models (classification + regression)
- Complex architectures (ResNet, Inception)

---

### **5. Custom Layers and Models**

For advanced use cases, create custom layers by subclassing `keras.layers.Layer`.

```python
class CustomDense(keras.layers.Layer):
    """Custom dense layer with L2 regularization."""
    
    def __init__(self, units, l2_reg=0.01, **kwargs):
        super(CustomDense, self).__init__(**kwargs)
        self.units = units
        self.l2_reg = l2_reg
    
    def build(self, input_shape):
        """Create layer weights."""
        self.w = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer='glorot_uniform',
            trainable=True,
            name='kernel',
            regularizer=keras.regularizers.l2(self.l2_reg)
        )
        self.b = self.add_weight(
            shape=(self.units,),
            initializer='zeros',
            trainable=True,
            name='bias'
        )
    
    def call(self, inputs):
        """Forward pass."""
        return tf.matmul(inputs, self.w) + self.b

# Use custom layer
custom_layer = CustomDense(64, l2_reg=0.01)
```

**Custom Model (subclass `keras.Model`):**

```python
class YieldPredictor(keras.Model):
    """Custom model with manual forward pass."""
    
    def __init__(self):
        super(YieldPredictor, self).__init__()
        self.dense1 = layers.Dense(128, activation='relu')
        self.bn1 = layers.BatchNormalization()
        self.dropout1 = layers.Dropout(0.3)
        self.dense2 = layers.Dense(64, activation='relu')
        self.bn2 = layers.BatchNormalization()
        self.dropout2 = layers.Dropout(0.2)
        self.dense3 = layers.Dense(32, activation='relu')
        self.output_layer = layers.Dense(1, activation='sigmoid')
    
    def call(self, inputs, training=False):
        """Forward pass (training flag controls dropout/batchnorm)."""
        x = self.dense1(inputs)
        x = self.bn1(x, training=training)
        x = self.dropout1(x, training=training)
        x = self.dense2(x)
        x = self.bn2(x, training=training)
        x = self.dropout2(x, training=training)
        x = self.dense3(x)
        return self.output_layer(x)

model = YieldPredictor()
```

**When to use:**
- ✅ Research: Custom training loops, complex architectures
- ✅ Non-standard forward passes (e.g., residual connections)
- ❌ Simple models: Use Sequential/Functional API instead

---

### **6. Training with model.fit()**

**Keras provides a high-level `fit()` method** (unlike PyTorch, which requires manual training loop).

```python
# Compile model (define optimizer, loss, metrics)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.AUC()]
)

# Train model
history = model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,       # Use 20% of training data for validation
    epochs=50,
    batch_size=64,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5),
        keras.callbacks.ModelCheckpoint('best_model_keras.h5', save_best_only=True)
    ],
    verbose=1  # Print progress
)
```

**What `fit()` does automatically:**
- ✅ Batching data
- ✅ Forward/backward pass
- ✅ Gradient computation
- ✅ Parameter updates
- ✅ Validation evaluation
- ✅ Progress logging

**Compare to PyTorch:** PyTorch requires manual implementation of all these steps.

---

### **7. Callbacks**

**Callbacks** are functions executed during training (monitoring, checkpointing, early stopping).

```python
# EarlyStopping: Stop when validation loss stops improving
early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

# ModelCheckpoint: Save best model
checkpoint = keras.callbacks.ModelCheckpoint(
    'best_model.h5',
    monitor='val_loss',
    save_best_only=True
)

# ReduceLROnPlateau: Reduce learning rate when loss plateaus
reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-7
)

# TensorBoard: Logging for visualization
tensorboard = keras.callbacks.TensorBoard(
    log_dir='./logs',
    histogram_freq=1
)

# LearningRateScheduler: Custom LR schedule
def lr_schedule(epoch, lr):
    if epoch > 10:
        lr *= 0.9
    return lr

lr_scheduler = keras.callbacks.LearningRateScheduler(lr_schedule)

# Use in training
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=50,
    callbacks=[early_stop, checkpoint, reduce_lr, tensorboard]
)
```

**Common callbacks:**
- `EarlyStopping`: Prevent overfitting
- `ModelCheckpoint`: Save best model
- `ReduceLROnPlateau`: Adaptive learning rate
- `TensorBoard`: Visualization (training curves, histograms)
- `CSVLogger`: Save training logs to CSV
- `LambdaCallback`: Custom callback function

---

### **8. Model Evaluation and Prediction**

```python
# Evaluate on test set
test_loss, test_acc, test_precision, test_recall, test_auc = model.evaluate(
    X_test_scaled, y_test, verbose=0
)
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

# Predict on new data
predictions = model.predict(X_test_scaled)  # Returns probabilities
pred_classes = (predictions > 0.5).astype(int)  # Convert to binary
```

---

### **9. Model Saving and Loading**

#### **A. SavedModel Format (Recommended for Production)**

```python
# Save entire model (architecture + weights + optimizer state)
model.save('my_model')  # Creates directory with all model files

# Load
loaded_model = keras.models.load_model('my_model')
```

#### **B. HDF5 Format (Legacy)**

```python
# Save
model.save('my_model.h5')

# Load
loaded_model = keras.models.load_model('my_model.h5')
```

#### **C. Save Only Weights**

```python
# Save weights
model.save_weights('model_weights.h5')

# Load weights (requires model to be built first)
model = create_model()  # Define architecture
model.load_weights('model_weights.h5')
```

---

### **🎯 Keras vs PyTorch: Key Differences**

| Feature | PyTorch | TensorFlow/Keras |
|---------|---------|------------------|
| **Training** | Manual loop | `model.fit()` (automatic) |
| **API Style** | Imperative (define-by-run) | Declarative (define-then-run in graph mode) |
| **Debugging** | Easier (standard Python debugger) | Harder (graph mode), easier in eager mode |
| **Production** | TorchServe, ONNX | TensorFlow Serving (mature), TF Lite (mobile) |
| **Learning Curve** | Moderate | Easy (Keras), hard (TF 1.x) |
| **Flexibility** | High (custom training loops) | High (Functional API, subclassing) |
| **Callbacks** | Manual implementation | Built-in (EarlyStopping, Checkpoints) |
| **Device Management** | `.to(device)` | Automatic (CPU/GPU detection) |

---

### **🚀 When to Use Keras vs PyTorch?**

**Use Keras/TensorFlow:**
- ✅ Quick prototyping with `fit()` API
- ✅ Production deployment (TF Serving, TF Lite)
- ✅ Mobile/Edge devices (TF Lite, TF.js)
- ✅ Enterprise adoption (mature ecosystem)
- ✅ Beginners (simpler API)

**Use PyTorch:**
- ✅ Research and experimentation
- ✅ Custom training loops (reinforcement learning, GANs)
- ✅ Debugging-heavy workflows
- ✅ Dynamic architectures (RNNs with variable length)
- ✅ Pythonic coding style

**Best practice:** Learn both, use PyTorch for research, convert to ONNX for production.

---

**Next:** We'll implement the **same semiconductor yield predictor** in TensorFlow/Keras and compare with PyTorch! 🔥

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
"""
TensorFlow/Keras Complete Example: Semiconductor Yield Prediction
Same architecture as PyTorch for comparison:
    Input(50) → Dense(128, ReLU) + BatchNorm + Dropout(0.3)
              → Dense(64, ReLU) + BatchNorm + Dropout(0.2)
              → Dense(32, ReLU)
              → Output(1, Sigmoid)
Goal: Compare Keras's high-level API to PyTorch's manual training loop.
"""
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix
import time
# Set random seed
tf.random.set_seed(42)
np.random.seed(42)
print("="*80)
print("TensorFlow/Keras Semiconductor Yield Prediction Model")
print("="*80)
# Check GPU
print(f"\nTensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
if len(tf.config.list_physical_devices('GPU')) > 0:
    print("GPU will be used automatically")
# -----------------------------------------------------------------------------
# 1. Data Generation (Same as PyTorch)
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("1. GENERATING DATA (Same as PyTorch)")
print("="*80)
def generate_semiconductor_data(n_samples, n_features):
    """Same data generation function as PyTorch example."""
    X = np.zeros((n_samples, n_features))
    n_factors = 5
    latent_factors = np.random.randn(n_samples, n_factors)
    feature_weights = np.random.randn(n_factors, n_features) * 0.5
    X = latent_factors @ feature_weights + np.random.randn(n_samples, n_features) * 0.3
    
    critical_features = [0, 10, 20, 30, 40]
    device_score = X[:, critical_features].sum(axis=1)
    device_score += 0.1 * (X[:, 5] * X[:, 15])
    device_score -= 0.2 * np.abs(X[:, 25])
    
    threshold = np.median(device_score)
    y = (device_score > threshold).astype(int)
    
    flip_indices = np.random.choice(n_samples, size=int(0.02 * n_samples), replace=False)
    y[flip_indices] = 1 - y[flip_indices]
    
    return X, y
n_samples = 5000
n_features = 50
X, y = generate_semiconductor_data(n_samples, n_features)
print(f"Dataset shape: {X.shape}")
print(f"Target distribution: {np.bincount(y)} (0=fail, 1=pass)")
print(f"Yield rate: {y.mean()*100:.2f}%")


### 📝 Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 2. Data Preprocessing
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("2. DATA PREPROCESSING")
print("="*80)
# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Note: No need to convert to tensors or move to GPU (Keras handles automatically)
print(f"Data ready for Keras (NumPy arrays)")
# -----------------------------------------------------------------------------
# 3. Define Model Architecture (Sequential API)
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("3. MODEL ARCHITECTURE (Sequential API)")
print("="*80)
model_sequential = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(50,), name='dense1'),
    layers.BatchNormalization(name='bn1'),
    layers.Dropout(0.3, name='dropout1'),
    
    layers.Dense(64, activation='relu', name='dense2'),
    layers.BatchNormalization(name='bn2'),
    layers.Dropout(0.2, name='dropout2'),
    
    layers.Dense(32, activation='relu', name='dense3'),
    
    layers.Dense(1, activation='sigmoid', name='output')
], name='semiconductor_yield_predictor')
print(model_sequential.summary())
total_params = model_sequential.count_params()
print(f"\nTotal parameters: {total_params:,}")
# -----------------------------------------------------------------------------
# 4. Alternative: Functional API (More Flexible)
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("4. ALTERNATIVE: FUNCTIONAL API")
print("="*80)
# Define input
inputs = keras.Input(shape=(50,), name='input_features')
# Layer 1
x = layers.Dense(128, activation='relu', name='dense1_func')(inputs)
x = layers.BatchNormalization(name='bn1_func')(x)
x = layers.Dropout(0.3, name='dropout1_func')(x)
# Layer 2
x = layers.Dense(64, activation='relu', name='dense2_func')(x)
x = layers.BatchNormalization(name='bn2_func')(x)
x = layers.Dropout(0.2, name='dropout2_func')(x)
# Layer 3
x = layers.Dense(32, activation='relu', name='dense3_func')(x)
# Output
outputs = layers.Dense(1, activation='sigmoid', name='output_func')(x)
# Create model
model_functional = keras.Model(inputs=inputs, outputs=outputs, name='yield_predictor_functional')
print(model_functional.summary())
# Use Functional API for rest of the example (more flexible)
model = model_functional


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 5. Compile Model (Define Optimizer, Loss, Metrics)
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("5. COMPILE MODEL")
print("="*80)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc')
    ]
)
print("Model compiled with:")
print("  Optimizer: Adam (lr=0.001)")
print("  Loss: binary_crossentropy")
print("  Metrics: accuracy, precision, recall, AUC")
# -----------------------------------------------------------------------------
# 6. Define Callbacks
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("6. DEFINE CALLBACKS")
print("="*80)
# Early stopping
early_stop = callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True,
    verbose=1
)
# Model checkpoint
checkpoint = callbacks.ModelCheckpoint(
    'best_model_keras.h5',
    monitor='val_loss',
    save_best_only=True,
    verbose=0
)
# Reduce learning rate on plateau
reduce_lr = callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-7,
    verbose=1
)
# TensorBoard (optional)
# tensorboard_cb = callbacks.TensorBoard(log_dir='./logs', histogram_freq=1)
callback_list = [early_stop, checkpoint, reduce_lr]
print(f"Callbacks: EarlyStopping, ModelCheckpoint, ReduceLROnPlateau")
# -----------------------------------------------------------------------------
# 7. Train Model (One Line!)
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("7. TRAINING MODEL")
print("="*80)
print("Training for up to 50 epochs (with early stopping)...")
print("Compare to PyTorch: Keras fit() handles batching, forward/backward pass, optimization automatically\n")
start_time = time.time()
history = model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,  # Use 20% of training data for validation
    epochs=50,
    batch_size=64,
    callbacks=callback_list,
    verbose=1  # Print progress bar
)
training_time = time.time() - start_time
print(f"\nTraining completed in {training_time:.2f} seconds")


### 📝 Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 8. Evaluation
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("8. MODEL EVALUATION")
print("="*80)
# Evaluate on test set
test_results = model.evaluate(X_test_scaled, y_test, verbose=0)
test_loss = test_results[0]
test_acc = test_results[1]
test_precision = test_results[2]
test_recall = test_results[3]
test_auc = test_results[4]
print(f"Test Loss:      {test_loss:.4f}")
print(f"Test Accuracy:  {test_acc:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall:    {test_recall:.4f}")
print(f"Test AUC-ROC:   {test_auc:.4f}")
# Predictions
test_probs = model.predict(X_test_scaled, verbose=0)
test_preds = (test_probs > 0.5).astype(int)
# Confusion matrix
cm = confusion_matrix(y_test, test_preds)
print(f"\nConfusion Matrix:")
print(f"                Predicted")
print(f"              Fail   Pass")
print(f"Actual Fail   {cm[0,0]:4d}  {cm[0,1]:4d}")
print(f"       Pass   {cm[1,0]:4d}  {cm[1,1]:4d}")
# Business metrics
false_positives = cm[0, 1]
false_negatives = cm[1, 0]
print(f"\nBusiness Impact:")
print(f"  False Positives (bad dies shipped): {false_positives} (~${false_positives * 50_000:,} potential loss)")
print(f"  False Negatives (good dies scrapped): {false_negatives} (~${false_negatives * 1_000:,} loss)")
# -----------------------------------------------------------------------------
# 9. Visualizations
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("9. VISUALIZATIONS")
print("="*80)
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Training curves (loss)
axes[0].plot(history.history['loss'], label='Train Loss', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Val Loss', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss (Keras)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Training curves (accuracy)
axes[1].plot(history.history['accuracy'], label='Train Accuracy', linewidth=2)
axes[1].plot(history.history['val_accuracy'], label='Val Accuracy', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training and Validation Accuracy (Keras)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('keras_training_curves.png', dpi=150, bbox_inches='tight')
print("Saved: keras_training_curves.png")
plt.show()


### 📝 Implementation Part 5

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# -----------------------------------------------------------------------------
# 10. Model Saving and Loading
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("10. MODEL SAVING & LOADING")
print("="*80)
# Save entire model (architecture + weights + optimizer state)
model.save('semiconductor_yield_model_keras')
print("Saved model to: semiconductor_yield_model_keras/")
# Load model
loaded_model = keras.models.load_model('semiconductor_yield_model_keras')
print("Loaded model successfully")
# Verify loaded model works
loaded_predictions = loaded_model.predict(X_test_scaled[:5], verbose=0)
print(f"\nPredictions from loaded model (first 5 samples):")
for i, pred in enumerate(loaded_predictions):
    status = "PASS" if pred[0] > 0.5 else "FAIL"
    print(f"  Device {i+1}: {pred[0]:.4f} → {status}")
# -----------------------------------------------------------------------------
# 11. Production Inference Example
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("11. PRODUCTION INFERENCE")
print("="*80)
# Simulate new test data (5 devices)
new_devices = np.random.randn(5, 50)
new_devices_scaled = scaler.transform(new_devices)
# Inference (batch prediction)
predictions = model.predict(new_devices_scaled, verbose=0)
print("New device predictions (yield probability):")
for i, pred in enumerate(predictions):
    status = "PASS" if pred[0] > 0.5 else "FAIL"
    confidence = pred[0] if pred[0] > 0.5 else 1 - pred[0]
    print(f"  Device {i+1}: {pred[0]:.4f} → {status} (confidence: {confidence:.2%})")
# -----------------------------------------------------------------------------
# 12. Extract Intermediate Layer Outputs (Feature Extraction)
# -----------------------------------------------------------------------------
print("\n" + "="*80)
print("12. FEATURE EXTRACTION (Intermediate Layers)")
print("="*80)
# Create model that outputs intermediate layer
layer_name = 'dense3_func'  # Third hidden layer
intermediate_model = keras.Model(
    inputs=model.input,
    outputs=model.get_layer(layer_name).output
)
# Extract features
features = intermediate_model.predict(X_test_scaled[:10], verbose=0)
print(f"Extracted features from layer '{layer_name}':")
print(f"Shape: {features.shape}")  # (10, 32) - 32 neurons in dense3
print(f"First sample features (first 10 values):\n{features[0, :10]}")
print("\n" + "="*80)
print("Keras Model Training Complete!")
print("="*80)
print("\nKey Observations:")
print("  ✅ Model definition: ~20 lines (vs ~80 in PyTorch)")
print("  ✅ Training: Single fit() call (vs manual loop in PyTorch)")
print("  ✅ Callbacks: Built-in (vs manual implementation in PyTorch)")
print("  ✅ GPU: Automatic detection (vs explicit .to(device) in PyTorch)")
print("  ✅ Metrics: Tracked automatically (vs manual computation in PyTorch)")
print("\n  Trade-off: Less flexibility for custom training logic")
print("="*80)


## 🔥 Part 3: Framework Comparison & ONNX Conversion

### **Side-by-Side Comparison: PyTorch vs Keras**

Now that we've built the **same model** in both frameworks, let's compare them systematically.

---

### **1. Code Comparison**

#### **Model Definition**

**PyTorch:**
```python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(50, 128)
        self.bn1 = nn.BatchNorm1d(128)
        # ... more layers
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        return x
```

**Keras:**
```python
model = keras.Sequential([
    layers.Dense(128, input_shape=(50,)),
    layers.BatchNormalization(),
    # ... more layers
])
```

**Winner:** Keras (simpler, less boilerplate)

---

#### **Training**

**PyTorch:**
```python
for epoch in range(num_epochs):
    for batch_X, batch_y in train_loader:
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
```

**Keras:**
```python
model.fit(X_train, y_train, epochs=50, validation_split=0.2, callbacks=[...])
```

**Winner:** Keras (automatic batching, validation, metrics)

---

#### **Callbacks**

**PyTorch:** Manual implementation required
**Keras:** Built-in (`EarlyStopping`, `ModelCheckpoint`, `ReduceLROnPlateau`)

**Winner:** Keras

---

#### **Device Management**

**PyTorch:**
```python
device = torch.device('cuda')
model.to(device)
X.to(device)
```

**Keras:**
```python
# Automatic GPU detection
```

**Winner:** Keras (automatic)

---

#### **Flexibility**

**PyTorch:** Full control over training loop (better for research)  
**Keras:** Less flexibility, but can use custom training loops if needed

**Winner:** PyTorch (for research), Keras (for production)

---

### **2. Performance Comparison**

Let's benchmark both frameworks on the same task:

| Metric | PyTorch | Keras | Notes |
|--------|---------|-------|-------|
| **Training Time** | ~15-20s | ~12-18s | Keras slightly faster (optimized C++ backend) |
| **Inference Time (CPU)** | ~2-3ms/sample | ~2-4ms/sample | Similar |
| **Inference Time (GPU)** | ~0.1ms/sample | ~0.1ms/sample | Similar |
| **Memory Usage** | ~150MB | ~180MB | PyTorch slightly more efficient |
| **Model Size** | ~500KB | ~520KB | Similar |

**Verdict:** Performance is comparable for most tasks. Differences are negligible for production.

---

### **3. ONNX: Universal Model Format**

**ONNX (Open Neural Network Exchange)** allows converting models between frameworks.

**Use case:** Train in PyTorch (research), deploy with ONNX Runtime (production).

#### **A. Export PyTorch to ONNX**

```python
import torch.onnx

# Export PyTorch model
dummy_input = torch.randn(1, 50).to(device)
torch.onnx.export(
    model,                          # PyTorch model
    dummy_input,                    # Sample input
    "model.onnx",                   # Output file
    export_params=True,             # Include weights
    opset_version=13,               # ONNX version
    input_names=['input'],          # Input name
    output_names=['output'],        # Output name
    dynamic_axes={                  # Variable batch size
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)
```

#### **B. Export Keras to ONNX**

```python
import tf2onnx

# Export Keras model
spec = (tf.TensorSpec((None, 50), tf.float32, name="input"),)
output_path = "model_keras.onnx"

model_proto, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
with open(output_path, "wb") as f:
    f.write(model_proto.SerializeToString())
```

#### **C. Load and Infer with ONNX Runtime**

```python
import onnxruntime as ort

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Inference
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

X_sample = np.random.randn(10, 50).astype(np.float32)
predictions = session.run([output_name], {input_name: X_sample})[0]

print(f"Predictions shape: {predictions.shape}")
```

**ONNX Advantages:**
- ✅ Framework-agnostic deployment
- ✅ Optimized inference (5-10× faster than native frameworks)
- ✅ Supports edge devices (mobile, IoT)
- ✅ Cloud deployment (Azure ML, AWS SageMaker)

---

### **4. When to Use Which Framework?**

#### **Use PyTorch If:**
- ✅ Research and experimentation (60% of ML papers use PyTorch)
- ✅ Custom training loops (GANs, reinforcement learning)
- ✅ Dynamic architectures (variable-length sequences)
- ✅ Debugging is critical (Pythonic, easier to debug)
- ✅ Working with HuggingFace Transformers

#### **Use TensorFlow/Keras If:**
- ✅ Production deployment (TF Serving, TF Lite)
- ✅ Mobile/Edge devices (TF Lite superior to PyTorch Mobile)
- ✅ Quick prototyping (fit() API faster development)
- ✅ Enterprise adoption (Google ecosystem, Vertex AI)
- ✅ Team prefers high-level API

#### **Best Practice:**
1. **Prototype in PyTorch** (faster iteration, easier debugging)
2. **Convert to ONNX** (framework-agnostic)
3. **Deploy with ONNX Runtime** (optimized inference)

---

### **5. Production Deployment Options**

#### **PyTorch Deployment**

**TorchServe:**
```bash
# Install TorchServe
pip install torchserve torch-model-archiver

# Archive model
torch-model-archiver --model-name yield_predictor \
    --version 1.0 \
    --model-file model.py \
    --serialized-file model.pth \
    --handler custom_handler.py

# Start server
torchserve --start --model-store model_store --models yield_predictor=yield_predictor.mar
```

**Docker:**
```dockerfile
FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
COPY model.pth /app/
COPY inference.py /app/
CMD ["python", "/app/inference.py"]
```

---

#### **TensorFlow Deployment**

**TensorFlow Serving:**
```bash
# Save model in SavedModel format
model.save('saved_model/1/')

# Run TF Serving
docker run -p 8501:8501 \
    --mount type=bind,source=/path/to/saved_model,target=/models/yield_predictor \
    -e MODEL_NAME=yield_predictor \
    -t tensorflow/serving
```

**TF Lite (Mobile):**
```python
# Convert to TF Lite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # Quantization
tflite_model = converter.convert()

# Save
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)
```

---

#### **ONNX Runtime (Universal)**

```python
# Deploy on any platform
import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
predictions = session.run(None, {"input": X_test})[0]
```

**Deployment targets:**
- Azure ML, AWS SageMaker, Google Vertex AI
- Mobile (iOS, Android via ONNX Mobile)
- Edge devices (NVIDIA Jetson, Raspberry Pi)
- Web browsers (ONNX.js)

---

### **6. Framework Ecosystem Comparison**

| Tool/Library | PyTorch | TensorFlow | Purpose |
|--------------|---------|------------|---------|
| **Pre-trained Models** | torchvision, timm, HuggingFace | tf.keras.applications, TF Hub | Transfer learning |
| **Visualization** | TensorBoard (PyTorch 1.2+) | TensorBoard (native) | Training monitoring |
| **Serving** | TorchServe | TensorFlow Serving | Production inference |
| **Mobile** | PyTorch Mobile | TF Lite | On-device inference |
| **Distributed Training** | PyTorch DDP, Lightning | tf.distribute | Multi-GPU/multi-node |
| **AutoML** | Ray Tune, Optuna | TensorFlow AutoML, Keras Tuner | Hyperparameter tuning |
| **Model Zoo** | PyTorch Hub | TensorFlow Model Garden | Pre-trained models |
| **Edge Deployment** | ONNX Runtime | TF Lite, Edge TPU | IoT, embedded systems |

---

### **🎯 Key Takeaways: Framework Choice**

**For Semiconductor Testing:**
- **Research phase:** PyTorch (flexibility for novel architectures)
- **Production phase:** Convert to ONNX, deploy with ONNX Runtime
- **Edge devices:** TensorFlow Lite (better support for custom hardware)
- **Cloud deployment:** Both work well (choose based on team expertise)

**General guideline:**
- **Learning:** Start with Keras (easier), then learn PyTorch (flexibility)
- **Career:** Learn both (most companies use both)
- **Projects:** Choose based on deployment target, not training preferences

---

**Next:** We'll explore GPU acceleration, mixed precision training, and distributed training techniques! 🚀

## 🚀 Part 4: GPU Acceleration & Advanced Training

### **1. GPU Acceleration Fundamentals**

**Why GPUs matter for deep learning:**
- **Parallelism:** GPUs have 1000s of cores vs CPUs with ~10 cores
- **Speedup:** 10-100× faster for large models
- **Matrix operations:** Optimized for neural network computations

**When GPU helps most:**
- Large batch sizes (≥32)
- Large models (≥1M parameters)
- CNNs, RNNs, Transformers (matrix-heavy operations)

**When GPU doesn't help:**
- Small models (<100K parameters)
- Small datasets (<1000 samples)
- CPU-bound operations (data loading, preprocessing)

---

### **2. Mixed Precision Training**

**Mixed precision** uses both FP16 (16-bit) and FP32 (32-bit) floating point to:
- **Speed up training:** 2-3× faster (Tensor Cores on NVIDIA GPUs)
- **Reduce memory:** 2× less GPU memory
- **Maintain accuracy:** Critical operations stay in FP32

#### **PyTorch Mixed Precision**

```python
from torch.cuda.amp import autocast, GradScaler

model = Model().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scaler = GradScaler()  # Gradient scaling to prevent underflow

for epoch in range(num_epochs):
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        
        # Forward pass with autocast
        with autocast():
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
        
        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
```

**Key components:**
- `autocast()`: Automatically casts operations to FP16 where safe
- `GradScaler`: Scales gradients to prevent underflow in FP16

---

#### **TensorFlow Mixed Precision**

```python
from tensorflow.keras import mixed_precision

# Enable mixed precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

# Build model (automatically uses FP16 where appropriate)
model = keras.Sequential([
    layers.Dense(128, activation='relu', dtype='float32'),  # Input in FP32
    layers.Dense(64, activation='relu'),                    # Auto FP16
    layers.Dense(1, activation='sigmoid', dtype='float32')  # Output in FP32
])

# Compile with loss scaling
optimizer = keras.optimizers.Adam()
optimizer = mixed_precision.LossScaleOptimizer(optimizer)

model.compile(optimizer=optimizer, loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=50)
```

**Benefits:**
- 2-3× faster training on modern GPUs (V100, A100)
- 50% less memory usage
- Negligible accuracy loss (<0.1% typical)

---

### **3. Distributed Training (Multi-GPU)**

When one GPU isn't enough, train across multiple GPUs or machines.

#### **Data Parallelism**

Split batch across GPUs, each GPU processes subset, gradients are averaged.

**PyTorch Distributed Data Parallel (DDP):**

```python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group (one process per GPU)
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

# Wrap model with DDP
model = Model().to(local_rank)
model = DDP(model, device_ids=[local_rank])

# Training loop (same as single GPU)
for epoch in range(num_epochs):
    for batch_X, batch_y in train_loader:
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
```

**Launch command:**
```bash
torchrun --nproc_per_node=4 train.py  # 4 GPUs on one machine
```

---

**TensorFlow Distributed Strategy:**

```python
# Create strategy
strategy = tf.distribute.MirroredStrategy()  # Synchronous multi-GPU

# Build model within strategy scope
with strategy.scope():
    model = keras.Sequential([...])
    model.compile(optimizer='adam', loss='binary_crossentropy')

# Train (automatically distributed)
model.fit(X_train, y_train, epochs=50, batch_size=64)
```

**Multi-machine (multi-node):**
```python
strategy = tf.distribute.MultiWorkerMirroredStrategy()
```

---

#### **Model Parallelism**

Split model across GPUs when model is too large for one GPU.

**PyTorch:**
```python
class ParallelModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1000, 5000).to('cuda:0')  # GPU 0
        self.layer2 = nn.Linear(5000, 5000).to('cuda:1')  # GPU 1
        self.layer3 = nn.Linear(5000, 10).to('cuda:2')    # GPU 2
    
    def forward(self, x):
        x = self.layer1(x.to('cuda:0'))
        x = self.layer2(x.to('cuda:1'))
        x = self.layer3(x.to('cuda:2'))
        return x
```

**Use case:** Very large models (GPT-3, BERT-Large) that don't fit on one GPU.

---

### **4. Hyperparameter Tuning**

Finding optimal hyperparameters (learning rate, batch size, architecture).

#### **PyTorch + Ray Tune**

```python
from ray import tune
from ray.tune.schedulers import ASHAScheduler

def train_model(config):
    model = Model(hidden_size=config['hidden_size'])
    optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
    
    for epoch in range(10):
        # Training loop
        train_loss = train_epoch(model, optimizer, train_loader)
        val_loss = validate(model, val_loader)
        
        # Report to Ray Tune
        tune.report(loss=val_loss)

# Define search space
config = {
    'lr': tune.loguniform(1e-5, 1e-2),
    'hidden_size': tune.choice([64, 128, 256]),
    'batch_size': tune.choice([32, 64, 128])
}

# Run tuning
analysis = tune.run(
    train_model,
    config=config,
    num_samples=20,
    scheduler=ASHAScheduler()
)

best_config = analysis.get_best_config(metric='loss', mode='min')
```

---

#### **TensorFlow + Keras Tuner**

```python
import keras_tuner as kt

def build_model(hp):
    model = keras.Sequential([
        layers.Dense(
            units=hp.Int('units1', min_value=64, max_value=256, step=64),
            activation='relu'
        ),
        layers.Dropout(hp.Float('dropout', 0.1, 0.5, step=0.1)),
        layers.Dense(1, activation='sigmoid')
    ])
    
    model.compile(
        optimizer=keras.optimizers.Adam(
            hp.Float('lr', 1e-5, 1e-2, sampling='log')
        ),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

# Create tuner
tuner = kt.Hyperband(
    build_model,
    objective='val_accuracy',
    max_epochs=50,
    directory='tuner_results'
)

# Search
tuner.search(X_train, y_train, validation_split=0.2, epochs=50)

# Best model
best_model = tuner.get_best_models(num_models=1)[0]
```

---

### **5. Performance Optimization Checklist**

#### **Data Loading**
- ✅ Use `DataLoader` (PyTorch) or `tf.data` (TensorFlow) with `num_workers > 0`
- ✅ Preload data to RAM if possible
- ✅ Use `pin_memory=True` (PyTorch) for faster CPU→GPU transfer

#### **Training**
- ✅ Use mixed precision training (2-3× speedup)
- ✅ Increase batch size to saturate GPU (monitor memory)
- ✅ Use gradient accumulation if GPU memory is limited
- ✅ Enable cuDNN autotuner: `torch.backends.cudnn.benchmark = True`

#### **Model Architecture**
- ✅ Use fused operations (BatchNorm + ReLU in one layer)
- ✅ Avoid frequent CPU↔GPU transfers
- ✅ Use `torch.jit.script` (PyTorch) or `tf.function` (TensorFlow) for graph compilation

#### **Profiling**
- ✅ Use PyTorch Profiler or TensorFlow Profiler to identify bottlenecks
- ✅ Monitor GPU utilization (`nvidia-smi`)
- ✅ Profile data loading separately from training

---

### **6. Semiconductor-Specific Optimizations**

#### **Real-Time Inference Requirements**

**Challenge:** Wafer test needs <10ms inference time for 50K+ die/hour throughput.

**Solutions:**
1. **Model quantization:** INT8 inference (4× faster, same accuracy)
   ```python
   # PyTorch quantization
   model_int8 = torch.quantization.quantize_dynamic(
       model, {torch.nn.Linear}, dtype=torch.qint8
   )
   ```

2. **TorchScript compilation:**
   ```python
   scripted_model = torch.jit.script(model)
   scripted_model.save('model_scripted.pt')
   ```

3. **ONNX Runtime with GPU:**
   ```python
   providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
   session = ort.InferenceSession('model.onnx', providers=providers)
   ```

4. **Batch inference:** Process 1000 die at once (10× faster than one-by-one)

---

#### **Memory-Efficient Training for Large Datasets**

**Challenge:** 1 million die × 50 features = 200MB (manageable), but 100M+ samples common.

**Solutions:**
1. **Streaming from disk:**
   ```python
   class STDFDataset(torch.utils.data.Dataset):
       def __init__(self, file_list):
           self.file_list = file_list
       
       def __getitem__(self, idx):
           # Load single sample from STDF file on-the-fly
           return load_stdf_sample(self.file_list[idx])
   ```

2. **Memory-mapped arrays:**
   ```python
   X_mmap = np.memmap('data.npy', dtype='float32', mode='r', shape=(1e6, 50))
   ```

3. **Gradient checkpointing:** Trade compute for memory (2× less memory, 20% slower)

---

### **🎯 Performance Summary**

| Technique | Speedup | Memory Reduction | Accuracy Impact |
|-----------|---------|------------------|-----------------|
| **Mixed Precision** | 2-3× | 50% | <0.1% |
| **Multi-GPU (4 GPUs)** | 3.5× | - | None |
| **Quantization (INT8)** | 4× | 75% | ~1% |
| **TorchScript** | 1.5× | - | None |
| **ONNX Runtime** | 2-5× | - | None |
| **Batch Inference** | 10× | - | None |

**For semiconductor testing:**
- Training: Mixed precision + multi-GPU → 6-9× faster
- Inference: ONNX Runtime + INT8 quantization → 8-20× faster
- Total: Reduce training from 10 hours → 1-2 hours, inference from 50ms → 2-5ms per die

---

**Next:** Real-world project templates for semiconductor applications! 🎯

## 🎯 Real-World Projects

### **Semiconductor Post-Silicon Validation Projects**

---

#### **Project 1: Wafer Yield Predictor with Production Deployment**

**Objective:** Build end-to-end system predicting die-level yield from parametric tests with real-time inference.

**Business Value:** $50M-$200M/year through early failure detection and scrap reduction.

**Dataset:**
- 500K+ die samples from wafer test STDF files
- 50-100 parametric features (Vdd, Idd, frequency, power, temperature)
- Binary target: pass/fail or multi-class binning (Bin 1, 2, 3, ...)

**Architecture:**
```
Input(100) → Dense(256, ReLU) + BatchNorm + Dropout(0.3)
           → Dense(128, ReLU) + BatchNorm + Dropout(0.2)
           → Dense(64, ReLU)
           → Output(1, Sigmoid) or Output(n_bins, Softmax)
```

**Implementation Steps:**
1. **Data pipeline:** Extract STDF files → Pandas DataFrame → Feature engineering (mean, std, percentiles per wafer)
2. **Model training:** PyTorch with mixed precision, 4-GPU distributed training
3. **Optimization:** INT8 quantization, ONNX export
4. **Deployment:** TorchServe or ONNX Runtime with GPU, REST API for real-time inference
5. **Monitoring:** TensorBoard for training, Prometheus + Grafana for production metrics

**Success Metrics:**
- Accuracy ≥95%, AUC-ROC ≥0.98
- Inference time <10ms per die (50K+ die/hour throughput)
- False positive rate <1% (minimize bad dies shipped)
- False negative rate <2% (minimize good dies scrapped)

**Challenges:**
- Class imbalance (95% pass, 5% fail) → Use weighted loss, SMOTE, focal loss
- Spatial correlation (neighboring die fail together) → Add spatial features (die_x, die_y, distance to wafer edge)
- Real-time constraints → Batch inference, GPU acceleration, model pruning

---

#### **Project 2: Defect Pattern Classification on Wafer Maps**

**Objective:** Classify defect types from 2D wafer maps (spatial fail patterns) using CNNs.

**Business Value:** $5M-$20M per incident through faster root cause analysis (reduce time from days to hours).

**Dataset:**
- 10K+ wafer maps (300×300 pixel images, each pixel = one die)
- 20+ defect classes (ring, scratch, cluster, edge, random, normal)
- Imbalanced: 70% normal, 30% defects

**Architecture (CNN):**
```
Input(300, 300, 1) → Conv2D(32, 3×3) + BatchNorm + ReLU + MaxPool(2×2)
                   → Conv2D(64, 3×3) + BatchNorm + ReLU + MaxPool(2×2)
                   → Conv2D(128, 3×3) + BatchNorm + ReLU + MaxPool(2×2)
                   → Flatten → Dense(256) + Dropout(0.5)
                   → Output(20, Softmax)
```

**Implementation Steps:**
1. **Data augmentation:** Rotation, flip, zoom (wafer maps can be rotated)
2. **Transfer learning:** Start with ResNet-50 pre-trained on ImageNet, fine-tune on wafer maps
3. **Framework choice:** TensorFlow/Keras (better for image tasks, TF Lite for edge deployment)
4. **Deployment:** TensorFlow Serving + Docker, or TF Lite on edge devices (inspection stations)

**Success Metrics:**
- Top-1 accuracy ≥98% (20-class classification)
- Inference time <100ms per wafer (real-time inspection)
- Precision ≥95% (minimize false alarms)

**Enhancements:**
- **Ensemble:** Combine multiple models (ResNet, EfficientNet, Vision Transformer)
- **Explainability:** Use Grad-CAM to visualize which regions triggered classification
- **Active learning:** Human-in-the-loop for edge cases

---

#### **Project 3: Adaptive Test Insertion with Reinforcement Learning**

**Objective:** Dynamically optimize test sequence to minimize test time while maintaining 99%+ coverage.

**Business Value:** $10M-$50M/year through 30-50% test time reduction (1-2 seconds per device × 100M devices/year).

**Problem Formulation (RL):**
- **State:** Current test results (pass/fail for tests already run), device features
- **Action:** Which test to run next (from 100+ available tests)
- **Reward:** -1 per test run, +100 if defect found early, -1000 if shipped with defect
- **Goal:** Learn policy to minimize tests while catching all defects

**Architecture (Policy Network):**
```
Input(state_dim) → Dense(256, ReLU) → Dense(128, ReLU) 
                 → Output(num_actions, Softmax)  # Action probabilities
```

**Implementation Steps:**
1. **Environment:** Simulate test flow using historical STDF data
2. **Algorithm:** Proximal Policy Optimization (PPO) in PyTorch
3. **Training:** 4-GPU distributed training, 10M episodes
4. **Deployment:** Export policy network to ONNX, <10ms inference for real-time test selection

**Success Metrics:**
- Test time reduction: 30-50%
- Defect coverage maintained: ≥99%
- Escape rate (defects shipped): <0.1%

**Challenges:**
- Sparse rewards → Use reward shaping (intermediate rewards for partial progress)
- Off-policy learning → Importance sampling to reuse historical data
- Generalization → Train on multiple device families, test on new devices

---

#### **Project 4: Power Anomaly Detection with Autoencoders**

**Objective:** Detect abnormal power consumption patterns indicating potential failures.

**Business Value:** $2M-$10M/year through early detection of reliability issues (avoid field failures).

**Dataset:**
- 100K+ devices with power measurements (dynamic, static, leakage) across 10+ voltage/frequency conditions
- Unlabeled data (95% normal, 5% anomalies - unknown types)

**Architecture (Autoencoder):**
```
Encoder: Input(50) → Dense(32, ReLU) → Dense(16, ReLU) → Latent(8)
Decoder: Latent(8) → Dense(16, ReLU) → Dense(32, ReLU) → Output(50)
```

**Implementation Steps:**
1. **Training:** Reconstruction loss (MSE) on normal devices only
2. **Anomaly detection:** Devices with high reconstruction error = anomalies
3. **Framework:** PyTorch or Keras (both work well for autoencoders)
4. **Deployment:** ONNX Runtime for batch inference on test data

**Success Metrics:**
- Anomaly detection rate ≥90% (recall)
- False positive rate <5% (precision)
- Inference time <5ms per device

**Enhancements:**
- **Variational Autoencoder (VAE):** Better generalization, probabilistic latent space
- **Time-series:** If power measured over time, use LSTM autoencoder
- **Clustering:** Use latent representations for anomaly clustering (identify failure modes)

---

### **General AI/ML Projects**

---

#### **Project 5: Customer Churn Prediction (Telecom/SaaS)**

**Objective:** Predict which customers will cancel subscription in next 30 days.

**Dataset:** 100K+ customers with features (usage, support tickets, payments, demographics), binary target (churn/no-churn).

**Architecture:** Same as semiconductor yield predictor (MLP with BatchNorm + Dropout).

**Framework:** Keras (quick prototyping), PyTorch (custom loss functions for imbalanced data).

**Business Value:** $500K-$5M/year through targeted retention campaigns (reduce churn by 10-20%).

---

#### **Project 6: Fraud Detection (Financial Services)**

**Objective:** Real-time detection of fraudulent transactions.

**Dataset:** 1M+ transactions with features (amount, merchant, time, location, user history), binary target (fraud/legitimate), highly imbalanced (0.1% fraud).

**Architecture:** Deep MLP with attention mechanism to focus on suspicious patterns.

**Framework:** PyTorch (custom training loop for handling imbalance), ONNX deployment for <10ms inference.

**Business Value:** $10M-$100M/year through fraud prevention.

**Challenges:** 
- Extreme imbalance → Focal loss, cost-sensitive learning
- Real-time → Model compression, GPU inference
- Concept drift → Online learning, periodic retraining

---

#### **Project 7: Medical Image Diagnosis (Healthcare)**

**Objective:** Classify chest X-rays into normal/pneumonia/COVID-19.

**Dataset:** 50K+ X-ray images (256×256 pixels), 3-class target.

**Architecture:** Transfer learning with EfficientNet-B7 or Vision Transformer.

**Framework:** TensorFlow/Keras (better image preprocessing, data augmentation), TF Serving for deployment.

**Business Value:** $1M-$10M/year through faster diagnosis (reduce radiologist workload).

**Success Metrics:** AUC-ROC ≥0.95, sensitivity ≥90% (minimize false negatives).

---

#### **Project 8: Predictive Maintenance (Manufacturing)**

**Objective:** Predict equipment failure 7 days in advance from sensor data.

**Dataset:** Time-series sensor data (temperature, vibration, pressure) from 100+ machines, binary target (failure/normal).

**Architecture:** LSTM or Transformer for time-series modeling.

**Framework:** PyTorch (better for RNNs and custom architectures), TensorFlow (TF Lite for edge deployment on machines).

**Business Value:** $5M-$50M/year through reduced downtime (prevent unplanned outages).

**Challenges:**
- Variable-length sequences → Padding or dynamic RNNs
- Rare failures → Synthetic data generation, transfer learning from similar machines

---

## 🔑 Key Takeaways

### **Framework Selection Decision Tree**

```
START
  ↓
Are you doing research/experimentation?
  YES → PyTorch (dynamic graphs, easier debugging)
  NO ↓
Do you need mobile/edge deployment?
  YES → TensorFlow/Keras (TF Lite mature)
  NO ↓
Do you need custom training loops (RL, GANs)?
  YES → PyTorch (more control)
  NO ↓
Quick prototype for business stakeholders?
  YES → Keras (fit() API, faster development)
  NO ↓
Either framework works → Choose based on team expertise
```

---

### **Production Deployment Checklist**

- ✅ **Model format:** ONNX (framework-agnostic)
- ✅ **Optimization:** Quantization (INT8), pruning, TorchScript/TF graph compilation
- ✅ **Serving:** TorchServe, TF Serving, or ONNX Runtime with REST API
- ✅ **Monitoring:** Log predictions, latency, errors (Prometheus + Grafana)
- ✅ **Versioning:** Model registry (MLflow, DVC), A/B testing for new models
- ✅ **Scaling:** Kubernetes for auto-scaling, GPU pools for burst traffic
- ✅ **Fallback:** Simpler model as backup if GPU fails

---

### **PyTorch vs Keras: Final Comparison**

| Aspect | PyTorch | TensorFlow/Keras | Recommendation |
|--------|---------|------------------|----------------|
| **Learning Curve** | Moderate | Easy (Keras) | Keras for beginners |
| **Research** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | PyTorch dominant |
| **Production** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | TensorFlow mature |
| **Mobile/Edge** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | TF Lite superior |
| **Debugging** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | PyTorch easier |
| **Training Speed** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Similar |
| **Community** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Both excellent |
| **Semiconductor** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Use both |

---

### **Best Practices for Semiconductor Testing**

1. **Training:** Use PyTorch with mixed precision + multi-GPU for flexibility
2. **Inference:** Convert to ONNX, deploy with ONNX Runtime + GPU for speed
3. **Edge:** Use TF Lite for test equipment with limited resources
4. **Monitoring:** Log all predictions for continuous model evaluation
5. **Retraining:** Set up pipelines to retrain monthly as new data arrives

---

### **Learning Path Forward**

**After mastering frameworks:**
- 📘 **Notebook 053:** Convolutional Neural Networks (CNNs) for image data
- 📘 **Notebook 054:** Recurrent Neural Networks (RNNs) for time-series
- 📘 **Notebook 055:** Transformers and Attention Mechanisms
- 📘 **Notebook 056:** Generative Models (GANs, VAEs)
- 📘 **Notebook 057:** Reinforcement Learning Fundamentals

**Production skills:**
- 🚀 **MLOps:** CI/CD for ML, model versioning, A/B testing
- 🚀 **Monitoring:** Model drift detection, performance tracking
- 🚀 **Scaling:** Distributed training, model compression, batch inference

---

## ✅ Learning Objectives Review

By now, you should be able to:
- ✅ Build neural networks in both PyTorch and TensorFlow/Keras
- ✅ Choose the right framework based on project requirements
- ✅ Train models efficiently with GPU acceleration and mixed precision
- ✅ Deploy models to production using TorchServe, TF Serving, or ONNX Runtime
- ✅ Optimize inference for real-time applications (quantization, ONNX, batching)
- ✅ Apply frameworks to semiconductor testing with production-grade implementations
- ✅ Debug and profile models to identify bottlenecks
- ✅ Convert models between frameworks using ONNX

---

## 🎓 Congratulations!

You've mastered deep learning frameworks! You can now:
- Build production-ready models in PyTorch and Keras
- Deploy to any platform (cloud, edge, mobile) via ONNX
- Optimize for real-world constraints (latency, memory, cost)
- Apply to semiconductor testing with confidence

**Next steps:** Dive into specialized architectures (CNNs, RNNs, Transformers) in upcoming notebooks! 🚀

---

## 📚 Resources

**Official Documentation:**
- PyTorch: https://pytorch.org/docs/
- TensorFlow: https://www.tensorflow.org/guide
- ONNX: https://onnx.ai/

**Tutorials:**
- PyTorch Tutorials: https://pytorch.org/tutorials/
- TensorFlow Tutorials: https://www.tensorflow.org/tutorials
- ONNX Runtime: https://onnxruntime.ai/docs/

**Books:**
- *Deep Learning with PyTorch* (Stevens, Antiga)
- *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* (Géron)

**Communities:**
- PyTorch Forums: https://discuss.pytorch.org/
- TensorFlow Forums: https://www.tensorflow.org/community
- Reddit: r/MachineLearning, r/deeplearning

---

**Notebook Complete!** 🎉