# TinyNet Trainer: From Overfitting to Generalization

**Prerequisites:** Complete `tinynet_discovery.ipynb` first

In the Discovery notebook, TinyNet's problem was clear:
- Trained on 4 perfect examples
- Memorized instead of learning patterns
- Failed on stress tests

**Today's Mission:**
1. Add noise to training data
2. Compare Sigmoid vs ReLU
3. Track experiments with Weights & Biases
4. Build a model that generalizes

In [None]:
# Install dependencies
%pip install torch torchvision matplotlib seaborn numpy wandb

In [None]:
import torch
import torch.nn as nn
import random
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import wandb
import os
from datetime import datetime

# Hardware acceleration
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# wandb setup: online if logged in, offline otherwise
if wandb.api.api_key or os.getenv("WANDB_API_KEY"):
    wandb_mode = "online"
    print("\nwandb: Online mode - experiments syncing to cloud")
else:
    wandb_mode = "offline"
    print("\nwandb: Offline mode - experiments saved locally")
    print("To enable cloud sync: run 'wandb login' in terminal")

## Part 1: The Noise Solution

**Discovery notebook's problem:** 4 perfect examples → memorization, not learning

**The fix:** Add random variation to training data
- Instead of `[1.0, 1.0, 0.0, 0.0]`
- Train on `[0.95, 1.0, 0.12, 0.05]`, `[0.88, 0.92, 0.0, 0.08]`, etc.
- Forces the model to learn patterns, not memorize specific values

In [None]:
def get_noisy_batch(batch_size=32, noise_level=0.1):
    """
    Generate batch of noisy horizontal and vertical lines.
    
    Args:
        batch_size: Number of examples to generate
        noise_level: How much random variation to add (e.g., 0.1 = ±5%)
    
    Returns:
        inputs: Tensor of shape [batch_size, 4]
        targets: Tensor of shape [batch_size, 2]
    """
    # Base patterns
    h_bases = [
        torch.tensor([1.0, 1.0, 0.0, 0.0]),  # Horizontal top
        torch.tensor([0.0, 0.0, 1.0, 1.0])   # Horizontal bottom
    ]
    v_bases = [
        torch.tensor([1.0, 0.0, 1.0, 0.0]),  # Vertical left
        torch.tensor([0.0, 1.0, 0.0, 1.0])   # Vertical right
    ]
    
    inputs, targets = [], []
    
    for _ in range(batch_size // 2):
        # Add noise: uniform random in range [-noise_level/2, +noise_level/2]
        h_noise = (torch.rand(4) * noise_level) - (noise_level / 2)
        v_noise = (torch.rand(4) * noise_level) - (noise_level / 2)
        
        # Apply noise and clamp to [0, 1]
        h_noisy = torch.clamp(random.choice(h_bases) + h_noise, 0, 1)
        v_noisy = torch.clamp(random.choice(v_bases) + v_noise, 0, 1)
        
        inputs.append(h_noisy)
        targets.append(torch.tensor([1.0, 0.0]))  # Horizontal
        
        inputs.append(v_noisy)
        targets.append(torch.tensor([0.0, 1.0]))  # Vertical
    
    return torch.stack(inputs).to(device), torch.stack(targets).to(device)

In [None]:
# Generate and visualize some noisy examples
import matplotlib.patches as patches

inputs, _ = get_noisy_batch(batch_size=8, noise_level=0.6)

fig, axes = plt.subplots(2, 4, figsize=(14, 4.5))
axes = axes.flatten()

for idx in range(8):
    pixels = inputs[idx].cpu().numpy()
    grid = pixels.reshape(2, 2)
    
    ax = axes[idx]
    
    # Set up the plot
    ax.set_xlim(-0.5, 1.5)
    ax.set_ylim(-0.5, 1.5)
    ax.set_aspect('equal')
    ax.invert_yaxis()  # Put origin at top-left
    ax.set_title(f'Example {idx+1}', fontsize=10)
    
    # Remove axis ticks
    ax.set_xticks([])
    ax.set_yticks([])
    
    # Draw circles for each pixel
    for i in range(2):
        for j in range(2):
            value = grid[i, j]
            # Circle filled based on brightness (use grayscale)
            circle = patches.Circle((j, i), radius=0.3, 
                                   facecolor=str(1-value),  # Inverse for grayscale (0=black, 1=white)
                                   edgecolor='black', linewidth=2)
            ax.add_patch(circle)
            
            # Add pixel value as text
            ax.text(j, i, f'{value:.3f}', ha='center', va='center', 
                   fontsize=7, color='red' if value > 0.5 else 'blue', weight='bold')
    
    # Draw outer border (thick, dark)
    outer_rect = patches.Rectangle((-0.5, -0.5), 2, 2, 
                                  linewidth=3, edgecolor='black', 
                                  facecolor='none')
    ax.add_patch(outer_rect)
    
    # Draw inner grid lines (thin, light gray)
    ax.plot([0.5, 0.5], [-0.5, 1.5], color='gray', linewidth=1, alpha=0.5)
    ax.plot([-0.5, 1.5], [0.5, 0.5], color='gray', linewidth=1, alpha=0.5)

plt.suptitle('Noisy Training Examples (noise_level=0.6)', fontsize=14)
plt.tight_layout()
plt.show()

print("Notice: Values are no longer perfect 0.0 or 1.0")
print("But the patterns (horizontal/vertical) are still visible!")

### How Much Noise?

- **Too little (0.05)**: Model still memorizes, barely better than Book notebook
- **Visualization (0.6)**: Perfect for demonstrating the concept - noise is obvious, patterns still clear
- **Training (0.3)**: Sweet spot for robust learning without overwhelming the signal
- **Too much (0.8+)**: Pattern becomes ambiguous, model can't learn reliably

We'll use **0.6 for visualization** (so you can see the noise clearly) and **0.3 for training** (optimal learning).

## Part 2: Experiment 1 - Sigmoid + Noise

Let's keep the Sigmoid activation from the Book notebook, but train on noisy data.

**Hypothesis:** With enough diverse examples, even Sigmoid should generalize better than the Book notebook.

In [None]:
class TinyNet_Sigmoid(nn.Module):
    def __init__(self):
        super(TinyNet_Sigmoid, self).__init__()
        self.layer1 = nn.Linear(4, 3)
        self.layer2 = nn.Linear(3, 2)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.sigmoid(self.layer1(x))
        x = self.sigmoid(self.layer2(x))
        return x

model_sigmoid = TinyNet_Sigmoid().to(device)
print(f"Sigmoid model parameters: {sum(p.numel() for p in model_sigmoid.parameters())}")

In [None]:
# Initialize wandb for Sigmoid experiment
wandb.init(
    project="tinynet-trainer",
    name=f"sigmoid-noisy-training-{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    tags=["v1", "sigmoid", "sgd"],
    mode=wandb_mode,
    config={
        "architecture": "4-3-2",
        "activation": "sigmoid",
        "learning_rate": 0.1,
        "noise_level": 0.3,
        "batch_size": 64,
        "epochs": 200,
        "device": str(device)
    }
)

criterion = nn.MSELoss()
optimizer_sigmoid = torch.optim.SGD(model_sigmoid.parameters(), lr=0.1)
loss_history_sigmoid = []

print("Training Sigmoid model with noisy data...")
for epoch in range(200):
    inputs, labels = get_noisy_batch(batch_size=64, noise_level=0.3)
    
    outputs = model_sigmoid(inputs)
    loss = criterion(outputs, labels)
    
    optimizer_sigmoid.zero_grad()
    loss.backward()
    optimizer_sigmoid.step()
    
    loss_history_sigmoid.append(loss.item())
    
    # Log to wandb
    wandb.log({"epoch": epoch, "loss": loss.item()})
    
    if epoch % 40 == 0:
        print(f"Epoch {epoch:03d} | Loss: {loss.item():.4f}")

# Finish this wandb run
wandb.finish()

plt.plot(loss_history_sigmoid)
plt.title("Sigmoid Training with Noisy Data")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True, alpha=0.3)
plt.show()

## Part 3: The ReLU Upgrade

**Why ReLU?**
- No saturation (unlike Sigmoid which gets stuck at 0/1)
- Better gradient flow → faster training
- Standard in modern networks

**Strategy:** ReLU on hidden layer, Sigmoid on output (for probabilities)

In [None]:
class TinyNet_ReLU(nn.Module):
    def __init__(self):
        super(TinyNet_ReLU, self).__init__()
        self.layer1 = nn.Linear(4, 3)
        self.layer2 = nn.Linear(3, 2)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()  # Only on output for probabilities
    
    def forward(self, x):
        x = self.relu(self.layer1(x))      # ReLU on hidden layer
        x = self.sigmoid(self.layer2(x))   # Sigmoid on output layer
        return x

model_relu = TinyNet_ReLU().to(device)
print(f"ReLU model parameters: {sum(p.numel() for p in model_relu.parameters())}")

In [None]:
# Initialize wandb for ReLU experiment
wandb.init(
    project="tinynet-trainer",
    name=f"relu-v1-{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    tags=["v1", "relu", "sgd"],
    mode=wandb_mode,
    config={
        "architecture": "4-3-2",
        "activation": "relu",
        "learning_rate": 0.1,
        "noise_level": 0.3,
        "batch_size": 64,
        "epochs": 200,
        "device": str(device)
    }
)

optimizer_relu = torch.optim.SGD(model_relu.parameters(), lr=0.1)
loss_history_relu = []

print("Training ReLU model with noisy data...")
for epoch in range(200):
    inputs, labels = get_noisy_batch(batch_size=64, noise_level=0.3)
    
    outputs = model_relu(inputs)
    loss = criterion(outputs, labels)
    
    optimizer_relu.zero_grad()
    loss.backward()
    optimizer_relu.step()
    
    loss_history_relu.append(loss.item())
    
    # Log to wandb
    wandb.log({"epoch": epoch, "loss": loss.item()})
    
    if epoch % 40 == 0:
        print(f"Epoch {epoch:03d} | Loss: {loss.item():.4f}")

# Finish this wandb run
wandb.finish()

plt.plot(loss_history_relu)
plt.title("ReLU Training with Noisy Data")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True, alpha=0.3)
plt.show()

### Initial Observations

Notice anything different about the training curves?

ReLU typically converges faster and to a lower loss than Sigmoid. Let's compare them side-by-side.

## Part 4: Side-by-Side Comparison

Now let's compare the two approaches directly.

In [None]:
# Compare loss curves
plt.figure(figsize=(10, 5))
plt.plot(loss_history_sigmoid, label='Sigmoid', linewidth=2, alpha=0.8)
plt.plot(loss_history_relu, label='ReLU', linewidth=2, alpha=0.8)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Convergence: Sigmoid vs ReLU')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Final loss - Sigmoid: {loss_history_sigmoid[-1]:.4f}")
print(f"Final loss - ReLU: {loss_history_relu[-1]:.4f}")

In [None]:
# Compare learned weights (heatmaps)
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Sigmoid weights
sigmoid_weights = model_sigmoid.layer1.weight.data.cpu().numpy()
for i in range(3):
    sns.heatmap(sigmoid_weights[i].reshape(2, 2), annot=True, fmt='.2f',
                cmap='coolwarm', center=0, ax=axes[0, i], cbar=False, vmin=-1, vmax=1)
    axes[0, i].set_title(f'Sigmoid - Hidden Node {i}')

# ReLU weights
relu_weights = model_relu.layer1.weight.data.cpu().numpy()
for i in range(3):
    sns.heatmap(relu_weights[i].reshape(2, 2), annot=True, fmt='.2f',
                cmap='coolwarm', center=0, ax=axes[1, i], cbar=False, vmin=-1, vmax=1)
    axes[1, i].set_title(f'ReLU - Hidden Node {i}')

plt.suptitle('Learned Weights Comparison', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Stress test comparison
stress_tests = {
    "Perfect Horizontal":  torch.tensor([1.0, 1.0, 0.0, 0.0]),
    "Perfect Vertical":    torch.tensor([1.0, 0.0, 1.0, 0.0]),
    "Fuzzy Horizontal":    torch.tensor([0.88, 0.92, 0.12, 0.08]),
    "Fuzzy Vertical":      torch.tensor([0.91, 0.05, 0.89, 0.11]),
    "Diagonal /":          torch.tensor([0.0, 1.0, 1.0, 0.0]),
    "Diagonal \\\\":         torch.tensor([1.0, 0.0, 0.0, 1.0]),
    "Solid Gray Block":    torch.tensor([0.5, 0.5, 0.5, 0.5])
}

print(f"{'Test Case':<20} | {'Sigmoid H%':<11} | {'Sigmoid V%':<11} | {'ReLU H%':<11} | {'ReLU V%':<11}")
print("-" * 85)

model_sigmoid.eval()
model_relu.eval()

# Store results for wandb
stress_results = []

with torch.no_grad():
    for name, pixels in stress_tests.items():
        pixels_device = pixels.to(device)
        
        pred_sigmoid = model_sigmoid(pixels_device)
        pred_relu = model_relu(pixels_device)
        
        sig_h, sig_v = pred_sigmoid[0].item() * 100, pred_sigmoid[1].item() * 100
        relu_h, relu_v = pred_relu[0].item() * 100, pred_relu[1].item() * 100
        
        print(f"{name:<20} | {sig_h:>10.1f}% | {sig_v:>10.1f}% | {relu_h:>10.1f}% | {relu_v:>10.1f}%")
        
        stress_results.append({
            "test_case": name,
            "sigmoid_h": sig_h,
            "sigmoid_v": sig_v,
            "relu_h": relu_h,
            "relu_v": relu_v
        })



Looking at the stress test results above, notice something concerning:

**Perfect/Fuzzy cases (should be confident):**
- Perfect Horizontal: ~60-65% (too uncertain!)
- Perfect Vertical: ~50-55% (barely better than random!)
- Fuzzy cases: ~60% (still too uncertain!)

**Ambiguous cases (should be uncertain):**
- Diagonals: ~50% ✓ (correct uncertainty)
- Gray block: ~50% ✓ (correct uncertainty)

### The Problem: Model is Too Conservative

**What happened:**
- Book notebook: No noise → Memorized → Overconfident (96% on gray blocks!)
- Trainer v1: Heavy noise (0.6) → Generalized BUT too cautious (60% on perfect cases)

**We swung from one extreme to the other:**

```
Book: Too confident ──────[Sweet Spot]────── Trainer v1: Too uncertain
         (overfit)                                    (over-cautious)
```

The model learned: "Everything in training was noisy, so I should never be too sure!"

**This is progress** (at least it's not confidently wrong), but not production-ready.

**Next:** The bonus section will show how to find the balance!

## Part 5: Summary

**Training data comparison:**
- Discovery: 4 perfect examples
- Trainer: ~12,800 noisy examples (64 per epoch × 200 epochs)

**Results:** Check wandb dashboard for detailed comparison!

In [None]:
print("=" * 80)
print("FINAL COMPARISON: Book vs Trainer")
print("=" * 80)

print("\nTraining Data:")
print("  Book Model:       4 perfect examples (no variation)")
print("  Trainer Models:   ~12,800 noisy examples (64 per epoch × 200 epochs)")

print("\nStress Test Results:")
print(f"{'Test Case':<20} | {'Expected':<12} | {'Sigmoid':<12} | {'ReLU':<12}")
print("-" * 72)

# Reference expected behaviors
expected = {
    "Perfect Horizontal": "High H",
    "Perfect Vertical": "High V",
    "Fuzzy Horizontal": "High H",
    "Fuzzy Vertical": "High V",
    "Diagonal /": "Uncertain",
    "Diagonal \\\\": "Uncertain",
    "Solid Gray Block": "Uncertain"
}

# Log to wandb
wandb_table_data = []

with torch.no_grad():
    for name, pixels in stress_tests.items():
        pixels_device = pixels.to(device)
        
        pred_sigmoid = model_sigmoid(pixels_device)
        pred_relu = model_relu(pixels_device)
        
        sig_conf = max(pred_sigmoid[0].item(), pred_sigmoid[1].item()) * 100
        relu_conf = max(pred_relu[0].item(), pred_relu[1].item()) * 100
        
        sig_winner = "H" if pred_sigmoid[0] > pred_sigmoid[1] else "V"
        relu_winner = "H" if pred_relu[0] > pred_relu[1] else "V"
        
        print(f"{name:<20} | {expected[name]:<12} | {sig_conf:>5.1f}% {sig_winner:<4} | {relu_conf:>5.1f}% {relu_winner:<4}")
        
        wandb_table_data.append([name, expected[name], f"{sig_conf:.1f}% {sig_winner}", f"{relu_conf:.1f}% {relu_winner}"])

# Log results table to wandb
wandb.init(project="tinynet-trainer", name="final-comparison", mode=wandb_mode)
wandb.log({
    "stress_test_results": wandb.Table(
        columns=["Test Case", "Expected", "Sigmoid", "ReLU"],
        data=wandb_table_data
    )
})
wandb.finish()

## Bonus: Finding the Balance

**The pendulum swing:**
- Discovery: No noise → overfit, 96% confident on gray blocks
- Trainer v1: Heavy noise (0.3) → too cautious, 60% on perfect examples
- Trainer v2: Let's find the sweet spot with better techniques!

**The solution - Multiple improvements:**
1. **Mixed batches:** 20% clean + 80% noisy (learn confidence + generalization)
2. **Adam optimizer:** Adaptive learning rate (faster, smarter convergence)
3. **Weight decay (L2 regularization):** Penalize large weights (prevent overfitting)
4. **More epochs:** 300 vs 200 (with regularization, we can train longer)

This demonstrates real ML iteration: try → identify weakness → add technique → repeat.

In [None]:
def get_mixed_batch(batch_size=64, clean_ratio=0.2):
    """
    Generate batch with mix of clean and noisy examples.
    
    Args:
        batch_size: Total examples to generate
        clean_ratio: Fraction of examples that are nearly perfect (0.2 = 20%)
    
    Strategy: Train on BOTH clear examples (so it learns confidence)
    AND noisy examples (so it generalizes)
    """
    clean_size = int(batch_size * clean_ratio)
    noisy_size = batch_size - clean_size
    
    # Clean examples: very low noise
    clean_inputs, clean_targets = get_noisy_batch(clean_size, noise_level=0.05)
    
    # Noisy examples: moderate noise
    noisy_inputs, noisy_targets = get_noisy_batch(noisy_size, noise_level=0.3)
    
    # Combine them
    inputs = torch.cat([clean_inputs, noisy_inputs])
    targets = torch.cat([clean_targets, noisy_targets])
    
    return inputs, targets

# Test the mixed batch
sample_inputs, _ = get_mixed_batch(batch_size=10)
print("Mixed batch sample (first 10 examples):")
for i in range(10):
    pixels = sample_inputs[i].cpu().numpy()
    print(f"  Example {i+1}: [{pixels[0]:.3f}, {pixels[1]:.3f}, {pixels[2]:.3f}, {pixels[3]:.3f}]")
print("\nNotice: Some examples are very clean (1.000, 0.000), others are noisy (0.876, 0.234)")

### What is Weight Decay?

**Weight decay (L2 regularization)** adds a penalty for large weights to the loss function:
- `Loss_total = Loss_prediction + (weight_decay × sum_of_squared_weights)`

**Why it helps:**
- Large weights = model is too specific/memorizing
- Small weights = model stays general/simple
- Penalty encourages the model to find simpler solutions

**Effect:** Prevents overfitting even with more training epochs.

In [None]:
# Initialize improved ReLU model
model_relu_v2 = TinyNet_ReLU().to(device)

# Use Adam optimizer (adapts learning rate automatically)
optimizer_adam = torch.optim.Adam(model_relu_v2.parameters(), lr=0.01, weight_decay=0.01)

criterion = nn.MSELoss()
loss_history_v2 = []

# wandb tracking
wandb.init(
    project="tinynet-trainer",
    name=f"relu-v2-improved-{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    tags=["v2", "relu", "adam", "mixed"],
    mode=wandb_mode,
    config={
        "architecture": "4-3-2",
        "activation": "relu",
        "optimizer": "adam",
        "learning_rate": 0.01,
        "weight_decay": 0.01,
        "clean_ratio": 0.2,
        "noisy_level": 0.3,
        "batch_size": 64,
        "epochs": 300,
        "device": str(device)
    }
)

print("Training improved ReLU model...")
print("Improvements: Mixed batches (20% clean) + Adam + Weight decay (0.01) + 300 epochs\n")

for epoch in range(300):
    # Use mixed batches: 30% clean, 70% noisy
    inputs, labels = get_mixed_batch(batch_size=64, clean_ratio=0.2)
    
    outputs = model_relu_v2(inputs)
    loss = criterion(outputs, labels)
    
    optimizer_adam.zero_grad()
    loss.backward()
    optimizer_adam.step()
    
    loss_history_v2.append(loss.item())
    
    # Log to wandb
    wandb.log({"epoch": epoch, "loss": loss.item()})
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch:03d} | Loss: {loss.item():.4f}")

wandb.finish()

print(f"\nFinal loss: {loss_history_v2[-1]:.4f}")
print(f"Original ReLU loss: {loss_history_relu[-1]:.4f}")
print(f"Improvement: {((loss_history_relu[-1] - loss_history_v2[-1]) / loss_history_relu[-1] * 100):.1f}% better")

In [None]:
# Compare all three models on stress test
print("=" * 95)
print("STRESS TEST: Original vs Improved")
print("=" * 95)

print(f"\n{'Test Case':<20} | {'Sigmoid':<15} | {'ReLU v1':<15} | {'ReLU v2 (Improved)':<20}")
print("-" * 95)

model_sigmoid.eval()
model_relu.eval()
model_relu_v2.eval()

with torch.no_grad():
    for name, pixels in stress_tests.items():
        pixels_device = pixels.to(device)
        
        pred_sigmoid = model_sigmoid(pixels_device)
        pred_relu = model_relu(pixels_device)
        pred_relu_v2 = model_relu_v2(pixels_device)
        
        # Get max confidence for each
        sig_conf = max(pred_sigmoid[0].item(), pred_sigmoid[1].item()) * 100
        relu_conf = max(pred_relu[0].item(), pred_relu[1].item()) * 100
        relu_v2_conf = max(pred_relu_v2[0].item(), pred_relu_v2[1].item()) * 100
        
        sig_winner = "H" if pred_sigmoid[0] > pred_sigmoid[1] else "V"
        relu_winner = "H" if pred_relu[0] > pred_relu[1] else "V"
        relu_v2_winner = "H" if pred_relu_v2[0] > pred_relu_v2[1] else "V"
        
        print(f"{name:<20} | {sig_conf:>5.1f}% {sig_winner:<8} | {relu_conf:>5.1f}% {relu_winner:<8} | {relu_v2_conf:>5.1f}% {relu_v2_winner:<8}")

print("\n" + "=" * 95)

### Reality Check: Loss vs Behavior

**Important observation:** v2 achieved lower loss (~0.247 vs ~0.250) but check the stress test:
- Gray block: 96% confident (should be ~50%)
- Diagonals: 90%+ confident (should be ~50%)

**Lesson:** Loss is not the only metric! A model can:
- ✅ Have low loss on training data
- ❌ Still be overconfident on ambiguous cases

This is why stress testing matters. The bonus section shows further improvements.

In [None]:
# Plot all three loss curves
plt.figure(figsize=(12, 5))
plt.plot(loss_history_sigmoid, label='Sigmoid (200 epochs, SGD)', linewidth=2, alpha=0.8)
plt.plot(loss_history_relu, label='ReLU v1 (200 epochs, SGD)', linewidth=2, alpha=0.8)
plt.plot(loss_history_v2, label='ReLU v2 (500 epochs, Adam, Mixed)', linewidth=2, alpha=0.8)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Comparison: Iteration Matters!')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=loss_history_relu[-1], color='orange', linestyle='--', alpha=0.3, label='ReLU v1 final')
plt.axhline(y=loss_history_v2[-1], color='green', linestyle='--', alpha=0.3, label='ReLU v2 final')
plt.show()

print(f"Loss reduction: {loss_history_sigmoid[-1]:.4f} → {loss_history_relu[-1]:.4f} → {loss_history_v2[-1]:.4f}")

## What We Learned: ML is Iterative

**The journey:**

**v1 - Sigmoid + Noise:** Loss ~0.250, weak confidence (~50%)
- Lesson: Sigmoid saturates with noisy data

**v2 - ReLU + Noise:** Loss ~0.217, better but only 63% on perfect cases
- Lesson: Improvement, but not production-ready

**v3 - ReLU + Mixed Data + Adam + Weight Decay + 300 epochs:** Loss ~0.185, 97% on clear cases
- Lesson: Iteration pays off!

**Key insights:**
1. First attempt rarely works perfectly (that's normal!)
2. Data quality matters (mix of clean + noisy beats noisy-only)
3. Optimizer choice matters (Adam > SGD for this task)
4. Regularization enables longer training (weight decay prevents overfitting)
5. More training can help when properly regularized
6. Track everything (wandb shows all experiments in one place)

**Real ML workflow:**
```
Train → Evaluate → Identify weakness → Improve → Repeat
```

This is how professional ML works. You just experienced the real development cycle!