# Deep Learning Foundations: From Scratch Understanding

This notebook teaches **every fundamental concept** in deep learning from the ground up.

## What You'll Learn:
1. **Tensors**: The building blocks of neural networks
2. **Gradients**: How networks learn through calculus
3. **Backpropagation**: The core learning algorithm
4. **Loss Functions**: How we measure network performance
5. **Optimizers**: How we update network weights
6. **Activation Functions**: How neurons make decisions

**Teaching Philosophy**: Every line of code will be explained in detail with mathematical intuition.

In [None]:
# Cell 1: Essential Imports with Explanations
"""
LIBRARY EXPLANATIONS:

numpy: Mathematical operations on arrays (tensors)
- Why needed: Neural networks are just matrix multiplications
- What it does: Efficient numerical computations

matplotlib: Data visualization
- Why needed: Understanding data and model behavior visually
- What we'll plot: Loss curves, gradients, decision boundaries

tensorflow: Deep learning framework
- Why needed: Automatic differentiation and GPU acceleration
- Key feature: Computes gradients automatically
"""

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducible results
np.random.seed(42)  # NumPy random seed
tf.random.set_seed(42)  # TensorFlow random seed

print(f"TensorFlow version: {tf.__version__}")
print(f"NumPy version: {np.__version__}")
print("All libraries imported successfully!")

# Configure matplotlib for better plots
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

## 1. Understanding Tensors: The Foundation of Neural Networks

**What is a Tensor?**
- A tensor is a generalization of scalars, vectors, and matrices
- **Scalar** (0D tensor): Just a number → `5`
- **Vector** (1D tensor): Array of numbers → `[1, 2, 3]`
- **Matrix** (2D tensor): Table of numbers → `[[1,2], [3,4]]`
- **3D+ tensors**: Higher dimensional arrays

**Why Tensors Matter in Deep Learning:**
- Images are 3D tensors: `(height, width, channels)`
- Text is 2D tensors: `(sequence_length, vocabulary_size)`
- Neural network weights are matrices (2D tensors)
- All operations in neural networks are tensor operations

In [None]:
# Cell 2: Tensor Fundamentals with Detailed Explanations

print("=== TENSOR DIMENSIONS EXPLAINED ===")

# 0D Tensor (Scalar)
scalar = tf.constant(42.0)
print(f"\n0D Tensor (Scalar):")
print(f"Value: {scalar}")
print(f"Shape: {scalar.shape} ← Empty shape means 0 dimensions")
print(f"Number of dimensions (rank): {len(scalar.shape)}")

# 1D Tensor (Vector)
vector = tf.constant([1.0, 2.0, 3.0, 4.0])
print(f"\n1D Tensor (Vector):")
print(f"Value: {vector}")
print(f"Shape: {vector.shape} ← (4,) means 4 elements in 1 dimension")
print(f"Number of dimensions (rank): {len(vector.shape)}")

# 2D Tensor (Matrix)
matrix = tf.constant([[1.0, 2.0, 3.0],
                      [4.0, 5.0, 6.0]])
print(f"\n2D Tensor (Matrix):")
print(f"Value:\n{matrix}")
print(f"Shape: {matrix.shape} ← (2, 3) means 2 rows, 3 columns")
print(f"Number of dimensions (rank): {len(matrix.shape)}")

# 3D Tensor (like a small grayscale image)
tensor_3d = tf.constant([[[1, 2], [3, 4]],
                         [[5, 6], [7, 8]]])
print(f"\n3D Tensor:")
print(f"Value:\n{tensor_3d}")
print(f"Shape: {tensor_3d.shape} ← (2, 2, 2) means 2x2x2 cube")
print(f"Number of dimensions (rank): {len(tensor_3d.shape)}")

# Real-world example: RGB image
rgb_image = tf.random.uniform((224, 224, 3), minval=0, maxval=255, dtype=tf.int32)
print(f"\nReal Example - RGB Image Tensor:")
print(f"Shape: {rgb_image.shape} ← (height=224, width=224, channels=3)")
print(f"Total elements: {tf.size(rgb_image)} ← 224×224×3 = {224*224*3:,}")
print(f"Data type: {rgb_image.dtype} ← Integer values 0-255 for pixel intensities")

## 2. Tensor Operations: The Math Behind Neural Networks

**Key Operations in Neural Networks:**

1. **Matrix Multiplication (`@` or `tf.matmul`)**:
   - Core operation in neural networks
   - `input @ weights = output`
   - Requirement: Inner dimensions must match

2. **Element-wise Operations (`+`, `-`, `*`, `/`)**:
   - Adding bias: `output + bias`
   - Activation functions: `sigmoid(x)`, `relu(x)`

3. **Broadcasting**:
   - Automatic shape adjustment for operations
   - Example: Adding bias vector to matrix

**Mathematical Foundation:**
- Neural network layer: `output = activation(input @ weights + bias)`

In [None]:
# Cell 3: Tensor Operations with Step-by-Step Explanations

print("=== NEURAL NETWORK OPERATIONS EXPLAINED ===")

# Simulate a simple neural network layer
print("\n🧠 SIMULATING A NEURAL NETWORK LAYER")
print("Formula: output = input @ weights + bias")

# Step 1: Create input data (batch of 3 samples, 4 features each)
input_data = tf.constant([[1.0, 2.0, 3.0, 4.0],    # Sample 1
                          [0.5, 1.5, 2.5, 3.5],    # Sample 2  
                          [2.0, 1.0, 4.0, 3.0]])   # Sample 3

print(f"\n📊 INPUT DATA:")
print(f"Shape: {input_data.shape} ← (batch_size=3, input_features=4)")
print(f"Values:\n{input_data}")
print(f"Meaning: 3 samples, each with 4 features (like height, weight, age, income)")

# Step 2: Create weights (4 input features → 2 output neurons)
weights = tf.constant([[0.1, 0.2],    # Weights for feature 1 → [neuron1, neuron2]
                       [0.3, 0.4],    # Weights for feature 2 → [neuron1, neuron2]
                       [0.5, 0.6],    # Weights for feature 3 → [neuron1, neuron2]
                       [0.7, 0.8]])   # Weights for feature 4 → [neuron1, neuron2]

print(f"\n⚖️ WEIGHTS:")
print(f"Shape: {weights.shape} ← (input_features=4, output_neurons=2)")
print(f"Values:\n{weights}")
print(f"Meaning: How much each input feature contributes to each output neuron")

# Step 3: Matrix multiplication (the core operation)
linear_output = tf.matmul(input_data, weights)
# Alternative syntax: linear_output = input_data @ weights

print(f"\n🔢 MATRIX MULTIPLICATION:")
print(f"Operation: input_data @ weights")
print(f"Shape calculation: {input_data.shape} @ {weights.shape} = {linear_output.shape}")
print(f"Result:\n{linear_output}")
print(f"\nWhat happened: Each input sample was transformed into 2 values (one per neuron)")

# Step 4: Add bias (one bias value per output neuron)
bias = tf.constant([0.1, -0.1])  # Bias for [neuron1, neuron2]

print(f"\n📍 BIAS:")
print(f"Shape: {bias.shape} ← (output_neurons=2,)")
print(f"Values: {bias}")
print(f"Purpose: Shifts the output, allows learning patterns that don't pass through origin")

# Broadcasting: bias is added to each row of linear_output
final_output = linear_output + bias

print(f"\n➕ ADDING BIAS (Broadcasting):")
print(f"Operation: linear_output + bias")
print(f"Shape: {linear_output.shape} + {bias.shape} = {final_output.shape}")
print(f"Final output:\n{final_output}")
print(f"\n✅ Complete neural network layer computation!")

# Visualize the operation
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.imshow(input_data, cmap='viridis', aspect='auto')
plt.title(f'Input Data\nShape: {input_data.shape}')
plt.xlabel('Features')
plt.ylabel('Samples')
plt.colorbar()

plt.subplot(1, 3, 2)
plt.imshow(weights, cmap='plasma', aspect='auto')
plt.title(f'Weights\nShape: {weights.shape}')
plt.xlabel('Output Neurons')
plt.ylabel('Input Features')
plt.colorbar()

plt.subplot(1, 3, 3)
plt.imshow(final_output, cmap='coolwarm', aspect='auto')
plt.title(f'Output\nShape: {final_output.shape}')
plt.xlabel('Neurons')
plt.ylabel('Samples')
plt.colorbar()

plt.suptitle('Neural Network Layer: Input → Weights → Output', fontsize=16)
plt.tight_layout()
plt.show()

## 3. Activation Functions: How Neurons Make Decisions

**Why Activation Functions?**
- Without activation functions, neural networks are just linear transformations
- Activation functions introduce **non-linearity**
- Non-linearity allows networks to learn complex patterns

**Common Activation Functions:**

1. **ReLU (Rectified Linear Unit)**: `max(0, x)`
   - Most popular in hidden layers
   - Fast to compute, helps with gradient flow
   - Problem: "Dead neurons" when x < 0

2. **Sigmoid**: `1 / (1 + e^(-x))`
   - Outputs between 0 and 1
   - Used for binary classification
   - Problem: Vanishing gradients

3. **Tanh**: `(e^x - e^(-x)) / (e^x + e^(-x))`
   - Outputs between -1 and 1
   - Zero-centered (better than sigmoid)

4. **Softmax**: Used for multi-class classification
   - Converts logits to probabilities
   - All outputs sum to 1

In [None]:
# Cell 4: Activation Functions Deep Dive

print("=== ACTIVATION FUNCTIONS EXPLAINED ===")

# Create a range of input values to visualize functions
x = tf.linspace(-5.0, 5.0, 100)

# 1. ReLU: max(0, x)
relu_output = tf.nn.relu(x)
print(f"\n🔥 ReLU (Rectified Linear Unit):")
print(f"Formula: max(0, x)")
print(f"Purpose: Introduces non-linearity while being computationally simple")
print(f"Range: [0, +∞)")
print(f"Sample values: x=-2 → {tf.nn.relu(-2.0)}, x=3 → {tf.nn.relu(3.0)}")

# 2. Sigmoid: 1 / (1 + e^(-x))
sigmoid_output = tf.nn.sigmoid(x)
print(f"\n📈 Sigmoid:")
print(f"Formula: 1 / (1 + e^(-x))")
print(f"Purpose: Squashes values to (0,1), used for binary classification")
print(f"Range: (0, 1)")
print(f"Sample values: x=-2 → {tf.nn.sigmoid(-2.0):.4f}, x=0 → {tf.nn.sigmoid(0.0):.4f}, x=2 → {tf.nn.sigmoid(2.0):.4f}")

# 3. Tanh: (e^x - e^(-x)) / (e^x + e^(-x))
tanh_output = tf.nn.tanh(x)
print(f"\n🌊 Tanh (Hyperbolic Tangent):")
print(f"Formula: (e^x - e^(-x)) / (e^x + e^(-x))")
print(f"Purpose: Similar to sigmoid but zero-centered")
print(f"Range: (-1, 1)")
print(f"Sample values: x=-2 → {tf.nn.tanh(-2.0):.4f}, x=0 → {tf.nn.tanh(0.0):.4f}, x=2 → {tf.nn.tanh(2.0):.4f}")

# 4. Leaky ReLU: max(0.01*x, x)
leaky_relu_output = tf.nn.leaky_relu(x, alpha=0.01)
print(f"\n💧 Leaky ReLU:")
print(f"Formula: max(α*x, x) where α=0.01")
print(f"Purpose: Fixes 'dead neuron' problem of ReLU")
print(f"Range: (-∞, +∞)")
print(f"Sample values: x=-2 → {tf.nn.leaky_relu(-2.0, alpha=0.01):.4f}, x=2 → {tf.nn.leaky_relu(2.0, alpha=0.01):.4f}")

# Visualize all activation functions
plt.figure(figsize=(15, 10))

# Plot each activation function
plt.subplot(2, 2, 1)
plt.plot(x, relu_output, 'b-', linewidth=2, label='ReLU')
plt.grid(True, alpha=0.3)
plt.title('ReLU: max(0, x)')
plt.xlabel('Input (x)')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)

plt.subplot(2, 2, 2)
plt.plot(x, sigmoid_output, 'r-', linewidth=2, label='Sigmoid')
plt.grid(True, alpha=0.3)
plt.title('Sigmoid: 1/(1+e^(-x))')
plt.xlabel('Input (x)')
plt.ylabel('Output')
plt.axhline(y=0.5, color='k', linestyle='--', alpha=0.5, label='y=0.5')
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.legend()

plt.subplot(2, 2, 3)
plt.plot(x, tanh_output, 'g-', linewidth=2, label='Tanh')
plt.grid(True, alpha=0.3)
plt.title('Tanh: (e^x - e^(-x))/(e^x + e^(-x))')
plt.xlabel('Input (x)')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)

plt.subplot(2, 2, 4)
plt.plot(x, leaky_relu_output, 'purple', linewidth=2, label='Leaky ReLU')
plt.grid(True, alpha=0.3)
plt.title('Leaky ReLU: max(0.01*x, x)')
plt.xlabel('Input (x)')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)

plt.suptitle('Activation Functions Comparison', fontsize=16)
plt.tight_layout()
plt.show()

# Demonstrate softmax for classification
print(f"\n🎯 SOFTMAX (for multi-class classification):")
logits = tf.constant([2.0, 1.0, 0.1])  # Raw network outputs
probabilities = tf.nn.softmax(logits)

print(f"Raw logits: {logits}")
print(f"After softmax: {probabilities}")
print(f"Sum of probabilities: {tf.reduce_sum(probabilities):.6f} ← Should be 1.0")
print(f"Purpose: Converts any real numbers to valid probabilities")
print(f"Interpretation: [Class A: {probabilities[0]:.3f}, Class B: {probabilities[1]:.3f}, Class C: {probabilities[2]:.3f}]")

## 4. Loss Functions: Measuring How Wrong We Are

**What is a Loss Function?**
- Measures the difference between predicted and actual values
- Gives the neural network a "score" to optimize
- Lower loss = better performance

**Common Loss Functions:**

1. **Mean Squared Error (MSE)**: For regression
   - Formula: `(y_true - y_pred)²`
   - Penalizes large errors heavily

2. **Binary Crossentropy**: For binary classification
   - Formula: `-[y*log(p) + (1-y)*log(1-p)]`
   - Measures probability distribution distance

3. **Categorical Crossentropy**: For multi-class classification
   - Used with softmax activation
   - Measures how far predicted probabilities are from true labels

**Key Insight**: The choice of loss function depends on your problem type!

In [None]:
# Cell 5: Loss Functions Explained with Examples

print("=== LOSS FUNCTIONS: MEASURING PREDICTION QUALITY ===")

# 1. REGRESSION EXAMPLE: Predicting house prices
print("\n🏠 REGRESSION EXAMPLE: House Price Prediction")
true_prices = tf.constant([300000., 450000., 200000., 600000.])  # Actual house prices
pred_prices = tf.constant([320000., 430000., 180000., 580000.])  # Model predictions

# Mean Squared Error (MSE)
mse_loss = tf.keras.losses.mean_squared_error(true_prices, pred_prices)
print(f"\nTrue prices: {true_prices}")
print(f"Predicted:   {pred_prices}")
print(f"Differences: {true_prices - pred_prices}")
print(f"Squared errors: {(true_prices - pred_prices)**2}")
print(f"MSE Loss: {mse_loss:.2f}")
print(f"Interpretation: Average squared error of ${tf.sqrt(mse_loss):.0f} per prediction")

# 2. BINARY CLASSIFICATION EXAMPLE: Email spam detection
print("\n📧 BINARY CLASSIFICATION: Email Spam Detection")
true_labels = tf.constant([1., 0., 1., 0.])  # 1=spam, 0=not spam
pred_probs = tf.constant([0.9, 0.1, 0.8, 0.3])  # Predicted probabilities

# Binary Crossentropy
bce_loss = tf.keras.losses.binary_crossentropy(true_labels, pred_probs)
print(f"\nTrue labels: {true_labels} (1=spam, 0=not spam)")
print(f"Predictions: {pred_probs} (probability of being spam)")
print(f"Individual losses: {bce_loss}")
print(f"Average BCE Loss: {tf.reduce_mean(bce_loss):.4f}")
print(f"Interpretation: Lower loss means better probability estimates")

# Let's see what happens with perfect vs terrible predictions
perfect_probs = tf.constant([1.0, 0.0, 1.0, 0.0])  # Perfect predictions
terrible_probs = tf.constant([0.1, 0.9, 0.1, 0.9])  # Opposite predictions

perfect_loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(true_labels, perfect_probs))
terrible_loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(true_labels, terrible_probs))

print(f"\nPerfect predictions loss: {perfect_loss:.6f} ← Nearly zero!")
print(f"Terrible predictions loss: {terrible_loss:.4f} ← Very high!")

# 3. MULTI-CLASS CLASSIFICATION: Animal classification
print("\n🐱 MULTI-CLASS CLASSIFICATION: Animal Classification")
# One-hot encoded true labels: [cat, dog, bird]
true_onehot = tf.constant([[1., 0., 0.],  # cat
                           [0., 1., 0.],  # dog  
                           [0., 0., 1.]])  # bird

# Predicted probabilities (after softmax)
pred_probs_multi = tf.constant([[0.8, 0.15, 0.05],  # Confident cat prediction
                                [0.3, 0.6, 0.1],     # Moderate dog prediction
                                [0.1, 0.2, 0.7]])    # Good bird prediction

# Categorical Crossentropy
cce_loss = tf.keras.losses.categorical_crossentropy(true_onehot, pred_probs_multi)
print(f"\nTrue labels (one-hot):")
print(f"  Sample 1: {true_onehot[0]} ← Cat")
print(f"  Sample 2: {true_onehot[1]} ← Dog") 
print(f"  Sample 3: {true_onehot[2]} ← Bird")
print(f"\nPredicted probabilities:")
print(f"  Sample 1: {pred_probs_multi[0]} ← [cat: 80%, dog: 15%, bird: 5%]")
print(f"  Sample 2: {pred_probs_multi[1]} ← [cat: 30%, dog: 60%, bird: 10%]")
print(f"  Sample 3: {pred_probs_multi[2]} ← [cat: 10%, dog: 20%, bird: 70%]")
print(f"\nIndividual losses: {cce_loss}")
print(f"Average CCE Loss: {tf.reduce_mean(cce_loss):.4f}")

# Visualize how loss changes with prediction quality
plt.figure(figsize=(15, 5))

# MSE Loss visualization
plt.subplot(1, 3, 1)
errors = np.linspace(-50000, 50000, 100)
mse_values = errors**2
plt.plot(errors, mse_values, 'b-', linewidth=2)
plt.title('MSE Loss: (y_true - y_pred)²')
plt.xlabel('Prediction Error')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.axvline(x=0, color='r', linestyle='--', alpha=0.7, label='Perfect prediction')
plt.legend()

# Binary Crossentropy visualization
plt.subplot(1, 3, 2)
probs = np.linspace(0.001, 0.999, 100)  # Avoid log(0)
# For true label = 1, loss = -log(p)
bce_values_true1 = -np.log(probs)
# For true label = 0, loss = -log(1-p)
bce_values_true0 = -np.log(1 - probs)

plt.plot(probs, bce_values_true1, 'r-', linewidth=2, label='True label = 1')
plt.plot(probs, bce_values_true0, 'b-', linewidth=2, label='True label = 0')
plt.title('Binary Crossentropy Loss')
plt.xlabel('Predicted Probability')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.legend()
plt.ylim(0, 5)

# Loss comparison
plt.subplot(1, 3, 3)
loss_types = ['MSE\n(Regression)', 'BCE\n(Binary)', 'CCE\n(Multi-class)']
example_losses = [float(mse_loss)/1e8, float(tf.reduce_mean(bce_loss)), float(tf.reduce_mean(cce_loss))]
colors = ['skyblue', 'lightcoral', 'lightgreen']

bars = plt.bar(loss_types, example_losses, color=colors)
plt.title('Loss Function Comparison\n(Example Values)')
plt.ylabel('Loss Value')
plt.grid(True, alpha=0.3)

# Add value labels on bars
for bar, loss in zip(bars, example_losses):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{loss:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\n✨ KEY INSIGHTS:")
print("• MSE: Penalizes large errors heavily (quadratic growth)")
print("• BCE: Loss approaches infinity as predictions approach opposite of truth")
print("• CCE: Measures how far predicted distribution is from true distribution")
print("• Choice depends on problem: regression→MSE, classification→crossentropy")

## 5. Gradients: The Mathematics of Learning

**What are Gradients?**
- Gradients tell us **which direction** to change weights to reduce loss
- Mathematically: the partial derivative of loss with respect to each weight
- **Positive gradient**: Increase weight increases loss → **decrease** weight
- **Negative gradient**: Increase weight decreases loss → **increase** weight

**Why Gradients Matter:**
- Without gradients, neural networks can't learn
- Gradients guide the weight updates during training
- TensorFlow automatically computes gradients (automatic differentiation)

**Gradient Descent Algorithm:**
1. Forward pass: Compute predictions and loss
2. Backward pass: Compute gradients
3. Update weights: `new_weight = old_weight - learning_rate * gradient`
4. Repeat until convergence

In [None]:
# Cell 6: Understanding Gradients with Automatic Differentiation

print("=== GRADIENTS: THE MATHEMATICS OF LEARNING ===")

# Simple example: y = x² + 3x + 2, find gradient at different points
print("\n📐 MATHEMATICAL EXAMPLE: f(x) = x² + 3x + 2")
print("Analytical derivative: f'(x) = 2x + 3")

def simple_function(x):
    """Simple quadratic function for demonstration"""
    return x**2 + 3*x + 2

# Compute gradients at different points using TensorFlow
test_points = [tf.constant(x, dtype=tf.float32) for x in [-2.0, -1.0, 0.0, 1.0, 2.0]]

print("\nPoint | Function Value | TF Gradient | Analytical | Match?")
print("-" * 60)
for x in test_points:
    with tf.GradientTape() as tape:
        tape.watch(x)  # Tell TensorFlow to track this variable
        y = simple_function(x)
    
    tf_gradient = tape.gradient(y, x)  # Automatic differentiation
    analytical_gradient = 2*x + 3     # Hand-calculated derivative
    match = "✅" if abs(tf_gradient - analytical_gradient) < 1e-6 else "❌"
    
    print(f"{x.numpy():5.1f} | {y.numpy():13.1f} | {tf_gradient.numpy():10.1f} | {analytical_gradient.numpy():10.1f} | {match}")

print("\n🎯 Key Insight: TensorFlow's automatic differentiation matches analytical calculus!")

# Neural Network Gradient Example
print("\n\n🧠 NEURAL NETWORK GRADIENT EXAMPLE")
print("Simple network: y = relu(x * w + b)")
print("Loss: MSE between prediction and target")

# Create simple data
x = tf.constant([[1.0], [2.0], [3.0], [4.0]])  # Input features
y_true = tf.constant([[2.0], [4.0], [6.0], [8.0]])  # Target: y = 2x

# Initialize trainable variables (parameters)
w = tf.Variable([[1.5]], dtype=tf.float32, name='weight')  # Initial weight
b = tf.Variable([0.5], dtype=tf.float32, name='bias')      # Initial bias

print(f"\nInitial parameters:")
print(f"Weight: {w.numpy()[0][0]:.2f}")
print(f"Bias: {b.numpy()[0]:.2f}")
print(f"Target relationship: y = 2x (so ideal weight=2, bias=0)")

# Forward pass with gradient tracking
with tf.GradientTape() as tape:
    # Forward pass
    linear_output = tf.matmul(x, w) + b
    y_pred = tf.nn.relu(linear_output)  # Apply ReLU activation
    
    # Compute loss
    loss = tf.reduce_mean(tf.square(y_true - y_pred))  # MSE loss

# Compute gradients
gradients = tape.gradient(loss, [w, b])
grad_w, grad_b = gradients

print(f"\nForward pass results:")
print(f"Predictions: {tf.squeeze(y_pred).numpy()}")
print(f"True values: {tf.squeeze(y_true).numpy()}")
print(f"Loss (MSE): {loss.numpy():.4f}")

print(f"\nGradients (direction to reduce loss):")
print(f"∂Loss/∂Weight = {grad_w.numpy()[0][0]:.4f}")
print(f"∂Loss/∂Bias   = {grad_b.numpy()[0]:.4f}")

# Gradient interpretation
print(f"\n📊 GRADIENT INTERPRETATION:")
if grad_w.numpy()[0][0] > 0:
    print(f"• Weight gradient > 0: Increasing weight increases loss → DECREASE weight")
else:
    print(f"• Weight gradient < 0: Increasing weight decreases loss → INCREASE weight")
    
if grad_b.numpy()[0] > 0:
    print(f"• Bias gradient > 0: Increasing bias increases loss → DECREASE bias")
else:
    print(f"• Bias gradient < 0: Increasing bias decreases loss → INCREASE bias")

# Demonstrate one gradient descent step
learning_rate = 0.1
print(f"\n📈 GRADIENT DESCENT STEP (learning_rate = {learning_rate}):")
print(f"Update rule: new_param = old_param - learning_rate * gradient")

old_w = w.numpy()[0][0]
old_b = b.numpy()[0]

# Apply gradient descent update
w.assign_sub(learning_rate * grad_w)  # w = w - lr * grad_w
b.assign_sub(learning_rate * grad_b)  # b = b - lr * grad_b

new_w = w.numpy()[0][0]
new_b = b.numpy()[0]

print(f"Weight: {old_w:.4f} → {new_w:.4f} (change: {new_w - old_w:+.4f})")
print(f"Bias:   {old_b:.4f} → {new_b:.4f} (change: {new_b - old_b:+.4f})")

# Verify loss decreased
with tf.GradientTape() as tape:
    linear_output = tf.matmul(x, w) + b
    y_pred_new = tf.nn.relu(linear_output)
    loss_new = tf.reduce_mean(tf.square(y_true - y_pred_new))

print(f"\nLoss: {loss.numpy():.6f} → {loss_new.numpy():.6f} (change: {loss_new.numpy() - loss.numpy():+.6f})")
if loss_new < loss:
    print("✅ Loss decreased! Gradient descent is working!")
else:
    print("❌ Loss increased. Something might be wrong.")

# Visualize the loss surface and gradient
plt.figure(figsize=(15, 5))

# Plot 1: Function and its derivative
plt.subplot(1, 3, 1)
x_vals = np.linspace(-3, 2, 100)
y_vals = x_vals**2 + 3*x_vals + 2
dy_vals = 2*x_vals + 3

plt.plot(x_vals, y_vals, 'b-', linewidth=2, label='f(x) = x² + 3x + 2')
plt.plot(x_vals, dy_vals, 'r--', linewidth=2, label="f'(x) = 2x + 3")
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=-1.5, color='g', linestyle=':', alpha=0.7, label='Minimum at x=-1.5')
plt.title('Function and its Derivative')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Gradient descent visualization
plt.subplot(1, 3, 2)
# Simulate gradient descent on the simple function
x_start = 1.5
lr = 0.1
steps = 10

x_path = [x_start]
for _ in range(steps):
    grad = 2*x_path[-1] + 3  # Gradient
    x_new = x_path[-1] - lr * grad  # Gradient descent step
    x_path.append(x_new)

y_path = [x**2 + 3*x + 2 for x in x_path]

plt.plot(x_vals, y_vals, 'b-', linewidth=2, alpha=0.7)
plt.plot(x_path, y_path, 'ro-', linewidth=2, markersize=8, label='Gradient descent path')
plt.plot(x_path[0], y_path[0], 'go', markersize=12, label='Start')
plt.plot(x_path[-1], y_path[-1], 'rs', markersize=12, label='End')
plt.title('Gradient Descent Optimization')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Neural network prediction improvement
plt.subplot(1, 3, 3)
x_plot = tf.squeeze(x).numpy()
y_true_plot = tf.squeeze(y_true).numpy()
y_pred_old = tf.squeeze(y_pred).numpy()
y_pred_new_plot = tf.squeeze(y_pred_new).numpy()

plt.plot(x_plot, y_true_plot, 'go-', linewidth=2, markersize=8, label='True values')
plt.plot(x_plot, y_pred_old, 'r^-', linewidth=2, markersize=8, label='Before gradient step')
plt.plot(x_plot, y_pred_new_plot, 'bs-', linewidth=2, markersize=8, label='After gradient step')
plt.title('Neural Network Improvement')
plt.xlabel('Input (x)')
plt.ylabel('Output (y)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✨ GRADIENT INSIGHTS:")
print("• Gradients point in the direction of steepest loss INCREASE")
print("• We move OPPOSITE to gradients to minimize loss")
print("• Learning rate controls how big steps we take")
print("• TensorFlow automatically computes gradients for any function!")

## 6. Optimizers: Smart Ways to Update Weights

**Beyond Basic Gradient Descent:**
While basic gradient descent works, we can do better! Modern optimizers use clever tricks to:
- Adapt learning rates automatically
- Build momentum to escape local minima
- Handle different scales of parameters

**Popular Optimizers:**

1. **SGD (Stochastic Gradient Descent)**:
   - Basic: `w = w - lr * gradient`
   - With momentum: Accelerates in consistent directions

2. **Adam (Adaptive Moment Estimation)**:
   - Combines momentum + adaptive learning rates
   - Most popular choice for beginners
   - Works well across many problems

3. **RMSprop**: 
   - Adapts learning rate per parameter
   - Good for recurrent neural networks

**Key Parameters:**
- **Learning Rate**: How big steps to take (typically 0.001-0.1)
- **Momentum**: How much to consider previous gradients (0-1)
- **Beta values**: Control momentum and learning rate adaptation

In [None]:
# Cell 7: Optimizer Comparison with Detailed Explanations

print("=== OPTIMIZERS: SMART WEIGHT UPDATES ===")

# Create a more complex optimization problem
# We'll optimize a 2D function to visualize different optimizer paths

def complex_loss_function(x, y):
    """A complex 2D function with multiple local minima"""
    return (x**2 + y**2) + 0.3 * tf.sin(3*x) * tf.cos(3*y) + 0.1 * (x + y)**2

# Starting point for optimization
start_x, start_y = 2.0, 1.5

print(f"🎯 OPTIMIZATION PROBLEM:")
print(f"Function: f(x,y) = (x² + y²) + 0.3*sin(3x)*cos(3y) + 0.1*(x+y)²")
print(f"Goal: Find the minimum starting from ({start_x}, {start_y})")
print(f"True minimum is near (0, 0)")

# Function to run optimization with different optimizers
def optimize_with_optimizer(optimizer_class, optimizer_name, **kwargs):
    """Run optimization and return the path taken"""
    
    # Initialize variables
    x = tf.Variable(start_x, dtype=tf.float32)
    y = tf.Variable(start_y, dtype=tf.float32)
    
    # Create optimizer
    optimizer = optimizer_class(**kwargs)
    
    # Track the optimization path
    path_x, path_y, losses = [x.numpy()], [y.numpy()], []
    
    # Optimization loop
    for step in range(50):
        with tf.GradientTape() as tape:
            loss = complex_loss_function(x, y)
        
        # Compute gradients
        gradients = tape.gradient(loss, [x, y])
        
        # Apply optimizer
        optimizer.apply_gradients(zip(gradients, [x, y]))
        
        # Record path
        path_x.append(x.numpy())
        path_y.append(y.numpy())
        losses.append(loss.numpy())
    
    return path_x, path_y, losses

# Test different optimizers
optimizers_to_test = [
    (tf.keras.optimizers.SGD, "SGD", {"learning_rate": 0.01}),
    (tf.keras.optimizers.SGD, "SGD + Momentum", {"learning_rate": 0.01, "momentum": 0.9}),
    (tf.keras.optimizers.Adam, "Adam", {"learning_rate": 0.1}),
    (tf.keras.optimizers.RMSprop, "RMSprop", {"learning_rate": 0.1})
]

print(f"\n🔬 TESTING OPTIMIZERS:")
optimizer_results = {}

for optimizer_class, name, kwargs in optimizers_to_test:
    print(f"\n{name}:")
    print(f"  Parameters: {kwargs}")
    
    path_x, path_y, losses = optimize_with_optimizer(optimizer_class, name, **kwargs)
    
    final_loss = losses[-1]
    final_x, final_y = path_x[-1], path_y[-1]
    
    print(f"  Final position: ({final_x:.4f}, {final_y:.4f})")
    print(f"  Final loss: {final_loss:.6f}")
    print(f"  Distance from origin: {np.sqrt(final_x**2 + final_y**2):.4f}")
    
    optimizer_results[name] = {
        'path_x': path_x,
        'path_y': path_y, 
        'losses': losses,
        'final_loss': final_loss
    }

# Create visualization
plt.figure(figsize=(20, 12))

# Create meshgrid for contour plot
x_range = np.linspace(-2.5, 2.5, 100)
y_range = np.linspace(-2.5, 2.5, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = (X**2 + Y**2) + 0.3 * np.sin(3*X) * np.cos(3*Y) + 0.1 * (X + Y)**2

# Plot 1: Optimization paths on contour plot
plt.subplot(2, 3, 1)
contour = plt.contour(X, Y, Z, levels=20, alpha=0.6)
plt.colorbar(contour)

colors = ['red', 'blue', 'green', 'orange']
for i, (name, results) in enumerate(optimizer_results.items()):
    plt.plot(results['path_x'], results['path_y'], 
             color=colors[i], linewidth=2, marker='o', markersize=4, label=name)

plt.plot(start_x, start_y, 'ko', markersize=12, label='Start')
plt.plot(0, 0, 'r*', markersize=15, label='True minimum')
plt.title('Optimizer Paths on Loss Surface')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Loss curves
plt.subplot(2, 3, 2)
for i, (name, results) in enumerate(optimizer_results.items()):
    plt.plot(results['losses'], color=colors[i], linewidth=2, label=name)

plt.title('Loss During Optimization')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Plot 3: Final performance comparison
plt.subplot(2, 3, 3)
names = list(optimizer_results.keys())
final_losses = [results['final_loss'] for results in optimizer_results.values()]

bars = plt.bar(names, final_losses, color=colors)
plt.title('Final Loss Comparison')
plt.ylabel('Final Loss')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

# Add value labels on bars
for bar, loss in zip(bars, final_losses):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(),
             f'{loss:.4f}', ha='center', va='bottom')

# Plots 4-6: Individual optimizer details
detailed_optimizers = ['SGD', 'Adam', 'SGD + Momentum']
for i, opt_name in enumerate(detailed_optimizers):
    plt.subplot(2, 3, 4 + i)
    
    if opt_name in optimizer_results:
        results = optimizer_results[opt_name]
        
        # Plot path with gradient vectors
        plt.contour(X, Y, Z, levels=15, alpha=0.3)
        path_x, path_y = results['path_x'], results['path_y']
        
        # Plot path
        plt.plot(path_x, path_y, 'ro-', linewidth=2, markersize=6)
        
        # Show first few steps with arrows
        for j in range(min(5, len(path_x)-1)):
            dx = path_x[j+1] - path_x[j]
            dy = path_y[j+1] - path_y[j]
            plt.arrow(path_x[j], path_y[j], dx, dy,
                     head_width=0.05, head_length=0.05, fc='blue', ec='blue')
        
        plt.plot(path_x[0], path_y[0], 'go', markersize=10, label='Start')
        plt.plot(path_x[-1], path_y[-1], 'rs', markersize=10, label='End')
        plt.title(f'{opt_name} Detailed Path')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.legend()
        plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Explain optimizer characteristics
print(f"\n📊 OPTIMIZER ANALYSIS:")
print(f"\n1. SGD (Stochastic Gradient Descent):")
print(f"   • Formula: w = w - learning_rate * gradient")
print(f"   • Pros: Simple, reliable, good for convex problems")
print(f"   • Cons: Can be slow, may oscillate in valleys")

print(f"\n2. SGD + Momentum:")
print(f"   • Formula: v = momentum * v + gradient; w = w - learning_rate * v")
print(f"   • Pros: Accelerates in consistent directions, dampens oscillations")
print(f"   • Cons: May overshoot minima")

print(f"\n3. Adam (Adaptive Moment Estimation):")
print(f"   • Combines momentum + adaptive learning rates")
print(f"   • Pros: Works well out-of-the-box, adapts to parameter scales")
print(f"   • Cons: Can converge to suboptimal solutions in some cases")

print(f"\n4. RMSprop:")
print(f"   • Adapts learning rate based on recent gradient magnitudes")
print(f"   • Pros: Good for non-stationary objectives, handles sparse gradients")
print(f"   • Cons: Less momentum than Adam")

print(f"\n✨ PRACTICAL RECOMMENDATIONS:")
print(f"• Start with Adam (learning_rate=0.001) for most problems")
print(f"• Use SGD + momentum for fine-tuning or when you need reproducibility")
print(f"• Learning rate is the most important hyperparameter to tune")
print(f"• Monitor loss curves to detect convergence issues")

## 7. Putting It All Together: Complete Learning Example

**Now we'll combine everything we've learned:**
1. ✅ **Tensors**: Data representation
2. ✅ **Operations**: Matrix multiplication, broadcasting
3. ✅ **Activations**: Non-linear transformations
4. ✅ **Loss Functions**: Measuring prediction quality
5. ✅ **Gradients**: Direction of improvement
6. ✅ **Optimizers**: Smart weight updates

**Complete Training Loop:**
```python
for epoch in range(num_epochs):
    for batch in dataset:
        with tf.GradientTape() as tape:
            predictions = model(batch_x)     # Forward pass
            loss = loss_function(batch_y, predictions)
        
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
```

In [None]:
# Cell 8: Complete Deep Learning Example from Scratch

print("=== COMPLETE DEEP LEARNING EXAMPLE ===")
print("Building a neural network from scratch with detailed explanations!")

# Generate synthetic dataset for binary classification
print("\n📊 CREATING DATASET:")
np.random.seed(42)
tf.random.set_seed(42)

# Create two classes of data with some overlap
n_samples = 1000
n_features = 2

# Class 0: centered around (-1, -1)
class_0 = np.random.multivariate_normal([-1, -1], [[0.5, 0.1], [0.1, 0.5]], n_samples//2)
labels_0 = np.zeros((n_samples//2, 1))

# Class 1: centered around (1, 1)
class_1 = np.random.multivariate_normal([1, 1], [[0.5, -0.1], [-0.1, 0.5]], n_samples//2)
labels_1 = np.ones((n_samples//2, 1))

# Combine data
X_data = np.vstack([class_0, class_1]).astype(np.float32)
y_data = np.vstack([labels_0, labels_1]).astype(np.float32)

# Shuffle the data
indices = np.random.permutation(n_samples)
X_data = X_data[indices]
y_data = y_data[indices]

# Convert to TensorFlow tensors
X = tf.constant(X_data)
y = tf.constant(y_data)

print(f"Dataset shape: {X.shape} (samples, features)")
print(f"Labels shape: {y.shape}")
print(f"Feature range: X1=[{X[:, 0].numpy().min():.2f}, {X[:, 0].numpy().max():.2f}], X2=[{X[:, 1].numpy().min():.2f}, {X[:, 1].numpy().max():.2f}]")
print(f"Class distribution: {tf.reduce_sum(y).numpy():.0f} class 1, {len(y) - tf.reduce_sum(y).numpy():.0f} class 0")

# Visualize the dataset
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
class_0_mask = y.numpy().flatten() == 0
class_1_mask = y.numpy().flatten() == 1

plt.scatter(X[class_0_mask, 0], X[class_0_mask, 1], c='red', alpha=0.6, label='Class 0', s=20)
plt.scatter(X[class_1_mask, 0], X[class_1_mask, 1], c='blue', alpha=0.6, label='Class 1', s=20)
plt.title('Dataset Visualization')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)

# Define the neural network architecture
print(f"\n🏗️ BUILDING NEURAL NETWORK:")
print(f"Architecture: Input(2) → Hidden(4, ReLU) → Output(1, Sigmoid)")

# Initialize network parameters
input_size = 2
hidden_size = 4
output_size = 1

# Xavier/Glorot initialization for better convergence
# Formula: uniform(-sqrt(6/(fan_in + fan_out)), sqrt(6/(fan_in + fan_out)))
def xavier_init(fan_in, fan_out):
    limit = tf.sqrt(6.0 / (fan_in + fan_out))
    return tf.random.uniform((fan_in, fan_out), -limit, limit)

# Layer 1: Input → Hidden
W1 = tf.Variable(xavier_init(input_size, hidden_size), name='W1')
b1 = tf.Variable(tf.zeros((hidden_size,)), name='b1')

# Layer 2: Hidden → Output
W2 = tf.Variable(xavier_init(hidden_size, output_size), name='W2')
b2 = tf.Variable(tf.zeros((output_size,)), name='b2')

print(f"Weight matrices initialized:")
print(f"  W1 shape: {W1.shape} (input_to_hidden weights)")
print(f"  b1 shape: {b1.shape} (hidden layer biases)")
print(f"  W2 shape: {W2.shape} (hidden_to_output weights)")
print(f"  b2 shape: {b2.shape} (output layer bias)")
print(f"Total parameters: {tf.size(W1) + tf.size(b1) + tf.size(W2) + tf.size(b2)} = {input_size*hidden_size} + {hidden_size} + {hidden_size*output_size} + {output_size}")

# Define the forward pass
def forward_pass(X, training=True):
    """Complete forward pass through the network"""
    
    # Layer 1: Input → Hidden
    z1 = tf.matmul(X, W1) + b1  # Linear transformation
    a1 = tf.nn.relu(z1)         # ReLU activation
    
    # Layer 2: Hidden → Output  
    z2 = tf.matmul(a1, W2) + b2 # Linear transformation
    a2 = tf.nn.sigmoid(z2)      # Sigmoid activation (for binary classification)
    
    if training:
        return a2, {'z1': z1, 'a1': a1, 'z2': z2}  # Return intermediate values for analysis
    else:
        return a2

# Test forward pass
print(f"\n🧠 TESTING FORWARD PASS:")
sample_predictions, intermediates = forward_pass(X[:5])  # Test with first 5 samples

print(f"Input (first 5 samples):")
print(X[:5].numpy())
print(f"\nLayer 1 pre-activation (z1):")
print(intermediates['z1'].numpy())
print(f"\nLayer 1 post-activation (a1, after ReLU):")
print(intermediates['a1'].numpy())
print(f"\nLayer 2 pre-activation (z2):")
print(intermediates['z2'].numpy())
print(f"\nFinal output (probabilities):")
print(sample_predictions.numpy())
print(f"\nTrue labels (first 5):")
print(y[:5].numpy().flatten())

# Setup training
print(f"\n🎯 TRAINING SETUP:")
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
loss_function = tf.keras.losses.BinaryCrossentropy()

print(f"Optimizer: Adam with learning_rate=0.01")
print(f"Loss function: Binary Crossentropy")
print(f"Metrics: Loss, Accuracy")

# Training loop with detailed logging
epochs = 100
loss_history = []
accuracy_history = []
weight_norms = {'W1': [], 'W2': []}

print(f"\n🚀 STARTING TRAINING:")
print(f"Epochs: {epochs}")
print(f"Progress: [Loss | Accuracy | W1_norm | W2_norm]")

for epoch in range(epochs):
    with tf.GradientTape() as tape:
        # Forward pass
        predictions = forward_pass(X, training=False)
        
        # Compute loss
        loss = loss_function(y, predictions)
        
        # Add L2 regularization (optional)
        l2_reg = 0.001
        l2_loss = l2_reg * (tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2))
        total_loss = loss + l2_loss
    
    # Compute gradients
    variables = [W1, b1, W2, b2]
    gradients = tape.gradient(total_loss, variables)
    
    # Apply gradients
    optimizer.apply_gradients(zip(gradients, variables))
    
    # Compute metrics
    binary_predictions = tf.round(predictions)
    accuracy = tf.reduce_mean(tf.cast(tf.equal(binary_predictions, y), tf.float32))
    
    # Record metrics
    loss_history.append(loss.numpy())
    accuracy_history.append(accuracy.numpy())
    weight_norms['W1'].append(tf.norm(W1).numpy())
    weight_norms['W2'].append(tf.norm(W2).numpy())
    
    # Print progress
    if epoch % 20 == 0 or epoch == epochs - 1:
        print(f"Epoch {epoch:3d}: {loss.numpy():.4f} | {accuracy.numpy():.4f} | {tf.norm(W1).numpy():.3f} | {tf.norm(W2).numpy():.3f}")

# Final evaluation
final_predictions = forward_pass(X, training=False)
final_binary_predictions = tf.round(final_predictions)
final_accuracy = tf.reduce_mean(tf.cast(tf.equal(final_binary_predictions, y), tf.float32))
final_loss = loss_function(y, final_predictions)

print(f"\n🎉 TRAINING COMPLETE!")
print(f"Final Loss: {final_loss:.6f}")
print(f"Final Accuracy: {final_accuracy:.4f} ({final_accuracy*100:.1f}%)")

# Visualize results
plt.subplot(1, 3, 2)
plt.plot(loss_history, 'b-', linewidth=2, label='Training Loss')
plt.title('Training Loss Over Time')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.legend()

plt.subplot(1, 3, 3)
plt.plot(accuracy_history, 'g-', linewidth=2, label='Training Accuracy')
plt.title('Training Accuracy Over Time')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.grid(True, alpha=0.3)
plt.legend()
plt.ylim(0, 1)

plt.tight_layout()
plt.show()

# Visualize decision boundary
plt.figure(figsize=(15, 5))

# Create a mesh for decision boundary
h = 0.02
x_min, x_max = X[:, 0].numpy().min() - 1, X[:, 0].numpy().max() + 1
y_min, y_max = X[:, 1].numpy().min() - 1, X[:, 1].numpy().max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Make predictions on the mesh
mesh_points = tf.constant(np.c_[xx.ravel(), yy.ravel()], dtype=tf.float32)
mesh_predictions = forward_pass(mesh_points, training=False)
mesh_predictions = mesh_predictions.numpy().reshape(xx.shape)

plt.subplot(1, 3, 1)
plt.contourf(xx, yy, mesh_predictions, levels=20, alpha=0.8, cmap='RdYlBu')
plt.colorbar(label='Predicted Probability')
plt.scatter(X[class_0_mask, 0], X[class_0_mask, 1], c='red', alpha=0.8, label='Class 0', s=20, edgecolors='black')
plt.scatter(X[class_1_mask, 0], X[class_1_mask, 1], c='blue', alpha=0.8, label='Class 1', s=20, edgecolors='black')
plt.title('Learned Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()

# Weight evolution
plt.subplot(1, 3, 2)
plt.plot(weight_norms['W1'], 'r-', linewidth=2, label='W1 norm')
plt.plot(weight_norms['W2'], 'b-', linewidth=2, label='W2 norm')
plt.title('Weight Norms During Training')
plt.xlabel('Epoch')
plt.ylabel('L2 Norm')
plt.legend()
plt.grid(True, alpha=0.3)

# Final network weights visualization
plt.subplot(1, 3, 3)
# Combine all weights for visualization
all_weights = np.concatenate([W1.numpy().flatten(), W2.numpy().flatten()])
plt.hist(all_weights, bins=20, alpha=0.7, color='purple', edgecolor='black')
plt.title('Final Weight Distribution')
plt.xlabel('Weight Value')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n📊 FINAL ANALYSIS:")
print(f"\n🔍 What the network learned:")
print(f"• W1 (input→hidden): {W1.shape} matrix maps 2D input to 4D hidden representation")
print(f"• b1 (hidden bias): {b1.shape} vector shifts hidden activations")
print(f"• W2 (hidden→output): {W2.shape} matrix combines hidden features for final decision")
print(f"• b2 (output bias): {b2.shape} scalar adjusts decision threshold")

print(f"\n📈 Training insights:")
print(f"• Loss decreased from {loss_history[0]:.4f} to {loss_history[-1]:.4f}")
print(f"• Accuracy improved from {accuracy_history[0]:.4f} to {accuracy_history[-1]:.4f}")
print(f"• Decision boundary successfully separates the two classes")
print(f"• Weight norms stabilized, indicating convergence")

print(f"\n✅ CONGRATULATIONS!")
print(f"You've successfully implemented and trained a neural network from scratch!")
print(f"You now understand: tensors, operations, activations, loss, gradients, and optimizers!")