# PyTorch Basics Part 2: Autograd and Neural Networks

Automatic differentiation, gradients, and basic neural network building blocks

## Mathematical Foundation of Automatic Differentiation

**Automatic Differentiation (Autograd)** is the computational implementation of the chain rule from calculus. For composite functions, it enables efficient computation of derivatives without symbolic differentiation.

### The Chain Rule
For composite function $f(g(x))$:
$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

### Computational Graph
Each operation creates nodes in a **directed acyclic graph (DAG)**:
- **Forward pass**: Compute function values $f(x)$
- **Backward pass**: Compute gradients $\frac{\partial f}{\partial x}$ using reverse-mode differentiation

### Gradient Computation
For scalar output $y = f(x_1, x_2, \ldots, x_n)$:
$$\nabla f = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right)$$

PyTorch implements **reverse-mode AD**, which is efficient for functions $\mathbb{R}^n \rightarrow \mathbb{R}$ (many inputs, scalar output) - ideal for loss functions in machine learning.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

## Autograd: Automatic Differentiation

### Mathematical Example
Consider the function: $z = x^2 + y^3$, $\text{loss} = \sum z$

**Manual Computation:**
- $\frac{\partial z}{\partial x} = 2x$
- $\frac{\partial z}{\partial y} = 3y^2$
- $\frac{\partial \text{loss}}{\partial x} = \frac{\partial \text{loss}}{\partial z} \frac{\partial z}{\partial x} = 1 \cdot 2x = 2x$

**Reverse-Mode Algorithm:**
1. **Forward pass**: Compute function values and store intermediate results
2. **Backward pass**: Apply chain rule from output to inputs

This demonstrates how PyTorch autograd implements mathematical differentiation computationally.

In [None]:
# Tensors with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([1.0, 4.0], requires_grad=True)

print(f"x: {x}")
print(f"y: {y}")
print(f"x.requires_grad: {x.requires_grad}")
print(f"y.requires_grad: {y.requires_grad}")

In [None]:
# Compute a function
z = x**2 + y**3
loss = z.sum()

print(f"z: {z}")
print(f"loss: {loss}")
print(f"loss.requires_grad: {loss.requires_grad}")

In [None]:
# Compute gradients
loss.backward()

print(f"x.grad: {x.grad}")
print(f"y.grad: {y.grad}")

# Manual verification:
# dz/dx = 2x, so at x=[2,3]: [4, 6]
# dz/dy = 3y^2, so at y=[1,4]: [3, 48]
print(f"Expected x.grad: {2 * x}")
print(f"Expected y.grad: {3 * y**2}")

## Gradient Accumulation and Zeroing

### Mathematical Principle
**Gradient Accumulation** follows the linearity of differentiation:

For functions $f_1, f_2, \ldots, f_n$ and scalar $c$:
$$\frac{\partial}{\partial x}[f_1(x) + f_2(x)] = \frac{\partial f_1}{\partial x} + \frac{\partial f_2}{\partial x}$$
$$\frac{\partial}{\partial x}[c \cdot f(x)] = c \cdot \frac{\partial f}{\partial x}$$

**Why Gradients Accumulate:**
- Each `.backward()` call adds to existing gradients: $\text{grad} \leftarrow \text{grad} + \nabla_{\text{new}} f$
- This enables **gradient accumulation** across multiple loss terms or batches
- **Must manually zero** gradients between independent computations

**Mathematical Interpretation:**
If we compute losses $L_1, L_2$ separately:
$$\nabla (L_1 + L_2) = \nabla L_1 + \nabla L_2$$

PyTorch implements this by accumulating gradients, allowing flexible gradient computation strategies.

In [None]:
# Gradients accumulate by default
x = torch.tensor([1.0, 2.0], requires_grad=True)

# First computation
y1 = x**2
y1.sum().backward()
print(f"After first backward: x.grad = {x.grad}")

# Second computation (gradients accumulate)
y2 = x**3
y2.sum().backward()
print(f"After second backward: x.grad = {x.grad}")

# Zero gradients
x.grad.zero_()
print(f"After zeroing: x.grad = {x.grad}")

## Neural Network Modules

### Mathematical Foundation of Linear Layers

**Affine Transformation:**
A linear layer implements the affine transformation:
$$\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$$

Where:
- $\mathbf{x} \in \mathbb{R}^{n}$ is the input vector
- $\mathbf{W} \in \mathbb{R}^{m \times n}$ is the weight matrix
- $\mathbf{b} \in \mathbb{R}^{m}$ is the bias vector
- $\mathbf{y} \in \mathbb{R}^{m}$ is the output vector

**Batch Processing:**
For batch input $\mathbf{X} \in \mathbb{R}^{B \times n}$ (B samples):
$$\mathbf{Y} = \mathbf{X}\mathbf{W}^T + \mathbf{b}$$

**Parameter Initialization:**
- **Xavier/Glorot**: $\mathcal{W} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$
- **He initialization**: $\mathcal{W} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$ (for ReLU)

The linear layer forms the fundamental building block for deep neural networks, implementing learnable linear transformations.

In [None]:
# Linear layer (fully connected)
linear = nn.Linear(in_features=3, out_features=2)

print(f"Weight shape: {linear.weight.shape}")
print(f"Bias shape: {linear.bias.shape}")
print(f"Weight:\n{linear.weight}")
print(f"Bias: {linear.bias}")

In [None]:
# Forward pass through linear layer
x = torch.randn(5, 3)  # batch_size=5, input_features=3
output = linear(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output:\n{output}")

## Activation Functions

### Mathematical Foundation of Non-linearity

**Activation functions** introduce non-linearity into neural networks, enabling them to approximate complex functions. Without activation functions, multiple linear layers would collapse to a single linear transformation.

**Common Activation Functions:**

**ReLU (Rectified Linear Unit):**
$$\text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}$$
- **Derivative**: $\frac{d}{dx}\text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}$

**Sigmoid:**
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
- **Properties**: $\sigma(x) \in (0, 1)$, $\sigma'(x) = \sigma(x)(1 - \sigma(x))$
- **Issues**: Vanishing gradients for large $|x|$

**Tanh (Hyperbolic Tangent):**
$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}$$
- **Properties**: $\tanh(x) \in (-1, 1)$, $\tanh'(x) = 1 - \tanh^2(x)$
- **Advantage**: Zero-centered output

**Universal Approximation Theorem**: Neural networks with at least one hidden layer and non-linear activation can approximate any continuous function on compact sets.

In [None]:
# Common activation functions
x = torch.linspace(-3, 3, 100)

# ReLU
relu_output = F.relu(x)

# Sigmoid
sigmoid_output = torch.sigmoid(x)

# Tanh
tanh_output = torch.tanh(x)

# Plot activations
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(x.numpy(), relu_output.numpy())
plt.title('ReLU')
plt.grid(True)

plt.subplot(1, 3, 2)
plt.plot(x.numpy(), sigmoid_output.numpy())
plt.title('Sigmoid')
plt.grid(True)

plt.subplot(1, 3, 3)
plt.plot(x.numpy(), tanh_output.numpy())
plt.title('Tanh')
plt.grid(True)

plt.tight_layout()
plt.show()

## Building a Simple Neural Network

### Mathematical Architecture

**Multi-Layer Perceptron (MLP):**
For a 2-layer network with hidden layer:
$$\mathbf{h} = \sigma_1(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)$$
$$\mathbf{y} = \sigma_2(\mathbf{W}_2\mathbf{h} + \mathbf{b}_2)$$

Where:
- $\mathbf{x} \in \mathbb{R}^{d}$ is input
- $\mathbf{W}_1 \in \mathbb{R}^{h \times d}$, $\mathbf{b}_1 \in \mathbb{R}^{h}$ (input to hidden)
- $\mathbf{W}_2 \in \mathbb{R}^{k \times h}$, $\mathbf{b}_2 \in \mathbb{R}^{k}$ (hidden to output)
- $\sigma_1, \sigma_2$ are activation functions

**Parameter Count:**
Total parameters = $(d \times h + h) + (h \times k + k) = h(d + k) + (h + k)$

**Forward Propagation:**
The network computes the composition:
$$f(\mathbf{x}; \boldsymbol{\theta}) = \sigma_2(\mathbf{W}_2 \sigma_1(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)$$

Where $\boldsymbol{\theta} = \{\mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2\}$ are learnable parameters.

In [None]:
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNet, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = self.linear2(x)
        return x

# Create network
net = SimpleNet(input_size=4, hidden_size=10, output_size=3)
print(net)

# Count parameters
total_params = sum(p.numel() for p in net.parameters())
trainable_params = sum(p.numel() for p in net.parameters() if p.requires_grad)
print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

In [None]:
# Forward pass through network
x = torch.randn(8, 4)  # batch_size=8, input_features=4
output = net(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output:\n{output}")

## Loss Functions

### Mathematical Foundation of Loss Functions

Loss functions quantify the discrepancy between predictions and targets, providing the objective to minimize during training.

**Mean Squared Error (MSE) - Regression:**
$$L_{\text{MSE}}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Properties:**
- Differentiable everywhere
- Penalizes large errors quadratically
- Assumes Gaussian noise in targets

**Cross-Entropy Loss - Classification:**
For multiclass classification with true class $c$ and predicted probabilities $\mathbf{p}$:
$$L_{\text{CE}} = -\log p_c = -\log\left(\frac{e^{z_c}}{\sum_{j=1}^{K} e^{z_j}}\right)$$

Where $z_j$ are logits (pre-softmax outputs).

**Softmax Function:**
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

**Gradient of Cross-Entropy + Softmax:**
$$\frac{\partial L_{\text{CE}}}{\partial z_i} = p_i - \delta_{ic}$$
where $\delta_{ic}$ is 1 if $i = c$ (true class), 0 otherwise.

This combination has a clean gradient that drives predictions toward the correct distribution.

In [None]:
# Mean Squared Error (MSE) for regression
predictions = torch.randn(10, 1)
targets = torch.randn(10, 1)

mse_loss = nn.MSELoss()
loss = mse_loss(predictions, targets)
print(f"MSE Loss: {loss.item():.4f}")

# Manual calculation
manual_mse = ((predictions - targets)**2).mean()
print(f"Manual MSE: {manual_mse.item():.4f}")

In [None]:
# Cross-Entropy Loss for classification
# Raw logits (before softmax)
logits = torch.randn(5, 3)  # 5 samples, 3 classes
targets = torch.tensor([0, 1, 2, 1, 0])  # class indices

ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(logits, targets)
print(f"Cross-Entropy Loss: {loss.item():.4f}")

# Convert to probabilities
probabilities = F.softmax(logits, dim=1)
print(f"Probabilities:\n{probabilities}")
print(f"Predicted classes: {probabilities.argmax(dim=1)}")

## Optimizers

### Mathematical Foundation of Optimization

**Gradient Descent** is the fundamental optimization algorithm for minimizing loss functions.

**Vanilla Gradient Descent:**
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} L(\boldsymbol{\theta}_t)$$

Where:
- $\boldsymbol{\theta}$ are parameters
- $\eta$ is the learning rate
- $\nabla_{\boldsymbol{\theta}} L$ is the gradient of loss w.r.t. parameters

**Stochastic Gradient Descent (SGD):**
Instead of using the full dataset, SGD uses mini-batches:
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} L_{\text{batch}}(\boldsymbol{\theta}_t)$$

**Advanced Optimizers:**

**Adam (Adaptive Moment Estimation):**
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Adam combines momentum (first moment) with adaptive learning rates (second moment).

In [None]:
# Create a simple optimization example
net = SimpleNet(2, 5, 1)
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

# Generate some dummy data
x = torch.randn(20, 2)
y = torch.randn(20, 1)

print("Initial loss:")
initial_output = net(x)
initial_loss = F.mse_loss(initial_output, y)
print(f"Loss: {initial_loss.item():.4f}")

# Training step
optimizer.zero_grad()  # Clear gradients
output = net(x)        # Forward pass
loss = F.mse_loss(output, y)  # Compute loss
loss.backward()        # Backward pass
optimizer.step()       # Update parameters

print("\nAfter one optimization step:")
new_output = net(x)
new_loss = F.mse_loss(new_output, y)
print(f"Loss: {new_loss.item():.4f}")
print(f"Loss change: {new_loss.item() - initial_loss.item():.4f}")

## Simple Training Loop

### Mathematical Training Process

**Training Loop** implements the empirical risk minimization:
$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \frac{1}{n} \sum_{i=1}^{n} L(f(\mathbf{x}_i; \boldsymbol{\theta}), y_i)$$

**Algorithm:**
1. **Forward Pass**: Compute predictions $\hat{\mathbf{y}} = f(\mathbf{X}; \boldsymbol{\theta})$
2. **Loss Computation**: $L = \frac{1}{B} \sum_{i=1}^{B} \ell(\hat{y}_i, y_i)$
3. **Backward Pass**: Compute $\nabla_{\boldsymbol{\theta}} L$ via backpropagation
4. **Parameter Update**: $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} L$
5. **Repeat** until convergence

**Example: Linear Regression**
True model: $y = \mathbf{w}^T\mathbf{x} + b + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \sigma^2)$

**Maximum Likelihood Estimation** under Gaussian noise leads to MSE loss:
$$L(\mathbf{w}, b) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T\mathbf{x}_i - b)^2$$

The training process finds parameters that minimize this empirical risk.

In [None]:
# Simple regression problem: y = 2x1 + 3x2 + noise
torch.manual_seed(42)

# Generate data
n_samples = 100
x = torch.randn(n_samples, 2)
true_weights = torch.tensor([[2.0], [3.0]])
y = x @ true_weights + 0.1 * torch.randn(n_samples, 1)

# Create model
model = nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
criterion = nn.MSELoss()

# Training loop
losses = []
for epoch in range(100):
    # Forward pass
    predictions = model(x)
    loss = criterion(predictions, y)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

print(f"\nTrue weights: {true_weights.flatten()}")
print(f"Learned weights: {model.weight.data.flatten()}")
print(f"True bias: 0.0")
print(f"Learned bias: {model.bias.data.item():.4f}")

In [None]:
# Plot training loss
plt.figure(figsize=(8, 4))
plt.plot(losses)
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.grid(True)
plt.show()

## Working with Different Optimizers

In [None]:
# Compare different optimizers
def train_with_optimizer(optimizer_class, **kwargs):
    model = nn.Linear(2, 1)
    optimizer = optimizer_class(model.parameters(), **kwargs)
    criterion = nn.MSELoss()
    
    losses = []
    for epoch in range(50):
        predictions = model(x)
        loss = criterion(predictions, y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        losses.append(loss.item())
    
    return losses, model.weight.data.flatten()

# Test different optimizers
sgd_losses, sgd_weights = train_with_optimizer(torch.optim.SGD, lr=0.1)
adam_losses, adam_weights = train_with_optimizer(torch.optim.Adam, lr=0.1)
rmsprop_losses, rmsprop_weights = train_with_optimizer(torch.optim.RMSprop, lr=0.1)

plt.figure(figsize=(10, 4))
plt.plot(sgd_losses, label='SGD')
plt.plot(adam_losses, label='Adam')
plt.plot(rmsprop_losses, label='RMSprop')
plt.title('Optimizer Comparison')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True)
plt.show()

print(f"True weights: {true_weights.flatten()}")
print(f"SGD weights: {sgd_weights}")
print(f"Adam weights: {adam_weights}")
print(f"RMSprop weights: {rmsprop_weights}")

## Saving and Loading Models

In [None]:
# Save model state
model = SimpleNet(3, 5, 2)
torch.save(model.state_dict(), 'model_weights.pth')

# Save entire model
torch.save(model, 'complete_model.pth')

print("Model saved successfully")

# Load model state (recommended approach)
new_model = SimpleNet(3, 5, 2)
new_model.load_state_dict(torch.load('model_weights.pth'))
new_model.eval()  # Set to evaluation mode

print("Model loaded successfully")

# Clean up files
import os
os.remove('model_weights.pth')
os.remove('complete_model.pth')