# Beginner's Guide to Deep Learning in PyTorch: Multi-Layer Networks, Mini-Batches, MNIST, L2 Regularization, and Dropout

**Objective:**  
In this tutorial, we'll work on the exercise from last week. We will build and train neural networks step by step using **PyTorch**. 
We will start with simple networks and gradually add complexity and improvements:
1. Build a deeper neural network with multiple hidden layers (vs. a shallow one) and try different activation functions (Tanh, Sigmoid, ReLU).
2. Implement **mini-batch gradient descent** instead of full-batch training and discuss efficiency vs. convergence.
3. Train our network on the **MNIST dataset** (handwritten digit images) with proper data preprocessing for 10-class classification.
4. Add **L2 regularization (weight decay)** to the loss to reduce overfitting and discuss its impact.
5. Manually implement **Dropout** regularization in the network to improve generalization, ensuring it behaves correctly during training and inference.

## 1. Building a Deeper Network with Multiple Hidden Layers

Neural networks are composed of layers of interconnected neurons. 
A **shallow network** typically has only one hidden layer, whereas a **deep network** has multiple hidden layers. 
Deeper networks can capture more complex patterns, but they may be harder to train. 

### Why Multiple Hidden Layers?
- A network with **one hidden layer** (a single-layer network) can approximate many functions but might require a very large number of neurons to do so.
- **Multiple hidden layers** can learn hierarchical representations: each layer builds more abstract features from the previous one (for example, in image recognition, early layers learn edges, later layers learn object parts).
- Deeper networks often can represent functions more efficiently (with fewer neurons) than a very wide single-layer network, given the same number of total neurons.

However, deeper networks might be more prone to issues like vanishing gradients (especially with certain activation functions), and they require more data and computation to train effectively.

### Activation Functions: ReLU, Tanh, Sigmoid
**Activation functions** introduce non-linearity to networks, enabling them to learn complex patterns. Different hidden layers can use different activations. Common activation functions include:
- **Sigmoid**: Outputs a value between 0 and 1. Historically used in early networks, but can saturate (gradients become very small for values near 0 or 1), which can slow training. Good for binary outputs or probabilistic interpretation, but less common in deep hidden layers now.
- **Tanh**: Outputs between -1 and 1. It's zero-centered (unlike Sigmoid) which can be advantageous. But it also saturates at the extremes, causing vanishing gradients for deep networks.
- **ReLU (Rectified Linear Unit)**: Outputs 0 for negative inputs and a linear output for positive inputs. It doesn't saturate for positive values, which helps mitigate the vanishing gradient problem, and it tends to converge faster in practice. A downside is that neurons can "die" if they only output 0 (if inputs are always negative), but this is usually manageable.
- *(There are others like Leaky ReLU, ELU, etc., but we'll focus on these basic ones.)*

**Comparison:** In practice, ReLU is often a good default for hidden layers because of its training efficiency. 
Sigmoid and Tanh can still be used, especially in shallower networks or specific applications, but they may require careful tuning (like smaller learning rates) due to gradient saturation.

### Defining a Neural Network in PyTorch
We'll start by defining two simple neural network architectures to compare:
- A **single-layer (shallow) network**: just an input layer directly to an output layer (no hidden layer). This is essentially a logistic regression model if used for classification.
- A **multi-layer (deep) network**: one with two hidden layers. You can extend this idea to even more layers as needed.

We'll use fully connected linear layers (`nn.Linear`) for this multi-layer perceptron (MLP). 
For now, let's use ReLU activation in hidden layers as a default (we will mention how to switch to Tanh or Sigmoid).

In [21]:
import torch
import torch.nn as nn

import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Device configuration (use GPU if available for faster training, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:", device)

# Define a simple single-layer network (no hidden layers, just input -> output)
class SingleLayerNet(nn.Module):
    def __init__(self, input_size, output_size):
        super(SingleLayerNet, self).__init__() # call the constructor of the parent class (nn.Module)
        self.linear = nn.Linear(input_size, output_size)
        # Note: No hidden layers, so no activation needed here (we'll apply softmax via loss for classification)
    
    def forward(self, x):
        # x is expected to be of shape [batch_size, input_size]
        out = self.linear(x)  # linear output
        return out

# Define a deeper network with two hidden layers
class TwoHiddenLayerNet(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size, activation_fn=nn.ReLU):
        super(TwoHiddenLayerNet, self).__init__()
        # Define layers
        self.hidden1 = nn.Linear(input_size, hidden_size1)
        self.hidden2 = nn.Linear(hidden_size1, hidden_size2)
        self.output_layer = nn.Linear(hidden_size2, output_size)
        # Activation function for hidden layers (we can pass nn.ReLU, nn.Tanh, nn.Sigmoid, etc.)
        self.activation = activation_fn()
    
    def forward(self, x):
        # First hidden layer + activation
        x = self.hidden1(x)
        x = self.activation(x)   # apply activation (ReLU/Tanh/Sigmoid as specified)
        # Second hidden layer + activation
        x = self.hidden2(x)
        x = self.activation(x)
        # Output layer (no activation here if we'll use CrossEntropyLoss which applies Softmax internally)
        out = self.output_layer(x)
        return out

# Example: Initialize networks for an input of size 784 (e.g., flattened 28x28 image) and 10 output classes.
input_size = 784   # for MNIST images flattened
output_size = 10   # for 10 classes (digits 0-9)
hidden_size1 = 128  # number of neurons in first hidden layer
hidden_size2 = 64   # number of neurons in second hidden layer

model_shallow = SingleLayerNet(input_size, output_size).to(device)
# Available activation functions: nn.ReLU, nn.Tanh, nn.Sigmoid
model_deep = TwoHiddenLayerNet(input_size, hidden_size1, hidden_size2, output_size, activation_fn=nn.ReLU).to(device)

print(model_shallow)
print(model_deep)

Using device: cpu
SingleLayerNet(
  (linear): Linear(in_features=784, out_features=10, bias=True)
)
TwoHiddenLayerNet(
  (hidden1): Linear(in_features=784, out_features=128, bias=True)
  (hidden2): Linear(in_features=128, out_features=64, bias=True)
  (output_layer): Linear(in_features=64, out_features=10, bias=True)
  (activation): ReLU()
)


In the code above:
- `SingleLayerNet` has just one `nn.Linear` layer. This will take the input and directly produce outputs. If we use it for classification with 10 classes, it will produce 10 scores (logits) for each class.
- `TwoHiddenLayerNet` has two hidden linear layers (`hidden1` and `hidden2`) and an `output_layer`. We apply an activation function after each hidden layer. By default, we use `ReLU` (by passing `nn.ReLU` to the `activation_fn` argument), but we could replace it with `nn.Tanh` or `nn.Sigmoid` when instantiating the model to experiment with those activations.
- We do not apply an activation on the output layer because we'll use `CrossEntropyLoss` for training, which expects raw scores for each class (and internally applies Softmax when computing the loss).

After defining the models, we moved them to the chosen `device` (CPU or GPU). We then printed the model architectures to see the layers.

### Comparing Shallow vs Deep Network Performance
To compare, we'd train both a shallow and a deep network on the same task and see the results:
- The **shallow network** (no hidden layer) is essentially a linear classifier. It may not fit complex patterns in data if the data is not linearly separable.
- The **deep network** (with hidden layers) can fit more complex functions due to the additional layers and non-linear activations.

**Expectations:** On a complex dataset like MNIST (10-class digit images), a deep network will usually achieve higher training accuracy and lower loss than a shallow one, because the deep network can model non-linear relationships better. The shallow network might converge faster initially (fewer parameters to adjust) but will likely plateau at a lower accuracy. The deep network, with more parameters and layers, might take a bit longer to train per epoch but can reach a higher accuracy given enough training. We'll see this in practice in section 3 when we train on MNIST.

We will hold off actual training until we introduce the data in section 3. For now, we have our models defined and ready to use. Next, we will discuss using **mini-batch gradient descent** to train these networks efficiently.






## 2. Implementing Mini-Batch Gradient Descent

When training neural networks, we use **gradient descent** to optimize the model's parameters (weights). There are a few variations of how we can compute the gradients over the training data:
- **Full-batch (Batch) Gradient Descent:** Use *all training examples* to compute the loss and gradient, then update the weights. This means one very large batch (the entire dataset) per update.
- **Stochastic Gradient Descent (SGD):** Use a *single training example* to compute the loss and gradient, update the weights immediately for each training example. This can be very noisy but updates happen very frequently.
- **Mini-Batch Gradient Descent:** Use a **small batch of training examples** (e.g., 16, 32, 64 samples) to compute the loss and gradient, then update weights. This is the most commonly used approach as it balances efficiency and stability.

### Why Mini-Batches?
- Using the **entire dataset** for each update (full-batch) can be very slow if the dataset is large, and it doesn't provide more frequent feedback to the model about the direction to move.
- Using a **single example** for each update (SGD) is fast per update, but the direction of the gradient is very noisy and can jump around, sometimes causing instability or requiring more updates to converge.
- **Mini-batches** strike a good balance:
  - They allow vectorized operations on multiple samples at once, which is computationally efficient (especially on GPUs, which excel at parallel operations).
  - The gradient computed on a batch is a good approximation of the gradient on the full dataset, but with some noise that can help escape local minima and prevent certain kinds of overfitting.
  - By adjusting the batch size, we can tune the trade-off: larger batches = more stable, accurate gradient directions (but more memory and possibly slower per update), smaller batches = more noise but faster updates and less memory.

**Trade-offs between batch sizes:**
- *Computational Efficiency:* Larger batches make better use of hardware (especially GPUs) through parallelism, up to a point. Very small batches might not fully utilize the GPU.
- *Convergence:* Noisy gradients (small batch) can help generalize but too much noise might hinder learning of precise patterns. Very large batches might converge to sharp minima and potentially generalize worse, and you get fewer weight updates for the same number of samples seen.
- *Memory:* Larger batch sizes require more memory to store the data and intermediate activations. You may be limited by GPU memory, for example.

In practice, common batch sizes are 16, 32, 64, 128, etc. Often 32 or 64 is a good start for many problems, but this can be tuned.

### Implementing Mini-Batch Training Loop
PyTorch makes mini-batch training easy with the `DataLoader` class, which automatically divides the dataset into batches for us. We'll see how to use `DataLoader` in the MNIST section. But to illustrate the concept, here's what a typical training loop with mini-batches looks like:



```python
# Assume we have a DataLoader for training data: train_loader
# Also assume we have a model, loss function (criterion), and optimizer defined.

num_epochs = 5
batch_size = 32  # for example
for epoch in range(num_epochs):
    model_deep.train()  # set model to training mode (important for dropout, batch norm, etc.)
    total_loss = 0.0
    for batch_idx, (data, labels) in enumerate(train_loader):
        # Move batch to the device (CPU/GPU)
        data, labels = data.to(device), labels.to(device)
        # If data is an image (N, 1, 28, 28 for MNIST), flatten it to (N, 784) for our MLP
        data = data.view(data.size(0), -1)
        
        # Forward pass: compute model output
        outputs = model_deep(data)
        loss = criterion(outputs, labels)  # compute loss for this batch
        
        # Backward pass: compute gradients
        optimizer.zero_grad()    # zero out gradients from previous step
        loss.backward()          # backpropagation: compute gradients of loss w.r.t. parameters
        optimizer.step()         # update parameters using the optimizer (gradient descent step)
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Average Loss: {avg_loss:.4f}")
```

In this snippet:
- We loop over each `epoch` (one pass through the dataset).
- For each epoch, we loop over the `train_loader`, which yields mini-batches of `(data, labels)` pairs.
- We move the data and labels to the chosen `device` (so that computation happens on GPU if available).
- If dealing with images, we flatten each batch of images with `data.view(data.size(0), -1)`. For MNIST, each image is 28x28 = 784 features.
- We perform the forward pass on the batch to get outputs, then compute the loss against the true labels for that batch.
- We zero out previous gradients (`optimizer.zero_grad()`), do backprop (`loss.backward()`), and take an optimizer step (`optimizer.step()`) to update weights.
- We accumulate the loss to compute an average loss for the epoch (just for logging).
- We set `model_deep.train()` at the start of training to ensure the model is in training mode (this matters for layers like Dropout or BatchNorm, which behave differently in training vs evaluation).

Using `DataLoader` automatically shuffles the data (if specified) and yields exactly `batch_size` samples per iteration (except possibly the last one if the dataset size isn't divisible by the batch size). This greatly simplifies our training loop.

We'll implement this in practice for MNIST in the next section. For now, remember:
- Changing `batch_size` in the DataLoader is how you experiment with different mini-batch sizes (e.g., 16, 32, 64). It's usually as simple as passing a different `batch_size` argument when creating the DataLoader.
- If `batch_size = len(dataset)`, you're effectively doing full-batch gradient descent.
- If `batch_size = 1`, you're doing (close to) stochastic gradient descent.

Now that we understand mini-batches, let's apply what we've learned by training our network on a real dataset: **MNIST**.

## 3. Training on the MNIST Dataset

The **MNIST dataset** is a classic benchmark in machine learning. It consists of 60,000 training images and 10,000 test images of handwritten digits (0 through 9), each image being 28x28 pixels in grayscale. Our task is to classify each image into the correct digit.

### Data Loading and Preprocessing
PyTorch provides convenient tools to load common datasets like MNIST through `torchvision.datasets`. We will:
- Download/load the MNIST dataset.
- Transform the images into tensor format and normalize the pixel values.
- Flatten each 28x28 image into a 784-dimensional vector, since our neural network expects a 1D feature vector input.
- Prepare DataLoader objects for the training set and test set with a chosen batch size.

Normalization: Neural networks train faster when input features are scaled. Pixel values in MNIST are 0 to 255. We will normalize them to a 0-1 range by converting images to tensors of type float and dividing by 255 (the `ToTensor()` transform in PyTorch does this automatically: it converts images to tensors and scales pixels to [0,1]). We could further normalize to have mean 0 and variance 1 (like subtract 0.1307 and divide by 0.3081, which are the mean and std of MNIST), but a simpler 0-1 scaling is sufficient for this demonstration. 

Let's load the data:

In [22]:
import torchvision
import torchvision.transforms as transforms

# Define transforms: convert images to tensor and normalize to [0,1]. 
# Also, we won't flatten here; we'll do that in the training loop for clarity.
transform = transforms.ToTensor()

# Load the MNIST training and test datasets
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transform, download=True)

# Define DataLoaders for training and testing
batch_size = 32  # we can experiment with 16, 32, 64, etc.
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of test samples: {len(test_dataset)}")

Number of training samples: 60000
Number of test samples: 10000


In the code above:
- We set `transform = transforms.ToTensor()` which means each image will be converted to a PyTorch tensor and scaled to [0, 1].
- `datasets.MNIST` is used to download (if not already) and load the dataset. We specify `train=True` for the training set and `train=False` for the test set.
- `DataLoader` is used to create an iterable over the dataset. We set `shuffle=True` for the training loader to shuffle the data each epoch (improving training randomness) and `shuffle=False` for test (not necessary to shuffle test data).
- `batch_size` is set to 32 as an example. You can change this and observe differences in training speed or performance.

### Adjusting the Network for 10 Output Classes
Our network architecture needs to have an output size equal to the number of classes (10 for digits 0-9). In the earlier section, we already defined `output_size = 10`. If you were previously using a network for a different task (say binary classification with 1 output or 2 outputs), you would change it to 10 outputs for MNIST.

We will use the `TwoHiddenLayerNet` defined earlier as our model for MNIST, since a deeper network should perform better than a single-layer one on this task. We already instantiated `model_deep` with `output_size=10`. Let's re-initialize it to make sure we start with a fresh model (since we haven't actually trained it yet) and choose an appropriate loss function and optimizer:

In [23]:
# Re-initialize a fresh model for training on MNIST
model = TwoHiddenLayerNet(input_size, hidden_size1, hidden_size2, output_size, activation_fn=nn.ReLU).to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()  # for multi-class classification, CrossEntropyLoss is appropriate (it applies Softmax internally)
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# PyTorch's torch.optim.SGD isn't strictly stochastic gradient descent by definition.
# Whether it's "batch," "mini-batch," or "stochastic" depends on how much data is provided per step.

A few notes on the choices:
- We use `nn.CrossEntropyLoss` for multi-class classification. This loss function expects raw logits from the model and true class labels. It will compute softmax internally and then compute the negative log-likelihood. So our model's output should be of shape `[batch_size, 10]` (score for each class) and labels of shape `[batch_size]` with class indices 0-9.
- We chose Stochastic Gradient Descent (SGD) as the optimizer with a learning rate of 0.1. This is a fairly high learning rate for MNIST; we might reduce it or use a more advanced optimizer like Adam for faster convergence, but let's start simple. (We can adjust if needed.)
- `optimizer = torch.optim.SGD(model.parameters(), lr=0.1)` will update our model's parameters using the gradients computed. We did not set `momentum` or `weight_decay` here; momentum can accelerate SGD and weight_decay is for L2 regularization (which we'll add separately later).

### Training the Model on MNIST
Now we'll train the model for a few epochs and monitor the performance. For each epoch, we'll compute the average training loss and also compute accuracy on the training set (and later on the test set) to see how well the model is doing.

We'll also compare how our model is doing against what a shallow model would achieve, to highlight the difference. First, let's train the `model` (which is our two-hidden-layer network with ReLU).

In [24]:
num_epochs = 5
for epoch in range(num_epochs):
    model.train()  # make sure to set model to training mode
    running_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (images, labels) in enumerate(train_loader):
        # Move data to device (CPU or GPU)
        images, labels = images.to(device), labels.to(device)
        # Flatten the images into vectors of size 784
        images = images.view(images.size(0), -1)
        
        # Forward pass
        outputs = model(images)             # shape: [batch_size, 10]
        loss = criterion(outputs, labels)   # compute loss for this batch
        
        # Backward and optimize
        optimizer.zero_grad()   # clear previous gradients
        loss.backward()         # compute gradients
        optimizer.step()        # update weights
        
        # Accumulate training statistics
        running_loss += loss.item() * images.size(0)  # loss.item() is average loss per sample in the batch
        _, predicted = torch.max(outputs, 1)          # predicted class is the index of max logit
        correct += (predicted == labels).sum().item() # count how many predictions were correct
        total += labels.size(0)
    
    # Compute average loss and accuracy for the epoch
    epoch_loss = running_loss / total
    epoch_acc = 100 * correct / total
    print(f"Epoch [{epoch+1}/{num_epochs}], Training Loss: {epoch_loss:.4f}, Training Accuracy: {epoch_acc:.2f}%")

Epoch [1/5], Training Loss: 0.3651, Training Accuracy: 89.02%
Epoch [2/5], Training Loss: 0.1302, Training Accuracy: 96.08%
Epoch [3/5], Training Loss: 0.0890, Training Accuracy: 97.27%
Epoch [4/5], Training Loss: 0.0685, Training Accuracy: 97.85%
Epoch [5/5], Training Loss: 0.0540, Training Accuracy: 98.30%


In each epoch:
- We set `model.train()` to ensure dropout (if any) is active and other layers behave in training mode.
- We iterate over `train_loader` to get mini-batches of `images` and `labels`.
- We flatten `images` from shape `(batch_size, 1, 28, 28)` to `(batch_size, 784)` to feed into our linear layers.
- We compute `outputs = model(images)` and then `loss = criterion(outputs, labels)`.
- We perform backpropagation and an optimizer step to update the model's parameters.
- We also track the `running_loss` (accumulating the loss of each batch, weighted by batch size to later compute average loss) and count correct predictions to compute accuracy.
- After the inner loop, we calculate `epoch_loss` and `epoch_acc` for the whole training dataset and print them.

After training for a few epochs, we expect the training loss to decrease and training accuracy to increase. A well-configured two-layer network should reach a high accuracy on MNIST (potentially above 95% on training within 5 epochs if learning rate is okay and hidden sizes are sufficient).

Let's evaluate the model on the **test dataset** to see how well it generalizes:

In [25]:
# Evaluate on test data
model.eval()  # set model to evaluation mode (important for dropout, batchnorm, etc.)
test_correct = 0
test_total = 0
test_loss = 0.0

with torch.no_grad():  # disable gradient computation for efficiency
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.view(images.size(0), -1)  # flatten
        outputs = model(images)
        loss = criterion(outputs, labels)
        test_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        test_correct += (predicted == labels).sum().item()
        test_total += labels.size(0)

avg_test_loss = test_loss / test_total
test_accuracy = 100 * test_correct / test_total
print(f"Test Loss: {avg_test_loss:.4f}, Test Accuracy: {test_accuracy:.2f}%")

Test Loss: 0.0834, Test Accuracy: 97.40%


Here we:
- Set `model.eval()` to notify layers like Dropout (if present) to behave in evaluation mode (i.e., not drop neurons).
- Use `torch.no_grad()` to skip gradient calculations since we're only doing forward passes (this makes it faster and uses less memory).
- Loop through the test data and compute the accumulated loss and correct predictions similar to training, but without updating the model.
- Calculate `avg_test_loss` and `test_accuracy`.

**Interpreting Results:** We should see a test accuracy that is hopefully close to the training accuracy if the model generalizes well. If the training accuracy is significantly higher than test accuracy, it might indicate overfitting (the model memorized training data patterns that don't generalize to new data). We will address overfitting using regularization in the next sections.

Now, as an additional comparison, let's see how a **shallow network (SingleLayerNet)** would perform on the same task. We can train the `SingleLayerNet` for a few epochs and compare its accuracy to the `TwoHiddenLayerNet`.

In [26]:
# Initialize a single-layer model (logistic regression) for comparison
shallow_model = SingleLayerNet(input_size, output_size).to(device)
optimizer_shallow = torch.optim.SGD(shallow_model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# Train the shallow model for a few epochs
num_epochs = 5
for epoch in range(num_epochs):
    shallow_model.train()
    correct = 0
    total = 0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.view(images.size(0), -1)
        outputs = shallow_model(images)
        loss = criterion(outputs, labels)
        optimizer_shallow.zero_grad()
        loss.backward()
        optimizer_shallow.step()
        _, predicted = torch.max(outputs, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)
    train_acc = 100 * correct / total
    print(f"Epoch [{epoch+1}/{num_epochs}], Shallow Net Training Accuracy: {train_acc:.2f}%")

# Evaluate shallow model on test data
shallow_model.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.view(images.size(0), -1)
        outputs = shallow_model(images)
        _, predicted = torch.max(outputs, 1)
        test_correct += (predicted == labels).sum().item()
        test_total += labels.size(0)
test_acc = 100 * test_correct / test_total
print(f"Shallow Net Test Accuracy: {test_acc:.2f}%")

Epoch [1/5], Shallow Net Training Accuracy: 88.69%
Epoch [2/5], Shallow Net Training Accuracy: 91.25%
Epoch [3/5], Shallow Net Training Accuracy: 91.63%
Epoch [4/5], Shallow Net Training Accuracy: 91.92%
Epoch [5/5], Shallow Net Training Accuracy: 92.21%
Shallow Net Test Accuracy: 92.26%


We won't double print losses for brevity, focusing on accuracy:
- This trains a logistic regression model (no hidden layers). We expect it to do reasonably well on MNIST (likely achieving ~88-93% test accuracy because linear models can capture a lot of the variance in digit images, but not as much as a multi-layer network).
- By comparing the results of `shallow_model` vs `model` (deep network), you should observe:
  - The deep network reaches a higher accuracy than the shallow model.
  - The shallow model might train a bit faster initially (fewer parameters, simpler model), but its performance plateaus earlier and lower.
  - This confirms that multiple hidden layers (with non-linear activations) give the network more representational power to solve the task.

So far, we've trained a multi-layer network on MNIST and improved performance over a single-layer network. Next, we'll look at ways to further improve generalization and training through regularization techniques.

## 4. Adding L2 Regularization (Weight Decay)

Our deep network likely has a lot of parameters (weights). For example, with input_size=784, hidden sizes 128 and 64, and output 10, the number of parameters is:
- Hidden1 layer: 784 x 128 weights + 128 biases
- Hidden2 layer: 128 x 64 weights + 64 biases
- Output layer: 64 x 10 weights + 10 biases  
That’s a total of many thousands of parameters. With a large number of parameters, a network can fit the training data very closely, sometimes even memorizing it. This can lead to **overfitting**, where the model performs much worse on new, unseen data (test set) than on the training set.

**L2 Regularization**, also known as **weight decay**, is one way to combat overfitting:
- The idea is to add an extra term to the loss function that penalizes large weights. Specifically, for each weight $w$, we add $\frac{\lambda}{2} w^2$ (or just $\lambda w^2$ depending on convention) to the loss, where $\lambda$ is a small positive number (the **regularization strength** or coefficient).
- This encourages the network to keep the weights small. Smaller weights generally produce smoother and less complex mappings from inputs to outputs, which often generalize better.
- Effect: The model will try to make a trade-off between fitting the data well (minimizing the original loss) and keeping weights small (minimizing the regularization term). This can prevent any single weight from growing too large to fit some noise or outlier perfectly.

In practice, adding L2 regularization tends to:
- **Reduce overfitting:** The gap between training and test performance often shrinks. Test accuracy may improve if the model was overfitting.
- **Slightly increase training loss:** Because we added an extra penalty, the optimizer might not be able to drive the training loss as low as before (which is fine if the test loss improves).
- **Potentially slow training convergence:** The optimizer has to also minimize the regularization term, which can make it a bit slower to reach minimum on the original loss. Often, this effect is minor if $\lambda$ is small.

### Implementing L2 Regularization in PyTorch
There are two common ways to add L2 regularization in PyTorch:
1. **Manual addition to the loss function:** After computing the normal loss, compute the sum of squares of all weights and add $\lambda$ times that to the loss.
2. **Using `weight_decay` parameter in the optimizer:** Many PyTorch optimizers (SGD, Adam, etc.) have a `weight_decay` argument which, if set to a non-zero value, automatically adds L2 penalty to the weights during the update step. This is convenient and functionally equivalent.

For educational purposes, we'll show the manual method as it makes the concept clear. Let's take our training loop and add L2 regularization to it. We'll use a new model instance to compare training with and without regularization.

In [27]:
# Re-initialize a new model for a fair comparison (same architecture)
model_reg = TwoHiddenLayerNet(input_size, hidden_size1, hidden_size2, output_size, activation_fn=nn.ReLU).to(device)
optimizer_reg = torch.optim.SGD(model_reg.parameters(), lr=0.1)
criterion = nn.CrossEntropyLoss()

# Set L2 regularization coefficient (lambda)
lambda_reg = 1e-4  # you can experiment with values like 1e-3, 1e-4, 1e-5, etc.

num_epochs = 5
for epoch in range(num_epochs):
    model_reg.train()
    total_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.view(images.size(0), -1)
        
        outputs = model_reg(images)
        base_loss = criterion(outputs, labels)  # standard data loss
        # Manual L2 loss: sum of squared weights (we won't include biases to be more standard, but that's a minor detail)
        l2_loss = 0.0
        for param in model_reg.parameters():
            # Only add weight parameters (skip biases). In PyTorch, biases are usually one-dimensional.
            # We'll include everything for simplicity, it's a small effect whether biases are included or not.
            l2_loss += torch.sum(param.pow(2))
        # Add the L2 penalty to the base loss
        loss = base_loss + lambda_reg * l2_loss
        
        optimizer_reg.zero_grad()
        loss.backward()
        optimizer_reg.step()
        total_loss += base_loss.item() * images.size(0)  # note: logging base_loss (data loss) for transparency
    avg_loss = total_loss / len(train_dataset)
    print(f"Epoch [{epoch+1}/{num_epochs}], Training Loss (no reg term): {avg_loss:.4f}")

Epoch [1/5], Training Loss (no reg term): 0.3730
Epoch [2/5], Training Loss (no reg term): 0.1295
Epoch [3/5], Training Loss (no reg term): 0.0902
Epoch [4/5], Training Loss (no reg term): 0.0702
Epoch [5/5], Training Loss (no reg term): 0.0559


What we did:
- Set `lambda_reg = 1e-4` as the regularization strength. This is a common scale for weight decay; if we set it too high (e.g., 0.1), the model might underfit because weights are forced to be extremely small.
- In each training iteration, we calculated `base_loss` as usual, then computed `l2_loss` by iterating over `model_reg.parameters()` and summing up `param.pow(2)`. This gives $\sum w^2$ for all weights.
- We then formed the final `loss = base_loss + lambda_reg * l2_loss`.
- Notice we added the regularization *after* computing base_loss, and we used `base_loss.item()` for logging the part of loss without regularization, just to see how the data fitting is going. The actual loss used for backward includes the L2 term.
- We could skip biases (which are 1D parameters) by checking `param.ndim > 1` (2D weights) to only penalize those, but including biases typically doesn't hurt if lambda is small (biases are far fewer parameters).

We would train and then evaluate `model_reg` on the test set similar to before to see the effect:

In [28]:
# Evaluate the L2-regularized model
model_reg.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.view(images.size(0), -1)
        outputs = model_reg(images)
        _, predicted = torch.max(outputs, 1)
        test_correct += (predicted == labels).sum().item()
        test_total += labels.size(0)
test_acc_reg = 100 * test_correct / test_total
print(f"Test Accuracy with L2 regularization: {test_acc_reg:.2f}%")

Test Accuracy with L2 regularization: 97.65%


**Expected impact of L2 regularization:** 
- The training loss printed might be slightly higher than without regularization (because part of it, not displayed in `base_loss.item()`, is coming from L2 penalty).
- The final training accuracy might be a bit lower than the non-regularized model if the original model was able to nearly perfectly fit the training data by making weights large.
- The test accuracy might be higher than the non-regularized model *if* the original was overfitting. If the original model wasn't heavily overfitting, the difference may be small. But if there was a noticeable gap (say, training acc 99% vs test acc 94%), adding weight decay could improve test accuracy by a couple of points while slightly reducing training accuracy, narrowing the gap.

**Tuning $\lambda$:** If $\lambda$ is too high, the model will underfit (too much pressure to keep weights near zero, so it can't even fit the training data well). If $\lambda$ is too low, it might not make any noticeable difference. A good practice is to try a few values (e.g., 1e-3, 1e-4, 1e-5) to see which yields the best validation performance.

Now that we've addressed overfitting via weight decay, let's introduce another powerful regularization technique: **Dropout**.

## 5. Manually Implementing Dropout

**Dropout** is a regularization technique that reduces overfitting by *randomly dropping units (neurons) during training*. 
It's very effective for large neural networks. 
Here's how it works conceptually:
- During each training forward pass, each neuron (in certain layers, usually the hidden layers) has a probability *p* of being "dropped out," meaning its output is set to zero for that pass.
- This forces the network to not rely too heavily on any single neuron, because that neuron might be gone in the next training step. It's like training an ensemble of many smaller networks that each omit different neurons, and they all share weights.
- At training time, when we drop a neuron with probability *p*, we *scale up* the remaining active neurons' outputs by *1/(1-p)*. This is called **inverted dropout** and it ensures that the *expected* sum of inputs to the next layer is the same as it would be without dropout. (Another way to see it: at test time, we use all neurons but would scale their outputs by (1-p) since none are dropped. Inverted dropout does the scaling during training instead, so we don't have to scale at test time.)
- During inference (evaluation), we **turn off dropout** (no neurons are dropped). We want the full network's capacity when making predictions. The randomness is only for training to introduce robustness.

**Impact of Dropout:**
- It adds noise to the training process (the network sees a slightly different architecture each time), which can slow down training convergence somewhat.
- It greatly helps in preventing overfitting, especially in large networks, since neurons can't co-adapt to each other as much.
- With dropout, you often might need to train for more epochs or with a slightly higher learning rate to compensate for the noise, but the end result is a model that generalizes better.

### Implementing Dropout Without PyTorch's Built-in Layers
PyTorch provides `nn.Dropout` modules that one can insert into a network, but here we'll implement it manually to see what's happening under the hood:
- We will manually zero-out random neurons' outputs in the forward pass during training.
- We'll ensure to multiply the remaining outputs by `1/(1-p)` to maintain the scale.
- We'll use `self.training` flag inside the `forward` method to check if the model is in training mode or eval mode. (`model.train()` sets `self.training=True`, `model.eval()` sets it to False).

Let's define a new network class similar to our `TwoHiddenLayerNet` but with dropout in the hidden layers:

In [29]:
class TwoHiddenLayerNetWithDropout(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size, dropout_prob=0.5):
        super(TwoHiddenLayerNetWithDropout, self).__init__()
        self.hidden1 = nn.Linear(input_size, hidden_size1)
        self.hidden2 = nn.Linear(hidden_size1, hidden_size2)
        self.output_layer = nn.Linear(hidden_size2, output_size)
        self.activation = nn.ReLU()       # using ReLU for hidden layers
        self.dropout_prob = dropout_prob  # dropout probability (p)
    
    def forward(self, x):
        # Hidden layer 1
        x = self.hidden1(x)
        x = self.activation(x)
        if self.training:  # only apply dropout during training
            # Create a dropout mask with the same shape as x
            # Mask has 0s with probability dropout_prob, 1s with probability (1 - dropout_prob)
            mask = (torch.rand_like(x) > self.dropout_prob).float()
            # Apply mask: zero out some activations
            x = x * mask
            # Scale up the remaining activations to account for dropout (inverted dropout technique)
            x = x / (1.0 - self.dropout_prob)
        
        # Hidden layer 2
        x = self.hidden2(x)
        x = self.activation(x)
        if self.training:
            mask = (torch.rand_like(x) > self.dropout_prob).float()
            x = x * mask
            x = x / (1.0 - self.dropout_prob)
        
        # Output layer (no dropout here, typically we only dropout in hidden layers)
        out = self.output_layer(x)
        return out

# Initialize the network with dropout
dropout_prob = 0.5  # 50% dropout, a common choice
model_dropout = TwoHiddenLayerNetWithDropout(input_size, hidden_size1, hidden_size2, output_size, dropout_prob).to(device)
print(model_dropout)

TwoHiddenLayerNetWithDropout(
  (hidden1): Linear(in_features=784, out_features=128, bias=True)
  (hidden2): Linear(in_features=128, out_features=64, bias=True)
  (output_layer): Linear(in_features=64, out_features=10, bias=True)
  (activation): ReLU()
)


In this class:
- We added a `dropout_prob` attribute to store the dropout probability *p*.
- After computing the activation of each hidden layer, we apply dropout:
  - We create a `mask` the same shape as the layer's output using `torch.rand_like(x) > p`. `torch.rand_like(x)` generates a tensor of random values in [0,1] with the same shape as `x`. Comparing it to `p` gives a boolean mask that's True (1) with probability (1-p) and False (0) with probability p. We convert that to float (so we have 1s and 0s).
  - We multiply `x` by this `mask`, zeroing out a fraction `p` of the elements of `x`.
  - We then divide `x` by (1-p) to scale up the remaining active neurons' outputs.
- We only do this when `self.training` is True, meaning the model is in training mode. In evaluation mode (`model.eval()` called), the `if self.training` blocks will be skipped, so no dropout will be applied and outputs won't be scaled (effectively, it's as if we are using the whole network with weights scaled properly already).
- We choose not to apply dropout on the output layer in this implementation. Dropout is usually applied to hidden layers. Dropping out units in the output layer (just before softmax) is less common and can hurt performance because the output layer directly corresponds to predictions. So we typically keep the output layer intact.

Now, let's train this network with dropout on MNIST and see how it does compared to before:

In [30]:
# Train the model with dropout
model_dropout.train()
optimizer_do = torch.optim.SGD(model_dropout.parameters(), lr=0.1)
criterion = nn.CrossEntropyLoss()

num_epochs = 5
for epoch in range(num_epochs):
    model_dropout.train()
    running_loss = 0.0
    correct = 0
    total = 0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.view(images.size(0), -1)
        
        outputs = model_dropout(images)
        loss = criterion(outputs, labels)
        optimizer_do.zero_grad()
        loss.backward()
        optimizer_do.step()
        
        running_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)
    epoch_loss = running_loss / total
    epoch_acc = 100 * correct / total
    print(f"Epoch [{epoch+1}/{num_epochs}], Training Loss: {epoch_loss:.4f}, Training Accuracy: {epoch_acc:.2f}%")

Epoch [1/5], Training Loss: 0.6326, Training Accuracy: 80.54%
Epoch [2/5], Training Loss: 0.3421, Training Accuracy: 90.24%
Epoch [3/5], Training Loss: 0.2903, Training Accuracy: 91.84%
Epoch [4/5], Training Loss: 0.2606, Training Accuracy: 92.52%
Epoch [5/5], Training Loss: 0.2444, Training Accuracy: 93.03%


And then evaluate on the test set as usual:

In [31]:
model_dropout.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.view(images.size(0), -1)
        outputs = model_dropout(images)
        _, predicted = torch.max(outputs, 1)
        test_correct += (predicted == labels).sum().item()
        test_total += labels.size(0)
test_acc_do = 100 * test_correct / test_total
print(f"Test Accuracy with Dropout: {test_acc_do:.2f}%")

Test Accuracy with Dropout: 96.66%


**What to expect with Dropout:** 
- The training accuracy reported might be a bit lower than the model without dropout, because dropout is making the task harder on the training data (it's like training a bunch of smaller networks each epoch). For example, you might see the training accuracy rising slower or plateau a bit lower than 100%.
- The test accuracy, however, could be higher than the non-dropout model if that model was overfitting. If the original model already generalized well, dropout might not boost test accuracy much, but if it was overfitting, dropout often gives a noticeable improvement.
- Overall, you usually use dropout when you have a large network and a risk of overfitting. In our case, a two-hidden-layer network on MNIST might not strictly need dropout to reach good performance, but this is for learning purposes. On more complex tasks or deeper networks, dropout is very beneficial.

**Important:** Remember to call `model_dropout.eval()` when evaluating; otherwise, dropout will remain active and you'll get inconsistent, random results on the test set (since it will drop random neurons even when you're trying to evaluate performance).

### Combining L2 and Dropout
You can use **both** L2 regularization and dropout together. They address overfitting in different ways: L2 softly penalizes complexity (large weights) and dropout adds noise and forces redundancy in the network. In practice, using both often yields a better result than either alone, especially in very deep networks.

If using both:
- You would keep the `lambda_reg` term in the loss or `weight_decay` in optimizer, and also include the dropout layers in the model.
- Ensure to tune hyperparameters (learning rate might need adjustment since both regularizations can slow learning slightly).

### Recap and Conclusions
- We built a deep neural network with multiple hidden layers and saw that it can outperform a single-layer (shallow) network on a complex task.
- We used different activation functions (ReLU as default, with mentions of how to use Sigmoid/Tanh) and discussed their impact on training.
- We implemented mini-batch gradient descent to make training efficient and stable, and discussed how batch size affects training.
- We trained on the MNIST dataset, handling data loading, preprocessing (flattening and normalization), and adjusted the network for multi-class output.
- We added L2 regularization (weight decay) to the loss function to combat overfitting, and observed how it can improve generalization.
- We manually implemented dropout in our network, dropping neurons during training to further improve the model's robustness and generalization.