# Deep Learning: More Convolutional Neural Networks

Welcome back! Today’s small–group lecture explores **Convolutional Neural Networks (CNNs)**, building directly on what we've learned so far in deep learning.

By the end of this session, you will be able to:
- Understand the intuition behind CNNs and why they work so well for image data.
- Build a multi-layer PyTorch CNN from scratch.
- Explain the role of activation functions, batch normalization, and max pooling.
- Train, validate, and analyze a CNN’s performance.
- Understand core training concepts such as optimizers, loss functions, epochs, and learning rate.

This lecture is designed to feel like a guided walkthrough — with explanations of not just **what** we do, but **why** we do it.

Let's begin!

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from tqdm import tqdm

print(f"Pytorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")

Pytorch version: 2.3.1.post100
Torchvision version: 0.18.1


### Note: Python must be 3.11.x

If your kernel shows a value > 3.11.x you will need to downgrade. Please email the staff for help.

**Why does this matter?** Some PyTorch wheels break under Python 3.12 due to ABI changes. Always confirm version compatibility when working with deep learning frameworks.

In [2]:
# DEVICE CONFIGURATION
if torch.backends.mps.is_available():          # Apple Silicon
    device = torch.device("mps")
elif torch.cuda.is_available():                # CUDA GPU
    device = torch.device("cuda")
else:
    device = torch.device("cpu")               # Fallback

print("Using device:", device)

Using device: mps


For today's small group, we will walk through the process of setting up a convolutional neural network ("CNN" for short) using the `pytorch` package!

CNNs shine when working with **structured grid data**, especially images — tasks like classification, segmentation, object detection, and more.

Why? Because CNNs:
- Capture local patterns (edges, textures) via **convolutions**.
- Build hierarchical features (shapes → objects) through **stacked layers**.
- Use **shared weights**, making them efficient and translation-invariant.

Let’s load a dataset so we can see these ideas in action.

Recall from lecture that CNNs are generally used to process gridded data or images.

Let's begin by loading one of the toy datasets included in `pytorch`: **CIFAR-10**.

The dataset contains 60,000 small 32×32 color images belonging to 10 different classes:

In [3]:
# init preprocessing for CIFAR-10 dataset (images are 32x32x3)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # normalize to [-1, 1]
])

In [4]:
batch_size = 100
train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                             download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                            download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

Files already downloaded and verified
Files already downloaded and verified


Great! We have image data now. But what does it look like?

Visualizing your data is an essential first step — especially in computer vision.

Let's plot the different classes below using `matplotlib`. When teaching, this is a great moment to ask:
- *What patterns do you notice?*
- *Which classes might be harder for the network? Why?*

In [5]:
# Plot each class here

## Building a CNN

Staff:
- Reference lecture: CNNs involve stacking multiple **layers**
- The first part of a CNN involves stacking multiple layers of convolutional, activation, and maxpool layers 
- The example code below shows 3 of these stacks of layers!
<p align="left">
    <img src = "https://media.geeksforgeeks.org/wp-content/uploads/20250529121802516451/Convolutional-Neural-Network-in-Machine-Learning.webp" width = "500">
</p>

### Activation Functions

Staff:
- Please briefly review some of the common activation functions (there will be a table on this at the beginning of lecture)
- Discuss 3 of the most common activation functions
- Be sure to define what their names are in tensorflow

### Expanded Notes for Staff
- **ReLU**: fast, avoids vanishing gradients. TensorFlow: `tf.nn.relu` or `tf.keras.layers.ReLU()`.
- **LeakyReLU**: avoids dead neurons by allowing small negative gradient. TF: `tf.nn.leaky_relu`.
- **Tanh**: zero-centered but saturates. TF: `tf.nn.tanh`.

Explain *why* we use nonlinear activations: without them, a stack of layers collapses to a single linear transformation — making the network unable to model complex patterns.

Please add comments to this code:

In [6]:
# Define a simple CNN
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),        # 1st conv layer
            nn.BatchNorm2d(32),                                # normalize activations
            nn.ReLU(),                                         # nonlinear activation
            nn.MaxPool2d(2)                                    # reduces image size
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.layer3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc = nn.Linear(128*4*4, 10)  # CIFAR-10 has 10 classes

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

model = CNN().to(device)

### Q: What do you think will happen to your CNN as you change the activation function?
Feel free to try this by changing `activation_func`!

### A:
Different activation functions drastically change how gradients behave. For example:
- **ReLU**: fast, stable, common default choice.
- **Sigmoid**: gradients vanish → poor performance.
- **Tanh**: better than sigmoid but still saturates.
- **LeakyReLU**: may improve performance and stability by preventing dead neurons.


----
## Training a CNN
Staff:
- Please discuss the selection process for optimizer and loss inputs
- Define what learning rate is
- Give an overview of 2-3 common loss functions and their behavior

### Expanded Lecture Notes

**Loss function:** Measures how wrong our predictions are.
- For multi-class classification like CIFAR-10 → `CrossEntropyLoss`.
- Binary labels → `BCEWithLogitsLoss`.
- Regression → `MSELoss`.

**Optimizer:** The algorithm that updates model parameters.
- `SGD`: simple, but sensitive to learning rate.
- `Adam`: adaptive, stable — great default.
- `RMSProp`: similar to Adam but older.

**Learning rate:** Controls the size of each update.
- Too high → unstable / diverges.
- Too low → slow / stuck.

[This](https://www.geeksforgeeks.org/machine-learning/epoch-in-machine-learning/) reference will be useful.

In [7]:
#  Loss and optimizer
learning_rate = 0.001
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

### Q: What might happen if we changed our loss from __ to __ ?

### A:
- Using **MSELoss** for classification usually causes slow learning and poor accuracy.
- Using **BCEWithLogitsLoss** for multi-class problems produces incorrect shapes and fails.
- Using **CrossEntropyLoss** is appropriate for problems with >2 categories.

### Q: What happens if the `learning_rate` parameter is too high? Or too low?

### A:
- Too high → the loss will oscillate or explode; training will fail.
- Too low → training will get stuck or take far too long.
- Proper learning rate scheduling can dramatically improve training performance.

In [11]:
# Training loop
# STAFF: Please add check for early stopping!!!
num_epochs = 10
for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    model.train()
    running_loss = 0.0
    for i,(images, labels) in enumerate(tqdm(train_loader)):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()

    print(f'Loss: {loss.item():.4f} \n')

### Q: What happens if you increase `epochs`? Will performance always improve as `epochs` increases?

### A:
Performance improves early on, but after a certain number of epochs the model begins to **overfit**:
- Training accuracy increases
- Validation accuracy decreases

The model memorizes noise rather than learning generalizable features.

----
## Validating a CNN

Staff: Please add comments/explanations as needed to this code!

Validation is where we measure the model's generalization. Key concepts:
- **model.eval()** disables dropout and batchnorm updates.
- **torch.no_grad()** ensures we don't compute gradients.
- We compute accuracy across the entire test set.

In [12]:
# Testing
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print(f'Test Accuracy: {100 * correct / total:.2f}%')


----

## Analyzing Performance
- Staff: prompt some reflection about the plot below

Encourage students to think about:
- Does accuracy improve smoothly or noisily?
- Does it plateau? When?
- Signs of overfitting?
- Are more layers or data augmentation needed?

This is a great opportunity to discuss *hyperparameter tuning*, *model capacity*, and *training dynamics*.

In [None]:
# Plot accuracy vs. epochs