# Neural networks

In this notebook, you will implement fundamental neural network architectures from scratch using NumPy and PyTorch.
These exercises will deepen your understanding of how neural networks work, from basic matrix multiplications to more complex architectures like convolutional neural networks (CNNs).
As part this notebook, you will build and train neural networks, visualize their learning process, and apply them to real-world biomedical datasets.

## Learning objectives

By the end of this notebook, you will:

- Understand neural networks at a fundamental level.
- Apply and extend neural network architectures.
- Explore real-world applications with CNNs.
- Learn practical deep learning skills with PyTorch.

## Requirements

You will need the following Python packages to complete this notebook:

- [NumPy](https://numpy.org/): for numerical computing.
- [Matplotlib](https://matplotlib.org/): for data visualization.
- [MedMNIST](https://medmnist.com/): dataset of biomedical images.
- [PyTorch](https://pytorch.org/): a commonly used framework for neural networks.

You can install these packages by running:

```
pip install numpy matplotlib medmnist torch torchvision
```

## Neural networks as matrix multiplications

At the core of a neural network is a series of layers that consist of multiple nodes (or neurons), each performing simple calculations and passing information to the next layer.
In a fully-connected neural network, each neuron is connected to the neurons in the adjacent layers.
The connections are represented by weights.
When the input data is passed through the network, these weights are used to compute the activations of each neuron in the following way:

For a single layer, the computation at each neuron is:
$$
z = W \cdot x + b
$$
where:
- $W$ is the weight matrix.
- $x$ is the input vector (either the input data for the first layer or the activations from the previous layer).
- $b$ is the bias term (added to each neuron to control the output even when the input is zero).
- $z$ is the linear combination of the inputs (pre-activation value).

This operation is essentially a **matrix multiplication** between the input vector and the weight matrix, followed by the addition of the bias term. The matrix multiplication allows the network to process multiple inputs at once, making neural networks computationally efficient.

## Non-linear activation functions

After the matrix multiplication step, neural networks apply **non-linear activation functions** to the pre-activation values $z$ to introduce non-linearity into the model.
Without these non-linear activations, the neural network would behave like a simple linear regression model, regardless of the number of layers.

Commonly used activation functions include:

- ReLU (rectified linear unit): The ReLU function outputs the input directly if it's positive, and outputs zero otherwise.
$$
\text{ReLU}(z) = \max(0, z)
$$
ReLU is widely used due to its simplicity and its ability to alleviate the vanishing gradient problem in deep networks.

- Sigmoid: The sigmoid function squashes the input into the range (0, 1), making it useful for binary classification tasks.
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
This function maps large negative inputs to values close to 0 and large positive inputs to values close to 1.

- Tanh (hyperbolic tangent): The tanh function squashes the input into the range (-1, 1). It is similar to the sigmoid function but outputs values symmetrically around zero.
$$
\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}
$$

## Multi-layer neural networks

A multi-layer neural network, often referred to as a **feedforward neural network** or **multi-layer perceptron (MLP)**, consists of multiple layers of neurons:

1. **Input layer**: Receives the input data. Each neuron corresponds to a feature in the data.
2. **Hidden layers**: These layers lie between the input and output layers. Each hidden layer performs matrix multiplication followed by a non-linear activation function. A neural network can have multiple hidden layers, allowing it to capture more complex relationships.
3. **Output layer**: Produces the final predictions. For binary classification, the output layer typically has one neuron with a sigmoid activation function. For multi-class classification, the output layer might have multiple neurons with a softmax activation function.

### Forward propagation through multiple layers

In a multi-layer neural network, each layer transforms the input data and passes it to the next layer.
This process is called **forward propagation**.

For a network with one hidden layer, the forward propagation looks like this:

1. **Input to hidden layer**:
$$
z^{(1)} = W^{(1)} \cdot x + b^{(1)}
$$
$$
a^{(1)} = \text{activation}(z^{(1)})
$$

2. **Hidden layer to output layer**:
$$
z^{(2)} = W^{(2)} \cdot a^{(1)} + b^{(2)}
$$
$$
a^{(2)} = \text{activation}(z^{(2)})
$$
   
Here, $W^{(1)}$ and $W^{(2)}$ are the weight matrices for the first and second layers, $b^{(1)}$ and $b^{(2)}$ are the bias vectors, and $a^{(1)}$ and $a^{(2)}$ are the activations of the hidden and output layers.

This chain of matrix multiplications and activations continues through all the layers of the network.

### Learning in multi-layer networks: Backpropagation

The goal of training a neural network is to adjust the weights and biases so that the predictions closely match the actual targets.
This is done using an algorithm called **backpropagation**.
Backpropagation involves the following steps:

1. **Forward pass**: Compute the network's output by performing matrix multiplication and activation functions layer by layer.
2. **Compute loss**: Calculate how far off the network's predictions are from the actual targets using a loss function (e.g., mean squared error for regression, cross-entropy for classification).
3. **Backward pass (backpropagation)**: Compute the gradients of the loss with respect to the weights and biases using the chain rule of calculus.
4. **Gradient descent**: Update the weights and biases using the gradients to minimize the loss function.

This process is repeated for multiple epochs, gradually improving the performance of the network.

### Example of forward propagation through multiple layers

Consider a neural network with the following architecture:
- Input layer: 2 neurons (for a 2D input).
- Hidden layer: 3 neurons with ReLU activation.
- Output layer: 1 neuron with sigmoid activation (for binary classification).

1. Input to hidden layer:
$$
z^{(1)} = W^{(1)} \cdot x + b^{(1)}, \quad a^{(1)} = \text{ReLU}(z^{(1)})
$$
where:
$$
W^{(1)} = \begin{bmatrix}
w_{11}^{(1)} & w_{12}^{(1)} \\
w_{21}^{(1)} & w_{22}^{(1)} \\
w_{31}^{(1)} & w_{32}^{(1)}
\end{bmatrix}, \quad
x = \begin{bmatrix}
x_1 \\
x_2
\end{bmatrix}
$$

2. Hidden layer to output layer:
$$
z^{(2)} = W^{(2)} \cdot a^{(1)} + b^{(2)}, \quad a^{(2)} = \text{Sigmoid}(z^{(2)})
$$

Through this series of operations, the input is transformed into a prediction.

---

In summary, multi-layer neural networks use a combination of matrix multiplications and non-linear activation functions to learn complex relationships in data.
The power of these networks lies in their ability to stack multiple layers of transformations, enabling them to model highly complex patterns.
The learning process is guided by backpropagation and gradient descent, which iteratively adjust the weights and biases to minimize the prediction error.

## Task: Neural networks from scratch using NumPy

In this first exercise, you will create a fully-connected neural network from scratch using NumPy.
This will give you a deep understanding of how neural networks work at the fundamental level, without relying on high-level frameworks.

**Instructions**

- Implement a simple feedforward neural network with one hidden layer.
- The architecture will be:
    - Input layer: 2 neurons (for a 2D synthetic dataset).
    - Hidden layer: 3 neurons with ReLU activation.
    - Output layer: 1 neuron (binary classification) with sigmoid activation.

**Steps**

- Initialize weights: Write a function that initializes the weights and biases randomly. Use small random values to ensure the network starts with different weights.
- Forward propagation: Implement forward propagation using matrix multiplication to calculate the activations of each layer.
- Loss function: Use binary cross-entropy as the loss function.
- Backpropagation: Write the backpropagation algorithm to compute the gradients of the loss with respect to the weights.
- Gradient descent: Update the weights using the gradients computed during backpropagation and a predefined learning rate.
- Training: Train the network on a small synthetic dataset for binary classification.

In [None]:
import numpy as np
import matplotlib.pyplot as plt


# Step 1: Generate synthetic data.
def generate_data(num_points, noise=0.1):
    np.random.seed(42)
    # Random points between -1 and 1.
    X = np.random.rand(num_points, 2) * 2 - 1
    # Label: 1 if product of coordinates > 0, else 0.
    Y = (X[:, 0] * X[:, 1] > 0).astype(int).reshape(-1, 1)
    # Add some noise.
    Y = (Y + (np.random.rand(num_points, 1) < noise).astype(int)) % 2
    return X, Y


# Step 2: Initialize weights.
def initialize_weights(input_size, hidden_size, output_size):
    np.random.seed(42)
    W1 = np.random.randn(input_size, hidden_size) * 0.01
    b1 = np.zeros((1, hidden_size))
    W2 = np.random.randn(hidden_size, output_size) * 0.01
    b2 = np.zeros((1, output_size))
    return W1, b1, W2, b2


# Step 3: Forward propagation.
def forward_propagation(X, W1, b1, W2, b2):
    # TODO
    Z1 = None  # Apply W1 to input.
    A1 = None  # ReLU activation.
    Z2 = None  # Apply W2 to A1.
    A2 = None  # Sigmoid activation.
    return A1, A2


# Step 4: Compute loss.
def compute_loss(A2, Y):
    # TODO
    loss = None
    return loss


# Step 5: Backpropagation.
def backpropagation(X, Y, A1, A2, W2):
    m = X.shape[0]
    dZ2 = A2 - Y
    dW2 = np.dot(A1.T, dZ2) / m
    db2 = np.sum(dZ2, axis=0, keepdims=True) / m
    dZ1 = np.dot(dZ2, W2.T) * (A1 > 0)
    dW1 = np.dot(X.T, dZ1) / m
    db1 = np.sum(dZ1, axis=0, keepdims=True) / m
    return dW1, db1, dW2, db2


# Step 6: Gradient descent.
def gradient_descent(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate):
    # TODO
    return W1, b1, W2, b2


# Step 7: Training loop.
def train(X, Y, hidden_size, learning_rate, n_epochs):
    input_size = X.shape[1]
    output_size = 1
    W1, b1, W2, b2 = initialize_weights(input_size, hidden_size, output_size)

    loss_history = []
    for i in range(n_epochs):
        A1, A2 = forward_propagation(X, W1, b1, W2, b2)

        loss = compute_loss(A2, Y)
        loss_history.append(loss)

        dW1, db1, dW2, db2 = backpropagation(X, Y, A1, A2, W2)

        W1, b1, W2, b2 = gradient_descent(
            W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate
        )

        # Print the loss every 100 epochs.
        if i % 100 == 0:
            print(f"Epoch {i:3d}: Loss = {loss:.3f}")

    return W1, b1, W2, b2, loss_history


# Step 8: Predictions.
def predict(X, W1, b1, W2, b2):
    _, A2 = forward_propagation(X, W1, b1, W2, b2)
    predictions = (A2 > 0.5).astype(int)
    return predictions


# Generate synthetic data.
X, Y = generate_data(num_points=200, noise=0.1)

# Train the neural network.
# TODO
W1, b1, W2, b2, loss_history = None

# Make predictions (on the training data—not correct).
predictions = predict(X, W1, b1, W2, b2)
accuracy = np.mean(predictions == Y) * 100
print(f"Accuracy: {accuracy:.2f}%")

# Plot the loss history.
fig, ax = plt.subplots()
ax.plot(loss_history)
ax.set_xlabel("Iteration")
ax.set_ylabel("Loss")
plt.show()
plt.close()

# Plot the decision boundary.
x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = predict(np.c_[xx.ravel(), yy.ravel()], W1, b1, W2, b2)
Z = Z.reshape(xx.shape)

fig, ax = plt.subplots()
ax.contourf(xx, yy, Z, cmap="RdBu", alpha=0.5)
ax.scatter(X[:, 0], X[:, 1], c=Y[:, 0], cmap="RdBu")
plt.show()
plt.close()

## Task: Extend the neural network with multiple layers (NumPy)

Now that you have implemented a simple neural network, it is time to extend it to a deeper architecture by adding more hidden layers.
This will help you understand how deep learning models extract complex features from data.

**Instructions**

- Modify your previous implementation to support an arbitrary number of layers.
- The architecture will now be:
    - Input layer: 2 neurons.
    - Multiple hidden layers with ReLU activation.
    - Output layer: 1 neuron with sigmoid activation.

**Steps**

- Modify forward propagation: Extend your forward propagation function to handle multiple layers.
- Modify backpropagation: Update the backpropagation function to calculate gradients for all layers.
- Activation functions: Experiment with different activation functions like sigmoid and tanh for hidden layers.
- Training: Train your deeper network on the same synthetic dataset and compare the performance.

In [None]:
def initialize_weights_deep(layer_dims):
    np.random.seed(42)
    weights = {}
    n_layers = len(layer_dims)
    for i in range(1, n_layers):
        # He initialization for ReLU.
        weights[f"W{i}"] = np.random.randn(layer_dims[i - 1], layer_dims[i]) * np.sqrt(
            2 / layer_dims[i - 1]
        )
        weights[f"b{i}"] = np.zeros((1, layer_dims[i]))
    return weights


def forward_propagation_deep(X, weights):
    caches = {}
    A = X
    n_layers = len(weights) // 2

    # Iterate through all layers (except the last).
    for i in range(1, n_layers):
        # TODO
        # Z = None
        # A = None
        caches[f"Z{i}"] = Z
        caches[f"A{i}"] = A

    # Output layer (uses sigmoid).
    # TODO
    ZL = None
    AL = None
    caches[f"Z{n_layers}"] = ZL
    caches[f"A{n_layers}"] = AL

    return AL, caches


def backpropagation_deep(X, Y, weights, caches):
    grads = {}
    m = X.shape[0]
    n_layers = len(weights) // 2

    # Output layer gradients.
    dZL = caches[f"A{n_layers}"] - Y
    grads[f"dW{n_layers}"] = np.dot(caches[f"A{n_layers - 1}"].T, dZL) / m
    grads[f"db{n_layers}"] = np.sum(dZL, axis=0, keepdims=True) / m

    # Backpropagation through remaining layers.
    for i in reversed(range(1, n_layers)):
        dZ = np.dot(dZL, weights[f"W{i+1}"].T) * (caches[f"A{i}"] > 0)
        grads[f"dW{i}"] = np.dot(X.T if i == 1 else caches[f"A{i - 1}"].T, dZ) / m
        grads[f"db{i}"] = np.sum(dZ, axis=0, keepdims=True) / m
        dZL = dZ

    return grads


def gradient_descent_deep(weights, grads, learning_rate):
    # TODO
    return weights


def train_deep(X, Y, layer_dims, learning_rate, n_epochs):
    weights = initialize_weights_deep(layer_dims)

    loss_history = []
    for i in range(n_epochs):
        AL, caches = forward_propagation_deep(X, weights)

        loss = compute_loss(AL, Y)
        loss_history.append(loss)

        grads = backpropagation_deep(X, Y, weights, caches)

        weights = gradient_descent_deep(weights, grads, learning_rate)

        # Print the loss every 100 epochs.
        if i % 100 == 0:
            print(f"Epoch {i:3d}: Loss = {loss:.3f}")

    return weights, loss_history


def predict_deep(X, weights):
    AL, _ = forward_propagation_deep(X, weights)
    predictions = (AL > 0.5).astype(int)
    return predictions


# Generate synthetic data.
X, Y = generate_data(num_points=200, noise=0.1)

# Define the layer dimensions for a deep network (input layer -> hidden layers -> output layer).
layer_dims = [2, 5, 5, 1]  # 2 inputs, 2 hidden layers with 5 neurons each, 1 output.

# Train the deep neural network.
weights, loss_history = train_deep(X, Y, layer_dims, learning_rate=1.0, n_epochs=1000)

# Make predictions (on the training data—not correct).
predictions = predict_deep(X, weights)
accuracy = np.mean(predictions == Y) * 100
print(f"Accuracy: {accuracy:.2f}%")

# Plot the loss history.
fig, ax = plt.subplots()
ax.plot(loss_history)
ax.set_xlabel("Iteration")
ax.set_ylabel("Loss")
plt.show()
plt.close()

# Plot the decision boundary.
x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = predict_deep(np.c_[xx.ravel(), yy.ravel()], weights)
Z = Z.reshape(xx.shape)

fig, ax = plt.subplots()
ax.contourf(xx, yy, Z, cmap="RdBu", alpha=0.5)
ax.scatter(X[:, 0], X[:, 1], c=Y[:, 0], cmap="RdBu")
plt.show()
plt.close()

**Discuss what happens when you change the number of neurons, the number of layers, and the activation functions.**

Write your observations here.

## Introduction to convolutional neural networks (CNNs)

Convolutional neural networks (CNNs) are a specialized type of neural network primarily used for processing data that has a grid-like structure, such as images.
CNNs have proven to be highly effective in tasks like image classification, object detection, and medical image analysis, among many others.

In traditional fully-connected neural networks, each neuron in one layer is connected to every neuron in the next layer.
While this works well for small datasets, it becomes impractical for high-dimensional data like images.
For example, a 128x128 grayscale image has 16,384 pixels, and a fully-connected neural network would require an enormous number of parameters if each pixel were connected to every neuron in the first layer.

CNNs take advantage of the hierarchical and spatial structure of images to reduce the number of parameters while preserving important features like edges, textures, and shapes.

### Basic building blocks of CNNs

CNNs consist of three main types of layers: convolutional layers, pooling layers, and fully-connected layers.

#### Convolutional layers

The **convolutional layer** is the core building block of a CNN.
Instead of connecting every input pixel to every output neuron, a convolutional layer applies a set of filters (also called kernels) that slide over the input image.
Each filter detects specific features, like edges or textures, by performing an operation called convolution.

How convolution works:
- Each filter has a small size (e.g., 3x3 or 5x5) compared to the full image.
- The filter slides over the input image, and at each position, it computes the dot product between the filter and the corresponding region of the image.
- This produces an **activation map** or **feature map**, which highlights the presence of certain features (e.g., edges or corners) in the image.

For example, consider a 3x3 filter:
$$
\text{Filter} = \begin{bmatrix}
f_{11} & f_{12} & f_{13} \\
f_{21} & f_{22} & f_{23} \\
f_{31} & f_{32} & f_{33}
\end{bmatrix}
$$
and a region of the image:
$$
\text{Image region} = \begin{bmatrix}
p_{11} & p_{12} & p_{13} \\
p_{21} & p_{22} & p_{23} \\
p_{31} & p_{32} & p_{33}
\end{bmatrix}
$$

The convolution operation is the element-wise multiplication and summation of these two matrices:
$$
\text{Convolution output} = (f_{11} \cdot p_{11}) + (f_{12} \cdot p_{12}) + \dots + (f_{33} \cdot p_{33})
$$

This operation is repeated as the filter "slides" across the entire image, generating a feature map.

#### Pooling layers

After a convolutional layer, it is common to apply a **pooling layer** to reduce the spatial dimensions of the feature maps and down-sample the data.
Pooling helps reduce the number of parameters, making the model computationally efficient and less prone to overfitting.

Types of ooling:
- **Max pooling**: Takes the maximum value from a set of pixels within a pooling window (e.g., a 2x2 window). Max pooling captures the most prominent features.
- **Average pooling**: Computes the average of the values within the pooling window. Average pooling is less commonly used in modern CNNs compared to max pooling.

For example, with max pooling and a 2x2 window, the pooling layer selects the maximum value from each 2x2 region:
$$
\text{Image patch} = \begin{bmatrix}
1 & 3 \\
2 & 4
\end{bmatrix} \quad \text{Max Pooling} \Rightarrow 4
$$

#### Fully-connected layers

After several convolutional and pooling layers, the high-level features extracted by the CNN are passed to **fully-connected layers**, just like in traditional neural networks.
These layers combine all the learned features to make predictions (e.g., classifying an image).

#### Activation functions in CNNs

As in fully-connected neural networks, CNNs use non-linear activation functions like **ReLU** and **sigmoid** to introduce non-linearity after convolution and fully-connected layers.
The most common activation function in modern CNNs is ReLU, which is used after each convolutional operation:
$$
\text{ReLU}(z) = \max(0, z)
$$

### CNN architecture example

A typical CNN architecture for image classification might consist of:

1. **Input layer**: The input is an image, often represented as a 3D matrix (e.g., width x height x channels). For grayscale images, the number of channels is 1, and for RGB images, it's 3.
   
2. **Convolutional layer**: Applies filters to detect features like edges or textures. The output is a feature map.
   
3. **Pooling layer**: Down-samples the feature map to reduce the spatial dimensions and computation.
   
4. **Convolutional + pooling layers**: Often, multiple convolutional and pooling layers are stacked to progressively detect higher-level features (e.g., shapes, objects).
   
5. **Fully-connected layers**: After the convolution and pooling layers, the feature maps are flattened and passed through fully-connected layers. These layers combine all the learned features to predict the class label.
   
6. **Output layer**: For binary classification tasks, the output layer typically has one neuron with a sigmoid activation. For multi-class classification, it would have as many neurons as there are classes, often with a softmax activation.

### Key concepts in CNNs

#### Stride and padding

- **Stride**: The number of pixels by which the filter moves over the input image. A larger stride reduces the spatial size of the output feature map.
- **Padding**: Sometimes, the input image is padded with zeros around the borders to preserve the spatial dimensions after convolution. This is called **same padding**. Without padding, the output feature map will be smaller than the input.

#### Filters and feature maps

- A CNN can have multiple filters in each convolutional layer, where each filter detects different features. For example, one filter might learn to detect horizontal edges, while another might detect vertical edges.
- The **depth** of the feature map corresponds to the number of filters in the convolutional layer.

### Training a CNN

CNNs are trained in a similar way to fully-connected neural networks, using backpropagation and gradient descent.
The difference lies in how the gradients are calculated through the convolutional and pooling layers.
During training, the network learns the optimal values for the filters that best detect the relevant features for the task at hand.

---

In summary, CNNs use convolutional and pooling layers to extract hierarchical features from grid-like data, such as images.
By using small, local connections (filters), CNNs drastically reduce the number of parameters compared to fully connected networks, making them more efficient and capable of learning complex patterns in visual data.

## Convolutional neural network for medical image classification (PyTorch)

In this exercise, you will switch to PyTorch and implement a CNN to classify medical images.
CNNs are powerful models for tasks involving image data, such as detecting diseases from medical images like X-rays.

**Instructions**

- You will use a preprocessed dataset of chest X-rays from MedMNIST.
    - Load the dataset using PyTorch's DataLoader and apply necessary transformations like resizing and normalization.
- Implement CNN architecture
    - Define a CNN with multiple convolutional layers followed by pooling layers and fully-connected layers.
    - Use ReLU activations and softmax at the output for classification.

**Steps**

- Data loading: Write code to load the chest X-ray dataset, applying transformations like normalization.
- CNN model: Implement a CNN with PyTorch's `nn.Module`. Include several convolutional layers followed by pooling and fully-connected layers.
- Training: Train the CNN on the dataset, optimizing with gradient descent (you can use Adam as the optimizer).
- Evaluation: Evaluate the CNN's performance on a test set using accuracy, precision, and recall metrics.

In [None]:
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from medmnist import PneumoniaMNIST
from torch.utils.data import DataLoader
from torchvision import transforms


# Data loading and preprocessing.
# Apply the following transformations:
# - Convert the images to tensors.
# - Normalize the images (mean 0.5, std 0.5 for grayscale images).
transform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5], std=[0.5]),
    ]
)
# Download the dataset (chest X-ray images from Medical MNIST).
train_dataset = PneumoniaMNIST(split="train", transform=transform, download=True)
test_dataset = PneumoniaMNIST(split="test", transform=transform, download=True)
# DataLoader to feed the images into the model in batches.
train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=32, shuffle=False)

## TODO: Perform some data exploration by visualizing the images, etc.

In [None]:
# Define a simple CNN model for binary classification.
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        # Convolutional layer 1: 1 input channel (grayscale), 32 output channels, kernel size 3.
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)
        # Convolutional layer 2: 32 input channels, 64 output channels, kernel size 3.
        self.conv2 = nn.Conv2d(
            in_channels=32, out_channels=64, kernel_size=3, padding=1
        )
        # Max pooling layer: kernel size 2x2.
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        # Fully connected Layer 1.
        self.fc1 = nn.Linear(64 * 7 * 7, 128)  # Flattened size from conv layer.
        # Fully connected layer 2 (output layer): 2 classes for PneumoniaMNIST.
        self.fc2 = nn.Linear(128, 1)

    def forward(self, x):
        # Implement the forward pass. Apply the layers as follows:
        # 1. conv1 -> relu -> pool
        # 2. conv2 -> relu -> pool
        # TODO
        # Flatten the image into a 1D vector for fully connected layers.
        x = x.view(x.size(0), -1)
        # 3. fc1 -> relu
        # TODO
        # 4. fc2 (output layer) -> sigmoid
        # TODO
        return x


# Define the model.
model = CNN()
# Binary cross-entropy loss function.
criterion = nn.BCELoss()
# Adam optimizer with learning rate 0.001.
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model.
n_epochs = 20
train_loss = []
# Set the model in training mode.
model.train()
for epoch in range(n_epochs):
    epoch_loss = 0.0
    for i, (images, labels) in enumerate(train_loader):
        # Zero the gradients.
        optimizer.zero_grad()
        # Forward pass.
        outputs = model(images)
        # Compute loss.
        loss = criterion(outputs, labels.float())
        # Backpropagation.
        loss.backward()
        # Update weights.
        optimizer.step()
        # Accumulate the loss.
        epoch_loss += loss.item()

    # Print statistics at the end of every epoch.
    epoch_loss = epoch_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{n_epochs}], Loss: {epoch_loss:.4f}")
    train_loss.append(epoch_loss)

# Evaluate the model.
# Set the model to evaluation mode.
model.eval()
correct, total = 0, 0
with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        predicted = (outputs > 0.5).int()
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total
print(f"Accuracy of the network on the test images: {accuracy:.2%}")
# TODO: Calculate the precursor and recall on the test set.
precision = None
recall = None

# Plot training loss over epochs.
fig, ax = plt.subplots()
ax.plot(train_loss)
ax.set_xlabel("Epoch")
ax.set_ylabel("Loss")
plt.show()
plt.close()