# Context Aggregation Networks (CAN)
### Introduction
This code details the implementation of a Convolutional Attention Network (CAN) using PyTorch for classifying the MNIST dataset. The MNIST dataset consists of 28x28 grayscale images of handwritten digits (0-9). The goal is to classify these images into one of the 10 digit classes.\
\
Note:
- CNN model with dilated convolutions
- Similar model: Dilated Residual Networks (CVPR 2017)

### Hyperparameters
We define the following hyperparameters:
- `batch_size`: Number of samples per batch.
- `num_classes`: Number of output classes (10 for digits 0-9).
- `learning_rate`: Learning rate for the optimizer.
- `num_epochs`: Number of times the entire dataset is passed through the network.

In [1]:
# Load in relevant libraries, and alias where appropriate
import torch
import torch.nn as nn
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define relevant variables for the ML task
batch_size = 64
num_classes = 10
learning_rate = 0.01
num_epochs = 20

# Device will determine whether to run the training on GPU or CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Data Loading and Transformation
We load the MNIST dataset and apply transformations to resize the images to 32x32 (as required by the model), convert them to tensors, and normalize them.

In [2]:
#Loading the dataset and preprocessing
train_dataset = datasets.MNIST(root = './MNIST',
                               train = True,
                               transform = transforms.Compose([
                               transforms.Resize((32,32)),
                               transforms.ToTensor(),
                               transforms.Normalize(mean = (0.1307,), std = (0.3081,))]),
                               download = True)

test_dataset = datasets.MNIST(root = './MNIST',
                              train = False,
                              transform = transforms.Compose([
                              transforms.Resize((32,32)),
                              transforms.ToTensor(),
                              transforms.Normalize(mean = (0.1325,), std = (0.3105,))]),
                              download=True)

train_loader = DataLoader(dataset = train_dataset, batch_size = batch_size, shuffle = True)
test_loader = DataLoader(dataset = test_dataset, batch_size = batch_size, shuffle = True)

### CAN Model
We define the CANModel, which consists of five convolutional layers with increasing dilation rates and padding to maintain the spatial dimensions. The activation function used is LeakyReLU, and an adaptive average pooling layer is used before the final output.

![CAN](https://i.imgur.com/Yy6NOS1.png)

In [3]:
import torch.nn.functional as F

class CANModel(nn.Module):
    def __init__(self):
        super(CANModel, self).__init__()
        # Define the layers of the CAN model with padding
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, dilation=1, padding=1)
        self.conv2 = nn.Conv2d(32, 32, kernel_size=3, dilation=2, padding=2)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=3, dilation=4, padding=4)
        self.conv4 = nn.Conv2d(32, 32, kernel_size=3, dilation=8, padding=8)
        self.conv5 = nn.Conv2d(32, 10, kernel_size=3, dilation=1, padding=1)
        self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))

    def forward(self, x):
        # Apply layers with LeakyReLU activation function
        x = F.leaky_relu(self.conv1(x))
        x = F.leaky_relu(self.conv2(x))
        x = F.leaky_relu(self.conv3(x))
        x = F.leaky_relu(self.conv4(x))
        x = F.leaky_relu(self.conv5(x))
        
        # Global Average Pooling layer before final output
        x = self.avg_pool(x)
        x = x.view(x.size(0), -1)  # Flatten the output

        return x

### Model, Loss Function, and Optimizer
We instantiate the model, define the loss function as `CrossEntropyLoss`, and use the `Adam` optimizer.

In [4]:
model = CANModel().to(device)

#Setting the loss function
cost = nn.CrossEntropyLoss()

#Setting the optimizer with the model parameters and learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

#this is defined to print how many steps are remaining when training
total_step = len(train_loader)

### Training the Model
We train the model for the specified number of epochs. During each epoch:
- Forward pass: Compute the model’s predictions.
- Compute the loss.
- Backward pass: Compute gradients.
- Update the model parameters using the optimizer.
- Track the running loss and accuracy.

In [5]:
for epoch in range(num_epochs):
    correct = 0
    total = 0
    running_loss = 0.0

    for i, (images, labels) in enumerate(train_loader):  
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = cost(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Track loss
        running_loss += loss.item()
        
        # Track accuracy
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    # Calculate average loss and accuracy
    avg_loss = running_loss / total_step
    accuracy = 100 * correct / total
    
    print('Epoch [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%'
          .format(epoch+1, num_epochs, avg_loss, accuracy))

Epoch [1/20], Loss: 0.3412, Accuracy: 88.22%
Epoch [2/20], Loss: 0.0560, Accuracy: 98.22%
Epoch [3/20], Loss: 0.0437, Accuracy: 98.67%
Epoch [4/20], Loss: 0.0393, Accuracy: 98.80%
Epoch [5/20], Loss: 0.0335, Accuracy: 99.00%
Epoch [6/20], Loss: 0.0345, Accuracy: 98.92%
Epoch [7/20], Loss: 0.0300, Accuracy: 99.09%
Epoch [8/20], Loss: 0.0301, Accuracy: 99.07%
Epoch [9/20], Loss: 0.0259, Accuracy: 99.22%
Epoch [10/20], Loss: 0.0274, Accuracy: 99.24%
Epoch [11/20], Loss: 0.0273, Accuracy: 99.20%
Epoch [12/20], Loss: 0.0262, Accuracy: 99.26%
Epoch [13/20], Loss: 0.0219, Accuracy: 99.35%
Epoch [14/20], Loss: 0.0269, Accuracy: 99.24%
Epoch [15/20], Loss: 0.0229, Accuracy: 99.33%
Epoch [16/20], Loss: 0.0203, Accuracy: 99.38%
Epoch [17/20], Loss: 0.0223, Accuracy: 99.38%
Epoch [18/20], Loss: 0.0240, Accuracy: 99.32%
Epoch [19/20], Loss: 0.0252, Accuracy: 99.30%
Epoch [20/20], Loss: 0.0165, Accuracy: 99.52%


### Testing the Model
We evaluate the model on the test dataset without computing gradients to save memory. We calculate the accuracy of the model on the test images.

In [7]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
  
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

Accuracy of the network on the 10000 test images: 99.1 %


### Results
With a CAN model, an accuracy of 99.1% is achieved on the MNIST test dataset.\
\
With different number of feature channels (Note: due to long runtime, it's run seperately, only the accuracy results are present):
- 16 channels accuracy: 99.15%
- 32 channels accuracy: 99.1%
- 64 channels accuracy: 97.38%

As Comparisons:
- KNN model (k=3) accuracy: 96.33%
- MLP model (n=128) accuracy: 96.75%
- LeNet-5 model (modified) accuracy: 98.85%

### Discussion (CAN with diff. num. of channels)
For channels (c in short) is 16 and 32, CAN model achieved accuracy of 99.15% and 99.1%. However when c=64, the accuracy was unexpected lower (97.38%), as learning rate (lr) of 0.01 is used across 3 models.
- For c=16 & c=32, lr=0.01 was effective likely because the model’s capacity was balanced with the learning rate, allowing for stable and efficient training.
- For c=64, as capacity increases, lr=0.01 might be too aggressive, causing the model to overshoot the optimal weights during training. This can result in lower accuracy compared to configurations with fewer feature channels.

Potential Solutions:
- Learning Rate Adjustment: For c=64, reduce lr from 0.01 to 0.001, might allow the model to converge more smoothly and avoid overshooting.
- Learning Rate Scheduling: Implementing a learning rate scheduler that decreases the learning rate over time can help the model fine-tune its weights more effectively as training progresses.

### Discussion (CAN with Other Models)
Several factors contribute to the superior performance of the CAN:
1. **Activation Function (Leaky ReLU vs ReLU)**:
- CAN uses LeakyReLU, which address the “dying ReLU” problem by allowing a small, non-zero gradient when the unit is not active. (helps maintain gradient flow and improves learning stability)
- Modified LeNet-5 uses ReLU, which can sometimes lead to neurons becoming inactive and not contributing to learning.

2. **Dilated Convolutions**:
- It allow the network to have a larger receptive field without increasing the number of parameters.
- This helps in capturing more contextual information from the input images, leading to better feature extraction and improved classification performance.

3. **Global Average Pooling**:
-  It helps reduce the number of parameters and prevents overfitting.
-  It also ensures that the spatial dimensions are reduced effectively, leading to a more compact and efficient representation of the features.\

### Issue to address
However, one notable issue with LeakyReLU is that it can take longer to train compared to ReLU. The reason might be LeakyReLU introduces a small gradient for negative inputs, which can slow down the convergence rate.

### Conclusion
- The CANModel demonstrates superior performance in classifying MNIST images, achieving an accuracy of 99.1%. 
- This is attributed to the use of LeakyReLU activation functions, dilated convolutions, and global average pooling, which collectively enhance the model’s ability to learn and generalize from the data. 