# Day 02 of upskilling to ML engineer.
This day we will continue to work on the MNIST dataset. Our tasks are:
- Understand and implement a basic Convolutional Neural Network (CNN) in PyTorch for image classification.  
- Learn how CNNs improve performance over simple MLPs for image data.


## What is a CNN
A Convolutional Neural Network (CNN) is a type of deep learning architecture specifically designed for processing _grid-like data_ such as images, using convolutional layers that apply filters to detect local features like edges, textures, and patterns. These networks excel at computer vision tasks because they can automatically learn hierarchical representations, starting with simple features in early layers and combining them into more complex patterns in deeper layers.

In [None]:
%pip install torch torchvision matplotlib scikit-learn > out.log

In [None]:
# Import all necessary packages from pytorch for convolutional neural networks
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

## Load MNIST Dataset
We'll use the same MNIST dataset from Day 1, which contains 60,000 training images and 10,000 test images of handwritten digits (0-9), each in grayscale at 28x28 pixels.

In [None]:
# Define a transform to normalize the data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load the training and test datasets
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader objects for training and testing
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

# Print dataset sizes
print(f"Training set size: {len(train_dataset)}")
print(f"Test set size: {len(test_dataset)}")

In [None]:
# Visualize the first 5 samples
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
for i in range(5):
    axes[i].imshow(train_dataset[i][0].squeeze(), cmap='gray')
    axes[i].set_title(f"Label: {train_dataset[i][1]}")
    axes[i].axis('off')
plt.show()

## Build a Simple CNN

Our CNN architecture will include:
- **Conv Layer 1**: 1 input channel (grayscale) → 32 feature maps, 3x3 kernel, ReLU activation
- **MaxPool 1**: 2x2 pooling to reduce spatial dimensions
- **Conv Layer 2**: 32 → 64 feature maps, 3x3 kernel, ReLU activation
- **MaxPool 2**: 2x2 pooling
- **Flatten**: Convert 2D feature maps to 1D vector
- **FC Layer 1**: Fully connected layer with 128 neurons
- **FC Layer 2**: Output layer with 10 neurons (one per class)

**Why CNNs work better for images:**
- **Local connectivity**: Each neuron only connects to a small region, preserving spatial structure
- **Parameter sharing**: Same filters applied across the entire image, reducing parameters
- **Translation invariance**: Features detected anywhere in the image

In [None]:
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # First convolutional layer: 1 input channel, 32 output channels, 3x3 kernel
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        # Second convolutional layer: 32 input channels, 64 output channels, 3x3 kernel
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        # Max pooling layer: 2x2 window
        self.pool = nn.MaxPool2d(2, 2)
        # Fully connected layers
        # After 2 pooling layers (2x2), 28x28 becomes 7x7
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, x):
        # First conv block: Conv -> ReLU -> Pool
        x = self.pool(F.relu(self.conv1(x)))  # 28x28 -> 14x14
        # Second conv block: Conv -> ReLU -> Pool
        x = self.pool(F.relu(self.conv2(x)))  # 14x14 -> 7x7
        # Flatten for fully connected layers
        x = x.view(-1, 64 * 7 * 7)
        # Fully connected layers with dropout
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Instantiate the model
cnn_model = SimpleCNN()

# Print model architecture
print("CNN Model Architecture:")
print(cnn_model)
print("\nTrainable parameters:")
total_params = sum(p.numel() for p in cnn_model.parameters() if p.requires_grad)
print(f"Total trainable parameters: {total_params:,}")

## Train the CNN
We'll train for 3 epochs using the Adam optimizer and CrossEntropyLoss.

In [None]:
# Move model to device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
cnn_model.to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)

# Training loop
num_epochs = 3
train_losses = []
train_accuracies = []

for epoch in range(num_epochs):
    cnn_model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        
        # Zero the gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = cnn_model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass and optimize
        loss.backward()
        optimizer.step()
        
        # Track statistics
        running_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        
        if (batch_idx + 1) % 200 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{batch_idx+1}/{len(train_loader)}], '
                  f'Loss: {loss.item():.4f}')
    
    epoch_loss = running_loss / len(train_loader)
    epoch_acc = 100 * correct / total
    train_losses.append(epoch_loss)
    train_accuracies.append(epoch_acc)
    
    print(f'Epoch [{epoch+1}/{num_epochs}] completed. '
          f'Average Loss: {epoch_loss:.4f}, Training Accuracy: {epoch_acc:.2f}%')

print('\nTraining completed!')

In [None]:
# Plot training loss and accuracy
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(range(1, num_epochs+1), train_losses, 'b-', marker='o')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss')
ax1.grid(True)

ax2.plot(range(1, num_epochs+1), train_accuracies, 'g-', marker='o')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.set_title('Training Accuracy')
ax2.grid(True)

plt.tight_layout()
plt.show()

## Evaluate the CNN on Test Data

In [None]:
# Evaluate on test dataset
cnn_model.eval()
correct = 0
total = 0
all_preds = []
all_labels = []

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = cnn_model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

test_accuracy = 100 * correct / total
print(f'\nTest Accuracy of CNN: {test_accuracy:.2f}%')

## Visualize Predictions

In [None]:
# Visualize predictions (5 correct, 5 incorrect)
num_correct = 5
num_incorrect = 5
fig, axes = plt.subplots(2, 5, figsize=(15, 6))

correct_count = 0
incorrect_count = 0

cnn_model.eval()
with torch.no_grad():
    for image, label in test_dataset:
        if correct_count >= num_correct and incorrect_count >= num_incorrect:
            break
            
        image_tensor = image.unsqueeze(0).to(device)
        output = cnn_model(image_tensor)
        _, predicted = torch.max(output.data, 1)
        
        if predicted.item() == label and correct_count < num_correct:
            axes[0, correct_count].imshow(image.squeeze().cpu(), cmap='gray')
            axes[0, correct_count].set_title(f'✓ Label: {label}')
            axes[0, correct_count].axis('off')
            correct_count += 1
        elif predicted.item() != label and incorrect_count < num_incorrect:
            axes[1, incorrect_count].imshow(image.squeeze().cpu(), cmap='gray')
            axes[1, incorrect_count].set_title(f'✗ Pred: {predicted.item()}, True: {label}', color='red')
            axes[1, incorrect_count].axis('off')
            incorrect_count += 1

plt.suptitle('CNN Predictions: Correct (top) vs Incorrect (bottom)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Confusion Matrix

In [None]:
# Compute and display confusion matrix
cm = confusion_matrix(all_labels, all_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=np.arange(10))

fig, ax = plt.subplots(figsize=(10, 10))
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title("Confusion Matrix: CNN on MNIST", fontsize=16, fontweight='bold')
plt.show()

print("\nPer-class accuracy:")
for i in range(10):
    class_correct = cm[i, i]
    class_total = cm[i, :].sum()
    class_acc = 100 * class_correct / class_total if class_total > 0 else 0
    print(f"Digit {i}: {class_acc:.2f}% ({class_correct}/{class_total})")

## Comparison: CNN vs MLP

Based on Day 1's MLP results and today's CNN:

### MLP (from Day 1)
- **Architecture**: Simple 2-layer network (784 → 16 → 10)
- **Test Accuracy**: ~73-88% (depending on configuration)
- **Parameters**: ~12,000 parameters
- **Limitation**: Treats image as flat vector, losing spatial structure

### CNN (Day 2)
- **Architecture**: Conv layers + pooling + FC layers
- **Test Accuracy**: Typically 98-99%
- **Parameters**: More parameters but better feature learning
- **Advantage**: Preserves spatial structure, learns local patterns

### Why CNNs Win for Images:
1. **Spatial hierarchies**: Early layers detect edges, later layers detect complex patterns
2. **Parameter efficiency**: Shared weights across spatial locations
3. **Translation invariance**: Features detected anywhere in the image
4. **Local connectivity**: Each neuron focuses on a small region

## Stretch Goals (Optional)

Try experimenting with:
1. Adding more convolutional layers
2. Changing kernel sizes (3x3, 5x5)
3. Adjusting the number of filters (16, 32, 64, 128)
4. Different optimizers (SGD with momentum, RMSprop)
5. Batch normalization layers
6. Visualizing learned filters from the first convolutional layer

In [None]:
# Bonus: Visualize first layer filters
def visualize_filters(model):
    # Get the weights from the first convolutional layer
    filters = model.conv1.weight.data.cpu().numpy()
    
    # Normalize filters for visualization
    f_min, f_max = filters.min(), filters.max()
    filters = (filters - f_min) / (f_max - f_min)
    
    # Plot first 16 filters
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for i, ax in enumerate(axes.flat):
        if i < 32:
            ax.imshow(filters[i, 0], cmap='gray')
            ax.set_title(f'Filter {i+1}')
        ax.axis('off')
    
    plt.suptitle('Learned Filters from First Convolutional Layer', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

visualize_filters(cnn_model)