# MNIST Multi-Modal Learning Practice Notebook

In this notebook, you will practice some of the core concepts we have presented. The overall pipeline is as follows:

---

## 1. Data Preparation

- Load the **MNIST** dataset.
- Split the dataset into **train**, **validation**, and **test** sets.
- Horizontally split each image into **upper** and **lower** halves.
- Pad the removed half with zeros to maintain consistent input shapes.

---

## 2. Model Definitions

Define three encoders:

1. **CNNEncoder**  
   - A simple CNN with two convolutional layers and pooling layers.

2. **MLPEncoder**  
   - A simple MLP with two fully connected layers.

3. **FusedModel**  
   - A combined model using:
     - `CNNEncoder` for the **upper half**.
     - `MLPEncoder` for the **lower half**.

---

## 3. Training and Evaluation

For each encoder:

### (i) CNNEncoder

- Use only the **upper half** of the input data.
- Train for **5 epochs**.
- Validate using the validation set.
- Evaluate performance on the test set.

### (ii) MLPEncoder

- Use only the **lower half** of the input data.
- Train for **5 epochs**.
- Validate using the validation set.
- Evaluate performance on the test set.

### (iii) FusedModel

- Use **both upper and lower halves** of the input data.
- Use the CNNEncoder (upper half) and the MLPEncoder (bottom half) and use concatenation to fuse the representations.
- Train for **5 epochs**.
- Validate using the validation set.
- Evaluate performance on the test set.

### (iv) Different Fusion Stategies
- Explore alternative fusion strategies such as for instance average fusion (Averaging representations).

### (v) Investigate how adding noise to the representations impacts performance
- Add increasing amounts of noise to one or both modalities and monitor performance.
- Visualize the fused representations using PCA or t-SNE.

---
Concepts introduced tomorrow:

### (vi) Self-supervised:
- Instead of directly fusing modalities, align the modalities using CLIP.
- Visualize the fused representations using PCA or t-SNE.
- Train a linear classifier on-top of the learned representations (keeping the encoders frozen).

### (vii) Alignment noise:
- Add increasing amounts of noise to one or both modalities and monitor performance.
- Visualize the fused representations using PCA or t-SNE.

# Imports

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F


# Set hyperparams and load, split MNIST

In [2]:
# Hyperparameters
batch_size = 64
learning_rate = 0.01
num_epochs = 5
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5), (0.5))])

# Load MNIST dataset

train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_dataset, val_dataset = torch.utils.data.random_split(train_dataset, [0.7, 0.3])
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True)

train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

100%|██████████| 9.91M/9.91M [00:30<00:00, 320kB/s] 
100%|██████████| 28.9k/28.9k [00:00<00:00, 191kB/s]
100%|██████████| 1.65M/1.65M [00:04<00:00, 388kB/s] 
100%|██████████| 4.54k/4.54k [00:00<00:00, 119kB/s]


# Visualize some images

In [None]:
import matplotlib.pyplot as plt
import numpy as np

for images, _ in train_loader:
    # Visualize the first few images in the batch
    num_images_to_show = 5
    for i in range(num_images_to_show):
        image = images[i]  # Get one image from the batch
        image = np.transpose(image, (1, 2, 0))  # Rearrange dimensions from CxHxW to HxWxC

        # Display the image
        plt.imshow(image)
        plt.axis('off')  # Turn off axis labels
        plt.show()

    break  # Exit the loop after visualizing the first batch

# Define our models - a CNN, an MLP, and a FusedModel

In [None]:
class CNNEncoder(nn.Module):
    """Simple CNN Encoder"""
    def __init__(self):
        super(CNNEncoder, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(64*7*7, 10) #This shape depends on the kernels and the input (split) shape

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.fc1(x)
        return x

class MLP(nn.Module):
    """Simple 2-Layer MLP"""
    def __init__(self):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.relu(self.fc1(x.view(x.size(0), -1))) #Flatten all dimensions except batch_size
        x = F.relu(self.fc2(x))
        return x

# Fusing representations
class FusedModel(nn.Module):
    #Implement this in practical
    """Model that fuses CNN and MLP representations."""
    def __init__(self):
        super(FusedModel, self).__init__()
        # FILL IN HERE

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # FILL IN HERE

# Utils

In [None]:
import torch.nn.functional as F

def prepare_data(data):
    """Splits MNIST images into two halves horizontally and pads to original shape."""
    upper_half = data[:, :, :14, :]  # Top half: [B, 1, 14, 28]
    lower_half = data[:, :, 14:, :]  # Bottom half: [B, 1, 14, 28]

    # Pad bottom 14 rows with zeros for upper_half
    upper_half_padded = F.pad(upper_half, pad=(0, 0, 0, 14))  # Pad rows: (left, right, top, bottom)

    # Pad top 14 rows with zeros for lower_half
    lower_half_padded = F.pad(lower_half, pad=(0, 0, 14, 0))  # Pad rows: (left, right, top, bottom)

    return upper_half_padded, lower_half_padded


def train_model(model: nn.Module, data_loader, optimizer: optim.Optimizer, criterion: nn.modules.loss._Loss):
    """Train the model."""
    # FILL IN HERE


def evaluate_model(model: nn.Module, data_loader, criterion):
    """Evaluate the model."""
    # FILL IN HERE (compute validation and test loss and accuracy)


In [None]:
for images, _ in train_loader:
  upper_half, lower_half = prepare_data(images)
  print(upper_half.shape, lower_half.shape)
  print(images.shape)
  break

# Check that split works

In [None]:
#Look at the first upper_half image
image = upper_half[0]  # Get one image from the batch
image = np.transpose(image, (1, 2, 0))  # Rearrange dimensions from CxHxW to HxWxC

# Display the image
plt.imshow(image)
plt.axis('off')  # Turn off axis labels
plt.show()



In [None]:
#Look at the first lower_half image
image = lower_half[0]  # Get one image from the batch
image = np.transpose(image, (1, 2, 0))  # Rearrange dimensions from CxHxW to HxWxC

# Display the image
plt.imshow(image)
plt.axis('off')  # Turn off axis labels
plt.show()

In [None]:
#look at first full image
image = images[0]
image = np.transpose(image, (1, 2, 0))  # Rearrange dimensions from CxHxW to HxWxC

# Display the image
plt.imshow(image)
plt.axis('off')  # Turn off axis labels
plt.show()



# Init and train CNN

In [None]:
#Initialize CNN
cnn_encoder = CNNEncoder()
criterion = nn.CrossEntropyLoss()
cnn_optimizer = optim.Adam(cnn_encoder.parameters(), lr=learning_rate)

In [None]:
# Train and evaluate the CNN encoder
for epoch in range(num_epochs):
    train_model(cnn_encoder, train_loader, cnn_optimizer, criterion)
    print(f"Epoch {epoch+1}: CNN Encoder val loss {evaluate_model(cnn_encoder, val_loader, criterion)}")

print("Done training! Evaluating on test set...")
#Test
evaluate_model(cnn_encoder, test_loader, criterion)

# Init and train MLP

In [None]:
#Initialize MLP
mlp = MLP()
criterion = nn.CrossEntropyLoss()
mlp_optimizer = optim.Adam(mlp.parameters(), lr=learning_rate)

# Train and evaluate the MLP
for epoch in range(num_epochs):
    train_model(mlp, train_loader, mlp_optimizer, criterion)
    print(f"Epoch {epoch+1}: MLP val loss {evaluate_model(mlp, val_loader, criterion)}")

print("Done training! Evaluating on test set...")
#Test
evaluate_model(mlp, test_loader, criterion)

# Init and train FusedModel

In [None]:
#Initialize Fusion Encoder
fused_nn = FusedModel()
criterion = nn.CrossEntropyLoss()
fuse_optimizer =  optim.Adam(fused_nn.parameters(), lr=learning_rate)

In [None]:
# Train and evaluate the Fused encoder
for epoch in range(num_epochs):
    train_model(fused_nn, train_loader, fuse_optimizer, criterion)
    print(f"Epoch {epoch+1}: Fusion Encoder val loss {evaluate_model(fused_nn, val_loader, criterion)}")

print("Done training! Evaluating on test set...")
#Test
evaluate_model(fused_nn, test_loader, criterion)

## (iv) Different Fusion Stategies


## (v) Investigate how adding noise to the representations impacts performance

## (vi) Self-supervised

## (vii) Alignment noise