# ENEL 645 - Assignment 2
## Group 30: Aneesh Bulusu, Long Nguyen, Strahinja Radakovic

In this assignment we were tasked with creating a machine learning model that can take an image and description as input, and output what kind of garbage disposal the imaged object should be subjected to.

We attempted several different models (especially when it came to implementing the text data), but in the end the following model (very similar to the tutorials we went over in class) ended up performing the best, using ResNet50 for image classification and DistilBERT for text feature implementation.

### Part 1: Set-up

Here we import the necessary libraries for the model, and attempt to use a GPU to train the model.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import models, transforms
from transformers import DistilBertModel, DistilBertTokenizer
import os
import re
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import torchvision.utils as vutils

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Part 2: Data Preprocessing

In [None]:
# Transforming the images to a standard size, converts them to tensors, and normalizes them.
image_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Tokenizing the text
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
max_len = 24


# Extracting text from filenames
def extract_text_from_filename(filename):
    text = os.path.splitext(filename)[0]  # Remove extension
    text = text.replace('_', ' ')  # Replace underscores
    text = re.sub(r'\d+', '', text)  # Remove digits
    return text


# Creating a custom dataset by merging the image and text data
class MultimodalDataset(Dataset):
    def __init__(self, image_dir, tokenizer, max_len, transform=None):
        self.image_dir = image_dir
        self.transform = transform
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.classes = sorted(os.listdir(image_dir))
        self.label_map = {class_name: idx for idx, class_name in enumerate(self.classes)}
        self.samples = []

        # Gathers image file paths and filenames
        for class_name in self.classes:
            class_path = os.path.join(image_dir, class_name)
            if os.path.isdir(class_path):
                for file in os.listdir(class_path):
                    self.samples.append((os.path.join(class_path, file), class_name))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        image_path, class_name = self.samples[idx]
        label = self.label_map[class_name]

        # Loads image
        image = Image.open(image_path).convert("RGB")
        if self.transform:
            image = self.transform(image)

        # Processes text from filename
        text = extract_text_from_filename(os.path.basename(image_path))
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        # Returns the image tensor, text tensor, attention mask tensor and label tensor
        return {
            'image': image,
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'label': torch.tensor(label, dtype=torch.long)
        }

### Part 3: Creating the Model

In [None]:
# Extracting image features using ResNet50
class ImageFeatureExtractor(nn.Module):
    def __init__(self):
        super(ImageFeatureExtractor, self).__init__()
        resnet = models.resnet50(weights="IMAGENET1K_V1")
        self.feature_extractor = nn.Sequential(*list(resnet.children())[:-1])  # Removing FC layer

    def forward(self, x):
        x = self.feature_extractor(x)
        return x.view(x.size(0), -1)  
    
    
# Extracting text features using DistilBERT
class TextFeatureExtractor(nn.Module):
    def __init__(self):
        super(TextFeatureExtractor, self).__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')

    def forward(self, input_ids, attention_mask):
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        return output.last_hidden_state[:, 0, :] 
    

# Combining image and text features together
class MultimodalClassifier(nn.Module):
    def __init__(self, image_feature_dim, text_feature_dim, num_classes):
        super(MultimodalClassifier, self).__init__()

        self.image_model = ImageFeatureExtractor()
        self.text_model = TextFeatureExtractor()

        # Combined feature size
        combined_dim = image_feature_dim + text_feature_dim
        self.classifier = nn.Sequential(
            nn.Linear(combined_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def forward(self, image, input_ids, attention_mask):
        image_features = self.image_model(image)
        text_features = self.text_model(input_ids, attention_mask)

        combined_features = torch.cat((image_features, text_features), dim=1)
        output = self.classifier(combined_features)
        return output

### Part 4: Training & Evaluation

In [None]:
#  Defining paths to the datasets (the ones below are for the TALC cluster)
TRAIN_PATH = "/work/TALC/enel645_2025w/garbage_data/CVPR_2024_dataset_Train"
VAL_PATH = "/work/TALC/enel645_2025w/garbage_data/CVPR_2024_dataset_Val"
TEST_PATH = "/work/TALC/enel645_2025w/garbage_data/CVPR_2024_dataset_Test"

# Loading the datasets
train_dataset = MultimodalDataset(TRAIN_PATH, tokenizer, max_len, transform=image_transform)
val_dataset = MultimodalDataset(VAL_PATH, tokenizer, max_len, transform=image_transform)
test_dataset = MultimodalDataset(TEST_PATH, tokenizer, max_len, transform=image_transform)

# Createating loaders
batch_size = 32  # in the future we could experiment more with different batch sizes to try to find the perfect balance between over- and under-fitting.
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Defining the model
image_feature_dim = 2048  # ResNet50 feature size
text_feature_dim = 768  # DistilBERT CLS token size
num_classes = len(train_dataset.classes)

model = MultimodalClassifier(image_feature_dim, text_feature_dim, num_classes).to(device)

# Parameter Count
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

image_params = count_parameters(model.image_model)
text_params = count_parameters(model.text_model)
total_params = count_parameters(model)


# Defining training parameters
optimizer = optim.AdamW(model.parameters(), lr=2e-5) # in the future, we could look at finding the optimal learning rate
criterion = nn.CrossEntropyLoss()


# Training/Validation loop
epochs = 10  # in the future, we could look into the optimal number of epochs
best_loss = float('inf')

# Script to run the epochs
for epoch in range(epochs):
    model.train()
    total_train_loss = 0
    correct_train = 0
    total_train = 0

    for batch in train_loader:
        images = batch['image'].to(device)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(images, input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_train_loss += loss.item()

        # Computing training accuracy
        _, preds = torch.max(outputs, 1)
        correct_train += (preds == labels).sum().item()
        total_train += labels.size(0)

    avg_train_loss = total_train_loss / len(train_loader)
    train_accuracy = correct_train / total_train * 100

    
    
    # VALIDATION! 
    model.eval()
    total_val_loss = 0
    correct_val = 0
    total_val = 0

    with torch.no_grad():
        for batch in val_loader:
            images = batch['image'].to(device)
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(images, input_ids, attention_mask)
            loss = criterion(outputs, labels)
            total_val_loss += loss.item()

            _, preds = torch.max(outputs, 1)
            correct_val += (preds == labels).sum().item()
            total_val += labels.size(0)

    avg_val_loss = total_val_loss / len(val_loader)
    val_accuracy = correct_val / total_val * 100 

    # Print output
    print(f"Epoch [{epoch+1}/{epochs}], Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}, Train Acc: {train_accuracy:.2f}%, Val Acc: {val_accuracy:.2f}%")

    # Save the best model
    if avg_val_loss < best_loss:
        torch.save(model.state_dict(), 'best_model.pth')
        best_loss = avg_val_loss

### Part 5: Testing

In [None]:
# Loading the previously saved best model
model.load_state_dict(torch.load('best_model.pth'))
model.eval()

correct_test = 0
total_test = 0
test_predictions = []
test_labels = []

# Loop over the test set to calculate test accuracy
with torch.no_grad():
    for batch in test_loader:
        images = batch['image'].to(device)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(images, input_ids, attention_mask)

        _, preds = torch.max(outputs, 1)
        correct_test += (preds == labels).sum().item()
        total_test += labels.size(0)

        test_predictions.extend(preds.cpu().numpy())  # Preparing for confusion matrix
        test_labels.extend(labels.cpu().numpy())

test_accuracy = correct_test / total_test * 100

# Printing test accuracy
print(f"Test Accuracy: {test_accuracy:.2f}%")

### Part 6: Visualization

In [None]:
# Confusion Matrix
cm = confusion_matrix(test_labels, test_predictions)
class_names = sorted(train_dataset.classes)  # Ensure correct class order

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.savefig("confusion_matrix.png")
plt.show()



# Visualize first convolutional layer filters
first_conv_weights = model.image_model.feature_extractor[0].weight.data.cpu()
first_conv_weights = (first_conv_weights - first_conv_weights.min()) / (first_conv_weights.max() - first_conv_weights.min())

# Plot filters
plt.figure(figsize=(8, 8))
grid = vutils.make_grid(first_conv_weights, normalize=True, nrow=8)
plt.imshow(grid.permute(1, 2, 0))
plt.title("First Convolutional Layer Filters")
plt.axis("off")
plt.savefig("convolutional_filters.png")
plt.show()

# Printing the number of parameters
print(f"Image Model (ResNet50) Parameters: {image_params:,}")
print(f"Text Model (DistilBERT) Parameters: {text_params:,}")
print(f"Total Trainable Parameters: {total_params:,}")

## Output
*Warnings and prints that ended up being less useful are omitted. For full output check the slurm-34704.out file

Epoch [1/10], Train Loss: 0.4898, Val Loss: 0.2942, Train Acc: 81.84%, Val Acc: 89.89%

Epoch [2/10], Train Loss: 0.2211, Val Loss: 0.3102, Train Acc: 92.37%, Val Acc: 89.39%

Epoch [3/10], Train Loss: 0.1081, Val Loss: 0.3342, Train Acc: 96.40%, Val Acc: 89.67%

Epoch [4/10], Train Loss: 0.0469, Val Loss: 0.3917, Train Acc: 98.67%, Val Acc: 89.50%

Epoch [5/10], Train Loss: 0.0245, Val Loss: 0.4075, Train Acc: 99.32%, Val Acc: 90.39%

Epoch [6/10], Train Loss: 0.0159, Val Loss: 0.4867, Train Acc: 99.60%, Val Acc: 89.17%

Epoch [7/10], Train Loss: 0.0141, Val Loss: 0.4723, Train Acc: 99.60%, Val Acc: 89.28%

Epoch [8/10], Train Loss: 0.0220, Val Loss: 0.6345, Train Acc: 99.26%, Val Acc: 88.83%

Epoch [9/10], Train Loss: 0.0280, Val Loss: 0.4908, Train Acc: 99.05%, Val Acc: 89.50%

Epoch [10/10], Train Loss: 0.0207, Val Loss: 0.5048, Train Acc: 99.23%, Val Acc: 89.89%





Test Accuracy: 85.66%


![confusion matrix](confusion_matrix.png)

![convolutional filters](convolutional_filters.png)


Image Model (ResNet50) Parameters: 23,508,032

Text Model (DistilBERT) Parameters: 66,362,880

Total Trainable Parameters: 91,315,268

## Discussion

### Process:
Throughout this project we went through many models. Initially, we used a ResNet18 framework for the images, and tried tokenizing the text data from scratch. While this worked, our accuracy was in the ballpark of ~73%. We were not satisfied with this, so we switched to using the DistilBERT model for our text. This improved the model significantly, increasing our accuracy by ~7 percentage points. Once we analized the number of parameters though, we saw that we had ~10,000,000 image parameters using ResNet18, while DistilBERT gave us ~66,000,000 parameters. There are many possible ways to remedy this, but due to time constraints, we went with an upgrade from ResNet18 to ResNet50, knowing the more complex model will give us more parameters. In doing so, we approximately doubled the number of parameters in the image model.
While the ratio is still far from perfect, we are satisfied with the results we have currently (given the limitations). 

### Next Steps:
In the future we would like to implement dense layers to equalize the number of parameters contributing from each of the two sources.
Additionally, doing more testing around the optimal batch size/number of epochs/learning rate, and finding a good balance between overfitting and underfitting would be a fun endeavour.