Connect to google drive. Google drive contains the data folder downloaded from kaggle

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Get the data from the google drive. You need to update this based on where your data is in the google drive

In [2]:
import os

GOOGLE_DRIVE_PATH_POST_MYDRIVE = 'data'
GOOGLE_DRIVE_PATH = os.path.join('/content', 'drive', 'MyDrive', GOOGLE_DRIVE_PATH_POST_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))

['dev.jsonl', 'LICENSE.txt', 'README.md', 'train.jsonl', 'test.jsonl', 'img']


Clone github repository

In [None]:
!git clone https://ssubedi09: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@github.com/ssubedi09/Deep-Learning-Hateful-Memes.git

Cloning into 'Deep-Learning-Hateful-Memes'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 28 (delta 6), reused 13 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (28/28), 380.21 KiB | 1.52 MiB/s, done.
Resolving deltas: 100% (6/6), done.


Add github repo to the path

In [None]:
import sys
sys.path.append('/content/Deep-Learning-Hateful-Memes')

Configure github if not configured yet.

In [None]:
!git config --global user.email "sandipsubedi0926@gmail.com"
!git config --global user.name "ssubedi09"

Push the changes back to the repository, Change the commit before pushing anything.

In [None]:
#!touch models/.gitkeep
!git add .
!git commit -m "Add new folder"
!git push origin main

reloading external modules

In [4]:
# Just run this block. Please do not modify the following code.
import pandas as pd
import torch

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

Now lets check your GPU availability and load some sanity checkers. By default you should be using your gpu for this assignment if you have one available.

In [6]:
# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("You are using device: %s" % device)

You are using device: cuda


Load data here. train.jsonl file contains id, image location, label and caption in the image

In [11]:
path = GOOGLE_DRIVE_PATH + '/train.jsonl'
data=pd.read_json(path,lines=True)
print(f"Data set size: {len(data)}")

Data set size: 8500


Test Set

In [12]:
path_1 = GOOGLE_DRIVE_PATH + '/dev.jsonl'
test_data=pd.read_json(path_1,lines=True)
print(f"Data set size: {len(test_data)}")

Data set size: 500


Split data in to train, validation and test set

In [15]:
from sklearn.model_selection import train_test_split

# First split off val set (500 validation data)
train_data, val_data = train_test_split(data, test_size=0.0588, random_state=42)


print(f"Train set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")
print(f"Test set size: {len(test_data)}")
print(train_data.head())


Train set size: 8000
Validation set size: 500
Test set size: 500
         id            img  label  \
2858  27341  img/27341.png      1   
3290  96342  img/96342.png      0   
7100  85126  img/85126.png      0   
7789  93187  img/93187.png      1   
7470  84059  img/84059.png      1   

                                                   text  
2858  no one: white cops: he's black, so i'm assumin...  
3290  listen up sweetheart and let this sink in. i n...  
7100                  gillbet bink says your news sucks  
7789  when wwe has a ppv in world war ll germany inc...  
7470  trans women are women, trans women are women, ...  


Fine Tuning CLIP model.
This part of the code imports the pretrained model and processor from openAI.
model contains weights and actual architecture.
processor contains tokenizer for words and feature extractor for image. These are used to convert image and caption into numbers that computer will understand.


In [None]:
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

This is a class to tokenize dataset. It takes an input and output tokens for text and image. It uses the same tokenizer that CLIP uses.

In [None]:
from torch.utils.data import Dataset
from PIL import Image
import torch

class MemeDataset(Dataset):
    def __init__(self, dataframe, processor, image_root_dir, max_length = 18):
        self.df = dataframe.reset_index(drop=True)
        self.processor = processor
        #directory where all images are
        self.image_root = image_root_dir
        #to make sure all captions are of same length
        self.max_length = max_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        #get the data at idx
        row = self.df.loc[idx]
        #extract path to image
        img_path = f"{self.image_root}/{row['img'].split('/')[-1]}"
        #load image in RGB
        image = Image.open(img_path).convert("RGB")
        #load text
        text = row['text']
        #load label
        label = torch.tensor(row['label'], dtype=torch.float)

        # Convert text and image to tokens
        inputs = self.processor(
            text=text,
            images=image,
            return_tensors="pt",
            padding='max_length',
            max_length=self.max_length,
            truncation=True)

        # Remove batch dimension (1) from processor outputs
        inputs = {k: v.squeeze(0) for k, v in inputs.items()}
        #add label
        inputs["labels"] = label
        return inputs


Implement MemeDataset class to the train data, validation data, and test data.

In [None]:
from torch.utils.data import DataLoader
import torch

BATCH_SIZE = 32
train_dataset = MemeDataset(train_data, processor, image_root_dir = GOOGLE_DRIVE_PATH + "/img")
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_dataset = MemeDataset(val_data, processor, image_root_dir = GOOGLE_DRIVE_PATH + "/img")
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_dataset = MemeDataset(test_data, processor, image_root_dir = GOOGLE_DRIVE_PATH + "/img")
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)


Binary classification head at the end of CLIP model. It uses CLIP embeddings from both text and image inputs, and combines them to make a prediction.


In [None]:
import torch.nn as nn

class CLIPBinaryClassifier(nn.Module):
    def __init__(self, clip_model):
        super().__init__()
        #parent model - pretrained openAI CLIP model
        self.clip = clip_model
        #get image and text embeddings
        dim = self.clip.config.projection_dim
        #Linear classification head with two outputs - Hateful or Not Hateful
        self.classifier = nn.Linear(dim * 2, 2)


    def forward(self, input_ids, attention_mask, pixel_values):
        #forward pass on tokenized inputs, outputs image and text embeddings
        outputs = self.clip(input_ids=input_ids, attention_mask=attention_mask, pixel_values=pixel_values)
        combined = torch.cat([outputs.image_embeds, outputs.text_embeds], dim=1)
        logits = self.classifier(combined)
        return logits

Training Loop

In [None]:
import torch
import matplotlib.pyplot as plt

#move the classifier to cuda
classifier = CLIPBinaryClassifier(model).to(device)
# Freeze all parameters of the CLIP backbone
for param in classifier.clip.parameters():
    param.requires_grad = False
#opitimzer with learning rate as hyperparameter
learning_rate = 0.0001
optimizer = torch.optim.AdamW(classifier.classifier.parameters(), lr=learning_rate)
#cross entropy loss
criterion = nn.CrossEntropyLoss()
epochs = 5

#Train and validation loss
train_losses = []
val_losses = []

# Training and validation
for epoch in range(epochs):
    classifier.train()
    total_train_loss = 0
    for batch in train_loader:
        # Move inputs and labels to the correct device
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        pixel_values = batch["pixel_values"].to(device)
        labels = batch["labels"].to(device).long()

        # Forward and backward pass
        optimizer.zero_grad()
        outputs = classifier(input_ids=input_ids, attention_mask=attention_mask, pixel_values=pixel_values)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_train_loss += loss.item()
    avg_train_loss = total_train_loss / len(train_loader)
    train_losses.append(avg_train_loss)
    print(f"Epoch {epoch + 1} Loss: {total_train_loss / len(train_loader):.4f}")

    # Validation
    classifier.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            pixel_values = batch["pixel_values"].to(device)
            labels = batch["labels"].to(device).long()

            outputs = classifier(input_ids=input_ids, attention_mask=attention_mask, pixel_values=pixel_values)
            loss = criterion(outputs, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(val_loader)
    val_losses.append(avg_val_loss)
    print(f"Epoch {epoch+1} - Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f}")

plt.figure(figsize=(8, 5))
plt.plot(range(1, epochs+1), train_losses, label='Train Loss')
plt.plot(range(1, epochs+1), val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training & Validation Loss per Epoch')
plt.legend()
plt.grid(True)
plt.show()

Epoch 1 Loss: 0.6643
Epoch 1 - Train Loss: 0.6643 | Val Loss: 0.6534
Epoch 2 Loss: 0.6505
Epoch 2 - Train Loss: 0.6505 | Val Loss: 0.6528
Epoch 3 Loss: 0.6506
Epoch 3 - Train Loss: 0.6506 | Val Loss: 0.6527


Accuracy

In [None]:
from sklearn.metrics import accuracy_score

classifier.eval()
preds, trues = [], []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        pixel_values = batch["pixel_values"].to(device)
        labels = batch["labels"].to(device)

        outputs = classifier(input_ids=input_ids, attention_mask=attention_mask, pixel_values=pixel_values)
        preds.extend((outputs > 0.5).int().cpu().numpy())
        trues.extend(labels.int().cpu().numpy())

print("Accuracy:", accuracy_score(trues, preds))


NameError: name 'classifier' is not defined