Connect to google drive. Google drive contains the data folder downloaded from kaggle

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Get the data from the google drive. You need to update this based on where your data is in the google drive

In [5]:
import os

GOOGLE_DRIVE_PATH_POST_MYDRIVE = 'data'
GOOGLE_DRIVE_PATH = os.path.join('/content', 'drive', 'MyDrive', GOOGLE_DRIVE_PATH_POST_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))

['dev.jsonl', 'LICENSE.txt', 'README.md', 'train.jsonl', 'test.jsonl', 'img']


Clone github repository

In [3]:
!git clone https://ssubedi09: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@github.com/ssubedi09/Deep-Learning-Hateful-Memes.git

Cloning into 'Deep-Learning-Hateful-Memes'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 28 (delta 6), reused 13 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (28/28), 380.21 KiB | 1.52 MiB/s, done.
Resolving deltas: 100% (6/6), done.


Add github repo to the path

In [4]:
import sys
sys.path.append('/content/Deep-Learning-Hateful-Memes')

Configure github if not configured yet.

In [5]:
!git config --global user.email "sandipsubedi0926@gmail.com"
!git config --global user.name "ssubedi09"

Push the changes back to the repository, Change the commit before pushing anything.

In [None]:
#!touch models/.gitkeep
!git add .
!git commit -m "Add new folder"
!git push origin main

Import everything here

In [6]:
# Just run this block. Please do not modify the following code.
import pandas as pd
import torch

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

Now lets check your GPU availability and load some sanity checkers. By default you should be using your gpu for this assignment if you have one available.

In [7]:
# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("You are using device: %s" % device)

You are using device: cuda


Load data here. train.jsonl file contains id, image location, label and caption in the image

In [8]:
path = GOOGLE_DRIVE_PATH + '/train.jsonl'
data=pd.read_json(path,lines=True)
print(f"Data set size: {len(data)}")

Data set size: 8500


Split data in to train, validation and test set

In [9]:
from sklearn.model_selection import train_test_split

# First split off test set (20%)
train_val_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Then split train_val into train and val (25% of train_val = 20% of total)
train_data, val_data = train_test_split(train_val_data, test_size=0.25, random_state=42)

print(f"Train set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")
print(f"Test set size: {len(test_data)}")
print(train_data.head())


Train set size: 5100
Validation set size: 1700
Test set size: 1700
         id            img  label  \
6895  79328  img/79328.png      0   
8226  34076  img/34076.png      0   
6662  94605  img/94605.png      0   
3939   3472  img/03472.png      1   
3076  20375  img/20375.png      1   

                                                   text  
6895  sorry, trump... i just prefer presidents who w...  
8226                      i'm taking my clock to school  
6662  he was a world savior only if stupid people wo...  
3939  pet niguana for sale!! come with food dish, a ...  
3076  islam is a religion of peace lets celebrate by...  


Fine Tuning CLIP model.
This part of the code imports the pretrained model and processor from openAI.
model contains weights and actual architecture.
processor contains tokenizer for words and feature extractor for image. These are used to convert image and caption into numbers that computer will understand.


In [10]:
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

This is a class to tokenize dataset. It takes an input and output tokens for text and image. It uses the same tokenizer that CLIP uses.

In [18]:
from torch.utils.data import Dataset
from PIL import Image
import torch

class MemeDataset(Dataset):
    def __init__(self, dataframe, processor, image_root_dir):
        self.df = dataframe.reset_index(drop=True)
        self.processor = processor
        #directory where all images are
        self.image_root = image_root_dir

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        #get the data at idx
        row = self.df.loc[idx]
        #extract path to image
        img_path = f"{self.image_root}/{row['img'].split('/')[-1]}"
        #load image in RGB
        image = Image.open(img_path).convert("RGB")
        #load text
        text = row['text']
        #load label
        label = torch.tensor(row['label'], dtype=torch.float)

        # Convert text and image to tokens
        inputs = self.processor(
            text=text,
            images=image,
            return_tensors="pt",
            padding=True,
            truncation=True)

        # Remove batch dimension (1) from processor outputs
        inputs = {k: v.squeeze(0) for k, v in inputs.items()}
        #add label
        inputs["labels"] = label

        return inputs


Implemen MemeDataset class to the train data. Also create batches for training.

In [24]:
from torch.utils.data import DataLoader
import torch

train_dataset = MemeDataset(train_data, processor, image_root_dir = GOOGLE_DRIVE_PATH + "/img")
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)



{'input_ids': tensor([[49406,  1594,  1043,  1946,   267,   257,   861,   593,   695,  1010,
          3010,   634,  2947,  3010,   634,   256,   592,   631,   607,  1010,
          3010,   634,   256, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'pixel_values': tensor([[[[-9.4555e-01, -8.1417e-01, -1.0331e+00,  ...,  1.7942e-02,
            1.7853e-01,  2.2232e-01],
          [-3.9081e-01, -2.7403e-01, -5.9519e-01,  ...,  1.1128e+00,
            1.4048e+00,  1.5216e+00],
          [-2.3023e-01, -2.1563e-01, -6.0979e-01,  ...,  1.0398e+00,
            1.3318e+00,  1.4340e+00],
          ...,
          [-1.3981e+00, -1.3835e+00, -1.3835e+00,  ..., -1.3689e+00,
           -1.3689e+00, -1.3543e+00],
          [-1.3835e+00, -1.3543e+00, -1.3543e+00,  ..., -1.3543e+00,
           -1.3689e+00, -1.3543e+00],
          [-1.0769e+00, -1.2813e+00, -1.3105e+00,  ..., -1.3981e+00,
           -1.3835e+00, -1.4127e+00]],

         [[

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/data/img/10259.png'

Binary classification head at the end of CLIP model

---



In [None]:
import torch.nn as nn

class CLIPBinaryClassifier(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.clip = base_model
        self.classifier = nn.Linear(self.clip.config.projection_dim * 2, 1)  # Image + text features
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_ids, attention_mask, pixel_values):
        outputs = self.clip(input_ids=input_ids, attention_mask=attention_mask, pixel_values=pixel_values)
        image_embeds = outputs.image_embeds  # [batch_size, dim]
        text_embeds = outputs.text_embeds

        # Concatenate both modalities
        combined = torch.cat([image_embeds, text_embeds], dim=1)  # [batch_size, 2*dim]
        logits = self.classifier(combined)
        probs = self.sigmoid(logits).squeeze(1)
        return probs

Training Loop

In [None]:
from tqdm import tqdm

for epoch in range(5):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        pixel_values = batch["pixel_values"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, pixel_values=pixel_values)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1} Loss: {total_loss / len(train_loader):.4f}")


Accuracy

In [None]:
from sklearn.metrics import accuracy_score

model.eval()
preds, trues = [], []

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        pixel_values = batch["pixel_values"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, pixel_values=pixel_values)
        preds.extend((outputs > 0.5).int().cpu().numpy())
        trues.extend(labels.int().cpu().numpy())

print("Accuracy:", accuracy_score(trues, preds))
