# Deep Learning Assignment 2 - Part A: Image Captioning

This notebook covers Part A of the assignment, focusing on implementing and benchmarking an image captioning model.

## A1. Setup & Imports


In [None]:
!pip install evaluate
!pip install rouge_score

In [2]:
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer, GPT2TokenizerFast, AutoModel, BertTokenizer, BertModel
from transformers import AutoProcessor, AutoModelForCausalLM, AutoModelForVision2Seq
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from transformers import default_data_collator
from transformers import Idefics3ForConditionalGeneration, Idefics3Processor
import evaluate
from PIL import Image
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
import copy

## A2. Dataset Preparation

Downloading the dataset from Google Drive, extracting it, and defining paths for train, validation, and test sets. Also logs into hugging face(token - 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)

In [3]:
!pip install -q gdown

# dataset stored in gdrive zip file
file_id = "1-4zt018qT1M85m1X0v95C9-a6_6YelIQ"
!gdown --id {file_id} --output custom_captions_dataset.zip

Downloading...
From (original): https://drive.google.com/uc?id=1-4zt018qT1M85m1X0v95C9-a6_6YelIQ
From (redirected): https://drive.google.com/uc?id=1-4zt018qT1M85m1X0v95C9-a6_6YelIQ&confirm=t&uuid=2105babe-d0d5-4df9-9b30-82d455200fb6
To: /content/custom_captions_dataset.zip
100% 288M/288M [00:02<00:00, 127MB/s]


In [None]:
from huggingface_hub import login
login()

In [4]:
import zipfile

with zipfile.ZipFile("custom_captions_dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("custom_captions_dataset")

In [5]:
DATASET_PATH = "custom_captions_dataset/custom_captions_dataset"
TEST_IMAGES_PATH = os.path.join(DATASET_PATH, "test")
TRAIN_IMAGES_PATH = os.path.join(DATASET_PATH, "train")
VAL_IMAGES_PATH = os.path.join(DATASET_PATH, "val")

## A3. SmolVLM Zero-Shot Evaluation

Here, we evaluate the performance of a pre-trained SmolVLM in a zero-shot setting on our test dataset.

### A3.1 SmolVLM Model Loading

Loading the SmolVLM model (`HuggingFaceTB/SmolVLM-Instruct`) and its processor.

In [None]:
# # To test smolVLM
model_name = "HuggingFaceTB/SmolVLM-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
      model_name,
      trust_remote_code=True,
      attn_implementation="eager"
)
model.to(device)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Idefics3ForConditionalGeneration(
  (model): Idefics3Model(
    (vision_model): Idefics3VisionTransformer(
      (embeddings): Idefics3VisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(729, 1152)
      )
      (encoder): Idefics3Encoder(
        (layers): ModuleList(
          (0-26): 27 x Idefics3EncoderLayer(
            (self_attn): Idefics3VisionAttention(
              (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): Idefics3VisionMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear(in_

### A3.2 Zero-Shot Caption Generation Function

Defining the `zero_shot_captioning` function as required by the assignment. This function takes an image path and generates a caption using the loaded SmolVLM.

In [None]:
# Updated zero-shot captioning
def zero_shot_captioning(image_path: str, processor, model, device) -> str:
    try:
        image = Image.open(image_path).convert("RGB")
    except FileNotFoundError:
        return f"Error: Image not found at {image_path}"

    prompt = "<image>\nDescribe the image:"
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)

    max_tokens = 150

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_tokens
        )
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
        caption = generated_text.split("Describe the image:")[-1].strip()
        return caption

# to test zero shot caption function
# test_image_sample_path = os.path.join(TEST_IMAGES_PATH, "test_1.jpg")
# smolvlm_caption = zero_shot_captioning(test_image_sample_path, processor, model, device)
# print(f"SmolVLM Zero-Shot Caption: {smolvlm_caption}")

### A3.3 SmolVLM Evaluation on Test Set

Loading evaluation metrics (BLEU, ROUGE, METEOR). Iterating through the test set, generating captions using `zero_shot_captioning`, and computing the scores against the reference captions.

In [None]:
# Load evaluation metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")
meteor_metric = evaluate.load("meteor")

# Load csv file
csv_path = os.path.join(DATASET_PATH, "test.csv")
df = pd.read_csv(csv_path)
df["image_path"] = df["filename"].apply(lambda fname: os.path.join(TEST_IMAGES_PATH, fname))

# Initialize result lists
predictions = []
references = []

#  Start evaluating
print(f"Starting evaluation using zero-shot captioning...")

for index, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating SmolVLM"):
    image_path = row["image_path"]
    ref_caption = row["caption"]

    pred_caption = zero_shot_captioning(image_path, processor, model, device)

    # just in case
    if pred_caption is None:
        print(f"Warning: Image not found at {image_path}. Skipping.")
        continue

    predictions.append(pred_caption)
    references.append([ref_caption])


print("\nFinished processing images. Calculating evaluation metrics...")

# 3 metrics
bleu_score = bleu_metric.compute(predictions=predictions, references=references)
rouge_score = rouge_metric.compute(predictions=predictions, references=references)
meteor_score = meteor_metric.compute(predictions=predictions, references=[ref[0] for ref in references])

# printing final results
print("\n📊 SmolVLM Zero-Shot Evaluation Results:")
print(f"🔹 BLEU: {bleu_score}")
print(f"🔹 ROUGE: {rouge_score}")
print(f"🔹 METEOR: {meteor_score}")

print("\n✅ Evaluation complete!")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Starting evaluation using zero-shot captioning...


Evaluating SmolVLM: 100%|██████████| 928/928 [1:32:39<00:00,  5.99s/it]



Finished processing images. Calculating evaluation metrics...

📊 SmolVLM Zero-Shot Evaluation Results:
🔹 BLEU: {'bleu': 0.05751116213026179, 'precisions': [0.2518443757703507, 0.08519012194462958, 0.0340490073202187, 0.014975544523621974], 'brevity_penalty': 1.0, 'length_ratio': 1.7787557193695984, 'translation_length': 111962, 'reference_length': 62944}
🔹 ROUGE: {'rouge1': np.float64(0.3170341037878326), 'rouge2': np.float64(0.10634791729741047), 'rougeL': np.float64(0.23192061369592903), 'rougeLsum': np.float64(0.2282949082709533)}
🔹 METEOR: {'meteor': np.float64(0.2687414556395594)}

✅ Evaluation complete!


## A4. Custom Encoder-Decoder Model (ViT + GPT-2)

Training, and evaluating a custom transformer-based encoder-decoder model for image captioning.

### A4.1 Model Definition (`ImageCaptionModel`)

This model uses a pre-trained Vision Transformer (ViT - `WinKawaks/vit-small-patch16-224`) as the encoder and a pre-trained GPT-2 model as the decoder.

In [6]:
class ImageCaptionModel(nn.Module):
    """
    Custom Encoder-Decoder Model for Image Captioning using ViT as an encoder and GPT2 for decoder
    """
    def __init__(self, encoder_model_name="WinKawaks/vit-small-patch16-224", decoder_model_name="gpt2", dropout_rate=0.1):
        super(ImageCaptionModel, self).__init__()
        # Using VisionEncoderDecoderModel simplifies connecting ViT with GPT-2
        self.model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
            encoder_model_name, decoder_model_name
        )

        self.tokenizer = AutoTokenizer.from_pretrained(decoder_model_name)
        if self.tokenizer.pad_token is None:
             self.tokenizer.pad_token = self.tokenizer.eos_token
             self.model.config.decoder.pad_token_id = self.model.config.decoder.eos_token_id

        self.model.config.decoder_start_token_id = self.tokenizer.bos_token_id
        self.model.config.pad_token_id = self.tokenizer.pad_token_id
        self.dropout = nn.Dropout(dropout_rate)
        self.image_processor = ViTImageProcessor.from_pretrained(encoder_model_name)


    def forward(self, pixel_values, labels=None):
        """
        Forward pass of the model.
        Returns model output logits if labels are provided.
        """

        outputs = self.model(pixel_values=pixel_values, labels=labels)
        return outputs

    def generate(self, pixel_values, gen_kwargs=None):
        """
        Helper function for generating captions during inference.
        Returns generated token IDs.
        """
        if gen_kwargs is None:
            gen_kwargs = {"max_length": 150, "num_beams": 4}

        if self.model.config.decoder_start_token_id is None and hasattr(self.tokenizer, 'bos_token_id'):
             gen_kwargs['decoder_start_token_id'] = self.tokenizer.bos_token_id
        elif 'decoder_start_token_id' not in gen_kwargs:
             gen_kwargs['decoder_start_token_id'] = self.model.config.decoder_start_token_id

        return self.model.generate(pixel_values=pixel_values, **gen_kwargs)


### A4.2 Training Setup

Defining the `train_model` function. This function handles the training loop, including forward pass, loss calculation, backpropagation, optimizer steps, and validation. It also implements early stopping based on validation loss.

In [None]:
def train_model(model: nn.Module, train_dataloader: DataLoader, val_dataloader: DataLoader, optimizer, device: str, epochs: int):
    """
    Train model.

    Args:
        model (nn.Module): Custom image captioning model (ImageCaptionModel instance).
        dataloader (DataLoader): Training data loader.
        optimizer: Optimizer.
        criterion (Loss): This is often handled internally by Hugging Face models when labels are passed.
        device (str): Device to use (cuda or cpu).
        epochs (int): Number of epochs.
    """
    model.to(device)

    best_val_loss = np.inf
    epochs_no_improve = 0
    best_model_state_dict = None

    early_stopping_patience = 3

    print(f"Starting training on {device} for max {epochs} epochs (Patience: {early_stopping_patience})...")

    for epoch in range(epochs):
        # training
        model.train()
        total_train_loss = 0
        print(f"\n--- Epoch {epoch+1}/{epochs} ---")
        print("Training...")
        for batch_idx, batch in enumerate(train_dataloader):
            pixel_values = batch['pixel_values'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()
            outputs = model(pixel_values=pixel_values, labels=labels)
            loss = outputs.loss
            loss.backward()
            optimizer.step()

            total_train_loss += loss.item()

            if batch_idx % 100 == 0:
                print(f"  Batch {batch_idx}/{len(train_dataloader)}, Current Batch Train Loss: {loss.item():.4f}")

        avg_train_loss = total_train_loss / len(train_dataloader)
        print(f"--- End of Epoch {epoch+1} Training, Average Training Loss: {avg_train_loss:.4f} ---")

        # validating
        model.eval()
        total_val_loss = 0
        print("Validating...")
        # disable grad
        with torch.no_grad():
            for batch_idx, batch in enumerate(val_dataloader):
                pixel_values = batch['pixel_values'].to(device)
                labels = batch['labels'].to(device)

                outputs = model(pixel_values=pixel_values, labels=labels)
                loss = outputs.loss
                total_val_loss += loss.item()

        avg_val_loss = total_val_loss / len(val_dataloader)
        print(f"--- Validation Loss Epoch {epoch+1}: {avg_val_loss:.4f} ---")

        # We are using early stopping if val loss is not improving for 3(patience) consecutive epochs
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            epochs_no_improve = 0
            best_model_state_dict = copy.deepcopy(model.state_dict())
            print(f"Validation loss improved to {best_val_loss:.4f}. Saving model state.")
        else:
            epochs_no_improve += 1
            print(f"Validation loss did not improve for {epochs_no_improve} epoch(s). Best was {best_val_loss:.4f}.")

        if epochs_no_improve >= early_stopping_patience:
            print(f"\nEarly stopping triggered after {epoch+1} epochs.")
            break

    print("\nTraining finished.")

# save the best model to model
    if best_model_state_dict:
        print(f"Loading model state from epoch with best validation loss: {best_val_loss:.4f}")
        model.load_state_dict(best_model_state_dict)
    else:
        print("Warning: No best model state was saved (validation may not have run or improved). Using final model state.")

### A4.3 Evaluation Function

Defining the `evaluate_model` function. This function takes the trained model, evaluates it on a given dataloader (test set), generates captions, and calculates BLEU, ROUGE-L, and METEOR scores.

In [None]:
def evaluate_model(model: nn.Module, dataloader: DataLoader, device: str, tokenizer) -> dict:
    """
    Evaluate model with BLEU, ROUGE-L & METEOR scores

    Args:
        model (nn.Module): Trained model (ImageCaptionModel instance).
        dataloader (DataLoader): Test data loader.
        device (str): 'cuda' or 'cpu'.
        tokenizer: The tokenizer used for the decoder, needed for decoding generated IDs.

    Returns:
        dict: BLEU, ROUGE-L scores for the test set. (METEOR calculation commented out)
    """
    model.eval()
    model.to(device)

    bleu_metric = evaluate.load("bleu")
    rouge_metric = evaluate.load("rouge")
    meteor_metric = evaluate.load("meteor")

    predictions = []
    references = []

    print(f"Starting evaluation on {device}...")
    # disable grad
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            pixel_values = batch['pixel_values'].to(device)
            ref_captions_batch = batch['captions']

            gen_kwargs_eval = {
                "max_length": 150,
                "num_beams": 4,
                "pad_token_id": tokenizer.pad_token_id if tokenizer.pad_token_id is not None else model.model.config.pad_token_id,
                "decoder_start_token_id": model.model.config.decoder_start_token_id
            }

            generated_ids = model.generate(
                pixel_values=pixel_values.to(device),
                gen_kwargs=gen_kwargs_eval
            )

            # decode generated_ids and clean preds
            pred_captions_batch = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            cleaned_predictions = [pred.strip() for pred in pred_captions_batch]

            formatted_references = [[ref] for ref in ref_captions_batch]
            predictions.extend(cleaned_predictions)
            references.extend(formatted_references)

            if batch_idx % 50 == 0:
                 print(f"  Evaluated Batch {batch_idx}/{len(dataloader)}")

    print("Calculating final scores...")
    try:
        if not predictions or not references:
             print("Warning: No predictions or references generated. Cannot compute metrics.")
             return {"BLEU": 0, "ROUGE-L": 0, "METEOR": 0, "Error": "No predictions/references"}

        bleu_score = bleu_metric.compute(predictions=predictions, references=references)
        rouge_score = rouge_metric.compute(predictions=predictions, references=references)
        meteor_score = meteor_metric.compute(predictions=predictions, references=[ref[0] for ref in references]) 

        results = {
            "BLEU": bleu_score['bleu'],
            "ROUGE-L": rouge_score['rougeL'],
            "METEOR": meteor_score['meteor']
        }
        print(f"Evaluation Results: {results}")
        return results

    except Exception as e:
        print(f"Error calculating metrics: {e}")
        print(f"Num predictions: {len(predictions)}, Num references: {len(references)}")
        return {"BLEU": 0, "ROUGE-L": 0, "METEOR": 0, "Error": str(e)}

### A4.4 Dataset and DataLoader Preparation

Loading the train, validation, and test CSV files. Defining the `ImageCaptionDataset` class to process images and tokenize captions. Creating PyTorch DataLoaders for training, validation, and testing.

In [12]:
# Load CSVs
train_df = pd.read_csv(os.path.join(DATASET_PATH, "train.csv"))
val_df = pd.read_csv(os.path.join(DATASET_PATH, "val.csv"))
test_df = pd.read_csv(os.path.join(DATASET_PATH, "test.csv"))

# add full image paths for each set
train_df["image_path"] = train_df["filename"].apply(lambda x: os.path.join(TRAIN_IMAGES_PATH, x))
val_df["image_path"] = val_df["filename"].apply(lambda x: os.path.join(VAL_IMAGES_PATH, x))
test_df["image_path"] = test_df["filename"].apply(lambda x: os.path.join(TEST_IMAGES_PATH, x))

In [10]:
class ImageCaptionDataset(Dataset):
    def __init__(self, dataframe, image_dir, image_processor, tokenizer, max_length=50):
        self.dataframe = dataframe
        self.image_dir = image_dir
        self.image_processor = image_processor
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        img_name = self.dataframe.iloc[idx]['filename']
        img_path = os.path.join(self.image_dir, img_name)
        image = Image.open(img_path).convert("RGB")
        caption = self.dataframe.iloc[idx]['caption']

        # process image with ViT processor
        pixel_values = self.image_processor(images=image, return_tensors="pt").pixel_values.squeeze(0)

        # tokenize caption
        labels = self.tokenizer(
            caption,
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        ).input_ids.squeeze(0)

        return {
            "pixel_values": pixel_values,
            "labels": labels,
            "captions": caption
        }

### A4.5 Model Initialization and Training

Initializing the `ImageCaptionModel`, the AdamW optimizer.  Calling the `train_model` function to train the model. The best model based on validation loss is saved.

In [14]:
# Initialize model
caption_model = ImageCaptionModel(
    encoder_model_name="WinKawaks/vit-small-patch16-224",
    decoder_model_name="gpt2"
)

# using correct processor & tokenizer
image_processor = caption_model.image_processor
tokenizer = caption_model.tokenizer
optimizer = AdamW(caption_model.parameters(), lr=5e-5)

# get dataset by calling ImageCaptionDataset
train_dataset = ImageCaptionDataset(train_df, TRAIN_IMAGES_PATH, image_processor, tokenizer)
val_dataset = ImageCaptionDataset(val_df, VAL_IMAGES_PATH, image_processor, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/88.2M [00:00<?, ?B/s]

Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-small-patch16-224 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.bias', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.10.crossattention.c_attn.bias', 'transformer.h.10.crossattention.c_attn.weight', 'transformer.h.10.crossattention.c_proj.bias', 'transformer.h.10.cros

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

##### Training Loop

In [None]:
# Train
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_model(
    model=caption_model,
    train_dataloader=train_loader,
    val_dataloader=val_loader,
    optimizer=optimizer,
    device=device,
    epochs=15,
)

Starting training on cuda for max 15 epochs (Patience: 3)...

--- Epoch 1/15 ---
Training...
  Batch 0/715, Current Batch Train Loss: 4.2833
  Batch 100/715, Current Batch Train Loss: 2.1399
  Batch 200/715, Current Batch Train Loss: 2.5707
  Batch 300/715, Current Batch Train Loss: 2.2617
  Batch 400/715, Current Batch Train Loss: 2.4036
  Batch 500/715, Current Batch Train Loss: 2.6319
  Batch 600/715, Current Batch Train Loss: 1.9107
  Batch 700/715, Current Batch Train Loss: 2.6556
--- End of Epoch 1 Training, Average Training Loss: 2.4068 ---
Validating...
--- Validation Loss Epoch 1: 2.2159 ---
Validation loss improved to 2.2159. Saving model state.

--- Epoch 2/15 ---
Training...
  Batch 0/715, Current Batch Train Loss: 1.9271
  Batch 100/715, Current Batch Train Loss: 2.1354
  Batch 200/715, Current Batch Train Loss: 2.0576
  Batch 300/715, Current Batch Train Loss: 2.0312
  Batch 400/715, Current Batch Train Loss: 2.0783
  Batch 500/715, Current Batch Train Loss: 1.9234
  Batc

In [None]:
torch.save(caption_model.state_dict(), "vit_gpt2_caption_model.pth")

### A4.6 Custom Model Evaluation on Test Set

Loading the test dataset and dataloader. Evaluating the final trained custom model on the test set using the `evaluate_model` function.

In [15]:
test_dataset = ImageCaptionDataset(test_df, TEST_IMAGES_PATH, image_processor, tokenizer)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

In [None]:
results = evaluate_model(caption_model, test_loader, device=device, tokenizer=caption_model.tokenizer)
print("Final Evaluation Metrics:", results)

Starting evaluation on cuda...
  Evaluated Batch 0/116
  Evaluated Batch 50/116
  Evaluated Batch 100/116
Calculating final scores...
Evaluation Results: {'BLEU': 0.0696221900014177, 'ROUGE-L': 0.2917747947065263}
Final Evaluation Metrics: {'BLEU': 0.0696221900014177, 'ROUGE-L': 0.2917747947065263}


*(Reran evaluate in colab since METEOR was not working on kaggle)*
### A4.7 Re-Evaluation (Including METEOR)
Reloading the saved model state and re-running evaluation to ensure all metrics (BLEU, ROUGE-L, METEOR) are computed.

In [17]:
# load again and test(used for testing meteor score since it was not available on kaggle)
# used the model and recomputed metoer and other scores in colab
model = ImageCaptionModel(
    encoder_model_name="WinKawaks/vit-small-patch16-224",
    decoder_model_name="gpt2"
)

# Load the trained weights
model.load_state_dict(torch.load("vit_gpt2_caption_model.pth", map_location=torch.device(device)))

Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-small-patch16-224 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', '

<All keys matched successfully>

In [18]:
results = evaluate_model(model, test_loader, device=device, tokenizer=model.tokenizer)
print("Final Evaluation Metrics:", results)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Starting evaluation on cuda...
  Evaluated Batch 0/116
  Evaluated Batch 50/116
  Evaluated Batch 100/116
Calculating final scores...
Evaluation Results: {'BLEU': 0.0696919985483392, 'ROUGE-L': np.float64(0.2914559915144349), 'METEOR': np.float64(0.23919343552885367)}
Final Evaluation Metrics: {'BLEU': 0.0696919985483392, 'ROUGE-L': np.float64(0.2914559915144349), 'METEOR': np.float64(0.23919343552885367)}


In [None]:
#the end of part A