# Deep Learning Assignment 2 - Part B: Evaluating on Occluded Images

This notebook covers Part B of the assignment. We will:
1. Implement image occlusion by masking patches.
2. Evaluate the pre-trained SmolVLM (zero-shot) and the custom-trained ViT-GPT2 model (from Part A) on occluded test images.
3. Analyze the change in performance (BLEU, ROUGE-L, METEOR) due to occlusion.
4. Save the generated captions along with occlusion levels for Part C.

## B1. Setup & Imports

In [None]:
!pip install evaluate rouge_score

In [6]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from transformers import AutoProcessor, AutoModelForVision2Seq
import evaluate
from PIL import Image
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
import random
import math
import copy

### B1.1 Downloading our trained model and dataset

In [7]:
!pip install -q gdown

# model stored in gdrive zip file
file_id = "1IS_jKcD0ginPrwo-OY14DutAsRF7cuMt"
!gdown {file_id} --output vit_gpt2_caption_model.pth

Downloading...
From (original): https://drive.google.com/uc?id=1IS_jKcD0ginPrwo-OY14DutAsRF7cuMt
From (redirected): https://drive.google.com/uc?id=1IS_jKcD0ginPrwo-OY14DutAsRF7cuMt&confirm=t&uuid=5ee4db82-b619-40a9-bd6d-b723a8b3ec84
To: /kaggle/working/vit_gpt2_caption_model.pth
100%|████████████████████████████████████████| 700M/700M [00:08<00:00, 84.4MB/s]


In [8]:
!pip install -q gdown

# dataset stored in gdrive zip file
file_id = "1-4zt018qT1M85m1X0v95C9-a6_6YelIQ"
!gdown {file_id} --output custom_captions_dataset.zip

Downloading...
From (original): https://drive.google.com/uc?id=1-4zt018qT1M85m1X0v95C9-a6_6YelIQ
From (redirected): https://drive.google.com/uc?id=1-4zt018qT1M85m1X0v95C9-a6_6YelIQ&confirm=t&uuid=2cfdb06b-a0b7-41e6-b178-59a2fb173ecf
To: /kaggle/working/custom_captions_dataset.zip
100%|████████████████████████████████████████| 288M/288M [00:04<00:00, 71.5MB/s]


In [9]:
import zipfile

with zipfile.ZipFile("custom_captions_dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("custom_captions_dataset")

## B2. Config and Paths

In [None]:
OCCLUSION_LEVELS_ARG = [0.1, 0.5, 0.8]
PATCH_SIZE = 16
BATCH_SIZE = 8 

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DATASET_PATH = "custom_captions_dataset/custom_captions_dataset"
TEST_IMAGES_PATH = os.path.join(DATASET_PATH, "test")
TEST_CSV_PATH = os.path.join(DATASET_PATH, "test.csv")
CUSTOM_MODEL_PATH = "vit_gpt2_caption_model.pth"

PART_C_DATA_OUTPUT_FILE = "final_raw_results.csv"
RAW_RESULTS_DIR = "part_b_raw_results_csv"
os.makedirs(RAW_RESULTS_DIR, exist_ok=True)

print(f"Device: {DEVICE}")
print(f"Raw Results Dir: {RAW_RESULTS_DIR}")
print(f"Part C Output File: {PART_C_DATA_OUTPUT_FILE}")

Device: cuda
Raw Results Dir: part_b_raw_results_csv
Part C Output File: final_raw_results.csv


## B3. Load Models and Processors/Tokenizers

Load the SmolVLM and the custom trained ViT-GPT2 model from Part A.

### B3.1 Load SmolVLM

In [None]:
smolvlm_model_name = "HuggingFaceTB/SmolVLM-Instruct"
try:
    smolvlm_processor = AutoProcessor.from_pretrained(smolvlm_model_name)
    smolvlm_model = AutoModelForVision2Seq.from_pretrained(
        smolvlm_model_name,
        trust_remote_code=True,
        attn_implementation="eager" 
    ).to(DEVICE)
    smolvlm_model.eval()
    print("SmolVLM loaded successfully.")
except Exception as e:
    print(f"Error loading SmolVLM: {e}")
    smolvlm_model = None
    smolvlm_processor = None


processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/429 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.48k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/92.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.49G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

SmolVLM loaded successfully.


### B3.2 Load Custom Trained Model

We used the `ImageCaptionModel` class definition from Part A to load the state dict.


In [None]:
class ImageCaptionModel(nn.Module):
    """
    Custom Encoder-Decoder Model for Image Captioning using ViT as an encoder and GPT2 for decoder(same as from part A)
    """
    def __init__(self, encoder_model_name="WinKawaks/vit-small-patch16-224", decoder_model_name="gpt2", dropout_rate=0.1):
        super(ImageCaptionModel, self).__init__()
        self.model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
            encoder_model_name, decoder_model_name
        )
        self.tokenizer = AutoTokenizer.from_pretrained(decoder_model_name)
        if self.tokenizer.pad_token is None:
             self.tokenizer.pad_token = self.tokenizer.eos_token
             self.model.config.decoder.pad_token_id = self.model.config.decoder.eos_token_id

        self.model.config.decoder_start_token_id = self.tokenizer.bos_token_id
        self.model.config.pad_token_id = self.tokenizer.pad_token_id
        self.dropout = nn.Dropout(dropout_rate)
        self.image_processor = ViTImageProcessor.from_pretrained(encoder_model_name)

    def forward(self, pixel_values, labels=None):
        outputs = self.model(pixel_values=pixel_values, labels=labels)
        return outputs

    def generate(self, pixel_values, gen_kwargs=None):
        if gen_kwargs is None:
            gen_kwargs = {"max_length": 150, "num_beams": 4}

        gen_kwargs['pad_token_id'] = self.model.config.pad_token_id
        if self.model.config.decoder_start_token_id is None and hasattr(self.tokenizer, 'bos_token_id'):
             gen_kwargs.setdefault('decoder_start_token_id', self.tokenizer.bos_token_id)
        elif 'decoder_start_token_id' not in gen_kwargs:
             gen_kwargs.setdefault('decoder_start_token_id', self.model.config.decoder_start_token_id)

        return self.model.generate(pixel_values=pixel_values.to(self.model.device), **gen_kwargs) 


# load out custome model we fine tuned previously
try:
    custom_model = ImageCaptionModel(
        encoder_model_name="WinKawaks/vit-small-patch16-224",
        decoder_model_name="gpt2"
    )
    custom_model.load_state_dict(torch.load(CUSTOM_MODEL_PATH, map_location=DEVICE, weights_only=True))

    custom_model.to(DEVICE)
    custom_model.eval()
    custom_tokenizer = custom_model.tokenizer
    custom_image_processor = custom_model.image_processor
    print("Custom trained model loaded successfully.")
except FileNotFoundError:
    print(f"Error: Custom model state file not found at {CUSTOM_MODEL_PATH}. Cannot proceed with custom model evaluation.")
    custom_model = None
    custom_tokenizer = None
    custom_image_processor = None
except Exception as e:
    print(f"Error loading custom model state: {e}")
    custom_model = None
    custom_tokenizer = None
    custom_image_processor = None

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/88.2M [00:00<?, ?B/s]

Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-small-patch16-224 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.bias', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.10.crossattention.c_attn.bias', 'transformer.h.10.crossattention.c_attn.weight', 'transformer.h.10.crossattention.c_proj.bias', 'transformer.h.10.cros

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

Custom trained model loaded successfully.


## B4. Image Occlusion Function

Occludes images based on the mask percentage provided.

In [None]:
def occlude_image(image: np.array, mask_percentage: float, patch_size: int = 16) -> np.array:
    """
    Applying patch-wise occlusion to an image. i/p and o/p are numpy arrays.
    """
    if mask_percentage == 0:
        return image.copy()

    occluded_image = image.copy()
    h, w, _ = occluded_image.shape

    num_patches_h = math.ceil(h / patch_size)
    num_patches_w = math.ceil(w / patch_size)
    total_patches = num_patches_h * num_patches_w

    num_patches_to_mask = int(total_patches * mask_percentage)

    # all indices of patches
    all_patch_indices = [(i, j) for i in range(num_patches_h) for j in range(num_patches_w)]

    # randomly select patches to mask
    indices_to_mask = random.sample(all_patch_indices, num_patches_to_mask)

    for patch_r, patch_c in indices_to_mask:
        r_start = patch_r * patch_size
        r_end = min((patch_r + 1) * patch_size, h)
        c_start = patch_c * patch_size
        c_end = min((patch_c + 1) * patch_size, w)

        # set these to black
        occluded_image[r_start:r_end, c_start:c_end, :] = 0

    return occluded_image

## B5. Evaluation & Helper functions
Defined helper functions for generating captions with each model, Dataset Helper which loads images and the main evaluation loop.


In [None]:
# generate captions for test images(smolvlm)
def generate_smolvlm_caption(image: Image.Image, processor, model, device, max_new_tokens):
    """ Generates caption for a single image using SmolVLM. """
    if model is None or processor is None:
        return "Error: SmolVLM not loaded"
    try:
        prompt = "<image>\nDescribe the image:"
        inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
        with torch.no_grad():
            generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False) # Use deterministic generation for consistency
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
        caption = generated_text.split("Describe the image:")[-1].strip()
        return caption
    except Exception as e:
        return f"Error during SmolVLM generation: {e}"


# generate captions for test images(custom model)
def generate_custom_caption(pixel_values: torch.Tensor, model, tokenizer, device, max_length):
    """ Generates caption for a single processed image using the custom model. """
    if model is None or tokenizer is None:
         return "Error: Custom model not loaded"
    try:
        with torch.no_grad():
            gen_kwargs = {
                "max_length": max_length,
                "num_beams": 4,
                "pad_token_id": tokenizer.pad_token_id if tokenizer.pad_token_id is not None else model.model.config.pad_token_id,
                "decoder_start_token_id": model.model.config.decoder_start_token_id
            }
            if pixel_values.dim() == 3:
                pixel_values = pixel_values.unsqueeze(0)
            pixel_values = pixel_values.to(device)

            generated_ids = model.generate(pixel_values=pixel_values, gen_kwargs=gen_kwargs)
        decoded_caption = tokenizer.decode(generated_ids[0], skip_special_tokens=True).strip()
        return decoded_caption
    except Exception as e:
        return f"Error during Custom Model generation: {e}"


In [None]:
class ImageOcclusionDataset(Dataset):
    def __init__(self, dataframe, image_dir):
        self.dataframe = dataframe
        self.image_dir = image_dir

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        row = self.dataframe.iloc[idx]
        img_filename = row['filename']
        img_path = os.path.join(self.image_dir, img_filename)
        try:
            original_image_pil = Image.open(img_path).convert("RGB")
        except Exception as e:
            print(f"Error loading image {img_path}: {e}")
            original_image_pil = Image.new('RGB', (224, 224)) 
            img_filename = f"ERROR_{img_filename}"

        return {
            "image_pil": original_image_pil,
            "image_id": img_filename,
            "original_caption": row['caption']
        }

In [None]:
def evaluate_on_occluded_images(model, processor_or_tokenizer, image_processor, dataloader, device, occlusion_levels, model_key, gen_params, raw_results_output_dir):
    """
    Evaluates a model on occluded images using a DataLoader. Saves raw results internally to CSVs. Returns scores dict.
    """
    
    if model is None: print(f"Model {model_key} is None. Skipping evaluation."); return None
    print(f"--- Starting Occlusion Evaluation for {model_key.upper()} (Will save raw CSVs) ---")
    levels_to_evaluate = occlusion_levels # Include baseline

    raw_results_per_level = {level: {'preds': [], 'refs': [], 'ids': [], 'orig_captions': []} for level in levels_to_evaluate}

    try:
        bleu_metric = evaluate.load("bleu"); rouge_metric = evaluate.load("rouge"); meteor_metric = evaluate.load("meteor")
    except Exception as e: print(f"Error loading metrics for {model_key}: {e}."); return None

    model.eval(); torch.cuda.empty_cache() 
    with torch.no_grad():
        # for all batches
        for batch in tqdm(dataloader, desc=f"Evaluating {model_key}"):
            batch_image_pils = batch["image_pil"]; batch_image_ids = batch["image_id"]; batch_original_captions = batch["original_caption"]
            # for each image
            for i in range(len(batch_image_ids)): 
                image_pil, image_id, original_caption = batch_image_pils[i], batch_image_ids[i], batch_original_captions[i]
                if "ERROR_" in image_id: continue
                try: original_image_np = np.array(image_pil)
                except Exception as e: print(f"Error converting {image_id} to numpy: {e}. Skipping."); continue

                for occ_level in levels_to_evaluate:
                    occluded_image_np = occlude_image(original_image_np, occ_level, PATCH_SIZE)
                    # convert back to image
                    occluded_image_pil = Image.fromarray(occluded_image_np)
                    pred_caption = f"Error - Generation Failed"
                    try:
                        if model_key == 'smolvlm':
                            pred_caption = generate_smolvlm_caption(occluded_image_pil, processor_or_tokenizer, model, device, **gen_params)
                        elif model_key == 'custom':
                            if image_processor is None: raise ValueError("Image processor needed")
                            pixel_values = image_processor(images=occluded_image_pil, return_tensors="pt").pixel_values
                            pred_caption = generate_custom_caption(pixel_values, model, processor_or_tokenizer, device, **gen_params)
                    except Exception as e: pred_caption = f"Error during generation: {e}"

                    raw_results_per_level[occ_level]['preds'].append(pred_caption)
                    raw_results_per_level[occ_level]['refs'].append([original_caption])
                    raw_results_per_level[occ_level]['ids'].append(image_id)
                    raw_results_per_level[occ_level]['orig_captions'].append(original_caption)


    # save all raw results to CSV
    print(f"\nSaving raw results CSV files for {model_key} to {raw_results_output_dir}...")
    for level in levels_to_evaluate:
        level_int = int(level*100)
        filename = os.path.join(raw_results_output_dir, f"raw_{model_key}_{level_int}.csv")
        data_for_level = raw_results_per_level[level]
        if data_for_level['ids']:
             try:
                 df_raw = pd.DataFrame({
                     'image_id': data_for_level['ids'],
                     'original_caption': data_for_level['orig_captions'],
                     'generated_caption': data_for_level['preds']
                 })
                 df_raw.to_csv(filename, index=False, encoding='utf-8')
                 print(f"Saved {filename}")
             except Exception as e:
                 print(f"Error saving raw results to {filename}: {e}")
        else:
            print(f"Skipping save for {model_key} at {level*100}% - no data.")


    # calculate the 3 metrics
    print(f"\nCalculating final metrics for {model_key}...")
    scores_per_level = {}
    for level in levels_to_evaluate:
        data = raw_results_per_level[level]; preds = data['preds']; refs = data['refs']; meteor_refs = data['orig_captions']
        if not preds or not refs:
            scores_per_level[level] = {'BLEU': 0, 'ROUGE-L': 0, 'METEOR': 0, 'Error': 'No data'}
            continue
        try:
            preds = [str(p) if pd.notna(p) else "" for p in preds]; meteor_refs = [str(r) if pd.notna(r) else "" for r in meteor_refs]; refs = [[str(r[0]) if pd.notna(r[0]) else ""] for r in refs]
            bleu_score = bleu_metric.compute(predictions=preds, references=refs)['bleu']; rouge_score = rouge_metric.compute(predictions=preds, references=refs)['rougeL']; meteor_score = meteor_metric.compute(predictions=preds, references=meteor_refs)['meteor']
            scores_per_level[level] = { 'BLEU': bleu_score, 'ROUGE-L': rouge_score, 'METEOR': meteor_score }
        except Exception as e: print(f"Error calculating metrics for {model_key}@{level*100}%: {e}"); scores_per_level[level] = {'BLEU': 0, 'ROUGE-L': 0, 'METEOR': 0, 'Error': str(e)}

    print(f"--- Finished Occlusion Evaluation for {model_key.upper()} ---")
    return scores_per_level

In [None]:
# Create the test dataloader for our model etsting
try:
    test_df = pd.read_csv(TEST_CSV_PATH)
    test_dataset = ImageOcclusionDataset(test_df, TEST_IMAGES_PATH)
    test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=lambda batch: {key: [dic[key] for dic in batch] for key in batch[0]})
    print(f"Created DataLoader with {len(test_dataloader)} batches.")
except FileNotFoundError: print(f"FATAL: Test CSV not found at {TEST_CSV_PATH}."); test_dataloader = None
except Exception as e: print(f"Error creating DataLoader: {e}"); test_dataloader = None

smolvlm_gen_params = {"max_new_tokens": 50}
custom_gen_params = {"max_length": 50}

smolvlm_final_scores = None
custom_final_scores = None

Created DataLoader with 116 batches.


## Evaluating on both smolVLM and our custom model 

In [20]:
if test_dataloader and smolvlm_model:
    smolvlm_final_scores = evaluate_on_occluded_images(
        model=smolvlm_model,
        processor_or_tokenizer=smolvlm_processor,
        image_processor=None, 
        dataloader=test_dataloader,
        device=DEVICE,
        occlusion_levels=OCCLUSION_LEVELS_ARG,
        model_key='smolvlm',
        gen_params=smolvlm_gen_params,
        raw_results_output_dir=RAW_RESULTS_DIR 
    )

--- Starting Occlusion Evaluation for SMOLVLM (Will save raw CSVs) ---


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Evaluating smolvlm: 100%|██████████| 116/116 [4:22:16<00:00, 135.66s/it] 



Saving raw results CSV files for smolvlm to part_b_raw_results_csv...
Saved part_b_raw_results_csv/raw_smolvlm_10.csv
Saved part_b_raw_results_csv/raw_smolvlm_50.csv
Saved part_b_raw_results_csv/raw_smolvlm_80.csv

Calculating final metrics for smolvlm...
--- Finished Occlusion Evaluation for SMOLVLM ---


In [21]:
print("\n--- Scores returned by evaluate_on_occluded_images for Custom Model ---")
if isinstance(smolvlm_final_scores, dict):
        sorted_levels = sorted([float(k) for k in smolvlm_final_scores.keys()])

        for level in sorted_levels:
             level_key = float(level) 
             scores = smolvlm_final_scores.get(level_key)
             print(f"\nOcclusion Level: {level_key*100:.0f}%")
             if isinstance(scores, dict):
                 print(f"  BLEU:    {scores.get('BLEU', 'N/A'):.5f}")
                 print(f"  ROUGE-L: {scores.get('ROUGE-L', 'N/A'): 5f}")
                 print(f"  METEOR:  {scores.get('METEOR', 'N/A'):.5f}")


--- Scores returned by evaluate_on_occluded_images for Custom Model ---

Occlusion Level: 10%
  BLEU:    0.05252
  ROUGE-L:  0.225322
  METEOR:  0.26010

Occlusion Level: 50%
  BLEU:    0.03954
  ROUGE-L:  0.176309
  METEOR:  0.19322

Occlusion Level: 80%
  BLEU:    0.01424
  ROUGE-L:  0.104451
  METEOR:  0.10684


In [None]:
if test_dataloader and custom_model:
    custom_final_scores = evaluate_on_occluded_images(
        model=custom_model,
        processor_or_tokenizer=custom_tokenizer,
        image_processor=custom_image_processor,
        dataloader=test_dataloader,
        device=DEVICE,
        occlusion_levels=OCCLUSION_LEVELS_ARG,
        model_key='custom',
        gen_params=custom_gen_params,
        raw_results_output_dir=RAW_RESULTS_DIR 
    )

--- Starting Occlusion Evaluation for CUSTOM (Will save raw CSVs) ---


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Evaluating custom:   0%|          | 0/116 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Evaluating custom: 100%|██████████| 116/116 [33:37<00:00, 17.39s/it]



Saving raw results CSV files for custom to part_b_raw_results_csv...
Saved part_b_raw_results_csv/raw_custom_10.csv
Saved part_b_raw_results_csv/raw_custom_50.csv
Saved part_b_raw_results_csv/raw_custom_80.csv

Calculating final metrics for custom...
--- Finished Occlusion Evaluation for CUSTOM ---


In [23]:
print("\n--- Scores returned by evaluate_on_occluded_images for Custom Model ---")
if isinstance(custom_final_scores, dict):
        sorted_levels = sorted([float(k) for k in custom_final_scores.keys()])

        for level in sorted_levels:
             level_key = float(level) 
             scores = custom_final_scores.get(level_key)
             print(f"\nOcclusion Level: {level_key*100:.0f}%")
             if isinstance(scores, dict):
                 print(f"  BLEU:    {scores.get('BLEU', 'N/A'):.5f}")
                 print(f"  ROUGE-L: {scores.get('ROUGE-L', 'N/A'): 5f}")
                 print(f"  METEOR:  {scores.get('METEOR', 'N/A'):.5f}")


--- Scores returned by evaluate_on_occluded_images for Custom Model ---

Occlusion Level: 10%
  BLEU:    0.06307
  ROUGE-L:  0.283749
  METEOR:  0.23333

Occlusion Level: 50%
  BLEU:    0.04353
  ROUGE-L:  0.252464
  METEOR:  0.20752

Occlusion Level: 80%
  BLEU:    0.03392
  ROUGE-L:  0.240810
  METEOR:  0.19795


## B8. Analysis and Saving data for next PART

Uses the final scores returned by the evaluation functions for analysis. Loads the raw CSV files (saved internally by the functions) to assemble Part C data.

In [None]:
baseline_smolvlm = {
    'BLEU': 0.05751116213026179,
    'ROUGE-L': 0.23192061369592903,
    'METEOR': 0.2687414556395594
}
baseline_custom = {
    'BLEU': 0.0696919985483392,
    'ROUGE-L': 0.2914559915144349,
    'METEOR': 0.23919343552885367
}

if smolvlm_final_scores is None or custom_final_scores is None:
    print("\nOne or both evaluations did not complete successfully or were skipped. Cannot aggregate results fully.")
else:
    print("\n--- Aggregating Results and Saving Part C Data ---")

    performance_summary = {'smolvlm': {}, 'custom': {}}
    performance_summary['smolvlm'] = smolvlm_final_scores 
    performance_summary['custom'] = custom_final_scores 

    # using score results from part A
    performance_summary['smolvlm'][0.0] = baseline_smolvlm
    performance_summary['custom'][0.0] = baseline_custom


    print("\n--- Final Performance Summary (Using Part A Baselines) ---")
    print(f"\nBaseline (0% Occlusion - from Part A):")
    print(f"  SmolVLM: {performance_summary['smolvlm'].get(0.0, 'N/A')}")
    print(f"  Custom : {performance_summary['custom'].get(0.0, 'N/A')}")

    for level in OCCLUSION_LEVELS_ARG: 
        print(f"\nOcclusion Level: {int(level*100)}%")
        print(f"  SmolVLM: {performance_summary['smolvlm'].get(level, 'N/A - Evaluation might have failed')}")
        print(f"  Custom : {performance_summary['custom'].get(level, 'N/A - Evaluation might have failed')}")


    print("\n--- Performance Change (Score(Occluded) - Score(Baseline)) ---")
    valid_baseline_smol = isinstance(baseline_smolvlm, dict) and 'Error' not in baseline_smolvlm
    valid_baseline_cust = isinstance(baseline_custom, dict) and 'Error' not in baseline_custom

    # calculating changes for each occlusion level
    for level in OCCLUSION_LEVELS_ARG: 
        print(f"\nOcclusion Level: {int(level*100)}%")

        # score for smolvlm 
        current_perf_smol = performance_summary['smolvlm'].get(level) 
        valid_current_smol = isinstance(current_perf_smol, dict) and 'Error' not in current_perf_smol
        if valid_baseline_smol and valid_current_smol:
             try:
                 delta_bleu = current_perf_smol['BLEU'] - baseline_smolvlm['BLEU']
                 delta_rouge = current_perf_smol['ROUGE-L'] - baseline_smolvlm['ROUGE-L']
                 delta_meteor = current_perf_smol['METEOR'] - baseline_smolvlm['METEOR']
                 print(f"  SmolVLM Change: BLEU={delta_bleu:+.4f}, ROUGE-L={delta_rouge:+.4f}, METEOR={delta_meteor:+.4f}")
             except KeyError as ke: print(f"  SmolVLM Change: Error calculating change - Missing key {ke}")
             except Exception as e: print(f"  SmolVLM Change: Error calculating change - {e}")
        else: print(f"  SmolVLM Change: Cannot calculate change due to missing/invalid data.")

        # score for custom model
        current_perf_cust = performance_summary['custom'].get(level) 
        valid_current_cust = isinstance(current_perf_cust, dict) and 'Error' not in current_perf_cust
        if valid_baseline_cust and valid_current_cust:
             try:
                 delta_bleu = current_perf_cust['BLEU'] - baseline_custom['BLEU']
                 delta_rouge = current_perf_cust['ROUGE-L'] - baseline_custom['ROUGE-L']
                 delta_meteor = current_perf_cust['METEOR'] - baseline_custom['METEOR']
                 print(f"  Custom Change : BLEU={delta_bleu:+.4f}, ROUGE-L={delta_rouge:+.4f}, METEOR={delta_meteor:+.4f}")
             except KeyError as ke: print(f"  Custom Change : Error calculating change - Missing key {ke}")
             except Exception as e: print(f"  Custom Change : Error calculating change - {e}")
        else: print(f"  Custom Change : Cannot calculate change due to missing/invalid data.")


    # saving final data for part c
    print(f"\nLoading raw data from {RAW_RESULTS_DIR} and assembling Part C data...")
    part_c_dfs = []
    error_loading_raw = False
    for level in OCCLUSION_LEVELS_ARG: 
        level_int = int(level*100)
        smol_file = os.path.join(RAW_RESULTS_DIR, f"raw_smolvlm_{level_int}.csv")
        cust_file = os.path.join(RAW_RESULTS_DIR, f"raw_custom_{level_int}.csv")
        df_smol, df_cust = None, None

        try: 
            if os.path.exists(smol_file): df_smol = pd.read_csv(smol_file)
            else: print(f"Warning: Raw file not found: {smol_file}")
            if os.path.exists(cust_file): df_cust = pd.read_csv(cust_file)
            else: print(f"Warning: Raw file not found: {cust_file}")
        except Exception as e: print(f"Error loading raw CSVs for level {level_int}%: {e}"); error_loading_raw = True; continue

        if df_smol is None or df_cust is None:
            print(f"Warning: Missing raw data CSV for level {level_int}%. Cannot include in Part C data."); error_loading_raw = True; continue

        # two df for each model
        df_smol = df_smol.rename(columns={'generated_caption': 'generated_caption_smolvlm'})
        df_cust = df_cust.rename(columns={'generated_caption': 'generated_caption_custom'})
        merged_df = pd.merge(
            df_smol[['image_id', 'original_caption', 'generated_caption_smolvlm']],
            df_cust[['image_id', 'generated_caption_custom']], on='image_id', how='inner' 
        )
        merged_df['perturbation_percentage'] = level_int
        part_c_dfs.append(merged_df)

    # merged both parts
    if part_c_dfs:
        final_part_c_df = pd.concat(part_c_dfs, ignore_index=True)
        try:
            final_part_c_df = final_part_c_df[['image_id', 'original_caption', 'generated_caption_smolvlm', 'generated_caption_custom', 'perturbation_percentage']]
            final_part_c_df.to_csv(PART_C_DATA_OUTPUT_FILE, index=False, encoding='utf-8')
            print(f"Data for Part C saved successfully to {PART_C_DATA_OUTPUT_FILE} ({len(final_part_c_df)} records).")
        except Exception as e: print(f"Error saving final Part C data to CSV: {e}")
    elif not error_loading_raw: print("No data assembled for Part C (check if raw CSV files exist in output dir). CSV file not saved.")
    else: print("Could not assemble Part C data due to errors loading raw CSV files.")


--- Aggregating Results and Saving Part C Data ---

--- Final Performance Summary (Using Part A Baselines) ---

Baseline (0% Occlusion - from Part A):
  SmolVLM: {'BLEU': 0.05751116213026179, 'ROUGE-L': 0.23192061369592903, 'METEOR': 0.2687414556395594}
  Custom : {'BLEU': 0.0696919985483392, 'ROUGE-L': 0.2914559915144349, 'METEOR': 0.23919343552885367}

Occlusion Level: 10%
  SmolVLM: {'BLEU': 0.05251569996354642, 'ROUGE-L': 0.2253222082555746, 'METEOR': 0.2601031894310235}
  Custom : {'BLEU': 0.06306653453007512, 'ROUGE-L': 0.2837485057567668, 'METEOR': 0.23333117133535802}

Occlusion Level: 50%
  SmolVLM: {'BLEU': 0.03954059454316779, 'ROUGE-L': 0.17630869575191552, 'METEOR': 0.1932211553338966}
  Custom : {'BLEU': 0.043532610279796144, 'ROUGE-L': 0.2524637112278998, 'METEOR': 0.20752224399128041}

Occlusion Level: 80%
  SmolVLM: {'BLEU': 0.014235201519826829, 'ROUGE-L': 0.10445087834211797, 'METEOR': 0.10684116586939828}
  Custom : {'BLEU': 0.033921258602148105, 'ROUGE-L': 0.24081