## Library Imports and Environment Configuration
In this section, we import essential libraries:
- **pandas** for data manipulation using DataFrames.
- **PIL (Python Imaging Library)** to handle image loading and conversion.
- **torch** to leverage PyTorch for model inference.
- **transformers** (via `processor` and `model`) for Visual Question Answering (VQA) functionality.
We also configure device settings (CPU/GPU) for computation.

In [None]:
# Import core libraries for data handling, image processing, and model inference
import os
import re
import math
import torch 
import evaluate 
from evaluate import load
import warnings
import pandas as pd 
from PIL import Image 
from tqdm.notebook import tqdm 
import torch.nn.functional as F
from sklearn.metrics import f1_score 
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BlipProcessor, BlipForQuestionAnswering 

## Dataset Path Definition
We define the constant `DATASET_CSV` to specify the CSV file location containing the VQA dataset. 
This CSV is expected to include columns for:
- **image_path**: Path to each image file.
- **question**: Natural language questions about the image.
- **answer**: Ground-truth answers used for evaluation.
Centralizing the path makes it easy to update file locations without modifying downstream code.

In [None]:
# Define the file path for the dataset CSV
DATASET_CSV = '/kaggle/input/image-input/output.csv'
IMAGE_BASE_DIR = '/kaggle/working/images/small'
MODEL_NAME = "Salesforce/blip-vqa-base"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 64
print(f"Using device: {DEVICE}")
print(f"Using Batch Size: {BATCH_SIZE}")

Using device: cuda
Using Batch Size: 64


## Loading Dataset and Model

In [None]:
print("Loading dataset...")
try:
    df = pd.read_csv(DATASET_CSV)
    print(f"Loaded {len(df)} samples.")
except FileNotFoundError:
    print(f"Error: {DATASET_CSV} not found")
    exit()

print(f"Loading model: {MODEL_NAME}...")
processor = BlipProcessor.from_pretrained(MODEL_NAME, use_fast=True)
model = BlipForQuestionAnswering.from_pretrained(MODEL_NAME).to(DEVICE)
model.eval() #Set model to evaluation mode 
print("Model loaded.")

Loading dataset...
Loaded 33866 samples.
Loading model: Salesforce/blip-vqa-base...


preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

Model loaded.


## Visual Question Answering Prediction Function
This cell defines `get_vqa_prediction(image_path, question)`, which:
1. **Loads and preprocesses** the image using PIL and converts it to RGB.
2. **Processes inputs** by combining the image and question through the `processor`, returning PyTorch tensors.
3. **Performs model inference** with `model.generate(...)` under `torch.no_grad()` to produce an answer (max 10 tokens).
4. **Decodes** the generated token IDs back into a string with `processor.decode`.
Error handling is included to gracefully manage missing or unreadable image files

In [None]:
def get_vqa_prediction(image_path, question):
    try:
        raw_image = Image.open(image_path).convert('RGB')
    except FileNotFoundError:
        print(f"Warning: Image not found at {image_path}")
        return "[Image Not Found Error]"
    except Exception as e:
        print(f"Warning: Error loading image {image_path}: {e}")
        return "[Image Load Error]"

    inputs = processor(raw_image, question, return_tensors="pt").to(DEVICE)
    with torch.no_grad(): # no gradients need to be calculated during inference
        outputs = model.generate(**inputs, max_new_tokens=10) # Limiting for single-word answers
    answer = processor.decode(outputs[0], skip_special_tokens=True).strip() #decode answers
    return answer

In [16]:
df.head()

Unnamed: 0,id,question,answer
0,718mYsQTQbL,What are the items in the image?,Bibs
1,718mYsQTQbL,What color is the solid bib?,Yellow
2,718mYsQTQbL,How many bibs are shown?,Six
3,718mYsQTQbL,What material are the bibs?,Cotton
4,718mYsQTQbL,Does one bib have a striped pattern?,Yes


In [None]:
# !gunzip /kaggle/working/images/metadata/images.csv.gz

Defines the `directory` variable pointing to the folder with image metadata files.
Centralizing directory paths facilitates file management for batch processing tasks

In [None]:
directory = "/kaggle/working/listings/metadata"
df1 = pd.read_csv(r'/kaggle/working/images/metadata/images.csv')

In [None]:
for idx, row in df.iterrows():
    #Loops through each row in the DataFrame allowing row-wise operations such as per-image inference.
    imageId = row['id']
    question = row['question']
    pt= df1[df1['image_id']==imageId]
    pt= pt['path'].values[0]
    true_answer = str(row['answer']).lower().strip()

4c/4c533ad7.jpg What are the items in the image? bibs


## Batched Inference Loop

- Calculate number of batches (`math.ceil(len(df)/BATCH_SIZE)`).
- Disable gradients with `torch.no_grad()` for efficiency.
- Loop over data in batches, loading and validating images and questions.
- Skip empty batches or missing files, logging warnings.
- Preprocess batch with `processor`, run `model.generate()`, and decode outputs.
- Normalize predictions and ground truths (lowercase, strip, remove punctuation).
- Store results and original indices for later evaluation.


In [None]:
print("Running batched inference...")
predictions = []
ground_truths_normalized = [] # Store normalized ground truths for metrics
original_indices = []
num_batches = math.ceil(len(df) / BATCH_SIZE)

with torch.no_grad(): # Disable gradient calculations for inference
    for i in tqdm(range(0, len(df), BATCH_SIZE), total=num_batches, desc="Evaluating Batches"): #progress bar
        batch_df = df[i:i+BATCH_SIZE] # Batch
        batch_images_pil = []
        batch_questions = []
        batch_ground_truths = []
        batch_valid_indices = [] 

        for idx, row in batch_df.iterrows(): # Load images and collect data for the current batch
            imageId = row['id']
            question = row['question']
            pt = df1[df1['image_id']==imageId]
            pt = pt['path'].values[0]
            true_answer = str(row['answer']).lower().strip()
            img_path = os.path.join(IMAGE_BASE_DIR, pt) #img_path, true_answer and question for each batch loaded 

            try:
                raw_image = Image.open(img_path).convert('RGB') #reading the image, appeding 
                batch_images_pil.append(raw_image)
                batch_questions.append(question)
                batch_ground_truths.append(true_answer)
                batch_valid_indices.append(idx) 
            except FileNotFoundError:
                print(f"Warning: Image not found at {img_path}. Skipping row {idx}.")
            except Exception as e:
                print(f"Warning: Error loading image {img_path} for row {idx}: {e}. Skipping.")

        # 2. Process the batch if any valid images were loaded
        if not batch_images_pil:
            print(f"Warning: No valid images loaded for batch starting at index {i}. Skipping batch.")
            continue # Skip to the next batch

        # Use the processor for the entire batch
        inputs = processor(images=batch_images_pil, text=batch_questions, return_tensors="pt", padding=True, truncation=True).to(DEVICE)

        # 3. Generate answers for the batch
        outputs = model.generate(**inputs, max_new_tokens=10)

        # 4. Decode and store results for the batch
        batch_preds_decoded = processor.batch_decode(outputs, skip_special_tokens=True)

        for pred_idx, original_df_idx in enumerate(batch_valid_indices):
            # Normalize prediction
            predicted_answer = batch_preds_decoded[pred_idx].strip().lower()
            predicted_answer = re.sub(r'[^\w\s]', '', predicted_answer) # Basic cleanup

            # Normalize corresponding ground truth
            true_answer_normalized = batch_ground_truths[pred_idx] # Already lowercased/stripped
            true_answer_normalized = re.sub(r'[^\w\s]', '', true_answer_normalized) # Basic cleanup

            predictions.append(predicted_answer)
            ground_truths_normalized.append(true_answer_normalized)
            original_indices.append(original_df_idx) # Store the original index
    

Running batched inference...


Evaluating Batches:   0%|          | 0/530 [00:00<?, ?it/s]

## Results DataFrame Construction
Aggregates prediction results into a new `results_df` DataFrame by passing a dictionary of lists.
This structure standardizes output for downstream analysis or export.

In [None]:
results_df = pd.DataFrame({
    'original_index': original_indices,
    'predicted_answer': predictions,
    'ground_truth_normalized': ground_truths_normalized
})
# Ensure the original df has a unique index if it was reset during sampling
df_with_results = df.merge(results_df, left_index=True, right_on='original_index', how='right') # right join to keep only processed rows
results_filename = 'vqa_results_baseline_batched.csv'
df_with_results.to_csv(results_filename, index=False)
print(f"Results saved to {results_filename}")

Results saved to vqa_results_baseline_batched.csv


In [None]:
results_df = pd.read_csv("../VR-mini-Proj-2/BLIP_vqa_results_baseline_batched.csv")

## Evaluations

#### Accuracy and F1

In [None]:
valid_predictions = results_df['predicted_answer'].to_list()
valid_ground_truths = results_df['ground_truth_normalized'].to_list()

if not valid_predictions:
    print("Error: No valid predictions available to calculate metrics.")
    exit()

# 1. Accuracy (Exact Match)
correct_predictions = sum(p == gt for p, gt in zip(valid_predictions, valid_ground_truths))
total_valid = len(valid_predictions)
accuracy = correct_predictions / total_valid if total_valid > 0 else 0
print(f"Accuracy (Exact Match): {accuracy:.4f}")

  from .autonotebook import tqdm as notebook_tqdm


Accuracy (Exact Match): 0.4248


In [None]:
f1_macro_simple = accuracy
print(f"F1 Score (Macro, based on Exact Match): {f1_macro_simple:.4f}")

F1 Score (Macro, based on Exact Match): 0.4248


#### Computing BERT Score with model `distilbert-base-uncased`

In [None]:
bertscore = load("bertscore")
results = bertscore.compute(references=valid_ground_truths, predictions=valid_predictions,lang="en",model_type="distilbert-base-uncased",rescale_with_baseline=True) 
print(results)

In [5]:
import numpy as np
print(np.mean(results['precision']))
print(np.mean(results['recall']))
print(np.mean(results['f1']))

0.6075401236502055
0.6054488398469636
0.6065456961339067


## TF-IDF Embedding and Cosine Similarity

- **Objective**: Quantify similarity between model predictions and ground-truth answers using TF-IDF representations and cosine similarity.
- **Steps**:  
  1. Initialize a `TfidfVectorizer` and fit it on the combined text of valid predictions and ground truths.  
  2. Transform each set into dense vectors and convert to PyTorch tensors (`pred_vec`, `gt_vec`).  
  3. Compute pairwise cosine similarity (`F.cosine_similarity`) across corresponding prediction–truth vectors.  
- **Output**: A tensor of cosine similarity scores, indicating how closely each predicted answer matches its reference.  


In [None]:


vectorizer = TfidfVectorizer()
all_sentences = valid_predictions + valid_ground_truths
vectorizer.fit(all_sentences)

vec1 = vectorizer.transform(valid_predictions).toarray()
vec2 = vectorizer.transform(valid_ground_truths).toarray()

pred_vec = torch.tensor(vec1, dtype=torch.float32)
gt_vec = torch.tensor(vec2, dtype=torch.float32)


cos_sim = F.cosine_similarity(pred_vec, gt_vec, dim=1)
print("Cosine similarity:", cos_sim)

Cosine similarity: tensor([1., 1., 0.,  ..., 1., 0., 0.])
