# Visual Question Answering (VQA)
This notebook implements a multi-modal architecture for VQA based on the DAQUAR dataset introduced here: [DAQUAR dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge/)

The objective of this project is to merge two modalities: Images + Text and build an AI that can answer questions about images.

In [2]:
# Imports

import os
import numpy as np
import torch
import torch.nn as nn

from PIL import Image
from datasets import load_dataset

from sklearn.metrics import accuracy_score, f1_score

# Set up device agnostic code

device = torch.device('cuda' if torch.cuda.is_available() else ('mps' if torch.mps.is_available() else 'cpu'))
print(f"Using device: {device}")

Using device: mps


## Dataset

The processed version of the dataset used in this project can be downloaded from Kaggle [here](https://www.kaggle.com/datasets/tezansahu/processed-daquar-dataset/data). Also contains the original dataset files.

### Original dataset
The original dataset has 3 files:
- `all_qa_pairs.txt`
- `train_images_list.txt`
- `test_images_list.txt`

### Processed dataset

The processed version dataset contains a processed version of the full DAQUAR dataset, with the following descriptions:
- `data.csv`: Processed dataset after normalizing all the questions and conversting the data into a tabular format {question, answer, image_id}
- `data_train.csv`: Training data from `train_images_list.txt`
- `data_eval.csv`: Testing data from `test_images_list.txt`
- `answer_space.txt`: List of all possible answers extracted from `all_qa_pairs.txt`

In [3]:
# Set up directory paths

from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data"
IMG_DIR = DATA_DIR / "images"

TRAIN_CSV = DATA_DIR / "data_train.csv"
EVAL_CSV = DATA_DIR / "data_eval.csv"
ANS_TXT = DATA_DIR / "answer_space.txt"

for p in [DATA_DIR, IMG_DIR, TRAIN_CSV, EVAL_CSV, ANS_TXT]:
    assert p.exists(), f"Missing: {p}"

In [4]:
# Load CSVs + answer space and add labels

ds = load_dataset(
    "csv",
    data_files={"train": str(TRAIN_CSV), "test":str (EVAL_CSV)}
)

# answer space
answers = [line.strip().lower() for line in ANS_TXT.read_text(encoding="utf-8").splitlines() if line.strip()]
ans_to_idx = {a: i for i, a in enumerate(answers)}

if "<unk>" not in ans_to_idx:
    ans_to_idx["<unk>"] = len(ans_to_idx)
    answers.append("<unk>")

def _norm(a: str) -> str:
    return str(a).strip().lower()

# take the first answer in case there are multiple answers
def _to_label(a_raw: str) -> int: 
    first = _norm(a_raw.split(",")[0])
    return ans_to_idx.get(first, ans_to_idx["<unk>"])

ds = ds.map(lambda b: {"label": [_to_label(a) for a in b["answer"]]}, batched=True)

def _resolve_path(image_id: str) -> str:
    from os.path import exists
    stem = image_id if image_id.endswith(".png") else f"{image_id}.png"
    p = IMG_DIR / stem
    if not p.exists():
        raise FileNotFoundError(f"Not found: {p}")
    return str(p)

ds = ds.map(lambda ex: {"image_path": _resolve_path(ex["image_id"])})
print(ds)



Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/9974 [00:00<?, ? examples/s]

Map:   0%|          | 0/2494 [00:00<?, ? examples/s]

Map:   0%|          | 0/9974 [00:00<?, ? examples/s]

Map:   0%|          | 0/2494 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'image_id', 'label', 'image_path'],
        num_rows: 9974
    })
    test: Dataset({
        features: ['question', 'answer', 'image_id', 'label', 'image_path'],
        num_rows: 2494
    })
})
