In this notebook, a **VQA** model is implemented using **PyTorch** library.

- Question features are extracted using
  - **Word2Vec or FastText Embeddings**
  - **LSTM layers**
- Image features are available in the dataset.
- The question and image features are fused with
  - **Cross attention** (with VisualBert)
- The correct answer is predicted with a Dense layer.

**Best Validation Accuracy: 0.881**


# Imports

In [2]:
import gensim.downloader as api
import pandas as pd
import torch
import pickle
from torch import nn
import torchtext
import numpy as np
import json
# from google.colab import drive
import nltk
nltk.download('stopwords')
import string
from nltk.corpus import stopwords
from transformers import VisualBertModel



[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Loading data

## Connecting to drive

In [3]:
# drive.mount('/content/gdrive/', force_remount=True)
# base_path = '/content/gdrive/My Drive/iust/miniVQA/'
base_path = '/kaggle/input/minivqaiust/'
output_path = '/kaggle/working/'

## Setting up GPU

In [27]:
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")
device

device(type='cuda')

## Reading data

### Answers

In [4]:
all_answers = [ 'surfboard', 'eating', 'cake', 'table', 'hat', 'giraffe', 'broccoli', 'woman', 'sunny', 'apple']

### Image features

In [5]:
with open(base_path + 'image_features.pickle', 'rb') as f:
    image_features = pickle.load(f)

### Questions

In [6]:
with open(base_path + 'image_question.json', 'r') as f:
  img_to_q_dict = json.load(f)
  questions = []
  for img_id, img_qs in img_to_q_dict.items():
    for img_q in img_qs:
      q_id, q_text = img_q
      questions.append({
        'q_id': q_id,
        'q_text': q_text,
        'img_id': img_id
      })

questions = sorted(questions, key= lambda q: q['q_id'])

### Subsets

In [7]:
train_csv = pd.read_csv(base_path + 'train.csv', index_col="question_id").sort_index()
train_csv.head()

train_csv["question_text"] = [q["q_text"] for q in questions if q['q_id'] in train_csv.index.values]
train_csv["image_id"] = [q["img_id"] for q in questions if q['q_id'] in train_csv.index.values]


train_q = train_csv["question_text"].values.tolist()
train_a = torch.from_numpy(train_csv["label"].values)


In [8]:
valid_csv = pd.read_csv(base_path + 'val.csv', index_col="question_id").sort_index()
valid_csv.head()

valid_csv["question_text"] = [q["q_text"] for q in questions if q['q_id'] in valid_csv.index.values]
valid_csv["image_id"] = [q["img_id"] for q in questions if q['q_id'] in valid_csv.index.values]


valid_q = valid_csv["question_text"].values.tolist()
valid_a = torch.from_numpy(valid_csv["label"].values)


In [9]:
test_csv = pd.read_csv(base_path + 'test.csv', index_col="question_id").sort_index()
test_csv.head()

test_csv["question_text"] = [q["q_text"] for q in questions if q['q_id'] in test_csv.index.values]
test_csv["image_id"] = [q["img_id"] for q in questions if q['q_id'] in test_csv.index.values]


test_q = test_csv["question_text"].values.tolist()

# Create word embeddings layer

## Download model

In [10]:
embedding_model_name = "word2vec-google-news-300"
# embedding_model_name = "fasttext-wiki-news-subwords-300"

In [11]:
embedding_model = api.load(embedding_model_name)



IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Preprocess questions

### Delete stopwords and punctuatutions

In [12]:
stop_words = (stopwords.words('english'))
punc = string.punctuation

def delete_extra(text_array):
  new_array = []
  for t in text_array:
    new_t = t
    for s in stop_words:
      new_t = new_t.replace(f" {s} ", " ")
    for p in punc:
      new_t = new_t.replace(p, "")
    new_array.append(new_t)
  return new_array


### Set up word embeddings layer

In [49]:
max_length = 8

In [50]:
class WordEmbeddings(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)

    def forward(self, text):
        return self.embedding(text)

In [51]:
# Tokenize
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

# Create embedding layer
embed_size = len(embedding_model.get_vector('hello'))
word_embeddings = WordEmbeddings(
    vocab_size = len(embedding_model.index_to_key) + 1,
    embed_dim = embed_size,
)
word_embeddings.embedding.weight.data[0] = torch.zeros(embed_size)
word_embeddings.embedding.weight.data[1:] = torch.from_numpy(embedding_model.vectors)

In [52]:
def encode(x):
  return [embedding_model.get_index(token, default=-1) + 1 for token in tokenizer(x)]

In [53]:
def padify(xs, l = max_length):
    encoded_x = [encode(x) for x in xs]
    return torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in encoded_x])

In [54]:
# Apply on train
train_q_embeddings = word_embeddings(
    padify(delete_extra(train_q))
)
print('Train q embeddings size:', train_q_embeddings.shape)

train_img = torch.Tensor([image_features[img_id] for img_id in train_csv["image_id"].values])
print('Train image features shape:', train_img.shape)

Train q embeddings size: torch.Size([780, 8, 300])
Train image features shape: torch.Size([780, 512])


In [55]:
# Apply on valid
valid_q_embeddings = word_embeddings(
    padify(delete_extra(valid_q))
)
print('Valid q embeddings size:', valid_q_embeddings.shape)

valid_img = torch.Tensor([image_features[img_id] for img_id in valid_csv["image_id"].values])
print('Valid image features shape:', valid_img.shape)

Valid q embeddings size: torch.Size([110, 8, 300])
Valid image features shape: torch.Size([110, 512])


In [56]:
# Apply on test
test_q_embeddings = word_embeddings(
    padify(delete_extra(test_q))
)
print('Test q embeddings size:', test_q_embeddings.shape)

test_img = torch.Tensor([image_features[img_id] for img_id in test_csv["image_id"].values])
print('Test image features shape:', test_img.shape)

Test q embeddings size: torch.Size([110, 8, 300])
Test image features shape: torch.Size([110, 512])


## Create dataset and dataloader

In [57]:
train_dataset = torch.utils.data.TensorDataset(train_q_embeddings, train_img, train_a)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

In [58]:
valid_dataset = torch.utils.data.TensorDataset(valid_q_embeddings, valid_img, valid_a)
valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=64, shuffle=True)

# Build model

In [59]:
# empty memory
del image_features
del questions
import gc
gc.collect

<function gc.collect(generation=2)>

In [89]:
class MiniVQA(nn.Module):
    def __init__(self, text_features = 300, image_features = 512, n_image_regions = 8):
        super(type(self), self).__init__()
        self.lstms = nn.LSTM(text_features, 768, num_layers=1)
        self.image_linear = nn.Linear(image_features, n_image_regions * 2048)
        self.cross_attn = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
        for param in self.cross_attn.parameters():
            param.requires_grad = False
            
        self.linears = nn.Sequential(
            nn.Linear(768 * (8 + max_length), 10),
        )
            
        
    def forward(self, text, image):
        text = self.lstms(text)[0]
        image = self.image_linear(image)
        image = torch.reshape(image, (image.shape[0], 8, image.shape[1]//8))

        features = self.cross_attn(    
            inputs_embeds = text,
            visual_embeds = image
        ).last_hidden_state
        features = torch.flatten(features, start_dim=1)
        logits = self.linears(features)
        return nn.functional.softmax(logits, dim=1)


In [98]:
miniVQA = MiniVQA()
miniVQA.to(device)

MiniVQA(
  (lstms): LSTM(300, 768)
  (image_linear): Linear(in_features=512, out_features=16384, bias=True)
  (cross_attn): VisualBertModel(
    (embeddings): VisualBertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (visual_token_type_embeddings): Embedding(2, 768)
      (visual_position_embeddings): Embedding(512, 768)
      (visual_projection): Linear(in_features=2048, out_features=768, bias=True)
    )
    (encoder): VisualBertEncoder(
      (layer): ModuleList(
        (0-11): 12 x VisualBertLayer(
          (attention): VisualBertAttention(
            (self): VisualBertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
  

# Train model

## Define constants

In [100]:
learning_rate = 4e-4
epochs = 10

## Define train loop

In [101]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(miniVQA.parameters(), lr=learning_rate)

In [103]:
def pred_val(model, dataloader):
  size = len(dataloader.dataset)
  correct = 0
  avg_loss = 0
  for batch, (text, image, y) in enumerate(dataloader):
    pred = model(text.to(device), image.to(device))
    loss = loss_fn(pred, y.to(device))
    output = [torch.argmax(o).item() for o in pred]
    correct += (torch.FloatTensor(output) == y).float().sum()
    avg_loss += loss.item()
  acc = correct / size
  return avg_loss, correct, acc

In [102]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    correct = 0
    avg_loss = 0
    for batch, (text, image, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(text.to(device), image.to(device))
        loss = loss_fn(pred, y.to(device))
        # Backpropagation
        optimizer.zero_grad()
        loss.backward(retain_graph=True)
        optimizer.step()

        output = [torch.argmax(o).item() for o in pred]
        correct += (torch.FloatTensor(output) == y).float().sum()
        avg_loss += loss.item()

    avg_loss /= (size // 64 + 1)
    acc = correct / size
    val_loss, val_correct, val_acc = pred_val(miniVQA, valid_dataloader)
    print(f"training / loss: {avg_loss:>7f} | accuracy: {acc}")
    print(f"val / loss: {val_loss:>7f} | accuracy: {val_acc}")

In [104]:
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, miniVQA, loss_fn, optimizer)
print("Done!")

Epoch 1
-------------------------------
training / loss: 2.311465 | accuracy: 0.09358974546194077
val / loss: 4.620685 | accuracy: 0.10000000149011612
Epoch 2
-------------------------------
training / loss: 2.276808 | accuracy: 0.18076923489570618
val / loss: 4.413741 | accuracy: 0.23636363446712494
Epoch 3
-------------------------------
training / loss: 2.112650 | accuracy: 0.3346153795719147
val / loss: 4.081356 | accuracy: 0.4000000059604645
Epoch 4
-------------------------------
training / loss: 2.030220 | accuracy: 0.4346153736114502
val / loss: 4.167796 | accuracy: 0.3909091055393219
Epoch 5
-------------------------------
training / loss: 2.020676 | accuracy: 0.4256410300731659
val / loss: 3.950049 | accuracy: 0.4909090995788574
Epoch 6
-------------------------------
training / loss: 1.956710 | accuracy: 0.5243589878082275
val / loss: 3.808668 | accuracy: 0.5636363625526428
Epoch 7
-------------------------------
training / loss: 1.884315 | accuracy: 0.5833333134651184
val /

In [106]:
for t in range(epochs, epochs+5):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, miniVQA, loss_fn, optimizer)
print("Done!")

Epoch 11
-------------------------------
training / loss: 1.708471 | accuracy: 0.75
val / loss: 3.309568 | accuracy: 0.7909091114997864
Epoch 12
-------------------------------
training / loss: 1.637171 | accuracy: 0.8269230723381042
val / loss: 3.279270 | accuracy: 0.8363636136054993
Epoch 13
-------------------------------
training / loss: 1.627418 | accuracy: 0.8307692408561707
val / loss: 3.200510 | accuracy: 0.8818181753158569
Epoch 14
-------------------------------
training / loss: 1.595506 | accuracy: 0.8602564334869385
val / loss: 3.246480 | accuracy: 0.8545454740524292
Epoch 15
-------------------------------
training / loss: 1.584935 | accuracy: 0.8769230842590332
val / loss: 3.208555 | accuracy: 0.8545454740524292
Done!


# Predict

In [107]:
pred = miniVQA(test_q_embeddings.to(device), test_img.to(device))
output = np.array([torch.argmax(o).item() for o in pred], dtype='int64')
df = pd.DataFrame({
    'question_id': sorted(test_csv.index.values),
    'label': output
})
print(df.head())
df.to_csv(output_path + '/minivqa-v3.2-submission.csv', index=False)

   question_id  label
0       144000      1
1       436017      7
2       706000      8
3      1497002      8
4      1518004      7


# Save model

In [108]:
torch.save(miniVQA.state_dict(), output_path + 'minivqa3_v2_weights.pth')
