# Visual Question Answering: Part I

### Baseline Approach: A Bag of Words Model

This notebook is simply an execution of the code to build VQA model using a basic `Neural Network (Multilayer Perceptron) + Bag of Words`, I would highly encourage you to read the [full post here](https://sominwadhwa.github.io/blog/2018/01/01/de/)

<p align="center">
  <img src="https://github.com/sominwadhwa/sominwadhwa.github.io/blob/master/assets/vqa/5.jpg?raw=true"/>
</p>

**Let's get all the necessary library imports**

In [1]:
import sys, warnings
warnings.filterwarnings("ignore")
from random import shuffle, sample
import pickle as pk
import gc

import numpy as np
import pandas as pd
import scipy.io
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils, generic_utils
from progressbar import Bar, ETA, Percentage, ProgressBar    
from keras.models import model_from_json
from sklearn.preprocessing import LabelEncoder
import spacy
#from spacy.en import English

from src.utils import *
from src.features import *

Using TensorFlow backend.


## Preprocessed Data

The open-source VQA dataset contains multiple open-ended questions about various images. All my experiments were performed with v1 of the dataset (though I've processed v2 of the dataset as well), which contains:

- 82,783 training images from COCO (common objects in context) dataset.
- 215,407 question-answer pairs for training images.
- 40,504 validation images to perform own testing.
- 121,512 question-answer pairs for validation images.

In [2]:
training_questions = open("preprocessed/v1/ques_train.txt","rb").read().decode('utf8').splitlines()
answers_train      = open("preprocessed/v1/answer_train.txt","rb").read().decode('utf8').splitlines()
images_train       = open("preprocessed/v1/images_coco_id.txt","rb").read().decode('utf8').splitlines()
img_ids            = open('preprocessed/v1/coco_vgg_IDMap.txt').read().splitlines()
vgg_path           = "/floyd/input/vqa_data/coco/vgg_feats.mat"

Let's look at a couple of questions along with their answers. The first entry you see here is the **COCO Image ID** through with the image can be found at [http://cocodataset.org/#explore](http://cocodataset.org/#explore) by simply entering the image ID in the **search** column. 

In [3]:
sample(list(zip(images_train, training_questions, answers_train)), 5)

[('354220', 'What are the elephants doing?', 'walking'),
 ('306440', 'What city is this?', 'new york'),
 ('68576', 'What are they riding in?', 'airplane'),
 ('384023', 'Is the train going through a city?', 'no'),
 ('269273', 'What nationality is the person in the picture?', 'asian')]

In [4]:
%time nlp = spacy.load("en_core_web_md")
print ("Loaded WordVec")

CPU times: user 23.4 s, sys: 808 ms, total: 24.2 s
Wall time: 24.3 s
Loaded WordVec


Load image features - `4096` sized vectors extracted from the last layer of a VGG network trained on the COCO Dataset.

In [5]:
%time vgg_features = scipy.io.loadmat(vgg_path)
img_features = vgg_features['feats']
id_map = dict()
print ("Loaded VGG Weights")

CPU times: user 9.53 s, sys: 2.57 s, total: 12.1 s
Wall time: 12.1 s
Loaded VGG Weights


In [6]:
gc.collect()

6

In [7]:
upper_lim = 1000 #Number of most frequently occurring answers in COCOVQA (Coverting >85% of the total data)
training_questions, answers_train, images_train = freq_answers(training_questions, answers_train, images_train, upper_lim)
print (len(training_questions), len(answers_train),len(images_train))

215407 215407 215407


In [8]:
lbl = LabelEncoder()
lbl.fit(answers_train)
nb_classes = len(list(lbl.classes_))
pk.dump(lbl, open('preprocessed/v1/label_encoder_mlp.sav','wb'))

### Defining the Network Architecture

In [9]:
num_hidden_units  = 1024
num_hidden_layers = 3
batch_size        = 128
dropout           = 0.5
activation        = 'tanh'
img_dim           = 4096
word2vec_dim      = 300

`num_epochs`: Set to the number of epochs you'd wish to run the network for.

`log_interval`: This parameter sets the epoch interval after which a copy of the model weights will be saved.

In [10]:
num_epochs = 1
log_interval = 1

In [11]:
for ids in img_ids:
    id_split = ids.split()
    id_map[id_split[0]] = int(id_split[1])

In [12]:
model = Sequential()
model.add(Dense(num_hidden_units, input_dim=word2vec_dim+img_dim, kernel_initializer='uniform'))
model.add(Dropout(dropout))
for i in range(num_hidden_layers):
    model.add(Dense(num_hidden_units, kernel_initializer='uniform'))
    model.add(Activation(activation))
    model.add(Dropout(dropout))
model.add(Dense(nb_classes, kernel_initializer='uniform'))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
#tensorboard = TensorBoard(log_dir='/output/Graph', histogram_freq=0, write_graph=True, write_images=True)
model.summary()

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1024)              4502528   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1024)              1049600   
_________________________________________________________________
activation_1 (Activation)    (None, 1024)              0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1024)              1049600

In [13]:
model_dump = model.to_json()
open('baseline_mlp'  + '.json', 'w').write(model_dump)

3150

Since I've already performed these experiments once, so it'd be a nice idea to leverage the model I already created so here I've loaded the weights saved after the 99th epoch during my training experiment, and simply retrain those!

### You may **skip** this step if you wish to build your model from scratch!

**And we're good to go!**

In [None]:
for k in range(num_epochs):
    index_shuffle = list(range(len(training_questions)))
    shuffle(index_shuffle)
    training_questions = [training_questions[i] for i in index_shuffle]
    answers_train = [answers_train[i] for i in index_shuffle]
    images_train = [images_train[i] for i in index_shuffle]
    progbar = generic_utils.Progbar(len(training_questions))
    for ques_batch, ans_batch, im_batch in zip(grouped(training_questions, batch_size, 
                                                       fillvalue=training_questions[-1]), 
                                               grouped(answers_train, batch_size, 
                                                       fillvalue=answers_train[-1]), 
                                               grouped(images_train, batch_size, fillvalue=images_train[-1])):
        %time X_ques_batch = get_questions_sum(ques_batch, nlp)
        %time X_img_batch = get_images_matrix(im_batch, id_map, img_features)
        X_batch = np.hstack((X_ques_batch, X_img_batch))
        Y_batch = get_answers_sum(ans_batch, lbl)
        #loss = model.train_on_batch(X_batch, Y_batch,callbacks= [tensorboard])
        %time loss = model.train_on_batch(X_batch, Y_batch)
        progbar.add(batch_size, values=[('train loss', loss)])

    if k%log_interval == 0:
        model.save_weights("weights/MLP" + "_epoch_{:02d}.hdf5".format(k))
model.save_weights("weights/MLP" + "_epoch_{:02d}.hdf5".format(k))

CPU times: user 2.31 s, sys: 116 ms, total: 2.43 s
Wall time: 1.8 s
CPU times: user 1.91 ms, sys: 307 µs, total: 2.22 ms
Wall time: 1.86 ms
CPU times: user 1.61 s, sys: 224 ms, total: 1.83 s
Wall time: 1.17 s
   128/215407 [..............................] - ETA: 5011s - train loss: 7.2509CPU times: user 2.29 s, sys: 119 ms, total: 2.41 s
Wall time: 1.81 s
CPU times: user 2.02 ms, sys: 485 µs, total: 2.5 ms
Wall time: 1.75 ms
CPU times: user 598 ms, sys: 136 ms, total: 734 ms
Wall time: 216 ms
   256/215407 [..............................] - ETA: 4219s - train loss: 6.5618CPU times: user 2.33 s, sys: 122 ms, total: 2.45 s
Wall time: 1.8 s
CPU times: user 2.05 ms, sys: 470 µs, total: 2.52 ms
Wall time: 1.74 ms
CPU times: user 602 ms, sys: 128 ms, total: 730 ms
Wall time: 218 ms
   384/215407 [..............................] - ETA: 3946s - train loss: 6.7655CPU times: user 2.33 s, sys: 122 ms, total: 2.45 s
Wall time: 1.9 s
CPU times: user 2 ms, sys: 462 µs, total: 2.46 ms
Wall time: 1.76

KeyboardInterrupt: 

CPU times: user 2.25 ms, sys: 226 µs, total: 2.47 ms
Wall time: 1.92 ms
CPU times: user 602 ms, sys: 122 ms, total: 724 ms
Wall time: 281 ms
  1280/215407 [..............................] - ETA: 3718s - train loss: 7.3880CPU times: user 2.52 s, sys: 137 ms, total: 2.66 s
Wall time: 2.21 s
CPU times: user 2.25 ms, sys: 585 µs, total: 2.84 ms
Wall time: 1.93 ms
CPU times: user 600 ms, sys: 136 ms, total: 735 ms
Wall time: 236 ms
  1408/215407 [..............................] - ETA: 3751s - train loss: 7.1890CPU times: user 2.3 s, sys: 121 ms, total: 2.43 s
Wall time: 1.84 s
CPU times: user 2.02 ms, sys: 455 µs, total: 2.47 ms
Wall time: 1.72 ms
CPU times: user 611 ms, sys: 138 ms, total: 750 ms
Wall time: 220 ms
  1536/215407 [..............................] - ETA: 3725s - train loss: 6.9902CPU times: user 2.69 s, sys: 149 ms, total: 2.84 s
Wall time: 2.29 s
CPU times: user 2.08 ms, sys: 557 µs, total: 2.64 ms
Wall time: 2.33 ms
CPU times: user 587 ms, sys: 137 ms, total: 724 ms
Wall tim

KeyboardInterrupt: 

CPU times: user 2.67 ms, sys: 405 µs, total: 3.07 ms
Wall time: 2.49 ms
CPU times: user 604 ms, sys: 137 ms, total: 741 ms
Wall time: 241 ms
  2048/215407 [..............................] - ETA: 3655s - train loss: 6.5601CPU times: user 2.45 s, sys: 145 ms, total: 2.59 s
Wall time: 2.09 s
CPU times: user 1.9 ms, sys: 497 µs, total: 2.4 ms
Wall time: 2.26 ms
CPU times: user 603 ms, sys: 133 ms, total: 736 ms
Wall time: 263 ms
  2176/215407 [..............................] - ETA: 3671s - train loss: 6.4393CPU times: user 2.43 s, sys: 141 ms, total: 2.57 s
Wall time: 1.94 s
CPU times: user 2.06 ms, sys: 472 µs, total: 2.53 ms
Wall time: 1.76 ms
CPU times: user 605 ms, sys: 127 ms, total: 732 ms
Wall time: 250 ms
  2304/215407 [..............................] - ETA: 3668s - train loss: 6.3537CPU times: user 2.47 s, sys: 141 ms, total: 2.61 s
Wall time: 2.05 s
CPU times: user 3.27 ms, sys: 718 µs, total: 3.99 ms
Wall time: 3.72 ms
CPU times: user 589 ms, sys: 129 ms, total: 718 ms
Wall time

# Let's evaluate our model!

We're going to evalute our model on the validation set provided by the **VQA Dataset** which I've already preprocessed much like our training datasets. 

In [5]:
model = model_from_json(open('baseline_mlp.json').read())
model.load_weights('weights/MLP_epoch_99.hdf5') #Pass in your weights file
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

print ("Model Loaded with Weights")
model.summary()

Model Loaded with Weights
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1024)              4502528   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1024)              1049600   
_________________________________________________________________
activation_1 (Activation)    (None, 1024)              0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1024)              1049600   
_________________________________________________________________
activation_2 (Activation)    (None, 1024)         

**Loading the validation preprocessed data**

In [29]:
val_imgs = open('preprocessed/v1/val_images_coco_id.txt','rb').read().decode('utf-8').splitlines()
val_ques = open('preprocessed/v1/ques_val.txt','rb').read().decode('utf-8').splitlines()
val_ans  = open('preprocessed/v1/answer_val.txt','rb').read().decode('utf-8').splitlines()

In [30]:
label_encoder = pk.load(open('preprocessed/v1/label_encoder.sav','rb'))

In [None]:
y_pred = []
batch_size = 128 

#print ("Word2Vec Loaded!")

widgets = ['Evaluating ', Percentage(), ' ', Bar(marker='#',left='[',right=']'), ' ', ETA()]
pbar = ProgressBar(widgets=widgets)
#i=1

In [None]:
for qu_batch,an_batch,im_batch in pbar(zip(grouped(val_ques, batch_size, fillvalue=val_ques[0]), grouped(val_ans, batch_size, fillvalue=val_ans[0]), grouped(val_imgs, batch_size, fillvalue=val_imgs[0]))):
    X_q_batch = get_questions_matrix(qu_batch, nlp)
    X_i_batch = get_images_matrix(im_batch, id_map, vgg_features)
    X_batch = np.hstack((X_q_batch, X_i_batch))
    y_predict = model.predict_classes(X_batch, verbose=0)
    y_pred.extend(label_encoder.inverse_transform(y_predict))
    #print (i,"/",len(val_ques))
    #i+=1
    #print(label_encoder.inverse_transform(y_predict))


In [32]:
correct_val = 0.0
total = 0
f1 = open('res.txt','w')

for pred, truth, ques, img in zip(y_pred, val_ans, val_ques, val_imgs):
    t_count = 0
    for _truth in truth.split(';'):
        if pred == truth:
            t_count += 1 
    if t_count >=2:
        correct_val +=1
    else:
        correct_val += float(t_count)/3

    total +=1

    try:
        f1.write(str(ques))
        f1.write('\n')
        f1.write(str(img))
        f1.write('\n')
        f1.write(str(pred))
        f1.write('\n')
        f1.write(str(truth))
        f1.write('\n')
        f1.write('\n')
    except:
        pass

print ("Accuracy: ", round((correct_val/total)*100,2)*)
f1.write('Final Accuracy is ' + str(round(correct_val/total),2)*100)
f1.close()

Accuracy:  48.74


There you go, all set to participate in the next VQA Challenge!

If you do, however, would like to try out these models on your own custom images do checkout **`src/test.py`** with an image and a characterstic question.