# VQA on CLEVR Dataset

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Creation of the vocabulary
First of all I created the vocabulary using the `Tokenizer` method. I loaded the train dataset and fitted the Tokenizer to create the unique vocabulary.
The vocabulary has 71 words and it is used for the creation of the datasets.

## Creation of the datasets

The dataset has more than 250k questions at which we have to associate the corresponding images and answers. I put a lot of effort in trying to optimize the input pipeline for the training otherwise, even running on GPU, the training process wpuld have taken lots of time. 

At first I tried using generators to fetch data. Using the following method 

    while True: 
        for i in range(batch_size): 
            ... load data here ...
            ... preprocess data here ...
            yield [question, image], answer`

but I've found that this method lack of parallelizing power which is what we want for optimizing the fetching of the data. 
I did another attempt using the methods of the `tensorflow` input pipeline library `tf.Data`. In this way I was able to create dataset that were both able to load and process batches of data in parallel and threadable for the `mode.fit()` procedure. 

The pipeline follows these steps (for train/validation datasets): 
* load the whole list of data from the json file (as a disctionary)
* shuffle all the entries of the dictionary (but, obviously, keeping the triple {question, answer, img_filepath} together)
* split train and validation data
* preprocess _all_ the data:
    - answers were encoded in one hot vectors;
    - questions were tokenized; 
    - the images_filepath was converted in a full path (from the root) and will be loaded in a few steps; 
* create the datasets using `tf.data.dataset.from_tensor_slices`. 
----
A note, even though the questions are a lot Python manage quite well the preprocessing on text, finishing it in a couple of second. 
I decided, then, to do all the text processing in advace in order not delay the loading of the images (surely the bottleneck of the pipeline).

---
Once the dataset are created I performed the following operations: 
* additional shuffling (not _strictly_ necessary since the dataset has already been shuffled before train/valid splitting)
* using the `.repeat()` method. This allows to cicle through the dataset during the various epochs
* mapped the dataset to a `parse_function` for loading the data in parallel. `dataset.map` allows to use multiple parallel calls to map the dataset in the target function, this allowed an optimized pipeline. The `parse_function` loaded the images in the following way: 
    - load the images using `tf.io.read_file(img_filenames)`
    - decode the png format using `tf.image.decode_png(img, channels=3)`
    - convert the images to `float32` type and perform proper resizing
* divide in batches
* prefetch some batches in memory according to the system performances

Using the `tf.data` dataset I managed to improve the efficiency of the training stage by nearly 50%. With shallow models a single epoch take less than half an hour while for more complex systems, with transfer learning and fine tuning a single epoch took a full hour to train. 

In [None]:
import tensorflow
import vqa_utils as vqa

# set the seed and check for GPU using the proper functions
vqa.set_seed(1234)
vqa.check_gpu()

batch_size = 8
# define fitting parameters, max_queue_size, steps_per_epoch, validation_steps according to the batch size.
max_que,e_steps,v_steps = vqa.fit_param(batch_size)

sentence_tokenizer = vqa.create_tokens()

train_dataset = vqa.get_dataset(sentence_tokenizer, batch_size)
valid_dataset = vqa.get_dataset(sentence_tokenizer, batch_size, is_training=False)
test_dataset = vqa.test_dataset(sentence_tokenizer)

## Model Creation 

I reported here my best model for this task. Since the train dataset is big enough I decided use transfer learning only for the feature extraction from the images, the Embedding Layer is trained from scratch. 

The basic structure of the network is the same for all these models. It is Y-shaped network with two inputs and a single output. One input takes the tokenized (and padded) questions and the other the images. After the embedding/LSTM and feature extraction the two branches are jointed together and classified with a fully connected (FC) network.

For the extraction of the image features I opted for transfer learning with fine tuning. I experimented with
- VGG16
- ResNet50
- DenseNet

pre-trained on the ImageNet dataset. 

Even though vanilla ResNet and DenseNet gave better results with respect to VGG16, they're also deeper and more complex networks meaning that a fine-tuning procedure was hardly feasible. I kept the model with VGG16 because it gave me the flexibility of having a big network for extracting meaningful features but also the possibility of fine tune the weights on the particular dataset (CLEVR).

The final model is constructed in the following way: 
- Embedding and LSTM for the question; 
- VGG16 with a 5 layer fine tuning for feature extraction and a final `GlobalAveragingPooling` layer; 

The Global Averaging Pooling was used to extract an array of features from the output of the CNN. It was choosen on the `Flatten()` alternative because of its regularization property and the fact that it has reduced the overall number of parameters.

The two representations are then mapped to (512,) array using `Dense` layers with `linear` activation function and multiplied pointwise for merging them. I also implemented a similar model using concatenation as a merging technique but I did find that the pointwise multiplication worked better. 

At this point I built 3 FC classfier in parallel each one of them with a different number of activation and initialization of the layer. This allow me to mimic an ensemble method but using a single, end-to-end trained network. This is the best trade off considering the limits of GPU availability in the Kaggle platform.

The results of the classifiers are averaged into a single layer. 

---
As loss function I used a focal loss in order to perform classification since it penalizes more the misclassified points with respect to the correctly classified ones. This allows to take care of the big class unbalance in the training data. 

In [None]:
MAX_NUM_WORDS = 71
SENTENCE_LENGTH = 42
NUM_ANS = 13
IMG_WIDTH = 480
IMG_HEIGHT = 320

def get_model(): 
    
    def clf_net(activation, SEED):
        """
        function the builds the classification network. 
        input param: 
        activation: number of activation in the Dense layre
        SEED: seed for the initialization of the layer initializer
        """
        init_x = tensorflow.keras.initializers.glorot_normal(seed=SEED)
        
        x = tensorflow.keras.layers.Dense(activation,activation='relu', kernel_initializer=init_x)(merge_layer)
        x = tensorflow.keras.layers.Dropout(0.5)(x)
        answer = tensorflow.keras.layers.Dense(NUM_ANS, activation='softmax')(x)
        return answer

    # QUESTION BRANCH    
    question_input = tensorflow.keras.layers.Input(shape=(SENTENCE_LENGTH,), name='input_qst')
    x = tensorflow.keras.layers.Embedding(MAX_NUM_WORDS+1, output_dim=50, mask_zero=True)(question_input)
    x = tensorflow.keras.layers.LSTM(units=256)(x)
    x = tensorflow.keras.layers.Dropout(0.3)(x)
    question_output = tensorflow.keras.layers.Dense(units=512, activation='linear')(x)
    
    # IMAGES BRANCH 
    base_model = tensorflow.keras.applications.VGG16(include_top=False, weights='imagenet')
    
    # fine tune the last 5 layers
    for layer in base_model.layers[:-5]:
        layer.trainable = False

    images_input = tensorflow.keras.layers.Input(shape=(IMG_WIDTH,IMG_HEIGHT,3), name='input_img')
    x = base_model(images_input)
    x = tensorflow.keras.layers.GlobalAveragePooling2D()(x)
    images_output = tensorflow.keras.layers.Dense(units=512, activation='linear')(x)

    # MERGING 
    merge_layer = tensorflow.keras.layers.multiply([question_output,images_output])
    
    # FC CLASSIFIER
    answer_x = clf_net(512, SEED=13)
    answer_y = clf_net(256, SEED=37)
    answer_z = clf_net(128, SEED=89)
    
    # OUTPUT
    output_ = tensorflow.keras.layers.average([answer_x, answer_y, answer_z])

    model = tensorflow.keras.models.Model([question_input, images_input], output_)

    loss = vqa.focal_loss(alpha=1.)
    optimizer = tensorflow.keras.optimizers.Adam(learning_rate=5e-4)
    metrics = [
      tensorflow.keras.metrics.CategoricalAccuracy(name='acc'),
      tensorflow.keras.metrics.Precision(name='prec'),
      tensorflow.keras.metrics.Recall(name='rec'),
      tensorflow.keras.metrics.AUC(name='auc'),
    ]

    model.compile(loss=loss, optimizer=optimizer, metrics=metrics)
    return model

model = get_model()
tensorflow.keras.utils.plot_model(model)

All the hyperparameters has been choosen by baby sitting the model and run it various times in order to improve the performances. In particular I did focus on: 
- the output dimension of the embedding layer;
- the number of units of the LSTM (I tried with 8,18,32,...,512 and find the optimum in the performances/epoch-time trade of at 256);
- the dimesions of the array for the merging layer;
- the learning rate (Initially 1e-3 and scaled down to 5e-4).

All of these adjustments where carefully tailored with performances and training time in mind.

## Training 

Here the training procedure. First of all I defined some callbacks (using the utility function) to be used in the fitting procedure (most importantly EarlyStopping for avoid overfitting and stopping the training) and fitted the model. 

Notice that I exploited the fact that the datasets created with `tf.data` are threadable and thus I did set: 
* `max_queue_size`, which is the number of batches to be kept in memory according to the values defined above (the number was tailored with lots of trial and error/crashes); 
* `workers` the number of threads

In [None]:
weig = vqa.get_class_weights()
call = vqa.callb()

model.fit(
    train_dataset,
    epochs=50,
    steps_per_epoch=e_steps, 
    validation_data=valid_dataset,
    validation_steps=v_steps,
    use_multiprocessing=True, 
    max_queue_size=max_que,
    callbacks = call,
    workers=8, 
    class_weight=weig,
)

## Generating the results

Results are generated using the `model.predict()` method on the test dataset and applying an `argmax` operation in the output vector of predictions in order to select the most probable. 

In [None]:
results = vqa.make_prediction(model,test_dataset)