It is the code reprository for "Joint embedding VQA model based on dynamic word vector", in which the model used a pre-trained ELMO model to replace the static word embedding commonly used by other VQA models.
You may need a machine with at least 1 GPU (>= 8GB), 20GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O.
You should first install some necessary packages.
- Install Python >= 3.5
- Install Cuda >= 9.0 and cuDNN
- Install all required packages as following:
$ pip install -r requirements.txt
The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (from 10 to 100) of 2048-D features. We store the features for each image in a .npz
file. You can prepare the visual features by yourself or download the extracted features from OneDrive or BaiduYun. The downloaded files contains three files: train2014.tar.gz, val2014.tar.gz, and test2015.tar.gz, corresponding to the features of the train/val/test images for VQA-v2, respectively. You should place them as follows:
|-- datasets
|-- coco_extract
| |-- train2014.tar.gz
| |-- val2014.tar.gz
| |-- test2015.tar.gz
Besides, we use the VQA samples from the visual genome dataset to expand the training samples. Similar to existing strategies, we preprocessed the samples by two rules:
- Select the QA pairs with the corresponding images appear in the MSCOCO train and val splits.
- Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).
For convenience, we provide our processed vg questions and annotations files, you can download them from OneDrive or BaiduYun, and place them as follow:
|-- datasets
|-- vqa
| |-- VG_questions.json
| |-- VG_annotations.json
We use a pre-trained ELMO model to get language features. weights and options should be placed as follow:
|-- utils
|-- elmo
| |-- elmo_2x4096_512_2048cnn_2xhighway_options.json
| |-- elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
After that, you can run the following script to setup all the needed configurations for the experiments
$ sh setup.sh
Running the script will:
- Download the QA files for VQA-v2.
- Unzip the bottom-up features
Finally, the datasets
folders will have the following structure:
|-- datasets
|-- coco_extract
| |-- train2014
| | |-- COCO_train2014_...jpg.npz
| | |-- ...
| |-- val2014
| | |-- COCO_val2014_...jpg.npz
| | |-- ...
| |-- test2015
| | |-- COCO_test2015_...jpg.npz
| | |-- ...
|-- vqa
| |-- v2_OpenEnded_mscoco_train2014_questions.json
| |-- v2_OpenEnded_mscoco_val2014_questions.json
| |-- v2_OpenEnded_mscoco_test2015_questions.json
| |-- v2_OpenEnded_mscoco_test-dev2015_questions.json
| |-- v2_mscoco_train2014_annotations.json
| |-- v2_mscoco_val2014_annotations.json
| |-- VG_questions.json
| |-- VG_annotations.json
The following script will start training with the default hyperparameters:
$ python3 run.py --RUN='train'
All checkpoint files will be saved to:
ckpts/ckpt_<VERSION>/epoch<EPOCH_NUMBER>.pkl
and the training log file will be placed at:
results/log/log_run_<VERSION>.txt
To add:
-
--VERSION=str
, e.g.--VERSION='small_model'
to assign a name for your this model. -
--GPU=str
, e.g.--GPU='2'
to train the model on specified GPU device. -
--NW=int
, e.g.--NW=8
to accelerate I/O speed. -
--MODEL={'small', 'large'}
( Warning: The large model will consume more GPU memory, maybe Multi-GPU Training and Gradient Accumulation can help if you want to train the model with limited GPU memory.) -
--SPLIT={'train', 'train+val', 'train+val+vg'}
can combine the training datasets as you want. The default training split is'train+val+vg'
. Setting--SPLIT='train'
will trigger the evaluation script to run the validation score after every epoch automatically. -
--RESUME=True
to start training with saved checkpoint parameters. In this stage, you should assign the checkpoint version--CKPT_V=str
and the resumed epoch numberCKPT_E=int
. -
--MAX_EPOCH=int
to stop training at a specified epoch number. -
--PRELOAD=True
to pre-load all the image features into memory during the initialization stage (Warning: needs extra 25~30GB memory and 30min loading time from an HDD drive).
We recommend to use the GPU with at least 8 GB memory, but if you don't have such device, don't worry, we provide two ways to deal with it:
-
Multi-GPU Training:
If you want to accelerate training or train the model on a device with limited GPU memory, you can use more than one GPUs:
Add
--GPU='0, 1, 2, 3...'
The batch size on each GPU will be adjusted to
BATCH_SIZE
/#GPUs automatically. -
Gradient Accumulation:
If you only have one GPU less than 8GB, an alternative strategy is provided to use the gradient accumulation during training:
Add
--ACCU=n
This makes the optimizer accumulate gradients for
n
small batches and update the model weights at once. It is worth noting thatBATCH_SIZE
must be divided byn
to run this mode correctly.
Warning: If you train the model use --MODEL
args or multi-gpu training, it should be also set in evaluation.
Offline evaluation only support the VQA 2.0 val split. If you want to evaluate on the VQA 2.0 test-dev or test-std split, please see Online Evaluation.
There are two ways to start:
(Recommend)
$ python3 run.py --RUN='val' --CKPT_V=str --CKPT_E=int
or use the absolute path instead:
$ python3 run.py --RUN='val' --CKPT_PATH=str
The evaluations of both the VQA 2.0 test-dev and test-std splits are run as follows:
$ python3 run.py --RUN='test' --CKPT_V=str --CKPT_E=int
Result files are stored in results/result_test/result_run_<'PATH+random number' or 'VERSION+EPOCH'>.json
In our expriments, we compare six models: baseline(random)
, baseline (w2v)
, baseline (glove)
, N — KBSN(s)
, N — KBSN(m)
, N — KBSN(l)
.
In order to further explore the best Elmo parameters, three different pre-trained Elmo models with different parameters are selected in this experiment, which are N — KBSN(s)/N — KBSN(m)/N — KBSN(l). The parameters, hidden layer size, output size and Elmo size of LSTM are shown in the table.
Model | Parameters (M) | LSTM Size | Output Size | ELMO Size |
---|---|---|---|---|
ELMO(s) | 13.6 | 1024 | 128 | 256 |
ELMO(m) | 28.0 | 2048 | 256 | 512 |
ELMO(l) | 93.6 | 4096 | 512 | 1024 |
Statistics of several word vectors are shown in the table.
name | Pre-training corpus (size) | Word vector dimension | Number of word vectors |
---|---|---|---|
word2vec | Google News (100 billion words) | 300 | 3 million |
Glove | Wikipedia 2014 + Gigaword 5 (Six billion words) | 300 | 400 thousand |
ELMO(s) | __ | 256 | __ |
ELMO(m) | WMT 2011 (800 million words) | 512 | __ |
ELMO(l) | __ | 1024 | __ |
The performance of the two models on the Val set is reported as follows:
Model | Overall | Yes/No | Number | Other |
---|---|---|---|---|
baseline(random) | 62.34 | 78.77 | 41.92 | 55.27 |
baseline (w2v) | 64.37 | 81.89 | 44.51 | 56.31 |
baseline (glove) | 66.73 | 84.56 | 49.52 | 57.72 |
N — KBSN(s) | 67.27 | 84.76 | 49.31 | 58.73 |
N — KBSN(m) | 67.55 | 85.03 | 49.62 | 59.01 |
N — KBSN(l) | 67.72 | 85.22 | 49.63 | 59.20 |
This project get a lot helps from the open-source MCAN-VQA project. If you are interested in the GloVe version of VQA, you can find a good example from here.