Code for a class project of EECS 598/498: Deep Learning at University of Michigan, Winter 2019.
Some code is borrowed from this pyTorch implementation of Multi-modal Factorized Bilinear Pooling (MFB) for VQA. The code of extracting BUTD features is adopted from the official implementation.
Python 3.6 and PyTorch 1.0 are required.
# tensorboardX
pip install tensorboardX
# pytorch
conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
# spacy
conda install -c conda-forge spacy
python -m spacy download en
python -m spacy download en_vectors_web_lg
In addition, preparing BUTD features on TextVQA requires Caffe. Please go to bottom-up-attention and check out the README. The environment is exactly the same as the original implementation although we modify some code. AWS GPU instance is recommended to set up the environment.
We use two datasets for our experiments: VQA v1.0 and TextVQA v0.5. Each dataset
has three splits: train|val|test
; each of them has three components:
ques_file
: json file with vqa questions.ans_file
: json file with answers to questions.features_prefix
: path to image feature.npy
files
Following examples are for TextVQA v0.5 only.
-
Download dataset and corresponding image files
mkdir -p data/textvqa/origin cd data/textvqa/origin wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5_val.json wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5_test.json wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip unzip train_val_images.zip cd ../../..
-
Generate ResNet image features:
python scripts/resnet_feature.py [--split] [--image_dir] [--feature_dir]
-
Generate BUTD image features:
# generate tsv file (Caffe is required) cd bottom-up-attention ./gen_faster_rcnn_textvqa.sh # convert tsv file to npy python scripts/butd_feature.py [--split] [--image_dir] [--feature_dir]
-
VQA v1.0 dataset is already in the desired
ques_file|ans_file
format. Generate json files for TextVQA v0.5:python scripts/textvqa_transform.py [--split] [--input_dir] [--output_dir]
-
Modify
DATA_PATHS
inconfig.py
to match the dataset and image feature paths accordingly.
Our implementation supports multiple models and datasets. Use the following command for training (refer to config.py
for details on --options
):
python train.py [MODEL] [EXP_TYPE] [--options]
Some examples:
-
MFH baseline on VQA v1.0:
python train.py mfh baseline
-
DiagNet without OCR on VQA v1.0:
python train.py mfh glove --EMBED
-
DiagNet on TextVQA v0.5:
python train.py mfh textvqa_butd --EMBED --OCR --BIN_HELP
-
Download image files and modify
image_prefix
ofDATA_PATHS
inconfig.py
accordingly. -
Run training and get the
.pth
model file intraining/checkpoint
. For example:python train.py mfh glove --EMBED
-
Specify the questions of interest by modifying
QTYPES
inconfig.py
-
Run visualization:
python predict.py mfh glove --EMBED [--RESUME_PATH]