Skip to content

Commit

Permalink
update toolkit for telling QA
Browse files Browse the repository at this point in the history
  • Loading branch information
yukezhu committed Nov 13, 2015
1 parent c1e06ee commit d1c4f3b
Show file tree
Hide file tree
Showing 10 changed files with 183 additions and 89 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Expand Up @@ -2,3 +2,6 @@
.DS_Store
.project
.pydevproject
.idea

tests
106 changes: 106 additions & 0 deletions README.md
@@ -1 +1,107 @@
# Visual7W Toolkit

![alt text](http://web.stanford.edu/~yukez/images/img/visual7w_examples.png "Visual7W example QAs")

## Introduction

Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers.
Each question starts with one of the seven Ws, *what*, *where*, *when*, *who*, *why*, *how* and *which*.
Please check out [our arxiv paper](http://web.stanford.edu/~yukez/papers/visual7w_arxiv.pdf) for more details.
This toolkit is used for parsing dataset files and evaluating model performances.
Please contact [Yuke Zhu](http://web.stanford.edu/~yukez/) for questions, comments, or bug reports.

## Dataset Overview

The Visual7W dataset is collected on 47,300 COCO images. In total, it has 327,939 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories. In addition, we provide complete grounding annotations that link the object mentions in the QA sentences to their bounding boxes in the images and therefore
introduce a new QA type with image regions as the visually grounded answers. We refer to questions with textual answers
as *telling* QA and to such with visual answers as *pointing* QA. The figure above shows some examples in the Visual7W dataset, where the first row shows *telling* QA examples, and the second row shows *pointing* QA examples.

## Evaluation Methods

We use two evaluation methods to measure performance. **Multiple-choice evaluation** aims at selecting the correct option from a pre-defined pool of candidate answers. **Open-ended evaluation** aims at predicting a freeform textual answer given a question and the image. This toolkit provides utility functions to evaluate performances in both methods. We explain the details of these two methods below.

1. **Multiple-choice QA**: We provide four human-generated multiple-choice answers for each question, where one of them is the ground-truth. We say the model is correct on a question if it selects the correct answer candidate. Accuracy is used to measure the performance. This is the default (and recommended) evaluation method for Visual7W.

2. **Open-ended QA**: similar to the top-5 criteria used in [ImageNet challenges](http://www.image-net.org/), we let the model to make *k* different freeform predictions. We say the model is correct on a question if one of the *k* predictions matches exactly with the ground-truth. Accuracy is used to measure the performance. This evlaution method only applies to the *telling* QA tasks with textual answers.

## How to Use

Before using this toolkit, make sure that you have downloaded the Visual7W dataset.
You can use our downloading script in ```datasets/[dataset-name]/download_dataset.sh```
to fetch the database json to the local disk.

In order to show how to use this toolkit, we have implemented two simple
baseline models based on training set answer frequencies.

### Telling QA

We implement a most-frequent-answer (MFA) baseline in ```predict_baseline.py```.
For open-ended evaluation, we use the top-*k* most frequent training set answers
as the predictions for all test questions. For multiple-choice evaluation, we select
the candidate answer with the highest training set frequency for each test question.

In this demo, we perform open-ended evaluation on the *telling* QA tasks.
To run the MFA baseline on the validation set, use the following command:

```
python predict_baseline.py --dataset visual7w-telling \
--mode open \
--topk 100 \
--split val \
--result_path results
```

It will generate a prediction file ```result_visual7w-telling_open.json``` in the ```results``` folder. Type ```python predict_baseline.py -h``` to learn more about the input arguments.

The script below shows how to use the evaluation script ```evaluate.py``` to evaluate the performances of the open-ended predictions in the ```result_visual7w-telling_open.json``` file. Type ```python evaluate.py -h``` to learn more about the input arguments.

```
python evaluate.py --dataset visual7w-telling \
--mode open \
--topk 100 \
--split val \
--results results/result_visual7w-telling_open.json \
--verbose 1
```

You will see the similar results as below:

```
2015-11-12 22:31:13,141 Evaluated on 28,020 QA pairs with top-100 predictions.
2015-11-12 22:31:13,141 Overall accuracy = 0.371
2015-11-12 22:31:13,142 Question type "what" accuracy = 0.380 (5053 / 13296)
2015-11-12 22:31:13,142 Question type "where" accuracy = 0.099 (456 / 4590)
2015-11-12 22:31:13,142 Question type "when" accuracy = 0.529 (667 / 1262)
2015-11-12 22:31:13,142 Question type "who" accuracy = 0.375 (1079 / 2879)
2015-11-12 22:31:13,142 Question type "why" accuracy = 0.051 (91 / 1782)
2015-11-12 22:31:13,142 Question type "how" accuracy = 0.721 (3037 / 4211)
```

<!--Similarly, we can test the most-frequent-answer baseline with multiple-choice evaluation.-->

<!--```-->
<!--python predict_baseline.py --dataset visual7w-telling \-->
<!-- --mode mc \-->
<!-- --split val \-->
<!-- --result_path results-->
<!--```-->

<!--We can still use ```evaluate.py``` to evaluate the performance.-->

<!--```-->
<!--python evaluate.py --dataset visual7w-telling \-->
<!-- --mode mc \-->
<!-- --split val \-->
<!-- --results results/result_visual7w-telling_mc.json-->
<!-- --verbose 1-->
<!--```-->

### Evaluating Your Own Models

In order to evaluate your own model, please check the format of the sample outputs
produced by the baseline script. In short,
a prediction file contains a list of predicted answers in the ```candidates``` arrays.
For multiple-choice QA, the ```candidates``` arrays contain only one element, which is
the selected multiple-choice option. For open-ended QA, the ```candidates``` arrays can
contain more than one (up to *k*) predictions, where we use the one-of-*k* metric to
evaluate the performance.
65 changes: 11 additions & 54 deletions common/data_provider.py
Expand Up @@ -4,32 +4,18 @@
import scipy.io
from collections import defaultdict

# modified from Karpathy's neuraltalk (https://github.com/karpathy/neuraltalk)
class BasicDataProvider:

def __init__(self, dataset, **kwargs):
print 'Initializing data provider for dataset %s...' % (dataset, )

# !assumptions on folder structure
self.dataset_root = kwargs.get('dataset_root', os.path.join('datasets', dataset))
self.feature_root = kwargs.get('feature_root', os.path.join('datasets', dataset))
self.image_root = kwargs.get('image_root', os.path.join('data', 'images', dataset))


# load the dataset into memory
dataset_path = os.path.join(self.dataset_root, 'dataset.json')
dataset_root = kwargs.get('dataset_root', os.path.join('datasets', dataset))
dataset_path = os.path.join(dataset_root, 'dataset.json')
print 'BasicDataProvider: reading %s' % (dataset_path, )
self.dataset = json.load(open(dataset_path, 'r'))

# load the image features into memory
self.load_features = kwargs.get('load_features', True)
if self.load_features:
# load feature
features_path = os.path.join(self.feature_root, 'vgg_fc7_feats.mat')
print 'BasicDataProvider: reading %s' % (features_path, )
features_struct = scipy.io.loadmat(open(features_path, 'rb'))
self.features = features_struct['feats']
# imgid2featidx is a dictionary that maps an image id to the column index of the feature matrix
image_ids = features_struct['image_ids'].ravel()
self.imgid2featidx = {img : i for i, img in enumerate(image_ids)}

# group images by their train/val/test split into a dictionary -> list structure
self.split = defaultdict(list)
for img in self.dataset['images']:
Expand All @@ -43,59 +29,30 @@ def __init__(self, dataset, **kwargs):
# and they will be returned in the future with the cache present
def _getImage(self, img):
""" create an image structure """
# lazily fill in some attributes
if self.load_features and not 'feat' in img: # also fill in the features
feature_index = self.imgid2featidx[img['image_id']]
img['feat'] = self.features[:,feature_index]
return img

def _getQAPair(self, qa_pair):
""" create a QA pair structure """
if not 'tokens' in qa_pair:
question_tokens = self.tokenize(qa_pair['question'], 'question')
qa_pair['question_tokens'] = question_tokens
if 'answer' in qa_pair:
answer_tokens = self.tokenize(qa_pair['answer'], 'answer')
qa_pair['answer_tokens'] = answer_tokens
qa_pair['tokens'] = question_tokens + answer_tokens
return qa_pair

def _getQAMultipleChoice(self, qa_pair, shuffle = False):
""" create a QA multiple choice structure """
qa_pair = self._getQAPair(qa_pair)
if 'multiple_choices' in qa_pair:
mcs = qa_pair['multiple_choices']
tokens = [self.tokenize(x, 'answer') for x in mcs]
pos_idx = range(len(mcs)+1)
# random shuffle the positions of multiple choices
if shuffle:
random.shuffle(pos_idx)
qa_pair['mc_tokens'] = []
if shuffle: random.shuffle(pos_idx)
qa_pair['mc_candidates'] = []
for idx, k in enumerate(pos_idx):
if k == 0 and 'answer' in qa_pair:
qa_pair['mc_tokens'].append(qa_pair['answer_tokens'])
if k == 0:
qa_pair['mc_candidates'].append(qa_pair['answer'])
qa_pair['mc_selection'] = idx # record the position of the true answer
else:
qa_pair['mc_tokens'].append(tokens[k-1])
qa_pair['mc_candidates'].append(mcs[k-1])
return qa_pair

# PUBLIC FUNCTIONS
def tokenize(self, sent, token_type=None):
""" convert question or answer into a sequence of tokens """
line = sent[:-1].lower().replace('.', '')
line = ''.join([x if x.isalnum() else ' ' for x in line])
tokens = line.strip().split()
if token_type == 'question':
assert sent[-1] == '?', 'question (%s) must end with question mark.' % sent
tokens.append('?')
if token_type == 'answer':
assert sent[-1] == '.', 'answer (%s) must end with period.' % sent
tokens.append('.')
return tokens

# PUBLIC FUNCTIONS
def getSplitSize(self, split, ofwhat = 'qa_pairs'):
""" return size of a split, either number of QA pairs or number of images """
if ofwhat == 'qa_pairs':
Expand All @@ -116,7 +73,7 @@ def sampleImageQAPair(self, split = 'train'):
return out

def sampleImageQAMultipleChoice(self, split = 'train', shuffle = False):
""" sample image QA pair from a split """
""" sample image and a multiple-choice test from a split """
images = self.split[split]

img = random.choice(images)
Expand Down Expand Up @@ -166,7 +123,7 @@ def iterQAMultipleChoice(self, split = 'train', shuffle = False):
yield self._getQAMultipleChoice(pair, shuffle)

def iterQAPairs(self, split = 'train'):
for img in self.split[split]:
for img in self.split[split]:
for pair in img['qa_pairs']:
yield self._getQAPair(pair)

Expand All @@ -182,5 +139,5 @@ def iterImages(self, split = 'train', shuffle = False, max_images = -1):

def getDataProvider(dataset, **kwargs):
""" we could intercept a special dataset and return different data providers """
assert dataset in ['visual6w'], 'dataset %s unknown' % (dataset, )
assert dataset in ['visual7w-telling'], 'dataset %s unknown' % (dataset, )
return BasicDataProvider(dataset, **kwargs)
2 changes: 1 addition & 1 deletion datasets/README.md
@@ -1 +1 @@
Put your dataset here.
Please use the "download_dataset.sh" script in each folder to fetch the annotation files.
Empty file removed datasets/download_dataset.sh
Empty file.
2 changes: 2 additions & 0 deletions datasets/visual7w-pointing/.gitignore
@@ -0,0 +1,2 @@
dataset.json
vgg_fc7_feats.mat
18 changes: 18 additions & 0 deletions datasets/visual7w-pointing/download_dataset.sh
@@ -0,0 +1,18 @@
#!/usr/bin/env bash

V7W_DB_NAME=v7w_pointing

V7W_URL="http://web.stanford.edu/~yukez/papers/resources/dataset_${V7W_DB_NAME}.zip"
V7W_PATH="dataset_${V7W_DB_NAME}.json"

if [ -f "dataset.json" ]; then
echo "Dataset already exists. Bye!"
exit
fi

echo "Downloading ${V7W_DB_NAME} dataset..."
wget -q $V7W_URL -O dataset.zip
unzip -j dataset.zip
rm dataset.zip
mv $V7W_PATH dataset.json
echo "Done."
18 changes: 18 additions & 0 deletions datasets/visual7w-telling/download_dataset.sh
@@ -0,0 +1,18 @@
#!/usr/bin/env bash

V7W_DB_NAME=v7w_telling

V7W_URL="http://web.stanford.edu/~yukez/papers/resources/dataset_${V7W_DB_NAME}.zip"
V7W_PATH="dataset_${V7W_DB_NAME}.json"

if [ -f "dataset.json" ]; then
echo "Dataset already exists. Bye!"
exit
fi

echo "Downloading ${V7W_DB_NAME} dataset..."
wget -q $V7W_URL -O dataset.zip
unzip -j dataset.zip
rm dataset.zip
mv $V7W_PATH dataset.json
echo "Done."
47 changes: 18 additions & 29 deletions evaluate.py
Expand Up @@ -23,26 +23,22 @@ def evaluate_top_k(dp, params):
if params['mode'] == 'mc':
logging.info('Multiple-choice QA evaluation')
if top_k != 1:
logging.error('top_k is set to 1 for multiple-choice QA')
logging.info('top_k is set to 1 for multiple-choice QA')
top_k = 1
else:
logging.info('Open-ended QA evaluation')

# split to be evaluated
split = params['split']
if split == 'test':
logging.error('Please use our online server for test set evaluation.')
return

if split not in ['train', 'val']:
if split not in ['train', 'val', 'test']:
logging.error('Error: cannot find split %s.' % split)
return

# load result json
result_file = params['results']
if os.path.isfile(result_file):
try:
results = json.load(open(result_file))
else:
except:
logging.error('Error: cannot read result file from %s' % result_file)
return

Expand All @@ -53,8 +49,7 @@ def evaluate_top_k(dp, params):
# fetch all test QA pairs from data provider
pairs = {pair['qa_id']: pair for pair in dp.iterQAPairs(split)}

# question_categories
question_categories = ['what', 'where', 'when', 'who', 'why', 'how']
# record performances per question category
category_total = dict()
category_correct = dict()

Expand All @@ -64,21 +59,16 @@ def evaluate_top_k(dp, params):
logging.error('Cannot find QA #%d. Are you using the correct split?' % entry['qa_id'])
return
pair = pairs[entry['qa_id']]
answer_tokens = pair['answer_tokens']
answer = str(pair['answer']).lower()
candidates = entry['candidates'][:top_k]
correct_prediction = False
c = pair['type']
category_total[c] = category_total.get(c, 0) + 1
for candidate in candidates:
prediction = candidate['answer']
if not prediction.endswith('.'): prediction += '.'
prediction_tokens = dp.tokenize(prediction, 'answer')
if prediction_tokens == answer_tokens:
prediction = str(candidate['answer']).lower()
if prediction == answer:
num_correct += 1
correct_prediction = True
category_correct[c] = category_correct.get(c, 0) + 1
break
for c in question_categories:
if pair['question'].lower().startswith(c):
category_total[c] = category_total.get(c, 0) + 1
if correct_prediction: category_correct[c] = category_correct.get(c, 0) + 1
num_total += 1
if (idx+1) % 10000 == 0:
logging.info('Evaluated %s QA pairs...' % format(idx+1, ',d'))
Expand All @@ -91,7 +81,7 @@ def evaluate_top_k(dp, params):

verbose = params['verbose']
if verbose:
for c in question_categories:
for c in category_total.keys():
total = category_total.get(c, 0)
correct = category_correct.get(c, 0)
logging.info('Question type "%s" accuracy = %.3f (%d / %d)' % (c, 1.0 * correct / total, correct, total))
Expand All @@ -104,20 +94,19 @@ def evaluate_top_k(dp, params):

# configure argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-d', '--dataset', default='visual6w', type=str, help='dataset name (default: visual6w)')
parser.add_argument('-m', '--mode', default='open', type=str, help='prediction mode. "mc" denotes multiple-choice QA. "open" denotes open-ended QA.')
parser.add_argument('-k', '--topk', default=1, type=int, help='top k evaluation. k denotes how many candidate answers to be examined.')
parser.add_argument('-j', '--results', default='results/result_visual6w_open.json', help='path to json file contains the results (see the format of the sample files in "results" folder).')
parser.add_argument('-o', '--output_path', default='.', type=str, help='output folder')
parser.add_argument('-d', '--dataset', default='visual7w-telling', type=str, help='dataset name (default: visual7w-telling)')
parser.add_argument('-m', '--mode', default='open', type=str, help='prediction mode: "mc" - multiple-choice QA; "open" - open-ended QA.')
parser.add_argument('-k', '--topk', default=1, type=int, help='top-k evaluation. k is the number of answer candidates to be examined.')
parser.add_argument('-j', '--results', default='results/result_visual7w-telling_open.json', help='path to json file contains the results')
parser.add_argument('-s', '--split', type=str, default='val', help='the split to be evaluated: train / val / test (default: val)')
parser.add_argument('-v', '--verbose', default=0, type=int, help='verbose mode. print performances of 6W categories when enabled.')
parser.add_argument('-v', '--verbose', default=0, type=int, help='verbose mode. report performances of question categories when enabled.')

# parse arguments
args = parser.parse_args()
params = vars(args) # convert to ordinary dict

# load dataset (skipping feature files)
dp = getDataProvider(params['dataset'], load_features = False)
dp = getDataProvider(params['dataset'])

# start evaluation mode
if params['mode'] in ['mc', 'open']:
Expand Down

0 comments on commit d1c4f3b

Please sign in to comment.