update toolkit for telling QA

yukezhu · Nov 13, 2015 · d1c4f3b · d1c4f3b
1 parent c1e06ee
commit d1c4f3b
Show file tree

Hide file tree

Showing 10 changed files with 183 additions and 89 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,6 @@
 .DS_Store
 .project
 .pydevproject
+.idea
+
+tests
diff --git a/README.md b/README.md
@@ -1 +1,107 @@
+# Visual7W Toolkit
 
+![alt text](http://web.stanford.edu/~yukez/images/img/visual7w_examples.png "Visual7W example QAs")
+
+## Introduction
+
+Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers.
+Each question starts with one of the seven Ws, *what*, *where*, *when*, *who*, *why*, *how* and *which*.
+Please check out [our arxiv paper](http://web.stanford.edu/~yukez/papers/visual7w_arxiv.pdf) for more details.
+This toolkit is used for parsing dataset files and evaluating model performances.
+Please contact [Yuke Zhu](http://web.stanford.edu/~yukez/) for questions, comments, or bug reports.
+
+## Dataset Overview
+
+The Visual7W dataset is collected on 47,300 COCO images. In total, it has 327,939 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories. In addition, we provide complete grounding annotations that link the object mentions in the QA sentences to their bounding boxes in the images and therefore
+introduce a new QA type with image regions as the visually grounded answers. We refer to questions with textual answers
+as *telling* QA and to such with visual answers as *pointing* QA. The figure above shows some examples in the Visual7W dataset, where the first row shows *telling* QA examples, and the second row shows *pointing* QA examples.
+
+## Evaluation Methods
+
+We use two evaluation methods to measure performance. **Multiple-choice evaluation** aims at selecting the correct option from a pre-defined pool of candidate answers. **Open-ended evaluation** aims at predicting a freeform textual answer given a question and the image. This toolkit provides utility functions to evaluate performances in both methods. We explain the details of these two methods below.
+
+1. **Multiple-choice QA**: We provide four human-generated multiple-choice answers for each question, where one of them is the ground-truth. We say the model is correct on a question if it selects the correct answer candidate. Accuracy is used to measure the performance. This is the default (and recommended) evaluation method for Visual7W.
+
+2. **Open-ended QA**: similar to the top-5 criteria used in [ImageNet challenges](http://www.image-net.org/), we let the model to make *k* different freeform predictions. We say the model is correct on a question if one of the *k* predictions matches exactly with the ground-truth. Accuracy is used to measure the performance. This evlaution method only applies to the *telling* QA tasks with textual answers.
+
+## How to Use
+
+Before using this toolkit, make sure that you have downloaded the Visual7W dataset. 
+You can use our downloading script in ```datasets/[dataset-name]/download_dataset.sh``` 
+to fetch the database json to the local disk.
+
+In order to show how to use this toolkit, we have implemented two simple
+baseline models based on training set answer frequencies. 
+
+### Telling QA
+
+We implement a most-frequent-answer (MFA) baseline in ```predict_baseline.py```.
+For open-ended evaluation, we use the top-*k* most frequent training set answers 
+as the predictions for all test questions. For multiple-choice evaluation, we select 
+the candidate answer with the highest training set frequency for each test question.
+
+In this demo, we perform open-ended evaluation on the *telling* QA tasks.
+To run the MFA baseline on the validation set, use the following command:
+
+```
+python predict_baseline.py --dataset visual7w-telling \
+                           --mode open \
+                           --topk 100 \
+                           --split val \
+                           --result_path results
+```
+
+It will generate a prediction file ```result_visual7w-telling_open.json``` in the ```results``` folder. Type ```python predict_baseline.py -h``` to learn more about the input arguments.
+
+The script below shows how to use the evaluation script ```evaluate.py``` to evaluate the performances of the open-ended predictions in the ```result_visual7w-telling_open.json``` file. Type ```python evaluate.py -h``` to learn more about the input arguments.
+
+```
+python evaluate.py --dataset visual7w-telling \
+                   --mode open \
+                   --topk 100 \
+                   --split val \
+                   --results results/result_visual7w-telling_open.json \
+                   --verbose 1
+```
+
+You will see the similar results as below:
+
+```
+2015-11-12 22:31:13,141 Evaluated on 28,020 QA pairs with top-100 predictions.
+2015-11-12 22:31:13,141 Overall accuracy = 0.371
+2015-11-12 22:31:13,142 Question type "what" accuracy = 0.380 (5053 / 13296)
+2015-11-12 22:31:13,142 Question type "where" accuracy = 0.099 (456 / 4590)
+2015-11-12 22:31:13,142 Question type "when" accuracy = 0.529 (667 / 1262)
+2015-11-12 22:31:13,142 Question type "who" accuracy = 0.375 (1079 / 2879)
+2015-11-12 22:31:13,142 Question type "why" accuracy = 0.051 (91 / 1782)
+2015-11-12 22:31:13,142 Question type "how" accuracy = 0.721 (3037 / 4211)
+```
+
+<!--Similarly, we can test the most-frequent-answer baseline with multiple-choice evaluation.-->
+
+<!--```-->
+<!--python predict_baseline.py --dataset visual7w-telling \-->
+<!--                           --mode mc \-->
+<!--                           --split val \-->
+<!--                           --result_path results-->
+<!--```-->
+
+<!--We can still use ```evaluate.py``` to evaluate the performance.-->
+
+<!--```-->
+<!--python evaluate.py --dataset visual7w-telling \-->
+<!--                   --mode mc \-->
+<!--                   --split val \-->
+<!--                   --results results/result_visual7w-telling_mc.json-->
+<!--                   --verbose 1-->
+<!--```-->
+
+### Evaluating Your Own Models
+
+In order to evaluate your own model, please check the format of the sample outputs 
+produced by the baseline script.  In short,
+a prediction file contains a list of predicted answers in the ```candidates``` arrays. 
+For multiple-choice QA, the ```candidates``` arrays contain only one element, which is 
+the selected multiple-choice option. For open-ended QA, the ```candidates``` arrays can 
+contain more than one (up to *k*) predictions, where we use the one-of-*k* metric to 
+evaluate the performance.
diff --git a/common/data_provider.py b/common/data_provider.py
@@ -4,32 +4,18 @@
 import scipy.io
 from collections import defaultdict
 
+# modified from Karpathy's neuraltalk (https://github.com/karpathy/neuraltalk)
 class BasicDataProvider:
+
   def __init__(self, dataset, **kwargs):
     print 'Initializing data provider for dataset %s...' % (dataset, )
-
-    # !assumptions on folder structure
-    self.dataset_root = kwargs.get('dataset_root', os.path.join('datasets', dataset))
-    self.feature_root = kwargs.get('feature_root', os.path.join('datasets', dataset))
-    self.image_root = kwargs.get('image_root', os.path.join('data', 'images', dataset))
-
+
     # load the dataset into memory
-    dataset_path = os.path.join(self.dataset_root, 'dataset.json')
+    dataset_root = kwargs.get('dataset_root', os.path.join('datasets', dataset))
+    dataset_path = os.path.join(dataset_root, 'dataset.json')
     print 'BasicDataProvider: reading %s' % (dataset_path, )
     self.dataset = json.load(open(dataset_path, 'r'))
 
-    # load the image features into memory
-    self.load_features = kwargs.get('load_features', True)
-    if self.load_features:
-      # load feature
-      features_path = os.path.join(self.feature_root, 'vgg_fc7_feats.mat')
-      print 'BasicDataProvider: reading %s' % (features_path, )
-      features_struct = scipy.io.loadmat(open(features_path, 'rb'))
-      self.features = features_struct['feats']
-      # imgid2featidx is a dictionary that maps an image id to the column index of the feature matrix
-      image_ids = features_struct['image_ids'].ravel()
-      self.imgid2featidx = {img : i for i, img in enumerate(image_ids)}
-
     # group images by their train/val/test split into a dictionary -> list structure
     self.split = defaultdict(list)
     for img in self.dataset['images']:
@@ -43,59 +29,30 @@ def __init__(self, dataset, **kwargs):
   # and they will be returned in the future with the cache present
   def _getImage(self, img):
     """ create an image structure """
-    # lazily fill in some attributes
-    if self.load_features and not 'feat' in img: # also fill in the features
-      feature_index = self.imgid2featidx[img['image_id']]
-      img['feat'] = self.features[:,feature_index]
     return img
 
   def _getQAPair(self, qa_pair):
     """ create a QA pair structure """
-    if not 'tokens' in qa_pair:
-      question_tokens = self.tokenize(qa_pair['question'], 'question')
-      qa_pair['question_tokens'] = question_tokens
-      if 'answer' in qa_pair:
-        answer_tokens = self.tokenize(qa_pair['answer'], 'answer')
-        qa_pair['answer_tokens'] = answer_tokens
-        qa_pair['tokens'] = question_tokens + answer_tokens
     return qa_pair
 
   def _getQAMultipleChoice(self, qa_pair, shuffle = False):
     """ create a QA multiple choice structure """
     qa_pair = self._getQAPair(qa_pair)
     if 'multiple_choices' in qa_pair:
       mcs = qa_pair['multiple_choices']
-      tokens = [self.tokenize(x, 'answer') for x in mcs]
       pos_idx = range(len(mcs)+1)
       # random shuffle the positions of multiple choices
-      if shuffle:
-        random.shuffle(pos_idx)
-      qa_pair['mc_tokens'] = []
+      if shuffle: random.shuffle(pos_idx)
       qa_pair['mc_candidates'] = []
       for idx, k in enumerate(pos_idx):
-        if k == 0 and 'answer' in qa_pair:
-          qa_pair['mc_tokens'].append(qa_pair['answer_tokens'])
+        if k == 0:
           qa_pair['mc_candidates'].append(qa_pair['answer'])
           qa_pair['mc_selection'] = idx # record the position of the true answer
         else:
-          qa_pair['mc_tokens'].append(tokens[k-1])
           qa_pair['mc_candidates'].append(mcs[k-1])
     return qa_pair
 
-  # PUBLIC FUNCTIONS
-  def tokenize(self, sent, token_type=None):
-    """ convert question or answer into a sequence of tokens """
-    line = sent[:-1].lower().replace('.', '')
-    line = ''.join([x if x.isalnum() else ' ' for x in line])
-    tokens = line.strip().split()
-    if token_type == 'question':
-      assert sent[-1] == '?', 'question (%s) must end with question mark.' % sent
-      tokens.append('?')
-    if token_type == 'answer':
-      assert sent[-1] == '.', 'answer (%s) must end with period.' % sent
-      tokens.append('.')
-    return tokens
-
+  # PUBLIC FUNCTIONS  
   def getSplitSize(self, split, ofwhat = 'qa_pairs'):
     """ return size of a split, either number of QA pairs or number of images """
     if ofwhat == 'qa_pairs': 
@@ -116,7 +73,7 @@ def sampleImageQAPair(self, split = 'train'):
     return out
 
   def sampleImageQAMultipleChoice(self, split = 'train', shuffle = False):
-    """ sample image QA pair from a split """
+    """ sample image and a multiple-choice test from a split """
     images = self.split[split]
 
     img = random.choice(images)
@@ -166,7 +123,7 @@ def iterQAMultipleChoice(self, split = 'train', shuffle = False):
         yield self._getQAMultipleChoice(pair, shuffle)
 
   def iterQAPairs(self, split = 'train'):
-    for img in self.split[split]: 
+    for img in self.split[split]:
       for pair in img['qa_pairs']:
         yield self._getQAPair(pair)
 
@@ -182,5 +139,5 @@ def iterImages(self, split = 'train', shuffle = False, max_images = -1):
 
 def getDataProvider(dataset, **kwargs):
   """ we could intercept a special dataset and return different data providers """
-  assert dataset in ['visual6w'], 'dataset %s unknown' % (dataset, )
+  assert dataset in ['visual7w-telling'], 'dataset %s unknown' % (dataset, )
   return BasicDataProvider(dataset, **kwargs)
diff --git a/datasets/README.md b/datasets/README.md
@@ -1 +1 @@
-Put your dataset here.
+Please use the "download_dataset.sh" script in each folder to fetch the annotation files.
diff --git a/datasets/download_dataset.sh b/datasets/download_dataset.sh
diff --git a/datasets/visual7w-pointing/.gitignore b/datasets/visual7w-pointing/.gitignore
@@ -0,0 +1,2 @@
+dataset.json
+vgg_fc7_feats.mat
diff --git a/datasets/visual7w-pointing/download_dataset.sh b/datasets/visual7w-pointing/download_dataset.sh
@@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+
+V7W_DB_NAME=v7w_pointing
+
+V7W_URL="http://web.stanford.edu/~yukez/papers/resources/dataset_${V7W_DB_NAME}.zip"
+V7W_PATH="dataset_${V7W_DB_NAME}.json"
+
+if [ -f "dataset.json" ]; then
+    echo "Dataset already exists. Bye!"
+    exit
+fi
+
+echo "Downloading ${V7W_DB_NAME} dataset..."
+wget -q $V7W_URL -O dataset.zip
+unzip -j dataset.zip
+rm dataset.zip
+mv $V7W_PATH dataset.json
+echo "Done."
diff --git a/datasets/visual7w-telling/download_dataset.sh b/datasets/visual7w-telling/download_dataset.sh
@@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+
+V7W_DB_NAME=v7w_telling
+
+V7W_URL="http://web.stanford.edu/~yukez/papers/resources/dataset_${V7W_DB_NAME}.zip"
+V7W_PATH="dataset_${V7W_DB_NAME}.json"
+
+if [ -f "dataset.json" ]; then
+    echo "Dataset already exists. Bye!"
+    exit
+fi
+
+echo "Downloading ${V7W_DB_NAME} dataset..."
+wget -q $V7W_URL -O dataset.zip
+unzip -j dataset.zip
+rm dataset.zip
+mv $V7W_PATH dataset.json
+echo "Done."
diff --git a/evaluate.py b/evaluate.py
@@ -23,26 +23,22 @@ def evaluate_top_k(dp, params):
   if params['mode'] == 'mc':
     logging.info('Multiple-choice QA evaluation')
     if top_k != 1:
-      logging.error('top_k is set to 1 for multiple-choice QA')
+      logging.info('top_k is set to 1 for multiple-choice QA')
       top_k = 1
   else:
     logging.info('Open-ended QA evaluation')
 
   # split to be evaluated
   split = params['split']
-  if split == 'test':
-    logging.error('Please use our online server for test set evaluation.')
-    return
-
-  if split not in ['train', 'val']:
+  if split not in ['train', 'val', 'test']:
     logging.error('Error: cannot find split %s.' % split)
     return
 
   # load result json
   result_file = params['results']
-  if os.path.isfile(result_file):
+  try:
     results = json.load(open(result_file))
-  else:
+  except:
     logging.error('Error: cannot read result file from %s' % result_file)
     return
 
@@ -53,8 +49,7 @@ def evaluate_top_k(dp, params):
   # fetch all test QA pairs from data provider
   pairs = {pair['qa_id']: pair for pair in dp.iterQAPairs(split)}
 
-  # question_categories
-  question_categories = ['what', 'where', 'when', 'who', 'why', 'how']
+  # record performances per question category
   category_total = dict()
   category_correct = dict()
 
@@ -64,21 +59,16 @@ def evaluate_top_k(dp, params):
       logging.error('Cannot find QA #%d. Are you using the correct split?' % entry['qa_id'])
       return
     pair = pairs[entry['qa_id']]
-    answer_tokens = pair['answer_tokens']
+    answer = str(pair['answer']).lower()
     candidates = entry['candidates'][:top_k]
-    correct_prediction = False
+    c = pair['type']
+    category_total[c] = category_total.get(c, 0) + 1
     for candidate in candidates:
-      prediction = candidate['answer']
-      if not prediction.endswith('.'): prediction += '.'
-      prediction_tokens = dp.tokenize(prediction, 'answer')
-      if prediction_tokens == answer_tokens:
+      prediction = str(candidate['answer']).lower()
+      if prediction == answer:
         num_correct += 1
-        correct_prediction = True
+        category_correct[c] = category_correct.get(c, 0) + 1
         break
-    for c in question_categories:
-      if pair['question'].lower().startswith(c):
-        category_total[c] = category_total.get(c, 0) + 1
-        if correct_prediction: category_correct[c] = category_correct.get(c, 0) + 1
     num_total += 1
     if (idx+1) % 10000 == 0:
       logging.info('Evaluated %s QA pairs...' % format(idx+1, ',d'))
@@ -91,7 +81,7 @@ def evaluate_top_k(dp, params):
 
   verbose = params['verbose']
   if verbose:
-    for c in question_categories:
+    for c in category_total.keys():
       total = category_total.get(c, 0)
       correct = category_correct.get(c, 0)
       logging.info('Question type "%s" accuracy = %.3f (%d / %d)' % (c, 1.0 * correct / total, correct, total))
@@ -104,20 +94,19 @@ def evaluate_top_k(dp, params):
 
   # configure argument parser
   parser = argparse.ArgumentParser()
-  parser.add_argument('-d', '--dataset', default='visual6w', type=str, help='dataset name (default: visual6w)')
-  parser.add_argument('-m', '--mode', default='open', type=str, help='prediction mode. "mc" denotes multiple-choice QA. "open" denotes open-ended QA.')
-  parser.add_argument('-k', '--topk', default=1, type=int, help='top k evaluation. k denotes how many candidate answers to be examined.')
-  parser.add_argument('-j', '--results', default='results/result_visual6w_open.json', help='path to json file contains the results (see the format of the sample files in "results" folder).')
-  parser.add_argument('-o', '--output_path', default='.', type=str, help='output folder')
+  parser.add_argument('-d', '--dataset', default='visual7w-telling', type=str, help='dataset name (default: visual7w-telling)')
+  parser.add_argument('-m', '--mode', default='open', type=str, help='prediction mode: "mc" - multiple-choice QA; "open" - open-ended QA.')
+  parser.add_argument('-k', '--topk', default=1, type=int, help='top-k evaluation. k is the number of answer candidates to be examined.')
+  parser.add_argument('-j', '--results', default='results/result_visual7w-telling_open.json', help='path to json file contains the results')
   parser.add_argument('-s', '--split', type=str, default='val', help='the split to be evaluated: train / val / test (default: val)')
-  parser.add_argument('-v', '--verbose', default=0, type=int, help='verbose mode. print performances of 6W categories when enabled.')
+  parser.add_argument('-v', '--verbose', default=0, type=int, help='verbose mode. report performances of question categories when enabled.')
 
   # parse arguments
   args = parser.parse_args()
   params = vars(args) # convert to ordinary dict
 
   # load dataset (skipping feature files)
-  dp = getDataProvider(params['dataset'], load_features = False)
+  dp = getDataProvider(params['dataset'])
 
   # start evaluation mode
   if params['mode'] in ['mc', 'open']: