Browse files

First official release update.

  • Loading branch information...
seominjoon committed Sep 26, 2018
1 parent 8b66f13 commit 854144792fccf318e865b78f1e8b77a8d25fec9e
Showing with 829 additions and 337 deletions.
  1. +1 −1 LICENSE
  2. +113 −8
  3. +3 −2
  4. +121 −72
  5. +3 −7
  6. +2 −2
  7. +89 −18
  8. +183 −105
  9. +125 −122
  10. +152 −0
  11. +7 −0 requirements.txt
  12. +30 −0
@@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2018 University of Washington

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
@@ -1,26 +1,131 @@
# Phrase-Indexed Question Answering
# Phrase-Indexed Question Answering (PIQA)
- This is the official github repository for [Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension][paper] (EMNLP 2018).
- Webpage with leaderboard and submission guideline are coming soon. For now, please consider reproducing the baseline models and running the official evaluation routine (below) to become familiar with the challenge format.
- Please create a new issue on this repository or contact [Minjoon Seo][minjoon] ([@seominjoon][minjoon-github]) for questions and suggestions.

## Introduction
We will assume that you have read the [paper][paper], though we will try to recap it here. PIQA challenge is about approaching (existing) extractive question answering tasks via phrase retrieval mechanism (we plan to hold the challenge for several extractive QA datasets in near future, though we currently only support PIQA for SQuAD 1.1.). This means we need:

1. **document encoder**: enumerates a list of (phrase, vector) pairs from the document,
2. **question encoder**: maps each question to the same vector space, and
3. **retrieval**: retrieves the (phrasal) answer to the question by performing nearest neighbor search on the list.

While the challenge shares some similarities with document retrieval, a classic problem in information retrieval literature, a key difference is the the phrase representation will need to be *context-based*, which is more challenging than obtaining the embedding by its *content*.

An important aspect of the challenge is the constraint of *independence* between the **document encoder** and the **question encoder**. As we have noted in our paper, most existing models heavily rely on question-dependent representations of the context document. Nevertheless, phrase representations in PIQA need to be completely *independent* of the input question. Not only this makes the challenge quite difficult, but also state-of-the-art models cannot be directly used for the task. Hence we have proposed a few reasonable baseline models as the starting point, which can be found in this repository.

Note that it is also not so straightforward to strictly enforce the constraint on an evaluation platform such as CodaLab. For instance, current SQuAD 1.1 evaluator simply provides the test dataset (both context and question) without answers, and ask the model to output predictions, which are then compared against the answers. This setup is not great for PIQA because we cannot know if the submitted model abides the independence constraint. To resolve this issue, a PIQA submission must consist of the two encoders with explicit independence, and the retrieval is performed on the evaluator side (See [Submission](#Submission) section below). While it is not as convenient as a vanilla SQuAD submission, we tried to make it as intuitive and easy as possible for the purpose :)

## Baseline Models

### 0. Download requirements
Make sure you have Python 3.6. Download and install all requirements by:

### Download requirements
Make sure you have python 3.x.
chmod +x; ./

### Train
Let `$SQUAD_DIR` be the directory that has both train and dev json files of SQuAD.
Download SQuAD v1.1 train and dev set at `$SQUAD_TRAIN_PATH` and `$SQUAD_DEV_PATH`, respectively. Also, for official evaluation, download [`$SQUAD_DEV_CONTEXT_PATH`][squad-context] and [`$SQUAD_DEV_QUESTION_PATH`][squad-question]. Note that a simple script `` is used to obtain both files from the original dev dataset.

### 1. Training
In our [paper][paper], we have introduced three baseline models:

For LSTM model:

python --cuda --data_dir $SQUAD_DIR
python --cuda --train_path $SQUAD_TRAIN_PATH --test_path $SQUAD_DEV_PATH

For LSTM+SA model:

python --cuda --num_heads 2 --data_dir $SQUAD_DIR
python --cuda --num_heads 2 --train_path $SQUAD_TRAIN_PATH --test_path $SQUAD_DEV_PATH

For LSTM+SA+ELMo model:

python --cuda --num_heads 2 --elmo --train_path $SQUAD_TRAIN_PATH --test_path $SQUAD_DEV_PATH
By default, these commands will output all interesting files (save, report, etc.) to `/tmp/piqa`. You can change the directory with `--output_dir` argument.

### 2. Easy Evaluation
Assuming you trust us, since the baseline code is abiding the independence constraint, let's just try to output the prediction file from a full (context+question) dataset, and evaluate it with the original SQuAD v1.1 evaluator. To do this with LSTM model, simply run:

python --cuda --num_heads 2 --elmo --data_dir $SQUAD_DIR
python --cuda --mode test --iteration XXXX --test_path $SQUAD_DEV_PATH

Where the iteration indicates the step at which the model of interest is saved. Take a look at the standard output during training and pick the one that gives the best performance (which is automatically tracked).
This will output the prediction file at `/tmp/piqa/pred.json`. Now, let's see what the official evalutor thinks about it:

python evaluate-v1.1.json $SQUAD_DEV_PATH /tmp/piqa/pred.json

That was easy! But why do we have *official evaluation* section below? Because we had a big assumption in the beginning, that you trust us that our encoders are independent. But who knows?

### 3. Official Evaluation
We need a strict evaluation method that enforces the independence between the encoders. To do so, our (PIQA) evaluator requires three inputs (instead of two). The first input is identical to that of the official evaluator: the path to the test data with the answers. The second and the third correspond to the directories for the phrase embeddings and the question embeddings, respectively. Here is an example directory structure:

+-- context_emb
| +-- Super_Bowl_50_0.npz
| +-- Super_Bowl_50_0.json
| +-- Super_Bowl_50_1.npz
| +-- Super_Bowl_50_2.json
| ...
+-- question_emb
| +-- 56be4eafacb8001400a50302.npz
| +-- 56d204ade7d4791d00902603.npz
| ...

This looks quite complicated! Let's take a look at one by one.

1. **`.npz` is a numpy/scipy matrix dump**: Each `.npz` file corresponds to a *N-by-d* matrix. If it is a dense matrix, it needs to be saved via `numpy.savez()` method, and if it is a sparse matrix (depending on your need), it needs to be saved via `scipy.sparse.save_npz()` method. Note that `scipy.sparse.save_npz()` is relatively new and old scipy versions do not support it.
2. **each `.npz` in `context_emb` is named after paragraph id**: Here, paragraph id is `'%s_%s' % (article_title, str(para_idx)`, where `para_idx` indicates the index of the paragraph within the article (starts at `0`). For instance, if the article `Super_Bowl_50` has 35 paragraphs, then it will have `.npz` files until `Super_Bowl_50_34.npz`.
3. **each `.npz` in `context_emb` is *N* phrase vectors of *d*-dim**: It is up to the submitted model to decide *N* and *d*. For instance, if the paragraph length is 100 words and we enumerate all possible phrases with length <= 7, then we will approximately have *N* = 700. While we will limit the size of `.npz` per word during the submission so one cannot have a very large dense matrix, we will allow sparse matrices, so *d* can be very large in some cases.
4. **`.json` is a list of *N* phrases**: each phrase corresponds to each phrase vector in its corresponding `.npz` file. Of course, one can have duplicate phrases (i.e. several vectors per phrase).
5. **each `.npz` in `question_emb` is named after question id**: Here, question id is the official id in original SQuAD 1.1.
6. **each `.npz` in `question_emb` must be *1*-by-*d* matrix**: Since each question has a single embedding, *N* = 1. Hence the matrix corresponds to the question representation.

Following these rules, one should confirm that `context_emb` contains 4134 files (2067 `.npz` files and 2067 `.json` files, i.e. 2067 paragraphs) and `question_emb` contains 10570 files (one file for each question) for SQuAD v1.1 dev dataset. Hint: `ls context_emb/ | wc -l` gives you the count in the `context_emb` folder.

In order to output these directories from our model, we run `` with two different arguments for each of document encoder and question encoder.

For document encoder:

python --cuda --mode embed_context --iteration XXXX --test_path $SQUAD_DEV_CONTEXT_PATH

For question encoder:

python --cuda --mode embed_question --iteration XXXX --test_path $SQUAD_DEV_QUESTION_PATH

The encoders will output the embeddings to the default output directory. You can also control the target directories with `--context_emb_dir` and `--question_emb_dir`, respectively. Using uncompressed dense (default) numpy dump for the LSTM model, these directories take about 3 GB of space. Now one can *officially* evaluate by:

python $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/

## Submission
We are coordinating with CodaLab and SQuAD folks to incorporate PIQA evaluation into the CodaLab framework. Submission guideline will be available soon!

@@ -19,5 +19,6 @@
ans = np.argmax(np.matmul(doc, np.expand_dims(query, -1)), 0)
duration = time.time() - start_time
speed = args.num_vecs * args.num_iters / duration
print('numpy: %.3f ms per %d vecs of %dD, or %d vecs/s' % (
duration * 1000 / args.num_iters, args.num_vecs, args.dim, speed))
print('numpy: %.3f ms per %d vecs of %dD, or %d vecs/s' % (duration * 1000 / args.num_iters, args.num_vecs, args.dim, speed))

Oops, something went wrong.

0 comments on commit 8541447

Please sign in to comment.