Utilities for running the PACRR neural IR model for Complex Answer Retrieval.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Utilities for running the PACRR neural IR model for Complex Answer Retrieval.

So far, there's just similarity matrix generation (to HDF5 file).


pip install -r requirements.txt

Python version: 3.6


python generate_simmats.py --run [qrels] --outlines [outlines.cbor] --embeddings [embed] --paragraphs [paragraphcorpus.cbor]

Outputs to simmats.hdf5. To configure, use --output.

The --run argument is a TREC qrels or run file that indicates which simmats to generate. The --outlines file is a cbor file that indicates the text of the queries. The --paragraphs file is a cbor file that contains the text of the paragraphs. Both can be found at http://trec-car.cs.unh.edu/datareleases/.

This will run on as many CPUs as are available on the machine. To configure this value, use the --pool option.

On 24 CPUs, it takes about 30min to generate simmats for automatic in 1 fold (data version 1.5). It's pretty I/O heavy right now, so there's probably a way to make it faster. It takes a while just to read through the paragraphs file itself, though.


python generate_simmats.py --run car-train/train.fold0.cbor.hierarchical.qrels --outlines car-train/train.fold0.cbor.outlines --embeddings glove.6B.50d.txt --paragraphs car-paragraphcorpus/paragraphcorpus.cbor
[2018-06-30 16:52:02,509][__main__:59][DEBUG] - [START] reading run files
1054369it [00:03, 272597.76it/s]
[2018-06-30 16:52:06,377][__main__:51][DEBUG] - found 1054369 pairs, 436851 qids, 1030775 docids
[2018-06-30 16:52:06,377][__main__:61][DEBUG] - [DONE] reading run files [3.8686s]
[2018-06-30 16:52:06,377][__main__:59][DEBUG] - [START] reading outlines
408137it [01:19, 5149.98it/s]
[2018-06-30 16:53:25,628][__main__:51][DEBUG] - found 408004 qids, 266457 headings
[2018-06-30 16:53:25,628][__main__:61][DEBUG] - [DONE] reading outlines [79.2507s]
[2018-06-30 16:53:25,785][__main__:54][WARNING] - missing outlines for 28847 qid(s), e.g. 9/11%20Truth%20movement/History
[2018-06-30 16:53:25,785][__main__:59][DEBUG] - [START] cleaning up missing qids
[2018-06-30 16:53:25,842][__main__:61][DEBUG] - [DONE] cleaning up missing qids [0.0573s]
[2018-06-30 16:53:25,842][__main__:59][DEBUG] - [START] loading embeddings
[2018-06-30 16:53:33,795][__main__:61][DEBUG] - [DONE] loading embeddings [7.9529s]
[2018-06-30 16:53:33,992][__main__:54][WARNING] - missing 35662 token(s) in embeddings, e.g. hauptstadt. These will be treated as binary matches. Consider retraining embeddings.
[2018-06-30 16:53:34,004][__main__:59][DEBUG] - [START] generating simmats
[2018-06-30 16:53:34,019][__main__:51][DEBUG] - pool process started
[2018-06-30 16:53:34,337][__main__:51][DEBUG] - pool process started
100%|██████████████████████████████| 1030775/1030775 [32:53<00:00, 522.19para/s]
[2018-06-30 17:26:28,273][__main__:61][DEBUG] - [DONE] generating simmats [1974.2693s]
[2018-06-30 17:26:28,273][__main__:51][DEBUG] - done!

File size: 5.2G


The output file is organized as follows:

[heading_or_page_id]/[paragraphid] - q x d similarity matrix

When building the full simmat for a query, find all components of the query (i.e., page_id, and heading ids), and concat the matrices along axis 0.

Terms that are not found in the embeddings file are binary matched (i.e, a similarity score of 1 if exact match, otherwise 0). It's best to retrain the embeddings using all available data so there are no missing terms.