PACRR-CAR

Utilities for running the PACRR neural IR model for Complex Answer Retrieval.

So far, there's just similarity matrix generation (to HDF5 file).

Setup

pip install -r requirements.txt

Python version: 3.6

Usage

python generate_simmats.py --run [qrels] --outlines [outlines.cbor] --embeddings [embed] --paragraphs [paragraphcorpus.cbor]

Outputs to simmats.hdf5. To configure, use --output.

The --run argument is a TREC qrels or run file that indicates which simmats to generate. The --outlines file is a cbor file that indicates the text of the queries. The --paragraphs file is a cbor file that contains the text of the paragraphs. Both can be found at http://trec-car.cs.unh.edu/datareleases/.

This will run on as many CPUs as are available on the machine. To configure this value, use the --pool option.

On 24 CPUs, it takes about 30min to generate simmats for automatic in 1 fold (data version 1.5). It's pretty I/O heavy right now, so there's probably a way to make it faster. It takes a while just to read through the paragraphs file itself, though.

Sample

python generate_simmats.py --run car-train/train.fold0.cbor.hierarchical.qrels --outlines car-train/train.fold0.cbor.outlines --embeddings glove.6B.50d.txt --paragraphs car-paragraphcorpus/paragraphcorpus.cbor
[2018-06-30 16:52:02,509][__main__:59][DEBUG] - [START] reading run files
1054369it [00:03, 272597.76it/s]
[2018-06-30 16:52:06,377][__main__:51][DEBUG] - found 1054369 pairs, 436851 qids, 1030775 docids
[2018-06-30 16:52:06,377][__main__:61][DEBUG] - [DONE] reading run files [3.8686s]
[2018-06-30 16:52:06,377][__main__:59][DEBUG] - [START] reading outlines
408137it [01:19, 5149.98it/s]
[2018-06-30 16:53:25,628][__main__:51][DEBUG] - found 408004 qids, 266457 headings
[2018-06-30 16:53:25,628][__main__:61][DEBUG] - [DONE] reading outlines [79.2507s]
[2018-06-30 16:53:25,785][__main__:54][WARNING] - missing outlines for 28847 qid(s), e.g. 9/11%20Truth%20movement/History
[2018-06-30 16:53:25,785][__main__:59][DEBUG] - [START] cleaning up missing qids
[2018-06-30 16:53:25,842][__main__:61][DEBUG] - [DONE] cleaning up missing qids [0.0573s]
[2018-06-30 16:53:25,842][__main__:59][DEBUG] - [START] loading embeddings
[2018-06-30 16:53:33,795][__main__:61][DEBUG] - [DONE] loading embeddings [7.9529s]
[2018-06-30 16:53:33,992][__main__:54][WARNING] - missing 35662 token(s) in embeddings, e.g. hauptstadt. These will be treated as binary matches. Consider retraining embeddings.
[2018-06-30 16:53:34,004][__main__:59][DEBUG] - [START] generating simmats
[2018-06-30 16:53:34,019][__main__:51][DEBUG] - pool process started
...
[2018-06-30 16:53:34,337][__main__:51][DEBUG] - pool process started
100%|██████████████████████████████| 1030775/1030775 [32:53<00:00, 522.19para/s]
[2018-06-30 17:26:28,273][__main__:61][DEBUG] - [DONE] generating simmats [1974.2693s]
[2018-06-30 17:26:28,273][__main__:51][DEBUG] - done!

File size: 5.2G

Details

The output file is organized as follows:

[heading_or_page_id]/[paragraphid] - q x d similarity matrix

When building the full simmat for a query, find all components of the query (i.e., page_id, and heading ids), and concat the matrices along axis 0.

Terms that are not found in the embeddings file are binary matched (i.e, a similarity score of 1 if exact match, otherwise 0). It's best to retrain the embeddings using all available data so there are no missing terms.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
generate_simmats.py		generate_simmats.py
requirements.txt		requirements.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PACRR-CAR

Setup

Usage

Sample

Details

About

Releases

Packages

Languages

seanmacavaney/pacrr-car

Folders and files

Latest commit

History

Repository files navigation

PACRR-CAR

Setup

Usage

Sample

Details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages