IR and QA Pipeline System for COVID-19

The repository is organized by THUNLP and Microsoft AI. It contains an ongoing work of an IR and QA pipeline system towards the novel coronavirus COVID-19 (SARS-CoV-2). This system is trained with MS-MARCO, a large scale reading comprehension dataset, and directly transferred to the medical area. We hope this repository will help us work together against the COVID-19.

COVID Dataset

The CORD-19 resource is constructed by Semantic Scholar of Allen Institute and will continue to be updated as new research is published in archival services and peer-reviewed publications. The shared task on Kaggle aims to help specialists in virusology, pharmacy and microbiology to find answers to the problem.

IR and QA Pipeline

Document Retrieval

The following models are implemented for an effective document retrieval system.

BM25
Approximate Nearest Neighbor (ANN)

Paragraph Retrieval

BERT (Base version of BERT with 12 layers)
Distilled BERT (BERT with 3 layers)

QA System

BERT (Base version)

Keyphrase Extraction

Running Systems

Downloading and unzipping checkpoints, data and index files into models and retrieval folders, respectively. You can find all resource on Tsinghua Cloud and Google Drive. Then install required packages.

Build BM25 Index using anserini. Download link of collections are available in data.

./indexer/bm25_indexer/bin/IndexCollection -collection JsonCollection -es -es.index cord19 -input collection -generator LuceneDocumentGenerator -threads 1 -storePositions -storeDocvectors -storeRawDocs

pip install -r requirements.txt

Setting the CUDA device.

export CUDA_VISIBLE_DEVICES=DEVICE_ID

Running this pipeline system with the basic instruction. BM25 document retrieval, BERT paragraph retrieval and BERT QA model.

python run_pipeline.py

Using ANN in document retrieval.

python run_pipeline.py --use_ann

Using Distilled BERT in paragraph retrieval.

python run_pipeline.py --ranking_model_path ./models/bert_ranking_model_distilled

Keyphrase Extraction: the detailed giudes for generating keyphrases in the kpe folder.

Running Results

Search result is a list of top-k document information and each document contains following fileds

"title": Document title
"keyphrases": Extracted keyphrases
"text": Document text

QA results is a list of top-k answers and each answer contains following fileds

"text": Answer text
"title": The document tile where the answer is from

Contribution

The following people share the same contribution for this repository:

Aowei Lu, Jiahua Liu, Kaitao Zhang, Shi Yu, Si Sun, Zhenghao Liu

Project Organizers

Chenyan Xiong
- Microsoft Research AI, Redmond, USA
- Homepage
Zhiyuan Liu
- Tsinghua University
- Homepage

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
indexer/bm25_indexer		indexer/bm25_indexer
kpe		kpe
logo		logo
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
bert_qa.py		bert_qa.py
bert_ranking.py		bert_ranking.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

License

thunlp/COVID19-IRQA

Folders and files

Latest commit

History

Repository files navigation

IR and QA Pipeline System for COVID-19

COVID Dataset