Two-Step Question Retrieval for Open-Domain QA

Yeon Seonwoo*, Juhee Son*, Jiho Jin, Sang-Woo Lee, Ji-Hoon Kim, Jung-Woo Ha, and Alice Oh (*equal contribution) | Findings of ACL 2022 | Paper

KAIST, Naver CLOVA, Naver AI Lab

Official implementation of Two-Step Question Retrieval for Open-Domain QA

Abstract

The retriever-reader pipeline has shown promising performance in open-domain QA but suffers from a very slow inference speed. Recently proposed question retrieval models tackle this problem by indexing question-answer pairs and searching for similar questions. These models have shown a significant increase in inference speed, but at the cost of lower QA performance compared to the retriever-reader models. This paper proposes a two-step question retrieval model, SQuID (Sequential QuestionIndexed Dense retrieval) and distant supervision for training. SQuID uses two bi-encoders for question retrieval. The first-step retriever selects top-k similar questions, and the secondstep retriever finds the most similar question from the top-k questions. We evaluate the performance and the computational efficiency of SQuID. The results show that SQuID significantly increases the performance of existing question retrieval models with a negligible loss on inference speed.

Getting started

Requirements

Java 11 (to run Pyserini)
Nvidia apex
Python libraries in requirements.txt

Setup

Set the project directory

export WORKING_DIR=/path/to/this/project/folder

Download PAQ files

cd $WORKING_DIR
wget https://dl.fbaipublicfiles.com/paq/v1/TQA_TRAIN_NQ_TRAIN_PAQ.tar.gz
tar -xvzf TQA_TRAIN_NQ_TRAIN_PAQ.tar.gz
mkdir -p $WORKING_DIR/data/paq/paq_qas/
mv TQA_TRAIN_NQ_TRAIN_PAQ/tqa_train_nq_train_PAQ.jsonl $WORKING_DIR/data/paq/paq_qas/

Download the PAQ retriever

cd $WORKING_DIR
wget https://dl.fbaipublicfiles.com/paq/v1/models/retrievers/retriever_multi_base_256.tar.gz
tar -xvzf retriever_multi_base_256.tar.gz
mkdir -p checkpoints/paq_dpr
mv retriever_multi_base_256 checkpoints/paq_dpr/

Download the PAQ faiss index

cd $WORKING_DIR
wget https://dl.fbaipublicfiles.com/paq/v1/models/indices/multi_base_256.hnsw.sq8.faiss
mv multi_base_256.hnsw.sq8.faiss data/paq/paq_qas/

Download PAQ code

cd $WORKING_DIR/code/module/retriever/
git clone https://github.com/facebookresearch/PAQ.git
mv PAQ/paq ./

Download NQ files

mkdir -p $WORKING_DIR/data/paq/nq
cd $WORKING_DIR/data/paq/nq/
wget https://dl.fbaipublicfiles.com/paq/v1/annotated_datasets/NQ-open.train-train.jsonl
wget https://dl.fbaipublicfiles.com/paq/v1/annotated_datasets/NQ-open.train-dev.jsonl
wget https://dl.fbaipublicfiles.com/paq/v1/annotated_datasets/NQ-open.test.jsonl

Training and evaluation

First-step retrieval

Option 1: BM25 as the first-step retriever

This script generates an index file for BM25 retrieval and runs BM25

bash scripts/_0_paq2index.sh
bash scripts/_1_bm25_search.sh

Option 2: RePAQ as the first-step retriever

This script runs RePAQ retrieval on NQ-train/dev/test sets

bash scripts/_1_dpr_search.sh

Training SQuID

Option 1: BM25

bash scripts/_2_train_model_first_step_retriever_bm25.sh

Option 2: RePAQ

bash scripts/_2_train_model_first_step_retriever_repaq.sh

Inference

Option 1: BM25

bash scripts/_3_pred_model_bm25.sh

Option 2: RePAQ

bash scripts/_3_pred_model_dpr.sh

Code reference

https://github.com/facebookresearch/PAQ

Copyright

Acknowledgement

This work was partly supported by NAVER Corp. and the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
code		code
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Two-Step Question Retrieval for Open-Domain QA

Abstract

Getting started

Requirements

Setup

Training and evaluation

First-step retrieval

Training SQuID

Inference

Code reference

Copyright

Acknowledgement

About

Releases

Packages

Languages

yeonsw/SQuID

Folders and files

Latest commit

History

Repository files navigation

Two-Step Question Retrieval for Open-Domain QA

Abstract

Getting started

Requirements

Setup

Training and evaluation

First-step retrieval

Training SQuID

Inference

Code reference

Copyright

Acknowledgement

About

Resources

Stars

Watchers

Forks

Languages