Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

HongChien Yu, Chenyan Xiong, Jamie Callan

This repository holds code that reproduces results reported in Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback.

Dense retrieval systems conduct first-stage retrieval using embedded representations and simple similarity metrics to match a query to documents. Its effectiveness depends on encoded embeddings to capture the semantics of queries and documents, a challenging task due to the shortness and ambiguity of search queries. This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval. ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels. It also keeps the document index unchanged to reduce overhead. ANCE-PRF significantly outperforms ANCE and other recent dense retrieval systems on several datasets. Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.

Requirements

pip install -r requirements.txt

Data Preparation

Data Preprocessing

Run the following script to preprocess data:

cd data_prep
bash download_data.sh 
bash preprocess_data.sh

Get ANCE Passage Embeddings

cd data_prep
bash get_ance_embs.sh

Get ANCE Ranking

bash get_ance_ranking.sh

Prepare PRF data

Run the following command to create PRF data from ANCE top-retrieved documents:

cd data_prep 
bash get_prf_data.sh

Training

bash train_encoder.sh

While training is running, concurrently run

bash eval.sh

which keeps looking for the newest checkpoints and evaluate it on marco. This is sadly not a very effective use of GPU in terms of utilization percentage, but it makes the training faster by avoiding periodic switching from training to evaluation.

In our work, we picked the model that performs best on marco dev as reported by eval.sh tensorboard.

Trained Models and Ranking Files

Trained models for k=3 can be downloaded here.

Ranking files for k=3 can be downloaded here.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data_prep		data_prep
utils		utils
.gitignore		.gitignore
README.md		README.md
convert_output.sh		convert_output.sh
data.py		data.py
eval.sh		eval.sh
get_eval_metrics.py		get_eval_metrics.py
get_eval_metrics.sh		get_eval_metrics.sh
get_marco_eval_output.py		get_marco_eval_output.py
lamb.py		lamb.py
main.py		main.py
model.py		model.py
msmarco_eval.py		msmarco_eval.py
requirements.txt		requirements.txt
runner.py		runner.py
test_ance.py		test_ance.py
train_encoder.sh		train_encoder.sh

yuhongqian/ANCE-PRF

Folders and files

Latest commit

History

Repository files navigation

Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

Requirements

Data Preparation

Data Preprocessing

Get ANCE Passage Embeddings

Get ANCE Ranking

Prepare PRF data

Training

Trained Models and Ranking Files

About

Resources

Stars

Watchers

Forks

Languages