Skip to content
/ GDR Public

code for EACL2024-main:Generative Dense Retrieval: Memory Can Be a Burden

Notifications You must be signed in to change notification settings

ypw0102/GDR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generative Dense Retrieval: Memory Can Be a Burden

Environemnt

conda env create -f environment.yml
conda activate GDR

Data Process

[1] Dataset Download. Download NQ Train and Dev dataset from https://ai.google.com/research/NaturalQuestions/download NQ Train: https://storage.cloud.google.com/natural_questions/v1.0-simplified/simplified-nq-train.jsonl.gz NQ Dev: https://storage.cloud.google.com/natural_questions/v1.0-simplified/nq-dev-all.jsonl.gz Please download it before re-training.

[2] Data preprocess You can process data with NQ_process.py (./Data_process/NQ_dataset/NQ_preprocess)

[3] Query Generation

In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.

GDR uses docTTTTTquery checkpoint to generate synthetic queries. If you finetune docTTTTTquery checkpoint, the query generation files can make the retrieval result even better. We show how to finetune the model. The following command will finetune the model for 4k iterations to predict queries. We assume you put the tsv training file in gs://your_bucket/qcontent_train_512.csv (download from above). Also, change your_tpu_name, your_tpu_zone, your_project_id, and your_bucket accordingly.

t5_mesh_transformer  \
  --tpu="your_tpu_name" \
  --gcp_project="your_project_id" \
  --tpu_zone="your_tpu_zone" \
  --model_dir="gs://your_bucket/models/" \
  --gin_param="init_checkpoint = 'gs://your_bucket/model.ckpt-1004000'" \
  --gin_file="dataset.gin" \
  --gin_file="models/bi_v1.gin" \
  --gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
  --gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" \
  --gin_param="tsv_dataset_fn.filename = 'gs://your_bucket/qcontent_train_512.csv'" \
  --gin_file="learning_rate_schedules/constant_0_001.gin" \
  --gin_param="run.train_steps = 1008000" \
  --gin_param="tokens_per_batch = 131072" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology ='v2-8'"

Please refer to docTTTTTquery documentation.

Find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Training

Once the data pre-processing is complete, you can launch training by train.sh

Evaluation

Please use infer.sh along with checkpoint(Download it to './GDR_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.

Acknowledgement

We learned a lot and borrowed some code from the following projects when building GDR.

About

code for EACL2024-main:Generative Dense Retrieval: Memory Can Be a Burden

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages