Generative Dense Retrieval: Memory Can Be a Burden

Environemnt

conda env create -f environment.yml
conda activate GDR

Data Process

[1] Dataset Download. Download NQ Train and Dev dataset from https://ai.google.com/research/NaturalQuestions/download NQ Train: https://storage.cloud.google.com/natural_questions/v1.0-simplified/simplified-nq-train.jsonl.gz NQ Dev: https://storage.cloud.google.com/natural_questions/v1.0-simplified/nq-dev-all.jsonl.gz Please download it before re-training.

[2] Data preprocess You can process data with NQ_process.py (./Data_process/NQ_dataset/NQ_preprocess)

[3] Query Generation

In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.

GDR uses docTTTTTquery checkpoint to generate synthetic queries. If you finetune docTTTTTquery checkpoint, the query generation files can make the retrieval result even better. We show how to finetune the model. The following command will finetune the model for 4k iterations to predict queries. We assume you put the tsv training file in gs://your_bucket/qcontent_train_512.csv (download from above). Also, change your_tpu_name, your_tpu_zone, your_project_id, and your_bucket accordingly.

t5_mesh_transformer  \
  --tpu="your_tpu_name" \
  --gcp_project="your_project_id" \
  --tpu_zone="your_tpu_zone" \
  --model_dir="gs://your_bucket/models/" \
  --gin_param="init_checkpoint = 'gs://your_bucket/model.ckpt-1004000'" \
  --gin_file="dataset.gin" \
  --gin_file="models/bi_v1.gin" \
  --gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
  --gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" \
  --gin_param="tsv_dataset_fn.filename = 'gs://your_bucket/qcontent_train_512.csv'" \
  --gin_file="learning_rate_schedules/constant_0_001.gin" \
  --gin_param="run.train_steps = 1008000" \
  --gin_param="tokens_per_batch = 131072" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology ='v2-8'"

Please refer to docTTTTTquery documentation.

Find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Training

Once the data pre-processing is complete, you can launch training by train.sh

Evaluation

Please use infer.sh along with checkpoint(Download it to './GDR_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.

Acknowledgement

We learned a lot and borrowed some code from the following projects when building GDR.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
Data_process		Data_process
GDR_model		GDR_model
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative Dense Retrieval: Memory Can Be a Burden

Environemnt

Data Process

Training

Evaluation

Acknowledgement

About

Releases

Packages

Languages

ypw0102/GDR

Folders and files

Latest commit

History

Repository files navigation

Generative Dense Retrieval: Memory Can Be a Burden

Environemnt

Data Process

Training

Evaluation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages