Anchors

Source code of CIKM2021 Long Paper:

Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need,

including the following two parts:

Pre-training on corpus based on hyperlinks ✅
Fine-tuning on MS MARCO Document Ranking Datasets 🌀

Preinstallation

First, prepare a Python3 environment, and run the following commands:

  git clone https://github.com/zhengyima/anchors.git anchors
  cd anchors
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Besides, you should download the BERT model checkpoint in format of huggingface transformers, and save them in a directory BERT_MODEL_PATH. In our paper, we use the version of bert-base-uncased. you can download it from the huggingface official model zoo, or Tsinghua mirror.

Pre-training on Raw Corpus

Prepare the Corpus Data

The corpus data should have one passage (in JSON format) per line, with the anchor texts saved in an array. e.g.

{
	"sentence": one sentence s in source page,
	"anchors": [
		{
			"text": anchor text of anchor a1,
			"pos": [the start index of a1 in s, the end index of a1 in s],
			"passage": the destination page of a1
		},
		{
			"text": anchor text of anchor a2,
			"pos": [the start index of a2 in s, the end index of a2 in s],
			"passage": the destination page of a2
		},
		...
	] 
}

For your convenience, we provide the demo corpus file data/corpus/demo_data.txt. You can refer to the demo data to generate the pre-trained corpus, such as from Wikipedia dump.

Generate Pre-training Samples from the Corpus

The process of generating the pre-training samples are complex, which has a long pipeline including four pre-training tasks. Thus, we build a shell shells/gendata.sh to complete the whole process. If you are interested in the detailed process, you can refer to the shell. If you just want to run the code, you can run the following:

 export CORPUS_DATA=./data/corpus/demo_data.txt
 export DATA_PATH=./data/
 export BERT_MODEL_PATH=/path/to/bert_model
 bash shells/gendata.sh

After running gendata.sh success, you will get the pre-training data stored in DATA_PATH/merged/.

Running Pre-training

 export PERTRAIN_OUTPUT_DIR=/path/to/output_path
 bash shells/pretrain.sh

Fine-tuning on MS MARCO

The process of fine-tuning is more complex than pre-training 💤

Thus, the author decides to pack and clean the fine-tuning part when he is free, such as the next weekend.

Notes: Since the pre-training of our model is completed in the standard manner of huggingface. So, you can apply the output checkpoints of pre-training into any down-stream method, just like using bert-base-uncased.

Citations

If you use the code and datasets, please cite the following paper:

@article{DBLP:journals/corr/abs-2108-09346,
  author    = {Zhengyi Ma and
               Zhicheng Dou and
               Wei Xu and
               Xinyu Zhang and
               Hao Jiang and
               Zhao Cao and
               Ji{-}Rong Wen},
  title     = {Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need},
  booktitle = {{CIKM} '21: The 30th {ACM} International Conference on Information
               and Knowledge Management, Virtual Event, QLD, Australia, November 1-5, 2021},
  publisher = {{ACM}},
  year      = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data/corpus		data/corpus
data_scripts		data_scripts
gendata		gendata
preprocess		preprocess
pretrain		pretrain
shells		shells
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/corpus

data/corpus

data_scripts

data_scripts

gendata

gendata

preprocess

preprocess

pretrain

pretrain

shells

shells

utils

utils

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Anchors

Preinstallation

Pre-training on Raw Corpus

Prepare the Corpus Data

Generate Pre-training Samples from the Corpus

Running Pre-training

Fine-tuning on MS MARCO

Citations

Links

About

Releases

Packages

Languages

zhengyima/Anchors

Folders and files

Latest commit

History

Repository files navigation

Anchors

Preinstallation

Pre-training on Raw Corpus

Prepare the Corpus Data

Generate Pre-training Samples from the Corpus

Running Pre-training

Fine-tuning on MS MARCO

Citations

Links

About

Topics

Resources

Stars

Watchers

Forks

Languages