BERT (Base) Sequence Tagging on OpenKP (Pytorch)
This repository provides the code of the model named BERT (Base) Sequence Tagging , which outperforms the Baselines (MSMARCO Team) on the OpenKP Leaderboard.
OpenKP (OpenKeyPhrase) is a large scale, open-domain keyphrase extraction dataset, which was first released in the paper Open Domain Web Keyphrase Extraction Beyond Language Modeling at EMNLP-IJCNLP 2019. Now it is a part of the MSMARCO dataset family .
Official Evaluation Results (ranked by F1 @3 on Eval)
|Rank||Model||Submission Date||Precision @1,@3,@5||Recall @1,@3,@5||F1 @1,@3,@5|
|1||BERT (Base) Sequence Tagging Si Sun (Tsinghua University), Chenyan Xiong (MSR AI), Zhiyuan Liu (Tsinghua University)||November 5th, 2019||0.484, 0.312, 0.227||0.255, 0.469, 0.563||0.321, 0.361, 0.314|
|2||Baseline finetuned on Bing Queries MSMARCO Team||October 19th, 2019||0.397, 0.249, 0.149||0.215, 0.391, 0.391||0.267, 0.292, 0.209|
|3||Baseline MSMARCO Team||October 19th, 2019||0.365, 0.237, 0.142||0.196, 0.367, 0.367||0.244, 0.277, 0.198|
python 3.5 Pytorch 1.3.0 Tensorflow (tested on 1.14.0, only for tensorboardX)
You should first download the DATA folder, which includes preprocess data, checkpoint and extracted keyphrase candidates for our result on the leardbord. Default directory structure should be setted as following :
Bert2Tag |— DATA |— cached_features (saved preprocess data) | |— openkp.train.json (OpenKP train dataset) | |— openkp.valid.json (OpenKP Dev dataset) | |— openkp.eval_public.json (OpenKP Valid dataset) | |— Dev_reference.json (OpenKP Dev ground-truth keyphrases for test) | |— pretrain_model | |— bert-base-cased | | |— vocab.txt | | |— config.json | | |— pytorch_model.bin | | | |— output (our best checkpoint) | |— epoch_4.checkpoint | |— Pred (extracted keyphrase candidates) |— Dev_candidate.json |— EvalPublic_candidate.json (submitted to the leardbord)
bert-base-cased can also be download from Huggingface's Pytorch-Transformers
* Re-produce evaluation result on Dev using our checkpoint
Get evaluation result using our generated keyphrase candidates
python evaluate.py ./DATA/Pred/Dev_candidate.json ./DATA/cached_features/Dev_reference.json
evaluate.pyscript is cloned from official evaluation script (we can only evaluate dev candidates because we don't know the ground-truths of Eval) , then the evaluation result on Dev can be shown as below :
######################## Metrics @1 F1:0.35075642965204235 P:0.47745839636913767 R:0.2980584972264246 @3 F1:0.36837115481593546 P:0.3011094301563288 R:0.5318204740292486 @5 F1:0.31765903032922704 P:0.22160363086232984 R:0.6389056984367121 #########################
Generate keyphrase candidates using our checkpoint model
The new generated keyphrases for Dev and Eval can be found in
* Re-train a new model from scratch
Train a new model using our preprocess data
Several new files will be generated：
visofolder (if use tensorboardX) : track the train & valid losses saved in
./outputafter each i epoch.
* Re-preprocess source OpenKP datasets
Download the OpenKP dataset from MS MARCO website to your own directory
Preprocess the dataset using our
preprocess.pyscript (it might take 2~3 hours)
python preprocess.py --source_dataset_dir "your own directory" --output_path "your save directory"
* Data Characteristics
The documents of the dataset come from real world webs , with a diversified topic domain distribution differs from previous keyphrase extraction datasets (focuse on a single sciencific-field).
For each document, 1-3 most relevant keyphrase labels have been generated by expert annotators, they have to appear in the document.
* Model Architectures
Upon the Characteristics of the data, we formulated keyphrase extraction as a soft-select sequence tagging task, and first introduced BERT into open-domian keyprhase extraction (as we know). We describe our model's workflow as the following 3 stages :
Word-Level Representations : We encode an input document into a sequence of WordPiece tokens' vectors with a pretrained BERT (base), and then we pick up the first sub-token vector of each word to represent the input in word-level.
Phrase-Level Representations : We perform a soft-select method to decode phrase from word-level vector instead of hard-select used in the standard sequence tagging task .
The word-level representation is feed into an classification layer to obtain the tag probabilities of each word on 5 classes (O, B, I, E, U) , and then we employ different tag patterns for extracting different n-grams ( 1 ≤ n ≤ 5 ) over the whole sequence.
Last there are a collect of n-gram candidates, each word of the n-gram just has one score.
Soft-select Example : considering all 3-grams (B I E) on the L-length document, we can extract (L-3+1) 3-grams sequentially like sliding window. In each 3-gram, we only keep B score for the first word, I score for the middle word, and E score for the last word, etc.
O : Non Keyphrase ; B : Begin word of the keyprase ; I : Middle word of the keyphrase ; E : End word of keyprhase ; U : Uni-word keyphrase
Document-Level Keyphrase : At the Last stage, the recovering from phrase-level n-grams to document-level keyphrases can be naturally formulated as a ranking task.
Incorporating with term frequency, we employ Min Pooling to get the final score of each n-gram (we tested Min / Mean / LogMean Pooling , Min pooling is the best) . Based on the final scores, we extract 5 top ranked keyprhase candidates for each document.
For any question, please contact Si Sun by email firstname.lastname@example.org , we will try our best to solve.