Skip to content

zhaochaocs/MDS-DR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MDS-DR

This is the codebase of our ACL22 Findings Paper Read Top News First: A Document Reordering Approach for Multi-Document News Summarization

The code is inherited from PreSumm. We add the implementation of document reordering. Please refer to PreSumm for the implementation of data pre-processing, summarization, and evaluation.

Step 1. Data Preparation

Set up the dataset name

dataset_name=multinews

Follow the PreSumm format to prepare for the json data. We provide toy examples under json_data2/multinews.

Convert the json format data to torch format, which will be used as the input of BERT. Files are saved to bert_data2/multinews_doc_cls/.

python preprocess.py -mode format_to_bert_doc -raw_path ../json_data2/${dataset_name} \
        -save_path ../bert_data2/${dataset_name}_doc_cls/ \
        -n_cpus 10 -log_file ../logs/multinews.log -min_src_nsents 1 -doc_separator "unused0"

Step 2. Model Training

Train a documents-level reordering model.

dataset_name=multinews_doc_cls
python train.py -task ext -mode train_doc -bert_data_path ../bert_data2/${dataset_name} \
-ext_dropout 0.1 -model_path ../models2/${dataset_name}/ \
-lr 2e-3 -visible_gpus 1 -report_every 1000 \
-save_checkpoint_steps 1000 -batch_size 3000 -train_steps 10000 -accum_count 2 -valid_per_steps 1000 \
-log_file ../logs/multinews.log -use_interval true -warmup_steps 2000 -max_pos 512 \
-test_summary_num_sents 11 -block_trigram false -result_path ../results2/${dataset_name}/

Step 3. Model Test

Run the trained model on MDS dataset to evaluate the importance of each document.

splits=( "train" "valid" "test" )
for split in "${splits[@]}"
do
python train.py  -task ext -mode test_doc -input ${split} -batch_size 1000 -test_batch_size 5 \
-bert_data_path ../bert_data2/${dataset_name}/ -log_file ../logs/multinews.log \
-model_path ../models/${dataset_name} -test_from ../models/${dataset_name}/model_step_10000.pt -sep_optim true \
-use_interval true -visible_gpus 1 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 \
-result_path ../results2/${dataset_name}/${split}
done

Citation

@inproceedings{zhao2022read,
  title={Read Top News First: A Document Reordering Approach for Multi-Document News Summarization},
  author={Zhao, Chao and Huang, Tenghao and Chowdhury, Somnath Basu Roy and Chandrasekaran, Muthu Kumar and Mckeown, Kathleen and Chaturvedi, Snigdha},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
  pages={613--621},
  year={2022}
}

About

Codebase for MDS-DR (ACL 22)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages