Source code for EACL 2023 Findings paper "EDU-level Extractive Summarization with Varying Summary Lengths".
- OS: CentOS Linux release 7.9.2009 (Core)
- GPU: 1 x Nvidia v100-SXM2-16GB (Volta) GPU
- CUDA: 11.1.1
- Python 3.7.11
- PyTorch 1.8.0
- pyrouge 0.1.3
- transformers 4.12.5
- allennlp 2.8.0
- pythonrouge
- Please run command
pyrouge_set_rouge_path
to setup the ROUGE package.
- Please run command
- nltk
- spacy
- lxml
- StanfordCoreNLP
We adapt pre-processing steps from DiscoBERT to process each dataset. Pre-processing codes from DiscoBERT/data_preparation and dependencies, e.g., NeuralEDUSeg, etc., are required.
To write document and reference summary into separate files:
# CNN/DailyMail
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode split -data_dir PATH-TO-DATASET
# Others
python preprocessing/DATAT-NAME_split.py
# tokenize using StanfordCoreNLP
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode tokenize -data_dir PATH-TO-DATASET -rel_split_doc_path raw_doc -rel_tok_path tokenized -snlp_path "stanford-corenlp-full-2018-10-05"
# convert XML file to CONLL file
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode dplp -data_dir PATH-TO-DATASET -dplp_path PATH-TO-DPLP-REPOSITORY -rel_rst_seg_path segs -rel_tok_path tokenized
# EDU Segmentation
cd NeuralEDUSeg/src
python run.py --segment --input_conll_path PATH-TO-TOKENIZED-DATASET --output_merge_conll_path PATH-TO-DATASET-SEGS --gpu 0
# construct RST on EDUs (not necessary)
cd DPLP
python2 rstparser.py PATH-TO-DATASET-SEGS False
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode rst -data_dir PATH-TO-DATASET -dplp_path PATH-TO-DPLP-REPOSITORY -rel_rst_seg_path segs
# aggregate datasets into train/test/valid
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode format_to_lines -data_dir PATH-TO-DATASET -rel_rst_seg_path segs -rel_tok_path tokenized -rel_save_path chunk -rel_split_sum_path sum -data_name DATASET-NAME -map_path PATH-TO-DATASET-SPLIT-FILES
The prediction file by the fine-tuned RoBERTa model is required to be prepared here. Example prediction file is provided, i.e., prediction_example.txt.
python preprocessing/prepare_candidate_summaries.py
An example of the prepared datasets is data_examples.json.
Setup model configuration at configuration file config/model.json
. To better understand the structure of configuration file, please refer to the AllenNLP tutorial.
- "data_reader": change "token_indexers-transformer-model_name" accordingly if other pre-trained language model is to be used
- "model": change "transformer_name" accordingly if other pre-trained language model is to be used; change "min_pred_unit" and "max_pred_unit" accordingly to control lengths of candidate summaries
- "train_data_path": path to the training dataset
- "validation_data_path": path to validation dataset
- Pre-trained language model: lines 18,29,39 in
model/data_reader.py
and lines 18 and 44 inmodel/model.py
might need change to call corresponding pre-trained language model. - Number of candidate summaries generated by model and k value: lines 91 and 92 in
model/data_reader.py
might need change depending one the number of candidate summaries and maximum length.
Run the following command to train model:
python main.py
xxxx/model.tar.gz
: The trained model would be automatically saved under a folder with randomly generated name.model_info.txt
: number of parameters in model and model architecture
Run the following command to test model:
allennlp evaluate PATH-TO-model.tar.gz PATH-TO-test-dataset --output-file evaluation.txt --cuda-device 0 --include-package model
Output:
evaluation.txt
: testing result
@inproceedings{wu-etal-2023-edu,
title = "{EDU}-level Extractive Summarization with Varying Summary Lengths",
author = "Wu, Yuping and
Tseng, Ching-Hsun and
Shang, Jiayu and
Mao, Shengzhong and
Nenadic, Goran and
Zeng, Xiao-Jun",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-eacl.123",
pages = "1655--1667",
abstract = "Extractive models usually formulate text summarization as extracting fixed top-k salient sentences from the document as a summary. Few works exploited extracting finer-grained Elementary Discourse Unit (EDU) with little analysis and justification for the extractive unit selection. Further, the selection strategy of the fixed top-k salient sentences fits the summarization need poorly, as the number of salient sentences in different documents varies and therefore a common or best k does not exist in reality. To fill these gaps, this paper first conducts the comparison analysis of oracle summaries based on EDUs and sentences, which provides evidence from both theoretical and experimental perspectives to justify and quantify that EDUs make summaries with higher automatic evaluation scores than sentences. Then, considering this merit of EDUs, this paper further proposes an EDU-level extractive model with Varying summary Lengths (EDU-VL) and develops the corresponding learning algorithm. EDU-VL learns to encode and predict probabilities of EDUs in the document, generate multiple candidate summaries with varying lengths based on various k values, and encode and score candidate summaries, in an end-to-end training manner. Finally, EDU-VL is experimented on single and multi-document benchmark datasets and shows improved performances on ROUGE scores in comparison with state-of-the-art extractive models, and further human evaluation suggests that EDU-constituent summaries maintain good grammaticality and readability.",
}
- Data processing steps are based on codes in DiscoBERT.