Skip to content

yuping-wu/EDU-VL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EDU-VL

Source code for EACL 2023 Findings paper "EDU-level Extractive Summarization with Varying Summary Lengths".

Environment

  • OS: CentOS Linux release 7.9.2009 (Core)
  • GPU: 1 x Nvidia v100-SXM2-16GB (Volta) GPU
  • CUDA: 11.1.1

Dependencies

  • Python 3.7.11
  • PyTorch 1.8.0
  • pyrouge 0.1.3
  • transformers 4.12.5
  • allennlp 2.8.0
  • pythonrouge
    • Please run command pyrouge_set_rouge_path to setup the ROUGE package.
  • nltk
  • spacy
  • lxml
  • StanfordCoreNLP

Pre-processing / Data preparation

We adapt pre-processing steps from DiscoBERT to process each dataset. Pre-processing codes from DiscoBERT/data_preparation and dependencies, e.g., NeuralEDUSeg, etc., are required.

1) Download the original datasets

2) Prepare datasets

To write document and reference summary into separate files:

# CNN/DailyMail
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode split -data_dir PATH-TO-DATASET

# Others
python preprocessing/DATAT-NAME_split.py

3) EDU segmentation

# tokenize using StanfordCoreNLP
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode tokenize -data_dir PATH-TO-DATASET -rel_split_doc_path raw_doc -rel_tok_path tokenized -snlp_path "stanford-corenlp-full-2018-10-05"

# convert XML file to CONLL file
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode dplp -data_dir PATH-TO-DATASET -dplp_path PATH-TO-DPLP-REPOSITORY -rel_rst_seg_path segs -rel_tok_path tokenized

# EDU Segmentation
cd NeuralEDUSeg/src
python run.py --segment --input_conll_path PATH-TO-TOKENIZED-DATASET --output_merge_conll_path PATH-TO-DATASET-SEGS --gpu 0

# construct RST on EDUs (not necessary)
cd DPLP
python2 rstparser.py PATH-TO-DATASET-SEGS False
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode rst -data_dir PATH-TO-DATASET  -dplp_path PATH-TO-DPLP-REPOSITORY -rel_rst_seg_path segs

# aggregate datasets into train/test/valid
python DiscoBERT/data_preparation/run_nlpyang_prepo.py -mode format_to_lines -data_dir PATH-TO-DATASET -rel_rst_seg_path segs -rel_tok_path tokenized -rel_save_path chunk -rel_split_sum_path sum -data_name DATASET-NAME -map_path PATH-TO-DATASET-SPLIT-FILES

4) Prepare candidate summaries

The prediction file by the fine-tuned RoBERTa model is required to be prepared here. Example prediction file is provided, i.e., prediction_example.txt.

python preprocessing/prepare_candidate_summaries.py

An example of the prepared datasets is data_examples.json.

Train

1) Configuration Setup

Setup model configuration at configuration file config/model.json. To better understand the structure of configuration file, please refer to the AllenNLP tutorial.

  • "data_reader": change "token_indexers-transformer-model_name" accordingly if other pre-trained language model is to be used
  • "model": change "transformer_name" accordingly if other pre-trained language model is to be used; change "min_pred_unit" and "max_pred_unit" accordingly to control lengths of candidate summaries
  • "train_data_path": path to the training dataset
  • "validation_data_path": path to validation dataset

2) Other Parameters

  • Pre-trained language model: lines 18,29,39 in model/data_reader.py and lines 18 and 44 in model/model.py might need change to call corresponding pre-trained language model.
  • Number of candidate summaries generated by model and k value: lines 91 and 92 in model/data_reader.py might need change depending one the number of candidate summaries and maximum length.

3) Train Model

Run the following command to train model:

python main.py

4) Output:

  • xxxx/model.tar.gz: The trained model would be automatically saved under a folder with randomly generated name.
  • model_info.txt: number of parameters in model and model architecture

Test

Run the following command to test model:

allennlp evaluate PATH-TO-model.tar.gz PATH-TO-test-dataset --output-file evaluation.txt --cuda-device 0 --include-package model

Output:

  • evaluation.txt: testing result

Citing

@inproceedings{wu-etal-2023-edu,
    title = "{EDU}-level Extractive Summarization with Varying Summary Lengths",
    author = "Wu, Yuping  and
      Tseng, Ching-Hsun  and
      Shang, Jiayu  and
      Mao, Shengzhong  and
      Nenadic, Goran  and
      Zeng, Xiao-Jun",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.123",
    pages = "1655--1667",
    abstract = "Extractive models usually formulate text summarization as extracting fixed top-k salient sentences from the document as a summary. Few works exploited extracting finer-grained Elementary Discourse Unit (EDU) with little analysis and justification for the extractive unit selection. Further, the selection strategy of the fixed top-k salient sentences fits the summarization need poorly, as the number of salient sentences in different documents varies and therefore a common or best k does not exist in reality. To fill these gaps, this paper first conducts the comparison analysis of oracle summaries based on EDUs and sentences, which provides evidence from both theoretical and experimental perspectives to justify and quantify that EDUs make summaries with higher automatic evaluation scores than sentences. Then, considering this merit of EDUs, this paper further proposes an EDU-level extractive model with Varying summary Lengths (EDU-VL) and develops the corresponding learning algorithm. EDU-VL learns to encode and predict probabilities of EDUs in the document, generate multiple candidate summaries with varying lengths based on various k values, and encode and score candidate summaries, in an end-to-end training manner. Finally, EDU-VL is experimented on single and multi-document benchmark datasets and shows improved performances on ROUGE scores in comparison with state-of-the-art extractive models, and further human evaluation suggests that EDU-constituent summaries maintain good grammaticality and readability.",
}

Acknowledgements

  • Data processing steps are based on codes in DiscoBERT.

About

Source code for model EDU-VL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages