This repo is maintained as part of Final Project for CSE538: NLP Course ,undertaken in Fall'19. Here's the reference paper, this work is based upon: A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss.
- Python 2.7
- Tensoflow 1.1.0
- pyrouge (for evaluation)
- tqdm
- Standford CoreNLP 3.7.0 (for data preprocessing)
- NLTK (for data preprocessing)
Note: Stanford CoreNLP 3.7.0 can be downloaded from here.
Note: To use ROUGE evaluation, you need to download the ROUGE-1.5.5
package from here. Next, follow the instrunction from here to install pyrouge and set the ROUGE path to your absolute path of ROUGE-1.5.5
directory.
-
project_scientific_merge_abstract_and_title.py - This code parses 2 files: 1 file for abstracts of all scientific papers, and another file for titles of all corresponding scientific papers, to produce 1 file per scientific paper, where each file has an abstract and a title as 'highlight' of the paper .
-
custom_make_data.py - This file feeds raw data-set generated using above code file and generates tokenized binary format files, i.e (train.bin, test.bin , val.bin).
More details on implementation can be found in Section 5 of the Report PDF.
Codes for generating the dataset is in data
folder.
Datasets for data-driven summarization of scientific articles: generating the title of a paper from its abstract (title-gen) or abstract from its full body (body-gen). title-gen was constructed from the MEDLINE dataset, whereas body-gen from the PubMed Open Access Subset. Here's the data repo
To generate merged data file containing abstract and title per scientific paper
python project_scientific_merge_abstract_and_title.py <folder containing title and abstract>
To generate Tokenised Binary file for each input merged data file generated by above command:
python custom_make_data.py <dir> <prefix> <abstract_tokenized_dir>
Set the path of pretrained extractor and abstractor to SELECTOR_PATH
and REWRITER_PATH
in the script.
sh scripts/end2end.sh
The trained models will be saved in log/end2end/${EXP_NAME}
directory.
Change the MODE
in the script to evalall
(i.e., MODE='evalall'
) and set CKPT_PATH
as the model path that you want to test.
If you want to use the best evaluation model, set LOAD_BEST_EVAL_MODEL
as True
to load the best model in eval(_${EVAL_METHOD})
directory. The default of LOAD_BEST_EVAL_MODEL
is False
.
If you didn't set the CKPT_PATH
or turn on LOAD_BEST_EVAL_MODEL
, it will automatically load the latest model in train
directory.
The evalutation results will be saved under your experiment directory log/${MODEL}/${EXP_NAME}/
.
We have used pretrained models as the following:
If you want to get the results of the pretrained models, set two arguments in the scripts:
- set the
MODE
toevalall
(i.e.,MODE='evalall'
). - set the
CKPT_PATH
to our pretrained model (e.g.,CKPT_PATH="pretrained/bestmodel-xxxx"
).
The output format is a dictionary:
{
'article': list of article sentences,
'reference': list of reference summary sentences,
'gt_ids': indices of ground-truth extracted sentences,
'decoded': list of output summary sentences
}
If you find this repository useful, please cite:
@InProceedings{hsu2018unified,
title={A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss},
author={Hsu, Wan-Ting and Lin, Chieh-Kai and Lee, Ming-Ying and Min, Kerui and Tang, Jing and Sun, Min},
booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year={2018}
}