MeanSum: A Model for Unsupervised Neural Multi-Document Abstractive Summarization
Corresponding paper, accepted to ICML 2019: https://arxiv.org/abs/1810.05739.
- python 3
- torch 0.4.0
Rest of python packages in
Tested in Docker, image =
Create directories that aren't part of the Git repo (checkpoints/, outputs/):
Install python packages:
The default parameters for Tensorboard(x?) cause texts from writer.add_text() to not show up. Update by:
Downloading data and pretrained models
- Download Yelp data: https://www.yelp.com/dataset and place files in
- Run script to pre-process script and create train, val, test splits:
- Download subword tokenizer built on Yelp and place in
- Download summarization model and place in
- Download language model and place in
- Download classification model and place in
Download from: link. Each row contains "Input.business_id", "Input.original_review_<num>_id", "Input.original_review__<num>_", "Answer.summary", etc. The "Answer.summary" is the reference summary written by the Mechanical Turk worker.
Testing with pretrained mode. This will output and save the automated metrics.
Results will be in
NOTE: Unlike some conventions, 'gpus' option here represents the GPU ID (the one which is visible) and NOT the number of GPUs. Hence, for a machine with a single GPU, you will give gpus=0
python train_sum.py --mode=test --gpus=0 --batch_size=16 --notes=<run_name>
Training summarization model (using pre-trained language model and default hyperparams).
The automated metrics results will be in
python train_sum.py --batch_size=16 --gpus=0,1,2,3 --notes=<additional_notes>