- LCM3DS is a large-scale multi-scenario multi-domain dialogue summarization corpus annotated by ChatGPT.
- LCM3DS corpus is currently available on both Google Drive and Baidu Netdisk.
- LCM3DS corpus is a standardized high-quality corpus that you can use for pretraining on your own model architecture.
- You can use the following code to extract "dialogue-summary" parallel data:
with open(os.path.join(dataset_path, dataset_name), 'r') as rf:
data = json.load(rf)
dataset = []
for sample in data:
if 'chatgpt_anno_summ' in sample:
dialogue = [i['added_role'] + '<eor>' + i['utterance'] + '<eou>' for i in sample['dialogue']]
summary = sample['chatgpt_anno_summ']
dataset.append({'dialogue': dialogue, 'summary': summary})
if 'role-rep_named-coref_summ' in sample:
dialogue = [i['named_coref'] + '<eor>' + i['utterance'] + '<eou>' for i in sample['dialogue']]
summary = sample['role-rep_named-coref_summ']
dataset.append({'dialogue': dialogue, 'summary': summary})
if 'role-rep_cust-serv_summ' in sample:
dialogue = [i['cust_serv'] + '<eor>' + i['utterance'] + '<eou>' for i in sample['dialogue']]
summary = sample['role-rep_cust-serv_summ']
dataset.append({'dialogue': dialogue, 'summary': summary})
- Our full fine-tuned models, few-shot models, pre-trained models, and initialized model can be obtained from the following:
Model | Google Drive | Baidu Netdisk |
---|---|---|
Fine-tuned | SAMSum, DIALOGSUM, TWEETSUMM | SAMSum, DIALOGSUM, TWEETSUMM |
Few-shot | SAMSum, DIALOGSUM, TWEETSUMM | SAMSum, DIALOGSUM, TWEETSUMM |
Pre-trained | MP4-DAP, MP4-DAP-TOP | MP4-DAP, MP4-DAP-TOP |
Initialized | Speaker-BART | Speaker-BART |
- Downstream datasets are currently available on both Google Drive and Baidu Netdisk.
Dataset | Train | Val | Test | Domain |
---|---|---|---|---|
SAMSum | 14,731 | 818 | 819 | ODDS-Online |
DIALOGSUM | 12,460 | 500 | 500 | ODDS-Daily |
TWEETSUMM | 869 | 108 | 110 | CSDS-Tweet |
- The inference results of ChatGPT (zero-shot) on SAMSum test set (Appendix A of our paper) can be obtained on Google Drive and Baidu Netdisk.
Prompt | R-1 | R-2 | R-L |
---|---|---|---|
Preceding | 37.90 | 15.19 | 35.89 |
InstructGPT | 42.17 | 16.84 | 39.26 |
Subsequent | 40.08 | 15.41 | 37.22 |
You can obtain all the inference results (i.e., full fine-tune, few-shot and zero-shot) of our models on Google Drive and Baidu Netdisk.
You can perform inference through the following steps:
Step1: Ensure that the required downstream datasets are stored in the datasets
folder.
Step2: Make sure the model you wish to test is downloaded and placed in the corresponding subfolders: models/fine-tuned
, models/few-shot
, models/pre-trained
, models/initialized
.
Step3: Run inference.py
. Below is an inference example:
CUDA_VISIBLE_DEVICES=0 \
python -u inference.py \
--model_path ../models/fine-tuned/MP4-DAP-TOP-SAMSum \
--dataset_name SAMSum \
--gen_use_cache \
--gen_max_length 100 \
--gen_min_length 5 \
--gen_beam_size 5 \
--gen_length_penalty 1.0 \
--gen_no_repeat_ngram_size 0 \
--infer_path ../outputs/Fine-tuned_MP4-DAP-TOP-SAMSum
NOTE: When you infer using the zero-shot setting, please ensure that the --gen_max_length
parameter aligns with our paper.
You can fine-tune through the following steps:
Step1: Ensure the required fine-tuning dataset is stored in the datasets
folder.
Step2: Ensure that the MP4-DAP or MP4-DAP-TOP pre-trained model is downloaded and placed in the models/pre-trained
subfolder.
Step3: Run training.py
. Below is a fine-tuning example:
CUDA_VISIBLE_DEVICES=6,7,8,9 \
python -u training.py \
--mode fine-tuning \
--model_path ../models/pre-trained/MP4-DAP-TOP \
--ckpt_save_path ../models/fine-tuned/MP4-DAP-TOP-SAMSum-Ours \
--gpus 4 \
--use_ddp \
--max_steps 1155 \
--val_check_interval 0.50 \
--num_sanity_val_steps 2 \
--accumulate_grad_batches 1 \
--progress_bar_refresh_rate 1 \
--lr 3e-05 \
--warmup_steps 100 \
--label_smoothing 0.1 \
--dataset_name Downstream_Datasets/SAMSum \
--max_length_src 1024 \
--max_length_tgt 256 \
--batch_size 16 \
--gen_use_cache \
--gen_max_length 100 \
--gen_min_length 5 \
--gen_beam_size 5 \
--gen_length_penalty 1.0 \
--gen_no_repeat_ngram_size 0
NOTE: If you wish to conduct few-shot training, please ensure that the --few_shot
, --seed
, and --num_sample
parameters are set.
- Domain-Aware Pre-training (DAP) is used for further understanding multi-scenario multi-domain dialogues, and its downstream tasks are suitable for dialogue-related tasks, not limited to dialogue summarization. The corpus with 20% masking ratio can be found on Google Drive and Baidu Netdisk, and the corpus with 40% masking ratio can be found on Google Drive and Baidu Netdisk.
- Task-Oriented Pre-training (TOP) is for dialogue summarization tasks, and "dialogue-summary" parallel data can be obtained by extracting from LCM3DS.
You can perform pre-training through the following steps:
Step1: Ensure the required pre-training datasets (i.e., DAP_0.20, DAP_0.40, or LCM3DS.json) are stored in the datasets
folder.
Step2: Make sure the initial Speaker-BART model is downloaded and placed in the models/initialized
sub-folder.
Step3: Run training.py
. Below is a domain-aware pre-training example:
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 \
python -u training.py \
--mode pre-training-dap \
--model_path ../models/initialized/Speaker-BART \
--ckpt_save_path ../models/pre-trained/MP4-DAP-Ours \
--gpus 8 \
--use_ddp \
--max_steps 5000 \
--val_check_interval 0.50 \
--num_sanity_val_steps 100 \
--accumulate_grad_batches 1 \
--progress_bar_refresh_rate 1 \
--lr 3e-05 \
--warmup_steps 500 \
--label_smoothing 0.1 \
--dataset_name DAP_0.20 \
--val_dataset_name SAMSum-DIALOGSUM-TWEETSUMM \
--max_length_src 1024 \
--max_length_tgt 1024 \
--batch_size 16 \
--gen_use_cache \
--gen_max_length 100 \
--gen_min_length 5 \
--gen_beam_size 5 \
--gen_length_penalty 1.0 \
--gen_no_repeat_ngram_size 0
NOTE: When conducting task-oriented pre-training, please ensure that the --max_length_tgt
parameter is set to 256 or 512.
ckpt2bin.py
can convert the.ckpt
model you've saved after training into the commonly used.bin
model in transformers. Also, ininference.py
andtraining.py
, there's a--resume_ckpt
parameter that can directly load the.ckpt
model.evaluation.py
directly calculates the rouge scores between the model's prediction outputs and the ground truths.
If you have any questions, please contact wxzhou@buaa.edu.cn or provide feedback in the issues section.
If you find this work useful and have used the code or data, please cite the following paper:
@article{zhou2023multi,
title={Multi-Stage Pre-training Enhanced by ChatGPT for Multi-Scenario Multi-Domain Dialogue Summarization},
author={Zhou, Weixiao and Li, Gengyao and Cheng, Xianfu and Liang, Xinnian and Zhu, Junnan and Zhai, Feifei and Li, Zhoujun},
journal={arXiv preprint arXiv:2310.10285},
url={https://arxiv.org/pdf/2310.10285.pdf}
year={2023}
}