This repository is the code and resources for the paper Terminology-aware Medical Dialogue Generation
This project is implemented with Pytorch.
This project is based on pytorch-lightning framework, and all pretrained models can be downloadeded from Hugginface.
So if you want to run this code, you must have following preliminaries:
- Python 3 or Anaconda (mine is 3.8)
- Pytorch
- transformers (a package for huggingface)
- pytorch-lightning (a package)
To reproduce our work you need to download following files:
-
Processed data (put them to
datasets/med-dialog
directory) med-dialog -
Medicial Terminology List (put it to
resources/med-dialog
directory) med_term_list.txt
The raw dialogue corpus is downloaded from the work of Medical-Dialogue-System, or you can download it from here.
Put it to resources/med-dialog
directory.
Unzip these files, and your datasets
and resources
should be as follows.
The structure of datasets
should be like this:
├── datasets/med-dialog
└── dialog-with-term # dialogs with term tags
└── `train.source.txt`
└── `train.target.txt`
└── `val.source.txt`
└── `val.target.txt`
└── `test.source.txt`
└── `test.target.txt`
└── large-english-dialog-corpus # dialog datasets without terms
└── `train.source.txt` # input: dialogue history
└── `train.target.txt` # reference output: response from the doctor
└── `val.source.txt`
└── `val.target.txt`
└── `test.source.txt`
└── `test.target.txt`
train, val, test are split by the ratio of 0.90, 0.05, 0.05
the example of test.source.txt
(leading context):
patient: i have been [TERM] diagnosed [TERM] with bppv [TERM] and sjogrens syndrome. sometimes i experience [TERM] horizontal [TERM] rolling [TERM] like old tv s [TERM] used to do. this usually occurs when in a semi-recline [TERM] position [TERM] and lasts [TERM] for about 10-15 seconds. i [TERM] can t find [TERM] any information. [TERM] can you help?
the example of test.target.txt
(story):
bppv is due to [TERM] ear problem, [TERM] and there is n [TERM] relation to sjogren's. it [TERM] can be treated by a good [TERM] ent specialist.
The structure of datasets
should be like this:
├── resources/med-dialog
└── large-english-dialog-corpus # the raw dialogue corpus
└── `train.source.txt`
└── `train.target.txt`
└── `val.source.txt`
└── `val.target.txt`
└── `test.source.txt`
└── `test.target.txt`
└── med_term_list.txt # the terminology list
The terminology list is acquired from the word of wordlist-medicalterms-en
pip install -r requirements.txt
As mentioned above.
Train bart -w terms AL:
python tasks/med-dialog/train.py --model_name terms_bart --experiment_name=term_bart-base-meddialog\
--learning_rate=2e-5 --train_batch_size=6 --eval_batch_size=6 --model_name_or_path=facebook/bart-base \
--val_check_interval=0.5 --max_epochs=6 --accum_batches_args=12 --num_sanity_val_steps=1 \
--save_top_k 3 --eval_beams 2 --data_dir=datasets/med-dialog/dialog-with-term \
--limit_val_batches=20
Test bart -w terms AL:
python tasks/med-dialog/test.py\
--eval_batch_size=32 --model_name_or_path=facebook/bart-base \
--output_dir=output/med-dialog/ --model_name terms_bart --experiment_name=term_bart-base-meddialog --eval_beams 2 \
--max_target_length=400
If you also want to try baselines, please read the code of
tasks/med-dialog/train.py
and tasks/med-dialog/test.py
. I believe you will understand what to do.
Some notes for this project.
├── src # source code
├── tasks # code for running programs
├── datasets
├── output # this will be automatically created to put all the output stuff including checkpoints and generated text
├── resources # put some resources used by the model e.g. the pretrained model.
├── tasks # excute programs e.g. training, tesing, generating stories
├── .gitignore # used by git
├── requirement.txt # the checklist of essential python packages
I wrote two scripts to download models from huggingface website.
One is tasks/download_hf_models.sh
, and another is src/utils/huggingface_helper.py
If you found this repository or paper is helpful to you, please cite our paper.
This is the arxiv citation:
@misc{https://doi.org/10.48550/arxiv.2210.15551,
doi = {10.48550/ARXIV.2210.15551},
url = {https://arxiv.org/abs/2210.15551},
author = {Tang, Chen and Zhang, Hongbo and Loakman, Tyler and Lin, Chenghua and Guerin, Frank},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Terminology-aware Medical Dialogue Generation},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
This is the ICASSP citation from google scholar:
@inproceedings{tang2023terminology,
title={Terminology-Aware Medical Dialogue Generation},
author={Tang, Chen and Zhang, Hongbo and Loakman, Tyler and Lin, Chenghua and Guerin, Frank},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}