Terminology-aware Medical Dialogue Generation

This repository is the code and resources for the paper Terminology-aware Medical Dialogue Generation

Instructions

This project is implemented with Pytorch.

This project is based on pytorch-lightning framework, and all pretrained models can be downloadeded from Hugginface.

So if you want to run this code, you must have following preliminaries:

Python 3 or Anaconda (mine is 3.8)
Pytorch
transformers (a package for huggingface)
pytorch-lightning (a package)

Datasets and Resources

Directly Download Dataset and Resources

To reproduce our work you need to download following files:

Processed data (put them to datasets/med-dialog directory) med-dialog
Medicial Terminology List (put it to resources/med-dialog directory) med_term_list.txt

Preprocess Dataset From Scratch

The raw dialogue corpus is downloaded from the work of Medical-Dialogue-System, or you can download it from here.

Put it to resources/med-dialog directory.

Put Files To Correct Destinations

Unzip these files, and your datasets and resources should be as follows.

The structure of datasetsshould be like this:

├── datasets/med-dialog
   └── dialog-with-term		# dialogs with term tags
          └── `train.source.txt`    
          └── `train.target.txt`       
          └── `val.source.txt` 
          └── `val.target.txt` 
          └── `test.source.txt` 
          └── `test.target.txt` 
    └── large-english-dialog-corpus		# dialog datasets without terms
          └── `train.source.txt`    # input: dialogue history
          └── `train.target.txt`    # reference output: response from the doctor 
          └── `val.source.txt` 
          └── `val.target.txt` 
          └── `test.source.txt` 
          └── `test.target.txt`

train, val, test are split by the ratio of 0.90, 0.05, 0.05

the example of test.source.txt (leading context):

patient: i have been [TERM] diagnosed [TERM] with bppv [TERM] and sjogrens syndrome. sometimes i experience [TERM] horizontal [TERM] rolling [TERM] like old tv s [TERM] used to do. this usually occurs when in a semi-recline [TERM] position [TERM] and lasts [TERM] for about 10-15 seconds. i [TERM] can t find [TERM] any information. [TERM] can you help?

the example of test.target.txt (story):

bppv is due to [TERM] ear problem, [TERM] and there is n [TERM] relation to sjogren's. it [TERM] can be treated by a good [TERM] ent specialist.

The structure of datasets should be like this:

├── resources/med-dialog
   └── large-english-dialog-corpus		# the raw dialogue corpus
          └── `train.source.txt`    
          └── `train.target.txt`       
          └── `val.source.txt` 
          └── `val.target.txt` 
          └── `test.source.txt` 
          └── `test.target.txt` 
    └── med_term_list.txt		# the terminology list

The terminology list is acquired from the word of wordlist-medicalterms-en

Quick Start

1. Install packages

pip install -r requirements.txt

2. Collect Datasets and Resources

As mentioned above.

3. Run the code for training or testing

Train bart -w terms AL:

python tasks/med-dialog/train.py --model_name terms_bart --experiment_name=term_bart-base-meddialog\
 --learning_rate=2e-5 --train_batch_size=6 --eval_batch_size=6 --model_name_or_path=facebook/bart-base \
 --val_check_interval=0.5 --max_epochs=6 --accum_batches_args=12  --num_sanity_val_steps=1 \
 --save_top_k 3 --eval_beams 2 --data_dir=datasets/med-dialog/dialog-with-term \
 --limit_val_batches=20

Test bart -w terms AL:

python tasks/med-dialog/test.py\
  --eval_batch_size=32 --model_name_or_path=facebook/bart-base \
  --output_dir=output/med-dialog/ --model_name terms_bart --experiment_name=term_bart-base-meddialog --eval_beams 2 \
  --max_target_length=400

If you also want to try baselines, please read the code of tasks/med-dialog/train.py and tasks/med-dialog/test.py. I believe you will understand what to do.

Notation

Some notes for this project.

1 - Complete Project Structure

├── src # source code
├── tasks # code for running programs
├── datasets 
├── output  # this will be automatically created to put all the output stuff including checkpoints and generated text
├── resources # put some resources used by the model e.g. the pretrained model.
├── tasks # excute programs e.g. training, tesing, generating stories
├── .gitignore # used by git
├── requirement.txt # the checklist of essential python packages

2 - Scripts for Downloading huggingface models

I wrote two scripts to download models from huggingface website. One is tasks/download_hf_models.sh, and another is src/utils/huggingface_helper.py

Citation

If you found this repository or paper is helpful to you, please cite our paper.

This is the arxiv citation:

@misc{https://doi.org/10.48550/arxiv.2210.15551,
  doi = {10.48550/ARXIV.2210.15551},
  url = {https://arxiv.org/abs/2210.15551},
  author = {Tang, Chen and Zhang, Hongbo and Loakman, Tyler and Lin, Chenghua and Guerin, Frank},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Terminology-aware Medical Dialogue Generation},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

This is the ICASSP citation from google scholar:

@inproceedings{tang2023terminology,
  title={Terminology-Aware Medical Dialogue Generation},
  author={Tang, Chen and Zhang, Hongbo and Loakman, Tyler and Lin, Chenghua and Guerin, Frank},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
tasks		tasks
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
python_commands.sh		python_commands.sh
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminology-aware Medical Dialogue Generation

Instructions

Datasets and Resources

Directly Download Dataset and Resources

Preprocess Dataset From Scratch

Put Files To Correct Destinations

Quick Start

1. Install packages

2. Collect Datasets and Resources

3. Run the code for training or testing

Notation

1 - Complete Project Structure

2 - Scripts for Downloading huggingface models

Citation

About

Releases

Packages

Contributors 2

Languages

License

tangg555/meddialog

Folders and files

Latest commit

History

Repository files navigation

Terminology-aware Medical Dialogue Generation

Instructions

Datasets and Resources

Directly Download Dataset and Resources

Preprocess Dataset From Scratch

Put Files To Correct Destinations

Quick Start

1. Install packages

2. Collect Datasets and Resources

3. Run the code for training or testing

Notation

1 - Complete Project Structure

2 - Scripts for Downloading huggingface models

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages