Skip to content

tangg555/meddialog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Terminology-aware Medical Dialogue Generation

This repository is the code and resources for the paper Terminology-aware Medical Dialogue Generation

Instructions

This project is implemented with Pytorch.

This project is based on pytorch-lightning framework, and all pretrained models can be downloadeded from Hugginface.

So if you want to run this code, you must have following preliminaries:

Datasets and Resources

Directly Download Dataset and Resources

To reproduce our work you need to download following files:

  • Processed data (put them to datasets/med-dialog directory) med-dialog

  • Medicial Terminology List (put it to resources/med-dialog directory) med_term_list.txt

Preprocess Dataset From Scratch

The raw dialogue corpus is downloaded from the work of Medical-Dialogue-System, or you can download it from here.

Put it to resources/med-dialog directory.

Put Files To Correct Destinations

Unzip these files, and your datasets and resources should be as follows.

The structure of datasetsshould be like this:

├── datasets/med-dialog
   └── dialog-with-term		# dialogs with term tags
          └── `train.source.txt`    
          └── `train.target.txt`       
          └── `val.source.txt` 
          └── `val.target.txt` 
          └── `test.source.txt` 
          └── `test.target.txt` 
    └── large-english-dialog-corpus		# dialog datasets without terms
          └── `train.source.txt`    # input: dialogue history
          └── `train.target.txt`    # reference output: response from the doctor 
          └── `val.source.txt` 
          └── `val.target.txt` 
          └── `test.source.txt` 
          └── `test.target.txt` 

train, val, test are split by the ratio of 0.90, 0.05, 0.05

the example of test.source.txt (leading context):

patient: i have been [TERM] diagnosed [TERM] with bppv [TERM] and sjogrens syndrome. sometimes i experience [TERM] horizontal [TERM] rolling [TERM] like old tv s [TERM] used to do. this usually occurs when in a semi-recline [TERM] position [TERM] and lasts [TERM] for about 10-15 seconds. i [TERM] can t find [TERM] any information. [TERM] can you help?

the example of test.target.txt (story):

bppv is due to [TERM] ear problem, [TERM] and there is n [TERM] relation to sjogren's. it [TERM] can be treated by a good [TERM] ent specialist.

The structure of datasets should be like this:

├── resources/med-dialog
   └── large-english-dialog-corpus		# the raw dialogue corpus
          └── `train.source.txt`    
          └── `train.target.txt`       
          └── `val.source.txt` 
          └── `val.target.txt` 
          └── `test.source.txt` 
          └── `test.target.txt` 
    └── med_term_list.txt		# the terminology list

The terminology list is acquired from the word of wordlist-medicalterms-en

Quick Start

1. Install packages

pip install -r requirements.txt

2. Collect Datasets and Resources

As mentioned above.

3. Run the code for training or testing

Train bart -w terms AL:

python tasks/med-dialog/train.py --model_name terms_bart --experiment_name=term_bart-base-meddialog\
 --learning_rate=2e-5 --train_batch_size=6 --eval_batch_size=6 --model_name_or_path=facebook/bart-base \
 --val_check_interval=0.5 --max_epochs=6 --accum_batches_args=12  --num_sanity_val_steps=1 \
 --save_top_k 3 --eval_beams 2 --data_dir=datasets/med-dialog/dialog-with-term \
 --limit_val_batches=20

Test bart -w terms AL:

python tasks/med-dialog/test.py\
  --eval_batch_size=32 --model_name_or_path=facebook/bart-base \
  --output_dir=output/med-dialog/ --model_name terms_bart --experiment_name=term_bart-base-meddialog --eval_beams 2 \
  --max_target_length=400

If you also want to try baselines, please read the code of tasks/med-dialog/train.py and tasks/med-dialog/test.py. I believe you will understand what to do.

Notation

Some notes for this project.

1 - Complete Project Structure

├── src # source code
├── tasks # code for running programs
├── datasets 
├── output  # this will be automatically created to put all the output stuff including checkpoints and generated text
├── resources # put some resources used by the model e.g. the pretrained model.
├── tasks # excute programs e.g. training, tesing, generating stories
├── .gitignore # used by git
├── requirement.txt # the checklist of essential python packages 

2 - Scripts for Downloading huggingface models

I wrote two scripts to download models from huggingface website. One is tasks/download_hf_models.sh, and another is src/utils/huggingface_helper.py

Citation

If you found this repository or paper is helpful to you, please cite our paper.

This is the arxiv citation:

@misc{https://doi.org/10.48550/arxiv.2210.15551,
  doi = {10.48550/ARXIV.2210.15551},
  url = {https://arxiv.org/abs/2210.15551},
  author = {Tang, Chen and Zhang, Hongbo and Loakman, Tyler and Lin, Chenghua and Guerin, Frank},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Terminology-aware Medical Dialogue Generation},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

This is the ICASSP citation from google scholar:

@inproceedings{tang2023terminology,
  title={Terminology-Aware Medical Dialogue Generation},
  author={Tang, Chen and Zhang, Hongbo and Loakman, Tyler and Lin, Chenghua and Guerin, Frank},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published