LearnFromHumanEdit

Installation

If using conda, you can get this to work as follows:

conda create -n salt python=3.8
conda activate salt

We have experimented with 11.7 and 10.2 cuda version, but this release should work with more recent versions as well.

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=10.2 -c pytorch

or

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Install other packages:

conda install -c conda-forge matplotlib
conda install -c conda-forge spacy
conda install -c conda-forge scipy
python -m spacy download en_core_web_sm
pip install nltk
pip install ipdb
pip install rouge
pip install rouge-score
pip install trl
pip install minineedle
pip install nltk

pip install datasets
pip install transformers

If you want to use qlora for llm:

pip install -q -U bitsandbytes 
pip install -q -U git+https://github.com/huggingface/peft.git 
pip install -q -U git+https://github.com/huggingface/accelerate.git

Run the trainer

python DPO_trainer.py
python SFT_trainer.py
python SALT_trainer.py

Run Synthetic Data Generation

python SyntheticData.py

Instructions for Synthetic Data Generation

Use the above script for the generation of synthetic data of two types:

High to Low (H2L): where the chosen summary is the reference summary & rejected summary is the LLM hallucinated summary.
Low to High (L2H): where the rejected summary is the pre-trained model-generated summary & chosen summary is the factually improved summary.

Make the following changes based on different synthetic data generation settings:

Add the OpenAI API key in the openai_api_key variable.
Update the pre-trained model checkpoint path in model_checkpoint variable for low to high (L2H) synthetic generation.
Update the OpenAI model type in gpt_model_type variable. This model is used to generate hallucinated and factually improved summaries.
- gpt_model_type: gpt-3.5-turbo-0613 for using GPT-3.5 Turbo
- gpt_model_type: gpt-4-0613 for using GPT-4
Update the synthetic data generation type in synthetic_data_type variable.
- synthetic_data_type: H2L for High to Low synthetic data.
- synthetic_data_type: L2H for Low to High synthetic data.
Update data_files variable to update the path for the base dataset.
Use num_samples to control the size of the synthetic dataset.

TODO

Add after-visit-summary datasets (L2H, H2L) * (GPT-3.5-turbo, GPT4) --> every dataset has around 5,000 data points
Run synthetic imitation edit generation codes on the doctor-patient-conversation-to-note synthetic dataset (https://github.com/believewhat/Dr.NoteAid/tree/main)

Citation

@article{yao2023improving,
  title={Improving Summarization with Human Edits},
  author={Yao, Zonghai and Schloss, Benjamin J and Selvaraj, Sai P},
  journal={arXiv preprint arXiv:2310.05857},
  year={2023}
}

@article{mishra2023synthetic,
  title={Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization},
  author={Mishra, Prakamya and Yao, Zonghai and Chen, Shuwei and Wang, Beining and Mittal, Rohan and Yu, Hong},
  journal={arXiv preprint arXiv:2310.20033},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
pkgs		pkgs
sequence_alignment		sequence_alignment
.gitignore		.gitignore
DPO_trainer.py		DPO_trainer.py
README.md		README.md
SALT_trainer.py		SALT_trainer.py
SFT-DPO-SALT_training.ipynb		SFT-DPO-SALT_training.ipynb
SFT_trainer.py		SFT_trainer.py
SyntheticData.py		SyntheticData.py
dpo.py		dpo.py
metrics.py		metrics.py
run_dpo.sh		run_dpo.sh
trainer.py		trainer.py

seasonyao/LearnFromHumanEdit

Folders and files

Latest commit

History

Repository files navigation

LearnFromHumanEdit

Installation

Run the trainer

Run Synthetic Data Generation

Instructions for Synthetic Data Generation

TODO

Citation

About

Resources

Stars

Watchers

Forks

Languages