ODPO: Direct Preference Optimization with an Offset

⚠️ this repo is based on and is an extension of the DPO repo. Please refer to the original repo for general installation guidelines. We thank the authors for releasing their code.

What is new to DPO codebase?

ODPO loss: To use this loss, just set loss=odpo. You can change the value of alpha simply by loss.alpha=X. The default value is 1.
Data (under data/) to reproduce our results on 3 tasks: sentiment control, toxicity control, and summarization.
Evaluation pipeline to test KL divergence and rewards.

Sentiment Control Task

To reproduce the results of paper, the SFT checkpoint should be a finetuned gpt2-large model on IMDB train set.

python train.py \
model=gpt2-large \
datasets=[imdb] \
loss=odpo \
seed=1 \
loss.beta=0.7 \
loss.alpha=0.5 \
model.name_or_path=[SFT CHECKPOINT] \
exp_name=imdb \
gradient_accumulation_steps=2 \
batch_size=32 \
eval_batch_size=32 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false

Toxicity Control Task

python train.py \
model=gpt-neo \
datasets=[toxicity] \
loss=odpo \
seed=1 \
loss.beta=0.05 \
loss.alpha=1. \
model.name_or_path=[SFT CHECKPOINT] \
exp_name=toxicity \
gradient_accumulation_steps=2 \
batch_size=32 \
eval_batch_size=32 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false"

Summarization Task

For this task, we first train the model (without the evaluation loop), and then test the model by sampling with different temperatures.

For training:

python train.py \
model=gptj \
datasets=[tldr] \
loss=odpo \
seed=0 \
loss.beta=0.5 \
model.name_or_path=[SFT CHECKPOINT] \
exp_name=tldr \
gradient_accumulation_steps=2 \
batch_size=16 \
max_prompt_length=512 \
max_length=1024 \
eval_batch_size=16 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false

Sampling:

python train.py \
function=eval \
model=gptj \
datasets=[tldr] \
loss=odpo \
seed=0 \
loss.beta=0.5 \
temperature=0.25 \
n_eval_examples=100 \
model.name_or_path=[SFT CHECKPOINT] \
model.archive=tldr/odpo/LATEST/policy.pt \
exp_name=tldr_test \
gradient_accumulation_steps=2 \
max_prompt_length=512 \
max_length=1024 \
eval_batch_size=16 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false

Note: model.archive is the path of the model that we finetuned with ODPO.

Citing ODPO

If ODPO or this repository is useful in your own research, you can use the following BibTeX entry:

@misc{amini2024direct,
  title={Direct Preference Optimization with an Offset}, 
  author={Afra Amini and Tim Vieira and Ryan Cotterell},
  url={https://arxiv.org/pdf/2402.10571},
  year={2024},
  eprint={2402.10571},
  archivePrefix={arXiv},
  primaryClass={cs.CL},

}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data		data
LICENSE		LICENSE
README.md		README.md
imdb_sentiment.py		imdb_sentiment.py
odpo.png		odpo.png
preference_datasets.py		preference_datasets.py
requirements.txt		requirements.txt
train.py		train.py
trainers.py		trainers.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ODPO: Direct Preference Optimization with an Offset

What is new to DPO codebase?

Sentiment Control Task

Toxicity Control Task

Summarization Task

Citing ODPO

About

Releases

Packages

Languages

License

rycolab/odpo

Folders and files

Latest commit

History

Repository files navigation

ODPO: Direct Preference Optimization with an Offset

What is new to DPO codebase?

Sentiment Control Task

Toxicity Control Task

Summarization Task

Citing ODPO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages