Generated by DALL-E 3
TPO (Triple Preference Optimization) is a new preference learning method designed to align an LLM with three preferences without requiring a separate supervised fine-tuning step. The simple one-step combination of SFT and Preference Optimization outperforms current state-of-the-art alignment methods such as DPO, CPO, KTO, IPO and ORPO.
- Models: Compatible with various models including but not limited to
alignment-handbook/zephyr-7b-sft-full
andmistralai/Mistral-7B-v0.1
. - GPUs: Optimized for both Nvidia GPUs.
- Batch Sizes: Configurable batch sizes for training to optimize GPU usage.
- Data Parallelism: Support for data parallelism to speed up training on multiple GPUs.
- TPO script supports LoRA
- TPO script supports deepspeed
- TPO script supports accelerate
- TPO script supports flash attention
- TPO script supports FSDP
This is a quick tutorial to set up and train a model with the TPO method.
Create and activate a Conda environment:
conda create --prefix tpo python=3.10
conda activate tpo
Install PyTorch with CUDA support (for Nvidia GPUs):
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
Install additional requirements:
pip install -r requirements.txt
We evaluated the models in both single-turn and multi-turn scenarios using the MT-Bench benchmark.
We also assessed all alignment methods using Open LLM Leaderboard benchmarks ( ARC, HellaSwag, MMLU, TruthfulQA, and Winogrande), Big Bench benchmarks ( Causal Judgment (causal reasoning), Sports Understanding (commonsense reasoning), Formal Fallacies) and OpenBookQA using this version of the Eleuther AI Harness.
Run run_tpo.sh
to train a model with TPO. This study used alignment-handbook/zephyr-7b-sft-full
, mistralai/Mistral-7B-v0.1
and microsoft/phi-2
models. However, you can use other models to train with TPO.
#!/bin/bash
OUTPUT_DIR="/OUTPUT/DIR/PATH"
accelerate launch \
--config_file configs/deepspeed_train_config_bf16.yaml \
run_tpo.py \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--tokenizer_name mistralai/Mistral-7B-v0.1 \
--beta 0.2 \
--alpha 0.9 \
--do_train \
--bf16 \
--attn_implementation flash_attention_2 \
--multi_gpu_one_model True \
--learning_rate 5.0e-7 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--optim adamw_torch \
--warmup_ratio 0.1 \
--save_steps 100 \
--log_level info \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--evaluation_strategy steps \
--save_total_limit 1 \
--logging_strategy steps \
--logging_steps 10 \
--output_dir $OUTPUT_DIR \
--num_train_epochs 1 \
--max_length 1024 \
--max_prompt_length 512 \
--seed 42 \
--overwrite_output_dir \
--report_to none
For training other alignment methods, we used huggingface alignment-handbook repository.
To train TPO, which requires three preferences, we created a custom dataset from the UltraFeedback dataset. Here, the response with the highest score serves as the reference response, the second-highest score as the chosen response, and the lowest score as the rejected response.
Finally, the dataset includes <
The Data Format in JSON file must be:
{
"prompt": "PROMPT_SENTENCE",
"reference": "REFERENCE_SENTENCE",
"chosen": "CHOSEN_SENTENCE",
"rejected": "REJECTED_SENTENCE",
}
@misc{saeidi2024triple,
title={Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Chitta Baral},
year={2024},
eprint={2405.16681},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
For more insights about various alignment methods, please check paper.
@misc{saeidi2024insights,
title={Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks},
author={Amir Saeidi and Shivanshu Verma and Chitta Baral},
year={2024},
eprint={2404.14723},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Thanks to Hugging Face for their Transformer Reinforcement Learning (TRL) library, which greatly assisted in our project.