This notebook fine-tune an instruct version of Mistral 7B on UltraFeedback with TRL's DPO.

More details in this article: [Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)](https://kaitchup.substack.com/p/fine-tune-your-own-instruct-version)

First, we need all these dependencies:

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m104.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m83.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

Import all the necessary packages.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
)
from trl import DPOTrainer

Load the tokenizer and configure padding

In [None]:
model_name = "mistralai/Mistral-7B-v0.1"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id =  tokenizer.unk_token_id
tokenizer.padding_side = 'left'

Downloading (…)okenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Load the custom UltraFeedback. I explained in this article how I made it. You can see some examples here: [kaitchup/UltraFeedback-prompt-chosen-rejected](https://huggingface.co/datasets/kaitchup/UltraFeedback-prompt-chosen-rejected)

In the appendix section of this notebook (below), I provide the code I used to make this dataset.

In [None]:
dataset = load_dataset("kaitchup/UltraFeedback-prompt-chosen-rejected")

Downloading readme:   0%|          | 0.00/463 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/17075 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/895 [00:00<?, ? examples/s]

Load the model that we will train with DPO and prepare it for QLoRA.

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Load the reference model used for DPO. The model I used is Mistral 7B fine-tuned (SFT) on ultrachat. I load it from the Hugging Face Hub.

You can find more details on this fine-tuning in this article: [Mistral 7B: Recipes for Fine-tuning and Quantization on Your Computer](https://kaitchup.substack.com/p/mistral-7b-recipes-for-fine-tuning)

In [None]:
model_ref_name = "mistralai/Mistral-7B-v0.1"
bnb_config_ref = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model_ref = AutoModelForCausalLM.from_pretrained(
          model_ref_name, quantization_config=bnb_config, device_map={"": 0}
)
model_ref = PeftModel.from_pretrained(model_ref, "kaitchup/Mistral-7B-v0.1-SFT-ultrachat")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Define the configuration of LoRA used to train the model with DPO.

In [None]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

For this tutorial, I trained for only 100 steps. It takes 4.5 hours using the V100 of Google Colab Pro. You can also use the T4 but it would take around 15 hours.

In [None]:
training_arguments = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=16,
        per_device_eval_batch_size=2,
        log_level="debug",
        save_steps=10,
        logging_steps=10,
        learning_rate=5e-7,
        eval_steps=20,
        #num_train_epochs=1,
        max_steps=100,
        warmup_steps=20,
        lr_scheduler_type="linear",
)

Start training:

In [None]:
trainer = DPOTrainer(
    model,
    model_ref,
    args=training_arguments,
    beta=0.1,
    peft_config=peft_config,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    tokenizer=tokenizer,
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 2
***** Running training *****
  Num examples = 17,075
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 16
  Total optimization steps = 100
  Number of trainable parameters = 41,943,040
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
20,0.7961,0.756272,-0.204051,-0.26559,0.522321,0.061539,-171.314545,-190.165543,-2.950176,-2.906446
40,0.7871,0.750689,-0.189419,-0.259613,0.52567,0.070194,-171.254776,-190.019196,-2.950228,-2.906829
60,0.7925,0.746549,-0.177696,-0.254009,0.527902,0.076313,-171.198746,-189.901993,-2.950177,-2.907035
80,0.7533,0.744203,-0.17157,-0.251619,0.527902,0.080048,-171.17485,-189.840729,-2.950137,-2.907137
100,0.789,0.743437,-0.169586,-0.25086,0.527902,0.081274,-171.167236,-189.820892,-2.950131,-2.907176


Saving model checkpoint to ./results/checkpoint-10
tokenizer config file saved in ./results/checkpoint-10/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 895
  Batch size = 2
Saving model checkpoint to ./results/checkpoint-20
tokenizer config file saved in ./results/checkpoint-20/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-20/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-30
tokenizer config file saved in ./results/checkpoint-30/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-30/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 895
  Batch size = 2
Saving model checkpoint to ./results/checkpoint-40
tokenizer config file saved in ./results/checkpoint-40/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-40/special_tokens_map.json
Saving model checkpoint to ./results/check

TrainOutput(global_step=100, training_loss=0.7826796817779541, metrics={'train_runtime': 16800.0722, 'train_samples_per_second': 0.19, 'train_steps_per_second': 0.006, 'total_flos': 0.0, 'train_loss': 0.7826796817779541, 'epoch': 0.19})

#Appendix
Format UltraFeedback for TRL's DPO.

In [None]:
from datasets import load_dataset

ultrafb = load_dataset('openbmb/UltraFeedback', split='train')

ultrafb = ultrafb.train_test_split(test_size=0.05)

format_ultrafb = dict()
for split in ultrafb:
  format_ultrafb[split] = []
  for i in ultrafb[split]:
    prompt = i['instruction']
    chosen = ""
    rejected = ""
    for outputs in range(len(i['completions'])):
      if i['completions'][outputs]['annotations']['instruction_following']['Rating'] == '5':
        chosen = i['completions'][outputs]['response']
      elif  i['completions'][outputs]['annotations']['instruction_following']['Rating'] == '1':
        rejected = i['completions'][outputs]['response']
    if chosen != '' and rejected != '':
      format_ultrafb[split].append({'prompt': prompt, 'chosen': chosen, 'rejected': rejected})
