<a href="https://colab.research.google.com/github/shing100/notebooks/blob/main/DPO_Zephyr_Unsloth_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "Runtime" and press "Run all" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join our Discord if you need help!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [DPO data prep](#Data), and how to [train via `DPOTrainer`](#Train).
To learn more about DPO, read TRL's [blog post](https://huggingface.co/blog/dpo-trl). We follow [Huggingface's Alignment Handbook](https://github.com/huggingface/alignment-handbook) to replicate [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).

In [10]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* DPO requires a model already trained by SFT on a similar dataset that is used for DPO. We use `HuggingFaceH4/mistral-7b-sft-beta` as the SFT model. Use this [notebook](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) first to train a SFT model.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [11]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [12]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-2b-it", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.8: Fast Gemma2 patching. Transformers = 4.43.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


<a name="Data"></a>
### Data Prep
We follow Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), and sample 0.5% of it to speed things up. You can sample the full dataset for a full run.

In [28]:
dataset = load_dataset("CarrotAI/Korea-WebText-Dataset")
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'chosen', 'rejected'],
        num_rows: 1009
    })
})

We shall print a random item from the dataset

In [29]:
import pprint
row = raw_datasets["train"][8]
pprint.pprint(row["instruction"])
pprint.pprint(row["chosen"])
pprint.pprint(row["rejected"])

'ÍπÄÎèÑÌòï Ïî®Ïùò Ìä∏Îü≠Ïª§ Í≤ΩÎ†•ÏóêÏÑú Í∞ÄÏû• ÌÅ∞ ÎèÑÏ†ÑÏù¥ÎÇò ÏàúÍ∞ÑÏùÄ Î¨¥ÏóáÏù¥ÏóàÏùÑÍπåÏöî?'
(' ÍπÄÎèÑÌòï Ïî®Ïùò Ìä∏Îü≠Ïª§ Í≤ΩÎ†•ÏóêÏÑú Í∞ÄÏû• ÌÅ∞ ÎèÑÏ†ÑÏù¥ÎÇò ÏàúÍ∞ÑÏùÄ Î¨¥ÏóáÏù¥ÏóàÎäîÏßÄ ÏïåÎ†§Ï£ºÏÑ∏Ïöî. Í∑∏Îü¨Î©¥ Ï†úÍ∞Ä ÎãµÎ≥ÄÏùÑ ÎìúÎ¶¥ Ïàò ÏûàÏäµÎãàÎã§. \n'
 '\n'
 'ÏòàÎ•º Îì§Ïñ¥, ÍπÄÎèÑÌòï Ïî®Í∞Ä Ïñ¥Î†§Ïö¥ Ïö¥Ï†Ñ ÌôòÍ≤ΩÏù¥ÎÇò ÏúÑÌóòÌïú ÏÉÅÌô©Ïóê Ï≤òÌñàÎçò Í≤ΩÌóò, ÌòπÏùÄ Í∑πÏã¨Ìïú ÌîºÎ°úÏôÄ Ïã∏ÏõåÏïº ÌñàÎçò Í≤ΩÌóò Îì±ÏùÑ ÏïåÎ†§Ï£ºÏãúÎ©¥ Ï¢Ä Îçî '
 'ÏûêÏÑ∏Ìïú ÎãµÎ≥ÄÏùÑ ÎìúÎ¶¥ Ïàò ÏûàÏäµÎãàÎã§. \n'
 '\n'
 'Îçî ÏûêÏÑ∏Ìïú Ï†ïÎ≥¥Î•º ÏïåÎ†§Ï£ºÏãúÎ©¥ ÍπÄÎèÑÌòï Ïî®Ïùò Ìä∏Îü≠Ïª§ Í≤ΩÎ†•Ïóê ÎåÄÌïú ÎãµÎ≥ÄÏùÑ ÎìúÎ¶¥ Ïàò ÏûàÏäµÎãàÎã§. \n')
('ÍπÄÎèÑÌòï Ïî®Ïùò Ìä∏Îü≠Ïª§ Í≤ΩÎ†•ÏóêÏÑú Í∞ÄÏû• ÌÅ∞ ÎèÑÏ†Ñ Ï§ë ÌïòÎÇòÎäî Ï¥àÍ∏∞ Îã®Í≥ÑÏóêÏÑúÏùò Í≤ΩÏ†úÏ†Å Î∂ÄÎã¥Í≥º Í∞ÄÏ°±Í≥ºÏùò Í±∞Î¶¨Î•º Í∑πÎ≥µÌïòÎäî Í≤ÉÏù¥ÏóàÏäµÎãàÎã§. Í∑∏Îäî ÏûêÏã†Ïùò '
 'Í∞ÄÏ°±ÏùÑ Î®πÏó¨ÏÇ¥Î¶¨Í∏∞ ÏúÑÌï¥ Ìä∏Îü≠Ïª§Î°úÏÑúÏùò ÏßÅÏóÖÏùÑ ÏÑ†ÌÉùÌñàÏßÄÎßå, Ïù¥ Í∏∏ÏùÑ ÌÉùÌïòÍ∏∞ ÏúÑÌï¥ ÍµêÏú°ÎπÑÎ•º ÏßÄÎ∂àÌïòÍ≥† ÏûêÍ≤©Ï¶ùÏùÑ Ï∑®ÎìùÌï¥Ïïº ÌñàÏ

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [36]:
def generate_prompt(example):
    prompt = example['instruction']
    rejected = example['rejected']
    chosen = example['chosen']

    example['prompt'] = f"<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
    example['rejected'] = f"{rejected}<end_of_turn>\n<eos>"
    example['chosen'] = f"{chosen}<end_of_turn>\n<eos>"

    return example

In [37]:
transformed_dataset = dataset.map(generate_prompt)

Map:   0%|          | 0/1009 [00:00<?, ? examples/s]

In [38]:
transformed_dataset['train'][0]

{'instruction': 'Í≥†ÌòàÏïïÏù¥ Î¨¥ÏóáÏù¥Î©∞, Ïôú ÏπòÎ£åÎ•º Í≤åÏùÑÎ¶¨ÌïòÎ©¥ Ïïà ÎêòÎäîÏßÄ ÏÑ§Î™ÖÌï¥Ï£ºÏÑ∏Ïöî.',
 'chosen': '## Í≥†ÌòàÏïïÏù¥ÎûÄ Î¨¥ÏóáÏùºÍπåÏöî?\n\nÍ≥†ÌòàÏïïÏùÄ ÌòàÏï°Ïù¥ ÌòàÍ¥Ä Î≤ΩÏóê Í∞ÄÌïòÎäî ÏïïÎ†•Ïù¥ ÏßÄÏÜçÏ†ÅÏúºÎ°ú ÎÜíÏùÄ ÏÉÅÌÉúÎ•º ÎßêÌï©ÎãàÎã§. ÎßàÏπò Î¨ºÏù¥ ÌååÏù¥ÌîÑÎ•º ÌûòÏ∞®Í≤å ÌùêÎ•¥Îäî Í≤ÉÏ≤òÎüº, ÌòàÏï°Ïù¥ ÌòàÍ¥ÄÏùÑ ÏßÄÎÇòÏπòÍ≤å Í∞ïÌïòÍ≤å ÌùêÎ•¥Î©¥ ÌòàÍ¥Ä Î≤ΩÏóê Î∂ÄÎã¥ÏùÑ Ï£ºÏñ¥ Ïó¨Îü¨ ÏßàÎ≥ëÏùÑ Ïú†Î∞úÌï† Ïàò ÏûàÏäµÎãàÎã§.\n\n**Í≥†ÌòàÏïïÏùò ÏúÑÌóòÏÑ±**\n\nÍ≥†ÌòàÏïïÏùÄ **Ï°∞Ïö©Ìïú ÏÇ¥Ïù∏Ïûê**ÎùºÍ≥† Î∂àÎ¶¥ ÎßåÌÅº ÏúÑÌóòÌïú ÏßàÎ≥ëÏûÖÎãàÎã§. Ï¥àÍ∏∞ÏóêÎäî Ï¶ùÏÉÅÏù¥ Í±∞Ïùò ÏóÜÏñ¥ ÏûêÏã†Ïù¥ Í≥†ÌòàÏïïÏù∏ÏßÄ Î™®Î•¥Îäî Í≤ΩÏö∞Í∞Ä ÎßéÏßÄÎßå, Î∞©ÏπòÌïòÎ©¥ Ïã¨Í∞ÅÌïú Ìï©Î≥ëÏ¶ùÏùÑ Ïú†Î∞úÌï† Ïàò ÏûàÏäµÎãàÎã§.\n\n**Í≥†ÌòàÏïïÏúºÎ°ú Ïù∏Ìïú Ìï©Î≥ëÏ¶ù**\n\n* **Ïã¨ÌòàÍ¥Ä ÏßàÌôò:** Ïã¨Ïû•ÎßàÎπÑ, ÎáåÏ°∏Ï§ë, Ïã¨Î∂ÄÏ†Ñ Îì± Ïã¨Í∞ÅÌïú Ïã¨ÌòàÍ¥Ä ÏßàÌôòÏùò ÏúÑÌóòÏùÑ ÎÜíÏûÖÎãàÎã§.\n* **Ïã†Ïû• ÏßàÌôò:** Ïã†Ïû• Í∏∞Îä• Ï†ÄÌïòÎ•º Ï¥àÎûòÌïòÏó¨ ÎßåÏÑ± Ïã†Î∂ÄÏ†ÑÏúºÎ°ú Ïù¥Ïñ¥Ïßà Ïàò ÏûàÏäµÎãàÎã§.\n* 

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Train"></a>
### Train the DPO model
Now let's use Huggingface TRL's `DPOTrainer`! More docs here: [TRL DPO docs](https://huggingface.co/docs/trl/dpo_trainer). We do 3 epochs on 0.5% of the dataset to speed things up.

In [None]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [None]:
from transformers import TrainingArguments
from trl import DPOTrainer
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = raw_datasets["train"],
    # eval_dataset = raw_datasets["test"],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)



Map:   0%|          | 0/309 [00:00<?, ? examples/s]

In [None]:
dpo_trainer.train()

Unsloth: `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.6931,0.0,0.0,0.0,0.0,-297.338806,-218.968842,-2.758142,-2.924523
2,0.6931,0.0,0.0,0.0,0.0,-237.602417,-217.613892,-2.73179,-2.91361
3,0.6922,0.001937,-8e-06,0.625,0.001945,-172.792877,-202.709259,-2.464616,-2.728198
4,0.6927,0.000855,-0.00013,0.75,0.000985,-117.745728,-170.711212,-2.591983,-2.805891
5,0.6925,0.001272,-4.4e-05,0.5,0.001315,-197.045288,-338.722626,-2.541809,-2.483452
6,0.6937,-0.000155,0.001034,0.5,-0.001189,-279.681396,-188.683563,-2.870259,-2.633069
7,0.6926,-0.001525,-0.002655,0.625,0.00113,-157.678253,-236.215118,-2.510528,-2.838598
8,0.6925,0.003246,0.001937,0.5,0.001309,-451.77655,-493.112579,-2.677159,-2.677977
9,0.6936,-0.003702,-0.002773,0.5,-0.000929,-198.079987,-405.686523,-2.366659,-2.640016
10,0.6859,0.009754,-0.004801,0.75,0.014555,-192.067322,-238.16156,-2.735095,-2.766379


TrainOutput(global_step=114, training_loss=0.397163258179238, metrics={'train_runtime': 4017.5003, 'train_samples_per_second': 0.231, 'train_steps_per_second': 0.028, 'total_flos': 0.0, 'train_loss': 0.397163258179238, 'epoch': 2.94})

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Mistral 7b 2x faster [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with ü§ó HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. Gemma 6 trillion tokens is 2.5x faster! [free Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>