### DPO Fine-Tuning - Zephyr

> **A100 GPU**

- Time: 29 mins
- GPU RAM: 11.4 GB
- Accuracy: 74.19%

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* DPO requires a model already trained by SFT on a similar dataset that is used for DPO. We use `HuggingFaceH4/mistral-7b-sft-beta` as the SFT model. Use this [notebook](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) first to train a SFT model.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/511 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

In [3]:
#@title Alignment Handbook utils
import os
import re
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError


DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


def apply_chat_template(
    example, tokenizer, task: Literal["sft", "generation", "rm", "dpo"] = "sft", assistant_prefix="<|assistant|>\n"
):
    def _strip_prefix(s, pattern):
        # Use re.escape to escape any special characters in the pattern
        return re.sub(f"^{re.escape(pattern)}", "", s)

    if task in ["sft", "generation"]:
        messages = example["messages"]
        # We add an empty system message if there is none
        if messages[0]["role"] != "system":
            messages.insert(0, {"role": "system", "content": ""})
        example["text"] = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True if task == "generation" else False
        )
    elif task == "rm":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            chosen_messages = example["chosen"]
            rejected_messages = example["rejected"]
            # We add an empty system message if there is none
            if chosen_messages[0]["role"] != "system":
                chosen_messages.insert(0, {"role": "system", "content": ""})
            if rejected_messages[0]["role"] != "system":
                rejected_messages.insert(0, {"role": "system", "content": ""})
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
        else:
            raise ValueError(
                f"Could not format example as dialogue for `rm` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    elif task == "dpo":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            # Compared to reward modeling, we filter out the prompt, so the text is everything after the last assistant token
            prompt_messages = [[msg for msg in example["chosen"] if msg["role"] == "user"][0]]
            # Insert system message
            if example["chosen"][0]["role"] != "system":
                prompt_messages.insert(0, {"role": "system", "content": ""})
            else:
                prompt_messages.insert(0, example["chosen"][0])
            # TODO: handle case where chosen/rejected also have system messages
            chosen_messages = example["chosen"][1:]
            rejected_messages = example["rejected"][1:]
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            example["text_chosen"] = _strip_prefix(example["text_chosen"], assistant_prefix)
            example["text_rejected"] = _strip_prefix(example["text_rejected"], assistant_prefix)
        else:
            raise ValueError(
                f"Could not format example as dialogue for `dpo` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    else:
        raise ValueError(
            f"Task {task} not supported, please ensure that the provided task is one of {['sft', 'generation', 'rm', 'dpo']}"
        )
    return example


def get_datasets(
    data_config: dict,
    splits: List[str] = ["train", "test"],
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads one or more datasets with varying training set proportions.

    Args:
        data_config (`DataArguments` or `dict`):
            Dataset configuration and split proportions.
        splits (`List[str]`, *optional*, defaults to `['train', 'test']`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.

    Returns
        [`DatasetDict`]: The dataset dictionary containing the loaded datasets.
    """

    if type(data_config) is dict:
        # Structure of the input is:
        #     dataset_mixer = {
        #             "dataset1": 0.5,
        #             "dataset1": 0.3,
        #             "dataset1": 0.2,
        #         }
        dataset_mixer = data_config
    else:
        raise ValueError(f"Data config {data_config} not recognized.")

    raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
    return raw_datasets


def mix_datasets(dataset_mixer: dict, splits: Optional[List[str]] = None, shuffle=True) -> DatasetDict:
    """
    Loads and mixes datasets according to proportions specified in `dataset_mixer`.

    Args:
        dataset_mixer (`dict`):
            Dictionary containing the dataset names and their training proportions. By default, all test proportions are 1.
        splits (Optional[List[str]], *optional*, defaults to `None`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.
    """
    raw_datasets = DatasetDict()
    raw_train_datasets = []
    raw_val_datasets = []
    fracs = []
    for ds, frac in dataset_mixer.items():
        fracs.append(frac)
        for split in splits:
            try:
                # Try first if dataset on a Hub repo
                dataset = load_dataset(ds, split=split)
            except DatasetGenerationError:
                # If not, check local dataset
                dataset = load_from_disk(os.path.join(ds, split))

            if "train" in split:
                raw_train_datasets.append(dataset)
            elif "test" in split:
                raw_val_datasets.append(dataset)
            else:
                raise ValueError(f"Split type {split} not recognized as one of test or train.")

    if any(frac < 0 for frac in fracs):
        raise ValueError("Dataset fractions cannot be negative.")

    if len(raw_train_datasets) > 0:
        train_subsets = []
        for dataset, frac in zip(raw_train_datasets, fracs):
            train_subset = dataset.select(range(int(frac * len(dataset))))
            train_subsets.append(train_subset)
        if shuffle:
            raw_datasets["train"] = concatenate_datasets(train_subsets).shuffle(seed=42)
        else:
            raw_datasets["train"] = concatenate_datasets(train_subsets)
    # No subsampling for test datasets to enable fair comparison across models
    if len(raw_val_datasets) > 0:
        if shuffle:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets).shuffle(seed=42)
        else:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets)

    if len(raw_datasets) == 0:
        raise ValueError(
            f"Dataset {dataset_mixer} not recognized with split {split}. Check the dataset has been correctly formatted."
        )

    return raw_datasets

<a name="Data"></a>
### Data Prep
We follow Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), and sample 0.5% of it to speed things up. You can sample the full dataset for a full run.

In [4]:
!pip install datasets



In [5]:
from datasets import load_dataset
import pandas as pd
from datasets import Dataset, DatasetDict

dataset = load_dataset('techandy42/debugger_llm_humaneval_dataset_v1')
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])
df_parts = [df_train, df_test]

df_sft_train = []
df_sft_test = []
df_sft_parts = [df_sft_train, df_sft_test]

df_dpo_train = []
df_dpo_test = []
df_dpo_parts = [df_dpo_train, df_dpo_test]

PREFIXS = ['score_s1_', 'score_s2_', 'score_s3_', 'score_s4_', 'score_s5_', 'score_s6_']
ROUNDS = ['rd1', 'rd2', 'rd3', 'custom']
PAIRS = [('rd1', 'rd2'), ('rd1', 'rd3'), ('rd1', 'custom'), ('rd2', 'rd3'), ('rd2', 'custom'), ('rd3', 'custom')]

def indent_lines(string: str) -> str:
  indented_string = '\n'.join('    ' + line for line in string.splitlines())
  return indented_string

for df, df_sft, df_dpo in zip(df_parts, df_sft_parts, df_dpo_parts):
  for idx, row in df.iterrows():
      prompt = row['prompt']
      result = row['result']
      instruction = f"""<instruction>
  <bullets>
    <bullet>The following buggy code is a wrong implementation that contains one or more bugs.</bullet>
    <bullet>Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.</bullet>
    <bullet>Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.</bullet>
    <bullet>Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.</bullet>
    <bullet>IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.</bullet>
  </bullets>
</instruction>"""
      full_solution = "<buggy_code>\n" + (prompt + indent_lines(result)).strip('\n') + "\n</buggy_code>"
      full_instruction = instruction + "\n" + full_solution
      solutions_info = {}
      for ROUND in ROUNDS:
        solutions_info[ROUND] = {}
        total_score = 0
        for PREFIX in PREFIXS:
          score_col = PREFIX + ROUND
          score = int(row[score_col][0])
          total_score += score
        total_score /= 42
        analysis_col = 'analysis_' + ROUND
        solutions_info[ROUND]['analysis'] = row[analysis_col]
        solutions_info[ROUND]['score'] = total_score
      for ROUND1, ROUND2 in PAIRS:
        round1_score = solutions_info[ROUND1]['score']
        round2_score = solutions_info[ROUND2]['score']
        round1_analysis = solutions_info[ROUND1]['analysis']
        round2_analysis = solutions_info[ROUND2]['analysis']
        if round1_score == round2_score:
          continue
        messages_info = {}
        messages_info['messages'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score > round2_score else round2_analysis, 'role': 'assistant'}
        ]
        pairwise_info = {}
        pairwise_info['prompt'] = full_instruction
        pairwise_info['chosen'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score > round2_score else round2_analysis, 'role': 'assistant'}
        ]
        pairwise_info['rejected'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score < round2_score else round2_analysis, 'role': 'assistant'}
        ]
        df_sft.append(messages_info)
        df_dpo.append(pairwise_info)

df_sft_train = pd.DataFrame(df_sft_train)
df_sft_test = pd.DataFrame(df_sft_test)
dataset_sft_train = Dataset.from_pandas(df_sft_train)
dataset_sft_test = Dataset.from_pandas(df_sft_test)
datasets_sft = DatasetDict({
    'train': dataset_sft_train,
    'test': dataset_sft_test
})
df_dpo_train = pd.DataFrame(df_dpo_train)
df_dpo_test = pd.DataFrame(df_dpo_test)
dataset_dpo_train = Dataset.from_pandas(df_dpo_train)
dataset_dpo_test = Dataset.from_pandas(df_dpo_test)
datasets_dpo = DatasetDict({
    'train': dataset_dpo_train,
    'test': dataset_dpo_test
})

Downloading readme:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6 [00:00<?, ? examples/s]

In [6]:
datasets_sft

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 165
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 31
    })
})

In [7]:
datasets_dpo

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 165
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 31
    })
})

In [8]:
column_names = list(datasets_sft['train'].features)

sft_datasets = datasets_sft.map(
    apply_chat_template,
    fn_kwargs = {"tokenizer": tokenizer, "task": "sft"},
    num_proc = 12,
    remove_columns = column_names,
    desc = "Formatting comparisons with prompt template",
)

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/165 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/31 [00:00<?, ? examples/s]

In [9]:
sft_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 165
    })
    test: Dataset({
        features: ['text'],
        num_rows: 31
    })
})

In [10]:
print(sft_datasets['train'][0]['text'])

<|system|>
</s>
<|user|>
<instruction>
  <bullets>
    <bullet>The following buggy code is a wrong implementation that contains one or more bugs.</bullet>
    <bullet>Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.</bullet>
    <bullet>Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.</bullet>
    <bullet>Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.</bullet>
    <bullet>IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.</bullet>
  </bullets>
</instruction>
<buggy_code>
from typing import List


def parse_nested_parens(paren_string: str) -> List[int]:
    """ Input to this function is a string represented multiple groups for nested parentheses separated by spaces.
    For each of the group, output the de

In [11]:
column_names = list(datasets_dpo['train'].features)

dpo_datasets = datasets_dpo.map(
    apply_chat_template,
    fn_kwargs = {"tokenizer": tokenizer, "task": "dpo"},
    num_proc = 12,
    remove_columns = column_names,
    desc = "Formatting comparisons with prompt template",
)

dpo_datasets = dpo_datasets.rename_columns(
    {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
)

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/165 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/31 [00:00<?, ? examples/s]

In [12]:
dpo_datasets

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 165
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 31
    })
})

In [13]:
print("=" * 10 + "PROMPT" + "=" * 10)
print(dpo_datasets['train'][0]['prompt'])
print("=" * 10 + "CHOSEN" + "=" * 10)
print(dpo_datasets['train'][0]['chosen'])
print("=" * 10 + "REJECTED" + "=" * 10)
print(dpo_datasets['train'][0]['rejected'])

<|system|>
</s>
<|user|>
<instruction>
  <bullets>
    <bullet>The following buggy code is a wrong implementation that contains one or more bugs.</bullet>
    <bullet>Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.</bullet>
    <bullet>Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.</bullet>
    <bullet>Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.</bullet>
    <bullet>IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.</bullet>
  </bullets>
</instruction>
<buggy_code>
from typing import List


def parse_nested_parens(paren_string: str) -> List[int]:
    """ Input to this function is a string represented multiple groups for nested parentheses separated by spaces.
    For each of the group, output the de

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [14]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Train the SFT model

In [15]:
# Note: running eval is not necessary for this stage
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

sft_trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = sft_datasets['train'],
    # eval_dataset = sft_datasets['test'], # Uncomment to run eval
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        # evaluation_strategy = "steps", # Uncomment to run eval
        # eval_steps = 1, # Uncomment to run eval
    ),
)

Map (num_proc=2):   0%|          | 0/165 [00:00<?, ? examples/s]

In [16]:
sft_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss
1,1.4947
2,1.4678
3,1.2906
4,1.0715
5,0.8044
6,0.6117
7,0.5358
8,0.4744
9,0.4522
10,0.3593


TrainOutput(global_step=20, training_loss=0.5639057993888855, metrics={'train_runtime': 56.4959, 'train_samples_per_second': 2.921, 'train_steps_per_second': 0.354, 'total_flos': 5014170208616448.0, 'train_loss': 0.5639057993888855, 'epoch': 0.963855421686747})

<a name="Train"></a>
### Train the DPO model
Now let's use Huggingface TRL's `DPOTrainer`! More docs here: [TRL DPO docs](https://huggingface.co/docs/trl/dpo_trainer). We do 3 epochs on 0.5% of the dataset to speed things up.

In [17]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [18]:
# Note: there is an issue running eval during training with trl's DPOTrainer & DPOConfig
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = dpo_datasets['train'],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

Tokenizing train dataset:   0%|          | 0/165 [00:00<?, ? examples/s]

In [19]:
from tqdm import tqdm
import torch

def run_eval(model, tokenizer, no_iter):
  NUM_ITEMS = len(dpo_datasets['test'])
  num_chosen = 0

  for _ in range(no_iter):
    for i in tqdm(range(NUM_ITEMS)):
      input = {
          "chosen": datasets_dpo['test'][i]["chosen"],
          "rejected": datasets_dpo['test'][i]["rejected"]
      }

      # Apply the chat template to format the input
      formatted_input = apply_chat_template(input, tokenizer, task="dpo")

      # Tokenize the inputs
      inputs_chosen = tokenizer(formatted_input["text_chosen"], return_tensors="pt", padding=True, truncation=True)
      inputs_rejected = tokenizer(formatted_input["text_rejected"], return_tensors="pt", padding=True, truncation=True)

      # Generate the scalar reward values
      with torch.no_grad():
          reward_chosen = model(**inputs_chosen).logits.mean().item()
          reward_rejected = model(**inputs_rejected).logits.mean().item()
          if reward_chosen > reward_rejected:
              num_chosen += 1

  return num_chosen / (no_iter * NUM_ITEMS)

In [20]:
best_iteration = 1
best_eval_result = 0

for i in range(1, 11):
    # Train the model
    training_result = dpo_trainer.train()
    eval_result = run_eval(model, tokenizer, 5)
    if eval_result >= best_eval_result:
        best_eval_result = eval_result
        best_iteration = i

    # Create a unique checkpoint directory for each iteration
    checkpoint_dir = f"checkpoint_iteration_{i}"
    os.makedirs(checkpoint_dir, exist_ok=True)

    # Save model and trainer states for this iteration
    dpo_trainer.save_model(checkpoint_dir)  # Save model and tokenizer
    dpo_trainer.save_state()  # Save optimizer, scheduler, and other trainer states

    print(f"\nEPOCH NO.{i}")
    print(f"TRAINING RESULT: {training_result}")
    print(f"TEST ACCURACY: {eval_result * 100:.2f}\n")

print(f"BEST ITERATION: {best_iteration}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.2619,16.942621,12.107467,0.875,4.835154,-108.939117,-104.661591,-2.449407,-2.467142
2,0.7771,14.912493,11.333905,0.75,3.578588,-81.166336,-52.434906,-2.326855,-2.274108
3,0.3039,16.010368,9.309386,0.875,6.700981,-109.945908,-63.962654,-2.357495,-2.382477
4,0.7947,13.795393,11.395256,0.75,2.400137,-75.61544,-65.92321,-2.293685,-2.275281
5,0.1713,17.271122,11.23784,0.875,6.033282,-119.854027,-103.050568,-2.480595,-2.441104
6,0.6662,14.851988,11.532567,0.625,3.319421,-94.244133,-73.491653,-2.341175,-2.302094
7,0.6473,12.960015,11.516346,0.75,1.443669,-79.30748,-40.650066,-2.292185,-2.316608
8,1.6368,12.439628,12.984877,0.25,-0.545248,-89.627625,-77.280945,-2.315696,-2.330642
9,0.3559,13.34855,8.983703,0.75,4.364847,-106.637024,-76.452469,-2.3801,-2.360526
10,0.3057,15.818031,10.860895,0.75,4.957136,-70.203392,-65.85038,-2.271419,-2.306396


100%|██████████| 31/31 [00:07<00:00,  4.41it/s]
100%|██████████| 31/31 [00:06<00:00,  4.45it/s]
100%|██████████| 31/31 [00:06<00:00,  4.46it/s]
100%|██████████| 31/31 [00:07<00:00,  4.39it/s]
100%|██████████| 31/31 [00:06<00:00,  4.50it/s]



EPOCH NO.1
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.48278161641210315, metrics={'train_runtime': 101.3828, 'train_samples_per_second': 1.627, 'train_steps_per_second': 0.197, 'total_flos': 0.0, 'train_loss': 0.48278161641210315, 'epoch': 0.963855421686747})
TEST ACCURACY: 67.74



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0928,16.631414,10.948038,1.0,5.683375,-120.533401,-107.773674,-2.417131,-2.431341
2,0.127,14.918297,9.762361,1.0,5.155936,-96.881783,-52.376873,-2.286054,-2.236933
3,0.0472,16.076952,8.344521,1.0,7.732431,-119.594566,-63.296814,-2.322732,-2.340117
4,0.1138,14.004068,9.965775,1.0,4.038292,-89.91024,-63.836456,-2.243995,-2.227006
5,0.0267,17.11735,10.399025,1.0,6.718324,-128.242188,-104.588287,-2.444749,-2.405279
6,0.549,14.023465,10.104827,0.875,3.918638,-108.52153,-81.776886,-2.285306,-2.241909
7,0.3313,12.84404,10.179047,0.875,2.664993,-92.680466,-41.809826,-2.228684,-2.242725
8,0.7372,12.545463,11.169505,0.625,1.375957,-107.781342,-76.222603,-2.249336,-2.267082
9,0.0322,13.12232,6.996924,1.0,6.125397,-126.504822,-78.71476,-2.321704,-2.299777
10,0.1514,15.10621,9.256987,0.875,5.849224,-86.242485,-72.968597,-2.200549,-2.225765


100%|██████████| 31/31 [00:07<00:00,  4.41it/s]
100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:06<00:00,  4.45it/s]
100%|██████████| 31/31 [00:06<00:00,  4.45it/s]
100%|██████████| 31/31 [00:07<00:00,  4.41it/s]



EPOCH NO.2
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.19922059918753804, metrics={'train_runtime': 100.9394, 'train_samples_per_second': 1.635, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.19922059918753804, 'epoch': 0.963855421686747})
TEST ACCURACY: 70.97



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0161,16.040501,9.365068,1.0,6.675433,-136.363113,-113.6828,-2.35016,-2.355003
2,0.059,14.159886,7.84472,1.0,6.315167,-116.058182,-59.960964,-2.209313,-2.158319
3,0.0108,15.245693,5.905346,1.0,9.340346,-143.986313,-71.609398,-2.240141,-2.250504
4,0.0398,13.161218,7.665529,1.0,5.495688,-112.912697,-72.264961,-2.159056,-2.139261
5,0.0164,15.699163,8.297877,1.0,7.401287,-149.253662,-118.770157,-2.382924,-2.338269
6,0.0892,11.888634,7.155666,1.0,4.732967,-138.013153,-103.125206,-2.199379,-2.154757
7,0.0576,11.749943,7.087271,1.0,4.662671,-123.598221,-52.750801,-2.138051,-2.146965
8,0.1378,11.347159,7.479948,0.875,3.867211,-144.67691,-88.205643,-2.149721,-2.177033
9,0.0057,11.300703,3.87435,1.0,7.426353,-157.73056,-96.930931,-2.242182,-2.219947
10,0.1329,12.970304,6.6637,1.0,6.306604,-112.175339,-94.327644,-2.108666,-2.129205


100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:06<00:00,  4.47it/s]
100%|██████████| 31/31 [00:06<00:00,  4.46it/s]
100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:06<00:00,  4.45it/s]



EPOCH NO.3
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.05512177784112282, metrics={'train_runtime': 100.9688, 'train_samples_per_second': 1.634, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.05512177784112282, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0016,13.633797,5.129803,1.0,8.503995,-178.715759,-137.749832,-2.267273,-2.263852
2,0.0156,12.24473,4.690529,1.0,7.5542,-147.600098,-79.112541,-2.110789,-2.062069
3,0.0007,12.965636,1.053528,1.0,11.912107,-192.504486,-94.409973,-2.142061,-2.153128
4,0.0181,10.948346,4.15436,1.0,6.793986,-148.024384,-94.393677,-2.070512,-2.04811
5,0.0091,13.514957,5.610741,1.0,7.904217,-176.125031,-140.612213,-2.314624,-2.271456
6,0.0174,8.754224,2.92592,1.0,5.828304,-180.310608,-134.469299,-2.119905,-2.074179
7,0.0066,10.225072,3.107673,1.0,7.117398,-163.394196,-67.999504,-2.063145,-2.069328
8,0.0176,9.3467,3.064096,1.0,6.282603,-188.835434,-108.210236,-2.071257,-2.106279
9,0.0012,9.023167,0.101195,1.0,8.921972,-195.462097,-119.706299,-2.182799,-2.160745
10,0.146,10.973977,3.598511,0.875,7.375467,-142.82724,-114.290916,-2.046877,-2.068064


100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:06<00:00,  4.43it/s]
100%|██████████| 31/31 [00:06<00:00,  4.47it/s]
100%|██████████| 31/31 [00:07<00:00,  4.42it/s]
100%|██████████| 31/31 [00:06<00:00,  4.47it/s]



EPOCH NO.4
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.01641314343214617, metrics={'train_runtime': 102.0688, 'train_samples_per_second': 1.617, 'train_steps_per_second': 0.196, 'total_flos': 0.0, 'train_loss': 0.01641314343214617, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0012,11.900208,2.434352,1.0,9.465856,-205.670273,-155.085724,-2.226001,-2.219207
2,0.0017,10.824802,1.786696,1.0,9.038106,-176.638428,-93.311813,-2.058651,-2.013796
3,0.0014,10.76777,-1.982838,1.0,12.750607,-222.868149,-116.388626,-2.092275,-2.105936
4,0.0029,9.623441,1.539226,1.0,8.084215,-174.175735,-107.642731,-2.028671,-2.003122
5,0.0039,12.442806,2.594715,1.0,9.848091,-206.285278,-151.33374,-2.283079,-2.240384
6,0.003,7.722754,-0.419453,1.0,8.142207,-213.764343,-144.783997,-2.078251,-2.032277
7,0.001,9.549961,-0.250784,1.0,9.800745,-196.978775,-74.75061,-2.018751,-2.023256
8,0.001,8.556546,-1.168928,1.0,9.725474,-231.16568,-116.111771,-2.019931,-2.058292
9,0.0003,7.146195,-2.861753,1.0,10.007948,-225.091583,-138.476013,-2.137258,-2.115604
10,0.0363,9.447703,1.424605,1.0,8.023099,-164.566299,-129.55365,-1.994027,-2.015691


100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:07<00:00,  4.35it/s]
100%|██████████| 31/31 [00:06<00:00,  4.47it/s]
100%|██████████| 31/31 [00:07<00:00,  4.41it/s]



EPOCH NO.5
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.0035504298790328902, metrics={'train_runtime': 101.074, 'train_samples_per_second': 1.632, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.0035504298790328902, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0005,10.347842,-0.791463,1.0,11.139305,-237.928406,-170.609375,-2.169862,-2.16147
2,0.0023,9.440406,-0.614506,1.0,10.054911,-200.650436,-107.155777,-1.994754,-1.95513
3,0.0004,9.252112,-5.612936,1.0,14.865047,-259.169128,-131.545197,-2.028315,-2.044998
4,0.0062,8.693589,-1.097447,1.0,9.791036,-200.542465,-116.941246,-1.96966,-1.943558
5,0.0019,10.613017,-0.072015,1.0,10.685032,-232.952576,-169.631607,-2.234678,-2.194607
6,0.0014,6.263693,-3.680555,1.0,9.944248,-246.375351,-159.374619,-2.021632,-1.975858
7,0.0015,8.479084,-2.527774,1.0,11.006859,-219.748672,-85.459381,-1.96973,-1.974851
8,0.0009,6.904212,-4.175495,1.0,11.079705,-261.231323,-132.635101,-1.970495,-2.012662
9,0.0002,5.762551,-5.67148,1.0,11.434031,-253.188858,-152.312454,-2.100768,-2.080048
10,0.0085,8.468018,-1.250053,1.0,9.718069,-191.312866,-139.350525,-1.952802,-1.977661


100%|██████████| 31/31 [00:07<00:00,  4.41it/s]
100%|██████████| 31/31 [00:06<00:00,  4.43it/s]
100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:06<00:00,  4.47it/s]
100%|██████████| 31/31 [00:06<00:00,  4.46it/s]



EPOCH NO.6
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.0014615994459745707, metrics={'train_runtime': 101.2071, 'train_samples_per_second': 1.63, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.0014615994459745707, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0004,9.339939,-2.610781,1.0,11.95072,-256.121582,-180.688416,-2.138501,-2.132897
2,0.0012,8.753928,-2.65927,1.0,11.413198,-221.098083,-114.020561,-1.96293,-1.926581
3,0.0003,8.290321,-7.825173,1.0,16.115496,-281.291504,-141.163101,-1.996135,-2.013543
4,0.0007,8.086563,-3.518614,1.0,11.605176,-224.75412,-123.011505,-1.940923,-1.913225
5,0.0014,9.682514,-1.481121,1.0,11.163635,-247.04364,-178.936646,-2.212778,-2.174166
6,0.0003,5.200433,-5.870246,1.0,11.070679,-268.272278,-170.007202,-1.995015,-1.947974
7,0.0007,8.060796,-4.146629,1.0,12.207424,-235.937225,-89.642265,-1.945737,-1.950509
8,0.0003,5.915983,-6.342247,1.0,12.25823,-282.898865,-142.517395,-1.943983,-1.98896
9,0.0001,4.818081,-7.752292,1.0,12.570374,-273.996979,-161.757141,-2.079563,-2.059453
10,0.0013,7.897456,-3.419878,1.0,11.317333,-213.011108,-145.056122,-1.928316,-1.95429


100%|██████████| 31/31 [00:07<00:00,  4.39it/s]
100%|██████████| 31/31 [00:06<00:00,  4.46it/s]
100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:06<00:00,  4.47it/s]



EPOCH NO.7
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.00045365643927652854, metrics={'train_runtime': 101.3117, 'train_samples_per_second': 1.629, 'train_steps_per_second': 0.197, 'total_flos': 0.0, 'train_loss': 0.00045365643927652854, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0001,8.570509,-4.198821,1.0,12.769329,-272.002014,-188.382721,-2.118719,-2.111689
2,0.0004,8.327002,-3.982117,1.0,12.309119,-234.326569,-118.289818,-1.937871,-1.903133
3,0.0001,7.597271,-9.477768,1.0,17.075039,-297.817444,-148.093613,-1.973614,-1.992301
4,0.0002,7.304213,-5.131396,1.0,12.435609,-240.881958,-130.835007,-1.91952,-1.891381
5,0.001,8.892303,-2.719325,1.0,11.611628,-259.42569,-186.838745,-2.194902,-2.156377
6,0.0002,4.307993,-7.051442,1.0,11.359435,-280.084229,-178.93161,-1.97492,-1.927381
7,0.0001,7.627769,-5.30306,1.0,12.930828,-247.501511,-93.972534,-1.927702,-1.933041
8,0.0002,5.207821,-7.809322,1.0,13.017143,-297.569611,-149.59903,-1.925606,-1.971589
9,0.0,4.046773,-9.289204,1.0,13.335978,-289.366089,-169.470245,-2.063568,-2.044886
10,0.0004,7.564031,-4.711267,1.0,12.275297,-225.924988,-148.390381,-1.909024,-1.937269


100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:07<00:00,  4.42it/s]
100%|██████████| 31/31 [00:06<00:00,  4.45it/s]
100%|██████████| 31/31 [00:06<00:00,  4.43it/s]



EPOCH NO.8
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.00022250367264859961, metrics={'train_runtime': 101.0144, 'train_samples_per_second': 1.633, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.00022250367264859961, 'epoch': 0.963855421686747})
TEST ACCURACY: 77.42



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0001,8.085546,-5.126726,1.0,13.212272,-281.281036,-193.23233,-2.106171,-2.098962
2,0.0002,8.033375,-4.72333,1.0,12.756705,-241.738678,-121.226089,-1.924989,-1.891123
3,0.0001,7.201322,-10.459201,1.0,17.660522,-307.631775,-152.053101,-1.959142,-1.978779
4,0.0001,6.898561,-5.934488,1.0,12.83305,-248.912857,-134.89151,-1.907028,-1.878054
5,0.0008,8.472612,-3.460231,1.0,11.932844,-266.834717,-191.03566,-2.184559,-2.147523
6,0.0002,3.774259,-7.841283,1.0,11.615542,-287.982635,-184.268951,-1.964808,-1.917413
7,0.0001,7.360116,-5.966987,1.0,13.327103,-254.140778,-96.649055,-1.918077,-1.923809
8,0.0001,4.812308,-8.6083,1.0,13.420608,-305.559387,-153.554138,-1.914004,-1.961122
9,0.0,3.544423,-9.997321,1.0,13.541742,-296.447266,-174.493729,-2.05531,-2.037546
10,0.0004,7.232944,-5.401536,1.0,12.63448,-232.827698,-151.701248,-1.900339,-1.928145


100%|██████████| 31/31 [00:07<00:00,  4.29it/s]
100%|██████████| 31/31 [00:06<00:00,  4.48it/s]
100%|██████████| 31/31 [00:06<00:00,  4.46it/s]
100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:06<00:00,  4.48it/s]



EPOCH NO.9
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.00016634182011330268, metrics={'train_runtime': 100.8385, 'train_samples_per_second': 1.636, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.00016634182011330268, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0001,7.786457,-5.802547,1.0,13.589004,-288.039246,-196.223236,-2.097544,-2.09117
2,0.0002,7.818546,-5.280821,1.0,13.099367,-247.313583,-123.374367,-1.916644,-1.882803
3,0.0,7.012627,-11.105567,1.0,18.118195,-314.095428,-153.940063,-1.948887,-1.96924
4,0.0001,6.625473,-6.396912,1.0,13.022386,-253.537109,-137.622406,-1.898469,-1.869811
5,0.0007,8.150713,-4.049095,1.0,12.199808,-272.723389,-194.254654,-2.177611,-2.14045
6,0.0002,3.528074,-8.372128,1.0,11.900201,-293.291077,-186.730804,-1.957107,-1.908975
7,0.0001,7.187729,-6.441859,1.0,13.629588,-258.889526,-98.372932,-1.909994,-1.91683
8,0.0001,4.621793,-9.251152,1.0,13.872944,-311.987885,-155.45929,-1.906971,-1.954468
9,0.0,3.222692,-10.637968,1.0,13.86066,-302.853729,-177.711044,-2.050242,-2.032673
10,0.0003,6.914399,-5.926967,1.0,12.841366,-238.082001,-154.886703,-1.891404,-1.919909


100%|██████████| 31/31 [00:06<00:00,  4.44it/s]
100%|██████████| 31/31 [00:07<00:00,  4.41it/s]
100%|██████████| 31/31 [00:06<00:00,  4.43it/s]
100%|██████████| 31/31 [00:06<00:00,  4.45it/s]
100%|██████████| 31/31 [00:07<00:00,  4.39it/s]



EPOCH NO.10
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.00012994895650990655, metrics={'train_runtime': 100.9846, 'train_samples_per_second': 1.634, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.00012994895650990655, 'epoch': 0.963855421686747})
TEST ACCURACY: 77.42

BEST ITERATION: 10


In [21]:
# Use to clear as much GPU RAM as possible
import gc
import torch

del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

In [22]:
# Make sure to have enough GPU RAM before running this
from unsloth import FastLanguageModel
from datasets import load_from_disk

best_checkpoint_dir = f"checkpoint_iteration_{best_iteration}"

model, tokenizer = FastLanguageModel.from_pretrained(best_checkpoint_dir)

eval_result = run_eval(model, tokenizer, 5)
print(f"\nTEST ACCURACY: {eval_result * 100:.2f}\n")

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


100%|██████████| 31/31 [00:06<00:00,  4.46it/s]
100%|██████████| 31/31 [00:06<00:00,  4.61it/s]
100%|██████████| 31/31 [00:06<00:00,  4.59it/s]
100%|██████████| 31/31 [00:06<00:00,  4.62it/s]
100%|██████████| 31/31 [00:06<00:00,  4.62it/s]


TEST ACCURACY: 74.19






In [23]:
!pip install huggingface_hub



In [24]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [25]:
model.save_pretrained("model", tokenizer, save_method="default")
model.push_to_hub("techandy42/zephyr-sft-bnb-4bit-fine-tuned", tokenizer, save_method="default")

README.md:   0%|          | 0.00/581 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

Saved model to https://huggingface.co/techandy42/zephyr-sft-bnb-4bit-fine-tuned


In [26]:
# Use to clear as much GPU RAM as possible
import gc
import torch

del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

In [27]:
from unsloth import FastLanguageModel

model_name = "techandy42/zephyr-sft-bnb-4bit-fine-tuned"
model, tokenizer = FastLanguageModel.from_pretrained(model_name)

eval_result = run_eval(model, tokenizer, 5)
print(f"\nTEST ACCURACY: {eval_result * 100:.2f}\n")

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

100%|██████████| 31/31 [00:06<00:00,  4.51it/s]
100%|██████████| 31/31 [00:06<00:00,  4.54it/s]
100%|██████████| 31/31 [00:06<00:00,  4.48it/s]
100%|██████████| 31/31 [00:06<00:00,  4.48it/s]
100%|██████████| 31/31 [00:06<00:00,  4.59it/s]


TEST ACCURACY: 74.19






Some other useful notebooks from Unsloth:
1. Mistral 7b 2x faster [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. Gemma 6 trillion tokens is 2.5x faster! [free Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)