### DPO Fine-Tuning - Llama 8B

> **Model Info**

- Model Name: Llama 3.1 8B Instruct (4-bit quantized)
- Accuracy: 80.65%

> **Training Info**

- GPU Type: A100
- Time: 30 mins
- GPU RAM: 18.4 GB

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* DPO requires a model already trained by SFT on a similar dataset that is used for DPO. We use `HuggingFaceH4/mistral-7b-sft-beta` as the SFT model. Use this [notebook](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) first to train a SFT model.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

In [3]:
#@title Alignment Handbook utils
import os
import re
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError


DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


def apply_chat_template(
    example, tokenizer, task: Literal["sft", "generation", "rm", "dpo"] = "sft", assistant_prefix="<|assistant|>\n"
):
    def _strip_prefix(s, pattern):
        # Use re.escape to escape any special characters in the pattern
        return re.sub(f"^{re.escape(pattern)}", "", s)

    if task in ["sft", "generation"]:
        messages = example["messages"]
        # We add an empty system message if there is none
        if messages[0]["role"] != "system":
            messages.insert(0, {"role": "system", "content": ""})
        example["text"] = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True if task == "generation" else False
        )
    elif task == "rm":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            chosen_messages = example["chosen"]
            rejected_messages = example["rejected"]
            # We add an empty system message if there is none
            if chosen_messages[0]["role"] != "system":
                chosen_messages.insert(0, {"role": "system", "content": ""})
            if rejected_messages[0]["role"] != "system":
                rejected_messages.insert(0, {"role": "system", "content": ""})
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
        else:
            raise ValueError(
                f"Could not format example as dialogue for `rm` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    elif task == "dpo":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            # Compared to reward modeling, we filter out the prompt, so the text is everything after the last assistant token
            prompt_messages = [[msg for msg in example["chosen"] if msg["role"] == "user"][0]]
            # Insert system message
            if example["chosen"][0]["role"] != "system":
                prompt_messages.insert(0, {"role": "system", "content": ""})
            else:
                prompt_messages.insert(0, example["chosen"][0])
            # TODO: handle case where chosen/rejected also have system messages
            chosen_messages = example["chosen"][1:]
            rejected_messages = example["rejected"][1:]
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            example["text_chosen"] = _strip_prefix(example["text_chosen"], assistant_prefix)
            example["text_rejected"] = _strip_prefix(example["text_rejected"], assistant_prefix)
        else:
            raise ValueError(
                f"Could not format example as dialogue for `dpo` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    else:
        raise ValueError(
            f"Task {task} not supported, please ensure that the provided task is one of {['sft', 'generation', 'rm', 'dpo']}"
        )
    return example


def get_datasets(
    data_config: dict,
    splits: List[str] = ["train", "test"],
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads one or more datasets with varying training set proportions.

    Args:
        data_config (`DataArguments` or `dict`):
            Dataset configuration and split proportions.
        splits (`List[str]`, *optional*, defaults to `['train', 'test']`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.

    Returns
        [`DatasetDict`]: The dataset dictionary containing the loaded datasets.
    """

    if type(data_config) is dict:
        # Structure of the input is:
        #     dataset_mixer = {
        #             "dataset1": 0.5,
        #             "dataset1": 0.3,
        #             "dataset1": 0.2,
        #         }
        dataset_mixer = data_config
    else:
        raise ValueError(f"Data config {data_config} not recognized.")

    raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
    return raw_datasets


def mix_datasets(dataset_mixer: dict, splits: Optional[List[str]] = None, shuffle=True) -> DatasetDict:
    """
    Loads and mixes datasets according to proportions specified in `dataset_mixer`.

    Args:
        dataset_mixer (`dict`):
            Dictionary containing the dataset names and their training proportions. By default, all test proportions are 1.
        splits (Optional[List[str]], *optional*, defaults to `None`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.
    """
    raw_datasets = DatasetDict()
    raw_train_datasets = []
    raw_val_datasets = []
    fracs = []
    for ds, frac in dataset_mixer.items():
        fracs.append(frac)
        for split in splits:
            try:
                # Try first if dataset on a Hub repo
                dataset = load_dataset(ds, split=split)
            except DatasetGenerationError:
                # If not, check local dataset
                dataset = load_from_disk(os.path.join(ds, split))

            if "train" in split:
                raw_train_datasets.append(dataset)
            elif "test" in split:
                raw_val_datasets.append(dataset)
            else:
                raise ValueError(f"Split type {split} not recognized as one of test or train.")

    if any(frac < 0 for frac in fracs):
        raise ValueError("Dataset fractions cannot be negative.")

    if len(raw_train_datasets) > 0:
        train_subsets = []
        for dataset, frac in zip(raw_train_datasets, fracs):
            train_subset = dataset.select(range(int(frac * len(dataset))))
            train_subsets.append(train_subset)
        if shuffle:
            raw_datasets["train"] = concatenate_datasets(train_subsets).shuffle(seed=42)
        else:
            raw_datasets["train"] = concatenate_datasets(train_subsets)
    # No subsampling for test datasets to enable fair comparison across models
    if len(raw_val_datasets) > 0:
        if shuffle:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets).shuffle(seed=42)
        else:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets)

    if len(raw_datasets) == 0:
        raise ValueError(
            f"Dataset {dataset_mixer} not recognized with split {split}. Check the dataset has been correctly formatted."
        )

    return raw_datasets

<a name="Data"></a>
### Data Prep
We follow Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), and sample 0.5% of it to speed things up. You can sample the full dataset for a full run.

In [4]:
from datasets import load_dataset
import pandas as pd
from datasets import Dataset, DatasetDict

dataset = load_dataset('techandy42/debugger_llm_humaneval_dataset_v1')
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])
df_parts = [df_train, df_test]

df_sft_train = []
df_sft_test = []
df_sft_parts = [df_sft_train, df_sft_test]

df_dpo_train = []
df_dpo_test = []
df_dpo_parts = [df_dpo_train, df_dpo_test]

PREFIXS = ['score_s1_', 'score_s2_', 'score_s3_', 'score_s4_', 'score_s5_', 'score_s6_']
ROUNDS = ['rd1', 'rd2', 'rd3', 'custom']
PAIRS = [('rd1', 'rd2'), ('rd1', 'rd3'), ('rd1', 'custom'), ('rd2', 'rd3'), ('rd2', 'custom'), ('rd3', 'custom')]

def indent_lines(string: str) -> str:
  indented_string = '\n'.join('    ' + line for line in string.splitlines())
  return indented_string

for df, df_sft, df_dpo in zip(df_parts, df_sft_parts, df_dpo_parts):
  for idx, row in df.iterrows():
      prompt = row['prompt']
      result = row['result']
      instruction = f"""<instruction>
  <bullets>
    <bullet>The following buggy code is a wrong implementation that contains one or more bugs.</bullet>
    <bullet>Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.</bullet>
    <bullet>Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.</bullet>
    <bullet>Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.</bullet>
    <bullet>IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.</bullet>
  </bullets>
</instruction>"""
      full_solution = "<buggy_code>\n" + (prompt + indent_lines(result)).strip('\n') + "\n</buggy_code>"
      full_instruction = instruction + "\n" + full_solution
      solutions_info = {}
      for ROUND in ROUNDS:
        solutions_info[ROUND] = {}
        total_score = 0
        for PREFIX in PREFIXS:
          score_col = PREFIX + ROUND
          score = int(row[score_col][0])
          total_score += score
        total_score /= 42
        analysis_col = 'analysis_' + ROUND
        solutions_info[ROUND]['analysis'] = row[analysis_col]
        solutions_info[ROUND]['score'] = total_score
      for ROUND1, ROUND2 in PAIRS:
        round1_score = solutions_info[ROUND1]['score']
        round2_score = solutions_info[ROUND2]['score']
        round1_analysis = solutions_info[ROUND1]['analysis']
        round2_analysis = solutions_info[ROUND2]['analysis']
        if round1_score == round2_score:
          continue
        messages_info = {}
        messages_info['messages'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score > round2_score else round2_analysis, 'role': 'assistant'}
        ]
        pairwise_info = {}
        pairwise_info['prompt'] = full_instruction
        pairwise_info['chosen'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score > round2_score else round2_analysis, 'role': 'assistant'}
        ]
        pairwise_info['rejected'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score < round2_score else round2_analysis, 'role': 'assistant'}
        ]
        # Not part of training data, only for analysis
        pairwise_info['metadata'] = {
            'chosen': ROUND1 if round1_score > round2_score else ROUND2,
            'rejected': ROUND1 if round1_score < round2_score else ROUND2,
        },
        df_sft.append(messages_info)
        df_dpo.append(pairwise_info)

df_sft_train = pd.DataFrame(df_sft_train)
df_sft_test = pd.DataFrame(df_sft_test)
dataset_sft_train = Dataset.from_pandas(df_sft_train)
dataset_sft_test = Dataset.from_pandas(df_sft_test)
datasets_sft = DatasetDict({
    'train': dataset_sft_train,
    'test': dataset_sft_test
})
df_dpo_train = pd.DataFrame(df_dpo_train)
df_dpo_test = pd.DataFrame(df_dpo_test)
dataset_dpo_train = Dataset.from_pandas(df_dpo_train)
dataset_dpo_test = Dataset.from_pandas(df_dpo_test)
datasets_dpo = DatasetDict({
    'train': dataset_dpo_train,
    'test': dataset_dpo_test
})

Downloading readme:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6 [00:00<?, ? examples/s]

In [5]:
datasets_sft

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 165
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 31
    })
})

In [6]:
datasets_dpo

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'metadata'],
        num_rows: 165
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'metadata'],
        num_rows: 31
    })
})

In [7]:
column_names = list(datasets_sft['train'].features)

sft_datasets = datasets_sft.map(
    apply_chat_template,
    fn_kwargs = {"tokenizer": tokenizer, "task": "sft"},
    num_proc = 12,
    remove_columns = column_names,
    desc = "Formatting comparisons with prompt template",
)

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/165 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/31 [00:00<?, ? examples/s]

In [8]:
sft_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 165
    })
    test: Dataset({
        features: ['text'],
        num_rows: 31
    })
})

In [9]:
print(sft_datasets['train'][0]['text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

<instruction>
  <bullets>
    <bullet>The following buggy code is a wrong implementation that contains one or more bugs.</bullet>
    <bullet>Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.</bullet>
    <bullet>Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.</bullet>
    <bullet>Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.</bullet>
    <bullet>IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.</bullet>
  </bullets>
</instruction>
<buggy_code>
from typing import List


def parse_nested_parens(paren_string: str) -> List[int]:
    

In [10]:
column_names = list(datasets_dpo['train'].features)

dpo_datasets = datasets_dpo.map(
    apply_chat_template,
    fn_kwargs = {"tokenizer": tokenizer, "task": "dpo"},
    num_proc = 12,
    remove_columns = column_names,
    desc = "Formatting comparisons with prompt template",
)

dpo_datasets = dpo_datasets.rename_columns(
    {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
)

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/165 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/31 [00:00<?, ? examples/s]

In [11]:
dpo_datasets

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 165
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 31
    })
})

In [12]:
print("=" * 10 + "PROMPT" + "=" * 10)
print(dpo_datasets['train'][0]['prompt'])
print("=" * 10 + "CHOSEN" + "=" * 10)
print(dpo_datasets['train'][0]['chosen'])
print("=" * 10 + "REJECTED" + "=" * 10)
print(dpo_datasets['train'][0]['rejected'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

<instruction>
  <bullets>
    <bullet>The following buggy code is a wrong implementation that contains one or more bugs.</bullet>
    <bullet>Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.</bullet>
    <bullet>Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.</bullet>
    <bullet>Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.</bullet>
    <bullet>IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.</bullet>
  </bullets>
</instruction>
<buggy_code>
from typing import List


def parse_nested_parens(paren_string: str) -> List[int]:
    

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [13]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Train the SFT model

In [14]:
# Note: running eval is not necessary for this stage
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

sft_trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = sft_datasets['train'],
    # eval_dataset = sft_datasets['test'], # Uncomment to run eval
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        # evaluation_strategy = "steps", # Uncomment to run eval
        # eval_steps = 1, # Uncomment to run eval
    ),
)

Map (num_proc=2):   0%|          | 0/165 [00:00<?, ? examples/s]

In [15]:
sft_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss
1,1.9434
2,1.8787
3,1.7617
4,1.6941
5,1.4124
6,1.2572
7,1.0567
8,0.8528
9,0.6928
10,0.5792


TrainOutput(global_step=20, training_loss=0.9286116510629654, metrics={'train_runtime': 53.6651, 'train_samples_per_second': 3.075, 'train_steps_per_second': 0.373, 'total_flos': 4584282903183360.0, 'train_loss': 0.9286116510629654, 'epoch': 0.963855421686747})

<a name="Train"></a>
### Train the DPO model
Now let's use Huggingface TRL's `DPOTrainer`! More docs here: [TRL DPO docs](https://huggingface.co/docs/trl/dpo_trainer). We do 3 epochs on 0.5% of the dataset to speed things up.

In [16]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [17]:
# Note: there is an issue running eval during training with trl's DPOTrainer & DPOConfig
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = dpo_datasets['train'],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

Tokenizing train dataset:   0%|          | 0/165 [00:00<?, ? examples/s]

In [18]:
from tqdm import tqdm
import torch

def run_eval(model, tokenizer, no_iter, get_stats = False):
  NUM_ITEMS = len(dpo_datasets['test'])
  num_chosen = 0
  ROUNDS = ['rd1', 'rd2', 'rd3', 'custom']
  stats = {}
  for ROUND in ROUNDS:
    stats[ROUND] = {'TP': 0, 'TN': 0, 'FP': 0, 'FN': 0}

  for _ in range(no_iter):
    for i in tqdm(range(NUM_ITEMS)):
      input = {
          "chosen": datasets_dpo['test'][i]["chosen"],
          "rejected": datasets_dpo['test'][i]["rejected"]
      }
      chosen_round = datasets_dpo['test'][i]["metadata"][0]['chosen']
      rejected_round = datasets_dpo['test'][i]["metadata"][0]['rejected']

      # Apply the chat template to format the input
      formatted_input = apply_chat_template(input, tokenizer, task="dpo")

      # Tokenize the inputs
      inputs_chosen = tokenizer(formatted_input["text_chosen"], return_tensors="pt", padding=True, truncation=True)
      inputs_rejected = tokenizer(formatted_input["text_rejected"], return_tensors="pt", padding=True, truncation=True)

      # Generate the scalar reward values
      with torch.no_grad():
          reward_chosen = model(**inputs_chosen).logits.mean().item()
          reward_rejected = model(**inputs_rejected).logits.mean().item()
          # Model chose correctly
          if reward_chosen > reward_rejected:
              num_chosen += 1
              stats[chosen_round]['TP'] += 1
              stats[rejected_round]['TN'] += 1
          # Model chose wrongly
          else:
              stats[chosen_round]['FN'] += 1
              stats[rejected_round]['FP'] += 1

  if get_stats:
    return num_chosen / (no_iter * NUM_ITEMS), stats

  return num_chosen / (no_iter * NUM_ITEMS)

In [19]:
best_iteration = 1
best_eval_result = 0

for i in range(1, 11):
    # Train the model
    training_result = dpo_trainer.train()
    eval_result = run_eval(model, tokenizer, 5)
    if eval_result >= best_eval_result:
        best_eval_result = eval_result
        best_iteration = i

    # Create a unique checkpoint directory for each iteration
    checkpoint_dir = f"checkpoint_iteration_{i}"
    os.makedirs(checkpoint_dir, exist_ok=True)

    # Save model and trainer states for this iteration
    dpo_trainer.save_model(checkpoint_dir)  # Save model and tokenizer
    dpo_trainer.save_state()  # Save optimizer, scheduler, and other trainer states

    print(f"\nEPOCH NO.{i}")
    print(f"TRAINING RESULT: {training_result}")
    print(f"TEST ACCURACY: {eval_result * 100:.2f}\n")

print(f"BEST ITERATION: {best_iteration}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.2666,10.036033,8.386921,1.0,1.649112,-204.292664,-237.738861,-0.272538,-0.240523
2,0.4147,10.214041,8.130713,0.875,2.083327,-181.061188,-167.657593,-0.249372,-0.218047
3,0.2704,9.085641,6.571654,0.875,2.513987,-199.636658,-198.674927,-0.209525,-0.204016
4,0.4812,8.826218,7.782429,0.875,1.043789,-179.347168,-179.33432,-0.29253,-0.31022
5,0.5258,10.586798,9.17892,0.75,1.407878,-212.192001,-229.905243,-0.270067,-0.202502
6,0.2901,10.294117,8.311008,0.875,1.983109,-190.381683,-190.156952,-0.254159,-0.246118
7,1.1074,8.537772,8.010983,0.75,0.52679,-173.108887,-152.761414,-0.279681,-0.265631
8,0.8488,8.397289,8.415031,0.5,-0.017742,-195.956451,-188.871613,-0.211068,-0.253205
9,0.5265,8.796047,7.766211,0.625,1.029837,-197.536224,-185.90564,-0.281337,-0.316664
10,0.4314,10.399527,7.871281,0.75,2.528245,-170.221741,-187.089325,-0.276053,-0.217698


100%|██████████| 31/31 [00:08<00:00,  3.58it/s]
100%|██████████| 31/31 [00:08<00:00,  3.60it/s]
100%|██████████| 31/31 [00:08<00:00,  3.61it/s]
100%|██████████| 31/31 [00:08<00:00,  3.62it/s]



EPOCH NO.1
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.5346610642969608, metrics={'train_runtime': 99.5626, 'train_samples_per_second': 1.657, 'train_steps_per_second': 0.201, 'total_flos': 0.0, 'train_loss': 0.5346610642969608, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.1381,10.160806,7.903973,1.0,2.256833,-209.122147,-236.491119,-0.27931,-0.248044
2,0.2397,10.214753,7.650357,0.875,2.564396,-185.864746,-167.650482,-0.253461,-0.222836
3,0.1528,9.268312,6.258196,1.0,3.010116,-202.77124,-196.848221,-0.215363,-0.210671
4,0.3249,8.947947,7.504571,0.875,1.443375,-182.125732,-178.11705,-0.298451,-0.318211
5,0.3317,10.767699,8.88662,0.75,1.881079,-215.114975,-228.096222,-0.276341,-0.208922
6,0.2359,10.203085,7.954247,0.875,2.248837,-193.94928,-191.067261,-0.262761,-0.255194
7,0.974,8.602802,7.70508,0.75,0.897723,-176.167908,-152.111099,-0.287973,-0.273885
8,0.6059,8.508978,8.004772,0.5,0.504205,-200.059036,-187.754715,-0.218557,-0.261486
9,0.4242,8.713116,7.388798,0.75,1.324318,-201.310349,-186.734955,-0.289887,-0.32655
10,0.3228,10.439167,7.51617,0.875,2.922998,-173.772858,-186.692902,-0.286683,-0.228472


100%|██████████| 31/31 [00:08<00:00,  3.59it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]
100%|██████████| 31/31 [00:08<00:00,  3.60it/s]
100%|██████████| 31/31 [00:08<00:00,  3.64it/s]
100%|██████████| 31/31 [00:08<00:00,  3.58it/s]



EPOCH NO.2
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.3926101766526699, metrics={'train_runtime': 99.5022, 'train_samples_per_second': 1.658, 'train_steps_per_second': 0.201, 'total_flos': 0.0, 'train_loss': 0.3926101766526699, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0887,10.279336,7.45604,1.0,2.823295,-213.601471,-235.305817,-0.290689,-0.260962
2,0.1742,10.120728,7.273875,1.0,2.846853,-189.629578,-168.590744,-0.265701,-0.236036
3,0.1086,9.260994,5.854968,1.0,3.406027,-206.803528,-196.921402,-0.229142,-0.226555
4,0.2481,8.998952,7.211423,1.0,1.787529,-185.05722,-177.606995,-0.313245,-0.334663
5,0.2359,10.765994,8.489451,0.875,2.276543,-219.08667,-228.113266,-0.290124,-0.224568
6,0.1823,9.948809,7.429451,1.0,2.519359,-199.197266,-193.610016,-0.279625,-0.27584
7,0.8617,8.514015,7.262347,0.75,1.251668,-180.595245,-152.998993,-0.298272,-0.285456
8,0.4334,8.492427,7.492272,0.75,1.000154,-205.184036,-187.920227,-0.23366,-0.276803
9,0.3368,8.489789,6.856879,0.875,1.63291,-206.629547,-188.968231,-0.302734,-0.340952
10,0.2677,10.26463,7.058152,0.875,3.206479,-178.353043,-188.438263,-0.304511,-0.248436


100%|██████████| 31/31 [00:08<00:00,  3.58it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]
100%|██████████| 31/31 [00:08<00:00,  3.63it/s]
100%|██████████| 31/31 [00:08<00:00,  3.56it/s]



EPOCH NO.3
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.302729176543653, metrics={'train_runtime': 99.6024, 'train_samples_per_second': 1.657, 'train_steps_per_second': 0.201, 'total_flos': 0.0, 'train_loss': 0.302729176543653, 'epoch': 0.963855421686747})
TEST ACCURACY: 74.19



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0573,10.167374,6.8154,1.0,3.351973,-220.007874,-236.425446,-0.301909,-0.275898
2,0.1269,9.823265,6.669022,1.0,3.154243,-195.678101,-171.565338,-0.282711,-0.255476
3,0.0707,9.072429,5.225324,1.0,3.847105,-213.099976,-198.807068,-0.249412,-0.248725
4,0.1823,8.826136,6.736474,1.0,2.089663,-189.806732,-179.335159,-0.335221,-0.358446
5,0.1605,10.525642,7.869393,1.0,2.656249,-225.287247,-230.516785,-0.305722,-0.245313
6,0.1437,9.451942,6.674438,1.0,2.777504,-206.747391,-198.578674,-0.302134,-0.303324
7,0.6952,8.243609,6.561388,0.75,1.682222,-187.604828,-155.703049,-0.313135,-0.301735
8,0.3058,8.217446,6.678217,0.875,1.539229,-213.324585,-190.670044,-0.254913,-0.296101
9,0.2556,8.075628,6.04795,0.875,2.027679,-214.718826,-193.109818,-0.317802,-0.357599
10,0.2239,9.863302,6.36377,0.875,3.499532,-185.296844,-192.451569,-0.325067,-0.275397


100%|██████████| 31/31 [00:08<00:00,  3.57it/s]
100%|██████████| 31/31 [00:08<00:00,  3.60it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]
100%|██████████| 31/31 [00:08<00:00,  3.63it/s]
100%|██████████| 31/31 [00:08<00:00,  3.62it/s]



EPOCH NO.4
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.2294985480606556, metrics={'train_runtime': 101.0268, 'train_samples_per_second': 1.633, 'train_steps_per_second': 0.198, 'total_flos': 0.0, 'train_loss': 0.2294985480606556, 'epoch': 0.963855421686747})
TEST ACCURACY: 77.42



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.038,9.822216,5.916945,1.0,3.905271,-228.992432,-239.877014,-0.317707,-0.297747
2,0.0839,9.315927,5.759857,1.0,3.556069,-204.76976,-176.638733,-0.301762,-0.278739
3,0.0445,8.651182,4.294091,1.0,4.357091,-222.412292,-203.019531,-0.274852,-0.278149
4,0.1373,8.384544,6.00879,1.0,2.375754,-197.083557,-183.751068,-0.358777,-0.385094
5,0.113,10.027969,6.987763,1.0,3.040206,-234.103561,-235.49353,-0.322741,-0.271473
6,0.1136,8.646535,5.625582,1.0,3.020953,-217.235947,-206.632767,-0.33112,-0.339061
7,0.5269,7.705906,5.610754,0.75,2.095152,-197.111176,-161.080078,-0.331109,-0.322093
8,0.2157,7.648954,5.580452,1.0,2.068502,-224.302231,-196.35495,-0.279608,-0.316761
9,0.177,7.445572,4.996677,0.875,2.448895,-225.231552,-199.410385,-0.332569,-0.374154
10,0.1883,9.181244,5.374662,1.0,3.806582,-195.187927,-199.272141,-0.346358,-0.306329


100%|██████████| 31/31 [00:08<00:00,  3.55it/s]
100%|██████████| 31/31 [00:08<00:00,  3.56it/s]
100%|██████████| 31/31 [00:08<00:00,  3.58it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]
100%|██████████| 31/31 [00:08<00:00,  3.62it/s]



EPOCH NO.5
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.16852652840316296, metrics={'train_runtime': 99.6142, 'train_samples_per_second': 1.656, 'train_steps_per_second': 0.201, 'total_flos': 0.0, 'train_loss': 0.16852652840316296, 'epoch': 0.963855421686747})
TEST ACCURACY: 77.42



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0232,9.216618,4.71881,1.0,4.497808,-240.973785,-245.932983,-0.332129,-0.321011
2,0.0515,8.558767,4.550192,1.0,4.008575,-216.866394,-184.210327,-0.320935,-0.303955
3,0.03,7.921791,3.109271,1.0,4.81252,-234.260498,-210.313431,-0.304799,-0.312269
4,0.1015,7.6931,4.981934,1.0,2.711166,-207.352112,-190.665497,-0.385265,-0.414071
5,0.0785,9.299482,5.790985,1.0,3.508498,-246.071335,-242.778381,-0.33756,-0.300165
6,0.0917,7.582856,4.342556,1.0,3.2403,-230.066208,-217.269562,-0.361869,-0.377182
7,0.2997,6.945788,4.325758,0.875,2.620031,-209.961136,-168.681244,-0.347304,-0.340131
8,0.1426,6.909211,4.195903,1.0,2.713308,-238.14772,-203.752396,-0.307286,-0.337125
9,0.1022,6.640337,3.712451,1.0,2.927887,-238.073822,-207.462723,-0.34588,-0.388406
10,0.1549,8.332833,4.171471,1.0,4.161363,-207.219833,-207.756241,-0.367317,-0.339221


100%|██████████| 31/31 [00:08<00:00,  3.61it/s]
100%|██████████| 31/31 [00:08<00:00,  3.60it/s]
100%|██████████| 31/31 [00:08<00:00,  3.58it/s]
100%|██████████| 31/31 [00:08<00:00,  3.58it/s]
100%|██████████| 31/31 [00:08<00:00,  3.61it/s]



EPOCH NO.6
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.11555937328375876, metrics={'train_runtime': 99.4205, 'train_samples_per_second': 1.66, 'train_steps_per_second': 0.201, 'total_flos': 0.0, 'train_loss': 0.11555937328375876, 'epoch': 0.963855421686747})
TEST ACCURACY: 80.65



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0145,8.459227,3.33531,1.0,5.123917,-254.808777,-253.506897,-0.343768,-0.343019
2,0.0307,7.636424,3.138304,1.0,4.498119,-230.985275,-193.433777,-0.336218,-0.325609
3,0.0214,7.028944,1.697227,1.0,5.331717,-248.380936,-219.241913,-0.334707,-0.345693
4,0.0742,6.857978,3.711618,1.0,3.14636,-220.055298,-199.016754,-0.410105,-0.439329
5,0.0545,8.36452,4.396292,1.0,3.968228,-260.01828,-252.128021,-0.349596,-0.327684
6,0.0698,6.368424,2.836159,1.0,3.532265,-245.130173,-229.413864,-0.39255,-0.414344
7,0.1556,6.031879,2.850717,1.0,3.181162,-224.711533,-177.820343,-0.358732,-0.350229
8,0.0909,6.027017,2.617751,1.0,3.409266,-253.929245,-212.57431,-0.333518,-0.353587
9,0.0543,5.593799,2.206232,1.0,3.387567,-253.136017,-217.928101,-0.354798,-0.397245
10,0.125,7.363332,2.77374,1.0,4.589592,-221.197144,-217.451263,-0.385482,-0.365165


100%|██████████| 31/31 [00:08<00:00,  3.57it/s]
100%|██████████| 31/31 [00:08<00:00,  3.58it/s]
100%|██████████| 31/31 [00:08<00:00,  3.55it/s]
100%|██████████| 31/31 [00:08<00:00,  3.58it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]



EPOCH NO.7
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.07826177226379513, metrics={'train_runtime': 99.316, 'train_samples_per_second': 1.661, 'train_steps_per_second': 0.201, 'total_flos': 0.0, 'train_loss': 0.07826177226379513, 'epoch': 0.963855421686747})
TEST ACCURACY: 80.65



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0094,7.574791,1.892511,1.0,5.68228,-269.236755,-262.351257,-0.356458,-0.364834
2,0.0207,6.646147,1.716602,1.0,4.929544,-245.202301,-203.336548,-0.35205,-0.346232
3,0.0165,6.03449,0.184904,1.0,5.849586,-263.50415,-229.186447,-0.369014,-0.382242
4,0.0457,5.995398,2.310977,1.0,3.684421,-234.061676,-207.642548,-0.435427,-0.463847
5,0.0364,7.408745,2.942909,1.0,4.465836,-274.552094,-261.68576,-0.365614,-0.357604
6,0.0581,5.035767,1.224272,1.0,3.811494,-261.249023,-242.740463,-0.423682,-0.449506
7,0.0792,5.183619,1.269996,1.0,3.913622,-240.518738,-186.302948,-0.373236,-0.363165
8,0.0584,5.105488,0.957071,1.0,4.148417,-270.536041,-221.789612,-0.368385,-0.378901
9,0.0258,4.515929,0.534121,1.0,3.981809,-269.857117,-228.706802,-0.368957,-0.409569
10,0.1014,6.323702,1.261396,1.0,5.062306,-236.320587,-227.847565,-0.407808,-0.396297


100%|██████████| 31/31 [00:08<00:00,  3.60it/s]
100%|██████████| 31/31 [00:08<00:00,  3.56it/s]
100%|██████████| 31/31 [00:08<00:00,  3.61it/s]
100%|██████████| 31/31 [00:08<00:00,  3.61it/s]
100%|██████████| 31/31 [00:08<00:00,  3.55it/s]



EPOCH NO.8
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.05257354283239692, metrics={'train_runtime': 99.218, 'train_samples_per_second': 1.663, 'train_steps_per_second': 0.202, 'total_flos': 0.0, 'train_loss': 0.05257354283239692, 'epoch': 0.963855421686747})
TEST ACCURACY: 80.65



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0067,6.667943,0.49579,1.0,6.172153,-283.203979,-271.419739,-0.376006,-0.393825
2,0.0142,5.775435,0.404108,1.0,5.371327,-258.327209,-212.043655,-0.368218,-0.367425
3,0.0097,5.096433,-1.365273,1.0,6.461706,-279.005951,-238.567017,-0.411348,-0.425165
4,0.0245,5.182413,0.843948,1.0,4.338464,-248.731979,-215.7724,-0.459779,-0.48688
5,0.0251,6.487335,1.506868,1.0,4.980468,-288.912506,-270.899841,-0.379287,-0.383917
6,0.0378,3.909957,-0.328091,1.0,4.238048,-276.772675,-253.998535,-0.455814,-0.481982
7,0.0381,4.402136,-0.318945,1.0,4.721081,-256.408142,-194.117783,-0.387699,-0.373645
8,0.0353,4.308119,-0.611593,1.0,4.919711,-286.222656,-229.763306,-0.405152,-0.405269
9,0.0119,3.420685,-1.289879,1.0,4.710564,-288.097107,-239.659256,-0.377303,-0.416191
10,0.0834,5.30601,-0.253038,1.0,5.559048,-251.46492,-238.024475,-0.431055,-0.425486


100%|██████████| 31/31 [00:08<00:00,  3.53it/s]
100%|██████████| 31/31 [00:08<00:00,  3.60it/s]
100%|██████████| 31/31 [00:08<00:00,  3.63it/s]
100%|██████████| 31/31 [00:08<00:00,  3.61it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]



EPOCH NO.9
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.03288072601426393, metrics={'train_runtime': 99.2417, 'train_samples_per_second': 1.663, 'train_steps_per_second': 0.202, 'total_flos': 0.0, 'train_loss': 0.03288072601426393, 'epoch': 0.963855421686747})
TEST ACCURACY: 80.65



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0051,5.560723,-1.044103,1.0,6.604826,-298.602905,-282.491943,-0.392664,-0.418446
2,0.0096,4.883613,-0.920449,1.0,5.804061,-271.572784,-220.961884,-0.37957,-0.381686
3,0.0062,4.094967,-3.146424,1.0,7.24139,-296.817444,-248.58168,-0.450995,-0.462835
4,0.0146,4.318766,-0.616424,1.0,4.93519,-263.335693,-224.408844,-0.47654,-0.500897
5,0.019,5.429621,-0.080901,1.0,5.510522,-304.790192,-281.47702,-0.387577,-0.403319
6,0.0197,2.745926,-2.190654,1.0,4.93658,-295.398315,-265.638855,-0.481325,-0.506681
7,0.0188,3.688778,-1.889548,1.0,5.578326,-272.114197,-201.251343,-0.394726,-0.375992
8,0.0189,3.459095,-2.308346,1.0,5.767441,-303.190186,-238.25354,-0.439121,-0.428628
9,0.0058,2.366343,-3.162417,1.0,5.528761,-306.82251,-250.202667,-0.382638,-0.413481
10,0.0594,4.287225,-1.897419,1.0,6.184644,-267.908752,-248.212341,-0.444447,-0.447571


100%|██████████| 31/31 [00:08<00:00,  3.60it/s]
100%|██████████| 31/31 [00:08<00:00,  3.55it/s]
100%|██████████| 31/31 [00:08<00:00,  3.56it/s]
100%|██████████| 31/31 [00:08<00:00,  3.56it/s]
100%|██████████| 31/31 [00:08<00:00,  3.59it/s]



EPOCH NO.10
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.01872755497461185, metrics={'train_runtime': 99.3082, 'train_samples_per_second': 1.661, 'train_steps_per_second': 0.201, 'total_flos': 0.0, 'train_loss': 0.01872755497461185, 'epoch': 0.963855421686747})
TEST ACCURACY: 77.42

BEST ITERATION: 9


In [20]:
# Use to clear as much GPU RAM as possible
import gc
import torch

del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

In [21]:
# Make sure to have enough GPU RAM before running this
from unsloth import FastLanguageModel
from datasets import load_from_disk

best_checkpoint_dir = f"checkpoint_iteration_{best_iteration}"

model, tokenizer = FastLanguageModel.from_pretrained(best_checkpoint_dir)

eval_result = run_eval(model, tokenizer, 5)
print(f"\nTEST ACCURACY: {eval_result * 100:.2f}\n")

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


100%|██████████| 31/31 [00:07<00:00,  3.98it/s]
100%|██████████| 31/31 [00:07<00:00,  4.09it/s]
100%|██████████| 31/31 [00:07<00:00,  4.08it/s]
100%|██████████| 31/31 [00:07<00:00,  4.07it/s]
100%|██████████| 31/31 [00:07<00:00,  4.09it/s]


TEST ACCURACY: 80.65






In [22]:
!pip install huggingface_hub



In [23]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [24]:
model.save_pretrained("model", tokenizer, save_method="default")
model.push_to_hub("techandy42/Meta-Llama-3.1-8B-Instruct-bnb-4bit-fine-tuned", tokenizer, save_method="default")

README.md:   0%|          | 0.00/609 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

Saved model to https://huggingface.co/techandy42/Meta-Llama-3.1-8B-Instruct-bnb-4bit-fine-tuned


In [25]:
# Use to clear as much GPU RAM as possible
import gc
import torch

del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

In [26]:
from unsloth import FastLanguageModel

model_name = "techandy42/Meta-Llama-3.1-8B-Instruct-bnb-4bit-fine-tuned"
model, tokenizer = FastLanguageModel.from_pretrained(model_name)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

In [27]:
def print_confusion_matrices(confusion_dict):
    for key, values in confusion_dict.items():
        # Calculate total instances
        total = values['TP'] + values['TN'] + values['FP'] + values['FN']

        # Calculate percentages
        tp_percent = (values['TP'] / total)
        tn_percent = (values['TN'] / total)
        fp_percent = (values['FP'] / total)
        fn_percent = (values['FN'] / total)

        # Print the confusion matrix with percentages
        print(f"Confusion Matrix for {key}:")
        print("-------------------------------------------------------")
        print(f"                Predicted Positive   Predicted Negative")
        print(f"Actual Positive           {tp_percent:>8.2f}             {fn_percent:>8.2f}")
        print(f"Actual Negative           {fp_percent:>8.2f}             {tn_percent:>8.2f}")
        print("-------------------------------------------------------")
        print(f"Combined                  {tp_percent+fp_percent:>8.2f}             {tn_percent+fn_percent:>8.2f}")
        print("-------------------------------------------------------\n")

In [28]:
import pandas as pd

eval_result, stats = run_eval(model, tokenizer, 5, get_stats = True)
print(f"\nTEST ACCURACY: {eval_result * 100:.2f}\n")
print_confusion_matrices(stats)

100%|██████████| 31/31 [00:07<00:00,  4.02it/s]
100%|██████████| 31/31 [00:07<00:00,  3.88it/s]
100%|██████████| 31/31 [00:08<00:00,  3.79it/s]
100%|██████████| 31/31 [00:07<00:00,  4.04it/s]
100%|██████████| 31/31 [00:07<00:00,  4.03it/s]


TEST ACCURACY: 80.65

Confusion Matrix for rd1:
-------------------------------------------------------
                Predicted Positive   Predicted Negative
Actual Positive               0.33                 0.00
Actual Negative               0.13                 0.53
-------------------------------------------------------
Combined                      0.47                 0.53
-------------------------------------------------------

Confusion Matrix for rd2:
-------------------------------------------------------
                Predicted Positive   Predicted Negative
Actual Positive               0.36                 0.21
Actual Negative               0.07                 0.36
-------------------------------------------------------
Combined                      0.43                 0.57
-------------------------------------------------------

Confusion Matrix for rd3:
-------------------------------------------------------
                Predicted Positive   Predicted Negative
A




In [29]:
def preliminary_stats(dataset):
  NUM_ITEMS = len(dpo_datasets[dataset])
  ROUNDS = ['rd1', 'rd2', 'rd3', 'custom']
  reward_model_chosen = dict.fromkeys(ROUNDS, 0)
  reward_model_rejected = dict.fromkeys(ROUNDS, 0)
  reward_model_ratio = dict.fromkeys(ROUNDS, 0)
  for i in range(NUM_ITEMS):
    chosen_round = datasets_dpo[dataset][i]["metadata"][0]['chosen']
    rejected_round = datasets_dpo[dataset][i]["metadata"][0]['rejected']

    reward_model_chosen[chosen_round] += 1
    reward_model_rejected[rejected_round] += 1

  for ROUND in ROUNDS:
    reward_model_ratio[ROUND] = reward_model_chosen[ROUND] / (reward_model_chosen[ROUND] + reward_model_rejected[ROUND])

  return reward_model_ratio

In [30]:
import pandas as pd

prelim_ratio_train = preliminary_stats('train')
df_prelim_ratio_train = pd.DataFrame(list(prelim_ratio_train.items()), columns=["Round", "Chosen"])
df_prelim_ratio_train

Unnamed: 0,Round,Chosen
0,rd1,0.367089
1,rd2,0.407407
2,rd3,0.3875
3,custom,0.8


In [31]:
import pandas as pd

prelim_ratio_test = preliminary_stats('test')
df_prelim_ratio_test = pd.DataFrame(list(prelim_ratio_test.items()), columns=["Round", "Chosen"])
df_prelim_ratio_test

Unnamed: 0,Round,Chosen
0,rd1,0.333333
1,rd2,0.571429
2,rd3,0.266667
3,custom,0.777778


Some other useful notebooks from Unsloth:
1. Mistral 7b 2x faster [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. Gemma 6 trillion tokens is 2.5x faster! [free Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)