### DPO Fine-Tuning - Mistral

> **Model Info**

- Model Name: Mistral 7B v3 Instruct (4-bit quantized)
- Accuracy: 61.29%

> **Training Info**

- GPU Type: A100
- Time: 28 mins
- GPU RAM: 11.4 GB

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* DPO requires a model already trained by SFT on a similar dataset that is used for DPO. We use `HuggingFaceH4/mistral-7b-sft-beta` as the SFT model. Use this [notebook](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) first to train a SFT model.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [3]:
#@title Alignment Handbook utils
import os
import re
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError


DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


def apply_chat_template(
    example, tokenizer, task: Literal["sft", "generation", "rm", "dpo"] = "sft", assistant_prefix="<|assistant|>\n"
):
    def _strip_prefix(s, pattern):
        # Use re.escape to escape any special characters in the pattern
        return re.sub(f"^{re.escape(pattern)}", "", s)

    if task in ["sft", "generation"]:
        messages = example["messages"]
        # We add an empty system message if there is none
        if messages[0]["role"] != "system":
            messages.insert(0, {"role": "system", "content": ""})
        example["text"] = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True if task == "generation" else False
        )
    elif task == "rm":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            chosen_messages = example["chosen"]
            rejected_messages = example["rejected"]
            # We add an empty system message if there is none
            if chosen_messages[0]["role"] != "system":
                chosen_messages.insert(0, {"role": "system", "content": ""})
            if rejected_messages[0]["role"] != "system":
                rejected_messages.insert(0, {"role": "system", "content": ""})
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
        else:
            raise ValueError(
                f"Could not format example as dialogue for `rm` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    elif task == "dpo":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            # Compared to reward modeling, we filter out the prompt, so the text is everything after the last assistant token
            prompt_messages = [[msg for msg in example["chosen"] if msg["role"] == "user"][0]]
            # Insert system message
            if example["chosen"][0]["role"] != "system":
                prompt_messages.insert(0, {"role": "system", "content": ""})
            else:
                prompt_messages.insert(0, example["chosen"][0])
            # TODO: handle case where chosen/rejected also have system messages
            # Custom templates to override Mistral's default chat template behaviours
            chosen_message_assistant_content = example["chosen"][1]['content']
            rejected_message_assistant_content = example["rejected"][1]['content']
            example["text_chosen"] = f""" {chosen_message_assistant_content} </s>"""
            example["text_rejected"] = f""" {rejected_message_assistant_content} </s>"""
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
        else:
            raise ValueError(
                f"Could not format example as dialogue for `dpo` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    else:
        raise ValueError(
            f"Task {task} not supported, please ensure that the provided task is one of {['sft', 'generation', 'rm', 'dpo']}"
        )
    return example


def get_datasets(
    data_config: dict,
    splits: List[str] = ["train", "test"],
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads one or more datasets with varying training set proportions.

    Args:
        data_config (`DataArguments` or `dict`):
            Dataset configuration and split proportions.
        splits (`List[str]`, *optional*, defaults to `['train', 'test']`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.

    Returns
        [`DatasetDict`]: The dataset dictionary containing the loaded datasets.
    """

    if type(data_config) is dict:
        # Structure of the input is:
        #     dataset_mixer = {
        #             "dataset1": 0.5,
        #             "dataset1": 0.3,
        #             "dataset1": 0.2,
        #         }
        dataset_mixer = data_config
    else:
        raise ValueError(f"Data config {data_config} not recognized.")

    raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
    return raw_datasets


def mix_datasets(dataset_mixer: dict, splits: Optional[List[str]] = None, shuffle=True) -> DatasetDict:
    """
    Loads and mixes datasets according to proportions specified in `dataset_mixer`.

    Args:
        dataset_mixer (`dict`):
            Dictionary containing the dataset names and their training proportions. By default, all test proportions are 1.
        splits (Optional[List[str]], *optional*, defaults to `None`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.
    """
    raw_datasets = DatasetDict()
    raw_train_datasets = []
    raw_val_datasets = []
    fracs = []
    for ds, frac in dataset_mixer.items():
        fracs.append(frac)
        for split in splits:
            try:
                # Try first if dataset on a Hub repo
                dataset = load_dataset(ds, split=split)
            except DatasetGenerationError:
                # If not, check local dataset
                dataset = load_from_disk(os.path.join(ds, split))

            if "train" in split:
                raw_train_datasets.append(dataset)
            elif "test" in split:
                raw_val_datasets.append(dataset)
            else:
                raise ValueError(f"Split type {split} not recognized as one of test or train.")

    if any(frac < 0 for frac in fracs):
        raise ValueError("Dataset fractions cannot be negative.")

    if len(raw_train_datasets) > 0:
        train_subsets = []
        for dataset, frac in zip(raw_train_datasets, fracs):
            train_subset = dataset.select(range(int(frac * len(dataset))))
            train_subsets.append(train_subset)
        if shuffle:
            raw_datasets["train"] = concatenate_datasets(train_subsets).shuffle(seed=42)
        else:
            raw_datasets["train"] = concatenate_datasets(train_subsets)
    # No subsampling for test datasets to enable fair comparison across models
    if len(raw_val_datasets) > 0:
        if shuffle:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets).shuffle(seed=42)
        else:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets)

    if len(raw_datasets) == 0:
        raise ValueError(
            f"Dataset {dataset_mixer} not recognized with split {split}. Check the dataset has been correctly formatted."
        )

    return raw_datasets

<a name="Data"></a>
### Data Prep
We follow Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), and sample 0.5% of it to speed things up. You can sample the full dataset for a full run.

In [4]:
from datasets import load_dataset
import pandas as pd
from datasets import Dataset, DatasetDict

dataset = load_dataset('techandy42/debugger_llm_humaneval_dataset_v1')
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])
df_parts = [df_train, df_test]

df_sft_train = []
df_sft_test = []
df_sft_parts = [df_sft_train, df_sft_test]

df_dpo_train = []
df_dpo_test = []
df_dpo_parts = [df_dpo_train, df_dpo_test]

PREFIXS = ['score_s1_', 'score_s2_', 'score_s3_', 'score_s4_', 'score_s5_', 'score_s6_']
ROUNDS = ['rd1', 'rd2', 'rd3', 'custom']
PAIRS = [('rd1', 'rd2'), ('rd1', 'rd3'), ('rd1', 'custom'), ('rd2', 'rd3'), ('rd2', 'custom'), ('rd3', 'custom')]

def indent_lines(string: str) -> str:
  indented_string = '\n'.join('    ' + line for line in string.splitlines())
  return indented_string

for df, df_sft, df_dpo in zip(df_parts, df_sft_parts, df_dpo_parts):
  for idx, row in df.iterrows():
      prompt = row['prompt']
      result = row['result']
      instruction = f"""### instruction:

  - The following buggy code is a wrong implementation that contains one or more bugs.
  - Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.
  - Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.
  - Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.
  - IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible."""
      full_solution = "### buggy code\n\n" + (prompt + indent_lines(result)).strip('\n')
      full_instruction = instruction + "\n\n" + full_solution + "\n"
      solutions_info = {}
      for ROUND in ROUNDS:
        solutions_info[ROUND] = {}
        total_score = 0
        for PREFIX in PREFIXS:
          score_col = PREFIX + ROUND
          score = int(row[score_col][0])
          total_score += score
        total_score /= 42
        analysis_col = 'analysis_' + ROUND
        solutions_info[ROUND]['analysis'] = row[analysis_col]
        solutions_info[ROUND]['score'] = total_score
      for ROUND1, ROUND2 in PAIRS:
        round1_score = solutions_info[ROUND1]['score']
        round2_score = solutions_info[ROUND2]['score']
        round1_analysis = solutions_info[ROUND1]['analysis']
        round2_analysis = solutions_info[ROUND2]['analysis']
        if round1_score == round2_score:
          continue
        messages_info = {}
        messages_info['messages'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score > round2_score else round2_analysis, 'role': 'assistant'}
        ]
        pairwise_info = {}
        pairwise_info['prompt'] = full_instruction
        pairwise_info['chosen'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score > round2_score else round2_analysis, 'role': 'assistant'}
        ]
        pairwise_info['rejected'] = [
            {'content': full_instruction, 'role': 'user'},
            {'content': round1_analysis if round1_score < round2_score else round2_analysis, 'role': 'assistant'}
        ]
        # Not part of training data, only for analysis
        pairwise_info['metadata'] = {
            'chosen': ROUND1 if round1_score > round2_score else ROUND2,
            'rejected': ROUND1 if round1_score < round2_score else ROUND2,
        },
        df_sft.append(messages_info)
        df_dpo.append(pairwise_info)

df_sft_train = pd.DataFrame(df_sft_train)
df_sft_test = pd.DataFrame(df_sft_test)
dataset_sft_train = Dataset.from_pandas(df_sft_train)
dataset_sft_test = Dataset.from_pandas(df_sft_test)
datasets_sft = DatasetDict({
    'train': dataset_sft_train,
    'test': dataset_sft_test
})
df_dpo_train = pd.DataFrame(df_dpo_train)
df_dpo_test = pd.DataFrame(df_dpo_test)
dataset_dpo_train = Dataset.from_pandas(df_dpo_train)
dataset_dpo_test = Dataset.from_pandas(df_dpo_test)
datasets_dpo = DatasetDict({
    'train': dataset_dpo_train,
    'test': dataset_dpo_test
})

Downloading readme:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6 [00:00<?, ? examples/s]

In [5]:
datasets_sft

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 165
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 31
    })
})

In [6]:
datasets_dpo

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'metadata'],
        num_rows: 165
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'metadata'],
        num_rows: 31
    })
})

In [7]:
column_names = list(datasets_sft['train'].features)

sft_datasets = datasets_sft.map(
    apply_chat_template,
    fn_kwargs = {"tokenizer": tokenizer, "task": "sft"},
    num_proc = 12,
    remove_columns = column_names,
    desc = "Formatting comparisons with prompt template",
)

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/165 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/31 [00:00<?, ? examples/s]

In [8]:
sft_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 165
    })
    test: Dataset({
        features: ['text'],
        num_rows: 31
    })
})

In [9]:
print(sft_datasets['train'][0]['text'])

<s>[INST] ### instruction:

  - The following buggy code is a wrong implementation that contains one or more bugs.
  - Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.
  - Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.
  - Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.
  - IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.

### buggy code

from typing import List


def parse_nested_parens(paren_string: str) -> List[int]:
    """ Input to this function is a string represented multiple groups for nested parentheses separated by spaces.
    For each of the group, output the deepest level of nesting of parentheses.
    E.g. (()()) has maximum two levels of nesting while ((())) has three.

    >>> parse_nest

In [10]:
column_names = list(datasets_dpo['train'].features)

dpo_datasets = datasets_dpo.map(
    apply_chat_template,
    fn_kwargs = {"tokenizer": tokenizer, "task": "dpo"},
    num_proc = 12,
    remove_columns = column_names,
    desc = "Formatting comparisons with prompt template",
)

dpo_datasets = dpo_datasets.rename_columns(
    {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
)

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/165 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/31 [00:00<?, ? examples/s]

In [11]:
dpo_datasets

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 165
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 31
    })
})

In [12]:
print("=" * 10 + "PROMPT" + "=" * 10)
print(dpo_datasets['train'][0]['prompt'])
print("=" * 10 + "CHOSEN" + "=" * 10)
print(dpo_datasets['train'][0]['chosen'])
print("=" * 10 + "REJECTED" + "=" * 10)
print(dpo_datasets['train'][0]['rejected'])

<s>[INST] 

### instruction:

  - The following buggy code is a wrong implementation that contains one or more bugs.
  - Firstly, find all of the bugs within the buggy code. Make sure to quotate each part of the buggy code that contains a bug.
  - Afterwards, for each of the bugs, describe the issue with each part of the buggy code with the bug, and outline how to fix the issue.
  - Make sure your answer covers (1) all of the existing bugs, (2) do not hallucinate non-existing bugs, and (3) be concise as possible.
  - IMPORTANT!: While abiding by the above instructions, keep your answer as brief as possible.

### buggy code

from typing import List


def parse_nested_parens(paren_string: str) -> List[int]:
    """ Input to this function is a string represented multiple groups for nested parentheses separated by spaces.
    For each of the group, output the deepest level of nesting of parentheses.
    E.g. (()()) has maximum two levels of nesting while ((())) has three.

    >>> parse_ne

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [13]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Train the SFT model

In [14]:
# Note: running eval is not necessary for this stage
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

sft_trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = sft_datasets['train'],
    # eval_dataset = sft_datasets['test'], # Uncomment to run eval
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        # evaluation_strategy = "steps", # Uncomment to run eval
        # eval_steps = 1, # Uncomment to run eval
    ),
)

Map (num_proc=2):   0%|          | 0/165 [00:00<?, ? examples/s]

In [15]:
sft_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss
1,1.662
2,1.6077
3,1.389
4,1.2017
5,0.9557
6,0.7509
7,0.6865
8,0.5869
9,0.5542
10,0.436


TrainOutput(global_step=20, training_loss=0.6564417362213135, metrics={'train_runtime': 53.8609, 'train_samples_per_second': 3.063, 'train_steps_per_second': 0.371, 'total_flos': 4547986229870592.0, 'train_loss': 0.6564417362213135, 'epoch': 0.963855421686747})

<a name="Train"></a>
### Train the DPO model
Now let's use Huggingface TRL's `DPOTrainer`! More docs here: [TRL DPO docs](https://huggingface.co/docs/trl/dpo_trainer). We do 3 epochs on 0.5% of the dataset to speed things up.

In [16]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [17]:
# Note: there is an issue running eval during training with trl's DPOTrainer & DPOConfig
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = dpo_datasets['train'],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

Tokenizing train dataset:   0%|          | 0/165 [00:00<?, ? examples/s]

In [18]:
from tqdm import tqdm
import torch

def run_eval(model, tokenizer, no_iter, get_stats = False):
  NUM_ITEMS = len(dpo_datasets['test'])
  num_chosen = 0
  ROUNDS = ['rd1', 'rd2', 'rd3', 'custom']
  stats = {}
  for ROUND in ROUNDS:
    stats[ROUND] = {'TP': 0, 'TN': 0, 'FP': 0, 'FN': 0}

  for _ in range(no_iter):
    for i in tqdm(range(NUM_ITEMS)):
      input = {
          "chosen": datasets_dpo['test'][i]["chosen"],
          "rejected": datasets_dpo['test'][i]["rejected"]
      }
      chosen_round = datasets_dpo['test'][i]["metadata"][0]['chosen']
      rejected_round = datasets_dpo['test'][i]["metadata"][0]['rejected']

      # Apply the chat template to format the input
      formatted_input = apply_chat_template(input, tokenizer, task="dpo")

      # Tokenize the inputs
      inputs_chosen = tokenizer(formatted_input["text_chosen"], return_tensors="pt", padding=True, truncation=True)
      inputs_rejected = tokenizer(formatted_input["text_rejected"], return_tensors="pt", padding=True, truncation=True)

      # Generate the scalar reward values
      with torch.no_grad():
          reward_chosen = model(**inputs_chosen).logits.mean().item()
          reward_rejected = model(**inputs_rejected).logits.mean().item()
          # Model chose correctly
          if reward_chosen > reward_rejected:
              num_chosen += 1
              stats[chosen_round]['TP'] += 1
              stats[rejected_round]['TN'] += 1
          # Model chose wrongly
          else:
              stats[chosen_round]['FN'] += 1
              stats[rejected_round]['FP'] += 1

  if get_stats:
    return num_chosen / (no_iter * NUM_ITEMS), stats

  return num_chosen / (no_iter * NUM_ITEMS)

In [19]:
best_iteration = 1
best_eval_result = 0

for i in range(1, 11):
    # Train the model
    training_result = dpo_trainer.train()
    eval_result = run_eval(model, tokenizer, 5)
    if eval_result >= best_eval_result:
        best_eval_result = eval_result
        best_iteration = i

    # Create a unique checkpoint directory for each iteration
    checkpoint_dir = f"checkpoint_iteration_{i}"
    os.makedirs(checkpoint_dir, exist_ok=True)

    # Save model and trainer states for this iteration
    dpo_trainer.save_model(checkpoint_dir)  # Save model and tokenizer
    dpo_trainer.save_state()  # Save optimizer, scheduler, and other trainer states

    print(f"\nEPOCH NO.{i}")
    print(f"TRAINING RESULT: {training_result}")
    print(f"TEST ACCURACY: {eval_result * 100:.2f}\n")

print(f"BEST ITERATION: {best_iteration}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.2059,17.486738,11.099583,0.875,6.387156,-114.709145,-103.794418,-2.592376,-2.636911
2,1.4821,13.420164,10.678968,0.375,2.741197,-87.83683,-60.587536,-2.57896,-2.597474
3,0.3191,15.481264,9.783014,0.75,5.69825,-113.087212,-67.584427,-2.749223,-2.756385
4,1.0925,12.51598,10.704499,0.75,1.811479,-88.775757,-72.24305,-2.632526,-2.630468
5,0.3079,16.518633,10.448053,0.875,6.070578,-128.103088,-104.995911,-2.565186,-2.63174
6,0.6272,14.639753,10.499788,0.75,4.139966,-103.265785,-73.582321,-2.648309,-2.66404
7,0.9353,11.242149,10.095436,0.75,1.146713,-83.658531,-48.000607,-2.623477,-2.59552
8,1.1816,12.769236,12.441994,0.5,0.327241,-90.975098,-80.339554,-2.692555,-2.626049
9,0.8205,12.872093,9.438245,0.875,3.433848,-109.962776,-78.398232,-2.559258,-2.586359
10,0.4502,15.999444,10.856851,0.875,5.142592,-74.533989,-67.090225,-2.673793,-2.738875


100%|██████████| 31/31 [00:07<00:00,  4.29it/s]
100%|██████████| 31/31 [00:07<00:00,  4.39it/s]
100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:07<00:00,  4.36it/s]



EPOCH NO.1
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.6427940238267184, metrics={'train_runtime': 94.6682, 'train_samples_per_second': 1.743, 'train_steps_per_second': 0.211, 'total_flos': 0.0, 'train_loss': 0.6427940238267184, 'epoch': 0.963855421686747})
TEST ACCURACY: 64.52



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0339,17.069899,10.513454,1.0,6.556444,-120.570419,-107.962814,-2.614498,-2.659098
2,0.4978,13.889389,9.446057,0.75,4.443332,-100.165939,-55.895294,-2.600188,-2.618802
3,0.0523,15.971531,9.245825,1.0,6.725707,-118.459106,-62.681747,-2.767859,-2.774236
4,0.5107,12.995096,10.002812,0.875,2.992284,-95.792633,-67.451874,-2.638593,-2.632155
5,0.0525,16.912594,9.961214,1.0,6.95138,-132.971497,-101.05629,-2.568256,-2.632895
6,0.5473,14.317147,10.00659,0.75,4.310557,-108.197777,-76.808388,-2.642584,-2.658861
7,0.6018,11.48564,9.451754,0.75,2.033886,-90.095352,-45.565704,-2.611207,-2.581091
8,0.6881,13.047103,11.653055,0.75,1.394047,-98.864479,-77.560883,-2.674887,-2.60983
9,0.3971,12.829313,8.402725,0.875,4.426589,-120.317963,-78.826012,-2.537364,-2.564103
10,0.112,15.908272,10.282786,1.0,5.625485,-80.274635,-68.001945,-2.649931,-2.715069


100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:07<00:00,  4.38it/s]
100%|██████████| 31/31 [00:07<00:00,  4.36it/s]
100%|██████████| 31/31 [00:07<00:00,  4.42it/s]
100%|██████████| 31/31 [00:07<00:00,  4.38it/s]



EPOCH NO.2
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.3229036188684404, metrics={'train_runtime': 94.2645, 'train_samples_per_second': 1.75, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.3229036188684404, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0115,17.030344,9.781694,1.0,7.248651,-127.888023,-108.358353,-2.576086,-2.621044
2,0.1513,13.7425,8.447684,1.0,5.294816,-110.149658,-57.364182,-2.557425,-2.569
3,0.0126,15.959352,8.100145,1.0,7.859208,-129.915894,-62.803539,-2.724726,-2.729984
4,0.1396,13.042749,9.217752,1.0,3.824998,-103.643242,-66.975342,-2.589444,-2.580452
5,0.0269,16.335495,8.79032,1.0,7.545176,-144.680435,-106.827278,-2.528184,-2.586926
6,0.3423,13.679081,8.588316,0.75,5.090766,-122.380524,-83.189056,-2.590475,-2.607025
7,0.276,11.26055,8.086741,0.75,3.17381,-103.745483,-47.816589,-2.554221,-2.517802
8,0.3232,12.839938,10.218803,0.75,2.621135,-113.207001,-79.632523,-2.614829,-2.55385
9,0.1201,12.160841,6.592023,0.875,5.568818,-138.424988,-85.510742,-2.481318,-2.50731
10,0.0607,15.364609,9.049615,1.0,6.314994,-92.606354,-73.438568,-2.591876,-2.656998


100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:07<00:00,  4.39it/s]
100%|██████████| 31/31 [00:07<00:00,  4.38it/s]
100%|██████████| 31/31 [00:07<00:00,  4.39it/s]
100%|██████████| 31/31 [00:07<00:00,  4.32it/s]



EPOCH NO.3
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.13440065409522503, metrics={'train_runtime': 94.2719, 'train_samples_per_second': 1.75, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.13440065409522503, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.005,16.393404,8.301545,1.0,8.091858,-142.689514,-114.727753,-2.513676,-2.559609
2,0.0493,13.184301,7.00378,1.0,6.180521,-124.588699,-62.946163,-2.488934,-2.473504
3,0.0029,15.556278,6.233779,1.0,9.322498,-148.579559,-66.834297,-2.661726,-2.663812
4,0.0462,12.752784,7.831614,1.0,4.921171,-117.504623,-69.874992,-2.51491,-2.508444
5,0.0141,15.486708,7.354886,1.0,8.131821,-159.034775,-115.315163,-2.473802,-2.524514
6,0.0944,12.528237,6.329152,1.0,6.199085,-144.972168,-94.697495,-2.519625,-2.533144
7,0.1055,10.636617,5.986039,1.0,4.650578,-124.75251,-54.055923,-2.477791,-2.433579
8,0.0993,12.281914,8.057914,1.0,4.224,-134.815887,-85.212769,-2.540443,-2.485999
9,0.0256,10.965185,4.113669,1.0,6.851516,-163.208511,-97.4673,-2.409968,-2.43292
10,0.0498,14.355927,7.183382,1.0,7.172544,-111.268684,-83.525406,-2.515821,-2.578773


100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:07<00:00,  4.32it/s]
100%|██████████| 31/31 [00:07<00:00,  4.35it/s]
100%|██████████| 31/31 [00:07<00:00,  4.33it/s]
100%|██████████| 31/31 [00:07<00:00,  4.37it/s]



EPOCH NO.4
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.047222642818815073, metrics={'train_runtime': 94.3294, 'train_samples_per_second': 1.749, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.047222642818815073, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.002,15.208192,6.130359,1.0,9.077833,-164.401367,-126.57988,-2.4479,-2.494377
2,0.0089,12.251267,4.897473,1.0,7.353794,-145.651764,-72.276512,-2.417133,-2.398059
3,0.0012,14.503137,3.750921,1.0,10.752216,-173.408142,-77.365692,-2.596666,-2.593427
4,0.0342,11.854548,5.670554,1.0,6.183993,-139.115204,-78.857361,-2.439608,-2.444109
5,0.0085,13.814229,4.757403,1.0,9.056827,-185.009613,-132.039932,-2.424088,-2.465765
6,0.0219,10.862192,3.445072,1.0,7.417119,-173.812943,-111.357933,-2.455781,-2.467067
7,0.0239,9.714256,3.472084,1.0,6.242172,-149.892059,-63.279533,-2.416323,-2.367369
8,0.0121,11.359886,4.985139,1.0,6.374747,-165.54364,-94.433044,-2.483187,-2.429292
9,0.0027,9.471012,1.410049,1.0,8.060964,-190.24472,-112.409027,-2.358019,-2.380399
10,0.0461,12.9326,4.942703,1.0,7.989897,-133.675476,-97.758667,-2.467826,-2.522833


100%|██████████| 31/31 [00:07<00:00,  4.33it/s]
100%|██████████| 31/31 [00:07<00:00,  4.32it/s]
100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:07<00:00,  4.36it/s]
100%|██████████| 31/31 [00:07<00:00,  4.35it/s]



EPOCH NO.5
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.01414787225658074, metrics={'train_runtime': 94.1492, 'train_samples_per_second': 1.753, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.01414787225658074, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0012,13.918028,4.001942,1.0,9.916086,-185.685547,-139.481522,-2.409138,-2.454489
2,0.0014,11.464093,2.566092,1.0,8.898001,-168.965576,-80.148254,-2.377783,-2.370968
3,0.0006,13.095785,1.200212,1.0,11.895572,-198.915222,-91.439224,-2.560509,-2.554931
4,0.0113,10.735371,3.671999,1.0,7.063372,-159.100769,-90.049133,-2.401608,-2.41259
5,0.0013,12.98175,2.483486,1.0,10.498264,-207.748779,-140.364731,-2.396644,-2.431998
6,0.0053,9.324689,0.899299,1.0,8.425389,-199.270676,-126.732971,-2.421595,-2.430686
7,0.0077,9.118342,1.001659,1.0,8.116684,-174.596313,-69.23867,-2.37616,-2.327532
8,0.0006,10.714569,1.586738,1.0,9.127831,-199.527649,-100.886215,-2.441279,-2.391162
9,0.0011,8.140311,-1.298807,1.0,9.439117,-217.333282,-125.716049,-2.318836,-2.342012
10,0.0144,11.656889,2.989227,1.0,8.667662,-153.21022,-110.515778,-2.427141,-2.475651


100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:07<00:00,  4.37it/s]
100%|██████████| 31/31 [00:07<00:00,  4.41it/s]
100%|██████████| 31/31 [00:07<00:00,  4.37it/s]
100%|██████████| 31/31 [00:07<00:00,  4.39it/s]



EPOCH NO.6
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.003677869281909807, metrics={'train_runtime': 94.3712, 'train_samples_per_second': 1.748, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.003677869281909807, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0006,12.730891,1.805838,1.0,10.925054,-207.646591,-151.352875,-2.364157,-2.407014
2,0.001,10.526731,0.82835,1.0,9.698381,-186.343018,-89.521866,-2.330943,-2.322726
3,0.0005,11.921943,-1.408714,1.0,13.330656,-225.004486,-103.177635,-2.513187,-2.507398
4,0.0034,9.962515,1.524406,1.0,8.43811,-180.576691,-97.777679,-2.355676,-2.365759
5,0.0015,11.383943,0.241559,1.0,11.142384,-230.16803,-156.342804,-2.360017,-2.393248
6,0.0007,7.955126,-2.253708,1.0,10.208834,-230.800751,-140.428604,-2.380611,-2.38768
7,0.0017,8.260278,-1.434314,1.0,9.694592,-198.956055,-77.819321,-2.332872,-2.284885
8,0.0003,9.725939,-1.003392,1.0,10.729331,-225.428955,-110.772507,-2.400465,-2.35329
9,0.0002,6.80214,-4.071895,1.0,10.874035,-245.064163,-139.097748,-2.277759,-2.302116
10,0.0027,10.783354,0.59932,1.0,10.184033,-177.109299,-119.251129,-2.390275,-2.432598


100%|██████████| 31/31 [00:07<00:00,  4.39it/s]
100%|██████████| 31/31 [00:07<00:00,  4.33it/s]
100%|██████████| 31/31 [00:07<00:00,  4.36it/s]
100%|██████████| 31/31 [00:07<00:00,  4.36it/s]
100%|██████████| 31/31 [00:07<00:00,  4.32it/s]



EPOCH NO.7
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.0010197769189289828, metrics={'train_runtime': 94.3893, 'train_samples_per_second': 1.748, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.0010197769189289828, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0003,11.752533,-0.096419,1.0,11.848953,-226.669159,-161.136475,-2.326187,-2.3693
2,0.0006,9.884022,-0.800473,1.0,10.684494,-202.631241,-95.948975,-2.294929,-2.297041
3,0.0003,11.228052,-3.412835,1.0,14.640886,-245.045685,-110.116547,-2.478278,-2.474777
4,0.0011,9.307203,-0.22186,1.0,9.529063,-198.039352,-104.330811,-2.330377,-2.335844
5,0.0008,10.43976,-1.434091,1.0,11.873852,-246.924561,-165.784622,-2.328832,-2.362868
6,0.0004,6.868908,-4.193721,1.0,11.062629,-250.200867,-151.290771,-2.352134,-2.359386
7,0.0009,7.632857,-2.978974,1.0,10.61183,-214.402634,-84.093529,-2.299939,-2.253523
8,0.0002,8.85165,-2.68928,1.0,11.54093,-242.287811,-119.515396,-2.371817,-2.32844
9,0.0001,5.89269,-6.002529,1.0,11.895218,-264.370483,-148.192261,-2.251372,-2.275685
10,0.002,10.0061,-1.173496,1.0,11.179596,-194.837463,-127.023674,-2.366552,-2.406183


100%|██████████| 31/31 [00:07<00:00,  4.33it/s]
100%|██████████| 31/31 [00:07<00:00,  4.35it/s]
100%|██████████| 31/31 [00:07<00:00,  4.36it/s]
100%|██████████| 31/31 [00:07<00:00,  4.36it/s]
100%|██████████| 31/31 [00:07<00:00,  4.34it/s]



EPOCH NO.8
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.0005030055006727708, metrics={'train_runtime': 94.3844, 'train_samples_per_second': 1.748, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.0005030055006727708, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0002,11.171094,-1.424526,1.0,12.595621,-239.950241,-166.950851,-2.308,-2.349729
2,0.0003,9.410428,-1.771025,1.0,11.181454,-212.336761,-100.684906,-2.276581,-2.282008
3,0.0002,10.786634,-4.689457,1.0,15.476091,-257.81192,-114.530724,-2.460821,-2.458696
4,0.0006,8.932558,-1.329904,1.0,10.262462,-209.119781,-108.077255,-2.318824,-2.320466
5,0.0005,9.938524,-2.463579,1.0,12.402103,-257.219421,-170.796997,-2.313315,-2.348157
6,0.0002,6.269001,-5.341078,1.0,11.610079,-261.674438,-157.289856,-2.337171,-2.343988
7,0.0005,7.325933,-3.957087,1.0,11.28302,-224.183777,-87.162773,-2.283825,-2.237442
8,0.0001,8.32304,-3.703193,1.0,12.026233,-252.426956,-124.801498,-2.358881,-2.315325
9,0.0001,5.343056,-6.89919,1.0,12.242247,-273.337097,-153.688599,-2.238035,-2.262041
10,0.0006,9.677377,-2.263463,1.0,11.940839,-205.737122,-130.310898,-2.355986,-2.393368


100%|██████████| 31/31 [00:07<00:00,  4.32it/s]
100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:07<00:00,  4.33it/s]
100%|██████████| 31/31 [00:07<00:00,  4.38it/s]



EPOCH NO.9
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.0002941340770348688, metrics={'train_runtime': 94.3794, 'train_samples_per_second': 1.748, 'train_steps_per_second': 0.212, 'total_flos': 0.0, 'train_loss': 0.0002941340770348688, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 165 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.0001,10.777631,-2.299117,1.0,13.076747,-248.696121,-170.885498,-2.296215,-2.338429
2,0.0002,9.127869,-2.466324,1.0,11.594193,-219.289749,-103.510498,-2.265999,-2.272505
3,0.0001,10.484029,-5.55275,1.0,16.036779,-266.444855,-117.556778,-2.448926,-2.447422
4,0.0004,8.683245,-2.144832,1.0,10.828077,-217.269073,-110.570389,-2.308971,-2.309021
5,0.0003,9.593196,-3.217783,1.0,12.81098,-264.761475,-174.250275,-2.303347,-2.338207
6,0.0002,5.76773,-6.187394,1.0,11.955124,-270.137604,-162.302551,-2.328284,-2.334558
7,0.0004,7.107495,-4.620927,1.0,11.728422,-230.822144,-89.347137,-2.274072,-2.228734
8,0.0001,7.963809,-4.519857,1.0,12.483665,-260.593597,-128.393829,-2.349713,-2.305894
9,0.0001,4.842663,-7.843145,1.0,12.685808,-282.776672,-158.692535,-2.227715,-2.252648
10,0.0004,9.282381,-3.05716,1.0,12.339542,-213.674103,-134.260849,-2.348466,-2.384178


100%|██████████| 31/31 [00:07<00:00,  4.29it/s]
100%|██████████| 31/31 [00:07<00:00,  4.33it/s]
100%|██████████| 31/31 [00:07<00:00,  4.32it/s]
100%|██████████| 31/31 [00:07<00:00,  4.40it/s]
100%|██████████| 31/31 [00:07<00:00,  4.35it/s]



EPOCH NO.10
TRAINING RESULT: TrainOutput(global_step=20, training_loss=0.0002063493324499177, metrics={'train_runtime': 94.9647, 'train_samples_per_second': 1.737, 'train_steps_per_second': 0.211, 'total_flos': 0.0, 'train_loss': 0.0002063493324499177, 'epoch': 0.963855421686747})
TEST ACCURACY: 61.29

BEST ITERATION: 1


In [20]:
# Use to clear as much GPU RAM as possible
import gc
import torch

del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

In [21]:
# Make sure to have enough GPU RAM before running this
from unsloth import FastLanguageModel
from datasets import load_from_disk

best_checkpoint_dir = f"checkpoint_iteration_{best_iteration}"

model, tokenizer = FastLanguageModel.from_pretrained(best_checkpoint_dir)

eval_result = run_eval(model, tokenizer, 5)
print(f"\nTEST ACCURACY: {eval_result * 100:.2f}\n")

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Will load checkpoint_iteration_1 as a legacy tokenizer.
100%|██████████| 31/31 [00:07<00:00,  4.34it/s]
100%|██████████| 31/31 [00:06<00:00,  4.50it/s]
100%|██████████| 31/31 [00:07<00:00,  4.42it/s]
100%|██████████| 31/31 [00:06<00:00,  4.45it/s]
100%|██████████| 31/31 [00:06<00:00,  4.50it/s]


TEST ACCURACY: 58.06






In [22]:
!pip install huggingface_hub



In [23]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [24]:
model.save_pretrained("model", tokenizer, save_method="default")
model.push_to_hub("techandy42/mistral-7b-instruct-v0.3-bnb-4bit-fine-tuned", tokenizer, save_method="default")

README.md:   0%|          | 0.00/609 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

Saved model to https://huggingface.co/techandy42/mistral-7b-instruct-v0.3-bnb-4bit-fine-tuned


In [25]:
# Use to clear as much GPU RAM as possible
import gc
import torch

del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

In [26]:
from unsloth import FastLanguageModel

model_name = "techandy42/mistral-7b-instruct-v0.3-bnb-4bit-fine-tuned"
model, tokenizer = FastLanguageModel.from_pretrained(model_name)

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

In [27]:
def print_confusion_matrices(confusion_dict):
    for key, values in confusion_dict.items():
        # Calculate total instances
        total = values['TP'] + values['TN'] + values['FP'] + values['FN']

        # Calculate percentages
        tp_percent = (values['TP'] / total)
        tn_percent = (values['TN'] / total)
        fp_percent = (values['FP'] / total)
        fn_percent = (values['FN'] / total)

        # Print the confusion matrix with percentages
        print(f"Confusion Matrix for {key}:")
        print("-------------------------------------------------------")
        print(f"                Predicted Positive   Predicted Negative")
        print(f"Actual Positive           {tp_percent:>8.2f}             {fn_percent:>8.2f}")
        print(f"Actual Negative           {fp_percent:>8.2f}             {tn_percent:>8.2f}")
        print("-------------------------------------------------------")
        print(f"Combined                  {tp_percent+fp_percent:>8.2f}             {tn_percent+fn_percent:>8.2f}")
        print("-------------------------------------------------------\n")

In [28]:
import pandas as pd

eval_result, stats = run_eval(model, tokenizer, 5, get_stats = True)
print(f"\nTEST ACCURACY: {eval_result * 100:.2f}\n")
print_confusion_matrices(stats)

100%|██████████| 31/31 [00:07<00:00,  4.33it/s]
100%|██████████| 31/31 [00:06<00:00,  4.43it/s]
100%|██████████| 31/31 [00:06<00:00,  4.51it/s]
100%|██████████| 31/31 [00:06<00:00,  4.46it/s]
100%|██████████| 31/31 [00:06<00:00,  4.48it/s]


TEST ACCURACY: 61.29

Confusion Matrix for rd1:
-------------------------------------------------------
                Predicted Positive   Predicted Negative
Actual Positive               0.20                 0.13
Actual Negative               0.27                 0.40
-------------------------------------------------------
Combined                      0.47                 0.53
-------------------------------------------------------

Confusion Matrix for rd2:
-------------------------------------------------------
                Predicted Positive   Predicted Negative
Actual Positive               0.50                 0.07
Actual Negative               0.21                 0.21
-------------------------------------------------------
Combined                      0.71                 0.29
-------------------------------------------------------

Confusion Matrix for rd3:
-------------------------------------------------------
                Predicted Positive   Predicted Negative
A




In [29]:
def preliminary_stats(dataset):
  NUM_ITEMS = len(dpo_datasets[dataset])
  ROUNDS = ['rd1', 'rd2', 'rd3', 'custom']
  reward_model_chosen = dict.fromkeys(ROUNDS, 0)
  reward_model_rejected = dict.fromkeys(ROUNDS, 0)
  reward_model_ratio = dict.fromkeys(ROUNDS, 0)
  for i in range(NUM_ITEMS):
    chosen_round = datasets_dpo[dataset][i]["metadata"][0]['chosen']
    rejected_round = datasets_dpo[dataset][i]["metadata"][0]['rejected']

    reward_model_chosen[chosen_round] += 1
    reward_model_rejected[rejected_round] += 1

  for ROUND in ROUNDS:
    reward_model_ratio[ROUND] = reward_model_chosen[ROUND] / (reward_model_chosen[ROUND] + reward_model_rejected[ROUND])

  return reward_model_ratio

In [30]:
import pandas as pd

prelim_ratio_train = preliminary_stats('train')
df_prelim_ratio_train = pd.DataFrame(list(prelim_ratio_train.items()), columns=["Round", "Chosen"])
df_prelim_ratio_train

Unnamed: 0,Round,Chosen
0,rd1,0.367089
1,rd2,0.407407
2,rd3,0.3875
3,custom,0.8


In [31]:
import pandas as pd

prelim_ratio_test = preliminary_stats('test')
df_prelim_ratio_test = pd.DataFrame(list(prelim_ratio_test.items()), columns=["Round", "Chosen"])
df_prelim_ratio_test

Unnamed: 0,Round,Chosen
0,rd1,0.333333
1,rd2,0.571429
2,rd3,0.266667
3,custom,0.777778


Some other useful notebooks from Unsloth:
1. Mistral 7b 2x faster [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. Gemma 6 trillion tokens is 2.5x faster! [free Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)