## gemma-2b FineTuning with ORPO (Odds Ratio Preference Optimization)

(base case of full-precision model training WITHOUT QLORA/PEFT, so you will probably need a large GPU )

Odds Ratio Preference Optimization (ORPO) by Jiwoo Hong, Noah Lee, and James Thorne studies the crucial role of SFT within the context of preference alignment. 

[Paper](https://arxiv.org/abs/2403.07691)

Using preference data the method posits that a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response via a simple log odds ratio term appended to the NLL loss is sufficient for preference-aligned SFT.

Thus ORPO is a reference model-free preference optimization algorithm eliminating the necessity for an additional preference alignment phase thus saving compute and memory.

The official code can be found [xfactlab/orpo.](https://github.com/xfactlab/orpo)

Expected dataset format
The ORPO trainer expects a format identical to the DPO trainer, which should include three entries. These entries should be named as follows:

* prompt
* chosen
* rejected

------

Read more [here](https://huggingface.co/docs/trl/main/en/orpo_trainer#expected-dataset-format)

----

ORPO does not require a reference model, unlike RLHF and DPO. In that sense, ORPO is computationally more efficient than RLHF and DPO in two perspectives:

- Memory allocation

- Fewer FLOPs per batch.


In [None]:
import random
import sys
import os

from config import CUDA_VISIBLE_DEVICES, USE_TORCH, CPU_NUMS  # from config

#  Set env variables that influence how PyTorch interacts with the system's hardware,
# specifically regarding CUDA for GPU usage and multi-threading for CPU operations.

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:3072"
os.environ["CUDA_VISIBLE_DEVICES"] = CUDA_VISIBLE_DEVICES
os.environ["USE_TORCH"] = USE_TORCH
os.environ["OMP_NUM_THREADS"] = CPU_NUMS  # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = CPU_NUMS  # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = CPU_NUMS  # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = CPU_NUMS  # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = CPU_NUMS  # export NUMEXPR_NUM_THREADS=1

📌 `os.environ["PYTORCH_CUDA_ALLOC_CONF"]` is set to `"max_split_size_mb:3072"`, which configures the caching allocator for CUDA to split large blocks into smaller pieces. This setting helps manage GPU memory more efficiently, particularly for models that require dynamic memory allocation.

📌 `os.environ["CUDA_VISIBLE_DEVICES"]` If I had to set it to "-1" then PyTorch will run on CPU.

📌 The variables `os.environ["OMP_NUM_THREADS"]`, `os.environ["OPENBLAS_NUM_THREADS"]`, `os.environ["MKL_NUM_THREADS"]`, `os.environ["VECLIB_MAXIMUM_THREADS"]`, and `os.environ["NUMEXPR_NUM_THREADS"]` are all related to the configuration of threading for various mathematical libraries and operations. By setting these, you ensure that PyTorch and its underlying libraries (like Intel MKL, OpenBLAS, and NumExpr) use a consistent and limited number of CPU threads, which is crucial for avoiding resource contention and ensuring predictable performance. "CPU_NUMS" is set to "8", indicating the code is configured to use up to 8 threads for these operations, aiming to leverage multi-core CPUs effectively without over-subscription.

In [None]:
!pip install git+https://github.com/huggingface/trl.git -q
!pip install bitsandbytes accelerate ninja packaging wandb -q
!MAX_JOBS=6 pip install flash-attn --no-build-isolation --upgrade -q

In [None]:
!nvidia-smi

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import ORPOConfig, ORPOTrainer
from datasets import load_dataset
import multiprocessing

In [None]:
from huggingface_hub import login
login(token="Your HF Token")

In [None]:
# https://huggingface.co/docs/trl/main/en/orpo_trainer#trl.ORPOConfig
# https://www.philschmid.de/fine-tune-google-gemma


# in the following config, we combine the usual HF Trainer args with the ORPOConfig args (beta)

cfg = ORPOConfig(
    output_dir='content/gemma-2b-orpo',     # usual HF Trainer args: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.args
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=2,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=20,                       # log every 20 steps
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32
    learning_rate=5e-5,                     # learning rate
    warmup_ratio=0.1,
    warmup_steps=100,
    lr_scheduler_type="cosine",
    max_prompt_length=512,
    remove_unused_columns=False,
    max_length=1024,
    beta=0.1,                               # ORPO beta
    save_total_limit=3,                     # args related to saving the model...
    save_strategy="epoch",
    push_to_hub=True,
    report_to=['wandb'],                    # report metrics to Weights & Biases
    hub_model_id='anakin87/gemma-2b-orpo',
)

In [None]:
model_id = "google/gemma-2b"

tokenizer_id = "philschmid/gemma-tokenizer-chatml"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    # attn_implementation="flash_attention_1",
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.padding_side = 'right' # to prevent warnings

In [None]:
# https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py

ds = load_dataset("alvarobartt/dpo-mix-7k-simplified")

def process(row):
    row["prompt"] = tokenizer.apply_chat_template(row["prompt"], tokenize=False)
    row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
    return row

ds = ds.map(
    process,
    num_proc=multiprocessing.cpu_count(),
    load_from_cache_file=False,
)
train_dataset = ds["train"]
eval_dataset = ds["test"]

### How `apply_chat_template()` method is working in this case

After the `apply_chat_template()` method, the tokenizer has added the control tokens `<bos><|im_start|` before each column's content and `<|im_end|>\n<eos>` after each column's content.

Since Google/gemma-7b requires inputs always to start with a <bos> token.

So now, our inputs will look like below

(below I am printing only the first row after applying the `apply_chat_template()` method )

```

{'dataset': 'argilla/distilabel-intel-orca-dpo-pairs',


'prompt': '<bos><|im_start|>user\nQ:Question: how old julio cesar chavez when he fought de la hoya I found the following answer on Google: He holds records for most successful consecutive defenses of world titles (27), most title fights (37), most title-fight victories (31) and he is after Joe Louis with (23) for most title defenses won by knockout (21). Is that a correct answer? Yes or no.\nA:<|im_end|>\n',

'chosen': "<|im_start|>assistant\n Yes, the information you found on Google is correct. Julio César Chávez holds several records related to world title defenses and victories, and he is considered one of the greatest boxers in history. Here is a detailed answer to your question:\n\nJulio César Chávez was born on July 12, 1962, in Ciudad Obregón, Sonora, Mexico. He began boxing at a young age and quickly made a name for himself in the sport, winning his first world title in 1984 when he defeated Mario Miranda for the WBC super featherweight championship.\n\nOver the course of his career, Chávez would go on to hold titles in three different weight classes (super featherweight, lightweight, and junior welterweight) and defend his titles a record 27 times, including 21 times by knockout. This record for most successful consecutive defenses of world titles has never been broken, and it is a testament to Chávez's skill, determination, and durability as a fighter.\n\nIn addition to his record for most successful title defenses, Chávez also holds the records for most title fights (37) and most title-fight victories (31). These records are also unbroken and demonstrate Chávez's consistent success and dominance in the ring.\n\nChávez's impressive record and achievements have earned him a place among the greatest boxers of all time, and he is often compared to other boxing legends such as Joe Louis, who holds the record for most title defenses won by knockout (23). While Chávez did not quite match Louis's record for most title defenses won by knockout, he is still widely regarded as one of the greatest boxers in history and a true legend of the sport.\n\nIn conclusion, the information you found on Google is correct: Julio César Chávez holds several records related to world title defenses and victories, and he is considered one of the greatest boxers in history. His impressive record and achievements have earned him a place among the greatest boxers of all time, and he will always be remembered as a true legend of the sport.<|im_end|>\n<eos>",

'rejected': '<|im_start|>assistant\nNo, the answer provided does not directly answer the question regarding the age of Julio Cesar Chavez when he fought Oscar De La Hoya. The provided information describes some general records held by Julio Cesar Chavez throughout his career. To answer your original question, let me provide the relevant information:\n\nJulio Cesar Chavez fought Oscar De La Hoya on June 7, 1996, in a match called "Ultimate Glory." Chavez was born on July 12, 1962. To calculate his age at the time of the fight, we need to find the difference between the fight date and his birthdate.\n\nFrom July 12, 1962, to June 7, 1996, there are:\n- 33 years (from 1962 to 1995)\n- An additional year from his birthday in 1995 (July 12, 1995) to the fight date in 1996 (June 7, 1996), which is approximately 10 months and 26 days.\n\nTherefore, Julio Cesar Chavez was about 33 years and 10 months old when he fought Oscar De La Hoya.<|im_end|>\n<eos>'}

```

---------

📌 The `ds.map` method from Hugging Face `datasets` library, applies a specified function to each element of the dataset.

📌 Here, `ds.map` applies the `process` function to each example (row) in the dataset `ds`. But note, each row's each column is actually getting formatted.

Because, here we are changing each of `row["prompt"]`, row["chosen"] and row["rejected"]

---------

📌 The method iterates over the dataset, applying the `process` function to each individual data point of each row's each column, thereby updating fields like "prompt", "chosen", and "rejected" within each example based on the logic defined in the `process` function.

📌 By setting `num_proc=multiprocessing.cpu_count()`, the method is instructed to use a number of processes equal to the number of CPU cores available, thereby enabling parallel processing. Especially beneficial for large datasets, as each process handles a subset of the dataset independently.

📌 The `load_from_cache_file=False` argument ensures that the processing function is applied to the dataset even if a cached file is available. This is particularly useful during development or when the processing function is updated and you want to make sure that the latest version is applied to the dataset.

📌 Through `ds.map`, `apply_chat_template` indirectly impacts each data row in the specified fields by formatting them according to a specific template structure, yet without tokenization. This is a batch operation across the dataset that ensures consistency and proper formatting for later stages in the model training or evaluation pipeline. The result is an updated dataset where each entry has been modified in accordance with the requirements of the `process` function.

In [None]:
train_dataset[0]

In [None]:
import wandb
run = wandb.init(project="YOUR_PROJECT_NAME")

In [None]:
orpo_trainer = ORPOTrainer(
    model=model,
    args=cfg,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)

In [None]:
orpo_trainer.train()

In [None]:
orpo_trainer.push_to_hub()