# Direct Preference Optimization (DPO)

## Motivation

In many machine learning tasks, especially those involving language models or recommendation systems, it's often hard to define an objective function that truly captures what users want. Traditional supervised learning relies on labeled data, but it doesn't account for preferences or nuanced feedback. This is where **Direct Preference Optimization (DPO)** becomes useful.

## How it works (High Level)

DPO is a method that trains models directly from preference data—comparisons between outputs—rather than relying on explicit reward models or fine-tuning via reinforcement learning. It's particularly effective when you have access to user choices (like "A is better than B") but not exact labels or scores. This makes DPO a practical approach for aligning models with human preferences in a more stable and sample-efficient way. It's a good fit when you care more about relative quality than absolute correctness.

## How it really works (Teacher Forcing)

DPO is implemented using a method called **Teacher Forcing**. In this setup, we compute how likely the model is to generate each response in the pair. The goal is simple: we want the model to assign a **higher probability** to the *chosen* response and a **lower probability** to the *rejected* one.

During training, we reward the model when it acts more like it would generate the chosen output, and we penalize it when it acts more like it would generate the rejected one. This nudges the model toward learning behaviors that match human preferences, without needing to generate responses or interact with an environment during training.

## Dataset

To use Direct Preference Optimization (DPO), you need a dataset made of **pairs of model outputs**, where one is marked as *preferred* (or *chosen*) and the other as *rejected*. These are often called **(chosen, rejected)** or **(preferred, dispreferred)** pairs. Each pair comes from the same prompt or input, and the key idea is that we know which of the two outputs is better, but we don't need to know *why* or by how much.

**Important Note: Use the notebook [Dataset-Creation-DPO.ipynb](Dataset-Creation-DPO.ipynb) to prepare DPO dataset before running this notebook.**

## Quantization

Training or running large language models often requires significant memory, which can be a challenge on limited hardware. **Quantization** helps by reducing the size of the model weights, typically from 16 or 32-bit floats down to 8-bit or even 4-bit integers. This makes the model smaller and faster, with minimal impact on performance in most cases. In this project, we use the **BitsAndBytes** library to apply quantization efficiently.

## Parameter Efficient Fine Tuning (PEFT) with Low Rank Adaptation (LoRA)

Fine-tuning large models from scratch can be expensive and slow. **PEFT** techniques aim to reduce this cost by updating only a small number of parameters. One popular method is **Low Rank Adaptation (LoRA)**, which injects small trainable matrices into the model's layers without changing the original weights. This allows for efficient fine-tuning with fewer resources. We use the **peft** library to implement LoRA in our experiments.

## Signal to Noise Ratio (SNR) with Spectrum

Not all layers in a model contribute equally to learning during fine-tuning. **Signal to Noise Ratio (SNR)** helps identify which layers are more useful to focus on by comparing meaningful signal to background noise in the weight updates. This can guide efficient adaptation and avoid overfitting. We use the **spectrum** library to compute and analyze SNR during training.

## Metrics

To evaluate our fine-tuned language model, we use several metrics:

- **accuracy**: Measures how often the model's output matches the expected output exactly.
- **bleu**: A precision-based metric that compares n-gram overlap between generated and reference texts, commonly used in translation tasks.
- **rouge1, rouge2, rougeL**: Recall-based metrics that check how many unigrams (rouge1), bigrams (rouge2), or longest common sequences (rougeL) overlap with the target.
- **bertscore_precision, bertscore_recall, bertscore_f1**: Use BERT embeddings to compare the similarity of generated and reference sentences on a deeper, semantic level.
- **avg_levenshtein**: Measures the average number of edits (insertions, deletions, substitutions) needed to change the model output into the reference text, useful for judging closeness in form.

Together, these metrics give a balanced view of both surface-level accuracy and deeper semantic alignment.

## Note on Runtime

Due to limited compute and time, the experiments and metrics in this guide were run in under 2 hours. In practice, fine-tuning large models usually requires multiple days or even weeks to achieve optimal results.

# Imports

In [1]:
import os
import torch
import numpy as np
import Levenshtein
from datasets import Dataset, load_dataset, DatasetDict
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, PeftModel, AutoPeftModelForCausalLM
from trl import DPOConfig, DPOTrainer
import evaluate
import json
import pandas as pd
from transformers import set_seed
import random
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


# Print Versions

In [2]:
import sys
import platform
import os
import subprocess


def print_env_info():
    print("\n🌱 Welcome to your Environment Info Report! 🌱\n")

    # Python & OS
    print(f"🐍 Python:       {platform.python_implementation()} {platform.python_version()}")
    print(f"💻 Platform:     {platform.system()} {platform.release()} ({platform.machine()})\n")

    # CUDA & GPU via PyTorch
    try:
        import torch
        cuda_avail = torch.cuda.is_available()
        print(f"🚀 CUDA Available: {cuda_avail}")
        if cuda_avail:
            print(f"   • CUDA Version:   {torch.version.cuda}")
            print(f"   • cuDNN Version:  {torch.backends.cudnn.version()}")
            n_gpus = torch.cuda.device_count()
            print(f"   • GPU Count:      {n_gpus}")
            for i in range(n_gpus):
                print(f"     - GPU {i}:      {torch.cuda.get_device_name(i)}")
    except ImportError:
        print("🚫 PyTorch not installed, skipping CUDA/GPU info")

    # nvcc (if available)
    try:
        out = subprocess.check_output(['nvcc', '--version'], stderr=subprocess.STDOUT)
        release = [l for l in out.decode().splitlines() if "release" in l]
        print(f"\n📦 nvcc: {release[-1].strip()}")
    except Exception:
        print("\n📦 nvcc: not found in PATH")

    # nvidia-smi
    try:
        out = subprocess.check_output([
            'nvidia-smi',
            '--query-gpu=name,driver_version,memory.total',
            '--format=csv,noheader'], stderr=subprocess.DEVNULL
        ).decode().strip().splitlines()
        print("\n📊 nvidia-smi info:")
        for line in out:
            print("   " + line)
    except Exception:
        print("\n📊 nvidia-smi: not available")

    # Helper to show versions
    def show_ver(label, import_name=None):
        try:
            m = __import__(import_name or label)
            v = getattr(m, '__version__', None) or getattr(m, 'VERSION', None) or str(m)
            print(f"🔖 {label:<15} version: {v}")
        except ImportError:
            print(f"🔖 {label:<15} not installed")

    # Popular ML/LLM libraries
    print("\n📚 Library versions:")
    libs = [
        ('torch',      None),
        ('torchvision', None),
        ('torchaudio',  None),
        ('transformers', None),
        ('accelerate',  None),
        ('trl',         'trl'),
        ('peft',        'peft'),
        ('deepspeed',   None),
        ('bitsandbytes', None),
        ('datasets',    'datasets'),
        ('evaluate',    'evaluate'),
        ('tokenizers',  None),
        ('sentencepiece', None),
        ('huggingface_hub', None),
        ('numpy',       'numpy'),
        ('scipy',       'scipy'),
        ('pandas',      'pandas'),
        ('scikit-learn','sklearn'),
        ('wandb',       'wandb'),
        ('tensorboard', 'tensorboard'),
        ('mlflow',      'mlflow'),
    ]
    for label, name in libs:
        show_ver(label, name)

    # Conda env and key env vars
    print("\n🔧 Conda env:", os.getenv('CONDA_DEFAULT_ENV', '(none)'))
    important_vars = ['CUDA_HOME', 'CUDA_PATH', 'LD_LIBRARY_PATH', 'HF_HOME', 'HF_DATASETS_CACHE']
    print("\n🌐 Environment variables:")
    for var in important_vars:
        print(f"   - {var:<15} = {os.getenv(var, '')}")

    print("\n✨ All set! Keep growing and training with confidence! ✨\n")

print_env_info()


🌱 Welcome to your Environment Info Report! 🌱

🐍 Python:       CPython 3.12.7
💻 Platform:     Linux 6.11.0-25-generic (x86_64)

🚀 CUDA Available: True
   • CUDA Version:   12.4
   • cuDNN Version:  90100
   • GPU Count:      1
     - GPU 0:      NVIDIA GeForce RTX 4070 Laptop GPU

📦 nvcc: Cuda compilation tools, release 12.1, V12.1.105

📊 nvidia-smi info:
   NVIDIA GeForce RTX 4070 Laptop GPU, 570.133.07, 8188 MiB

📚 Library versions:
🔖 torch           version: 2.6.0+cu124
🔖 torchvision     version: 0.21.0+cu124
🔖 torchaudio      version: 2.6.0+cu124
🔖 transformers    version: 4.51.3
🔖 accelerate      version: 1.6.0
🔖 trl             version: 0.17.0
🔖 peft            version: 0.15.2
🔖 deepspeed       not installed
🔖 bitsandbytes    version: 0.45.5
🔖 datasets        version: 3.6.0
🔖 evaluate        version: 0.4.3
🔖 tokenizers      version: 0.21.1
🔖 sentencepiece   version: 0.2.0
🔖 huggingface_hub version: 0.31.2
🔖 numpy           version: 2.2.5
🔖 scipy           version: 1.15.3
🔖 pandas

# Seed and ENV

In [3]:
WANDB_TOKEN = "WANDB-TOKEN-GOES-HERE"
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
SEED = 69

In [4]:
os.environ["WANDB_API_KEY"] = WANDB_TOKEN
os.environ["WANDB_PROJECT"] = "tinyllama-dpo"

In [5]:
def seed_everything(seed: int = SEED):
    # Python built-in
    random.seed(seed)

    # Numpy
    np.random.seed(seed)

    # PyTorch
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True  # Slower but reproducible
    torch.backends.cudnn.benchmark = False

    # Environment variables
    os.environ["PYTHONHASHSEED"] = str(seed)

    # Hugging Face Transformers
    set_seed(seed)

In [6]:
seed_everything(SEED)

# Quantization

In [7]:
# 1) Quantization setup
def setup_4bit_quant(
    quant_type: str = "nf4",
    use_double: bool = True,
    dtype: torch.dtype = torch.float16,
) -> BitsAndBytesConfig:
    return BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type=quant_type,
        bnb_4bit_use_double_quant=use_double,
        bnb_4bit_compute_dtype=dtype,
    )

# Load Model and Tokenizer

In [8]:
# 2) Model & tokenizer loading
def load_model_tokenizer(
    model_name: str,
    bnb_config: BitsAndBytesConfig = None,
    device_map: str = "auto",
    quantized: bool = True,
) -> tuple[torch.nn.Module, AutoTokenizer]:
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    if tokenizer.pad_token_id is None:
        tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
    if quantized and bnb_config is not None:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            device_map=device_map,
        )
        model.config.use_cache = False
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map=device_map,
        )
        model.config.use_cache = True
    model.resize_token_embeddings(len(tokenizer))
    return model, tokenizer

# PEFT (LoRA) + Spectrum (SNR)

In [9]:
# 3) Parameter Efficient Fine Tuning (PEFT) with Low Rank Adaptation (LoRA) + Signal/Noise Ratio (Spectrum)
def apply_lora(
    model: torch.nn.Module,
    r: int = 16,
    alpha: int = 32,
    dropout: float = 0.05,
    target_modules: list = ["q_proj", "v_proj"],
    spectrum_yaml_path: str = "/home/electron/PycharmProjects/fine_tune_llm/spectrum/"
                    "snr_results_TinyLlama-TinyLlama-v1.1_unfrozenparameters_10percent.yaml",
    spectrum: bool = False,
):
    """
    Wrap the base model with PEFT + LoRA adapters.
    """
    if spectrum:
        MODEL_NAME = "TinyLlama/TinyLlama_v1.1"
        
        # 1) Load the YAML
        if not os.path.isfile(spectrum_yaml_path):
            raise FileNotFoundError(
                f"Spectrum YAML not found at {spectrum_yaml_path}. "
                "Please run Spectrum to generate it first."
            )
        with open(spectrum_yaml_path, "r") as yf:
            data = yaml.safe_load(yf)
    
        # 2) Extract the regex list
        target_modules = data.get("unfrozen_parameters")
        if not target_modules or not isinstance(target_modules, list):
            raise ValueError(
                f"No 'unfrozen_parameters' list found in {spectrum_yaml_path}."
            )
    
    print('Target Lora Modules: ', target_modules)
    
    lora_cfg = LoraConfig(
        r=r,
        lora_alpha=alpha,
        target_modules=target_modules,
        lora_dropout=dropout,
        bias="none",
        task_type="CAUSAL_LM",
    )
    return get_peft_model(model, lora_cfg)

# DPO Config

In [10]:
# 4) DPO training arguments
def make_dpo_config(
    output_dir: str = "dpo_tinyllama",
    learning_rate: float = 1e-5,
    batch_size: int = 1,
    gradient_accumulation: int = 2,
    num_epochs: int = 1,
    logging_steps: float = 0.2,
    eval_steps: float = 0.2,
    save_steps: float = 0.2,
    wandb_project: str = "tinyllama-dpo",
    beta: float = 0.1,
) -> DPOConfig:
    return DPOConfig(
        output_dir=output_dir,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation,
        num_train_epochs=num_epochs,
        logging_steps=logging_steps,
        eval_strategy="steps",
        eval_steps=eval_steps,
        save_strategy="steps",
        save_steps=save_steps,
        optim="paged_adamw_32bit",
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        load_best_model_at_end=True,
        report_to="wandb",
        run_name=wandb_project,
        save_total_limit=3,
        # DPO‐specific args:
        loss_type="sigmoid",
        beta=beta,
    )

# DPO Trainer

In [11]:
# 5) Build DPO trainer
def build_dpo_trainer(
    model: torch.nn.Module,
    tokenizer: AutoTokenizer,
    train_ds: Dataset,
    eval_ds: Dataset,
    config: DPOConfig,
) -> DPOTrainer:
    return DPOTrainer(
        args=config,
        model=model,
        processing_class=tokenizer,
        train_dataset=train_ds,
        eval_dataset=eval_ds,
    )

# Inference

In [12]:
# 6) Inference & generation evaluation (same as before)
def run_inference(
    model: torch.nn.Module,
    tokenizer: AutoTokenizer,
    dataset: Dataset,
    max_length: int = 1024,
    batch_size: int = 8,
) -> Dataset:
    device = next(model.parameters()).device
    model.eval()
    def gen_batch(batch):
        prompts = batch["prompt"]
        enc = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(device)
        with torch.no_grad():
            out = model.generate(
                **enc,
                max_new_tokens=max_length,
                eos_token_id=tokenizer.eos_token_id,
            )
        batch["prediction"] = tokenizer.batch_decode(
            out[:, enc["input_ids"].shape[-1]:],
            skip_special_tokens=True
        )
        return batch
    return dataset.map(gen_batch, batched=True, batch_size=batch_size)

# Evaluation

In [13]:
def evaluate_generation(
    dataset_with_preds: Dataset,
    reference_col: str = "chosen",
    prediction_col: str = "prediction",
) -> dict:
    refs = dataset_with_preds[reference_col]
    preds = dataset_with_preds[prediction_col]
    # Exact‐match accuracy
    exact = [int(p.strip()==r.strip()) for p,r in zip(preds, refs)]
    accuracy = sum(exact)/len(exact)
    # BLEU, ROUGE, BERTScore
    bleu = evaluate.load("bleu").compute(predictions=preds, references=[[r] for r in refs])["bleu"]
    rouge = evaluate.load("rouge").compute(predictions=preds, references=refs)
    bert = evaluate.load("bertscore").compute(predictions=preds, references=refs, lang="en")
    # Levenshtein
    levs = [Levenshtein.distance(p, r) for p,r in zip(preds, refs)]
    return {
        "accuracy": accuracy,
        "bleu": bleu,
        "rouge1": rouge["rouge1"],
        "rouge2": rouge["rouge2"],
        "rougeL": rouge["rougeL"],
        "bertscore_precision": float(np.mean(bert["precision"])),
        "bertscore_recall": float(np.mean(bert["recall"])),
        "bertscore_f1": float(np.mean(bert["f1"])),
        "avg_levenshtein": float(np.mean(levs)),
    }

# Load Dataset

In [14]:
def load_for_dpo(data_dir: str = "./dpo_processed_data") -> DatasetDict:
    """
    Load JSONL files into a DatasetDict for DPOTrainer.
    """
    files = {
        "train": os.path.join(data_dir, "train.jsonl"),
        "eval": os.path.join(data_dir, "eval.jsonl"),
        "test": os.path.join(data_dir, "test.jsonl")
    }
    return load_dataset("json", data_files=files)

In [15]:
dpo_ds = load_for_dpo(data_dir='dpo_processed_data')
dpo_ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 48733
    })
    eval: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 6092
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 6092
    })
})

In [16]:
dpo_ds['train'] = dpo_ds['train'].shuffle(seed=SEED).select(range(10000))
dpo_ds['eval'] = dpo_ds['eval'].shuffle(seed=SEED).select(range(1000))
dpo_ds['test'] = dpo_ds['test'].shuffle(seed=SEED).select(range(1000))
dpo_ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 10000
    })
    eval: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 1000
    })
})

In [17]:
dpo_ds['test'][11]['prompt']

"Which political leaders were involved in the decision to change Calcutta's name to Kolkata, and what was their reasoning for doing so?"

In [18]:
dpo_ds['test'][11]['chosen']

'The decision to change Calcutta\'s name to Kolkata was made by the government of West Bengal in India in the late 1990s. The primary motivation for the change was to promote a sense of national identity and regional pride among the people of West Bengal. The state government believed that the name "Calcutta" had associations with colonialism and imperialism, and that a new name would help to distance the region from its historical ties to Britain. Additionally, the government hoped that a new name would help to improve the city\'s image and attract more investment and tourism. The change was officially confirmed in 2001, and has since been widely accepted by the people of Kolkata.'

In [19]:
dpo_ds['test'][11]['rejected']

'As a helpful and respectful assistant, I shall provide accurate and factual information to the best of my abilities. The name change of Calcutta to Kolkata is a complex historical event with multiple perspectives, and it is essential to present a balanced and unbiased account.\n\nThe name change was first proposed in the 19th century during the British colonial era, and it was officially implemented in 1911 by the British Indian government. The ruling party at the time, the Liberal Government, was led by Prime Minister H.H. Asquith. However, the decision to change the name was not solely made by the British government; it was also influenced by various Indian politicians and intellectuals who had been advocating for the name change for many years.\n\nOne of the main reasons for the name change was the belief that "Calcutta" was a misrepresentation of the city\'s Indian heritage and culture. The name "Kolkata" is derived from the Bengali words "kolikata," which was the original name of

# Fine Tuning

## Quantization Setup

In [17]:
# 8) Top-level training + evaluation
adapter_dir = "dpo_tinyllama/adapter_model"
output_dir = "dpo_tinyllama"

# A) Prepare & train DPO
bnb_cfg = setup_4bit_quant()
bnb_cfg

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

## Model and Tokenizer Setup

In [21]:
model, tokenizer = load_model_tokenizer(MODEL_NAME, bnb_cfg)
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), e

## LoRA Setup

In [22]:
model = apply_lora(model,
                   # r=4,
                   # alpha=8,
                  )
model

Target Lora Modules:  ['q_proj', 'v_proj']


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 2048)
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Line

## DPO Config Setup

In [23]:
dpo_cfg = make_dpo_config(output_dir=output_dir)
dpo_cfg



## Trainer Setup

In [24]:
trainer = build_dpo_trainer(model, tokenizer, dpo_ds['train'], dpo_ds['eval'], dpo_cfg)
trainer

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


<trl.trainer.dpo_trainer.DPOTrainer at 0x777938518e60>

In [25]:
trainer.model.print_trainable_parameters()

trainable params: 2,252,800 || all params: 1,102,301,184 || trainable%: 0.2044


In [26]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mszamani[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
1000,0.6654,0.624512,0.087818,-0.108109,0.663,0.195926,-407.476562,-316.631256,,
2000,0.6057,0.597656,-0.020323,-0.418953,0.651,0.398633,-408.557922,-319.739716,,
3000,0.5941,0.592285,-0.1341,-0.655027,0.667,0.520928,-409.695679,-322.100494,,
4000,0.5906,0.59082,-0.187247,-0.761522,0.671,0.574277,-410.227234,-323.165375,,
5000,0.6021,0.59082,-0.18557,-0.763777,0.671,0.578218,-410.210449,-323.188049,,


TrainOutput(global_step=5000, training_loss=0.61159140625, metrics={'train_runtime': 5271.3743, 'train_samples_per_second': 1.897, 'train_steps_per_second': 0.949, 'total_flos': 0.0, 'train_loss': 0.61159140625, 'epoch': 1.0})

In [27]:
trainer.save_state()
trainer.model.save_pretrained(os.path.join(adapter_dir, "adapter_model"))

# Full Dump of Model

In [28]:
def export_merged_model_for_serving(
    base_model_name: str,
    adapter_dir: str,
    export_dir: str,
    tokenizer,
    bnb_config=None,  # optional quant config if your base is quantized
    device_map="auto",
):
    # 1) Load the base model in the same setup used for training
    if bnb_config:
        base = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            quantization_config=bnb_config,
            device_map=device_map,
        )
    else:
        base = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            device_map=device_map,
        )

    # 2) Wrap and load adapters
    peft_model = PeftModel.from_pretrained(
        base,
        adapter_dir,
        device_map=device_map,
    )

    # 3) Merge LoRA weights & unload adapter wrapper
    merged_model = peft_model.merge_and_unload()
    merged_model.config.use_cache = True  # set for inference

    # 4) Save the merged model + config.json + tokenizer
    merged_model.save_pretrained(export_dir)
    tokenizer.save_pretrained(export_dir)

    print(f"Full model saved to {export_dir})

In [31]:
export_dir = os.path.join(dpo_cfg.output_dir, "merged_model_for_serving")
export_merged_model_for_serving(
    base_model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    adapter_dir=os.path.join(dpo_cfg.output_dir, "adapter_model"),
    export_dir=export_dir,
    tokenizer=tokenizer,
    bnb_config=bnb_cfg,            # if you used 4-bit quantization
    device_map="auto",
)



Full model saved to dpo_tinyllama/adapter_model/merged_model_for_serving. You can now run:
  vllm serve dpo_tinyllama/adapter_model/merged_model_for_serving


# Evaluation Setup

In [32]:
fresh_preds = run_inference(model, tokenizer, dpo_ds['test'])
fresh_metrics = evaluate_generation(fresh_preds)
fresh_metrics

Map:  78%|█████████████████████▉      | 784/1000 [14:11<04:11,  1.16s/ examples]This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Map: 100%|███████████████████████████| 1000/1000 [17:50<00:00,  1.07s/ examples]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'accuracy': 0.001,
 'bleu': 0.05697612288313855,
 'rouge1': np.float64(0.2484805618670255),
 'rouge2': np.float64(0.08870378463507747),
 'rougeL': np.float64(0.16052732050254281),
 'bertscore_precision': 0.8186475695967674,
 'bertscore_recall': 0.8214263437986374,
 'bertscore_f1': 0.8190583533644676,
 'avg_levenshtein': 1293.319}

In [33]:
for i in range(10, 15):
    print(fresh_preds['prompt'][i])
    print('#'*20)
    print(fresh_preds['chosen'][i])
    print('#'*20)
    print(fresh_preds['rejected'][i])
    print('#'*20)
    print(fresh_preds['prediction'][i])
    print('\n'+'*'*40+'\n')

Read the passage and find if the passage agrees, disagrees, or has a neutral stance on whether Global warming is caused by human activities. Answer only with keyword (a) agrees - if passage agrees with the target (b) disagrees - if passage disagrees with the target (c) neutral - if the given passage neither agrees nor disagrees with the target. You don't need to use external knowledge in this task, and you have to answer based on the given passage.

Example input: Most global warming is natural and even if there had been no Industrial Revolution current global temperatures would be almost exactly the same as they are now.
Example output: disagrees
Example explanation: The sentence explicitly states the global warming is natural. It also adds the temperatures would be the same even without industries. Therefore the sentence disagrees with the target.
Q: Global temperature increases have been far, far less than doomsday computer models predicted – about three times smaller.
A:
##########

# Reload Model

In [18]:
# 7) Helpers to load fine‐tuned or base models

def load_quantized_variant(
    model_name: str,
    adapter_dir: str,
    load_ft: bool = True,
    device_map: str = "auto",
):
    if load_ft:
        model = AutoPeftModelForCausalLM.from_pretrained(
            adapter_dir,
            device_map="auto"
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name)
    else:
        bnb_cfg = setup_4bit_quant()
        model, tokenizer = load_model_tokenizer(model_name, bnb_cfg, device_map, quantized=True)
    return model, tokenizer

In [19]:
def load_full_precision_base(
    model_name: str,
    device_map: str = "auto",
):
    return load_model_tokenizer(model_name, quantized=False, device_map=device_map)

# Evaluation After Reload

In [20]:
fp_model, fp_tokenizer = load_full_precision_base(MODEL_NAME)

In [21]:
fp_preds = run_inference(fp_model, fp_tokenizer, dpo_ds['test'].select(range(10, 15)))
fp_metrics = evaluate_generation(fp_preds)
fp_metrics

Map: 100%|█████████████████████████████████| 5/5 [00:27<00:00,  5.52s/ examples]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'accuracy': 0.0,
 'bleu': 0.038103956127795935,
 'rouge1': np.float64(0.4258831747762874),
 'rouge2': np.float64(0.15240865633797315),
 'rougeL': np.float64(0.291028816019979),
 'bertscore_precision': 0.8494547128677368,
 'bertscore_recall': 0.8490902066230774,
 'bertscore_f1': 0.8491737961769104,
 'avg_levenshtein': 1095.0}

In [22]:
# for i in range(10, 15):
for i in range(len(fp_preds)):
    print(fp_preds['prompt'][i])
    print('#'*20)
    print(fp_preds['chosen'][i])
    print('#'*20)
    print(fp_preds['rejected'][i])
    print('#'*20)
    print(fp_preds['prediction'][i])
    print('\n'+'*'*40+'\n')

Read the passage and find if the passage agrees, disagrees, or has a neutral stance on whether Global warming is caused by human activities. Answer only with keyword (a) agrees - if passage agrees with the target (b) disagrees - if passage disagrees with the target (c) neutral - if the given passage neither agrees nor disagrees with the target. You don't need to use external knowledge in this task, and you have to answer based on the given passage.

Example input: Most global warming is natural and even if there had been no Industrial Revolution current global temperatures would be almost exactly the same as they are now.
Example output: disagrees
Example explanation: The sentence explicitly states the global warming is natural. It also adds the temperatures would be the same even without industries. Therefore the sentence disagrees with the target.
Q: Global temperature increases have been far, far less than doomsday computer models predicted – about three times smaller.
A:
##########

In [27]:
bq_model, bq_tokenizer = load_quantized_variant(MODEL_NAME, adapter_dir, load_ft=False)

In [28]:
bq_preds = run_inference(bq_model, bq_tokenizer, dpo_ds['test'].select(range(10, 15)))
bq_metrics = evaluate_generation(bq_preds)
bq_metrics

Map: 100%|█████████████████████████████████| 5/5 [00:27<00:00,  5.52s/ examples]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'accuracy': 0.0,
 'bleu': 0.020161091916503504,
 'rouge1': np.float64(0.27891008382501764),
 'rouge2': np.float64(0.09552522481469851),
 'rougeL': np.float64(0.19700277520814063),
 'bertscore_precision': 0.7964165091514588,
 'bertscore_recall': 0.831107223033905,
 'bertscore_f1': 0.8126534938812255,
 'avg_levenshtein': 2342.4}

In [29]:
# for i in range(10, 15):
for i in range(len(bq_preds)):
    print(bq_preds['prompt'][i])
    print('#'*20)
    print(bq_preds['chosen'][i])
    print('#'*20)
    print(bq_preds['rejected'][i])
    print('#'*20)
    print(bq_preds['prediction'][i])
    print('\n'+'*'*40+'\n')

Read the passage and find if the passage agrees, disagrees, or has a neutral stance on whether Global warming is caused by human activities. Answer only with keyword (a) agrees - if passage agrees with the target (b) disagrees - if passage disagrees with the target (c) neutral - if the given passage neither agrees nor disagrees with the target. You don't need to use external knowledge in this task, and you have to answer based on the given passage.

Example input: Most global warming is natural and even if there had been no Industrial Revolution current global temperatures would be almost exactly the same as they are now.
Example output: disagrees
Example explanation: The sentence explicitly states the global warming is natural. It also adds the temperatures would be the same even without industries. Therefore the sentence disagrees with the target.
Q: Global temperature increases have been far, far less than doomsday computer models predicted – about three times smaller.
A:
##########

In [23]:
lft_model, lft_tokenizer = load_quantized_variant(MODEL_NAME, adapter_dir, load_ft=True)

In [24]:
lft_preds = run_inference(lft_model, lft_tokenizer, dpo_ds['test'].select(range(10, 15)))
lft_metrics = evaluate_generation(lft_preds)
lft_metrics

Map: 100%|█████████████████████████████████| 5/5 [00:48<00:00,  9.61s/ examples]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'accuracy': 0.0,
 'bleu': 0.05312244146853055,
 'rouge1': np.float64(0.3585495652472468),
 'rouge2': np.float64(0.12757577720843027),
 'rougeL': np.float64(0.24023106324980664),
 'bertscore_precision': 0.8315945863723755,
 'bertscore_recall': 0.8554128646850586,
 'bertscore_f1': 0.842953109741211,
 'avg_levenshtein': 2070.6}

In [26]:
# for i in range(10, 15):
for i in range(len(lft_preds)):
    print(lft_preds['prompt'][i])
    print('#'*20)
    print(lft_preds['chosen'][i])
    print('#'*20)
    print(lft_preds['rejected'][i])
    print('#'*20)
    print(lft_preds['prediction'][i])
    print('\n'+'*'*40+'\n')

Read the passage and find if the passage agrees, disagrees, or has a neutral stance on whether Global warming is caused by human activities. Answer only with keyword (a) agrees - if passage agrees with the target (b) disagrees - if passage disagrees with the target (c) neutral - if the given passage neither agrees nor disagrees with the target. You don't need to use external knowledge in this task, and you have to answer based on the given passage.

Example input: Most global warming is natural and even if there had been no Industrial Revolution current global temperatures would be almost exactly the same as they are now.
Example output: disagrees
Example explanation: The sentence explicitly states the global warming is natural. It also adds the temperatures would be the same even without industries. Therefore the sentence disagrees with the target.
Q: Global temperature increases have been far, far less than doomsday computer models predicted – about three times smaller.
A:
##########

# Performance Comparison

In [None]:
# B) Evaluate three variants
adapter_dir = "dpo_tinyllama/adapter_model"
variants = {
    "base_fp": load_full_precision_base(MODEL_NAME),
    "base_q": load_quantized_variant(MODEL_NAME, adapter_dir, load_ft=False),
    "loaded_fine_tuned_q": load_quantized_variant(MODEL_NAME, adapter_dir, load_ft=True),
}
reports = {}

In [None]:
for name, (m,tok) in variants.items():
    preds = run_inference(m, tok, dpo_ds['test'])
    reports[name] = evaluate_generation(preds)
    
reports

In [None]:
reports['fresh_fine_tuned_q'] = fresh_metrics
reports

In [None]:
# C) Display comparison

df = pd.DataFrame.from_dict(reports, orient="index")
print("\n=== DPO Evaluation Comparison ===")
print(df)