<a href="https://colab.research.google.com/github/roya90/Fine_tuning_tutorial/blob/main/DPO_for_Gemma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exploring Direct Preference Optimization (DPO) for Fine-Tuning Language Models

Direct Preference Optimization (DPO) is an efficient method for fine-tuning models based on human preferences, often used in Reinforcement Learning with Human Feedback (RLHF). Below is a guide to understanding DPO and implementing it.

##What is DPO?
DPO involves training models to prefer certain outputs (chosen) over others (rejected). It works by directly optimizing the model to reflect human preferences, providing a more aligned model response. Unlike RL algorithms, DPO doesn’t rely on complex reward modeling.

# Loading the Required Packages

In your DPO implementation, several essential packages are required to handle the fine-tuning of large language models efficiently.



*   **bitsandbytes**: Provides efficient memory management for large models by enabling 8-bit optimizers and quantization, crucial for working with large-scale models on limited hardware.

* **transformers**: Core library for handling state-of-the-art transformer models (e.g., BERT, GPT) used in NLP tasks.

* **accelerate**: Simplifies multi-GPU and mixed-precision training, making model training faster and resource-efficient.

* **datasets**: Handles large-scale datasets for machine learning, including efficient loading and processing.

* **trl**: Provides tools for fine-tuning transformer models using reinforcement learning techniques.

* **peft**: Parameter-efficient fine-tuning, enabling efficient training of large models with minimal compute overhead.




In [1]:
!pip install -q -i https://pypi.org/simple/ bitsandbytes
!pip install -q transformers accelerate datasets trl peft

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.4/318.4 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.9/105.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import torch
import torch.nn as nn

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          AutoConfig,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          )

from datasets import Dataset , load_dataset
import bitsandbytes as bnb

The `notebook_login()` function from `huggingface_hub` is used to authenticate your Hugging Face account within a Jupyter notebook or similar environment. By calling this function, a prompt will appear to log in using your Hugging Face credentials. Once authenticated, it allows you to access private models, datasets, and other resources on the Hugging Face platform that require authentication.

You can find your Hugging Face credentials by logging into your Hugging Face account. Once logged in, navigate to your Account Settings. There, you will find your Access Tokens under the "Access Tokens" section. These tokens allow you to authenticate with the Hugging Face Hub for API access, downloading private models, or managing repositories. You can generate new tokens with different permission levels (read, write, etc.) and copy the token for use in your scripts or notebooks.

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#Loading the Model

In the following snippet of code, we set up the model from Hugging Face's repository using quantization techniques to optimize memory and computational efficiency. Specifically, it uses 4-bit quantization to reduce the model size and enhance performance on hardware with limited resources. The model is loaded with specific configurations, such as disabling cache and adjusting soft-capping, while using bfloat16 for computations. Additionally, a tokenizer is prepared with a maximum sequence length of 2304, allowing the model to handle long sequences effectively during inference or fine-tuning.

In [5]:
# Define the model name from Hugging Face's repository

model_name = "google/gemma-2-2b"


# Set the compute data type to float16 for mixed-precision training

compute_dtype = getattr(torch, "float16")


# Configure 4-bit quantization for memory efficiency using BitsAndBytes

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Load model in 4-bit precision
    bnb_4bit_quant_type="nf4", # Use "nf4" quantization type
    bnb_4bit_compute_dtype=torch.bfloat16 # Set computation to bfloat16
)


# Load the model configuration

config = AutoConfig.from_pretrained(model_name)
config.final_logit_softcapping = None  # Disable soft-capping


# Load the model with the specified configuration and quantization settings

model = AutoModelForCausalLM.from_pretrained(
   model_name,
    device_map="auto", # Automatically map model across available devices (e.g., GPUs)
    config=config,
    attn_implementation="eager", # Use eager mode for attention implementation
    quantization_config=bnb_config,  # Apply the quantization settings
)


# Disable caching mechanism and set pretraining tensor parallelism to 1

model.config.use_cache = False
model.config.pretraining_tp = 1


# Set up the tokenizer with a maximum sequence length of 2304

max_seq_length = 2304
tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)

config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/481M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/46.4k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

# Creating the Data


The next step in implementing Direct Preference Optimization (DPO) is to create the dataset that will guide the model's fine-tuning process. This dataset should contain three essential components: a prompt, a chosen response, and a rejected response. By structuring the data this way, the model can be trained to understand which responses are preferable based on human feedback, enabling it to produce more aligned, relevant outputs during inference.

## Data Structure for DPO:
The dictionary used for Direct Preference Optimization (DPO) has three keys:

**prompt**: The input question or text for which the model is providing a response.

**chosen**: The preferred or correct response chosen based on human feedback.

**rejected**: The less preferred or incorrect response that should be avoided.

In [6]:
dpo_dataset_dict = {
    "prompt": [
        "What is the capital of France?",
        "Explain quantum mechanics in simple terms.",
        "What is 5 + 7?",
        "Who wrote 'Pride and Prejudice'?",
        "What is the tallest mountain in the world?",
        "What is the boiling point of water?",
        "Who painted the Mona Lisa?",
        "Define artificial intelligence.",
        "What is the speed of light?",
        "What is photosynthesis?"
    ],
    "chosen": [
        "The capital of France is Paris.",
        "Quantum mechanics describes the behavior of particles on a very small scale.",
        "5 + 7 equals 12.",
        "'Pride and Prejudice' was written by Jane Austen.",
        "The tallest mountain in the world is Mount Everest.",
        "The boiling point of water is 100 degrees Celsius.",
        "The Mona Lisa was painted by Leonardo da Vinci.",
        "Artificial intelligence refers to machines designed to mimic human intelligence.",
        "The speed of light is approximately 299,792 kilometers per second.",
        "Photosynthesis is the process by which plants convert sunlight into energy."
    ],
    "rejected": [
        "The capital of France is London.",
        "Quantum mechanics is the study of space travel.",
        "5 + 7 equals 57.",
        "'Pride and Prejudice' was written by Charles Dickens.",
        "The tallest mountain in the world is K2.",
        "The boiling point of water is 50 degrees Celsius.",
        "The Mona Lisa was painted by Vincent van Gogh.",
        "Artificial intelligence refers to machines that are completely autonomous.",
        "The speed of light is 1,000 kilometers per hour.",
        "Photosynthesis is the process by which animals digest food."
    ]
}

dataset = Dataset.from_dict(dpo_dataset_dict)

# LoRA Configuration

This following code snippet configures LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning using the LoraConfig class from the PEFT library.

`r=8` defines the rank of the low-rank matrices for adaptation.

`target_modules` specifies which layers in the model to apply LoRA to (e.g., `q_proj`, `k_proj` for query and key projections).

`task_type="CAUSAL_LM"` sets the task type to causal language modeling, indicating the model is being fine-tuned for autoregressive tasks like text generation.

LoRA reduces the number of parameters trained, making fine-tuning more efficient.

In [7]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

# DPO Configuration

This code snippet sets up Direct Preference Optimization (DPO) training using the Hugging Face TRL library:

`DPOConfig`: Specifies the training configuration, with `beta=0.1` controlling the preference strength, and the `output_dir` set to the current working directory for saving model outputs.

`DPOTrainer`: Initializes the DPO trainer with:

The preloaded `model`,
`training_args` for training parameters,
The `train_dataset` for DPO fine-tuning,
The `tokenizer` for processing text,
A LoRA (Low-Rank Adaptation) configuration (`peft_config`) to enable parameter-efficient fine-tuning.

In [9]:
import os
from trl import DPOTrainer, DPOConfig

training_args = DPOConfig(
    beta=0.1,
    output_dir = os.getcwd()

)

dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,  # for visual language models, use tokenizer=processor instead
    peft_config = lora_config
)



Tokenizing train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

# Fine tuning

The line `dpo_trainer.train()` starts the Direct Preference Optimization (DPO) fine-tuning process. It invokes the training loop where the model is updated based on the dataset. The model learns to prefer chosen responses over rejected ones using the provided configurations, including beta (which controls preference strength) and other training parameters like batch size. The process iterates over the training data, fine-tuning the model's weights according to the preferences specified, and ultimately saves the fine-tuned model to the specified output directory.

In [10]:
dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


TrainOutput(global_step=6, training_loss=0.527743379275004, metrics={'train_runtime': 8.4032, 'train_samples_per_second': 3.57, 'train_steps_per_second': 0.714, 'total_flos': 0.0, 'train_loss': 0.527743379275004, 'epoch': 3.0})


The line `dpo_trainer.save_model()` saves the fine-tuned model to a specified directory, preserving the updated weights and configuration after the training process. The line `tokenizer.save_pretrained(output_dir)` saves the tokenizer, which is essential for processing input text in the same format as used during training. Both commands ensure that the model and tokenizer can be easily loaded later for inference or further fine-tuning from the saved state.

In [12]:
import os

# Define the output directory path
output_dir = "updated_model"

# Create the folder if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save the model and tokenizer to the specified output directory
dpo_trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

('updated_model/tokenizer_config.json',
 'updated_model/special_tokens_map.json',
 'updated_model/tokenizer.model',
 'updated_model/added_tokens.json',
 'updated_model/tokenizer.json')


In summary, fine-tuning large models with Direct Preference Optimization (DPO) and LoRA offers an efficient way to align models with human feedback. The approach reduces computational load while maintaining high performance, allowing you to achieve fine-tuned models with fewer resources. By saving both the model and tokenizer, you're ready to deploy or further refine your model. Stay tuned for more insights and tutorials on fine-tuning techniques, where we’ll continue exploring advanced methods for optimizing AI models effectively!

Special thanks to [David](https://www.linkedin.com/in/davidcardozo/) for his valuable help throughout the process.