<a href="https://colab.research.google.com/github/thibaud-perrin/preference-alignment/blob/main/notebooks/orpo_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preference Alignment with Odds Ratio Preference Optimization (ORPO)

This notebook demonstrates how to fine-tune a language model using Odds Ratio Preference Optimization (ORPO). Unlike models that have undergone Supervised Fine-Tuning (SFT), the `SmolLM2-135M` model used here has not been through SFT training, making it incompatible with DPO. Instead, this notebook focuses on aligning the model's preferences directly using ORPO.


## What's Inside

### Fine-Tuning with ORPOTrainer
This notebook provides a detailed walkthrough of aligning the `SmolLM2-135M` model with ORPOTrainer. The process includes:
- Loading the pre-trained `SmolLM2-135M` model.
- Selecting a dataset for alignment:
  - **Basic Example:** Fine-tuning with the `trl-lib/ultrafeedback_binarized` dataset.
  - **Intermediate Example:** Fine-tuning with the `argilla/ultrafeedback-binarized-preferences` dataset.
  - **Advanced Example:** Fine-tuning on a subset of `mlabonne/orpo-dpo-mix-40k` for more complex preference alignment.
- Training the model to align its outputs with human preferences using the ORPO framework.

By the end of the notebook, the model will be better aligned with the preferences defined in the selected dataset. This demonstrates the effectiveness of ORPO in improving language model behavior, even for models that have not been through prior fine-tuning stages.

## Secrets
Loading HuggingFace secret and login to huggingFace

In [1]:
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')

In [2]:
# Authenticate to Hugging Face
from huggingface_hub import login

login(token=HF_TOKEN)

## Libraries

In [3]:
# Install the requirements in Google Colab
# transformers
!pip install datasets trl huggingface_hub accelerate bitsandbytes



In [4]:
import torch
import os
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
from trl import ORPOConfig, ORPOTrainer, setup_chat_format

## `trl-lib/ultrafeedback_binarized`

### Define the model

In [25]:
model_name = "HuggingFaceTB/SmolLM2-135M"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
model, tokenizer = setup_chat_format(model, tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-ORPO-trl-ufb"
finetune_tags = ["smol-course", "module_2", "trl-lib/ultrafeedback_binarized"]

### Format dataset

In [24]:
# Load dataset
dataset = load_dataset(path="trl-lib/ultrafeedback_binarized")

In [7]:
def process_dataset(sample):

    sample['prompt'] = tokenizer.apply_chat_template(
        sample['chosen'][0:1],
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )

    # Apply template for `chosen`
    sample['chosen'] = tokenizer.apply_chat_template(
        sample['chosen'][1:],
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )

    # Apply template for `rejected`
    sample['rejected'] = tokenizer.apply_chat_template(
        sample['rejected'][1:],
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )
    return sample
dataset = dataset.map(process_dataset)

In [26]:
def process_dataset(sample):

    sample['prompt'] = sample['chosen'][0:1]

    # Apply template for `chosen`
    sample['chosen'] = sample['chosen'][1:]

    # Apply template for `rejected`
    sample['rejected'] = sample['rejected'][1:]
    return sample
dataset = dataset.map(process_dataset)

Map:   0%|          | 0/62135 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [27]:
# Inspect the dataset structure and metadata
print(dataset)

# Display dataset features
print(dataset["train"].features)
print(dataset["test"].features)

# Check the number of examples in the train and test splits
print(f"Train split size: {len(dataset['train'])}")
print(f"Test split size: {len(dataset['test'])}")

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected', 'prompt'],
        num_rows: 62135
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected', 'prompt'],
        num_rows: 1000
    })
})
{'chosen': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'rejected': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'score_chosen': Value(dtype='float64', id=None), 'score_rejected': Value(dtype='float64', id=None), 'prompt': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}]}
{'chosen': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'rejected': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'score_chosen': Value(dtype='float64', id=None), 'score_rejected': Value(dtype='float64', id=None), 'prompt': [{'conte

In [28]:
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(10000))  # Randomly select first 1,000 after shuffle
dataset["test"] = dataset["test"].shuffle(seed=42).select(range(1000))  # Randomly select first 1,000 after shuffle

### Train model with ORPO

In [29]:
orpo_args = ORPOConfig(
    # Small learning rate to prevent catastrophic forgetting
    learning_rate=8e-6,
    # Linear learning rate decay over training
    lr_scheduler_type="linear",
    # Maximum combined length of prompt + completion
    max_length=1024,
    # Maximum length for input prompts
    max_prompt_length=512,
    # Controls weight of the odds ratio loss (λ in paper)
    beta=0.1,
    # Batch size for training
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    # Helps with training stability by accumulating gradients before updating
    gradient_accumulation_steps=4,
    # Memory-efficient optimizer for CUDA, falls back to adamw_torch for CPU/MPS
    optim="paged_adamw_8bit" if device == "cuda" else "adamw_torch",
    # Number of training epochs
    num_train_epochs=1,
    # When to run evaluation
    evaluation_strategy="steps",
    # Evaluate every 20% of training
    eval_steps=0.2,
    # Log metrics every step
    logging_steps=1,
    # Gradual learning rate warmup
    warmup_steps=10,
    # Disable external logging
    report_to="none",
    # Where to save model/checkpoints
    output_dir="./results/",
    # Enable MPS (Metal Performance Shaders) if available
    use_mps_device=device == "mps",
    hub_model_id=finetune_name,
)



In [30]:
trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)



Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [31]:
trainer.train()  # Train the model

# Save the model
trainer.save_model(f"./{finetune_name}")

Step,Training Loss,Validation Loss,Runtime,Samples Per Second,Steps Per Second,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen,Nll Loss,Log Odds Ratio,Log Odds Chosen
250,2.0294,2.014045,36.9034,27.098,13.549,-0.17092,-0.183243,0.51,0.012323,-1.83243,-1.709204,8.71949,8.056647,1.938332,-0.757127,0.141093
500,1.8405,1.972015,36.8702,27.122,13.561,-0.162543,-0.17635,0.518,0.013806,-1.763497,-1.625433,8.518332,7.842507,1.897809,-0.742062,0.158958
750,1.9743,1.954977,36.9154,27.089,13.544,-0.159376,-0.173197,0.519,0.013821,-1.731967,-1.593757,7.969738,7.369678,1.881235,-0.737419,0.160685
1000,1.9964,1.946934,36.8703,27.122,13.561,-0.157712,-0.171483,0.521,0.013771,-1.714832,-1.577119,7.636106,7.073079,1.873473,-0.734612,0.161112
1250,1.7754,1.944123,36.9926,27.032,13.516,-0.157338,-0.171082,0.523,0.013743,-1.710815,-1.573381,7.717488,7.15431,1.870676,-0.734476,0.161036


### Test the model

In [32]:
# Load the fine-tuned model
fine_tuned_model_path = f"./{finetune_name}"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

In [50]:
print(dataset['train'][2]['prompt'][0]['content'])
print(dataset['train'][2]['chosen'][0]['content'])

What is stationarity of a time series, how is it used for mean reversion trading strategies along with statistical tests, their mathematical equation and related python code
Stationarity is a property of a time series, which means that its statistical properties, such as the mean, variance, and autocorrelation, do not change over time. A stationary time series has a constant mean and a constant variance, and its statistical properties are independent of the time of observation. In other words, the distribution of the data remains the same regardless of the sampling period.

Stationarity is an important requirement in many statistical analyses and modeling techniques, including time series forecasting, hypothesis testing, and signal processing. If a time series is non-stationary, various techniques such as differencing, detrending, and seasonal decomposition can be applied to make the series stationary.

In mean reversion trading strategies, the assumption is made that a time series wil

In [51]:
# Test the fine-tuned model on the same prompt

# Format with template
messages = dataset['train'][2]['prompt']
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=256)

In [53]:
# Decode and print the response
print("After fine-tuning:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning:
user
What is stationarity of a time series, how is it used for mean reversion trading strategies along with statistical tests, their mathematical equation and related python code
assistant
Stationarity is a statistical concept that describes the tendency of a time series to return to a given value after a specified period of time. It is a fundamental concept in statistics and is used to analyze and understand the behavior of time series data.

There are two main types of stationarity:

1. Stationarity of the mean: This is the condition that the mean of a time series is constant over time. It is a statistical property that can be used to determine if a time series is stationary.
2. Stationarity of the variance: This is the condition that the variance of a time series is constant over time. It is a statistical property that can be used to determine if a time series is stationary.

Stationarity is important for many reasons, including:

1. Understanding the behavior of 

In [17]:
assert "a" == "b", "stop"

AssertionError: stop

## `argilla/ultrafeedback-binarized-preferences`

### Define the model

In [5]:
model_name = "HuggingFaceTB/SmolLM2-135M"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
model, tokenizer = setup_chat_format(model, tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-ORPO-a-ufb"
finetune_tags = ["smol-course", "module_2", "argilla/ultrafeedback-binarized-preferences"]

### Format dataset

In [6]:
# Load dataset
dataset = load_dataset(path="argilla/ultrafeedback-binarized-preferences")

README.md:   0%|          | 0.00/8.62k [00:00<?, ?B/s]

(…)-00000-of-00001-9dffc9d46d32c335.parquet:   0%|          | 0.00/110M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/63619 [00:00<?, ? examples/s]

In [8]:
def process_dataset(sample):
    sample['prompt'] = [
        {"role": "user", "content": sample['instruction']}
    ]

    # Build chosen column
    sample['chosen'] = [
        {"role": "assistant", "content": sample['chosen_response']}
    ]

    # Build rejected column
    sample['rejected'] = [
        {"role": "assistant", "content": sample['rejected_response']}
    ]
    return sample
dataset = dataset.map(process_dataset)

Map:   0%|          | 0/63619 [00:00<?, ? examples/s]

In [9]:
# Split the train dataset into train and test sets (e.g., 80% train, 20% test)
if "test" not in dataset:
  dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)

In [10]:
# Inspect the dataset structure and metadata
print(dataset)

# Display dataset features
print(dataset["train"].features)
print(dataset["test"].features)

# Check the number of examples in the train and test splits
print(f"Train split size: {len(dataset['train'])}")
print(f"Test split size: {len(dataset['test'])}")

DatasetDict({
    train: Dataset({
        features: ['source', 'instruction', 'chosen_response', 'rejected_response', 'chosen_avg_rating', 'rejected_avg_rating', 'chosen_model', 'prompt', 'chosen', 'rejected'],
        num_rows: 50895
    })
    test: Dataset({
        features: ['source', 'instruction', 'chosen_response', 'rejected_response', 'chosen_avg_rating', 'rejected_avg_rating', 'chosen_model', 'prompt', 'chosen', 'rejected'],
        num_rows: 12724
    })
})
{'source': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'chosen_response': Value(dtype='string', id=None), 'rejected_response': Value(dtype='string', id=None), 'chosen_avg_rating': Value(dtype='float64', id=None), 'rejected_avg_rating': Value(dtype='float64', id=None), 'chosen_model': Value(dtype='string', id=None), 'prompt': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'chosen': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='s

In [11]:
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(10000))  # Randomly select first 1,000 after shuffle
dataset["test"] = dataset["test"].shuffle(seed=42).select(range(1000))  # Randomly select first 1,000 after shuffle

### Train model with ORPO

In [12]:
orpo_args = ORPOConfig(
    # Small learning rate to prevent catastrophic forgetting
    learning_rate=8e-6,
    # Linear learning rate decay over training
    lr_scheduler_type="linear",
    # Maximum combined length of prompt + completion
    max_length=1024,
    # Maximum length for input prompts
    max_prompt_length=512,
    # Controls weight of the odds ratio loss (λ in paper)
    beta=0.1,
    # Batch size for training
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    # Helps with training stability by accumulating gradients before updating
    gradient_accumulation_steps=4,
    # Memory-efficient optimizer for CUDA, falls back to adamw_torch for CPU/MPS
    optim="paged_adamw_8bit" if device == "cuda" else "adamw_torch",
    # Number of training epochs
    num_train_epochs=1,
    # When to run evaluation
    evaluation_strategy="steps",
    # Evaluate every 20% of training
    eval_steps=0.2,
    # Log metrics every step
    logging_steps=1,
    # Gradual learning rate warmup
    warmup_steps=10,
    # Disable external logging
    report_to="none",
    # Where to save model/checkpoints
    output_dir="./results/",
    # Enable MPS (Metal Performance Shaders) if available
    use_mps_device=device == "mps",
    hub_model_id=finetune_name,
)



In [13]:
trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)



Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [14]:
trainer.train()  # Train the model

# Save the model
trainer.save_model(f"./{finetune_name}")

Step,Training Loss,Validation Loss,Runtime,Samples Per Second,Steps Per Second,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen,Nll Loss,Log Odds Ratio,Log Odds Chosen
250,1.7833,1.969147,37.1523,26.916,13.458,-0.16802,-0.187884,0.573,0.019864,-1.878838,-1.680196,9.64974,8.090158,1.897462,-0.716853,0.228714
500,1.7199,1.926081,37.1275,26.934,13.467,-0.159989,-0.178521,0.569,0.018533,-1.785211,-1.599885,8.852596,7.427892,1.85549,-0.705906,0.21698
750,1.8162,1.90833,37.1304,26.932,13.466,-0.157598,-0.176501,0.579,0.018903,-1.765012,-1.575979,8.253448,6.896946,1.838129,-0.702006,0.222985
1000,1.6957,1.899662,37.0672,26.978,13.489,-0.155644,-0.174178,0.577,0.018534,-1.741777,-1.556439,8.124045,6.817084,1.829494,-0.70168,0.220052
1250,1.9799,1.89685,37.1648,26.907,13.454,-0.155663,-0.174442,0.578,0.018779,-1.744416,-1.556625,8.122831,6.795854,1.826793,-0.700567,0.223013


### Test the model

In [15]:
# Load the fine-tuned model
fine_tuned_model_path = f"./{finetune_name}"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

In [19]:
print(dataset['test'][1]['prompt'][0]['content'])
print(dataset['test'][1]['chosen'][0]['content'])

what is the total addresable market for business intelligence and how is it distributed
Thank you for your question! I'm happy to help you with that.
The total addressable market (TAM) for business intelligence (BI) is a significant and growing market, with various sources estimating its size. According to a report by MarketsandMarkets, the global BI market is projected to reach $27.6 billion by 2024, growing at a Compound Annual Growth Rate (CAGR) of 10.9% from 2020 to 2024.
The distribution of the BI market can vary depending on factors such as the type of BI tool, industry vertical, and geographic region. Here is a rough estimate of the distribution of the BI market based on these factors:
1. Type of BI tool:
a. Predictive analytics tools: 30-40%
b. Prescriptive analytics tools: 20-30%
c. Operational analytics tools: 30-40%
d. Advanced analytics tools: 10-20%
2. Industry vertical:
a. Finance and banking: 20-25%
b. Healthcare and life sciences: 15-20%
c. Retail and e-commerce: 10-15%

In [22]:
# Test the fine-tuned model on the same prompt

# Format with template
messages = dataset['test'][1]['prompt']
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=128)

In [23]:
# Decode and print the response
print("After fine-tuning:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning:
user
what is the total addresable market for business intelligence and how is it distributed
assistant
The total addressable market for business intelligence and how it is distributed is a complex and multifaceted topic. It is important to understand the different types of business intelligence and how they are distributed across various industries and organizations.

Business intelligence (BI) is a set of tools and techniques used to gather, analyze, and interpret data to improve business decision-making. It can be used for a variety of purposes, including business analysis, marketing, sales, and customer service.

Business intelligence is distributed across a variety of industries and organizations. This means that it can be used by businesses in a variety of ways, from collecting and analyzing data


## `mlabonne/orpo-dpo-mix-40k`

### Define the model

In [5]:
model_name = "HuggingFaceTB/SmolLM2-135M"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
model, tokenizer = setup_chat_format(model, tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-ORPO-dpo-mix"
finetune_tags = ["smol-course", "module_2", "mlabonne/orpo-dpo-mix-40k"]

### Format dataset

In [17]:
# Load dataset
dataset = load_dataset(path="mlabonne/orpo-dpo-mix-40k")

In [19]:
def process_dataset(sample):
    # Build rejected column
    sample['prompt'] = [
        {
            "role": "user",
            "content": sample['question']
        },
    ]

    sample['chosen'] = sample['chosen'][1:]
    sample['rejected'] = sample['rejected'][1:]

    return sample
dataset = dataset.map(process_dataset)

Map:   0%|          | 0/44245 [00:00<?, ? examples/s]

In [20]:
# Split the train dataset into train and test sets (e.g., 80% train, 20% test)
if "test" not in dataset:
  dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)

In [21]:
# Inspect the dataset structure and metadata
print(dataset)

# Display dataset features
print(dataset["train"].features)
print(dataset["test"].features)

# Check the number of examples in the train and test splits
print(f"Train split size: {len(dataset['train'])}")
print(f"Test split size: {len(dataset['test'])}")

DatasetDict({
    train: Dataset({
        features: ['source', 'chosen', 'rejected', 'prompt', 'question'],
        num_rows: 35396
    })
    test: Dataset({
        features: ['source', 'chosen', 'rejected', 'prompt', 'question'],
        num_rows: 8849
    })
})
{'source': Value(dtype='string', id=None), 'chosen': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'rejected': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'prompt': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'question': Value(dtype='string', id=None)}
{'source': Value(dtype='string', id=None), 'chosen': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'rejected': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'prompt': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'question': Value

In [22]:
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(10000))  # Randomly select first 1,000 after shuffle
dataset["test"] = dataset["test"].shuffle(seed=42).select(range(1000))  # Randomly select first 1,000 after shuffle

### Train model with ORPO

In [23]:
orpo_args = ORPOConfig(
    # Small learning rate to prevent catastrophic forgetting
    learning_rate=8e-6,
    # Linear learning rate decay over training
    lr_scheduler_type="linear",
    # Maximum combined length of prompt + completion
    max_length=1024,
    # Maximum length for input prompts
    max_prompt_length=512,
    # Controls weight of the odds ratio loss (λ in paper)
    beta=0.1,
    # Batch size for training
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    # Helps with training stability by accumulating gradients before updating
    gradient_accumulation_steps=4,
    # Memory-efficient optimizer for CUDA, falls back to adamw_torch for CPU/MPS
    optim="paged_adamw_8bit" if device == "cuda" else "adamw_torch",
    # Number of training epochs
    num_train_epochs=1,
    # When to run evaluation
    evaluation_strategy="steps",
    # Evaluate every 20% of training
    eval_steps=0.2,
    # Log metrics every step
    logging_steps=1,
    # Gradual learning rate warmup
    warmup_steps=10,
    # Disable external logging
    report_to="none",
    # Where to save model/checkpoints
    output_dir="./results/",
    # Enable MPS (Metal Performance Shaders) if available
    use_mps_device=device == "mps",
    hub_model_id=finetune_name,
)



In [24]:
trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)



Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [25]:
trainer.train()  # Train the model

# Save the model
trainer.save_model(f"./{finetune_name}")

Step,Training Loss,Validation Loss,Runtime,Samples Per Second,Steps Per Second,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen,Nll Loss,Log Odds Ratio,Log Odds Chosen
250,1.9384,1.761485,36.2625,27.577,13.788,-0.154133,-0.18875,0.572,0.034617,-1.887498,-1.541325,7.629021,6.244759,1.698105,-0.63379,0.418838
500,1.7247,1.724739,36.2386,27.595,13.797,-0.148619,-0.20187,0.585,0.053251,-2.018698,-1.486188,6.526888,5.440802,1.66387,-0.608689,0.619451
750,1.5799,1.710398,36.3407,27.517,13.759,-0.146914,-0.207123,0.589,0.060209,-2.071231,-1.469138,6.250858,5.23993,1.650035,-0.603635,0.693434
1000,1.752,1.702898,36.3735,27.493,13.746,-0.146444,-0.211058,0.586,0.064614,-2.110581,-1.464436,5.690287,4.884073,1.642676,-0.60222,0.73948
1250,1.4678,1.700454,36.2751,27.567,13.784,-0.146234,-0.212462,0.578,0.066228,-2.124621,-1.462338,5.711889,4.89821,1.640325,-0.601286,0.756371


### Test the model

In [26]:
# Load the fine-tuned model
fine_tuned_model_path = f"./{finetune_name}"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

In [32]:
print(dataset['train'][1]['prompt'][0]['content'])
print('---')
print(dataset['train'][1]['chosen'][0]['content'])
print('----')
print(dataset['train'][1]['rejected'][0]['content'])

TASK DEFINITION: In this task your given a passage and a question in Catalan, you must answer the question based on the passage. The answer to the question can be extracted directly from the passage. The question will have a single correct answer. The answer will be a continuous span of text from the given passage. The correct answer will be short; it will not be more than a few words.
PROBLEM: Passage: Juan de Juni[n 1] (Joigny, 1506 - Valladolid, 1577) fou un escultor francoespanyol. Juntament amb Alonso Berruguete va formar la gran escola d'escultura castellana. Autor d'una extensa obra, feta principalment durant els més de trenta anys que va viure a Valladolid, les seves peces reflecteixen un gran domini dels diversos materials escultòrics com la terra cuita, la pedra i la fusta, i un extraordinari coneixement de l'anatomia humana.[1]
Question: Què fou Juan de Juni?

SOLUTION: escultor

PROBLEM: Passage: Amb aquestes aportacions, es franquejà una nova etapa. Si bé el significat pre

In [29]:
# Test the fine-tuned model on the same prompt

# Format with template
messages = dataset['train'][1]['prompt']
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=128)

In [30]:
# Decode and print the response
print("After fine-tuning:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning:
user
TASK DEFINITION: In this task your given a passage and a question in Catalan, you must answer the question based on the passage. The answer to the question can be extracted directly from the passage. The question will have a single correct answer. The answer will be a continuous span of text from the given passage. The correct answer will be short; it will not be more than a few words.
PROBLEM: Passage: Juan de Juni[n 1] (Joigny, 1506 - Valladolid, 1577) fou un escultor francoespanyol. Juntament amb Alonso Berruguete va formar la gran escola d'escultura castellana. Autor d'una extensa obra, feta principalment durant els més de trenta anys que va viure a Valladolid, les seves peces reflecteixen un gran domini dels diversos materials escultòrics com la terra cuita, la pedra i la fusta, i un extraordinari coneixement de l'anatomia humana.[1]
Question: Què fou Juan de Juni?

SOLUTION: escultor

PROBLEM: Passage: Amb aquestes aportacions, es franquejà una nova etapa.