<a href="https://colab.research.google.com/github/thibaud-perrin/preference-alignment/blob/main/notebooks/dpo_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preference Alignment with Direct Preference Optimization (DPO)

This notebook demonstrates the process of fine-tuning a language model using Direct Preference Optimization (DPO). The model used in this example, `SmolLM2-135M-Instruct`, has already undergone Supervised Fine-Tuning (SFT) and is therefore compatible with DPO fine-tuning.

Inside this notebook, we showcase how to align the model's preferences using datasets from the Hugging Face Hub. Specifically, we provide examples of how to fine-tune the model on different datasets to better align its responses with desired preferences.


## What's Inside

### Fine-Tuning with DPOTrainer
We use the `DPOTrainer` from the `trl` library to fine-tune the model with carefully chosen datasets. The process involves:
- Loading a pre-trained model (`SmolLM2-135M-Instruct`).
- Selecting a dataset for alignment, such as:
  - **Basic Example:** Fine-tuning with the `trl-lib/ultrafeedback_binarized` dataset.
  - **Intermediate Example:** Fine-tuning with the `argilla/ultrafeedback-binarized-preferences` dataset.
- Training the model to better align its outputs with human preferences.

By the end of the notebook, the model will demonstrate improved alignment with user expectations, making its responses more consistent with the intended preferences. This process highlights the flexibility and effectiveness of DPO in refining model behavior.

## Secrets
Loading HuggingFace secret and login to huggingFace

In [1]:
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')

In [2]:
# Authenticate to Hugging Face
from huggingface_hub import login

login(token=HF_TOKEN)

## Libraries

In [3]:
# Install the requirements in Google Colab
# transformers
!pip install datasets trl huggingface_hub



In [2]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig

## `trl-lib/ultrafeedback_binarized`

### Select the model
We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO.

In [5]:
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved
finetune_name = "SmolLM2-FT-DPO-ufb"
finetune_tags = ["smol-course", "module_2", "trl-lib/ultrafeedback_binarized"]

config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

Test the model before training

In [6]:
# Let's test the base model before training
prompt = "Use the pygame library to write a version of the classic game Snake, with a unique twist"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Before training:
system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
Use the pygame library to write a version of the classic game Snake, with a unique twist
assistant
Here's a Python implementation of the Snake game using the pygame library:

```python
import pygame

class Snake:
    def __init__(self, width, height):
        self.width = width
        self.height = height
        self.board = [[0 for _ in range(width)] for _ in range(height)]

    def move(self, direction):
        if direction == 'right':
            self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.index(self.board[self.board.ind

### Format dataset

In [7]:
# Load dataset
dataset = load_dataset(path="trl-lib/ultrafeedback_binarized")

README.md:   0%|          | 0.00/643 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/131M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/62135 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
def process_dataset(sample):
    # Apply template for `chosen`
    sample['chosen'] = tokenizer.apply_chat_template(
        sample['chosen'],
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )

    # Apply template for `rejected`
    sample['rejected'] = tokenizer.apply_chat_template(
        sample['rejected'],
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )

    return sample
dataset = dataset.map(process_dataset)

Map:   0%|          | 0/62135 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
# Pre-process: Extract the `chosen` `rejected` column
train_dataset = dataset["train"].map(lambda x: {"chosen": x["chosen"], "rejected": x["rejected"]}, remove_columns=dataset["train"].column_names)
eval_dataset = dataset["test"].map(lambda x: {"chosen": x["chosen"], "rejected": x["rejected"]}, remove_columns=dataset["test"].column_names)

Map:   0%|          | 0/62135 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [10]:
# Inspect the dataset structure and metadata
print(dataset)

# Display dataset features
print(dataset["train"].features)

# Check the number of examples in the train and test splits
print(f"Train split size: {len(dataset['train'])}")
print(f"Test split size: {len(dataset['test'])}")

# Peek at a few examples to understand the data format
print(dataset["train"][0])
print(dataset["test"][0])

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 62135
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 1000
    })
})
{'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None), 'score_chosen': Value(dtype='float64', id=None), 'score_rejected': Value(dtype='float64', id=None)}
Train split size: 62135
Test split size: 1000
{'chosen': "<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\nUse the pygame library to write a version of the classic game Snake, with a unique twist<|im_end|>\n<|im_start|>assistant\nSure, I'd be happy to help you write a version of the classic game Snake using the pygame library! Here's a basic outline of how we can approach this:\n\n1. First, we'll need to set up the game display and create a game object that we can use 

In [11]:
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(1000))  # Randomly select first 1,000 after shuffle
dataset["test"] = dataset["test"].shuffle(seed=42).select(range(1000))  # Randomly select first 1,000 after shuffle

### Train model with DPO

In [12]:
num_epochs = 3
train_size = len(dataset["train"])

# Calculate max_steps as before
max_steps = train_size // 4 * num_epochs

# Determine eval_steps as a fraction of max_steps (e.g., every 10% of max_steps)
eval_steps = max_steps // 10  # Adjust the divisor for more/less frequent evaluations

# Determine save_steps as a fraction of max_steps (e.g., every 5% of max_steps)
save_steps = max_steps // 20  # Adjust the divisor for more/less frequent saves

# Determine logging_steps as a fraction of max_steps (e.g., every 2% of max_steps)
logging_steps = max_steps // 50  # Adjust the divisor for more/less frequent logs

print(f"Calculated max_steps: {max_steps}")
print(f"Calculated eval_steps: {eval_steps}")
print(f"Calculated save_steps: {save_steps}")
print(f"Calculated logging_steps: {logging_steps}")

Calculated max_steps: 750
Calculated eval_steps: 75
Calculated save_steps: 37
Calculated logging_steps: 15


In [13]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=max_steps,
    eval_steps=eval_steps,
    save_steps=save_steps,
    eval_strategy="steps",
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=logging_steps,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,
)

In [14]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing chosen/rejected response pairs
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    # beta=0.1,
    # Maximum length of the input prompt in tokens
    # max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    # max_length=1536,
)

Extracting prompt from train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Extracting prompt from eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [15]:
import os

os.environ["WANDB_MODE"] = "disabled"

In [16]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
75,0.3238,0.659508,-0.42542,-0.68467,0.621,0.25925,-444.678619,-386.225647,3.340478,3.379606
150,0.0276,1.054754,-4.162828,-4.571417,0.563,0.408589,-482.052582,-425.09314,0.774694,0.713328
225,0.0105,1.11997,-4.888909,-5.448071,0.584,0.559162,-489.313446,-433.859619,0.100465,-0.029427
300,0.0033,1.206077,-6.477919,-7.179644,0.582,0.701725,-505.203583,-451.175354,-0.09514,-0.2418
375,0.003,1.188484,-5.631783,-6.257374,0.57,0.625591,-496.742157,-441.952698,0.07133,-0.053966
450,0.0029,1.218685,-5.99618,-6.651042,0.574,0.654862,-500.386169,-445.889343,-0.076222,-0.210845
525,0.0,1.236261,-6.187339,-6.857077,0.574,0.669739,-502.29776,-447.949707,-0.163337,-0.303013
600,0.0,1.245811,-6.284099,-6.954398,0.568,0.670299,-503.265381,-448.922913,-0.206039,-0.348544
675,0.0029,1.2467,-6.315342,-6.993732,0.575,0.678391,-503.577789,-449.316315,-0.213677,-0.356484
750,0.0029,1.251297,-6.329402,-7.003645,0.573,0.674243,-503.718384,-449.415375,-0.220758,-0.364112


### Test the model

In [17]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the fine-tuned model
fine_tuned_model_path = f"./{finetune_name}"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

In [24]:
print(dataset['train'][0]['chosen'])

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
INPUT ARTICLE: Article: Different laptop models have different numeric pad configurations.  If your U, I, and O keys have 4, 5, and 6 printed in the lower corner, you have an older laptop with an alternate numeric pad. See the next section for details on using it. The ThinkPad line of laptops do not use an alternate numeric pad. You'll need to use the method in this section as a workaround. Some larger models have a dedicated numeric pad. Click the "Start" button in the lower-right corner of the desktop. In many versions of Windows, this is just a Windows icon. The Start menu will appear above the button. If you are using Windows 8 and don't see the Start button, press ⊞ Win on the keyboard. This will open the Start screen. You can start typing immediately when the Start menu or screen is open to begin searching. You'll see "On-Screen Keyboard" in the search results. If yo

In [28]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt = "Use the pygame library to write a version of the classic game Snake, with a unique twist"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=1024)

In [19]:
print("Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! Here's a basic outline of how we can approach this:\n\n1. First, we'll need to set up the game display and create a game object that we can use to handle the game's state.\n2. Next, we'll create the game's grid, which will be used to represent the game board. We'll need to define the size of the grid and the spaces within it.\n3. After that, we'll create the snake object, which will be used to represent the player's movement. We'll need to define the size of the snake and the speed at which it moves.\n4. We'll also need to create a food object, which will be used to represent the food that the player must collect to score points. We'll need to define the location of the food and the speed at which it moves.\n5. Once we have these objects set up, we can start handling the game's logic. We'll need to define the rules for the player's movements, such as how the player can move the snake and how the snake will grow as the player collects more food.\n6. We'll also need to add collisions detection to the game, so that the snake and the food can collide with each other.\n7. Finally, we'll need to add a user interface to the game, such as a menu and a scoreboard.\n\nNow, as for the unique twist, we could add a few features to make the game more interesting. For example, we could add power-ups that give the player special abilities, such as the ability to grow the snake faster or to change its direction. We could also add obstacles, such as walls or pits, that the player must avoid.\n\nHere's some sample code to get us started:\n```\nimport pygame\n\n# Initialize pygame\npygame.init()\n\n# Set up the game display\nwidth = 800\nheight = 600\nscreen = pygame.display.set_mode((width, height))\n\n# Define the colors\nWHITE = (255, 255, 255)\nBLACK = (0, 0, 0)\nGREEN = (0, 255, 0)\n\n# Define the game objects\nsnake = pygame.Rect(50, 50, 10, 10)\nfood = pygame.Rect(100, 100, 10, 10)\n\n# Define the game logic\ndef update_snake():\n # Get the mouse position\n mouse_pos = pygame.mouse.get_pos()\n\n # Move the snake based on the mouse position\n if mouse_pos[0] > snake.x:\n snake.x += 10\n elif mouse_pos[0] < snake.x:\n snake.x -= 10\n elif mouse_pos[1] > snake.y:\n snake.y += 10\n elif mouse_pos[1] < snake.y:\n snake.y -= 10\n\n # Update the snake's size\n if snake.x == food.x and snake.y == food.y:\n snake.width += 10\n snake.height += 10\n\n# Define the game loop\ndef game_loop():\n # Clear the screen\n screen.fill(BLACK)\n\n # Update the game objects\n update_snake()\n\n # Draw the game objects\n screen.fill(WHITE)\n screen.draw.rect(snake, GREEN)\n screen.draw.rect(food, BLACK)\n\n # Update the display\n pygame.display.flip()\n\n# Start the game loop\ngame_loop()\n\n# Run the game\nwhile True:\n for event in pygame.event.get():\n if event.type == pygame.QUIT:\n pygame.quit()\n break\n\n pygame.time.Clock().tick(60)\n```\nThis code sets up a basic game display, defines the snake and food objects, and updates the game state based on the player's movements. We've also added a simple user interface and some basic collision detection.\n\nAs for the unique twist, we could add a few features to make the game")

Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! Here's a basic outline of how we can approach this:

1. First, we'll need to set up the game display and create a game object that we can use to handle the game's state.
2. Next, we'll create the game's grid, which will be used to represent the game board. We'll need to define the size of the grid and the spaces within it.
3. After that, we'll create the snake object, which will be used to represent the player's movement. We'll need to define the size of the snake and the speed at which it moves.
4. We'll also need to create a food object, which will be used to represent the food that the player must collect to score points. We'll need to define the location of the food and the speed at which it moves.
5. Once we have these objects set up, we can start handling the game's logic. We'll need to define the rules for the player's movements, such as how the player can move the snake and how th

In [29]:
# Decode and print the response
print("After fine-tuning:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning:
system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
Use the pygame library to write a version of the classic game Snake, with a unique twist
Professor_Kai_Lupin_1234567890

What is the pygame library used to create the classic game Snake, with a unique twist?


In [21]:
assert "a" == "b", "stop"

AssertionError: stop

## argilla/ultrafeedback-binarized-preferences

### Select the model

In [3]:
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-DPO-a-ufbp"
finetune_tags = ["smol-course", "module_2", "argilla/ultrafeedback-binarized-preferences"]

In [4]:
# Let's test the base model before training
prompt = "How can I convert the decimal number 31 to binary format using JavaScript code? Can you provide the code for this conversion?"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Before training:
system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
How can I convert the decimal number 31 to binary format using JavaScript code? Can you provide the code for this conversion?
assistant
Sure! Here's a JavaScript code snippet that converts a decimal number to binary format:

```javascript
function decimalToBinary(n) {
    let binary = '';
    for (let i = 0; i < n.length; i++) {
        binary += String.fromCharCode(n.charCodeAt(i) + 2);
    }
    return binary;
}

// Example usage:
let decimal = 31;
let binary = decimalToBinary(decimal);
console.log(binary);
```

In this code, the `decimalToBinary` function takes a decimal number `n` as input and returns the binary representation of `n`. It uses a `for` loop to iterate over the binary digits of the decimal number. For each binary digit, it adds 2 to the corresponding character in the string `String.fromCharCode(n.charCodeAt(i) + 2)`.

The `String.fromCharCode` method is used to convert th

### Format dataset

In [5]:
# Load dataset
dataset = load_dataset(path="argilla/ultrafeedback-binarized-preferences")

In [6]:
def process_dataset(sample):
    instruction = sample['instruction']

    # Build chosen column
    chosen = [
        {"role": "user", "content": instruction},
        {"role": "assistant", "content": sample['chosen_response']}
    ]

    # Build rejected column
    rejected = [
        {"role": "user", "content": instruction},
        {"role": "assistant", "content": sample['rejected_response']}
    ]

    # Apply template for `chosen`
    sample['chosen'] = tokenizer.apply_chat_template(
        chosen,
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )

    # Apply template for `rejected`
    sample['rejected'] = tokenizer.apply_chat_template(
        rejected,
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )
    return sample
dataset = dataset.map(process_dataset)


In [7]:
# Split the train dataset into train and test sets (e.g., 80% train, 20% test)
if "test" not in dataset:
  dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)

In [8]:
# Pre-process: Extract the `chosen` `rejected` column
train_dataset = dataset["train"].map(lambda x: {"chosen": x["chosen"], "rejected": x["rejected"]}, remove_columns=dataset["train"].column_names)
eval_dataset = dataset["test"].map(lambda x: {"chosen": x["chosen"], "rejected": x["rejected"]}, remove_columns=dataset["test"].column_names)

In [9]:

# Inspect the dataset structure and metadata
print(dataset)

# Display dataset features
print(dataset["train"].features)
print(dataset["test"].features)

# Check the number of examples in the train and test splits
print(f"Train split size: {len(dataset['train'])}")
print(f"Test split size: {len(dataset['test'])}")

# Peek at a few examples to understand the data format
print(dataset["train"][0])
print(dataset["test"][0])

DatasetDict({
    train: Dataset({
        features: ['source', 'instruction', 'chosen_response', 'rejected_response', 'chosen_avg_rating', 'rejected_avg_rating', 'chosen_model', 'chosen', 'rejected'],
        num_rows: 50895
    })
    test: Dataset({
        features: ['source', 'instruction', 'chosen_response', 'rejected_response', 'chosen_avg_rating', 'rejected_avg_rating', 'chosen_model', 'chosen', 'rejected'],
        num_rows: 12724
    })
})
{'source': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'chosen_response': Value(dtype='string', id=None), 'rejected_response': Value(dtype='string', id=None), 'chosen_avg_rating': Value(dtype='float64', id=None), 'rejected_avg_rating': Value(dtype='float64', id=None), 'chosen_model': Value(dtype='string', id=None), 'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None)}
{'source': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'chosen_response':

In [10]:
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(100))  # Randomly select first 1,000 after shuffle
dataset["test"] = dataset["test"].shuffle(seed=42).select(range(100))  # Randomly select first 1,000 after shuffle

### Train model with DPO

In [11]:
num_epochs = 3
train_size = len(dataset["train"])

# Calculate max_steps as before
max_steps = train_size // 4 * num_epochs

# Determine eval_steps as a fraction of max_steps (e.g., every 10% of max_steps)
eval_steps = max_steps // 10  # Adjust the divisor for more/less frequent evaluations

# Determine save_steps as a fraction of max_steps (e.g., every 5% of max_steps)
save_steps = max_steps // 20  # Adjust the divisor for more/less frequent saves

# Determine logging_steps as a fraction of max_steps (e.g., every 2% of max_steps)
logging_steps = max_steps // 50  # Adjust the divisor for more/less frequent logs

print(f"Calculated max_steps: {max_steps}")
print(f"Calculated eval_steps: {eval_steps}")
print(f"Calculated save_steps: {save_steps}")
print(f"Calculated logging_steps: {logging_steps}")

Calculated max_steps: 75
Calculated eval_steps: 7
Calculated save_steps: 3
Calculated logging_steps: 1


In [12]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=6,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=6,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=max_steps,
    eval_steps=eval_steps,
    save_steps=save_steps,
    eval_strategy="steps",
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=logging_steps,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,
)

In [13]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing chosen/rejected response pairs
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    # beta=0.1,
    # Maximum length of the input prompt in tokens
    # max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    # max_length=1536,
)

Extracting prompt from train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Extracting prompt from eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

In [14]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
7,1.1538,0.701339,-0.011172,0.003247,0.413462,-0.014418,-382.152283,-293.547302,3.437001,3.762158
14,0.6063,0.692862,-0.001352,-0.006054,0.576923,0.004702,-382.054077,-293.64032,3.44155,3.766985
21,0.6831,0.690879,-0.017133,-0.029862,0.461538,0.012728,-382.211884,-293.878357,3.452502,3.77547
28,0.1815,0.67937,-0.071977,-0.125406,0.538462,0.053429,-382.760376,-294.833832,3.415394,3.734719
35,0.1043,0.677602,-0.258049,-0.387429,0.557692,0.12938,-384.621094,-297.454041,3.216835,3.531419
42,0.0074,0.719351,-0.690928,-0.905617,0.557692,0.214689,-388.94986,-302.635895,2.832312,3.140878
49,0.001,0.779917,-1.36443,-1.685675,0.528846,0.321245,-395.684845,-310.436523,2.20033,2.475456
56,0.0001,0.860862,-2.189919,-2.627513,0.567308,0.437594,-403.939758,-319.854858,1.603184,1.836488
63,0.0001,0.928634,-2.772413,-3.253306,0.557692,0.480893,-409.764709,-326.112823,1.255835,1.461507
70,0.0,0.977576,-3.141264,-3.648539,0.586538,0.507275,-413.453247,-330.065155,1.040976,1.226405


### Test the model

In [15]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the fine-tuned model
fine_tuned_model_path = f"./{finetune_name}"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

In [16]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt = "How can I convert the decimal number 31 to binary format using JavaScript code? Can you provide the code for this conversion?"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=600)

In [17]:
print('Expected')
print("Yes, I can provide the JavaScript code for converting the decimal number 31 to binary format. Here's the code: ``` let decimalNumber = 31; let binaryString = decimalNumber.toString(2); console.log(binaryString); ``` In this code, we first declare a variable `decimalNumber` and assign it the value 31. Then we use the `toString()` method to convert the decimal number to binary format. The `toString()` method takes a parameter which represents the base of the number system that we want to convert to. In this case, we pass the value `2` to represent the binary number system. The converted binary number is then stored in the `binaryString` variable. Finally, we use `console.log()` to display the binary string in the console.")

Expected
Yes, I can provide the JavaScript code for converting the decimal number 31 to binary format. Here's the code: ``` let decimalNumber = 31; let binaryString = decimalNumber.toString(2); console.log(binaryString); ``` In this code, we first declare a variable `decimalNumber` and assign it the value 31. Then we use the `toString()` method to convert the decimal number to binary format. The `toString()` method takes a parameter which represents the base of the number system that we want to convert to. In this case, we pass the value `2` to represent the binary number system. The converted binary number is then stored in the `binaryString` variable. Finally, we use `console.log()` to display the binary string in the console.


In [18]:
# Decode and print the response
print("After fine-tuning:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning:
system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
How can I convert the decimal number 31 to binary format using JavaScript code? Can you provide the code for this conversion?
assistant
You can convert the decimal number 31 to binary format in JavaScript using the `toBinaryString()` method of the `Number` object. Here's how you can do it:

```javascript
function decimalToBinary(n) {
    return new Array(n).fill(0).map(() => String(n)).join('');
}

// Test the conversion
let decimal = 31;
let binary = decimalToBinary(decimal);
console.log(binary); // Output: 1010
```

The `toBinaryString()` method returns a new array with the binary representation of the input number. The `fill()` method is used to add zeros to the end of each element in the array. The `map()` method is used to iterate over the array and apply the `fill()` method to each element. Finally, the `join()` method is used to convert the array into a binary string.

In this exa

## Conclusion

This notebook demonstrates the process of fine-tuning a model using Direct Preference Optimization (DPO). However, due to limited computational resources, the dataset size was significantly reduced. As a result, the model experienced full overfitting, highlighting the challenges of fine-tuning with smaller datasets. Despite this, the notebook provides a clear overview of the DPO process and its implementation.  
