<a href="https://colab.research.google.com/github/simon-mellergaard/GAI-with-LLMs/blob/main/Other%20material/DPO_example_smolV2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Align SmolLM2-135M-Instruct with DPO

### Load the stuff we need

Install `trl` and load the nessecary libraries

In [None]:
!pip install trl



In [None]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig

Load `SmolLM2-135M-Instruct` and its tokenizer

In [None]:
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-DPO"
finetune_tags = ["smol-course", "module_1"]

Load the datasets for the preference model.

In [None]:
dataset = load_dataset(path="trl-lib/ultrafeedback_binarized", split="train")
dataset_eval = load_dataset(path="trl-lib/ultrafeedback_binarized", split="test")

Let's check one example. The important thing is that each training example should contain 3 things:

* a prompt (human instruction)
* a chosen completion
* a rejected completion.

The completions themselves were generated with a supervised fine-tuned (SFT) model. The chosen vs. rejected were annotated by humans.

In [None]:
example = dataset[15]
print(example.keys())

dict_keys(['chosen', 'rejected', 'score_chosen', 'score_rejected'])


Let's see what the human instruction was in this case:

In [None]:
example["chosen"]

[{'content': 'What unique cultural experiences can visitors have in Okinawa?',
  'role': 'user'},
 {'content': "Okinawa, an island in Japan, offers a variety of unique cultural experiences for visitors. Some of these include:\n\n1. Ryukyu Cuisine: Okinawa has its own distinct culinary style, known as Ryukyu cuisine, which incorporates local ingredients and traditional cooking methods. Visitors can try dishes like Goya Champuru (bitter melon stir-fry), Rafute (slow-cooked pork belly), and Okinawa Soba (a noodle soup).\n\n2. Eisa Dance: Eisa is a traditional Okinawan dance performed during the Obon season to honor the spirits of ancestors. Visitors can attend Eisa festivals and performances to watch the lively, drum-driven dance.\n\n3. Shuri Castle: Shuri Castle was the residence of the Ryukyu Kingdom's kings and a center of diplomacy and culture. Though the castle has been rebuilt several times, visitors can still explore its history and architecture.\n\n4. Bingata Textile Art: Bingata 

Let's take a look at the rejected one:

In [None]:
example["rejected"]

[{'content': 'What unique cultural experiences can visitors have in Okinawa?',
  'role': 'user'},
 {'content': 'As an AI language model, I don\'t have personal experiences, but I can provide some unique cultural experiences that visitors can have in Okinawa:\n\n1. Visit the Shuri Castle: This restored 13th-century royal palace is a UNESCO World Heritage Site and offers visitors a glimpse into Okinawan history and culture.\n\n2. Explore the Ryukyu Islands: Okinawa is home to the Ryukyu Islands, which are known for their stunning beaches, crystal-clear waters, and unique culture. Visitors can take a boat tour or snorkel among the colorful underwater life.\n\n3. Experience the traditional dance: The traditional dance of Okinawa is called the "Danchien" and is a must-see for visitors. It is a lively and colorful performance that showcases the island\'s culture.\n\n4. Try local cuisine: Okinawan cuisine is known for its fresh seafood, unique flavors, and healthy ingredients. Visitors can tr

Well, well, well? Would you agree that the chosen completion is better than the rejected one?

Also notice that the "chosen" and "rejected" completions both are messages, which are lists of dictionaries, each dictionary containing a single message. Each message contains the actual "content" of the message, as well as the "role" (either "user" indicating a human or "assistant" indicating the chatbot's response). This is similar to the format used during supervised fine-tuning (SFT) training

### Set training parameters

Here, we set the training parameters

In [None]:
# Training arguments
training_args = DPOConfig(
    # Training and eval batch size per GPU
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=200,
    eval_steps=20,
    eval_strategy="steps",
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=20,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,
)

Select only a small subset of data for illustration and train the model.

In [None]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=dataset.select(range(3000)),
    eval_dataset=dataset_eval.select(range(100)),
    # Tokenizer for processing inputs
    processing_class=tokenizer,
)

In [None]:
# Train the model
trainer.train()


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
20,0.6922,0.691188,-0.007762,-0.014468,0.57,0.006706,-463.031708,-395.147034,3.040443,3.123636
40,0.6878,0.667016,-0.022354,-0.087056,0.64,0.064702,-463.177643,-395.872925,3.107497,3.19069
60,0.676,0.637047,-0.113789,-0.295764,0.63,0.181976,-464.09198,-397.959991,3.200204,3.275159
80,0.6196,0.610391,-0.439207,-0.810305,0.67,0.371098,-467.346161,-403.105438,3.137609,3.192082
100,0.6237,0.612412,-0.873196,-1.354887,0.69,0.48169,-471.686066,-408.551239,3.04672,3.105724
120,0.6533,0.62483,-1.018751,-1.488724,0.66,0.469973,-473.141602,-409.889618,2.902883,2.986759
140,0.6852,0.641912,-0.956469,-1.368035,0.66,0.411566,-472.518799,-408.682739,2.880801,2.942332
160,0.6674,0.637802,-1.068695,-1.512829,0.7,0.444135,-473.641052,-410.130707,2.707297,2.759442
180,0.6259,0.619735,-1.094935,-1.565114,0.67,0.470179,-473.903442,-410.653503,2.699535,2.75194
200,0.3507,0.617784,-1.092528,-1.582901,0.7,0.490373,-473.879364,-410.83136,2.71598,2.769149


TrainOutput(global_step=200, training_loss=0.6281872010231018, metrics={'train_runtime': 790.7345, 'train_samples_per_second': 4.047, 'train_steps_per_second': 0.253, 'total_flos': 0.0, 'train_loss': 0.6281872010231018, 'epoch': 1.064})

Push to the hub

In [None]:
trainer.push_to_hub(tags=finetune_tags)

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...t/smol_dpo_output/model.safetensors:   0%|          | 12.0kB /  538MB            

  ...t/smol_dpo_output/training_args.bin:  70%|######9   | 4.70kB / 6.74kB            

CommitInfo(commit_url='https://huggingface.co/jnwulff/SmolLM2-FT-DPO/commit/1b7819d52de5bea2655ac7418232b3a17b962b52', commit_message='End of training', commit_description='', oid='1b7819d52de5bea2655ac7418232b3a17b962b52', pr_url=None, repo_url=RepoUrl('https://huggingface.co/jnwulff/SmolLM2-FT-DPO', endpoint='https://huggingface.co', repo_type='model', repo_id='jnwulff/SmolLM2-FT-DPO'), pr_revision=None, pr_num=None)

### Inference

Now, fetch the model we trained and use it for inference

In [None]:
from transformers import pipeline
model_ckp = "jnwulff/SmolLM2-FT-DPO"
pipe = pipeline("text-generation", model=model_ckp, device=0)

config.json:   0%|          | 0.00/873 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/368 [00:00<?, ?B/s]

Device set to use cuda:0


Apply the chat template so we get the instructions wrapped so they fit the model

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to sort a list"},
]

# Apply the chat template
formatted_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

sequences = pipe(
    formatted_chat,
)
print(sequences[0]['generated_text'])

<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a Python function to sort a list<|im_end|>
<|im_start|>assistant
Here's a Python function that sorts a list using the Bubble Sort algorithm:

```python
def bubble_sort(lst):
    n = len(lst)
    for i in range(n - 1):
        swapped = False
        for j in range(n - i - 1):
            if lst[j] > lst[j + 1]:
                lst[j], lst[j + 1] = lst[j + 1], lst[j]
                swapped = True
        if not swapped:
            break
    return lst

# Example usage:
numbers = [64, 34, 25, 12, 22, 11, 90]
sorted_numbers = bubble_sort(numbers)
print(sorted_numbers)
```

This function works by repeatedly swapping elements if they are in the wrong order. The outer loop controls how many swaps are made, and the inner loop controls how the elements are ordered.

Note that Bubble Sort has a time complexity of O(n^2), making it less efficient for large lists. Other sorting algorithms like QuickSort and 

Here's what the chat looks like.

In [None]:
formatted_chat

'<|im_start|>system\nYou are a helpful coding assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to sort a list<|im_end|>\n<|im_start|>assistant\n'