# (c) Reinforcement Learning with Preferences — DPO
**Created:** 2025-11-10 02:42 UTC

We use a tiny *pairwise preference* dataset (`chosen` vs `rejected`) and train via **DPO** (Direct Preference Optimization) using TRL.
For speed, we stick to **SmolLM2‑135M** as the policy and reference model.

In [17]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # force use of GPU 0 only


In [1]:
!pip -q install --upgrade pip
!pip -q install "transformers>=4.44.2" "datasets>=2.19.0" "accelerate>=0.33.0" "trl>=0.9.6" "unsloth>=2024.11.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
pylibcudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
bigframes 2.12.0 requires rich<14,>=12.4.4, but you have rich 14.2.0 which is incompatible.
libcugraph-cu12 25.6.0 requires libraft-cu12==25.6.*, but you have libraft-cu12 25.2.0 which is incompatible.
torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.8.0 which is incompatible.
cudf-polars-cu

In [2]:
import torch, platform
print("Python:", platform.python_version())
print("Torch:", torch.__version__)
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

Python: 3.11.13
Torch: 2.8.0+cu128
GPU: Tesla T4


## Build a micro preference dataset
Each row has a `prompt`, a `chosen` response (preferred), and a `rejected` response.

In [3]:
from datasets import Dataset

prefs = [
    {
        "prompt": "Explain what a hash map is in one sentence.",
        "chosen": "A hash map stores key–value pairs and uses a hash function for near-constant-time lookups.",
        "rejected": "It is a tall tree used in forests to store numbers in leaves."
    },
    {
        "prompt": "Give a clear docstring for a function that computes Fibonacci numbers.",
        "chosen": "Return the n-th Fibonacci number using iterative computation; n>=0 with F0=0, F1=1.",
        "rejected": "Does Fibonacci quickly and magically."
    },
]
dpo_ds = Dataset.from_list(prefs)
dpo_ds

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 2
})

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "HuggingFaceTB/SmolLM2-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
policy = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
reference = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
policy.resize_token_embeddings(len(tokenizer))
reference.resize_token_embeddings(len(tokenizer))

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
2025-11-10 03:35:57.048268: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762745757.281607      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762745757.357384      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu128 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Embedding(49152, 576)

In [18]:
import torch
from transformers import AutoModelForCausalLM

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BASE = "HuggingFaceTB/SmolLM2-135M"

# EITHER reload in fp16:
policy = AutoModelForCausalLM.from_pretrained(
    BASE, torch_dtype=torch.float16 if DEVICE=="cuda" else torch.float32
).to(DEVICE)
reference = AutoModelForCausalLM.from_pretrained(
    BASE, torch_dtype=torch.float16 if DEVICE=="cuda" else torch.float32
).to(DEVICE)

# (If you prefer not to reload, you can cast instead:)
# policy = policy.to(DEVICE, dtype=torch.float16 if DEVICE=="cuda" else torch.float32)
# reference = reference.to(DEVICE, dtype=torch.float16 if DEVICE=="cuda" else torch.float32)


In [19]:
from trl import DPOTrainer, DPOConfig

cfg = DPOConfig(
    output_dir="/kaggle/working/smollm2_dpo_speed",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    max_steps=20,
    learning_rate=1e-5,
    fp16=True,    # <-- use fp16 on T4
    bf16=False,   # <-- disable bf16
    logging_strategy="steps",
    logging_steps=50,
    eval_strategy="no",
    save_strategy="no",
    report_to="none",
    beta=0.1,
    max_prompt_length=128,
    max_completion_length=64,
    max_length=192,
    remove_unused_columns=False,
    dataloader_num_workers=2,
    disable_tqdm=True,
)

trainer = DPOTrainer(
    model=policy,
    ref_model=reference,
    args=cfg,
    train_dataset=dpo_ds.shuffle(seed=42).select(range(min(16, len(dpo_ds)))),
    processing_class=tokenizer,  # TRL 0.23+
)

trainer.train()


Extracting prompt in train dataset:   0%|          | 0/2 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/2 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/2 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ValueError: Attempting to unscale FP16 gradients.

In [20]:
prompt = "Explain what a hash map is in one sentence."
inputs = tokenizer(prompt, return_tensors="pt").to(policy.device)
with torch.no_grad():
    out = policy.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_values=None`.


Explain what a hash map is in one sentence.
" on our life on the fact.

S. For example: the last years ago, and the first to the universe is going with the next to the same country


A, and the same type of a very unusual. The same, and his own words that, the first year.

