### ⚙️ Setting Up the Training Environment

This section prepares the Google Colab environment for model fine-tuning.  
We install the following key packages:

- **Unsloth (GitHub version):** the main framework for lightweight fine-tuning.  
- **xFormers, TRL, PEFT, Accelerate, BitsAndBytes:** libraries that improve training speed, support parameter-efficient tuning, and enable 8-bit/4-bit optimization.  
- **Datasets:** the Hugging Face utility for loading and preprocessing data.

> 💡 The `--no-deps` flag is used to avoid re-installing dependencies that may already exist in Colab, ensuring faster setup and fewer version conflicts.


In [None]:
!pip install unsloth datasets transformers accelerate bitsandbytes wandb huggingface_hub

Collecting unsloth
  Downloading unsloth-2025.11.2-py3-none-any.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.8/61.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting unsloth_zoo>=2025.11.3 (from unsloth)
  Downloading unsloth_zoo-2025.11.3-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.35-py3-none-any.whl.metadata (12 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.33-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting datasets
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting trl!=0.19.0,<=0.23.0,>=0.18.2 (from unsloth)
  Downloading trl-0.23.0-py3-none-any.whl.metadata (11 kB)
Collecting pyarrow>=21.0.0 (fr

### 🧩 Importing Libraries and Defining Core Configuration

This step initializes all essential libraries and configurations before training.

1. **Unsloth** is imported first to safely patch dependencies and optimize model performance.  
2. Additional libraries like **Transformers**, **TRL**, **PEFT**, and **Datasets** handle model loading, training, and dataset processing.  
3. **W&B (Weights & Biases)** is used for tracking experiments and logging results.  
4. A random **seed** ensures reproducibility across runs, while the **device** is automatically set to GPU if available.

In [None]:
# Import Unsloth FIRST so it patches dependencies safely
import unsloth
from unsloth import FastLanguageModel

# Then other libs
from datasets import load_dataset
from transformers import TrainingArguments, AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
from peft import PeftModel
import wandb
import random, torch

SEED = 3407
random.seed(SEED); torch.manual_seed(SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"

# FAST MODE knobs
POLICY_BASE = "HuggingFaceTB/SmolLM2-135M-Instruct"
REF_BASE    = "HuggingFaceTB/SmolLM2-135M-Instruct"
MAX_LEN     = 768      # was 1024
SUBSET      = 1000     # was 3000
MAX_STEPS   = 200      # was 400
MAX_TARGET  = 128      # was 256
DO_MERGE    = False    # keep False for speed; True to create merged fp16 checkpoint

print({"POLICY_BASE": POLICY_BASE, "REF_BASE": REF_BASE, "MAX_LEN": MAX_LEN, "SUBSET": SUBSET, "MAX_STEPS": MAX_STEPS, "MAX_TARGET": MAX_TARGET})


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.9.0+cu130 with CUDA 1300 (you have 2.9.0+cu128)
    Python  3.10.19 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.9.0+cu130 with CUDA 1300 (you have 2.9.0+cu128)
    Python  3.10.19 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!
{'POLICY_BASE': 'HuggingFaceTB/SmolLM2-135M-Instruct', 'REF_BASE': 'HuggingFaceTB/SmolLM2-135M-Instruct', 'MAX_LEN': 768, 'SUBSET': 1000, 'MAX_STEPS': 200, 'MAX_TARGET': 128}


### 🧠 Loading and Preparing Models for Fine-Tuning

This section loads two versions of the model — a **trainable policy model** and a **frozen reference model** — both based on the same SmolLM2-135M architecture.

1. **Policy Model:**  
   - Loaded in **4-bit precision** to reduce memory footprint.  
   - Enhanced using **LoRA (Low-Rank Adaptation)** for efficient fine-tuning of selected attention and MLP layers.  
   - Uses **gradient checkpointing** to save GPU memory during backpropagation.  
   - Only a small subset of parameters are trainable, printed for verification.

2. **Reference Model:**  
   - Also loaded in **4-bit precision**, but kept **frozen** (no gradient updates).  
   - Serves as a baseline to measure how much the fine-tuned model improves during optimization, especially in DPO (Direct Preference Optimization) training.

> ⚙️ This dual-model setup helps stabilize training by comparing the fine-tuned model’s behavior to a fixed reference version.


In [None]:
# ---------- Load POLICY (4-bit + LoRA) and REFERENCE (4-bit frozen) ----------
policy, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = POLICY_BASE,
    max_seq_length = MAX_LEN,
    dtype          = None,
    load_in_4bit   = True,
)

policy = FastLanguageModel.get_peft_model(
    policy,
    r=16, lora_alpha=32, lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=SEED,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
if hasattr(policy, "print_trainable_parameters"):
    policy.print_trainable_parameters()

reference, _ = FastLanguageModel.from_pretrained(
    model_name     = REF_BASE,
    max_seq_length = MAX_LEN,
    dtype          = None,
    load_in_4bit   = True,
)
for p in reference.parameters():
    p.requires_grad_(False)


==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

HuggingFaceTB/SmolLM2-135M-Instruct does not have a padding token! Will use pad_token = <|endoftext|>.


Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.11.2 patched 30 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


trainable params: 4,884,480 || all params: 139,399,488 || trainable%: 3.5039
==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
HuggingFaceTB/SmolLM2-135M-Instruct does not have a padding token! Will use pad_token = <|endoftext|>.


In [None]:
# ===============================================================
# 📘 Load the raw dataset
# ===============================================================
from datasets import load_dataset

raw = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split=f"train_prefs[:{SUBSET}]")
print("Dataset loaded:", len(raw), "samples")
print("Columns:", raw.column_names)



Dataset loaded: 1000 samples
Columns: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected']


In [None]:
# ---- Hard length guards for DPO ----
# Ensure tokenizer knows the caps
tokenizer.model_max_length = MAX_LEN          # e.g., 768
tokenizer.padding_side = "right"
tokenizer.truncation_side = "left"            # trim from the left for long prompts

def _truncate_text(txt: str, max_tokens: int) -> str:
    ids = tokenizer(
        txt,
        add_special_tokens=False,
        truncation=True,
        max_length=max_tokens,
        return_attention_mask=False,
        return_token_type_ids=False,
    )["input_ids"]
    return tokenizer.decode(ids, skip_special_tokens=False)

def to_chat_prompt(prompt_text: str) -> str:
    # Minimal system to keep prompt short
    messages = [
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user",   "content": str(prompt_text).strip()},
    ]
    s = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,   # policy will generate assistant
    )
    # Truncate the prompt to MAX_LEN tokens
    return _truncate_text(s, MAX_LEN)

def trim_answer(ans: str) -> str:
    # Truncate targets to MAX_TARGET tokens to keep sequence <= MAX_LEN + MAX_TARGET
    return _truncate_text(str(ans), MAX_TARGET)

# Rebuild the mapped dataset with strict truncation
def map_dpo(batch):
    prompts   = [to_chat_prompt(p) for p in batch["prompt"]]
    chosens   = [trim_answer(c)     for c in batch["chosen"]]
    rejecteds = [trim_answer(r)     for r in batch["rejected"]]
    return {"prompt": prompts, "chosen": chosens, "rejected": rejecteds}

dpo_ds = raw.map(
    map_dpo,
    batched=True,
    num_proc=2,
    remove_columns=raw.column_names,
)

# Quick sanity: verify no prompt exceeds MAX_LEN and no target exceeds MAX_TARGET
def _count_toks(s): return len(tokenizer(s, add_special_tokens=False)["input_ids"])
print("Sanity (first row):",
      _count_toks(dpo_ds[0]["prompt"]), _count_toks(dpo_ds[0]["chosen"]), _count_toks(dpo_ds[0]["rejected"]))


Map (num_proc=2):   0%|          | 0/1000 [00:00<?, ? examples/s]

Sanity (first row): 29 128 128


In [None]:
# --- Correct DPO setup: config + trainer (no max_target_length in DPOConfig) ---
from trl import DPOConfig, DPOTrainer

dpo_args = DPOConfig(
    output_dir="smollm2_dpo_rl_fast",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    max_steps=MAX_STEPS,               # e.g., 200
    learning_rate=5e-6,
    lr_scheduler_type="linear",
    warmup_steps=25,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=50,
    save_strategy="no",                # fastest: no checkpoints
    report_to=[],                      # avoid wandb
    dataloader_num_workers=2,
    seed=SEED,
    # NOTE: do NOT put max_target_length or beta here
)

trainer = DPOTrainer(
    model=policy,                      # LoRA policy (trainable)
    ref_model=reference,               # frozen 4-bit reference
    args=dpo_args,
    tokenizer=tokenizer,
    train_dataset=dpo_ds,
    beta=0.1,                          # <-- put beta here
    max_length=MAX_LEN,                # e.g., 768
    max_target_length=MAX_TARGET,      # e.g., 128  (pass to trainer, not config)
    prompt_column="prompt",
    chosen_column="chosen",
    rejected_column="rejected",
)

print("DPOTrainer ready (config fixed).")


Extracting prompt in train dataset (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

DPOTrainer ready (config fixed).


### 🚀 Training the Model with DPO

This cell executes the **Direct Preference Optimization (DPO)** training loop.

- We first clear any cached GPU memory to maximize available VRAM.
- `trainer.train()` handles the entire fine-tuning process using the:
  - **trainable LoRA policy model**
  - **frozen reference model**
  - **preprocessed DPO dataset** with `prompt`, `chosen`, and `rejected` pairs.
- The script also tracks how long training takes for better performance benchmarking.

> ⚙️ DPO encourages the model to generate responses closer to the “chosen” answers while diverging from the “rejected” ones, improving alignment without reinforcement rollouts.


In [None]:
import gc, time
# ---------- Train ----------
gc.collect()
if torch.cuda.is_available(): torch.cuda.empty_cache()
start = time.time()
train_out = trainer.train()
elapsed = time.time() - start
print("Train out:", train_out)
print(f"Elapsed: {elapsed/60:.1f} min")



The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 4 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 4,884,480 of 139,399,488 (3.50% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss
50,0.6932,0.001928,0.001912,0.47875,1.7e-05,-299.747375,-295.733765,7.339626,7.404619,0,0,0
100,0.6893,0.019639,0.011614,0.635101,0.008025,-302.065033,-299.769104,7.219854,7.276952,No Log,No Log,No Log
150,0.6848,0.037052,0.020037,0.710859,0.017015,-299.165924,-299.079254,7.377467,7.443332,No Log,No Log,No Log
200,0.6811,0.048881,0.024311,0.729798,0.024569,-299.273682,-295.55191,7.35649,7.362379,No Log,No Log,No Log


Train out: TrainOutput(global_step=200, training_loss=0.6870830535888672, metrics={'train_runtime': 1228.1946, 'train_samples_per_second': 2.605, 'train_steps_per_second': 0.163, 'total_flos': 0.0, 'train_loss': 0.6870830535888672, 'epoch': 3.176})
Elapsed: 20.5 min


### 💾 Saving & (Optionally) Merging the Fine-Tuned Model + Quick Inference

This cell:
1. **Saves LoRA adapters** and the **tokenizer** to disk (fast, storage-friendly).
2. **Optionally merges** the adapters into the base model to produce a **single fp16 checkpoint** (set `DO_MERGE=True` to enable).
3. Provides a small **`chat()`** helper to sanity-check the model with greedy decoding for reproducibility.

**Why two save paths?**
- **Adapters**: lightweight, quick to store, and can be attached to the base model later.
- **Merged fp16**: a single deployable checkpoint that doesn’t require PEFT to load.

> Tip: Keep using **adapters** during experimentation; switch to **merged** only when you need a single file for serving or sharing.


In [None]:
# ===============================================================
# 💾 Save adapters (fast path) and optionally merge to a single fp16 checkpoint
# ===============================================================
import os, torch
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Where to store outputs
ADAPTER_DIR = "smollm2_dpo_rl_fast/adapters"
TOKEN_DIR   = "smollm2_dpo_rl_fast/tokenizer"
os.makedirs(ADAPTER_DIR, exist_ok=True)

# Save LoRA adapters and tokenizer config
trainer.model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(TOKEN_DIR)
print(f"✅ Saved adapters to: {ADAPTER_DIR}")
print(f"✅ Saved tokenizer to: {TOKEN_DIR}")

# ---------------------------------------------------------------
# 🔁 Optional: merge LoRA into a single fp16 model (set DO_MERGE=True)
# ---------------------------------------------------------------
MERGED_DIR = None
if DO_MERGE:
    MERGED_DIR = "smollm2_dpo_rl_fast/merged"
    os.makedirs(MERGED_DIR, exist_ok=True)

    # Load the base model in fp16 on available device(s)
    base_fp16 = AutoModelForCausalLM.from_pretrained(
        POLICY_BASE,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    # Attach adapters and merge into the base weights
    peft_model = PeftModel.from_pretrained(base_fp16, ADAPTER_DIR)
    merged = peft_model.merge_and_unload()

    # Persist merged weights + tokenizer
    merged.save_pretrained(MERGED_DIR, safe_serialization=True)
    tokenizer.save_pretrained(MERGED_DIR)
    print(f"✅ Merged model saved to: {MERGED_DIR}")

# ===============================================================
# 🗣️ Tiny chat helper for quick sanity checks
# ===============================================================
def chat(prompt: str, max_new_tokens: int = 128):
    # Pick the right model for inference: merged (if created) or LoRA policy
    model_for_infer = merged if (DO_MERGE and "merged" in locals()) else policy
    model_for_infer.eval()

    # Make sure tokenizer has a pad token
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Minimal system + user message
    messages = [
        {"role": "system", "content": "You are a helpful, concise assistant."},
        {"role": "user",   "content": str(prompt).strip()},
    ]

    # Tokenize with the chat template
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(device)

    # Build a simple attention mask (no padding expected for a single example)
    attention_mask = torch.ones_like(inputs)

    # Greedy decoding for reproducible outputs
    with torch.inference_mode():
        outputs = model_for_infer.generate(
            input_ids=inputs,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # greedy for consistency
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True,
        )

    # Return only the newly generated continuation
    return tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

# ---------------------------------------------------------------
# 🔎 Quick smoke tests
# ---------------------------------------------------------------
print("\n=== Inference (FAST MODE) ===")
print(chat("Explain the difference between a shallow copy and a deep copy in Python with a tiny example."))
print("-" * 80)
print(chat("Write a short Python function that checks if a string is a valid palindrome, ignoring non-alphanumerics."))


✅ Saved adapters to: smollm2_dpo_rl_fast/adapters
✅ Saved tokenizer to: smollm2_dpo_rl_fast/tokenizer

=== Inference (FAST MODE) ===
In Python, a shallow copy is a copy of an object that is created from an existing object, but not from an object that is created from a copy of an existing object. This means that if you create a shallow copy of an object from an existing object, you are essentially creating a copy of the object that was created from the original object, but not from the original object itself.

Here's a simple example:

```python
# Create a shallow copy of an object from an existing object
my_object = my_original_object

# Create a copy of an object from an existing object
my
--------------------------------------------------------------------------------
Here's a Python function that checks if a string is a valid palindrome:

```python
def is_palindrome(s):
    if not s:
        return False
    
    s = s.lower()
    return s == s[::-1]

# Example usage:
print(is_palin