<a href="https://colab.research.google.com/github/yanhann10/30_days_of_agentic_ai/blob/main/smolLM_DPO_n_variants.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PLEASE DO NOT DISTRIBUTE!**

        NIMA:
        ⚠️ The normalization layer used in the language model is RMSNorm which differs from the regular LayerNorm.

        ⚠️ The feedforward block used in the language model of this colab differs from a regular MLP block. Here we have 3 lnear layers insead of 2 linear layers. See Fig. 5 in https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/te_llama/tutorial_accelerate_hf_llama_with_te.html.

# **Background**

This exercise is designed to allow you to showcase your engineering and problem solving skills. The Challenge consists of different challenges including:

*   Identifying bugs, and getting the code working. This is designed to test your ability to grapple with real world engineering challenges.
*   Testing your ability to generate code for a specified problem.
*   An opportunity for you to attempt an optional challenge question that extends the original problem set.

Good luck!


# **Coding Challenge Part 2: Teach SmolLM to do grammatical error correction [15 points]**

The goal of this part is to train the SmolLM-135M model to perform grammatical error correction (GEC) using the Grammarly CoEdIT dataset. This [dataset](https://huggingface.co/datasets/grammarly/coedit), derived from the [CoEdIT project](https://arxiv.org/abs/2305.09857), provides a rich collection of text editing instructions and examples. The task involves several key steps that mimic conventional alignment processes:




## **2.1 Supervised Fine-Tuning (SFT) on Training Data [5 points]**

* Fine-tune the [SmolLM-135M model](https://huggingface.co/HuggingFaceTB/SmolLM-135M) using the CoEdIT dataset, which includes input sentences with grammatical errors and their corrected versions. Use the training GEC portion of the CoEdIT dataset to teach the model how to correct grammatical errors effectively.
* Calculate the BLEU score on the validation set to evaluate the model's performance in generating grammatically correct sentences. Ensure that this evaluation process is reusable for later comparisons.
* Search for an optimal set of hyperparameters, such as the learning rate. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters. **Do not train for more than 3 epochs -- we do not expect extensive training time.**
* For Part 2, don't use additional libraries, if an imported library is missing, install it with **pip install**.

      NIMA:
      💡 here’s the deal: There are multiple ways to fine-tune an LLM with Hugging Face, and you should experiment with these approaches:

      1. Padding Approach
      In this method, all samples in a batch are padded to the same length. Simple, right? But there's a catch:

        ⚠️ If you don’t set up the PAD token correctly, your model might struggle to end completions smoothly.

      2. Packing Approach
      This method concatenates multiple samples into one long sequence. You can do this in two ways:

        Dataset level → Set packing=True when preparing your dataset.
        Batch level → Use DataCollatorWithFlattening or DataCollatorForCompletionOnlyLM.

        Batch-level packing requires Flash Attention v2, which means you’ll need a high-end GPU like Ampere (A100, A6000, RTX 4090) or Hopper (H100) series.
      A good resource for the 2 approaches above is Chapter 4 and 5 from this book: https://www.amazon.com/Hands-Fine-Tuning-Language-PyTorch-Hugging-ebook/dp/B0DV3Y1GMP?ref_=ast_author_dp.
      🎯 Optimizing for Best Results
      Regardless of the method you choose, tuning your training recipe is key to achieving a solid BLEU score. Pay close attention to:
        ✅ Learning rate
        ✅ Weight decay
        ✅ Batch size
        ✅ Sequence length
      Tweak these wisely, and you’ll be on your way to LLM fine-tuning success! 🚀

      NIMA:
      💡 You should format your training samples appropriately to fine-tune your model. In this project, tou can either use the model's chat template, which typically involves learnable special tokens, oruse a prompt-completion format, which does not require special tokens. For this project, the prompt-completion format works better.



In [2]:
%%capture
!pip install unsloth trl peft accelerate evaluate

In [3]:
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer


issue

---

*   data format if wrong result in no loss reduction. chatML format + instruct ver easier to use
*   List item



In [4]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

import torch

full_train_ds = load_dataset("grammarly/coedit", split="train")
full_test_ds  = load_dataset("grammarly/coedit", split="validation")
train_ds_gec = full_train_ds.filter(lambda x: x["task"] == "gec")
test_ds_gec  = full_test_ds.filter(lambda x: x["task"] == "gec")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.jsonl:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

validation.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/69071 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1712 [00:00<?, ? examples/s]

Filter:   0%|          | 0/69071 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1712 [00:00<?, ? examples/s]

In [5]:
train_ds_gec[0]

{'_id': '1',
 'task': 'gec',
 'src': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'tgt': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.'}

In [6]:
test_ds_gec[0]

{'_id': '2',
 'task': 'gec',
 'src': 'Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.',
 'tgt': 'First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.'}

In [7]:
from google.colab import userdata
from huggingface_hub import login
import os
login(token=userdata.get('hf')) # export your HF_TOKEN first. You can add this to your ~/.bashrc.

In [8]:
!pip -q install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth==2025.12.5 unsloth-zoo==2025.12.4
!pip -q install --upgrade --no-cache-dir "trl>=0.25.0" "peft>=0.15.0" "accelerate>=0.34.0"


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.9/65.9 kB[0m [31m56.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m373.9/373.9 kB[0m [31m353.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.3/289.3 kB[0m [31m377.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.9/518.9 kB[0m [31m109.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth 2025.12.5 requires trl!=0.19.0,<=0.24.0,>=0.18.2, but you have trl 0.26.2 which is incompatible.
unsloth-zoo 2025.12.4 requires trl!=0.19.0,<=0.24.0,>=0.18.2, but you have trl 0.26.2 which is incompatible.[0m[31m
[0m

In [9]:
import wandb
wandb.init(
    project="ft-SmolLM-gec",
    name='sft_v10',
)

[34m[1mwandb[0m: Currently logged in as: [33mdeep-learning-rabbit[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


### packing

In [17]:
import re
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model


mdl_fp = "unsloth/SmolLM-135M-Instruct"
max_len = 512

use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if use_bf16 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(mdl_fp, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    mdl_fp,
    torch_dtype=dtype,
    device_map="auto",
)

tokenizer.padding_side = "right"
if tokenizer.pad_token is None:
    if tokenizer.unk_token is None:
        tokenizer.add_special_tokens({"unk_token": "<unk>"})
        model.resize_token_embeddings(len(tokenizer))
    tokenizer.pad_token = tokenizer.unk_token

model.config.use_cache = False


GEC_PROMPT_STYLE = "Correct the grammar: {}"

def strip_gec_prefix(s: str) -> str:
    s = s.strip()
    s = re.sub(r"^\s*fix grammaticality:\s*", "", s, flags=re.IGNORECASE)
    s = re.sub(r"^\s*fix grammatically:\s*", "", s, flags=re.IGNORECASE)
    s = re.sub(r"^\s*remove all grammatical errors from this text:\s*", "", s, flags=re.IGNORECASE)
    s = re.sub(r"^\s*correct the grammar:\s*", "", s, flags=re.IGNORECASE)
    return s.strip()

def fmt(ex):
    prompt = GEC_PROMPT_STYLE.format(strip_gec_prefix(ex["src"])) + "\n"
    ex["text"] = prompt + ex["tgt"] + (tokenizer.eos_token or "")
    return ex


train_ds_processed = train_ds_gec.map(fmt, remove_columns=train_ds_gec.column_names)
test_ds_processed  = test_ds_gec.map(fmt, remove_columns=test_ds_gec.column_names)


def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=max_len,
        padding=False,
    )

train_tok = train_ds_processed.map(
    tokenize_fn, batched=True, remove_columns=["text"]
)
test_tok = test_ds_processed.map(
    tokenize_fn, batched=True, remove_columns=["text"]
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)


Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]

Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]

In [None]:

lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

# ----------------------------
# Training
# ----------------------------
training_args = TrainingArguments(
    output_dir="gec_v2",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    warmup_ratio=0.1,
    max_steps=350,
    learning_rate=1e-4,
    fp16=not use_bf16,
    bf16=use_bf16,
    logging_steps=50,
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=3407,
    report_to="wandb",
    optim="adamw_torch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_tok,
    eval_dataset=test_tok,
    data_collator=data_collator,
)

trainer.train()


  trainer = Trainer(
The model is already on multiple devices. Skipping the move to device specified in `args`.


trainable params: 4,884,480 || all params: 139,399,488 || trainable%: 3.5039


Step,Training Loss
50,2.5323
100,1.9027
150,1.7733
200,1.7209
250,1.729
300,1.7077
350,1.7054


TrainOutput(global_step=350, training_loss=1.8673023114885603, metrics={'train_runtime': 198.8009, 'train_samples_per_second': 28.169, 'train_steps_per_second': 1.761, 'total_flos': 320690431051776.0, 'train_loss': 1.8673023114885603, 'epoch': 0.2824858757062147})

## #eval

In [None]:
import torch
import evaluate
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

def predict(tokenizer, model, src_texts, max_new_tokens=128, num_beams=1,temperature=0.0,top_p=1,top_k=1):
    model.eval()

    if isinstance(src_texts, str):
        src_texts = [src_texts]

    tokenizer.padding_side = "left"

    prompts = [GEC_PROMPT_STYLE.format(strip_gec_prefix(src)) + " " for src in src_texts]

    enc = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_len,
    ).to(model.device)

    input_len = enc["input_ids"].shape[1]

    with torch.inference_mode():
        gen_ids = model.generate(
            **enc,
            max_new_tokens=max_new_tokens,
            num_beams=num_beams,

            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            do_sample=None,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    preds = []
    for ids in gen_ids:
        text = tokenizer.decode(ids[input_len:], skip_special_tokens=True).strip()
        preds.append(" ".join(text.split()))

    return preds

def evaluate_bleu(tokenizer, model, dataset, batch_size=16, num_beams=1, max_new_tokens=128):
    bleu = evaluate.load("bleu")

    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    all_preds, all_refs = [], []

    for batch in tqdm(loader):
        preds = predict(
            tokenizer, model, batch["src"],
            max_new_tokens=max_new_tokens,
            num_beams=num_beams,
        )
        all_preds.extend(preds)
        all_refs.extend([[t.strip()] for t in batch["tgt"]])

    return bleu.compute(predictions=all_preds, references=all_refs)["bleu"]

# examples
print("=" * 60)
print("SINGLE EXAMPLE TEST")
print("=" * 60)
print("Input:", test_ds_gec[0]["src"])
print()
pred = predict(tokenizer, model, test_ds_gec[0]["src"])[0]
print("Prediction:", pred)
print("Reference: ", test_ds_gec[0]["tgt"])
print("=" * 60)
print()

#bleu
bleu_score = evaluate_bleu(
    tokenizer,
    model,
    test_ds_gec,
    batch_size=16,
    num_beams=1,
    max_new_tokens=128,
)

print("=" * 60)
print(f"BLEU score: {bleu_score:.4f}")
print("=" * 60)


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


SINGLE EXAMPLE TEST
Input: Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.

Prediction: From you, I found the pleasures of reading something which is expecting to be a new experience to me. I loathed the feeling of being a new experience to me.
Reference:  First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.



Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

  0%|          | 0/31 [00:00<?, ?it/s]

BLEU score: 0.4392


In [None]:
import os
!zip -r gec_ok.zip gec_v2/checkpoint-350/
from google.colab import files
files.download('gec_ok.zip')

  adding: gec_v2/checkpoint-350/ (stored 0%)
  adding: gec_v2/checkpoint-350/scheduler.pt (deflated 62%)
  adding: gec_v2/checkpoint-350/trainer_state.json (deflated 65%)
  adding: gec_v2/checkpoint-350/rng_state.pth (deflated 26%)
  adding: gec_v2/checkpoint-350/tokenizer.json (deflated 82%)
  adding: gec_v2/checkpoint-350/special_tokens_map.json (deflated 75%)
  adding: gec_v2/checkpoint-350/tokenizer_config.json (deflated 87%)
  adding: gec_v2/checkpoint-350/optimizer.pt (deflated 8%)
  adding: gec_v2/checkpoint-350/adapter_config.json (deflated 60%)
  adding: gec_v2/checkpoint-350/training_args.bin (deflated 53%)
  adding: gec_v2/checkpoint-350/chat_template.jinja (deflated 37%)
  adding: gec_v2/checkpoint-350/adapter_model.safetensors (deflated 7%)
  adding: gec_v2/checkpoint-350/README.md (deflated 66%)
  adding: gec_v2/checkpoint-350/vocab.json (deflated 59%)
  adding: gec_v2/checkpoint-350/merges.txt (deflated 55%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **2.2 Create a preference optimization dataset [5 points]**

* *Generate Output Variants* -- for each input sentence in the training set, use the fine-tuned model to generate two different output variants.
 * Consider using different decoding strategies, such as varying the temperature or beam size, to produce diverse outputs. Select an approach based on the desired balance between diversity and quality.

* *Preference Annotation* -- measure the edit distance between each **generated predicted variant** and **ground truth correction**. Label the variant with the lower edit distance as "chosen" and the one with the higher edit distance as "rejected."
 * Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?


### DPO

    NIMA:
      For DPO, you first need to have an SFTed model.
      💡 Do you think it makes a difference whether the SFTed mode was trained using a padding or pad-free approach?
      👀 Take a look at the default data collator used in the DPOTrainer class and consider the implications.

In [None]:
import wandb
wandb.init(
    project="ft-SmolLM-gec",
    name='dpo_v10',
)

0,1
train/epoch,▁▂▃▄▆▇██
train/global_step,▁▂▃▅▆▇██
train/grad_norm,▁▃█▁▆▂▅
train/learning_rate,█▇▆▄▃▁▁
train/loss,█▃▂▁▁▁▁

0,1
total_flos,320690431051776.0
train/epoch,0.28249
train/global_step,350.0
train/grad_norm,0.43563
train/learning_rate,0.0
train/loss,1.7054
train_loss,1.8673
train_runtime,198.8009
train_samples_per_second,28.169
train_steps_per_second,1.761


In [9]:
!pip install fast_edit_distance

Collecting fast_edit_distance
  Downloading fast_edit_distance-1.2.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Downloading fast_edit_distance-1.2.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (134 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.6/134.6 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fast_edit_distance
Successfully installed fast_edit_distance-1.2.2


## **2.3 Run Direct Preference Optimization (DPO) [5 points]**
* Use the preference optimization dataset to further train the model through DPO, a method that leverages human-like preferences for model training.
* After running DPO, measure the BLEU score on the test set. Compare this performance to the baseline established during the SFT phase.
* Search for an optimal set of hyperparameters, such as the learning rate and number of epochs. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters.

## dpo dataset

In [None]:
from trl import DPOConfig, DPOTrainer

In [None]:
train_ds_gec[0]

{'_id': '1',
 'task': 'gec',
 'src': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'tgt': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.'}

In [None]:
import torch

def predict(
    tokenizer,
    model,
    src_texts,
    max_new_tokens=128,
    num_beams=1,
    temperature=0.0,
    top_p=1.0,
    top_k=0,
    num_return_sequences=1,
    do_sample=True,
):

    model.eval()

    if isinstance(src_texts, str):
        src_texts = [src_texts]
    elif isinstance(src_texts, dict):
        raise TypeError("predict expects a string or list[str]. You passed a dict. Use ex['src'].")

    tokenizer.padding_side = "left"

    prompts = [GEC_PROMPT_STYLE.format(strip_gec_prefix(src)) + " " for src in src_texts]
    enc = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_len,
    ).to(model.device)

    input_len = enc["input_ids"].shape[1]

    do_sample = temperature is not None and temperature > 0

    gen_kwargs = dict(
        **enc,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=False,
    )

    if do_sample:
        gen_kwargs.update(
            do_sample=do_sample,
            temperature=float(temperature),
            top_p=float(top_p),
            top_k=int(top_k),
            num_beams=1,
            num_return_sequences=int(num_return_sequences),
        )
    else:
        # Deterministic mode
        gen_kwargs.update(
            do_sample=False,
            num_beams=int(num_beams),
            num_return_sequences=1,
        )

    with torch.inference_mode():
        gen_ids = model.generate(**gen_kwargs)

    batch_size = len(src_texts)
    n = int(num_return_sequences) if do_sample else 1

    decoded = []
    for ids in gen_ids:
        text = tokenizer.decode(ids[input_len:], skip_special_tokens=True).strip()
        decoded.append(" ".join(text.split()))

    if n == 1:
        return decoded
    else:
        return [decoded[i*n:(i+1)*n] for i in range(batch_size)]


In [None]:
train_ds_gec[:3]["src"]

['Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably essential.',
 'Improve the grammaticality of this sentence: Besides some technologically determinists that allow the development of biometric identification, this technology is also shaped by three social factors, namely, the desire of the society for safety, convenience and economy.']

In [None]:
train_ds_gec[:3]["tgt"]

['For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.',
 'As the number of people grows, the need for a habitable environment is unquestionably increasing.',
 'Besides some technological determinists that allow the development of biometric identification, this technology is also shaped by three social factors, namely, the desire of society for safety, convenience, and economy.']

In [None]:
predict(tokenizer, model,train_ds_gec[:3]["src"], max_new_tokens=128, num_beams=1,do_sample=False, temperature=0)

['For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'As the number of people grows, the need of habitable environment is unquestionably essential.',
 'Besides some technologically determinists that allow the development of biometric identification, this technology is also shaped by three social factors, namely, the desire of the society for safety, convenience and economy.']

In [None]:
predict(tokenizer, model,train_ds_gec[:3]["src"], max_new_tokens=128, num_beams=1,temperature=0.8,top_p=0.98,top_k=10)

['For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'As the number of people grows, the need for habitable environment is unquestionably essential.',
 'Besides some technologically determinists that allow the development of biometric identification, this technology is also shaped by three social factors: the desire of the society for safety, convenience and economy.']

In [None]:
from fast_edit_distance import edit_distance

n = 800
subset = train_ds_gec.select(range(min(n, len(train_ds_gec))))
srcs = subset["src"]
golds = subset["tgt"]

dec1s = predict(
    tokenizer,
    model,
    srcs,
    max_new_tokens=128,
    num_beams=1,
    do_sample=False,
    temperature=0.0,
)

dec2s = predict(
    tokenizer,
    model,
    srcs,
    max_new_tokens=128,
    num_beams=1,
    temperature=0.8,
    top_p=0.98,
    top_k=10,
)

pref_data = []
for src, gold, dec1, dec2 in zip(srcs, golds, dec1s, dec2s):
    if dec1 == dec2:
        continue
    d1 = edit_distance(dec1, gold)
    d2 = edit_distance(dec2, gold)
    if d1 == d2:
        continue
    prompt = GEC_PROMPT_STYLE.format(strip_gec_prefix(src)) + " "
    if d1 < d2:
        chosen, rejected = dec1, dec2
    else:
        chosen, rejected = dec2, dec1
    pref_data.append({"prompt": prompt, "chosen": chosen, "rejected": rejected})

pref_data[:3], len(pref_data)


([{'prompt': 'Correct the grammar: Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably essential. ',
   'chosen': 'As the number of people grows, the need for habitable environment is unquestionably essential.',
   'rejected': 'As the number of people grows, the need for habitable environment becomes unquestionably essential.'},
  {'prompt': 'Correct the grammar: Improve the grammaticality of this sentence: Besides some technologically determinists that allow the development of biometric identification, this technology is also shaped by three social factors, namely, the desire of the society for safety, convenience and economy. ',
   'chosen': 'Besides some technologically determinists that allow the development of biometric identification, this technology is also shaped by three social factors, namely, the desire of the society for safety, convenience and economy.',
   'rejected': 'Besides some technologically determinists tha

## dpo trainer

In [19]:
from datasets import Dataset



In [None]:
from datasets import Dataset
from trl import DPOConfig, DPOTrainer
import torch

pref_data_hf = Dataset.from_list(pref_data)
split = pref_data_hf.train_test_split(test_size=0.1, seed=42)

dpo_args = DPOConfig(
    output_dir="./dpo_v10_outputs",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    learning_rate=1e-5,
    warmup_ratio=0.05,
    lr_scheduler_type="constant",
    optim="adamw_torch",
    logging_steps=10,
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
    fp16=not (torch.cuda.is_available() and torch.cuda.is_bf16_supported()),
    max_length=1024,
    max_prompt_length=512,
    beta=0.1,
    report_to="wandb",
    run_name="dpo_gec_v10",
)

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=dpo_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    processing_class=tokenizer,
    peft_config=lora_cfg,
)

dpo_trainer.train()

dpo_trainer.model.save_pretrained("dpo_v10")
tokenizer.save_pretrained("dpo_v10")

bleu_score = evaluate_bleu(
    tokenizer,
    dpo_trainer.model,
    test_ds_gec,
    batch_size=16,
    num_beams=1,
    max_new_tokens=128,
)

print("=" * 60)
print(f"BLEU score after DPO: {bleu_score:.4f}")
print("=" * 60)




Extracting prompt in train dataset:   0%|          | 0/52 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/52 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/52 [00:00<?, ? examples/s]

Extracting prompt in eval dataset:   0%|          | 0/6 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/6 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/6 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.


Step,Training Loss


  0%|          | 0/31 [00:00<?, ?it/s]

BLEU score after DPO: 0.4340


In [None]:
import os
!zip -r dpo_v10.zip dpo_v10/
from google.colab import files
files.download('dpo_v10.zip')

updating: dpo_v10/ (stored 0%)
updating: dpo_v10/tokenizer.json (deflated 82%)
updating: dpo_v10/special_tokens_map.json (deflated 75%)
updating: dpo_v10/tokenizer_config.json (deflated 87%)
updating: dpo_v10/adapter_config.json (deflated 60%)
updating: dpo_v10/chat_template.jinja (deflated 37%)
updating: dpo_v10/adapter_model.safetensors (deflated 46%)
updating: dpo_v10/README.md (deflated 65%)
updating: dpo_v10/vocab.json (deflated 59%)
updating: dpo_v10/merges.txt (deflated 55%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Coding Challenge Part 3: Explore Alternative DPO Variants for Improved Model Performance [10 points]**

## rl techniques

** DPO**: offline, pairwise annotation of chosen vs reject, suitable when annotation quality is high.   

** GRPO**: online, use a group-based baseline, calculate advntage and optimizes a clipped PPO.  

** GSPO**: sequence-lvl instead of token level

---




## reward calc
**edit distance**: delete/insert/substitute. issue: mechanistic. no semantic correctness check  
**bleu**: avg log n-gram precision with brevity penalty. issue: harsh on short sentences, punish paraphrases   
**gleu**: min(n-gram precision, n-gram recall). issue: while less problemsome on short sentences, still punish paraphrases     
**ERRANT**: grammar-informed spacy-based word-level edit. issue: slower due to parsing, still pubish paraphrases. ERRANT F0.5 is the symmetrical Errant

# GRPO + GLEU

In [3]:
import wandb
wandb.init(
    project="ft-SmolLM-gec",
    name='grpo_v10',
)

In [None]:
import torch
from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer
from nltk.translate.gleu_score import sentence_gleu

src_clean = [strip_gec_prefix(s) for s in subset["src"]]
golds = subset["tgt"]
prompts = [GEC_PROMPT_STYLE.format(s) + " " for s in src_clean]

grpo_ds = Dataset.from_dict({"prompt": prompts, "gold": golds})
split = grpo_ds.train_test_split(test_size=0.1, seed=42)

def sentence_gleu_reward(prompts, completions, gold, **kwargs):
    out = []
    for hyp, g in zip(completions, gold):
        hyp_toks = hyp.strip().split()
        ref_toks = g.strip().split()
        if not hyp_toks or not ref_toks:
            out.append(-1.0)
            continue
        out.append(float(sentence_gleu([ref_toks], hyp_toks)))
    return out

def length_sanity_reward(prompts, completions, gold, **kwargs):
    out = []
    for hyp, g in zip(completions, gold):
        hyp_toks = hyp.strip().split()
        ref_toks = g.strip().split()
        if not hyp_toks or not ref_toks:
            out.append(-1.0)
            continue
        ratio = len(hyp_toks) / max(1, len(ref_toks))
        penalty = -0.5 * abs(ratio - 1.0)
        penalty = max(-0.5, penalty)
        out.append(float(penalty))
    return out

use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()

grpo_args = GRPOConfig(
    output_dir="./grpo_v10_outputs",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=5e-5,
    lr_scheduler_type="constant",
    max_steps=100,
    logging_steps=10,
    save_steps=100,
    bf16=use_bf16,
    fp16=not use_bf16,
    report_to="wandb",
    run_name="grpo_v10",
    remove_unused_columns=False,
    max_prompt_length=512,
    max_completion_length=256,
    num_generations=8,
    temperature=1.0,
    top_p=0.98,
    top_k=0,
    beta=0.1,
)

grpo_trainer = GRPOTrainer(
    model=model,
    args=grpo_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    reward_funcs=[sentence_gleu_reward, length_sanity_reward],
    processing_class=tokenizer,
    peft_config=lora_cfg,
)

grpo_trainer.train()

grpo_trainer.model.save_pretrained("grpo_gec_v10")
tokenizer.save_pretrained("grpo_gec_v10")

bleu_score = evaluate_bleu(
    tokenizer,
    grpo_trainer.model,
    test_ds_gec,
    batch_size=16,
    num_beams=1,
    max_new_tokens=128,
)

print("=" * 60)
print(f"BLEU score after GRPO: {bleu_score:.4f}")
print("=" * 60)


The model is already on multiple devices. Skipping the move to device specified in `args`.


Step,Training Loss
10,0.2394
20,0.1911
30,0.1216
40,0.0187
50,0.0276
60,0.0084
70,0.0521
80,0.011
90,0.0015
100,0.0323




  0%|          | 0/31 [00:00<?, ?it/s]

BLEU score after GRPO: 0.4659


In [None]:
import os
!zip -r grpo_v10.zip grpo_v10/
from google.colab import files
files.download('grpo_v10.zip')

### grpo output examples

In [None]:
print("Input:", test_ds_gec[0]["src"])
print()
pred = predict(tokenizer, model, test_ds_gec[0]["src"])[0]
print("Prediction:", pred)
print("Reference: ", test_ds_gec[0]["tgt"])


rejection sampling

# GSPO

In [61]:
import wandb
wandb.init(
    project="ft-SmolLM-gec",
    name='gspo_v10',
)

0,1
profiling/Time taken: GRPOTrainer._calculate_rewards,▂▂▃▄▂▂▂▃▂▄▃▁▃▅█▂▃▄▂▄▅▁▃▅▂▄▄▄▃▄▁▁▃▂▄▄▃▄▅▃
profiling/Time taken: GRPOTrainer._get_per_token_logps_and_entropies,▄▂▁▂▂▃▄▂█▂▄▅▄▂▂▂▄▂▁▅▂█▁▃▂▃▁▃▄▂▃▃▁▄▅▄▂▅▇▄
profiling/Time taken: GRPOTrainer._prepare_inputs,██▁▁██▁█▁███▁██▁█▁█▁▁█▁█▁▁██▁▁▁█▁█▁█▁█▁▁
profiling/Time taken: GRPOTrainer.compute_loss,▁▅▁▁▃▃▃▂▅▃▅▂▇▂▂▃█▄▅▂▃▂▃▂▅▁▆▃▂▄▂▃▃▂▁▃▇▅▂▂
profiling/Time taken: GRPOTrainer.length_sanity_reward,▁▄▂▂▃▁▃▃▄▃▄▂▂▃▄▃▂▂▁▂▃▃▃▃▂▃▃█▄▅▄▂▃▄▃▃▂▄▆▃
profiling/Time taken: GRPOTrainer.sentence_gleu_reward,▃▆▂▂▄▃▆▆▄▁▂▅▅▂▆▅▄▃▆▆▇▄▃▄▃▄▇▅▄▃▃▂▂█▃▅▃▄▄▁
profiling/Time taken: GRPOTrainer.transformers.generate,▃▂▂▄▄▃▃▃▃▄▄▂▂▃▃▂▃▃▃▂▃▄▂▂█▂▁▂▃▃▄▄▄▁▁▄▄▅▅▃
train/clip_ratio/high_max,▁▁▁▁▁▁▁▁▁▁
train/clip_ratio/high_mean,▁▁▁▁▁▁▁▁▁▁
train/clip_ratio/low_mean,▁▁▁▁▁▁▁▁▁▁

0,1
profiling/Time taken: GRPOTrainer._calculate_rewards,0.00907
profiling/Time taken: GRPOTrainer._get_per_token_logps_and_entropies,0.11055
profiling/Time taken: GRPOTrainer._prepare_inputs,1e-05
profiling/Time taken: GRPOTrainer.compute_loss,0.1233
profiling/Time taken: GRPOTrainer.length_sanity_reward,0.00053
profiling/Time taken: GRPOTrainer.sentence_gleu_reward,0.0076
profiling/Time taken: GRPOTrainer.transformers.generate,24.95709
total_flos,0
train/clip_ratio/high_max,0
train/clip_ratio/high_mean,0


In [66]:
from difflib import SequenceMatcher

In [72]:
src_clean = [strip_gec_prefix(s) for s in subset["src"]]
golds = subset["tgt"]
prompts = [GEC_PROMPT_STYLE.format(s) + " " for s in src_clean]

grpo_ds = Dataset.from_dict({
    "prompt": prompts,
    "gold": golds,
    "src": src_clean,   # REQUIRED for rewrite penalty
})

split = grpo_ds.train_test_split(test_size=0.1, seed=42)

def extract_answer(text: str) -> str:
    if not text or not text.strip():
        return ""

    # Safely get the first line
    lines = text.strip().splitlines()
    if not lines:
        return ""
    t = lines[0].strip()

    for prefix in ["Corrected:", "Correction:", "Answer:", "Output:"]:
        if t.lower().startswith(prefix.lower()):
            t = t[len(prefix):].strip()
    return t


def gleu_reward(prompts, completions, gold, **kwargs):
    out = []
    for hyp, g in zip(completions, gold):
        hyp = extract_answer(hyp)
        hyp_toks = hyp.split()
        ref_toks = (g or "").strip().split()
        if not hyp_toks or not ref_toks:
            out.append(-1.0)
            continue
        out.append(float(sentence_gleu([ref_toks], hyp_toks)))
    return out
def length_sanity_reward(prompts, completions, gold, **kwargs):
    out = []
    for hyp, g in zip(completions, gold):
        hyp = extract_answer(hyp)
        hyp_toks = hyp.split()
        ref_toks = (g or "").strip().split()
        if not hyp_toks or not ref_toks:
            out.append(-1.0)
            continue

        ratio = len(hyp_toks) / max(1, len(ref_toks))
        pen = -0.5 * abs(ratio - 1.0)
        pen = max(-0.5, pen)
        out.append(float(pen))
    return out
def rewrite_penalty_reward(prompts, completions, src, **kwargs):
    out = []
    for hyp, s in zip(completions, src):
        hyp = extract_answer(hyp)
        s = (s or "").strip()
        if not hyp or not s:
            out.append(-1.0)
            continue

        sim = SequenceMatcher(None, s, hyp).ratio()  # [0,1]
        penalty = min(0.0, (sim - 0.75))  # penalize if too different
        out.append(float(2.0 * penalty))  # scale
    return out
def format_reward(prompts, completions, **kwargs):
    out = []
    for hyp in completions:
        if hyp is None or not hyp.strip():
            out.append(-1.0)
        elif "\n" in hyp.strip():
            out.append(-0.5)
        else:
            out.append(0.0)
    return out
def gec_total_reward(prompts, completions, gold, src, **kwargs):
    r_gleu = gleu_reward(prompts, completions, gold)
    r_len  = length_sanity_reward(prompts, completions, gold)
    r_rw   = rewrite_penalty_reward(prompts, completions, src)
    r_fmt  = format_reward(prompts, completions)

    rewards = []
    for a, b, c, d in zip(r_gleu, r_len, r_rw, r_fmt):
        r = (
            1.0 * a +      # main signal
            0.2 * b +      # length sanity
            0.8 * c +      # rewrite suppression
            0.2 * d        # format
        )
        r = max(-2.0, min(2.0, r))  # CLIP (stability)
        rewards.append(float(r))
    return rewards
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()

gspo_args = GRPOConfig(
    output_dir="./gspo_gec_outputs",

    # GSPO-ish knobs (remove if TRL errors)
    importance_sampling_level="sequence",
    epsilon=3e-4,
    loss_type="grpo",

    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    lr_scheduler_type="constant",
    max_steps=200,

    logging_steps=10,
    save_steps=100,

    bf16=use_bf16,
    fp16=not use_bf16,

    report_to="wandb",
    run_name="gspo_gec_v11",
    remove_unused_columns=False,

    max_prompt_length=512,
    max_completion_length=96,   # IMPORTANT
    num_generations=16,          # IMPORTANT

    temperature=0.5,
    top_p=0.9,
    top_k=30,

    beta=0.08,                  # KL strength
)

gspo_trainer = GRPOTrainer(
    model=model,
    args=gspo_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    reward_funcs=[gec_total_reward],
    processing_class=tokenizer,
    peft_config=lora_cfg,
)
gspo_trainer.train()

gspo_trainer.model.save_pretrained("gspo_gec_v11")
tokenizer.save_pretrained("gspo_gec_v11")

bleu_score = evaluate_bleu(
    tokenizer,
    gspo_trainer.model,
    test_ds_gec,
    batch_size=16,
    num_beams=1,
    max_new_tokens=128,
)

print("=" * 60)
print(f"BLEU score after GSPO (v11): {bleu_score:.4f}")
print("=" * 60)


The model is already on multiple devices. Skipping the move to device specified in `args`.


Step,Training Loss
10,0.0002
20,0.0001
30,0.0001
40,0.0001
50,0.0005
60,0.0004
70,0.0001
80,0.0005
90,0.0004
100,0.0017




  0%|          | 0/31 [00:00<?, ?it/s]

BLEU score after GSPO (v11): 0.0401


In [73]:
import os
!zip -r gspo_gec_v11.zip gspo_gec_v11/
from google.colab import files
files.download('gspo_gec_v11.zip')

  adding: gspo_gec_v11/ (stored 0%)
  adding: gspo_gec_v11/tokenizer_config.json (deflated 87%)
  adding: gspo_gec_v11/tokenizer.json (deflated 82%)
  adding: gspo_gec_v11/adapter_config.json (deflated 58%)
  adding: gspo_gec_v11/merges.txt (deflated 55%)
  adding: gspo_gec_v11/chat_template.jinja (deflated 37%)
  adding: gspo_gec_v11/vocab.json (deflated 59%)
  adding: gspo_gec_v11/special_tokens_map.json (deflated 75%)
  adding: gspo_gec_v11/README.md (deflated 65%)
  adding: gspo_gec_v11/adapter_model.safetensors (deflated 7%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# MAPO + GLEU + ERRANT difficulty weighting

In [37]:
%%bash
pip install errant
python3 -m spacy download en_core_web_sm

Collecting errant
  Downloading errant-3.0.0-py3-none-any.whl.metadata (13 kB)
Collecting rapidfuzz>=3.4.0 (from errant)
  Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading errant-3.0.0-py3-none-any.whl (499 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 499.3/499.3 kB 29.8 MB/s eta 0:00:00
Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 106.3 MB/s eta 0:00:00
Installing collected packages: rapidfuzz, errant
Successfully installed errant-3.0.0 rapidfuzz-3.14.3
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 155.9 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core

In [38]:
import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)

1 2 are 1 2 is R:VERB:SVA
2 2  2 3 a M:DET
2 3 gramamtical 3 4 grammatical R:SPELL


In [68]:
import wandb
wandb.init(
    project="ft-SmolLM-gec",
    name='mapo_v10',
)

0,1
profiling/Time taken: GRPOTrainer._calculate_rewards,▁█
profiling/Time taken: GRPOTrainer._get_per_token_logps_and_entropies,▁▅██▂▃▄▁
profiling/Time taken: GRPOTrainer._prepare_inputs,█▁█▁
profiling/Time taken: GRPOTrainer.compute_loss,▇█▁▂
profiling/Time taken: GRPOTrainer.gec_total_reward,▁█
profiling/Time taken: GRPOTrainer.transformers.generate,▁█▃▁
train/global_step,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
profiling/Time taken: GRPOTrainer._calculate_rewards,0.01406
profiling/Time taken: GRPOTrainer._get_per_token_logps_and_entropies,0.11077
profiling/Time taken: GRPOTrainer._prepare_inputs,1e-05
profiling/Time taken: GRPOTrainer.compute_loss,0.11731
profiling/Time taken: GRPOTrainer.gec_total_reward,0.01327
profiling/Time taken: GRPOTrainer.transformers.generate,9.5462
train/global_step,0.0


In [23]:
lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

In [12]:
!unzip gec_ok.zip -d gec_ok


Archive:  gec_ok.zip
   creating: gec_ok/gec_v2/checkpoint-350/
  inflating: gec_ok/gec_v2/checkpoint-350/scheduler.pt  
  inflating: gec_ok/gec_v2/checkpoint-350/trainer_state.json  
  inflating: gec_ok/gec_v2/checkpoint-350/rng_state.pth  
  inflating: gec_ok/gec_v2/checkpoint-350/tokenizer.json  
  inflating: gec_ok/gec_v2/checkpoint-350/special_tokens_map.json  
  inflating: gec_ok/gec_v2/checkpoint-350/tokenizer_config.json  
  inflating: gec_ok/gec_v2/checkpoint-350/optimizer.pt  
  inflating: gec_ok/gec_v2/checkpoint-350/adapter_config.json  
  inflating: gec_ok/gec_v2/checkpoint-350/training_args.bin  
  inflating: gec_ok/gec_v2/checkpoint-350/chat_template.jinja  
  inflating: gec_ok/gec_v2/checkpoint-350/adapter_model.safetensors  
  inflating: gec_ok/gec_v2/checkpoint-350/README.md  
  inflating: gec_ok/gec_v2/checkpoint-350/vocab.json  
  inflating: gec_ok/gec_v2/checkpoint-350/merges.txt  


In [62]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

adapter_dir = "./gec_ok/gec_v2/checkpoint-350"
base_model_id = "unsloth/SmolLM-135M-Instruct"

dtype = (
    torch.bfloat16
    if torch.cuda.is_available() and torch.cuda.is_bf16_supported()
    else torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    use_fast=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=dtype,
    device_map="auto",
)
model = PeftModel.from_pretrained(
    base_model,
    adapter_dir,
    is_trainable=True,
)
model



PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(49152, 576, padding_idx=2)
        (layers): ModuleList(
          (0-29): 30 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=576, out_features=576, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=576, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=576, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lo

In [35]:
import math
import re
from fast_edit_distance import edit_distance
from nltk.translate.gleu_score import sentence_gleu

def strip_gec_prefix(s: str) -> str:
    s = s.strip()
    s = re.sub(r"^\s*fix grammaticality:\s*", "", s, flags=re.IGNORECASE)
    s = re.sub(r"^\s*fix grammatically:\s*", "", s, flags=re.IGNORECASE)
    s = re.sub(r"^\s*remove all grammatical errors from this text:\s*", "", s, flags=re.IGNORECASE)
    s = re.sub(r"^\s*correct the grammar:\s*", "", s, flags=re.IGNORECASE)
    return s.strip()


In [28]:
# !pip install trl

In [13]:
from trl import GRPOConfig, GRPOTrainer


In [31]:
n = 800
subset = train_ds_gec.select(range(min(n, len(train_ds_gec))))

src_clean = [strip_gec_prefix(s) for s in subset["src"]]
golds = subset["tgt"]
prompts = [GEC_PROMPT_STYLE.format(s) + " " for s in src_clean]

grpo_ds = Dataset.from_dict({"prompt": prompts, "gold": golds})


In [41]:
def errant_edit_count(src: str, gold: str) -> int:
    s_doc = annotator.parse(src)
    g_doc = annotator.parse(gold)
    edits = annotator.annotate(s_doc, g_doc)
    return len(edits)

def edits_to_weight(n_edits: int, a: float = 0.25, w_min: float = 0.8, w_max: float = 1.5) -> float:
    w = 1.0 + a * math.log1p(n_edits)
    return float(max(w_min, min(w_max, w)))



def add_errant_difficulty(example):
    n_edits = errant_edit_count(example["prompt"], example["gold"])
    example["difficulty"] = edits_to_weight(n_edits)
    return example
grpo_ds = grpo_ds.map(add_errant_difficulty)

split = grpo_ds.train_test_split(test_size=0.1, seed=42)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

In [None]:
#if use errant f0.5 instead
def errant_f05_dataset(srcs, hyps, golds, beta: float = 0.5):
    """
    srcs, hyps, golds: lists of equal length
    """

    TP = FP = FN = 0

    for src, hyp, gold in zip(srcs, hyps, golds):
        src_doc  = annotator.parse(src)
        hyp_doc  = annotator.parse(hyp)
        gold_doc = annotator.parse(gold)

        hyp_edits  = annotator.annotate(src_doc, hyp_doc)
        gold_edits = annotator.annotate(src_doc, gold_doc)

        tp, fp, fn = compareEdits(hyp_edits, gold_edits)
        TP += tp
        FP += fp
        FN += fn

    precision = TP / (TP + FP + 1e-8)
    recall    = TP / (TP + FN + 1e-8)

    b2 = beta * beta
    f_beta = (1 + b2) * precision * recall / (b2 * precision + recall + 1e-8)

    return precision, recall, f_beta


In [42]:
from nltk.translate.gleu_score import sentence_gleu

def weighted_gec_reward(prompts, completions, gold, difficulty=None, **kwargs):
    if difficulty is None:
        difficulty = [1.0] * len(completions)

    rewards = []
    for hyp, g, w in zip(completions, gold, difficulty):
        hyp_toks = (hyp or "").strip().split()
        ref_toks = (g or "").strip().split()

        if not hyp_toks or not ref_toks:
            rewards.append(-1.0)
            continue

        gleu = float(sentence_gleu([ref_toks], hyp_toks))

        ratio = len(hyp_toks) / max(1, len(ref_toks))
        len_pen = -0.5 * abs(ratio - 1.0)
        len_pen = max(-0.5, len_pen)

        base = gleu + 0.2 * len_pen

        w = float(max(0.8, min(1.5, w)))

        rewards.append(float(w * base))

    return rewards


## mapo train

In [69]:


use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()

grpo_args = GRPOConfig(
    output_dir="./mapo_v10_outputs",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=5e-6,
    lr_scheduler_type="constant",
    max_steps=300,
    logging_steps=10,
    save_steps=100,
    bf16=use_bf16,
    fp16=not use_bf16,
    report_to="wandb",
    run_name="mapo_v10",
    remove_unused_columns=False,
    max_prompt_length=512,
    max_completion_length=256,
    num_generations=16,
    temperature=0.5,
    top_p=0.8,
    top_k=30,
    beta=0.05,
)

grpo_trainer = GRPOTrainer(
    model=model,
    args=grpo_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    reward_funcs=[weighted_gec_reward],
    processing_class=tokenizer,
    peft_config=lora_cfg,
)

grpo_trainer.train()

grpo_trainer.model.save_pretrained("gec_mapo_v10")
tokenizer.save_pretrained("gec_mapo_v10")

bleu_score = evaluate_bleu(
    tokenizer,
    grpo_trainer.model,
    test_ds_gec,
    batch_size=16,
    num_beams=1,
    max_new_tokens=128,
)



The model is already on multiple devices. Skipping the move to device specified in `args`.


Step,Training Loss
10,0.0738
20,0.061
30,0.06
40,0.0538
50,0.0236
60,0.0245
70,0.1015
80,0.0617
90,0.1064
100,0.0527




  0%|          | 0/31 [00:00<?, ?it/s]

##
 mapo eval

In [70]:
import torch
import evaluate
from torch.utils.data import DataLoader
from tqdm.auto import tqdm



print("Input:", test_ds_gec[0]["src"])
print()
pred = predict(tokenizer, model, test_ds_gec[0]["src"])[0]
print("Prediction:", pred)
print("Reference: ", test_ds_gec[0]["tgt"])

#bleu
bleu_score = evaluate_bleu(
    tokenizer,
    model,
    test_ds_gec,
    batch_size=16,
    num_beams=1,
    max_new_tokens=128,
)

print("=" * 60)
print(f"BLEU score after MAPO: {bleu_score:.4f}")
print("=" * 60)

Input: Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.

Prediction: **Themes:** * The importance of reading and the pleasure of reading * The importance of reading for pleasure and for understanding * The importance of reading for the pleasure of others **Style:** * The poem is written in a lyrical, poetic style, with a focus on the sensory details and the emotions of the speaker. * The novel is written in a more formal, academic style, with a focus on the structure and the themes. **Structure:** * The poem is divided into stanzas, with each stanza containing a different type of poem, such as a sonnet,
Reference:  First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.


  0%|          | 0/31 [00:00<?, ?it/s]

BLEU score after MAPO: 0.0406


In [71]:
import os
!zip -r gec_mapo_v10.zip gec_mapo_v10/
from google.colab import files
files.download('gec_mapo_v10.zip')

updating: gec_mapo_v10/ (stored 0%)
updating: gec_mapo_v10/tokenizer_config.json (deflated 87%)
updating: gec_mapo_v10/tokenizer.json (deflated 82%)
updating: gec_mapo_v10/adapter_config.json (deflated 58%)
updating: gec_mapo_v10/merges.txt (deflated 55%)
updating: gec_mapo_v10/chat_template.jinja (deflated 37%)
updating: gec_mapo_v10/vocab.json (deflated 59%)
updating: gec_mapo_v10/special_tokens_map.json (deflated 75%)
updating: gec_mapo_v10/README.md (deflated 65%)
updating: gec_mapo_v10/adapter_model.safetensors (deflated 7%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Consider employing a different version or variant of DPO. Your task is to:

* Choose a variant of DPO or another preference-based optimization method that could potentially enhance the model's performance.
* Describe the specific differences in this approach compared to the initial DPO method used.
* Train the model using this alternative DPO method and measure its performance on the test set using the BLEU score.
* Compare these results with the baseline performance achieved during the initial Supervised Fine-Tuning (SFT) and the first DPO implementation.
* Select a few GEC example after SFT, DPO and this DPO variant phases and compare the quality of the corrections, which one you prefer as human?
* You are allowed to make changes in the preference data annotation to improve the score, e.g. apply different metrics or methods beyond edit distance.
* Discuss the role of any changes in achieving these results. Consider potential trade-offs or limitations introduced by the new approach.


Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

Expected number of train and test samples are 19823 and 485, respectively.

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HuggingFaceTB/SmolLM-135M"

# TODO: Load the model and the tokenizer from huggingface




In [None]:
# TRL - Transformer Reinforcement Learning -- https://huggingface.co/docs/trl/en/index
from trl import SFTConfig, SFTTrainer

# TODO: Run SFT



In [None]:
# Quick test if your model works properly
def format_text(text: str) -> str:
    # here you may have formatting of the input that you adopted for training
    return text


# Example of how to run inference on a single example
text = "Fix grammatically: I likes turtles"
inputs = tokenizer(format_text(text), return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.0)
print(tokenizer.decode(outputs[0]))

Expected output: I like turtles.

In [None]:
import evaluate

# BLEU Score
def evaluate_model(model, tokenizer, ds):
    # TODO - compute and call preds and targets for the bleu.compute in the following.


    bleu = evaluate.load("bleu")
    results = bleu.compute(predictions=preds, references=targets)
    return results["bleu"]

In [None]:
# TODO: Evaluate model, use the function given above



Expected BLEU score after 1 epoch SFT is ~ 0.48.

In [None]:
from fast_edit_distance import edit_distance

# TODO: Create preference optimization dataset



In [None]:
# TODO: (Load and) Visualize the created dataset -- display at least 5 lines of the dataset.




In [None]:
import os
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM
from datasets import Dataset
import pandas as pd

# TODO: Run Direct Preference Optimization (DPO)



In [None]:
# TODO: Evaluate model, use evaluate_model function



# **issue**


*  loss curve flat, something might be wrong in processing
---

*  for dpo, get both generated ver and use editdist (or other metric) to chose one closer to the ground truth


In [None]:
!pip install nbstripout

!nbstripout your_notebook.ipynb