# Exercise: Teach an LLM to Spell with Group Relative Policy Optimization (GRPO)

Large language models (LLMs) are notoriously bad at spelling. This is partly because tokenizers break words into smaller pieces, so the model learns about sub-word units rather than whole words and their spellings.

In this exercise, you'll use Group Relative Policy Optimization (GRPO) and a technique called Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to teach a small LLM how to spell words. This is a classic example of teaching a model a new skill that isn't well-represented in its pre-training data.

## What you'll do in this notebook

1.  **Setup**: Import libraries and configure the environment.
2.  **Load the tokenizer and base model**: Use a small, instruction-tuned model as our starting point.
3.  **Create the dataset**: Generate a simple dataset of words and their correct spellings.
4.  **Evaluate the base model**: Test the model's spelling ability *before* fine-tuning to establish a baseline.
5.  **Configure LoRA and train**: Attach a LoRA adapter to the model and fine-tune it on the spelling dataset.
6.  **Evaluate the fine-tuned model**: Test the model again to see if its spelling has improved.

## Setup

In [None]:
# Setup imports
# No changes needed in this cell

import os
import torch
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# Use GPU, MPS, or CPU, in that order of preference
if torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon
else:
    device = torch.device("cpu")
torch.set_num_threads(max(1, os.cpu_count() // 2))
print("Using device:", device)

## Step 1. Load the tokenizer and base model

The model `HuggingFaceTB/SmolLM2-135M-Instruct` is a small, instruction-tuned model that's suitable for this exercise. It has 135 million parameters, making it lightweight and efficient for fine-tuning. It's not the most powerful model, but it's a good choice for demonstrating the concepts of SFT and PEFT with LoRA, especially on a CPU or limited GPU resources.

In [None]:
# Student task: Load the model and tokenizer, and copy the model to the device.
# TODO: Complete the sections with **********

# See: https://huggingface.co/docs/transformers/en/models
# See: https://huggingface.co/docs/transformers/en/fast_tokenizers

# Model ID for SmolLM2-135M-Instruct
model_id = "***********"

# Load the tokenizer
tokenizer = "***********"

# Load the model
model = "***********"

# Copy the model to the device (GPU, MPS, or CPU)
model = "***********"



print("Model parameters (total):", sum(p.numel() for p in model.parameters()))

## Step 2. Create the dataset

In [None]:
# Create a list of words of different lengths
# No changes are needed in this cell.

# fmt: off
ALL_WORDS = [
    "idea", "glow", "rust", "maze", "echo", "wisp", "veto", "lush", "gaze", "knit", "fume", "plow",
    "void", "oath", "grim", "crisp", "lunar", "fable", "quest", "verge", "brawn", "elude", "aisle",
    "ember", "crave", "ivory", "mirth", "knack", "wryly", "onset", "mosaic", "velvet", "sphinx",
    "radius", "summit", "banner", "cipher", "glisten", "mantle", "scarab", "expose", "fathom",
    "tavern", "fusion", "relish", "lantern", "enchant", "torrent", "capture", "orchard", "eclipse",
    "frescos", "triumph", "absolve", "gossipy", "prelude", "whistle", "resolve", "zealous",
    "mirage", "aperture", "sapphire",
]
# fmt: on

In [None]:
# Student Task: Create a Hugging Face Dataset with the prompt that asks the model to spell the word
# with hyphens between the letters.
# TODO: Complete the sections with **********


def generate_records():
    for word in ALL_WORDS:
        yield {
            # We will use the GRPOTrainer which expects to receieve formatted prompts
            # to pass to the LLM
            # https://huggingface.co/docs/trl/main/en/grpo_trainer
            # "**********": f"**********",
            # Before using GRPOTrainer, will run a few epochs of supervised-fine tuning (SFT)
            # which can be useful to give an initial nudge to the model. Thus we need to provide
            # the gold standard answer.
            # See the documentation for more details:
            # https://huggingface.co/docs/trl/en/sft_trainer#expected-dataset-type-and-format
            # "**********": "-".join(word).upper() + ".",
            # GRPOTrainer does not expect a completion, but we can add extra columns to our dataset
            # that our reward functions will use to grade the completions provided by the LLM.
            "word": word,
            "spelling": "-".join(word).upper(),
        }


ds = Dataset.from_generator(generate_records)

ds = ds.train_test_split(test_size=0.1, seed=42)

# Show the first item of the train split
ds["train"][0]

## Step 3. Evaluate the base model

Before we fine-tune the model, let's see how it performs on the spelling task. We'll create a helper function to generate a spelling for a given word and compare it to the correct answer.

In [None]:
# Create a helper function that will help us visualize the performance of the model
# No changes needed in this cell


def check_spelling(
    model, tokenizer, prompt: str, actual_spelling: str, max_new_tokens: int = 20
) -> (str, str):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    gen = model.generate(
        **inputs, max_new_tokens=max_new_tokens
    )  # No parameters = greedy search
    output = tokenizer.decode(gen[0], skip_special_tokens=True)

    # Extract the generated spelling from the full output string
    proposed_spelling = output.split("Spelling:\n")[-1].strip().split("\n")[0].strip()

    # strip any whitepsace from the actual spelling
    actual_spelling = actual_spelling.strip()

    print(
        f"Proposed: {proposed_spelling} | Actual: {actual_spelling} "
        f"| Matches: {'✅' if proposed_spelling == actual_spelling else '❌'}"
    )

    # Remove hyphens for a character-by-character comparison
    proposed_spelling = proposed_spelling.replace("-", "")
    actual_spelling = actual_spelling.replace("-", "")

    # Calculate the proportion of the spelling that was correct
    num_correct = sum(1 for a, b in zip(actual_spelling, proposed_spelling) if a == b)

    return num_correct / len(actual_spelling)  # Return proportion correct


In [None]:
# Student task: Evaluate the base model's spelling ability
# We expect it to perform poorly, as it hasn't been trained for this task.

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    spelling = example["spelling"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=spelling,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

As expected, the base model is terrible at spelling. It mostly just repeats the word back. Now, let's fine-tune it.

## Step 4. Configure LoRA and train the model

Let’s attach a LoRA adapter to the base model. We use a LoRA config so only a tiny fraction of parameters are trainable. Read more here: [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

In [None]:
# Student task: Configure LoRA for a causal LM and wrap the model with get_peft_model
# Complete the sections with **********

# Print how many params are trainable at first
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params BEFORE: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

# See: https://huggingface.co/docs/peft/package_reference/lora
# lora_config = LoraConfig(
#     r=**********,                 # Rank of the update matrices. Lower value = fewer trainable parameters.
#     lora_alpha=**********,        # LoRA scaling factor.
#     lora_dropout=**********,      # Dropout probability for LoRA layers.
#     bias="none",
#     task_type=**********,         # Causal Language Modeling.
# )
# # Wrap the base model with get_peft_model
# model = get_peft_model(**********, **********)


# Print the number of trainable parameters after applying LoRA
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params AFTER: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

Now let’s set the training arguments. We'll use `SFTConfig` from the TRL library, which is a wrapper around the standard `TrainingArguments`. We keep epochs, batch size, and sequence length modest to finish training quickly.

In [None]:
# Train the model for a few epochs using SFT before GRPO as in certain cases
# they can work together synergystically.
# See: https://arxiv.org/html/2507.08267v1
# No changes needed here

output_dir = "data/model"

training_args = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=10,
    learning_rate=5 * 1e-4,
    logging_steps=20,
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="no",
    report_to=[],
    fp16=False,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    args=training_args,
)
trainer.train()

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    spelling = example["spelling"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=spelling,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

The number of words has slightly increased. Let's try training using GRPO now.

First let's create some reward functions.

In [None]:
# Student Task: Create a helper function proportion_correct that takes a word and
# a proposed spelling from the LLM and returns a score where every matched character
# adds +1 and every  mismatched character subtracts 1 from the reward--including the
# hyphens.
# TODO: Replace occurences of **********

import re


def proportion_correct(word, proposed_spelling):
    correct_spelling = "-".join(word).upper()

    score = 0.0

    # Pad to the same length to handle extra characters
    max_len = max(len(correct_spelling), len(proposed_spelling))
    proposed_spelling_padded = proposed_spelling.ljust(max_len, " ")
    correct_spelling_padded = correct_spelling.ljust(max_len, " ")

    for a, b in zip(correct_spelling_padded, proposed_spelling_padded):
        # Add 1 for matched characters, and subtract one for mismatched
        # **********


    return score / (
        len(correct_spelling)
    )  # Normalize by length of spelling, including dashes


assert proportion_correct("hello", "H-E-L-L-O") == 9 / 9
assert proportion_correct("hello", "H-E-L-") == 3 / 9
assert proportion_correct("hello", "H-E-L-L-O!") == 8 / 9


In [None]:
# Create a `reward_spelling` function that receives a batch of completions and the associated word values
# No changes needed here

import numpy as np


def reward_spelling(completions, word, **kwargs):
    """Reward function that rewards completions with more unique letters."""

    completion_strings = [
        completion.split("\n")[0].strip() for completion in completions
    ]
    words = [w for w in word]

    rewards = [proportion_correct(w, c) for w, c in zip(words, completion_strings)]

    # When training, GRPO will pass multiple completions and words to this function.
    # We print just the first one to observe what is happening under the hood.
    print("=====")
    print(
        "Completion example first line:",
        words[0],
        "->",
        completion_strings[0].strip().split("\n")[0].strip(),
    )
    print(f"Spelling mean and std: {np.mean(rewards):.3f} +/- {np.std(rewards):.3f}")
    return rewards


assert reward_spelling(
    completions=[
        "H-E-L-L-O",
        "H-E-L-",
        "H-E-L-L-O!",
    ],
    word=[
        "hello",
        "hello",
        "hello",
    ],
) == [1, 3 / 9, 8 / 9]

In [None]:
# Student task: Create a reward of 1.0 for completions starting with a string
# formatted like X-Y-Z else return 0.0
# TODO: Replace sections marked with **********


def reward_response_in_form_of_letter_dash_letter(completions, word, **kwargs):
    """Reward function that gives a bonus for completions in the form of LETTER-DASH-LETTER."""
    pattern = re.compile(r"^([A-Z]-)+[A-Z]")  # Pattern for LETTER-DASH-LETTER

    words = [w for w in word]

    # Normalize the completions, taking the first line and removing extra whitespace
    completion_strings = [
        completion.split("\n")[0].strip() for completion in completions
    ]

    # Create a list of rewards corresponding to completions
    # Each completion that matches the pattern should receive 1.0,
    # else 0.0
    # rewards = [
    #     **********
    # ]

    # <<< START COMPLETION SECTION
    rewards = [
        1.0 if pattern.match(c) else 0.0 for w, c in zip(words, completion_strings)
    ]
    # >>> END COMPLETION SECTION

    print(
        f"Letter-dash-letter rewards mean and std: {np.mean(rewards):.3f} +/- {np.std(rewards):.3f}"
    )
    return rewards


assert reward_response_in_form_of_letter_dash_letter(
    completions=[
        "H-E-L-L-O",
        "hello",
        "Hi!",
    ],
    word=[
        "hello",
        "hello",
        "hello",
    ],
) == [1, 0, 0]

In [None]:
# Student task: Set the GRPOConfig and initialize the trainer
# See: https://huggingface.co/docs/trl/main/en/grpo_trainer
# TODO: Complete the sections with **********

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    output_dir="data/spelling-grpo",
    max_completion_length=30,
    logging_steps=5,
    # learning_rate=**********,
    # num_train_epochs=**********,
    # per_device_train_batch_size=**********,
    # num_generations=**********,
    # lr_scheduler_type=**********,
    # beta=**********,
)
trainer = GRPOTrainer(
    model=model,
    # Add the parameter for the reward functions
    # **********
    args=training_args,
    train_dataset=ds["train"],
)
trainer.train()

Now we define the `SFTTrainer` and run the fine-tuning process.

## Step 5. Evaluate the fine-tuned model

In [None]:
# Evaluate the fine-tuned model on the same training examples
# No changes needed in this cell

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

The model now performs better on the training data it has seen. But has it generalized? Let's check its performance on the unseen test set.

In [None]:
# Evaluate the fine-tuned model on the unseen test set
# No changes needed in this cell

proportion_correct = 0.0
num_examples = len(ds["test"])

for example in ds["test"]:
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/{num_examples}.0 words correct")

It looks like it has improved! Perhaps with a larger dataset and more training, it could get even better.

## Congratulations for completing the exercise! 🎉

✅ You did it! You successfully fine-tuned a small language model using PEFT with LoRA to teach it a new skill: spelling! You saw how the base model failed completely at the task, and with a very small amount of data and a short training run, the model started to learn how to spell.

<br /><br /><br /><br /><br /><br /><br /><br /><br />