# Exercise: Teach an LLM to Spell with Group Relative Policy Optimization (GRPO)

Large language models (LLMs) are notoriously bad at spelling. This is partly because tokenizers break words into smaller pieces, so the model learns about sub-word units rather than whole words and their spellings.

In this exercise, you'll use Group Relative Policy Optimization (GRPO) and a technique called Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to teach a small LLM how to spell words. This is a classic example of teaching a model a new skill that isn't well-represented in its pre-training data.

## What you'll do in this notebook

1.  **Setup**: Import libraries and configure the environment.
2.  **Load the tokenizer and base model**: Use a small, instruction-tuned model as our starting point.
3.  **Create the dataset**: Generate a simple dataset of words and their correct spellings.
4.  **Evaluate the base model**: Test the model's spelling ability *before* fine-tuning to establish a baseline.
5.  **Configure LoRA and train**: Attach a LoRA adapter to the model and fine-tune it on the spelling dataset.
6.  **Evaluate the fine-tuned model**: Test the model again to see if its spelling has improved.

## Setup

In [1]:
# Setup imports
# No changes needed in this cell

import os
import torch
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# Use GPU, MPS, or CPU, in that order of preference
if torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon
else:
    device = torch.device("cpu")
torch.set_num_threads(max(1, os.cpu_count() // 2))
print("Using device:", device)

Using device: mps


## Step 1. Load the tokenizer and base model

The model `HuggingFaceTB/SmolLM2-135M-Instruct` is a small, instruction-tuned model that's suitable for this exercise. It has 135 million parameters, making it lightweight and efficient for fine-tuning. It's not the most powerful model, but it's a good choice for demonstrating the concepts of SFT and PEFT with LoRA, especially on a CPU or limited GPU resources.

In [2]:
# Student task: Load the model and tokenizer, and copy the model to the device.
# TODO: Complete the sections with **********

# See: https://huggingface.co/docs/transformers/en/models
# See: https://huggingface.co/docs/transformers/en/fast_tokenizers

# Model ID for SmolLM2-135M-Instruct
model_id = "***********"

# Load the tokenizer
tokenizer = "***********"

# Load the model
model = "***********"

# Copy the model to the device (GPU, MPS, or CPU)
model = "***********"


# <<< START SOLUTION SECTION
# Model ID for SmolLM2-135M-Instruct
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
)

# Copy the model to the device (GPU, MPS, or CPU)
model = model.to(device)
# <<< END SOLUTION SECTION

print("Model parameters (total):", sum(p.numel() for p in model.parameters()))

Model parameters (total): 134515008


## Step 2. Create the dataset

In [3]:
# Create a list of words of different lengths
# No changes are needed in this cell.

# fmt: off
ALL_WORDS = [
    "idea", "glow", "rust", "maze", "echo", "wisp", "veto", "lush", "gaze", "knit", "fume", "plow",
    "void", "oath", "grim", "crisp", "lunar", "fable", "quest", "verge", "brawn", "elude", "aisle",
    "ember", "crave", "ivory", "mirth", "knack", "wryly", "onset", "mosaic", "velvet", "sphinx",
    "radius", "summit", "banner", "cipher", "glisten", "mantle", "scarab", "expose", "fathom",
    "tavern", "fusion", "relish", "lantern", "enchant", "torrent", "capture", "orchard", "eclipse",
    "frescos", "triumph", "absolve", "gossipy", "prelude", "whistle", "resolve", "zealous",
    "mirage", "aperture", "sapphire",
]
# fmt: on

In [4]:
# Student Task: Create a Hugging Face Dataset with the prompt that asks the model to spell the word
# with hyphens between the letters.
# TODO: Complete the sections with **********


def generate_records():
    for word in ALL_WORDS:
        yield {
            # We will use the GRPOTrainer which expects to receieve formatted prompts
            # to pass to the LLM
            # https://huggingface.co/docs/trl/main/en/grpo_trainer
            # "**********": f"**********",
            # <<< START SOLUTION SECTION
            "prompt": (
                f"You spell words with hyphens between the letters like this W-O-R-D.\nWord:\n{word}\n\n"
                + "Spelling:\n"
            ),
            # >>> END SOLUTION SECTION
            # Before using GRPOTrainer, will run a few epochs of supervised-fine tuning (SFT)
            # which can be useful to give an initial nudge to the model. Thus we need to provide
            # the gold standard answer.
            # See the documentation for more details:
            # https://huggingface.co/docs/trl/en/sft_trainer#expected-dataset-type-and-format
            # "**********": "-".join(word).upper() + ".",
            # <<< START SOLUTION SECTION
            "completion": "-".join(word).upper() + ".",
            # >>> END SOLUTION SECTION
            # GRPOTrainer does not expect a completion, but we can add extra columns to our dataset
            # that our reward functions will use to grade the completions provided by the LLM.
            "word": word,
            "spelling": "-".join(word).upper(),
        }


ds = Dataset.from_generator(generate_records)

ds = ds.train_test_split(test_size=0.1, seed=42)

# Show the first item of the train split
ds["train"][0]

{'prompt': 'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\ntriumph\n\nSpelling:\n',
 'completion': 'T-R-I-U-M-P-H.',
 'word': 'triumph',
 'spelling': 'T-R-I-U-M-P-H'}

## Step 3. Evaluate the base model

Before we fine-tune the model, let's see how it performs on the spelling task. We'll create a helper function to generate a spelling for a given word and compare it to the correct answer.

In [5]:
# Create a helper function that will help us visualize the performance of the model
# No changes needed in this cell


def check_spelling(
    model, tokenizer, prompt: str, actual_spelling: str, max_new_tokens: int = 20
) -> (str, str):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    gen = model.generate(
        **inputs, max_new_tokens=max_new_tokens
    )  # No parameters = greedy search
    output = tokenizer.decode(gen[0], skip_special_tokens=True)

    # Extract the generated spelling from the full output string
    proposed_spelling = output.split("Spelling:\n")[-1].strip().split("\n")[0].strip()

    # strip any whitepsace from the actual spelling
    actual_spelling = actual_spelling.strip()

    print(
        f"Proposed: {proposed_spelling} | Actual: {actual_spelling} "
        f"| Matches: {'✅' if proposed_spelling == actual_spelling else '❌'}"
    )

    # Remove hyphens for a character-by-character comparison
    proposed_spelling = proposed_spelling.replace("-", "")
    actual_spelling = actual_spelling.replace("-", "")

    # Calculate the proportion of the spelling that was correct
    num_correct = sum(1 for a, b in zip(actual_spelling, proposed_spelling) if a == b)

    return num_correct / len(actual_spelling)  # Return proportion correct


In [6]:
# Student task: Evaluate the base model's spelling ability
# We expect it to perform poorly, as it hasn't been trained for this task.

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    spelling = example["spelling"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=spelling,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: trium | Actual: T-R-I-U-M-P-H | Matches: ❌
Proposed: sapp | Actual: S-A-P-P-H-I-R-E | Matches: ❌
Proposed: expose | Actual: E-X-P-O-S-E | Matches: ❌
Proposed: fres | Actual: F-R-E-S-C-O-S | Matches: ❌
Proposed: wisp | Actual: W-I-S-P | Matches: ❌
Proposed: mi-er-ge | Actual: M-I-R-A-G-E | Matches: ❌
Proposed: ivory | Actual: I-V-O-R-Y | Matches: ❌
Proposed: onset | Actual: O-N-S-E-T | Matches: ❌
Proposed: elude | Actual: E-L-U-D-E | Matches: ❌
Proposed: sphinx | Actual: S-P-H-I-N-X | Matches: ❌
Proposed: brawn | Actual: B-R-A-W-N | Matches: ❌
Proposed: goss | Actual: G-O-S-S-I-P-Y | Matches: ❌
Proposed: enchant | Actual: E-N-C-H-A-N-T | Matches: ❌
Proposed: tavern | Actual: T-A-V-E-R-N | Matches: ❌
Proposed: whistle | Actual: W-H-I-S-T-L-E | Matches: ❌
Proposed: W-O-R-D | Actual: C-A-P-T-U-R-E | Matches: ❌
Proposed: echo | Actual: E-C-H-O | Matches: ❌
Proposed: mirth | Actual: M-I-R-T-H | Matches: ❌
Proposed: cris | Actual: C-R-I-S-P | Matches: ❌
Proposed: zeal | Actual: Z-E-

As expected, the base model is terrible at spelling. It mostly just repeats the word back. Now, let's fine-tune it.

## Step 4. Configure LoRA and train the model

Let’s attach a LoRA adapter to the base model. We use a LoRA config so only a tiny fraction of parameters are trainable. Read more here: [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

In [7]:
# Student task: Configure LoRA for a causal LM and wrap the model with get_peft_model
# Complete the sections with **********

# Print how many params are trainable at first
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params BEFORE: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

# See: https://huggingface.co/docs/peft/package_reference/lora
# lora_config = LoraConfig(
#     r=**********,                 # Rank of the update matrices. Lower value = fewer trainable parameters.
#     lora_alpha=**********,        # LoRA scaling factor.
#     lora_dropout=**********,      # Dropout probability for LoRA layers.
#     bias="none",
#     task_type=**********,         # Causal Language Modeling.
# )
# # Wrap the base model with get_peft_model
# model = get_peft_model(**********, **********)

# <<< START SOLUTION SECTION
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# >>> END SOLUTION SECTION

# Print the number of trainable parameters after applying LoRA
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params AFTER: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

Trainable params BEFORE: 134,515,008 / 134,515,008 (100.00%)
Trainable params AFTER: 3,686,400 / 138,201,408 (2.67%)


Now let’s set the training arguments. We'll use `SFTConfig` from the TRL library, which is a wrapper around the standard `TrainingArguments`. We keep epochs, batch size, and sequence length modest to finish training quickly.

In [8]:
# Train the model for a few epochs using SFT before GRPO as in certain cases
# they can work together synergystically.
# See: https://arxiv.org/html/2507.08267v1
# No changes needed here

output_dir = "data/model"

training_args = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=10,
    learning_rate=5 * 1e-4,
    logging_steps=20,
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="no",
    report_to=[],
    fp16=False,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    args=training_args,
)
trainer.train()

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    spelling = example["spelling"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=spelling,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

  0%|          | 0/70 [00:00<?, ?it/s]



{'loss': 1.0615, 'grad_norm': 0.28181013464927673, 'learning_rate': 0.0004058724504646834, 'epoch': 2.86}


  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 0.8536208868026733, 'eval_runtime': 0.1366, 'eval_samples_per_second': 51.232, 'eval_steps_per_second': 14.638, 'eval_num_tokens': 6833.0, 'eval_mean_token_accuracy': 0.6834677457809448, 'epoch': 2.86}
{'loss': 0.5235, 'grad_norm': 0.35914021730422974, 'learning_rate': 0.00019436976651092142, 'epoch': 5.71}


  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 0.7460055351257324, 'eval_runtime': 0.0794, 'eval_samples_per_second': 88.106, 'eval_steps_per_second': 25.173, 'eval_num_tokens': 13603.0, 'eval_mean_token_accuracy': 0.7412634193897247, 'epoch': 5.71}
{'loss': 0.4247, 'grad_norm': 0.27732419967651367, 'learning_rate': 2.4757783024395242e-05, 'epoch': 8.57}


  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 0.7350455522537231, 'eval_runtime': 0.0796, 'eval_samples_per_second': 87.914, 'eval_steps_per_second': 25.118, 'eval_num_tokens': 20407.0, 'eval_mean_token_accuracy': 0.7412634193897247, 'epoch': 8.57}
{'train_runtime': 16.5062, 'train_samples_per_second': 33.321, 'train_steps_per_second': 4.241, 'train_loss': 0.6325836181640625, 'num_tokens': 23760.0, 'mean_token_accuracy': 0.7896737671324185, 'epoch': 10.0}
Proposed: T-I-R-U-M-P-H. | Actual: T-R-I-U-M-P-H | Matches: ❌
Proposed: S-A-P-I-C-R-H. | Actual: S-A-P-P-H-I-R-E | Matches: ❌
Proposed: E-X-P-S-E-T. | Actual: E-X-P-O-S-E | Matches: ❌
Proposed: F-S-R-E-C-O-S. | Actual: F-R-E-S-C-O-S | Matches: ❌
Proposed: W-I-P-S. | Actual: W-I-S-P | Matches: ❌
Proposed: M-I-R-E-G. | Actual: M-I-R-A-G-E | Matches: ❌
Proposed: I-V-O-R-Y. | Actual: I-V-O-R-Y | Matches: ❌
Proposed: O-N-S-H-O-R-D. | Actual: O-N-S-E-T | Matches: ❌
Proposed: E-L-E-U-D. | Actual: E-L-U-D-E | Matches: ❌
Proposed: S-P-H-I-N-X. | Actual: S-P-H-I-N-X | Matches

The number of words has slightly increased. Let's try training using GRPO now.

First let's create some reward functions.

In [9]:
# Student Task: Create a helper function proportion_correct that takes a word and
# a proposed spelling from the LLM and returns a score where every matched character
# adds +1 and every  mismatched character subtracts 1 from the reward--including the
# hyphens.
# TODO: Replace occurences of **********

import re


def proportion_correct(word, proposed_spelling):
    correct_spelling = "-".join(word).upper()

    score = 0.0

    # Pad to the same length to handle extra characters
    max_len = max(len(correct_spelling), len(proposed_spelling))
    proposed_spelling_padded = proposed_spelling.ljust(max_len, " ")
    correct_spelling_padded = correct_spelling.ljust(max_len, " ")

    for a, b in zip(correct_spelling_padded, proposed_spelling_padded):
        # Add 1 for matched characters, and subtract one for mismatched
        # **********

        # <<< START SOLUTION SECTION
        if a == b:
            score += 1
        else:
            score -= 1
        # >>> END SOLUTION SECTION

    return score / (
        len(correct_spelling)
    )  # Normalize by length of spelling, including dashes


assert proportion_correct("hello", "H-E-L-L-O") == 9 / 9
assert proportion_correct("hello", "H-E-L-") == 3 / 9
assert proportion_correct("hello", "H-E-L-L-O!") == 8 / 9


In [10]:
# Create a `reward_spelling` function that receives a batch of completions and the associated word values
# No changes needed here

import numpy as np


def reward_spelling(completions, word, **kwargs):
    """Reward function that rewards completions with more unique letters."""

    completion_strings = [
        completion.split("\n")[0].strip() for completion in completions
    ]
    words = [w for w in word]

    rewards = [proportion_correct(w, c) for w, c in zip(words, completion_strings)]

    # When training, GRPO will pass multiple completions and words to this function.
    # We print just the first one to observe what is happening under the hood.
    print("=====")
    print(
        "Completion example first line:",
        words[0],
        "->",
        completion_strings[0].strip().split("\n")[0].strip(),
    )
    print(f"Spelling mean and std: {np.mean(rewards):.3f} +/- {np.std(rewards):.3f}")
    return rewards


assert reward_spelling(
    completions=[
        "H-E-L-L-O",
        "H-E-L-",
        "H-E-L-L-O!",
    ],
    word=[
        "hello",
        "hello",
        "hello",
    ],
) == [1, 3 / 9, 8 / 9]

=====
Completion example first line: hello -> H-E-L-L-O
Spelling mean and std: 0.741 +/- 0.292


In [11]:
# Student task: Create a reward of 1.0 for completions starting with a string
# formatted like X-Y-Z else return 0.0
# TODO: Replace sections marked with **********


def reward_response_in_form_of_letter_dash_letter(completions, word, **kwargs):
    """Reward function that gives a bonus for completions in the form of LETTER-DASH-LETTER."""
    pattern = re.compile(r"^([A-Z]-)+[A-Z]")  # Pattern for LETTER-DASH-LETTER

    words = [w for w in word]

    # Normalize the completions, taking the first line and removing extra whitespace
    completion_strings = [
        completion.split("\n")[0].strip() for completion in completions
    ]

    # Create a list of rewards corresponding to completions
    # Each completion that matches the pattern should receive 1.0,
    # else 0.0
    # rewards = [
    #     **********
    # ]

    # <<< START COMPLETION SECTION
    rewards = [
        1.0 if pattern.match(c) else 0.0 for w, c in zip(words, completion_strings)
    ]
    # >>> END COMPLETION SECTION

    print(
        f"Letter-dash-letter rewards mean and std: {np.mean(rewards):.3f} +/- {np.std(rewards):.3f}"
    )
    return rewards


assert reward_response_in_form_of_letter_dash_letter(
    completions=[
        "H-E-L-L-O",
        "hello",
        "Hi!",
    ],
    word=[
        "hello",
        "hello",
        "hello",
    ],
) == [1, 0, 0]

Letter-dash-letter rewards mean and std: 0.333 +/- 0.471


In [12]:
# Student task: Set the GRPOConfig and initialize the trainer
# See: https://huggingface.co/docs/trl/main/en/grpo_trainer
# TODO: Complete the sections with **********

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    output_dir="data/spelling-grpo",
    max_completion_length=30,
    logging_steps=5,
    # learning_rate=**********,
    # num_train_epochs=**********,
    # per_device_train_batch_size=**********,
    # num_generations=**********,
    # lr_scheduler_type=**********,
    # beta=**********,
    # <<< START SOLUTION SECTION
    learning_rate=5e-5,
    num_train_epochs=10,  # We'll train just for a few epochs
    per_device_train_batch_size=8,  # The batch size for training
    num_generations=4,  # Determines the number of completions to compute for each single prompt
    lr_scheduler_type="cosine",
    beta=0.0,
    # >>> END SOLUTION SECTION
)
trainer = GRPOTrainer(
    model=model,
    # Add the parameter for the reward functions
    # **********
    # <<< START SOLUTION SECTION
    reward_funcs=[
        reward_spelling,
        reward_response_in_form_of_letter_dash_letter,
    ],
    # >>> END SOLUTION SECTION
    args=training_args,
    train_dataset=ds["train"],
)
trainer.train()

  0%|          | 0/280 [00:00<?, ?it/s]

=====
Completion example first line: mirth -> M-I-R-H.
Spelling mean and std: 0.107 +/- 0.263
Letter-dash-letter rewards mean and std: 0.875 +/- 0.331
=====
Completion example first line: mantle -> M-A-T-N-I-R.
Spelling mean and std: 0.468 +/- 0.218
Letter-dash-letter rewards mean and std: 1.000 +/- 0.000
=====
Completion example first line: summit -> S-U-M-T-I-A.
Spelling mean and std: 0.273 +/- 0.328
Letter-dash-letter rewards mean and std: 1.000 +/- 0.000
=====
Completion example first line: absolve -> A-B-U-L-E.
Spelling mean and std: 0.218 +/- 0.236
Letter-dash-letter rewards mean and std: 1.000 +/- 0.000
=====
Completion example first line: ivory -> I-V-O-R-Y.
Spelling mean and std: 0.525 +/- 0.270
Letter-dash-letter rewards mean and std: 1.000 +/- 0.000
{'loss': 0.0006, 'grad_norm': 0.33242887258529663, 'learning_rate': 4.996067037544542e-05, 'num_tokens': 1716.0, 'completions/mean_length': 11.9, 'completions/min_length': 9.4, 'completions/max_length': 15.0, 'completions/clipped

TrainOutput(global_step=270, training_loss=0.0006719554836982516, metrics={'train_runtime': 183.29, 'train_samples_per_second': 3.001, 'train_steps_per_second': 1.528, 'total_flos': 0.0, 'train_loss': 0.0006719554836982516})

Now we define the `SFTTrainer` and run the fine-tuning process.

## Step 5. Evaluate the fine-tuned model

In [13]:
# Evaluate the fine-tuned model on the same training examples
# No changes needed in this cell

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: T-R-I-U-M-P-H. | Actual: T-R-I-U-M-P-H. | Matches: ✅
Proposed: S-A-P-I-C-R-H. | Actual: S-A-P-P-H-I-R-E. | Matches: ❌
Proposed: E-X-P-S-E. | Actual: E-X-P-O-S-E. | Matches: ❌
Proposed: F-R-S-E-C-O-S. | Actual: F-R-E-S-C-O-S. | Matches: ❌
Proposed: W-I-P-S. | Actual: W-I-S-P. | Matches: ❌
Proposed: M-I-R-A-G. | Actual: M-I-R-A-G-E. | Matches: ❌
Proposed: I-V-O-R-Y. | Actual: I-V-O-R-Y. | Matches: ✅
Proposed: O-N-S-H-D. | Actual: O-N-S-E-T. | Matches: ❌
Proposed: E-L-U-D-E. | Actual: E-L-U-D-E. | Matches: ✅
Proposed: S-P-H-I-N-X. | Actual: S-P-H-I-N-X. | Matches: ✅
Proposed: B-R-A-N-Y. | Actual: B-R-A-W-N. | Matches: ❌
Proposed: G-O-S-H-O-P-I. | Actual: G-O-S-S-I-P-Y. | Matches: ❌
Proposed: E-N-C-H-A-N. | Actual: E-N-C-H-A-N-T. | Matches: ❌
Proposed: T-A-A-N-R. | Actual: T-A-V-E-R-N. | Matches: ❌
Proposed: W-H-I-C-E. | Actual: W-H-I-S-T-L-E. | Matches: ❌
Proposed: C-U-P-H-E-R. | Actual: C-A-P-T-U-R-E. | Matches: ❌
Proposed: E-C-H-R-E. | Actual: E-C-H-O. | Matches: ❌
Proposed: M

The model now performs better on the training data it has seen. But has it generalized? Let's check its performance on the unseen test set.

In [14]:
# Evaluate the fine-tuned model on the unseen test set
# No changes needed in this cell

proportion_correct = 0.0
num_examples = len(ds["test"])

for example in ds["test"]:
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/{num_examples}.0 words correct")

Proposed: W-R-Y-I-L-Y. | Actual: W-R-Y-L-Y. | Matches: ❌
Proposed: G-L-I-N-E. | Actual: G-L-I-S-T-E-N. | Matches: ❌
Proposed: C-A-S-E. | Actual: Q-U-E-S-T. | Matches: ❌
Proposed: C-E-R-A-V. | Actual: C-R-A-V-E. | Matches: ❌
Proposed: L-U-S-I-R-O. | Actual: L-U-S-H. | Matches: ❌
Proposed: F-A-L-I-C-E. | Actual: F-A-B-L-E. | Matches: ❌
Proposed: K-N-A-R-C-E. | Actual: K-N-A-C-K. | Matches: ❌
2.6416666666666666/7.0 words correct


It looks like it has improved! Perhaps with a larger dataset and more training, it could get even better.

## Congratulations for completing the exercise! 🎉

✅ You did it! You successfully fine-tuned a small language model using PEFT with LoRA to teach it a new skill: spelling! You saw how the base model failed completely at the task, and with a very small amount of data and a short training run, the model started to learn how to spell.

<br /><br /><br /><br /><br /><br /><br /><br /><br />