# Exercise: Teach an LLM to Spell with Supervised Fine-Tuning (SFT)

Large language models (LLMs) are notoriously bad at spelling. This is partly because tokenizers break words into smaller pieces, so the model learns about sub-word units rather than whole words and their spellings.

In this exercise, you'll use supervised fine-tuning (SFT) and a technique called Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to teach a small LLM how to spell words. This is a classic example of teaching a model a new skill that isn't well-represented in its pre-training data.

## What you'll do in this notebook

1.  **Setup**: Import libraries and configure the environment.
2.  **Load the tokenizer and base model**: Use a small, instruction-tuned model as our starting point.
3.  **Create the dataset**: Generate a simple dataset of words and their correct spellings.
4.  **Evaluate the base model**: Test the model's spelling ability *before* fine-tuning to establish a baseline.
5.  **Configure LoRA and train**: Attach a LoRA adapter to the model and fine-tune it on the spelling dataset.
6.  **Evaluate the fine-tuned model**: Test the model again to see if its spelling has improved.

## Setup

In [1]:
# Setup imports
# No changes needed in this cell

import os
import torch
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# Use GPU, MPS, or CPU, in that order of preference
if torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon
else:
    device = torch.device("cpu")
torch.set_num_threads(max(1, os.cpu_count() // 2))
print("Using device:", device)

Using device: mps


## Step 1. Load the tokenizer and base model

The model `HuggingFaceTB/SmolLM2-135M-Instruct` is a small, instruction-tuned model that's suitable for this exercise. It has 135 million parameters, making it lightweight and efficient for fine-tuning. It's not the most powerful model, but it's a good choice for demonstrating the concepts of SFT and PEFT with LoRA, especially on a CPU or limited GPU resources.

In [2]:
# Student task: Load the model and tokenizer, and copy the model to the device.
# TODO: Complete the sections with **********

# See: https://huggingface.co/docs/transformers/en/models
# See: https://huggingface.co/docs/transformers/en/fast_tokenizers

# Model ID for SmolLM2-135M-Instruct
model_id = "***********"

# Load the tokenizer
tokenizer = "***********"

# Load the model
model = "***********"

# Copy the model to the device (GPU, MPS, or CPU)
model = "***********"


# <<< START SOLUTION SECTION
# Model ID for SmolLM2-135M-Instruct
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
)

# Copy the model to the device (GPU, MPS, or CPU)
model = model.to(device)
# <<< END SOLUTION SECTION

print("Model parameters (total):", sum(p.numel() for p in model.parameters()))

Model parameters (total): 134515008


## Step 2. Create the dataset

In [3]:
# Create a list of words of different lengths
# No changes are needed in this cell.

# fmt: off
ALL_WORDS = [
    "idea", "glow", "rust", "maze", "echo", "wisp", "veto", "lush", "gaze", "knit", "fume", "plow",
    "void", "oath", "grim", "crisp", "lunar", "fable", "quest", "verge", "brawn", "elude", "aisle",
    "ember", "crave", "ivory", "mirth", "knack", "wryly", "onset", "mosaic", "velvet", "sphinx",
    "radius", "summit", "banner", "cipher", "glisten", "mantle", "scarab", "expose", "fathom",
    "tavern", "fusion", "relish", "lantern", "enchant", "torrent", "capture", "orchard", "eclipse",
    "frescos", "triumph", "absolve", "gossipy", "prelude", "whistle", "resolve", "zealous",
    "mirage", "aperture", "sapphire",
]
# fmt: on

In [4]:
# Student Task: Create a Hugging Face Dataset with the prompt that asks the model to spell the word
# with hyphens between the letters.
# TODO: Complete the sections with **********


def generate_records():
    for word in ALL_WORDS:
        yield {
            # We will use the SFTTrainer which expects a certain format for prompt and completions pair
            # in order for it to automatically construct the right tokenizations to train the model.
            # See the documentation for more details:
            # https://huggingface.co/docs/trl/en/sft_trainer#expected-dataset-type-and-format
            # "**********": f"**********",
            # <<< START SOLUTION SECTION
            "prompt": (
                f"You spell words with hyphens between the letters like this W-O-R-D.\nWord:\n{word}\n\n"
                + "Spelling:\n"
            ),
            # >>> END SOLUTION SECTION
            "completion": "-".join(word).upper() + ".",  # Of the form W-O-R-D.
        }


ds = Dataset.from_generator(generate_records)

# Show the first item
ds[0]

{'prompt': 'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\nidea\n\nSpelling:\n',
 'completion': 'I-D-E-A.'}

In [5]:
# Student Task: Split the dataset into training and testing sets
# See: train_test_split
# TODO: Complete the sections with **********

# ds = **********  # Set the test set to be 25% of the dataset, and the rest is training

# <<< START SOLUTION SECTION
ds = ds.train_test_split(test_size=0.25, seed=42)
# >>> END SOLUTION SECTION


In [6]:
# View the training set
# No changes needed in this cell

ds["train"]

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 46
})

## Step 3. Evaluate the base model

Before we fine-tune the model, let's see how it performs on the spelling task. We'll create a helper function to generate a spelling for a given word and compare it to the correct answer.

In [7]:
# Student task: Create a function to check the model's spelling.
# This function will take a model, tokenizer, prompt, and the correct spelling.
# It should generate text from the model and compare the model's proposed spelling
# to the actual spelling, returning the proportion of characters that were correct.
# TODO: Complete the sections with **********


def check_spelling(
    model, tokenizer, prompt: str, actual_spelling: str, max_new_tokens: int = 20
) -> (str, str):
    # Tokenize the prompt
    # inputs = **********

    # Generate text from the model
    # gen = **********

    # Decode the generated tokens to a string
    # output = **********

    # Extract the generated spelling from the full output string
    # proposed_spelling = "**********"

    # strip any whitepsace from the actual spelling
    # actual_spelling = "**********"

    # Remove hyphens for a character-by-character comparison
    # proposed_spelling = "**********"
    # actual_spelling = "**********"

    # Calculate the number of correct characters
    # num_correct = "**********"

    # <<< START SOLUTION SECTION
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate text from the model
    gen = model.generate(
        **inputs, max_new_tokens=max_new_tokens
    )  # No parameters = greedy search

    # Decode the generated tokens to a string
    output = tokenizer.decode(gen[0], skip_special_tokens=True)

    # Extract the generated spelling from the full output string
    proposed_spelling = output.split("Spelling:")[-1].strip().split("\n")[0].strip()

    # strip any whitepsace from the actual spelling
    actual_spelling = actual_spelling.strip()

    # Remove hyphens for a character-by-character comparison
    proposed_spelling = proposed_spelling.replace("-", "")
    actual_spelling = actual_spelling.replace("-", "")

    # Calculate the number of correct characters
    num_correct = sum(1 for a, b in zip(actual_spelling, proposed_spelling) if a == b)
    # >>> END SOLUTION SECTION

    print(
        f"Proposed: {proposed_spelling} | Actual: {actual_spelling} "
        f"| Matches: {'✅' if proposed_spelling == actual_spelling else '❌'}"
    )

    return num_correct / len(actual_spelling)  # Return proportion correct


check_spelling(
    model=model,
    tokenizer=tokenizer,
    prompt=ds["test"][0]["prompt"],
    actual_spelling=ds["test"][0]["completion"],
)

Proposed: wry | Actual: WRYLY. | Matches: ❌


0.0

In [8]:
# Student task: Evaluate the base model's spelling ability
# We expect it to perform poorly, as it hasn't been trained for this task.

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: sphinx | Actual: SPHINX. | Matches: ❌
Proposed: brawn | Actual: BRAWN. | Matches: ❌
Proposed: goss | Actual: GOSSIPY. | Matches: ❌
Proposed: enchant | Actual: ENCHANT. | Matches: ❌
Proposed: tavern | Actual: TAVERN. | Matches: ❌
Proposed: whistle | Actual: WHISTLE. | Matches: ❌
Proposed: WORD | Actual: CAPTURE. | Matches: ❌
Proposed: echo | Actual: ECHO. | Matches: ❌
Proposed: mirth | Actual: MIRTH. | Matches: ❌
Proposed: cris | Actual: CRISP. | Matches: ❌
Proposed: zeal | Actual: ZEALOUS. | Matches: ❌
Proposed:  | Actual: EMBER. | Matches: ❌
Proposed: scarab | Actual: SCARAB. | Matches: ❌
Proposed:  | Actual: KNIT. | Matches: ❌
Proposed: resolve | Actual: RESOLVE. | Matches: ❌
Proposed: velvet | Actual: VELVET. | Matches: ❌
Proposed:  | Actual: ABSOLVE. | Matches: ❌
Proposed: lunar | Actual: LUNAR. | Matches: ❌
Proposed: maze | Actual: MAZE. | Matches: ❌
Proposed:  | Actual: SUMMIT. | Matches: ❌
0.0/20.0 words correct


As expected, the base model is terrible at spelling. It mostly just repeats the word back. Now, let's fine-tune it.

## Step 4. Configure LoRA and train the model

Let’s attach a LoRA adapter to the base model. We use a LoRA config so only a tiny fraction of parameters are trainable. Read more here: [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

In [9]:
# Student task: Configure LoRA for a causal LM and wrap the model with get_peft_model
# Complete the sections with **********

# Print how many params are trainable at first
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params BEFORE: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

# See: https://huggingface.co/docs/peft/package_reference/lora
# lora_config = LoraConfig(
#     r=**********,                 # Rank of the update matrices. Lower value = fewer trainable parameters.
#     lora_alpha=**********,        # LoRA scaling factor.
#     lora_dropout=**********,      # Dropout probability for LoRA layers.
#     bias="none",
#     task_type=**********,         # Causal Language Modeling.
# )
# # Wrap the base model with get_peft_model
# model = get_peft_model(**********, **********)

# <<< START SOLUTION SECTION
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# >>> END SOLUTION SECTION

# Print the number of trainable parameters after applying LoRA
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params AFTER: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

Trainable params BEFORE: 134,515,008 / 134,515,008 (100.00%)
Trainable params AFTER: 3,686,400 / 138,201,408 (2.67%)


Now let’s set the training arguments. We'll use `SFTConfig` from the TRL library, which is a wrapper around the standard `TrainingArguments`. We keep epochs, batch size, and sequence length modest to finish training quickly.

In [10]:
# Student task: Fill in the SFTConfig for a quick training run
# Complete the sections with **********

output_dir = "data/model"

# See: https://huggingface.co/docs/trl/en/sft_trainer#trl.SFTConfig
# training_args = SFTConfig(
#     output_dir=output_dir,
#     per_device_train_batch_size=**********,
#     per_device_eval_batch_size=**********,
#     gradient_accumulation_steps=**********,
#     num_train_epochs=**********,
#     learning_rate=**********,
#     logging_steps=**********,
#     evaluation_strategy="steps",
#     eval_steps=**********,
#     save_strategy="no",
#     report_to=[],                            # disable wandb/tensorboard
#     fp16=False,                              # stay in fp32 for CPU/MPS
#     lr_scheduler_type="cosine",
# )

# <<< START SOLUTION SECTION
training_args = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=20,
    learning_rate=5 * 1e-4,
    logging_steps=20,
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="no",
    report_to=[],
    fp16=False,
    lr_scheduler_type="cosine",
)
# >>> END SOLUTION SECTION

Now we define the `SFTTrainer` and run the fine-tuning process.

In [11]:
# Student Task: Create and run the SFTTrainer
# TODO: Complete the sections with **********


# See: https://huggingface.co/docs/trl/en/sft_trainer
# trainer = SFTTrainer(
#     model=**********,
#     train_dataset=**********,
#     eval_dataset=**********,
#     args=**********,
# )
# Now train it:
# trainer.**********

# <<< START SOLUTION SECTION
trainer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    args=training_args,
)
trainer.train()
# >>> END SOLUTION SECTION

  0%|          | 0/120 [00:00<?, ?it/s]

{'loss': 1.0707, 'grad_norm': 0.2614535689353943, 'learning_rate': 0.00046650635094610973, 'epoch': 3.33}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 0.7199432849884033, 'eval_runtime': 0.2084, 'eval_samples_per_second': 76.768, 'eval_steps_per_second': 19.192, 'eval_num_tokens': 6627.0, 'eval_mean_token_accuracy': 0.7373691648244858, 'epoch': 3.33}
{'loss': 0.5026, 'grad_norm': 0.2868766188621521, 'learning_rate': 0.000375, 'epoch': 6.67}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 0.5938073396682739, 'eval_runtime': 0.1677, 'eval_samples_per_second': 95.415, 'eval_steps_per_second': 23.854, 'eval_num_tokens': 13256.0, 'eval_mean_token_accuracy': 0.7946356534957886, 'epoch': 6.67}
{'loss': 0.3479, 'grad_norm': 0.3226974904537201, 'learning_rate': 0.00025, 'epoch': 10.0}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 0.5652674436569214, 'eval_runtime': 0.1689, 'eval_samples_per_second': 94.753, 'eval_steps_per_second': 23.688, 'eval_num_tokens': 19800.0, 'eval_mean_token_accuracy': 0.8457628786563873, 'epoch': 10.0}
{'loss': 0.2361, 'grad_norm': 0.30054569244384766, 'learning_rate': 0.00012500000000000006, 'epoch': 13.33}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 0.5799983739852905, 'eval_runtime': 0.1684, 'eval_samples_per_second': 94.991, 'eval_steps_per_second': 23.748, 'eval_num_tokens': 26433.0, 'eval_mean_token_accuracy': 0.8403280973434448, 'epoch': 13.33}
{'loss': 0.1911, 'grad_norm': 0.3515804409980774, 'learning_rate': 3.3493649053890325e-05, 'epoch': 16.67}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 0.597210168838501, 'eval_runtime': 0.1981, 'eval_samples_per_second': 80.775, 'eval_steps_per_second': 20.194, 'eval_num_tokens': 33040.0, 'eval_mean_token_accuracy': 0.8403280973434448, 'epoch': 16.67}
{'loss': 0.1647, 'grad_norm': 0.40748950839042664, 'learning_rate': 0.0, 'epoch': 20.0}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 0.5992292761802673, 'eval_runtime': 0.1716, 'eval_samples_per_second': 93.225, 'eval_steps_per_second': 23.306, 'eval_num_tokens': 39600.0, 'eval_mean_token_accuracy': 0.8403280973434448, 'epoch': 20.0}
{'train_runtime': 28.6716, 'train_samples_per_second': 32.087, 'train_steps_per_second': 4.185, 'train_loss': 0.41884424289067584, 'num_tokens': 39600.0, 'mean_token_accuracy': 0.8682633362710476, 'epoch': 20.0}


TrainOutput(global_step=120, training_loss=0.41884424289067584, metrics={'train_runtime': 28.6716, 'train_samples_per_second': 32.087, 'train_steps_per_second': 4.185, 'total_flos': 27776639121408.0, 'train_loss': 0.41884424289067584})

## Step 5. Evaluate the fine-tuned model

In [12]:
# Evaluate the fine-tuned model on the same training examples
# No changes needed in this cell


proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: SPHINX. | Actual: SPHINX. | Matches: ✅
Proposed: BRAWN. | Actual: BRAWN. | Matches: ✅
Proposed: GOSSIPY. | Actual: GOSSIPY. | Matches: ✅
Proposed: ENCHANT. | Actual: ENCHANT. | Matches: ✅
Proposed: TAVENR. | Actual: TAVERN. | Matches: ❌
Proposed: WHISTE. | Actual: WHISTLE. | Matches: ❌
Proposed: CUPARE. | Actual: CAPTURE. | Matches: ❌
Proposed: ECHORD. | Actual: ECHO. | Matches: ❌
Proposed: MIRTH. | Actual: MIRTH. | Matches: ✅
Proposed: CRISP. | Actual: CRISP. | Matches: ✅
Proposed: ZEALOUS. | Actual: ZEALOUS. | Matches: ✅
Proposed: EMBEU. | Actual: EMBER. | Matches: ❌
Proposed: SCARAB. | Actual: SCARAB. | Matches: ✅
Proposed: KINT. | Actual: KNIT. | Matches: ❌
Proposed: RESILOS. | Actual: RESOLVE. | Matches: ❌
Proposed: VELVET. | Actual: VELVET. | Matches: ✅
Proposed: ABORE. | Actual: ABSOLVE. | Matches: ❌
Proposed: LUNAR. | Actual: LUNAR. | Matches: ✅
Proposed: MAZE. | Actual: MAZE. | Matches: ✅
Proposed: SUMMTI. | Actual: SUMMIT. | Matches: ❌
16.41190476190476/20.0 words c

The model now performs better on the training data it has seen. But has it generalized? Let's check its performance on the unseen test set.

In [13]:
# Evaluate the fine-tuned model on the unseen test set
# No changes needed in this cell


proportion_correct = 0.0
num_examples = len(ds["test"])

for example in ds["test"]:
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/{num_examples}.0 words correct")

Proposed: WRIYLY. | Actual: WRYLY. | Matches: ❌
Proposed: GLINES. | Actual: GLISTEN. | Matches: ❌
Proposed: CASQE. | Actual: QUEST. | Matches: ❌
Proposed: CERAVE. | Actual: CRAVE. | Matches: ❌
Proposed: LUSIO. | Actual: LUSH. | Matches: ❌
Proposed: FALICE. | Actual: FABLE. | Matches: ❌
Proposed: KNARKE. | Actual: KNACK. | Matches: ❌
Proposed: TIRUMPH. | Actual: TRIUMPH. | Matches: ❌
Proposed: SAPICHR. | Actual: SAPPHIRE. | Matches: ❌
Proposed: EXPSENT. | Actual: EXPOSE. | Matches: ❌
Proposed: FSRECOS. | Actual: FRESCOS. | Matches: ❌
Proposed: WIPS. | Actual: WISP. | Matches: ❌
Proposed: MIRGE. | Actual: MIRAGE. | Matches: ❌
Proposed: IVORY. | Actual: IVORY. | Matches: ✅
Proposed: ONSHORD. | Actual: ONSET. | Matches: ❌
Proposed: ELUDE. | Actual: ELUDE. | Matches: ✅
8.418253968253968/16.0 words correct


It looks like it has improved! Perhaps with a larger dataset and more training, it could get even better.

## Congratulations for completing the exercise! 🎉

✅ You did it! You successfully fine-tuned a small language model using PEFT with LoRA to teach it a new skill: spelling! You saw how the base model failed completely at the task, and with a very small amount of data and a short training run, the model managed to get better at spelling.

<br /><br /><br /><br /><br /><br /><br /><br /><br />