# Exercise: Teach an LLM to Spell with Supervised Fine-Tuning (SFT)

Large language models (LLMs) are notoriously bad at spelling. This is partly because tokenizers break words into smaller pieces, so the model learns about sub-word units rather than whole words and their spellings.

In this exercise, you'll use supervised fine-tuning (SFT) and a technique called Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to teach a small LLM how to spell words. This is a classic example of teaching a model a new skill that isn't well-represented in its pre-training data.

## What you'll do in this notebook

1.  **Setup**: Import libraries and configure the environment.
2.  **Load the tokenizer and base model**: Use a small, instruction-tuned model as our starting point.
3.  **Create the dataset**: Generate a simple dataset of words and their correct spellings.
4.  **Evaluate the base model**: Test the model's spelling ability *before* fine-tuning to establish a baseline.
5.  **Configure LoRA and train**: Attach a LoRA adapter to the model and fine-tune it on the spelling dataset.
6.  **Evaluate the fine-tuned model**: Test the model again to see if its spelling has improved.

## Setup

In [1]:
# Setup imports
# No changes needed in this cell

import os
import torch
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# Use GPU, MPS, or CPU, in that order of preference
if torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon
else:
    device = torch.device("cpu")
torch.set_num_threads(max(1, os.cpu_count() // 2))
print("Using device:", device)

  from .autonotebook import tqdm as notebook_tqdm


Using device: mps


## Step 1. Load the tokenizer and base model

The model `HuggingFaceTB/SmolLM2-135M-Instruct` is a small, instruction-tuned model that's suitable for this exercise. It has 135 million parameters, making it lightweight and efficient for fine-tuning. It's not the most powerful model, but it's a good choice for demonstrating the concepts of SFT and PEFT with LoRA, especially on a CPU or limited GPU resources.

In [2]:
# Student task: Load the model and tokenizer, and copy the model to the device.
# TODO: Complete the sections with **********

# See: https://huggingface.co/docs/transformers/en/models
# See: https://huggingface.co/docs/transformers/en/fast_tokenizers

# Model ID for SmolLM2-135M-Instruct
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_id)

# Copy the model to the device (GPU, MPS, or CPU)
model = model.to(device)



print("Model parameters (total):", sum(p.numel() for p in model.parameters()))

Model parameters (total): 134515008


## Step 2. Create the dataset

In [3]:
# Create a list of words of different lengths
# No changes are needed in this cell.

# fmt: off
ALL_WORDS = [
    "idea", "glow", "rust", "maze", "echo", "wisp", "veto", "lush", "gaze", "knit", "fume", "plow",
    "void", "oath", "grim", "crisp", "lunar", "fable", "quest", "verge", "brawn", "elude", "aisle",
    "ember", "crave", "ivory", "mirth", "knack", "wryly", "onset", "mosaic", "velvet", "sphinx",
    "radius", "summit", "banner", "cipher", "glisten", "mantle", "scarab", "expose", "fathom",
    "tavern", "fusion", "relish", "lantern", "enchant", "torrent", "capture", "orchard", "eclipse",
    "frescos", "triumph", "absolve", "gossipy", "prelude", "whistle", "resolve", "zealous",
    "mirage", "aperture", "sapphire",
]
# fmt: on

In [4]:
# Student Task: Create a Hugging Face Dataset with the prompt that asks the model to spell the word
# with hyphens between the letters.
# TODO: Complete the sections with **********


def generate_records():
    for word in ALL_WORDS:
        yield {
            # We will use the SFTTrainer which expects a certain format for prompt and completions pair
            # in order for it to automatically construct the right tokenizations to train the model.
            # See the documentation for more details:
            # https://huggingface.co/docs/trl/en/sft_trainer#expected-dataset-type-and-format
            # "**********": f"**********",
            # <<< START SOLUTION SECTION
            "prompt": (
                f"You spell words with hyphens between the letters like this W-O-R-D.\nWord:\n{word}\n\n"
                + "Spelling:\n"
            ),
            # >>> END SOLUTION SECTION
            "completion": "-".join(word).upper() + ".",  # Of the form W-O-R-D.
        }


ds = Dataset.from_generator(generate_records)

# Show the first item
ds[0]

{'prompt': 'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\nidea\n\nSpelling:\n',
 'completion': 'I-D-E-A.'}

In [5]:
type(ds)

datasets.arrow_dataset.Dataset

In [6]:
# Student Task: Split the dataset into training and testing sets
# See: train_test_split
# TODO: Complete the sections with **********

# ds = **********  # Set the test set to be 25% of the dataset, and the rest is training
ds = ds.train_test_split(test_size=0.25, seed=42)



In [7]:
type(ds)

datasets.dataset_dict.DatasetDict

In [8]:
ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 46
    })
    test: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 16
    })
})

In [9]:
# View the training set

# No changes needed in this cell

ds["train"]

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 46
})

## Step 3. Evaluate the base model

Before we fine-tune the model, let's see how it performs on the spelling task. We'll create a helper function to generate a spelling for a given word and compare it to the correct answer.

In [45]:
inputs = tokenizer(ds["test"][0]["prompt"], return_tensors="pt") 
gen = model.generate( inputs["input_ids"].to(device), max_new_tokens=20)

In [46]:
ds["test"][0]["prompt"]

'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\nwryly\n\nSpelling:\n'

In [47]:
gen[0]

tensor([ 2683, 15362,  1924,   351,  4015,  3139,   826,   260,  5073,   702,
          451,   408,    29,    63,    29,    66,    29,    52,    30,   198,
        21268,    42,   198,   103,   541,   318,   198,   198,  8103,  2132,
           42,   198,   103,   198,   198,   198,   198,   198,   198,   198,
          198,   198,   198,   198,   198,   198,   198,   198,   198,   198,
          198,   198], device='mps:0')

In [48]:
output = tokenizer.decode(gen[0], skip_special_tokens=True)
output

'You spell words with hyphens between the letters like this W-O-R-D.\nWord:\nwryly\n\nSpelling:\nw\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

In [49]:
output.split("spelling:")[-1].strip().split("\n")[0][:-1] 

'You spell words with hyphens between the letters like this W-O-R-D'

In [10]:
# Student task: Create a function to check the model's spelling.
# This function will take a model, tokenizer, prompt, and the correct spelling.
# It should generate text from the model and compare the model's proposed spelling
# to the actual spelling, returning the proportion of characters that were correct.
# TODO: Complete the sections with **********


def check_spelling(
    model, tokenizer, prompt: str, actual_spelling: str, max_new_tokens: int = 20
) -> (str, str):
    # Tokenize the prompt
    # inputs = **********
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate text from the model
    # gen = **********
    gen = model.generate( **inputs, max_new_tokens=max_new_tokens, use_cache=False)

    # Decode the generated tokens to a string
    # output = **********
    output = tokenizer.decode(gen[0], skip_special_tokens=True)

    # Extract the generated spelling from the full output string
    # proposed_spelling = "**********"
    proposed_spelling = output.split("Spelling:")[-1].strip().split("\n")[0].strip()

    # strip any whitepsace from the actual spelling
    # actual_spelling = "**********"
    actual_spelling = actual_spelling.strip()

    # Remove hyphens for a character-by-character comparison
    # proposed_spelling = "**********"
    # actual_spelling = "**********"
    proposed_spelling = proposed_spelling.replace("-", "")
    actual_spelling = actual_spelling.replace("-", "")

    # Calculate the number of correct characters
    # num_correct = "**********"
    num_correct = sum(1 for p_char, a_char in zip(proposed_spelling, actual_spelling) if p_char == a_char)


    print(
        f"Proposed: {proposed_spelling} | Actual: {actual_spelling} "
        f"| Matches: {'‚úÖ' if proposed_spelling == actual_spelling else '‚ùå'}"
    )

    return num_correct / len(actual_spelling)  # Return proportion correct


check_spelling(
    model=model,
    tokenizer=tokenizer,
    prompt=ds["test"][0]["prompt"],
    actual_spelling=ds["test"][0]["completion"],
)

Proposed: wry | Actual: WRYLY. | Matches: ‚ùå


0.0

In [11]:
# Student task: Evaluate the base model's spelling ability
# We expect it to perform poorly, as it hasn't been trained for this task.

proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: sphinx | Actual: SPHINX. | Matches: ‚ùå
Proposed: brawn | Actual: BRAWN. | Matches: ‚ùå
Proposed: goss | Actual: GOSSIPY. | Matches: ‚ùå
Proposed: enchant | Actual: ENCHANT. | Matches: ‚ùå
Proposed: tavern | Actual: TAVERN. | Matches: ‚ùå
Proposed: whistle | Actual: WHISTLE. | Matches: ‚ùå
Proposed: WORD | Actual: CAPTURE. | Matches: ‚ùå
Proposed: echo | Actual: ECHO. | Matches: ‚ùå
Proposed: mirth | Actual: MIRTH. | Matches: ‚ùå
Proposed: cris | Actual: CRISP. | Matches: ‚ùå
Proposed: zeal | Actual: ZEALOUS. | Matches: ‚ùå
Proposed:  | Actual: EMBER. | Matches: ‚ùå
Proposed: scarab | Actual: SCARAB. | Matches: ‚ùå
Proposed:  | Actual: KNIT. | Matches: ‚ùå
Proposed: resolve | Actual: RESOLVE. | Matches: ‚ùå
Proposed: velvet | Actual: VELVET. | Matches: ‚ùå
Proposed:  | Actual: ABSOLVE. | Matches: ‚ùå
Proposed: lunar | Actual: LUNAR. | Matches: ‚ùå
Proposed: maze | Actual: MAZE. | Matches: ‚ùå
Proposed:  | Actual: SUMMIT. | Matches: ‚ùå
0.0/20.0 words correct


As expected, the base model is terrible at spelling. It mostly just repeats the word back. Now, let's fine-tune it.

## Step 4. Configure LoRA and train the model

Let‚Äôs attach a LoRA adapter to the base model. We use a LoRA config so only a tiny fraction of parameters are trainable. Read more here: [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

In [12]:
# Student task: Configure LoRA for a causal LM and wrap the model with get_peft_model
# Complete the sections with **********

# Print how many params are trainable at first
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params BEFORE: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

# See: https://huggingface.co/docs/peft/package_reference/lora
# lora_config = LoraConfig(
#     r=**********,                 # Rank of the update matrices. Lower value = fewer trainable parameters.
#     lora_alpha=**********,        # LoRA scaling factor.
#     lora_dropout=**********,      # Dropout probability for LoRA layers.
#     bias="none",
#     task_type=**********,         # Causal Language Modeling.
# )
# # Wrap the base model with get_peft_model
# model = get_peft_model(**********, **********)
lora_config = LoraConfig(
    r=8,                 # Rank of the update matrices. Lower value = fewer trainable parameters.
    lora_alpha=16,        # LoRA scaling factor.
    lora_dropout=0.1,      # Dropout probability for LoRA layers.
    bias="none",
    task_type="CAUSAL_LM",         # Causal Language Modeling.
)
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# Print the number of trainable parameters after applying LoRA
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(
    f"Trainable params AFTER: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)"
)

Trainable params BEFORE: 134,515,008 / 134,515,008 (100.00%)
Trainable params AFTER: 3,686,400 / 138,201,408 (2.67%)


Now let‚Äôs set the training arguments. We'll use `SFTConfig` from the TRL library, which is a wrapper around the standard `TrainingArguments`. We keep epochs, batch size, and sequence length modest to finish training quickly.

### Training is basically answering 3 questions:

- How much data do I feed at once? ‚Üí batch sizes

- How aggressively do I change the weights? ‚Üí learning rate + scheduler

- How long and how often do I train/evaluate? ‚Üí epochs, steps, logging

In [None]:
# Student task: Fill in the SFTConfig for a quick training run
# Complete the sections with **********

output_dir = "data/model"

# See: https://huggingface.co/docs/trl/en/sft_trainer#trl.SFTConfig
# training_args = SFTConfig(
#     output_dir=output_dir,
#     per_device_train_batch_size=**********,
#     per_device_eval_batch_size=**********,
#     gradient_accumulation_steps=**********,
#     num_train_epochs=**********,
#     learning_rate=**********,
#     logging_steps=**********,
#     evaluation_strategy="steps",
#     eval_steps=**********,
#     save_strategy="no",
#     report_to=[],                            # disable wandb/tensorboard
#     fp16=False,                              # stay in fp32 for CPU/MPS
#     lr_scheduler_type="cosine",
# )
training_args = SFTConfig( 
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_steps=10,
    eval_steps=20,
    save_strategy="no",
    report_to=[],                            # disable wandb/tensorboard
    fp16=False,                              # stay in fp32 for CPU/MPS
    lr_scheduler_type="cosine",
)
training_args = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=20,
    learning_rate=5 * 1e-4,
    logging_steps=20,
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="no",
    report_to=[],
    fp16=False,
    lr_scheduler_type="cosine",
)

Now we define the `SFTTrainer` and run the fine-tuning process.

In [14]:
# Student Task: Create and run the SFTTrainer
# TODO: Complete the sections with **********


# See: https://huggingface.co/docs/trl/en/sft_trainer
# trainer = SFTTrainer(
#     model=**********,
#     train_dataset=**********,
#     eval_dataset=**********,
#     args=**********,
# )
# Now train it:
# trainer.**********
trianer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    args=training_args,
)
trianer.train()




Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
20,1.0904,0.725951,2.279726,6627.0,0.742577
40,0.5088,0.594898,1.87307,13256.0,0.789201
60,0.3551,0.564892,1.688663,19800.0,0.836504
80,0.2434,0.585253,1.570856,26433.0,0.845763
100,0.1988,0.597368,1.540196,33040.0,0.856632
120,0.1723,0.599792,1.52674,39600.0,0.845763


TrainOutput(global_step=120, training_loss=0.42812402844429015, metrics={'train_runtime': 99.966, 'train_samples_per_second': 9.203, 'train_steps_per_second': 1.2, 'total_flos': 27776639121408.0, 'train_loss': 0.42812402844429015, 'epoch': 20.0})

## Step 5. Evaluate the fine-tuned model

In [15]:
for example in ds["train"].select(range(15)):
    prompt = example["prompt"]
    completion = example["completion"]
    print(prompt)
    print(completion)
    result = check_spelling(
        model=model,
        tokenizer= tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20)
    print("After training:")
    print(result)

You spell words with hyphens between the letters like this W-O-R-D.
Word:
sphinx

Spelling:

S-P-H-I-N-X.
Proposed: SPHINX. | Actual: SPHINX. | Matches: ‚úÖ
After training:
1.0
You spell words with hyphens between the letters like this W-O-R-D.
Word:
brawn

Spelling:

B-R-A-W-N.
Proposed: BRAWN. | Actual: BRAWN. | Matches: ‚úÖ
After training:
1.0
You spell words with hyphens between the letters like this W-O-R-D.
Word:
gossipy

Spelling:

G-O-S-S-I-P-Y.
Proposed: GOSSIPY. | Actual: GOSSIPY. | Matches: ‚úÖ
After training:
1.0
You spell words with hyphens between the letters like this W-O-R-D.
Word:
enchant

Spelling:

E-N-C-H-A-N-T.
Proposed: ENCHANT. | Actual: ENCHANT. | Matches: ‚úÖ
After training:
1.0
You spell words with hyphens between the letters like this W-O-R-D.
Word:
tavern

Spelling:

T-A-V-E-R-N.
Proposed: TAVENR. | Actual: TAVERN. | Matches: ‚ùå
After training:
0.7142857142857143
You spell words with hyphens between the letters like this W-O-R-D.
Word:
whistle

Spelling:

W

In [16]:
# Evaluate the fine-tuned model on the same training examples
# No changes needed in this cell


proportion_correct = 0.0

for example in ds["train"].select(range(20)):
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/20.0 words correct")

Proposed: SPHINX. | Actual: SPHINX. | Matches: ‚úÖ
Proposed: BRAWN. | Actual: BRAWN. | Matches: ‚úÖ
Proposed: GOSSIPY. | Actual: GOSSIPY. | Matches: ‚úÖ
Proposed: ENCHANT. | Actual: ENCHANT. | Matches: ‚úÖ
Proposed: TAVENR. | Actual: TAVERN. | Matches: ‚ùå
Proposed: WHISTE. | Actual: WHISTLE. | Matches: ‚ùå
Proposed: CUPARE. | Actual: CAPTURE. | Matches: ‚ùå
Proposed: ECHORD. | Actual: ECHO. | Matches: ‚ùå
Proposed: MIRTH. | Actual: MIRTH. | Matches: ‚úÖ
Proposed: CRISP. | Actual: CRISP. | Matches: ‚úÖ
Proposed: ZEALOUS. | Actual: ZEALOUS. | Matches: ‚úÖ
Proposed: EMBEU. | Actual: EMBER. | Matches: ‚ùå
Proposed: SCARAB. | Actual: SCARAB. | Matches: ‚úÖ
Proposed: KINT. | Actual: KNIT. | Matches: ‚ùå
Proposed: RESILOS. | Actual: RESOLVE. | Matches: ‚ùå
Proposed: VELVET. | Actual: VELVET. | Matches: ‚úÖ
Proposed: ABORE. | Actual: ABSOLVE. | Matches: ‚ùå
Proposed: LUNAR. | Actual: LUNAR. | Matches: ‚úÖ
Proposed: MAZE. | Actual: MAZE. | Matches: ‚úÖ
Proposed: SUMITT. | Actual: SUMMIT. | Mat

The model now performs better on the training data it has seen. But has it generalized? Let's check its performance on the unseen test set.

In [17]:
# Evaluate the fine-tuned model on the unseen test set
# No changes needed in this cell


proportion_correct = 0.0
num_examples = len(ds["test"])

for example in ds["test"]:
    prompt = example["prompt"]
    completion = example["completion"]
    result = check_spelling(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        actual_spelling=completion,
        max_new_tokens=20,
    )
    proportion_correct += result

print(f"{proportion_correct}/{num_examples}.0 words correct")

Proposed: WRIYLY. | Actual: WRYLY. | Matches: ‚ùå
Proposed: GLINES. | Actual: GLISTEN. | Matches: ‚ùå
Proposed: CASQE. | Actual: QUEST. | Matches: ‚ùå
Proposed: CERAVE. | Actual: CRAVE. | Matches: ‚ùå
Proposed: LUSIO. | Actual: LUSH. | Matches: ‚ùå
Proposed: FALICE. | Actual: FABLE. | Matches: ‚ùå
Proposed: KNARKT. | Actual: KNACK. | Matches: ‚ùå
Proposed: TIRUMPH. | Actual: TRIUMPH. | Matches: ‚ùå
Proposed: SAPIZRI. | Actual: SAPPHIRE. | Matches: ‚ùå
Proposed: EXPSET. | Actual: EXPOSE. | Matches: ‚ùå
Proposed: FSRECSON. | Actual: FRESCOS. | Matches: ‚ùå
Proposed: WIPS. | Actual: WISP. | Matches: ‚ùå
Proposed: MIRGE. | Actual: MIRAGE. | Matches: ‚ùå
Proposed: IVORY. | Actual: IVORY. | Matches: ‚úÖ
Proposed: ONSHORD. | Actual: ONSET. | Matches: ‚ùå
Proposed: ELUDE. | Actual: ELUDE. | Matches: ‚úÖ
8.075/16.0 words correct


It looks like it has improved! Perhaps with a larger dataset and more training, it could get even better.

## Congratulations for completing the exercise! üéâ

‚úÖ You did it! You successfully fine-tuned a small language model using PEFT with LoRA to teach it a new skill: spelling! You saw how the base model failed completely at the task, and with a very small amount of data and a short training run, the model managed to get better at spelling.

<br /><br /><br /><br /><br /><br /><br /><br /><br />