# Fine-Tuning a Decoder Model: Teaching an Old Dog New Tricks

You've learned about pipelines, tokenization, attention, and model architectures. Now, it's time to put it all into practice by performing the most powerful technique in modern NLP: **fine-tuning**. 🧙

We're going to tackle the following:

1.  Take a pre-trained, general-purpose `gpt2-medium` model.
2.  Choose a specific task: creating a **specialized quote generator**.
3.  Prepare a custom dataset of quotes for this task.
4.  **Fine-tune** the model on our dataset, teaching it this new skill.
5.  Compare the fine-tuned model's performance against the original base model to see the dramatic improvement.

This process is a _super_ smaller-scale version of the exact same techniques used to create powerful models like ChatGPT.

## What is Fine-Tuning?

Fine-tuning is like teaching a knowledgeable student a specific skill. The pre-trained model already understands language patterns, and we're teaching it to apply that knowledge to our specific task.

## Why Fine-Tune?

1. **Task-specific performance**: Pre-trained models are general-purpose; fine-tuning makes them experts at your task.
2. **Domain adaptation**: Adapt models to specific domains (medical, legal, technical).
3. **Data efficiency**: Requires much less data than training from scratch.
4. **Time efficiency**: Much faster than pre-training a model.

# Setup and Configuration

First, we need to install the necessary libraries from the Hugging Face ecosystem and set up our environment.


> 🚨 **Important**: For this notebook to run, you must use a GPU. In Google Colab, go to **Runtime &rarr; Change runtime type &rarr; Hardware accelerator** and select **GPU**.

And then, make sure to run the code block below in order to install all of our dependencies and get ready to rock and/or roll.

In [1]:
# @title Install the Dependencies and Set Everything Up {"display-mode": "form"}

!pip install transformers datasets accelerate bitsandbytes -q

# - transformers: For models and tokenizers
# - datasets: To easily load and process our training data
# - accelerate: A library from Hugging Face to simplify training on any infrastructure (like the Colab GPU)
# - bitsandbytes: For quantization to make training more memory-efficient

# Import the required libraries
import pprint
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
    pipeline,
    logging
)
from datasets import load_dataset
from google.colab import output
import pprint
import peft

# Suppress verbose output from transformers
logging.set_verbosity_error()

output.clear()

print("🤘 The setup is complete.")

🤘 The setup is complete.


# Finding and Preparing Our Data

The quality and format of your training data are the most important factors for successful fine-tuning. For our task, we need a dataset of quotes. We'll use the `Abirate/english_quotes` dataset from the Hugging Face Hub.

Our goal is to teach the model a specific structure: given an author's name, it should generate a quote by that author. We will format our data into a consistent string:

```
Quote by {author}: {quote} [EOS]
```

The `[EOS]` (End of Sequence) token is *super important*. It explicitly teaches the model when a quote is finished, so it learns to stop generating at the right time.

In [3]:
# Load the dataset from the Hugging Face Hub
dataset_name = "Abirate/english_quotes"
dataset = load_dataset(dataset_name, split="train")

output.clear()

# Let's look at a few examples to understand the structure
print(dataset, end="\n\n")

dataset[0]

Dataset({
    features: ['quote', 'author', 'tags'],
    num_rows: 2508
})



{'quote': '“Be yourself; everyone else is already taken.”',
 'author': 'Oscar Wilde',
 'tags': ['be-yourself',
  'gilbert-perreira',
  'honesty',
  'inspirational',
  'misattributed-oscar-wilde',
  'quote-investigator']}

# Data Preparation

We need to format the data into a single string for the model. We'll create a new column called "text" with our desired format.

In [4]:
def format_prompt(example):
    quote_text = example['quote']
    author_name = example['author']
    # Return a dictionary with the formatted string under a new key 'text'
    return {"text": f"Quote by {author_name}: {quote_text} <|endoftext|>"}

formatted_dataset = dataset.map(format_prompt)

# This is what the first examplwe will look like.
print(formatted_dataset[0]['text'])

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

Quote by Oscar Wilde: “Be yourself; everyone else is already taken.” <|endoftext|>


Without this special stop sign, the model wouldn't know when a quote is finished and might keep generating text indefinitely or merge two different quotes together. It's a crucial part of teaching the model the structure of our desired output."

# Loading the Pre-Trained Model and Tokenizer

Now we'll load our base model, `gpt2-medium`. We'll also use a technique called **quantization** to make the model much more memory-efficient. This allows us to fine-tune a larger model on the free GPU without running out of memory—because that wouldnt' be very fun.

  * **Model**: `gpt2-medium` (a 355M parameter model).
  * **Tokenizer**: The corresponding tokenizer for `gpt2-medium`.
  * **Quantization**: We'll use `BitsAndBytesConfig` to load the model in 4-bit precision. This drastically reduces memory usage with a minimal impact on performance.

Think of quantization as creating a summary of a very long book. Instead of using a rich, detailed vocabulary (like 32-bit floating point), we're using a more limited, efficient set of words (like 4-bit integers) to capture the main ideas. This makes the book much smaller and easier to carry (i.e., fit into memory) with only a tiny loss in the overall story.

In [5]:
model_name = "gpt2-medium"

# Quantization configuration to load the model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

def get_model():
    # Load the model with our quantization configuration
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        trust_remote_code=True
    )

    model.config.use_cache = False

    output.clear()

    return model

# Disable cache to prepare for training

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Set the padding token to be the same as the end-of-sequence token.
# This is a common practice for decoder-only models.
tokenizer.pad_token = tokenizer.eos_token

print("[EOS]", tokenizer.pad_token)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[EOS] <|endoftext|>


# Parameter-Efficient Fine-Tuning

**Parameter-Efficient Fine-Tuning (PEFT)** is a family of tricks for customizing very large models without touching—or even storing—all of their original weights.  Instead of re-training hundreds of millions of parameters, PEFT methods freeze the base model and learn only a *tiny* add-on: maybe a set of soft prompt vectors, a handful of low-rank matrices (LoRA), or a slim adapter layer.  Because the tweak is small, you can train it quickly on modest hardware, then ship or swap the add-on file (often just a few megabytes) while everyone shares the same underlying model.


PEFT isn't a single algorithm; it's a design goal.  Techniques like **LoRA, prompt tuning, adapter layers, and QLoRA** all live under the PEFT umbrella, each squeezing storage, memory, or compute in its own way.  The common win is clear: you keep accuracy close to a full fine-tune but cut the cost so sharply that laptops, edge devices, or many parallel experiments suddenly become practical.


# What is LoRA?

Picture a gigantic language model as a skyscraper with millions of steel beams—the full weight matrix.  Fine-tuning that model for your niche task usually means nudging every beam, which is expensive and memory-hungry.  **Low-Rank Adaptation (LoRA)** sidesteps the heavy lifting by sliding in a few lightweight support columns: it freezes the original weights and learns two tiny matrices whose product has a very low rank (think a thin sheet slipped between existing floors).  During training only these add-on matrices are updated, so you store and multiply far fewer parameters—often **hundreds of times less** than a full fine-tune—yet the combination steers the model just as effectively toward your new task.

Because LoRA's extra matrices are so small, you can stack multiple task-specific adapters, swap them on-the-fly, and even share them publicly without redistributing the entire base model. Extensions like **QLoRA** pair the trick with low-bit quantization, squeezing both storage and memory footprints so a laptop—or sometimes even a browser tab—can fine-tune and run models that once demanded a data-center-grade GPU.

The key takeaway is that we aren't rebuilding the entire skyscraper. We're just adding some very smart, lightweight scaffolding. This is why fine-tuning with LoRA is so much faster and requires less computational power than training a model from scratch."

In [6]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Configure LoRA
lora_config = peft.LoraConfig(
    r=8,  # Rank of the update matrices.
    lora_alpha=32,  # Scaling factor for the LoRA weights.
    lora_dropout=0.05,  # Dropout probability for LoRA layers.
    bias="none",  # Bias type (none, all, or lora_only).
    task_type="CAUSAL_LM",  # Task type (e.g., CAUSAL_LM for language generation).
    fan_in_fan_out=True, # Explicitly set for Conv1D layers
)

# Add LoRA adapters to the model
model = peft.get_peft_model(get_model(), lora_config)

# Print the trainable parameters to see the effect of LoRA
model.print_trainable_parameters()

trainable params: 786,432 || all params: 355,609,600 || trainable%: 0.2212


Next, we want to tokenize the dataset of our nice, new formatted strings.

In [7]:
def tokenize_function(examples):
    # The tokenizer will convert our formatted text into token IDs
    # Pass the 'text' field to the tokenizer
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

# Tokenize the entire dataset
tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)

# No, `pprint` is *not* a typo. It's for printing stuff… but pretty.
pprint.pp(tokenized_dataset[0], compact=True)

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

{'quote': '“Be yourself; everyone else is already taken.”',
 'author': 'Oscar Wilde',
 'tags': ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational',
          'misattributed-oscar-wilde', 'quote-investigator'],
 'text': 'Quote by Oscar Wilde: “Be yourself; everyone else is already taken.” '
         '<|endoftext|>',
 'input_ids': [25178, 416, 15694, 45622, 25, 564, 250, 3856, 3511, 26, 2506,
               2073, 318, 1541, 2077, 13, 447, 251, 220, 50256, 50256, 50256,
               50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
               50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
               50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
               50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
               50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
               50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
               50256, 50256, 50256, 50256, 50256,

# The Fine-Tuning Process

We will use the `Trainer` API from Hugging Face, which handles the entire training loop for us. We just need to provide it with our model, dataset, and a set of `TrainingArguments`.

The `TrainingArguments` tell the `Trainer` how to perform the training (e.g., batch size, number of epochs, etc.).

- `learning_rate`: Think of the learning_rate as how big of a step the student takes when correcting a mistake. Too big, and they might overshoot the right answer. Too small, and it will take them forever to learn.
- `per_device_train_batch_size`: This is like showing our student a few examples (batch_size=2) before we ask them to update their understanding. It's more efficient than showing them just one example at a time.

## Using the Trainer API

The Trainer API is Hugging Face's high-level training interface that handles:

- Training loop
- Evaluation
- Logging
- Checkpointing
- Mixed precision training
- Distributed training

In [8]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-medium-quotes",      # Directory to save the model
    num_train_epochs=1,                     # We'll train for 1 full pass through the data
    per_device_train_batch_size=2,          # Process 2 examples at a time per GPU
    gradient_accumulation_steps=1,          # Accumulate gradients (useful for larger batches)
    learning_rate=2e-4,                     # The speed at which the model learns
    fp16=True,                              # Use mixed precision for faster training
    logging_steps=200,                      # Log training loss every 200 steps
    save_total_limit=2,                     # Only keep the last 2 saved models
    report_to="none"                        # Disable reporting to services like Weights & Biases
)

# Create the Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

Our fine-tuning process with LoRA created a separate, small set of instructions (the adapters). For our final step, we'll permanently merge these new instructions back into the main model. This creates a single, consolidated 'expert' model that is easy to save and use for generating quotes.

Alright, we're ready to kick off the fine tuning!

In [9]:
# Let's start fine-tuning!
print("🚀 Starting fine-tuning…")
trainer.train()
print("✅ Fine-tuning complete!")

# This saves the final model and tokenizer to the output directory
final_model_dir = "./gpt2-medium-quotes-final"
trainer.save_model(final_model_dir)
print(f"Model saved to {final_model_dir}")

🚀 Starting fine-tuning…
{'loss': 2.9498, 'grad_norm': 2.0155582427978516, 'learning_rate': 0.00016842105263157895, 'epoch': 0.1594896331738437}
{'loss': 2.6376, 'grad_norm': 1.3902876377105713, 'learning_rate': 0.0001365231259968102, 'epoch': 0.3189792663476874}
{'loss': 2.4954, 'grad_norm': 1.8509743213653564, 'learning_rate': 0.00010462519936204146, 'epoch': 0.4784688995215311}
{'loss': 2.466, 'grad_norm': 2.0513079166412354, 'learning_rate': 7.272727272727273e-05, 'epoch': 0.6379585326953748}
{'loss': 2.5287, 'grad_norm': 1.592956304550171, 'learning_rate': 4.082934609250399e-05, 'epoch': 0.7974481658692185}
{'loss': 2.4914, 'grad_norm': 1.7699286937713623, 'learning_rate': 8.931419457735247e-06, 'epoch': 0.9569377990430622}
{'train_runtime': 151.8252, 'train_samples_per_second': 16.519, 'train_steps_per_second': 8.26, 'train_loss': 2.590744651295542, 'epoch': 1.0}
✅ Fine-tuning complete!
Model saved to ./gpt2-medium-quotes-final


## Step 5: Testing and Evaluating Our Fine-Tuned Model

This is the most exciting part\! We will now compare the performance of the original **base model** against our new **fine-tuned model**.

We will give both models the same prompt and see how they respond.

In [11]:
from peft import PeftModel

prompt = "Quote by Jimi Hendrix"

print("--- Testing the Original Base Model ---")

base_generator = pipeline('text-generation', model="gpt2-medium", tokenizer="gpt2-medium")
result = base_generator(prompt, max_length=50, num_return_sequences=1)

print("Base model response:")
print(result[0]['generated_text'])

print("\n--- Testing Our Fine-Tuned Model ---")

# Load the base model first
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True)

# Load the PEFT model (LoRA adapters) from the saved directory
# Ensure final_model_dir is correctly set to the path where you saved the model
# For example, if you saved it to "./gpt2-medium-quotes-final", that should be the value
fine_tuned_model = PeftModel.from_pretrained(base_model, final_model_dir)

# Merge the LoRA weights into the base model
fine_tuned_model = fine_tuned_model.merge_and_unload()

# Create the pipeline with the merged fine-tuned model
fine_tuned_generator = pipeline('text-generation', model=fine_tuned_model, tokenizer=tokenizer)
result = fine_tuned_generator(prompt, max_length=50, num_return_sequences=1)

print("Fine-tuned model response:")
print(result[0]['generated_text'])

--- Testing the Original Base Model ---
Base model response:
Quote by Jimi Hendrix

I've been a fan of this album since it came out. It's got some great songs and some great guitar work. I was actually a bit disappointed with the "Songs of Experience" on the vinyl version. It was really boring and not that much different from the CD version. I think you're right about that, but I didn't really mind too much. It was a little sad that the "Songs of Experience" was removed from the vinyl version, but I still like it.


So, I'm thinking that this vinyl just sounds different to me. I'm not sure exactly why. I think it may have been the fact that the studio was in a different place - they had different equipment and different recording equipment. I'm definitely not someone who likes to listen to everything all at once, but after listening to this album I'm definitely starting to like this more and more. I think this is a really good album for anyone who is into jazz, and I'm starting to like

### Analysis of the Results

You should see a dramatic difference:

  * The **base model** likely generates something generic, nonsensical, or completely unrelated to a quote. It doesn't understand the specific format or task we want.
  * The **fine-tuned model** should immediately generate a plausible-sounding quote that follows the structure it learned from our dataset. It has become a specialist.

Notice how the base model just tried to complete the text, while our fine-tuned model understood the task—to provide a quote in the format we taught it. That's the power of fine-tuning.

## Conclusion and Next Steps

Congratulations\! You have successfully fine-tuned a powerful language model to perform a specific task. You've gone through the entire end-to-end process:

1.  **Prepared a custom dataset** with a specific format.
2.  **Loaded a pre-trained model** efficiently using quantization.
3.  **Fine-tuned the model** using the Hugging Face `Trainer`.
4.  **Evaluated and confirmed** that your model learned its new skill.

This is the core loop of applied NLP. From here, you can explore further:

  * **Try a different dataset**: Fine-tune a model to write poetry, code, or even mimic a specific person's writing style.
  * **Experiment with hyperparameters**: Change the `learning_rate` or `num_train_epochs` to see how it affects the final model.
  * **Use a larger model**: If you have access to more powerful hardware, try fine-tuning an even larger model for better results.

You now have the practical skills to adapt and specialize foundation models for your own unique applications.