# Fine-Tune GPT-2 for Instruction Following - Solution

This notebook demonstrates how to fine-tune a GPT-2 model on an instruction-following dataset. Fine-tuning adapts a pre-trained language model to perform specific tasks based on instructions, making it more useful for downstream applications like chatbots or text generation based on prompts.

We will use the `transformers` and `datasets` libraries from Hugging Face, which provide convenient tools for this process.

## Dataset: `hakurei/open-instruct-v1`

For this fine-tuning task, we will use the `hakurei/open-instruct-v1` dataset from the Hugging Face Hub. This dataset is designed for instruction tuning and has a specific structure, which is important to understand for preprocessing.

You can find the dataset card with more details here: [https://huggingface.co/datasets/hakurei/open-instruct-v1](https://huggingface.co/datasets/hakurei/open-instruct-v1)

Key characteristics of this dataset:

-   It contains pairs of instructions, optional inputs, and corresponding outputs.
-   The relevant columns for our task are `instruction`, `input`, and `output`.
-   We will need to combine these columns into a single text format that the GPT-2 model can learn from.

## 1. Installation

First, we need to install the necessary libraries: `datasets`, `transformers`, and `torch`. `torch` is the deep learning framework that `transformers` uses under the hood for model computations.

In [None]:
# Install datasets for loading and processing the dataset
# Install transformers for the model, tokenizer, and training utilities
# Install torch as the backend for transformers
!pip install datasets transformers torch

## 2. Load Dataset, Model, and Tokenizer

We will load the `hakurei/open-instruct-v1` dataset, the GPT-2 tokenizer, and the pre-trained GPT-2 model. We'll use a small subset of the dataset for faster training during this example.

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a small subset of the dataset for demonstration purposes
# We use 'train[:1%]' to load only the first 1% of the training data
print("Loading dataset...")
dataset = load_dataset("hakurei/open-instruct-v1", split="train[:1%]")
print("Dataset loaded.")

# Load the GPT-2 tokenizer
# The tokenizer converts text into token IDs that the model understands
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Set the padding token to the EOS token, as GPT-2 doesn't have a dedicated pad token by default
tokenizer.pad_token = tokenizer.eos_token
print("Tokenizer loaded.")

# Load the pre-trained GPT-2 model for causal language modeling
# This model will be fine-tuned on our instruction dataset
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained("gpt2")
print("Model loaded.")

## 3. Tokenization and Data Formatting

Language models work with numerical tokens, not raw text. We need to tokenize our dataset and format it into sequences that the model can learn from. For instruction tuning, a common approach is to concatenate the `instruction`, `input`, and `output` fields into a single text string, often using special separators or formatting to distinguish the different parts.

We will create a `tokenize` function that takes a batch of examples, formats the text, and then tokenizes it. We'll use `tokenizer.eos_token` at the end of each combined example to signal the end of a sequence. The `datasets` library's `.map()` method is used to apply this function efficiently across the entire dataset.

In [None]:
from transformers import DataCollatorForLanguageModeling

# Define the tokenization function
def tokenize(batch):
    # This function takes a batch of examples from the dataset.
    # We iterate through each example in the batch.
    texts = []
    for instruction, input_text, output_text in zip(batch['instruction'], batch['input'], batch['output']):
        # Format the text for instruction tuning.
        # We use a simple format: Instruction: ...\nInput: ...\nOutput: ...
        # The {tokenizer.eos_token} is added at the end to mark the end of the sequence.
        if input_text: # Check if there is an input field
            text = f"Instruction: {instruction}\nInput: {input_text}\nOutput: {output_text}{tokenizer.eos_token}"
        else: # Handle cases with no input field
            text = f"Instruction: {instruction}\nOutput: {output_text}{tokenizer.eos_token}"
        texts.append(text)

    # Tokenize the list of combined text strings.
    # truncation=True cuts off texts longer than max_length.
    # padding="max_length" pads shorter texts to max_length.
    # max_length=64 sets the maximum sequence length.
    return tokenizer(texts, truncation=True, padding="max_length", max_length=64)

# Apply the tokenization function to the dataset
# batched=True processes data in batches, which is faster.
# remove_columns removes the original text columns after tokenization to save memory.
print("Tokenizing dataset...")
tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=["instruction", "input", "output"])
print("Dataset tokenized and formatted.")

# A Data Collator is needed to prepare batches during training.
# DataCollatorForLanguageModeling handles the shifting of labels for causal language modeling (predicting the next token).
# mlm=False specifies that we are doing causal language modeling, not masked language modeling.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## 4. Training Setup

We use the `Trainer` class from `transformers` to handle the training loop. We need to define `TrainingArguments` which specify hyperparameters and training configurations like output directory, batch size, learning rate, etc.

In [None]:
from transformers import Trainer, TrainingArguments

# Define the training arguments
args = TrainingArguments(
    output_dir="./gpt2-instruct",  # Directory to save checkpoints and results
    per_device_train_batch_size=4, # Batch size per device (adjust based on your GPU memory)
    num_train_epochs=1,            # Number of training epochs
    logging_steps=10,              # Log training loss every X steps
    save_steps=100,                # Save a checkpoint every X steps
    fp16=True,                     # Enable mixed precision training (recommended for GPUs)
)

# Initialize the Trainer
# The Trainer brings together the model, training arguments, dataset, and data collator
trainer = Trainer(
    model=model,                  # The model to train
    args=args,                    # The training arguments
    train_dataset=tokenized_dataset, # The tokenized training dataset
    data_collator=data_collator,  # The data collator
)

print("Trainer initialized.")

## 5. Train the Model

Now we can start the training process by calling the `trainer.train()` method. This will run the fine-tuning for the specified number of epochs, logging progress and saving checkpoints as configured in `TrainingArguments`.

In [None]:
# Start the training loop
print("Starting training...")
trainer.train()
print("Training finished.")

## 6. Save the Fine-Tuned Model and Tokenizer

The `Trainer` automatically saves checkpoints during training based on `save_steps`. It also saves the final model and tokenizer in the `output_dir` after training completes. However, you can explicitly save the tokenizer as well to be certain all necessary files are in the output directory for later loading.

In [None]:
# Explicitly save the tokenizer to the output directory
# This ensures all tokenizer files are present for loading later
tokenizer.save_pretrained(args.output_dir)
print(f"Model and tokenizer saved to {args.output_dir}")

## 7. Load and Use the Fine-Tuned Model for Inference

After training, you can load your fine-tuned model and tokenizer from the `output_dir` and use it for text generation. The `transformers` pipeline is a convenient way to do this. Remember to format your input prompt for inference in the same way you formatted the training data.

In [None]:
from transformers import pipeline

# Define the directory where your trained model was saved (probably need to access the latest checkpoint folder)
output_dir = "./gpt2-instruct/checkpoint-1247/"

# Load the tokenizer and model from the saved directory
print(f"Loading model and tokenizer from {output_dir} for inference...")
loaded_tokenizer = AutoTokenizer.from_pretrained(output_dir)
loaded_model = AutoModelForCausalLM.from_pretrained(output_dir)

# Create a text generation pipeline using your fine-tuned model and tokenizer
generator = pipeline("text-generation", model=loaded_model, tokenizer=loaded_tokenizer)

# Now you can use the generator pipeline to generate text
# Format your prompt according to the structure used during fine-tuning
prompt = "Instruction: Write a short poem about nature.\nOutput:"

# Generate text
# Adjust max_length and other generation parameters as needed
print(f"Generating text with prompt:\n{prompt}")
generated_output = generator(prompt, max_length=100, num_return_sequences=1)[0]['generated_text']

print(f"\nGenerated Text:\n{generated_output}")

# Post-process the output to get just the generated response (optional)
if generated_output.startswith(prompt):
    generated_response = generated_output[len(prompt):].strip()
    print(f"\nGenerated Response (post-processed):\n{generated_response}")
else:
     print("Could not automatically remove prompt from output. Full generated text shown above.")