# Fine-Tune GPT-2 for Instruction Following - Workbook

This workbook will guide you through the process of fine-tuning a GPT-2 model on an instruction-following dataset. Fine-tuning adapts a pre-trained language model to perform specific tasks based on instructions.

You will use the `transformers` and `datasets` libraries from Hugging Face to:

1.  Install necessary libraries.
2.  Load the dataset, model, and tokenizer.
3.  Process and tokenize the data for training.
4.  Set up and run the training process.
5.  Save the fine-tuned model.
6.  Load and use the fine-tuned model for inference.

## Dataset: `hakurei/open-instruct-v1`

We will use the `hakurei/open-instruct-v1` dataset for instruction tuning. Understanding its structure is key to preparing the data correctly.

Explore the dataset card here: [https://huggingface.co/datasets/hakurei/open-instruct-v1](https://huggingface.co/datasets/hakurei/open-instruct-v1)

Note the `instruction`, `input`, and `output` columns. You will need to combine these for training.

## 1. Installation

Install the required libraries: `datasets`, `transformers`, and `torch`. These provide the tools for data handling, model management, and computation.

In [None]:
# TODO: Install datasets, transformers, and torch using pip

## 2. Load Dataset, Model, and Tokenizer

Load the `hakurei/open-instruct-v1` dataset (use a small split like `train[:1%]` for speed), the GPT-2 tokenizer, and the pre-trained `gpt2` model.

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM


# TODO: Load a small subset of the 'hakurei/open-instruct-v1' dataset (e.g., train[:1%])
print("Loading dataset...")
dataset = None # Replace with your code
print("Dataset loaded.")

# TODO: Load the GPT-2 tokenizer from 'gpt2'
print("Loading tokenizer...")
tokenizer = None # Replace with your code
# TODO: Set the tokenizer's pad_token to its eos_token

print("Tokenizer loaded.")

# TODO: Load the pre-trained GPT-2 model for causal language modeling from 'gpt2'
print("Loading model...")
model = None # Replace with your code
print("Model loaded.")

## 3. Tokenization and Data Formatting

Create a function to tokenize the dataset. This function should combine the `instruction`, `input`, and `output` columns into a single text string for each example, adding the `tokenizer.eos_token` at the end. Use a format like `Instruction: ...\nInput: ...\nOutput: ...`.

Apply this function to the dataset using `.map()`, ensuring you process in batches (`batched=True`) and remove the original text columns (`remove_columns`). Define a `DataCollatorForLanguageModeling` for preparing training batches.

In [None]:
from transformers import DataCollatorForLanguageModeling

# TODO: Define the tokenize function that combines 'instruction', 'input', and 'output'
#       and tokenizes the result. Remember to add eos_token and handle padding/truncation.
def tokenize(batch):
    pass # Replace with your code

# TODO: Apply the tokenization function to the dataset using .map()
#       Ensure batched=True and remove the original columns.
print("Tokenizing dataset...")
tokenized_dataset = None # Replace with your code
print("Dataset tokenized and formatted.")

# TODO: Define a DataCollatorForLanguageModeling with mlm=False
data_collator = None # Replace with your code

## 4. Training Setup

Define the `TrainingArguments` for your fine-tuning process. Specify the `output_dir`, `per_device_train_batch_size`, `num_train_epochs`, `logging_steps`, `save_steps`, and enable `fp16` for GPU acceleration.

Then, initialize the `Trainer` with your model, arguments, tokenized dataset, and data collator.

In [None]:
from transformers import Trainer, TrainingArguments

# TODO: Define the TrainingArguments
args = TrainingArguments(
    # Specify output directory, batch size, epochs, logging, saving, fp16, and dataloader_num_workers
    pass # Replace with your code
)

# TODO: Initialize the Trainer
trainer = None # Replace with your code

print("Trainer initialized.")

## 5. Train the Model

Start the training process by calling the `trainer.train()` method.

In [None]:
# TODO: Start the training loop
print("Starting training...")
pass # Replace with your code
print("Training finished.")

## 6. Save the Fine-Tuned Model and Tokenizer

Although the Trainer saves checkpoints, explicitly save the tokenizer to ensure all necessary files are in your output directory.

In [None]:
# TODO: Explicitly save the tokenizer to the output directory (args.output_dir)
pass # Replace with your code
print(f"Model and tokenizer saved to {args.output_dir}")

## 7. Load and Use the Fine-Tuned Model for Inference

Load your fine-tuned model and tokenizer from the `output_dir`. Create a `transformers` pipeline for text generation. Format your input prompt for inference consistently with your training data format.

In [None]:
from transformers import pipeline

# TODO: Define the output directory (include checkpoint-xxxx in the path)
output_dir = None # Replace with your code

# TODO: Load the tokenizer and model from the output directory
print(f"Loading model and tokenizer from {output_dir} for inference...")
loaded_tokenizer = None # Replace with your code
loaded_model = None # Replace with your code

# TODO: Create a text generation pipeline using the loaded model and tokenizer. Specify the device.
generator = None # Replace with your code

# TODO: Define an input prompt formatted like your training data (e.g., Instruction: ...\nOutput:)
prompt = None # Replace with your code

# TODO: Generate text using the pipeline. Adjust max_length as needed.
print(f"Generating text with prompt:\n{prompt}")
generated_output = None # Replace with your code

print(f"\nGenerated Text:\n{generated_output}")

# Optional: Post-process the output to remove the prompt part
# TODO: Implement optional post-processing to get just the generated response
pass # Replace with your code
