[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-09/exercise-04.ipynb)

# Fine-tuning GPT-2 on Poetry Dataset

## Goal
Fine-tune GPT-2 on ~10M+ poetry tokens in Colab with:
- Strong stylistic consistency
- No overfitting
- Stable training
- Clean reproducibility

## Assumptions
- Colab with GPU (T4 / A100)
- Gutenberg Poetry Corpus (`gutenberg-poetry.ndjson.gz`) downloaded from [aparrish/gutenberg-poetry-corpus](https://github.com/aparrish/gutenberg-poetry-corpus)
- The compressed file is available in your Colab filesystem (e.g. `/content/gutenberg-poetry.ndjson.gz`)

## Approach
1. Load + clean data
2. Build tokenizer (GPT-2 base)
3. Proper train/val split
4. Chunk into blocks (256 context)
5. Fine-tune GPT-2 with proper schedule
6. Monitor overfitting
7. Generate samples

This is how I would structure it in production.

In [1]:
%pip install -U -q transformers accelerate bitsandbytes datasets

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m10.4/10.4 MB[0m [31m72.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.7/60.7 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m515.2/515.2 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m47.6/47.6 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [46]:
## 1Ô∏è‚É£ Setup Environment

In [1]:
import torch
import numpy as np
import pandas as pd
import random
import os
import gzip
import json
import requests

from datasets import Dataset
from transformers import (
    GPT2TokenizerFast,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

Device: cuda


## 2Ô∏è‚É£ Load Gutenberg Poetry Corpus

We‚Äôll download the compressed NDJSON file directly from the Gutenberg Poetry Corpus and cache it in the Colab filesystem.

In [2]:
# Path where we‚Äôll store the Gutenberg Poetry Corpus
corpus_path = "/content/gutenberg-poetry.ndjson.gz"

# Official corpus URL from Allison Parrish‚Äôs Gutenberg Poetry Corpus
# See: https://github.com/aparrish/gutenberg-poetry-corpus
corpus_url = "https://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz"

# Download once if not already present
if not os.path.exists(corpus_path):
    print("Downloading Gutenberg Poetry Corpus from:", corpus_url)
    resp = requests.get(corpus_url, stream=True)
    resp.raise_for_status()
    with open(corpus_path, "wb") as f:
        for chunk in resp.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)
    print("Download complete.")
else:
    print("Found existing corpus file at", corpus_path)

# Load lines from the compressed NDJSON corpus
lines = []
with gzip.open(corpus_path, "rt", encoding="utf-8") as f:
    for i, line in enumerate(f):
        obj = json.loads(line)
        s = obj.get("s", "").strip()
        if not s:
            continue
        lines.append(s)

print("Total lines loaded:", len(lines))

# We use `texts` to keep the rest of the pipeline unchanged
texts = lines

Found existing corpus file at /content/gutenberg-poetry.ndjson.gz
Total lines loaded: 3085117


## 3Ô∏è‚É£ Clean Data (Minimal but Important)

We preserve formatting. Poetry depends on line breaks.

In [3]:
def clean_text(t):
    t = t.strip()
    t = t.replace("\r\n", "\n")
    t = t.replace("\r", "\n")
    return t

texts = [clean_text(t) for t in texts if len(t) > 50]

print("Number of poems:", len(texts))

Number of poems: 324589


## 4Ô∏è‚É£ Combine into Single Corpus

We preserve stanza breaks.

In [4]:
corpus = "\n\n".join(texts)

print("Total characters:", len(corpus))

# Check token scale later

Total characters: 19417189


## 5Ô∏è‚É£ Initialize Tokenizer

In [5]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# Add stanza token for stronger structure modeling
tokenizer.add_special_tokens({"additional_special_tokens": ["<STANZA>"]})

# Replace double newlines
corpus = corpus.replace("\n\n", " <STANZA> \n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## 6Ô∏è‚É£ Tokenize Corpus

In [6]:
import os
import torch

def get_or_build_tokens(corpus, tokenizer, save_path="poetry_tokens.pt"):
    """
    Load tokens from disk if available.
    Otherwise tokenize corpus and save.
    """

    if os.path.exists(save_path):
        print("Loading existing token file...")
        tokens = torch.load(save_path)
        print("Loaded tokens:", tokens.shape[0])
    else:
        print("Tokenizing corpus...")
        tokens = tokenizer(corpus, return_tensors="pt")["input_ids"][0]
        print("Total tokens:", tokens.shape[0])

        torch.save(tokens, save_path)
        print(f"Saved tokens to {save_path}")

    return tokens

In [7]:
tokens = get_or_build_tokens(corpus, tokenizer)
print("Total tokens:", tokens.shape[0])

# You want 10M+ tokens here.
# If less, consider:
# - concatenating multiple Kaggle poetry datasets
# - adding Gutenberg poetry corpus

Loading existing token file...
Loaded tokens: 6150837
Total tokens: 6150837


In [8]:
## 7Ô∏è‚É£ Train / Validation Split (Critical)

#Never train on 100%.

In [9]:
split_idx = int(0.95 * len(tokens))
train_tokens = tokens[:split_idx]
val_tokens = tokens[split_idx:]

## 8Ô∏è‚É£ Chunk Into Fixed-Length Blocks

We use block size 256 for poetry.

In [10]:
block_size = 256

def chunk_tokens(token_tensor):
    examples = []
    for i in range(0, len(token_tensor) - block_size, block_size):
        examples.append(token_tensor[i:i+block_size])
    return examples

train_examples = chunk_tokens(train_tokens)
val_examples = chunk_tokens(val_tokens)

print("Train chunks:", len(train_examples))
print("Val chunks:", len(val_examples))

Train chunks: 22825
Val chunks: 1201


## 9Ô∏è‚É£ Convert to HuggingFace Dataset

In [11]:
train_dataset = Dataset.from_dict({"input_ids": train_examples})
val_dataset = Dataset.from_dict({"input_ids": val_examples})

## üîü Load GPT-2 Model

In [18]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))
model.to(device)

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50258, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## 1Ô∏è‚É£1Ô∏è‚É£ Data Collator

In [19]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

## 1Ô∏è‚É£2Ô∏è‚É£ Training Arguments (Expert Settings)

These are tuned for stability + minimal overfitting.

**Why these values:**
- `5e-5` ‚Üí safe for fine-tuning
- `cosine decay` ‚Üí smoother convergence
- `warmup` prevents early divergence
- `weight_decay` combats overfitting
- `gradient accumulation` for stable large batch
- `load_best_model_at_end` ensures best val checkpoint

In [20]:
training_args = TrainingArguments(
    output_dir="poetry-gpt2",
    #overwrite_output_dir=True,

    num_train_epochs=8,
    per_device_train_batch_size=24,
    per_device_eval_batch_size=8,

    gradient_accumulation_steps=8,  # effective batch 192

    eval_strategy="steps",
    eval_steps=100,
    save_steps=100,
    logging_steps=50,

    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_ratio=0.03,

    lr_scheduler_type="cosine",

    fp16=True,

    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",

    report_to="none"
)

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


## 1Ô∏è‚É£3Ô∏è‚É£ Trainer

In [21]:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator
)

trainer.train()

Step,Training Loss,Validation Loss


## üìâ Monitoring Overfitting

Watch:
- **training loss**
- **validation loss**

If validation loss:
- **decreases then increases** ‚Üí stop early
- **flat** ‚Üí LR too low
- **exploding** ‚Üí LR too high

For 10M tokens, 3‚Äì5 epochs is usually enough.

## üé≠ Generate Poetry

In [None]:
def generate(prompt, max_length=200):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    output = model.generate(
        input_ids,
        max_length=max_length,
        do_sample=True,
        temperature=0.9,
        top_p=0.92,
        top_k=50,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

print(generate("Moonlight spills across the silent river\n"))

## üéØ Best Practices Summary

### To prevent overfitting:
- ‚úÖ Proper 95/5 split
- ‚úÖ Cosine decay
- ‚úÖ Weight decay 0.01
- ‚úÖ No more than 5 epochs
- ‚úÖ Load best checkpoint
- ‚úÖ Large effective batch
- ‚úÖ Monitor validation loss
- ‚úÖ Do not crank LR above 1e-4

### To improve poetic quality:
- ‚úÖ Preserve line breaks
- ‚úÖ Add `<STANZA>` token
- ‚úÖ Use temperature 0.8‚Äì1.0
- ‚úÖ Adjust top-p for creativity

### If You Want Even Better Stability

Upgrade:
- Use `bitsandbytes` 8-bit Adam
- Use gradient clipping (1.0)
- Use Flash Attention (if GPU supports)

**Now you have:**
- Industrial-grade fine-tuning setup
- Proper data pipeline
- Anti-overfitting strategy
- Stable optimization
- Controlled generation

This is how I would deploy a stylistically consistent GPT-2 fine-tune in production.