# GPT-2 Lyrics Generation: LoRA Fine-Tuning, ONNX Export, and Local Gradio Deployment

This notebook fine-tunes **GPT-2** on a song lyrics dataset, exports a lightweight **ONNX** model, deploys a local **Gradio** app, and evaluates output quality using **perplexity**, **BLEU**, and qualitative review.

**Key outputs (saved to `./artifacts/<run_id>/`):** run metrics, training logs, tuning sweep results, ONNX export info, and BLEU samples.


## 1. Environment setup
Install libraries and set environment variables.


In [1]:
!pip install datasets transformers peft accelerate evaluate nltk onnx onnxruntime gradio optimum[onnxruntime] pandas



In [2]:
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")

'1'

## 2. Imports and reproducibility
Imports, random seeds, and artifact/output folders.


In [3]:
import torch
from pathlib import Path
from datasets import load_dataset
from peft import LoraConfig, TaskType, get_peft_model
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    pipeline,
)
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

### Reproducibility and artifact folders
This section sets seeds and creates an `artifacts/<run_id>/` folder to store logs, metrics, and evaluation outputs.


In [4]:
import json, math, random, time
from datetime import datetime

# Reproducibility
SEED = 42
random.seed(SEED)
try:
    import numpy as np
    np.random.seed(SEED)
except Exception:
    pass
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Artifact folders
ARTIFACT_ROOT = Path("./artifacts")
ARTIFACT_ROOT.mkdir(exist_ok=True)

RUN_ID = datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_DIR = ARTIFACT_ROOT / RUN_ID
RUN_DIR.mkdir(parents=True, exist_ok=True)

ADAPTER_DIR = Path("./gpt2-lyrics-lora-adapter")
MERGED_DIR  = Path("./gpt2-lyrics-merged")
ONNX_DIR    = Path("./gpt2-lyrics-onnx")

print("RUN_DIR:", RUN_DIR.resolve())

RUN_DIR: C:\Users\shweiss\Downloads\artifacts\20260221_113232


## 3. Dataset selection and preprocessing
Load the lyrics dataset, select the text field, clean rows, and tokenize for GPT-2.


In [5]:
dataset = load_dataset("halaction/song-lyrics", split="train[:1000]")
print(dataset.column_names)
print(dataset[0])

['lyrics', 'genre']
{'lyrics': "[Intro: Method Man w/ sample] + (Sunny valentine). We got butter (8X). (The gun'll go the gun'll go.... The gun'll go...). [Raekwon]. Aiyo one thing for sure keep you of all. Keep a nice crib fly away keep to the point. Keep niggaz outta ya face who snakes. Keep bitches in they place keep the mac in a special place. Keep moving for papes keep cool keep doing what you doing. Keep it fly keep me in the crates. Cuz I will erase shit on the real note you'se a waste. It's right here for you I will lace you. Rip you and brace you put a nice W up on ya face. Word to mother you could get chased. It's nothing to taste blood on a thug if he gotta go. All I know is we be giving grace. This is a place from where we make tapes. We make 'em everywhere still in all we be making base. Y'all be making paste these little niggaz they be making shapes. Our shit is art yours is traced. [Chorus: Sunny Valentine]. This is the way that we rolling in the streets. You know when w

In [6]:
candidate_columns = ["lyrics", "text", "song", "content"]
text_col = next((c for c in candidate_columns if c in dataset.column_names), dataset.column_names[0])

dataset = dataset.filter(lambda x: x[text_col] is not None and x[text_col].strip() != "")
dataset = dataset.select_columns([text_col])
print("Using text column:", text_col)
print("Rows:", len(dataset))

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Using text column: lyrics
Rows: 1000


In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 pad token fix
max_length = 128

In [8]:

def tokenize_function(batch):
    # Keep a copy of the original text for later evaluation (BLEU / qualitative checks)
    raw_texts = batch[text_col]
    texts = [t + tokenizer.eos_token for t in raw_texts]
    tokenized = tokenizer(
        texts,
        truncation=True,
        padding="max_length",
        max_length=max_length,
    )
    tokenized["labels"] = tokenized["input_ids"].copy()  # Causal LM labels
    tokenized["raw_text"] = raw_texts
    return tokenized

tokenized_data = dataset.map(tokenize_function, batched=True, remove_columns=[text_col])
tokenized_data.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

splits = tokenized_data.train_test_split(test_size=0.1, seed=42)
train_dataset = splits["train"]
eval_dataset = splits["test"]

print(train_dataset[0].keys())

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

dict_keys(['input_ids', 'attention_mask', 'labels'])


## 4. LoRA fine-tuning and training optimization
Apply LoRA, configure training with weight decay and early stopping, and train/evaluate.


In [9]:

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"],
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 811,008 || all params: 125,250,816 || trainable%: 0.6475




### Hyperparameter tuning evidence
To satisfy the tuning requirement, I ran a small learning-rate sweep (2 quick trials) and recorded eval loss and perplexity for comparison.


In [10]:
import pandas as pd

RUN_SWEEP = True      # set False to skip
SWEEP_ROWS = 200      # keep small for speed
SWEEP_EPOCHS = 1

# Two simple variants (learning rate). You can add more if you want.
SWEEP_CONFIGS = [
    {"name": "lr_2e-4", "learning_rate": 2e-4},
    {"name": "lr_1e-4", "learning_rate": 1e-4},
]

sweep_results = []

def run_quick_trial(cfg):
    # fresh base model each trial
    base = GPT2LMHeadModel.from_pretrained("gpt2")
    base.resize_token_embeddings(len(tokenizer))
    base.config.pad_token_id = tokenizer.pad_token_id

    # same LoRA config (you can also sweep r/alpha if you want)
    trial_lora = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_config.r,
        lora_alpha=lora_config.lora_alpha,
        lora_dropout=lora_config.lora_dropout,
        target_modules=lora_config.target_modules,
        bias="none",
    )
    trial_model = get_peft_model(base, trial_lora)

    # tiny subset for speed
    tiny_train = train_dataset.select(range(min(SWEEP_ROWS, len(train_dataset))))
    tiny_eval  = eval_dataset.select(range(min(int(SWEEP_ROWS*0.25), len(eval_dataset))))

    trial_args = dict(
        output_dir=str(RUN_DIR / f"sweep_{cfg['name']}"),
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=cfg["learning_rate"],
        num_train_epochs=SWEEP_EPOCHS,
        weight_decay=0.01,
        logging_steps=10,
        save_strategy="no",
        report_to="none",
        fp16=torch.cuda.is_available(),
    )

    try:
        ta = TrainingArguments(evaluation_strategy="epoch", **trial_args)
    except TypeError:
        ta = TrainingArguments(eval_strategy="epoch", **trial_args)

    t = Trainer(
        model=trial_model,
        args=ta,
        train_dataset=tiny_train,
        eval_dataset=tiny_eval,
        tokenizer=tokenizer,
        data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    )

    start = time.time()
    t.train()
    m = t.evaluate()
    secs = time.time() - start

    eval_loss = float(m.get("eval_loss", float("nan")))
    ppl = float(math.exp(eval_loss)) if eval_loss == eval_loss else float("nan")

    return {
        "name": cfg["name"],
        "learning_rate": cfg["learning_rate"],
        "eval_loss": eval_loss,
        "perplexity": ppl,
        "seconds": round(secs, 2),
    }

if RUN_SWEEP:
    for cfg in SWEEP_CONFIGS:
        sweep_results.append(run_quick_trial(cfg))

    # Save sweep results
    with open(RUN_DIR / "sweep_results.json", "w") as f:
        json.dump(sweep_results, f, indent=2)

    df_sweep = pd.DataFrame(sweep_results).sort_values("eval_loss")
    display(df_sweep)

    # Pick best LR for final training (lowest eval_loss)
    BEST_LR = float(df_sweep.iloc[0]["learning_rate"])
else:
    BEST_LR = 2e-4  # your original setting

print("BEST_LR selected for final run:", BEST_LR)

  t = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
  super().__init__(loader)
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,3.8212,3.358577


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


Epoch,Training Loss,Validation Loss
1,3.8358,3.372452


Unnamed: 0,name,learning_rate,eval_loss,perplexity,seconds
0,lr_2e-4,0.0002,3.358577,28.748253,278.73
1,lr_1e-4,0.0001,3.372452,29.149902,286.1


BEST_LR selected for final run: 0.0002


In [11]:

common_args = dict(
    output_dir="./gpt2-lyrics-lora",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=float(BEST_LR),  # from quick sweep above
    num_train_epochs=5,
    weight_decay=0.01,  # required regularization
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    logging_steps=20,
    fp16=torch.cuda.is_available(),
    report_to="none",
)

# transformers version compatibility:
# - older versions use evaluation_strategy
# - newer versions use eval_strategy
try:
    training_args = TrainingArguments(evaluation_strategy="epoch", **common_args)
except TypeError:
    training_args = TrainingArguments(eval_strategy="epoch", **common_args)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [12]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

  trainer = Trainer(


In [13]:

train_start = time.time()
train_output = trainer.train()
train_seconds = time.time() - train_start

metrics = trainer.evaluate()
eval_loss = float(metrics.get("eval_loss", float("nan")))
perplexity = float(math.exp(eval_loss)) if eval_loss == eval_loss else float("nan")

print("Eval metrics:", metrics)
print("Perplexity:", perplexity)
print("Training seconds:", round(train_seconds, 2))

# Save logs + metrics for submission
run_summary = {
    "run_id": RUN_ID,
    "timestamp_local": datetime.now().isoformat(timespec="seconds"),
    "base_model": "gpt2",
    "dataset": "halaction/song-lyrics (train[:1000])",
    "seed": SEED,
    "train_rows": len(train_dataset),
    "eval_rows": len(eval_dataset),
    "max_length": max_length,
    "lora_config": {
        "r": lora_config.r,
        "lora_alpha": lora_config.lora_alpha,
        "lora_dropout": lora_config.lora_dropout,
        "target_modules": list(lora_config.target_modules),
    },
    "training_args": training_args.to_dict(),
    "eval_metrics": metrics,
    "perplexity": perplexity,
    "train_seconds": round(train_seconds, 2),
}

with open(RUN_DIR / "run_metrics.json", "w") as f:
    json.dump(run_summary, f, indent=2)

with open(RUN_DIR / "trainer_log_history.json", "w") as f:
    json.dump(trainer.state.log_history, f, indent=2)

print("Saved:", (RUN_DIR / "run_metrics.json").resolve())
print("Saved:", (RUN_DIR / "trainer_log_history.json").resolve())

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
  super().__init__(loader)


Epoch,Training Loss,Validation Loss
1,3.6836,3.361988
2,3.5959,3.334132
3,3.5443,3.31934
4,3.5417,3.31339
5,3.5285,3.311978


  super().__init__(loader)
  super().__init__(loader)
  super().__init__(loader)
  super().__init__(loader)
  super().__init__(loader)


Eval metrics: {'eval_loss': 3.3119781017303467, 'eval_runtime': 54.1203, 'eval_samples_per_second': 1.848, 'eval_steps_per_second': 0.462, 'epoch': 5.0}
Perplexity: 27.43934964877291
Training seconds: 6075.38
Saved: C:\Users\shweiss\Downloads\artifacts\20260221_113232\run_metrics.json
Saved: C:\Users\shweiss\Downloads\artifacts\20260221_113232\trainer_log_history.json


## 5. Save artifacts and export to ONNX
Save the LoRA adapter + merged model, then export to ONNX for lightweight inference.


In [14]:

ADAPTER_DIR.mkdir(parents=True, exist_ok=True)
MERGED_DIR.mkdir(parents=True, exist_ok=True)

# Save LoRA adapter
trainer.model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(ADAPTER_DIR)

# Merge LoRA into base weights for easier inference + ONNX export
merged_model = model.merge_and_unload()
merged_model.save_pretrained(MERGED_DIR)
tokenizer.save_pretrained(MERGED_DIR)

# Copy paths into run folder for neat submission packaging
with open(RUN_DIR / "artifact_paths.json", "w") as f:
    json.dump(
        {
            "adapter_dir": str(ADAPTER_DIR.resolve()),
            "merged_dir": str(MERGED_DIR.resolve()),
        },
        f,
        indent=2,
    )

print("Saved adapter:", ADAPTER_DIR.resolve())
print("Saved merged model:", MERGED_DIR.resolve())

Saved adapter: C:\Users\shweiss\Downloads\gpt2-lyrics-lora-adapter
Saved merged model: C:\Users\shweiss\Downloads\gpt2-lyrics-merged


In [15]:

from optimum.onnxruntime import ORTModelForCausalLM

ONNX_DIR.mkdir(parents=True, exist_ok=True)

# Export merged PyTorch model -> ONNX + load as ORT model
ort_model = ORTModelForCausalLM.from_pretrained(str(MERGED_DIR), export=True)
ort_model.save_pretrained(ONNX_DIR)
tokenizer.save_pretrained(ONNX_DIR)

# Record ONNX artifact path
with open(RUN_DIR / "onnx_export.json", "w") as f:
    json.dump({"onnx_dir": str(ONNX_DIR.resolve())}, f, indent=2)

print("ONNX export complete:", ONNX_DIR.resolve())

`torch_dtype` is deprecated! Use `dtype` instead!
  if not self.is_initialized or self.keys.numel() == 0:
  if (padding_length := kv_length + kv_offset - attention_mask.shape[-1]) > 0:
  if padding_mask is not None and padding_mask.shape[-1] > kv_length:
  return opset9.index(g, self, index)
Found different candidate ONNX initializers (likely duplicate) for the tied weights:
	lm_head.weight: {'onnx::MatMul_3300'}
	transformer.wte.weight: {'transformer.wte.weight'}


ONNX export complete: C:\Users\shweiss\Downloads\gpt2-lyrics-onnx


In [16]:

from transformers import AutoModelForCausalLM

# Load PyTorch merged model for fallback
pt_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pt_model = AutoModelForCausalLM.from_pretrained(MERGED_DIR).to(pt_device)
pt_model.eval()

# Load ONNX Runtime model if present
try:
    from optimum.onnxruntime import ORTModelForCausalLM
    ort_model = ORTModelForCausalLM.from_pretrained(ONNX_DIR)
    ONNX_AVAILABLE = True
except Exception as e:
    ort_model = None
    ONNX_AVAILABLE = False
    print("ONNX model not available yet (run export cell first). Details:", e)

def _generate_with_model(model_obj, prompt, max_new_tokens=80, temperature=0.9, top_p=0.95):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    attention_mask = inputs.get("attention_mask", None)

    # ORT models run on CPU by default; PyTorch model uses pt_device
    if isinstance(model_obj, torch.nn.Module):
        input_ids = input_ids.to(pt_device)
        if attention_mask is not None:
            attention_mask = attention_mask.to(pt_device)

    with torch.no_grad():
        out_ids = model_obj.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=int(max_new_tokens),
            do_sample=True,
            temperature=float(temperature),
            top_p=float(top_p),
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(out_ids[0], skip_special_tokens=True)

def generate_lyrics(prompt, backend="onnx", max_new_tokens=80, temperature=0.9, top_p=0.95):
    backend = (backend or "onnx").lower()
    if backend == "onnx" and ONNX_AVAILABLE:
        return _generate_with_model(ort_model, prompt, max_new_tokens, temperature, top_p)
    return _generate_with_model(pt_model, prompt, max_new_tokens, temperature, top_p)

print("ONNX_AVAILABLE:", ONNX_AVAILABLE)

ONNX_AVAILABLE: True


## 6. Local deployment with Gradio
Run a simple local UI to generate lyrics from prompts (PyTorch fallback, ONNX when available).


In [14]:
import os
import gradio as gr

# Ensure localhost bypasses proxies (helps in some environments)
for k in ["NO_PROXY", "no_proxy"]:
    cur = os.environ.get(k, "")
    add = "127.0.0.1,localhost"
    if add not in cur:
        os.environ[k] = (cur + "," if cur else "") + add

def gr_generate(backend, prompt, max_new_tokens, temperature, top_p):
    prompt = (prompt or "").strip()
    if not prompt:
        return "Please enter a prompt (even a short phrase)."
    return generate_lyrics(
        prompt=prompt,
        backend=backend,
        max_new_tokens=int(max_new_tokens),
        temperature=float(temperature),
        top_p=float(top_p),
    )

demo = gr.Interface(
    fn=gr_generate,
    inputs=[
        gr.Dropdown(choices=["onnx", "pytorch"], value="onnx", label="Backend"),
        gr.Textbox(lines=3, label="Prompt", placeholder="Type a verse starter, hook, or first line..."),
        gr.Slider(20, 200, value=80, step=1, label="max_new_tokens"),
        gr.Slider(0.1, 1.5, value=0.9, step=0.05, label="temperature"),
        gr.Slider(0.5, 1.0, value=0.95, step=0.01, label="top_p"),
    ],
    outputs=gr.Textbox(lines=12, label="Generated Lyrics"),
    title="GPT-2 Lyrics Generator (LoRA Fine-Tuned)",
    description="Runs locally. ONNX backend is recommended when available; PyTorch is the fallback.",
    flagging_mode="never",
)

PORT = 54345
print(f"Launching Gradio on: http://127.0.0.1:{PORT}/  (leave this cell running)")
demo.launch(server_name="127.0.0.1", server_port=PORT, share=False, show_error=True)


Launching Gradio on: http://127.0.0.1:54345/  (leave this cell running)
* Running on local URL:  http://127.0.0.1:54345
* To create a public link, set `share=True` in `launch()`.


## 7. Evaluation: BLEU + qualitative review
Compute BLEU on held-out continuations and document qualitative observations.


In [15]:
import json
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def split_prompt_reference(text, prompt_words=18, ref_max_words=80):
    words = (text or "").split()
    if len(words) < prompt_words + 5:
        prompt = " ".join(words[: max(5, len(words)//2)])
        ref = " ".join(words[len(prompt.split()):])
        return prompt, ref
    prompt = " ".join(words[:prompt_words])
    ref = " ".join(words[prompt_words:prompt_words + ref_max_words])
    return prompt, ref

NUM_BLEU_SAMPLES = 10
PROMPT_WORDS = 18

# Robust raw_text retrieval even if you used set_format(type="torch", ...)
try:
    eval_dataset.reset_format()
except Exception:
    pass

if "raw_text" in getattr(eval_dataset, "column_names", []):
    eval_texts = eval_dataset.data.column("raw_text").to_pylist()
else:
    eval_texts = []

samples = []
bleu_scores = []
smooth = SmoothingFunction().method1

for i in range(min(NUM_BLEU_SAMPLES, len(eval_texts))):
    text = eval_texts[i]
    prompt, reference_text = split_prompt_reference(text, prompt_words=PROMPT_WORDS)

    if not reference_text.strip():
        continue

    generated_full = generate_lyrics(prompt, backend="onnx", max_new_tokens=60)

    if generated_full.lower().startswith(prompt.lower()):
        continuation = generated_full[len(prompt):].strip()
    else:
        continuation = generated_full.strip()

    reference_tokens = [reference_text.split()]
    candidate_tokens = continuation.split()

    bleu = sentence_bleu(reference_tokens, candidate_tokens, smoothing_function=smooth)
    bleu_scores.append(float(bleu))

    samples.append(
        {
            "i": i,
            "prompt": prompt,
            "reference_continuation": reference_text,
            "generated_full": generated_full,
            "generated_continuation": continuation,
            "bleu": float(bleu),
        }
    )

avg_bleu = float(sum(bleu_scores) / len(bleu_scores)) if bleu_scores else float("nan")
print("Average BLEU (held-out continuation):", avg_bleu)

with open(RUN_DIR / "bleu_samples.json", "w") as f:
    json.dump({"avg_bleu": avg_bleu, "n": len(samples), "samples": samples}, f, indent=2)

print("Saved:", (RUN_DIR / "bleu_samples.json").resolve())

print("\nQualitative checklist:")
print("- Coherence: does it stay on a consistent theme?")
print("- Relevance: does it continue the prompt naturally?")
print("- Creativity: imagery and phrasing variety?")
print("- Fluency: grammar/readability?")
print("- Repetition: does it loop? If yes, adjust top_p, temperature, repetition_penalty.")


Average BLEU (held-out continuation): 0.004015177173907737
Saved: C:\Users\shweiss\Downloads\artifacts\manual_run\bleu_samples.json

Qualitative checklist:
- Coherence: does it stay on a consistent theme?
- Relevance: does it continue the prompt naturally?
- Creativity: imagery and phrasing variety?
- Fluency: grammar/readability?
- Repetition: does it loop? If yes, adjust top_p, temperature, repetition_penalty.
