#### GPT-2 Large Fine Tuning with LoRA

#### Project Overview:

*  Dataset Loading and Filtering: Load the merve/poetry dataset and filter for age='Renaissance' and type='love'.
*  Data Preparation: Tokenize the filtered poems and prepare them for input to GPT-2. This will involve creating input IDs and attention masks.
*  Model Loading: Load the pre-trained GPT-2 model.
*  LoRA Configuration: Set up LoRA (Low-Rank Adaptation) for efficient fine-tuning.
*  Fine-tuning: Train the GPT-2 model with LoRA on our filtered poetry dataset.
*  Text Generation (Evaluation): After fine-tuning, generate new poems to assess the model's ability to capture the style of Renaissance love poetry.

#### Python Libraries Needed:

*  transformers: For GPT-2 model, tokenizers, and training utilities.
*  datasets: For loading and managing the merve/poetry dataset.
*  peft: For implementing LoRA.
*  torch: The underlying deep learning framework (Hugging Face Transformers is built on PyTorch).

In [1]:
!pip install safetensors

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [3]:
!pip install accelerate>=0.26.0
!pip install transformers[torch]

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from safetensors.torch import load_file
import os




In [7]:
# --- 1. Configuration ---
MODEL_NAME = "gpt2-large"
DATASET_NAME = "merve/poetry"
OUTPUT_DIR = "./gpt2_large_renaissance_love_poems_lora"
LEARNING_RATE = 2e-4
BATCH_SIZE = 4
NUM_EPOCHS = 3
LORA_R = 8  # LoRA attention dimension
LORA_ALPHA = 16 # Alpha parameter for LoRA scaling
LORA_DROPOUT = 0.05 # Dropout probability for LoRA layers

In [9]:
import os
print(f"Current working directory: {os.getcwd()}")
print(f"Expected save path: {os.path.abspath(OUTPUT_DIR)}")

Current working directory: C:\Users\tterr\IE7374 Project
Expected save path: C:\Users\tterr\IE7374 Project\gpt2_large_renaissance_love_poems_lora


In [11]:
# --- 2. Load and Filter Dataset ---
print(f"Loading dataset: {DATASET_NAME}")
dataset = load_dataset(DATASET_NAME)

def filter_renaissance_love(example):
    return example['age'] == 'Renaissance' and example['type'] == 'Love'

filtered_dataset = dataset['train'].filter(filter_renaissance_love)
print(f"Original dataset size: {len(dataset['train'])}")
print(f"Filtered dataset size (Renaissance Love Poems): {len(filtered_dataset)}")

Loading dataset: merve/poetry


Repo card metadata block was not found. Setting CardData to empty.


Original dataset size: 573
Filtered dataset size (Renaissance Love Poems): 243


In [13]:
# --- 3. Data Preparation (Tokenization) ---
print(f"Loading tokenizer: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token 

# --- tokenize_function ---
def tokenize_function(examples):
    return tokenizer(examples['content'], truncation=True, max_length=512)

tokenized_dataset = filtered_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['author', 'content', 'poem name', 'age', 'type'] 
)

# Create data collator for language modeling (will handle padding and masking)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Loading tokenizer: gpt2-large


In [15]:
# --- 4. Load Model and Configure LoRA ---
print(f"Loading model: {MODEL_NAME}")
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Configure LoRA
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Loading model: gpt2-large
trainable params: 1,474,560 || all params: 775,504,640 || trainable%: 0.19014199579772986




In [17]:
# --- 5. Fine-tuning ---
print("Starting fine-tuning...")

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    per_device_train_batch_size=BATCH_SIZE,
    num_train_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
    logging_dir=f"{OUTPUT_DIR}/logs",
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    fp16=torch.cuda.is_available(),
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

trainer.train()

# Save the LoRA adapters
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Fine-tuning complete! Model and tokenizer saved to {OUTPUT_DIR}")

Starting fine-tuning...


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,5.2734
100,4.1882
150,3.959


Fine-tuning complete! Model and tokenizer saved to ./gpt2_large_renaissance_love_poems_lora


In [19]:
# --- 6. Text Generation (Evaluation) ---
print("\n--- Generating Sample Text ---")

# Load the fine-tuned model with LoRA adapters
base_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Apply the LoRA config to the base model to create a PeftModel shell
lora_model = get_peft_model(base_model, lora_config)

# --- LOADING LINE ---
adapter_path = os.path.join(OUTPUT_DIR, "adapter_model.safetensors") # Point to the .safetensors file

if not os.path.exists(adapter_path):
    print(f"Error: LoRA adapter file not found at {adapter_path}.")
    print("Please ensure fine-tuning completed successfully and the file was saved.")
    exit()

# Load the state dictionary from the .safetensors file
lora_state_dict = load_file(adapter_path, device="cpu")

# Load the state dictionary into the PEFT model
lora_model.load_state_dict(lora_state_dict, strict=False)

# Make sure the model is in evaluation mode
lora_model.eval()

prompt = "In gardens fair where roses bloom, my heart doth yearn for thee,"

# --- Encode with attention_mask ---
inputs = tokenizer.encode_plus(
    prompt,
    return_tensors='pt',
    padding='longest',
    truncation=True,
    max_length=tokenizer.model_max_length
)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

#input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Move to GPU if available
if torch.cuda.is_available():
    lora_model.to('cuda')
    input_ids = input_ids.to('cuda')
    attention_mask = attention_mask.to('cuda') 

print(f"Prompt: {prompt}")

# Generate text
#output = lora_model.generate(
#    input_ids=input_ids,
#    attention_mask=attention_mask,
#    max_length=150, # Keep or adjust as desired
#    num_return_sequences=1,
#    no_repeat_ngram_size=4, # Increase this to 4 or even 5
#    repetition_penalty=1.2, # Add this parameter, typically values like 1.0 to 1.5
#    top_k=50, # Keep or adjust
#    top_p=0.95, # Keep or adjust
#    temperature=0.7, # Keep or adjust
#    do_sample=True,
#    pad_token_id=tokenizer.eos_token_id
#)

output = lora_model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=150,
    num_return_sequences=1,
    no_repeat_ngram_size=4, # Keep this to avoid repetition
    repetition_penalty=1.2, # Keep this to avoid repetition
    num_beams=5,          # Try 3, 4, or 5 beams - common values
    do_sample=False,      # Turn off sampling for pure beam search first
    pad_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Poem:\n", generated_text)


--- Generating Sample Text ---
Prompt: In gardens fair where roses bloom, my heart doth yearn for thee,
Generated Poem:
 In gardens fair where roses bloom, my heart doth yearn for thee,

Thou art the sweetest flower in the garden,

And I love thee with all my heart.

I love thee, O sweetest flower,

With all my heart."

"O sweetest flower," said the king, "I know not what thou sayest,

But I know that thou art the most beautiful flower

In the garden, and that I love thee."

So saying, he kissed the flower, and said:

"Sweetest flower, I love thee, and I will be thy husband."

The flower answered: "O king, I know not what to say
