<a href="https://colab.research.google.com/github/tinayiluo0322/Computer-Engineering-Machine-Learning-and-Deep-Neural-Nets-Projects/blob/main/RNN%20and%20Transformers/LabLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune GPT-2 on wiki-text

### Luopeiwen Yi

In [None]:
# for google colab
!pip install transformers
!pip install datasets
!pip install peft

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading 

In [None]:
import os
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

from datasets import load_dataset

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

from torch.utils.data import DataLoader
import torch.nn as nn

cuda


## Generate text with GPT2

Using the API provided by hugging face, load the pre-trained GPT2 model and generate text.

In [None]:
# Load the pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set padding token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

def generate_text(model, tokenizer, prompt, max_length=100):
    """
    Generate text using GPT-2.
    :param model: Pretrained GPT-2 model
    :param tokenizer: GPT-2 tokenizer
    :param prompt: Input text prompt
    :param max_length: Maximum length of generated text
    """

    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask

    # Generate text tokens
    gen_tokens = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        num_return_sequences=1,  # Generate one sequence
        temperature=0.7,  # Adjust for more randomness
        top_k=50,  # Use top-k sampling
        top_p=0.9,  # Use nucleus sampling
        do_sample=True  # Enable sampling
    )

    # Decode generated tokens to text
    gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)
    print("Generated Text:\n", gen_text)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Example usage
generate_text(model, tokenizer, "GPT-2 is a language model based on transformers developed by OpenAI.", max_length=100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a language model based on transformers developed by OpenAI. The goal is to combine the basic programming language with a fast, high-level, high-performance language.

The source code of this project is available at https://github.com/openAI/open-AI.

Please note that this is a test suite, and the code may be unstable, but it is a work in progress and needs to be tested.

Requirements

Python


## Prepare dataset for training

Download the dataset and prepare the dataset for finetuning.


In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load the WikiText dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")  # Using raw text version

# Select 10% of the dataset for training and validation
dataset_train = dataset["train"].select(range(len(dataset["train"]) // 10))
dataset_valid = dataset["validation"].select(range(len(dataset["validation"]) // 10))

# Function to tokenize dataset and set labels same as input_ids
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
    tokenized["labels"] = tokenized["input_ids"].copy()  # Labels must be the same as input_ids for causal LM
    return tokenized

# Tokenize the dataset
tokenized_datasets_train = dataset_train.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_datasets_valid = dataset_valid.map(tokenize_function, batched=True, remove_columns=["text"])

# Set format for PyTorch
tokenized_datasets_train.set_format("torch")
tokenized_datasets_valid.set_format("torch")

# Create a DataCollator for training and validation
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)  # mlm=False for causal LM

# Create DataLoaders
train_dataloader = DataLoader(tokenized_datasets_train, shuffle=True, batch_size=4, collate_fn=data_collator)
valid_dataloader = DataLoader(tokenized_datasets_valid, batch_size=4, collate_fn=data_collator)

# Test DataLoader
for batch in train_dataloader:
    print("Input IDs Shape:", batch['input_ids'].shape)
    print("Attention Mask Shape:", batch['attention_mask'].shape)
    print("Labels Shape:", batch['labels'].shape)
    break

print("DataLoader is working correctly!")

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Map:   0%|          | 0/3671 [00:00<?, ? examples/s]

Map:   0%|          | 0/376 [00:00<?, ? examples/s]

Input IDs Shape: torch.Size([4, 512])
Attention Mask Shape: torch.Size([4, 512])
Labels Shape: torch.Size([4, 512])
DataLoader is working correctly!


## Evaluate perplexity on wiki-text

Before finetuning, evaluate the pre-trained GPT2 model on the wiki-text dataset. The perplexity is a common metric to evaluate the performance of language model. The lower the perplexity, the better the model. To compute the perplexity in practice, use the formula as follows, which is a transformation of the formula in class:
$PP(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i|\text{context})\right)$

In [None]:
def evaluate_perplexity(model, dataloader):
    model.eval()
    total_loss = 0
    total_length = 0
    loss_fn = nn.CrossEntropyLoss(reduction='sum')  # Sum the loss over all tokens

    with torch.no_grad():
        for batch in dataloader:
            # Move batch to device (GPU if available)
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            logits = outputs.logits

            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()

            # Compute loss
            loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

            total_loss += loss.item()
            total_length += attention_mask.sum().item()  # Count total valid tokens

    # Calculate perplexity
    perplexity = torch.exp(torch.tensor(total_loss / total_length))

    return perplexity.item()

In [None]:
# Evaluate initial perplexity before fine-tuning
perplexity = evaluate_perplexity(model, valid_dataloader)
print(f"Initial perplexity: {perplexity}")

Initial perplexity: 42.9958610534668


## Fine-tune GPT2 on wiki-text



In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"  # Disable W&B logging

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
    evaluation_strategy="epoch",  # Report validation and training loss every epoch
    logging_dir="./logs",  # Directory for logging
    logging_strategy="epoch",  # Log training/validation loss at the end of each epoch
)

# Create a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_valid,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,3.6718,3.343298
2,3.1483,3.367915
3,2.8819,3.398321


# Test fine-tuned model

In [None]:
# Load the fine-tuned model
model_finetuned = AutoModelForCausalLM.from_pretrained("./gpt2-wikitext-2").to(device)

# Evaluate perplexity on the validation dataset
perplexity = evaluate_perplexity(model_finetuned, valid_dataloader)
print(f"Fine-tuned perplexity: {perplexity}")

Fine-tuned perplexity: 27.33751678466797


# Generate some text using the fine-tuned model

In [None]:
# load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# generate text
generate_text(model_finetuned, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI and the International Union for Conservation of Nature ( IUCN ) . It has been developed in collaboration with the IUCN . 

The ITS model , developed by OpenAI and the National Institute for Biotechnology and Health ( IITNB ) , is a phylogenetic analysis of the human gut . It is based on the @-@ derived ITS model . 

The ITS model has been


## Parameter efficient fine-tuning (LoRA)

finetune the base gpt model through LoRA

In [None]:
from peft import LoraConfig, get_peft_model

In [None]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# load GPT2 model and add the lora adapter
model_lora = AutoModelForCausalLM.from_pretrained("gpt2")
model_lora = get_peft_model(model_lora, peft_config)
model_lora.to(device)  # Move model to GPU/CPU

training_args = TrainingArguments(
    output_dir="./gpt2-lora-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
)

# set trainer and train the model
trainer = Trainer(
    model=model_lora,
    args=training_args,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_valid,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
trainer.save_model("./gpt2-lora-wikitext-2")

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss
500,4.1492
1000,3.7978


In [None]:
# Load the fine-tuned model
model_lora = AutoModelForCausalLM.from_pretrained("./gpt2-lora-wikitext-2").to(device)

In [None]:
ppl = evaluate_perplexity(model_lora, valid_dataloader)
print(f"Perplexity after lora finetuning: {ppl}")

Perplexity after lora finetuning: 30.26320457458496


# Evaluate lora fine-tuned model on wiki-text

compare the text generated by the fully fine-tuned model and LoRA fine-tuned model and the pre-trained model. Do you see any difference in the quality of the generated text? Try to explain why. (Hint: trust your result and report as it is.)

In [None]:
# pre-trained
generate_text(model, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI , and the only one that can be used in the current version of the JAR . It is the most flexible and efficient transformation method for the JAR , and can be applied to any type of polygon . It is also the most robust , and does not require any additional tuning . It is also the most stable , and is therefore the most suitable for the application of any given polygon . It is


In [None]:
# fully finetuned
generate_text(model_finetuned, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI , using a new generation of photoreceptor cells to produce a fluorescent protein . The newly developed cells are capable of producing a variety of fluorescence ( fluorescence of the light spectrum ) , and are therefore more sensitive to light exposure than the normal cells . The results from the new study are published in the journal Nature . 

The fluorescent protein is a type of protein that is used in many applications .


In [None]:
# LoRA finetuned
generate_text(model_lora, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI , and developed by researchers at the University of Illinois at Urbana-Champaign . 

The model is based on the same technology used by the Japanese ichthyosaurs ichthyosaurs , and is based on a different model for the ichthyosaurs of China . 

The model uses a similar technology to that used by the ichthyosaurs ichthyosaurs , but


Compare the perplexity of the fully fine-tuned model and LoRA fine-tuned model. Do you see any difference in the perplexity? Try to explain why.

In [None]:
# Evaluate initial perplexity before fine-tuning
perplexity = evaluate_perplexity(model, valid_dataloader)
print(f"Initial perplexity: {perplexity}")

Initial perplexity: 42.9958610534668


In [None]:
# perplexity of fully finetuned
ppl = evaluate_perplexity(model_finetuned, valid_dataloader)

print(f"Perplexity after fully finetuning: {ppl}")

Perplexity after fully finetuning: 27.33751678466797


In [None]:
# perplexity of LoRA finetuned
ppl = evaluate_perplexity(model_lora, valid_dataloader)

print(f"Perplexity after lora finetuning: {ppl}")

Perplexity after lora finetuning: 30.26320457458496


### **Comparison Table of Fine-Tuning Methods**

| Model | **Generated Text** (First 100 Tokens) | **Perplexity (Lower is Better)** | **Training Time** | **Memory Usage** |
|--------|--------------------------------------|--------------------------------|----------------|----------------|
| **Pre-trained GPT-2** | GPT-2 is a langugae model based on transformers developed by OpenAI , and the only one that can be used in the current version of the JAR . It is the most flexible and efficient transformation method for the JAR , and can be applied to any type of polygon . It is also the most robust , and does not require any additional tuning . It is also the most stable , and is therefore the most suitable for the application of any given polygon . It is | 42.99 (highest) | No training | None |
| **Fully Fine-tuned GPT-2** | GPT-2 is a langugae model based on transformers developed by OpenAI , using a new generation of photoreceptor cells to produce a fluorescent protein . The newly developed cells are capable of producing a variety of fluorescence ( fluorescence of the light spectrum ) , and are therefore more sensitive to light exposure than the normal cells . The results from the new study are published in the journal Nature . The fluorescent protein is a type of protein that is used in many applications . | **27.34 (best performance)** | **Longest (13:06 min with L4 GPU)** | **High (all layers trained)** |
| **LoRA Fine-tuned GPT-2** | GPT-2 is a langugae model based on transformers developed by OpenAI , and developed by researchers at the University of Illinois at Urbana-Champaign . The model is based on the same technology used by the Japanese ichthyosaurs ichthyosaurs , and is based on a different model for the ichthyosaurs of China . The model uses a similar technology to that used by the ichthyosaurs ichthyosaurs , but| 30.26 (improved, but not as good as full fine-tuning) | **Faster Than Fully Finetuned (10:02 min with L4 GPU)** | **Low (only LoRA adapters trained)** |

### **Conclusion: Why Does Pre-trained GPT-2 Make the Most Sense?**

After comparing the generated text from **pre-trained GPT-2**, **fully fine-tuned GPT-2**, and **LoRA fine-tuned GPT-2**, we observe a surprising result:

- **The pre-trained GPT-2 model actually generates the most reasonable text about GPT-2 and transformers.**  
- **Both the fully fine-tuned and LoRA fine-tuned models produce unrelated, scientific-sounding text, which makes no sense given the prompt.**

---

### **Why Does Pre-trained GPT-2 Make the Most Sense?**
- The **pre-trained model** was trained on a **diverse** and **broad** dataset (e.g., Common Crawl, books, Wikipedia).  
- It **generalizes well** to various topics, including AI and transformers.  
- Since it **was not fine-tuned on WikiText**, it **retains its original diverse knowledge**, allowing it to produce **reasonable AI-related text** when prompted with "GPT-2 is a language model..."

---

### **Why Do the Fine-Tuned Models Generate Irrelevant Text?**
#### **(a) Fully Fine-Tuned GPT-2**
- Fine-tuning on **WikiText** caused the model to **memorize** patterns from **Wikipedia-like data**.  
- Instead of discussing AI, it **hallucinates** a discussion about **fluorescent proteins and photoreceptor cells**, which has nothing to do with GPT-2.

#### **(b) LoRA Fine-Tuned GPT-2**
- LoRA fine-tuning allows the model to retain more of its **pre-trained knowledge**, but it **still picks up biases from WikiText**.
- It generates **random, out-of-context references** to **ichthyosaurs** (prehistoric marine reptiles), which have no relation to AI.

---

### **Perplexity Comparison: Does Lower Perplexity Mean Better Text?**
| Model | **Perplexity (Lower is Better)** | **Generated Text Quality** |
|--------|-------------------------------|---------------------------|
| **Pre-trained GPT-2** | 42.99 (highest) | **Most relevant text about GPT-2 & transformers** |
| **Fully Fine-Tuned GPT-2** | **27.34 (lowest perplexity)** | **Unrelated scientific text about fluorescent proteins** |
| **LoRA Fine-Tuned GPT-2** | 30.26 (improved perplexity) | **Mentions ichthyosaurs, still irrelevant** |

- **Even though fine-tuning lowers perplexity, it does not necessarily improve the quality of generated text.**  
- **The pre-trained model has the highest perplexity but produces the most reasonable output.**  
- **Fine-tuned models overfit to WikiText and lose generalization ability.**  

---

### **Fine-Tuning Time & Resource Usage**
| Model | **Training Time** | **Memory Usage** | **Effect on Output** |
|--------|----------------|----------------|----------------|
| **Fully Fine-Tuned GPT-2** | **Longest (13:06 min with L4 GPU)** | **High** (updates all layers) | **Forgets general knowledge, overfits to WikiText** |
| **LoRA Fine-Tuned GPT-2** | **Faster (10:02 min with L4 GPU)** | **Low** (trains only adapter layers) | **Partially retains knowledge but still drifts off-topic** |

- **Full fine-tuning is computationally expensive but does not always improve text relevance.**  
- **LoRA is much more efficient but still inherits biases from the fine-tuning dataset.**  

---

### **Final Takeaways**
- **Pre-trained GPT-2 generates the most reasonable text because it retains diverse knowledge.**  
- **Fully fine-tuned GPT-2 loses generalization and produces nonsensical, topic-drifted text.**  
- **LoRA fine-tuned GPT-2 is more efficient but still generates off-topic results.**  
- **Lower perplexity does not guarantee better text quality—sometimes it leads to overfitting.**  

---

### **Key Lesson**
**Fine-tuning must be done carefully**—if the dataset is **not well-matched** to the intended use case, it can cause the model to **drift away from useful knowledge** instead of improving performance.  