<a href="https://colab.research.google.com/github/tinayiluo0322/Fine_tuning_GPT2_on_WikiText_A_Performance_Evaluation/blob/main/Finetune_GPT2_on_wiki_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune GPT-2 on wiki-text

### Luopeiwen Yi

In [1]:
# for google colab
!pip install transformers
!pip install datasets
!pip install peft

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading 

In [2]:
import os
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

from datasets import load_dataset

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

from torch.utils.data import DataLoader
import torch.nn as nn

cuda


## Generate text with GPT2

Using the API provided by hugging face, load the pre-trained GPT2 model and generate text.

In [4]:
# Load the pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set padding token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

def generate_text(model, tokenizer, prompt, max_length=100):
    """
    Generate text using GPT-2.
    :param model: Pretrained GPT-2 model
    :param tokenizer: GPT-2 tokenizer
    :param prompt: Input text prompt
    :param max_length: Maximum length of generated text
    """

    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask

    # Generate text tokens
    gen_tokens = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        num_return_sequences=1,  # Generate one sequence
        temperature=0.7,  # Adjust for more randomness
        top_k=50,  # Use top-k sampling
        top_p=0.9,  # Use nucleus sampling
        do_sample=True  # Enable sampling
    )

    # Decode generated tokens to text
    gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)
    print("Generated Text:\n", gen_text)

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [5]:
# Example usage
generate_text(model, tokenizer, "GPT-2 is a language model based on transformers developed by OpenAI.", max_length=100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a language model based on transformers developed by OpenAI.

A few points to note:

The term "transformers" is not a new concept, as it was used by OpenAI before.

We have already developed a similar model for the language, called a "translate-to-text" (TTF) model.

The TTF model is based on a set of transformers that can be used to transform text and can


## Prepare dataset for training

Download the dataset and prepare the dataset for finetuning.


In [6]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load the WikiText dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")  # Using raw text version

# Select 10% of the dataset for training and validation and test
dataset_train = dataset["train"].select(range(len(dataset["train"]) // 10))
dataset_valid = dataset["validation"].select(range(len(dataset["validation"]) // 10))
dataset_test = dataset["test"].select(range(len(dataset["test"]) // 10))

# Function to tokenize dataset and set labels same as input_ids
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
    tokenized["labels"] = tokenized["input_ids"].copy()  # Labels must be the same as input_ids for causal LM
    return tokenized

# Tokenize the dataset
tokenized_datasets_train = dataset_train.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_datasets_valid = dataset_valid.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_datasets_test = dataset_test.map(tokenize_function, batched=True, remove_columns=["text"])

# Set format for PyTorch
tokenized_datasets_train.set_format("torch")
tokenized_datasets_valid.set_format("torch")
tokenized_datasets_test.set_format("torch")

# Create a DataCollator for training and validation
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)  # mlm=False for causal LM

# Create DataLoaders
train_dataloader = DataLoader(tokenized_datasets_train, shuffle=True, batch_size=4, collate_fn=data_collator)
valid_dataloader = DataLoader(tokenized_datasets_valid, batch_size=4, collate_fn=data_collator)
test_dataloader = DataLoader(tokenized_datasets_test, batch_size=4, collate_fn=data_collator)

# Test DataLoader
for batch in train_dataloader:
    print("Input IDs Shape:", batch['input_ids'].shape)
    print("Attention Mask Shape:", batch['attention_mask'].shape)
    print("Labels Shape:", batch['labels'].shape)
    break

print("DataLoader is working correctly!")

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Map:   0%|          | 0/3671 [00:00<?, ? examples/s]

Map:   0%|          | 0/376 [00:00<?, ? examples/s]

Map:   0%|          | 0/435 [00:00<?, ? examples/s]

Input IDs Shape: torch.Size([4, 512])
Attention Mask Shape: torch.Size([4, 512])
Labels Shape: torch.Size([4, 512])
DataLoader is working correctly!


## Evaluate perplexity on wiki-text

Before finetuning, evaluate the pre-trained GPT2 model on the wiki-text dataset. The perplexity is a common metric to evaluate the performance of language model. The lower the perplexity, the better the model. To compute the perplexity in practice, use the formula as follows, which is a transformation of the formula in class:
$PP(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i|\text{context})\right)$

In [7]:
def evaluate_perplexity(model, dataloader):
    model.eval()
    total_loss = 0
    total_length = 0
    loss_fn = nn.CrossEntropyLoss(reduction='sum')  # Sum the loss over all tokens

    with torch.no_grad():
        for batch in dataloader:
            # Move batch to device (GPU if available)
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            logits = outputs.logits

            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()

            # Compute loss
            loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

            total_loss += loss.item()
            total_length += attention_mask.sum().item()  # Count total valid tokens

    # Calculate perplexity
    perplexity = torch.exp(torch.tensor(total_loss / total_length))

    return perplexity.item()

In [8]:
# Evaluate initial perplexity before fine-tuning
perplexity = evaluate_perplexity(model, test_dataloader)
print(f"Initial perplexity: {perplexity}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Initial perplexity: 49.59413146972656


## Fine-tune GPT2 on wiki-text



In [9]:
import os
os.environ["WANDB_DISABLED"] = "true"  # Disable W&B logging

In [10]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
    evaluation_strategy="epoch",  # Report validation and training loss every epoch
    logging_dir="./logs",  # Directory for logging
    logging_strategy="epoch",  # Log training/validation loss at the end of each epoch
)

# Create a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_valid,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,3.6718,3.343298
2,3.1483,3.367917
3,2.8819,3.398322


# Test fine-tuned model

In [11]:
# Load the fine-tuned model
model_finetuned = AutoModelForCausalLM.from_pretrained("./gpt2-wikitext-2").to(device)

# Evaluate perplexity on the validation dataset
perplexity = evaluate_perplexity(model_finetuned, test_dataloader)
print(f"Fine-tuned perplexity: {perplexity}")

Fine-tuned perplexity: 32.76613998413086


# Generate some text using the fine-tuned model

In [12]:
# load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# generate text
generate_text(model_finetuned, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI and the International Union for Conservation of Nature ( IUCN ) . It has been developed in collaboration with the IUCN . 

The ITS model , developed by OpenAI and the National Institute for Biotechnology and Health ( IITNB ) , is a phylogenetic analysis of the human gut . It is based on the @-@ derived ITS model . 

The ITS model has been


## Parameter efficient fine-tuning (LoRA)

finetune the base gpt model through LoRA

In [13]:
from peft import LoraConfig, get_peft_model

In [14]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# load GPT2 model and add the lora adapter
model_lora = AutoModelForCausalLM.from_pretrained("gpt2")
model_lora = get_peft_model(model_lora, peft_config)
model_lora.to(device)  # Move model to GPU/CPU

training_args = TrainingArguments(
    output_dir="./gpt2-lora-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
)

# set trainer and train the model
trainer = Trainer(
    model=model_lora,
    args=training_args,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_valid,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
trainer.save_model("./gpt2-lora-wikitext-2")

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss
500,4.1492
1000,3.7978


In [15]:
# Load the fine-tuned model
model_lora = AutoModelForCausalLM.from_pretrained("./gpt2-lora-wikitext-2").to(device)

In [16]:
ppl = evaluate_perplexity(model_lora, test_dataloader)
print(f"Perplexity after lora finetuning: {ppl}")

Perplexity after lora finetuning: 36.98552322387695


In [29]:
# LoRA finetuned
generate_text(model_lora, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI . It is the first integrated platform with a robust and flexible implementation . The platform can be used to simulate different types of translation tasks , and can be used to model the translation of a language using different parameters . The platform is capable of performing multiple translation tasks at the same time , including translating to different languages in different regions , and even performing translation tasks at the same time for different languages . OpenAI has developed


# Evaluate lora fine-tuned model on wiki-text

compare the text generated by the fully fine-tuned model and LoRA fine-tuned model and the pre-trained model. Do you see any difference in the quality of the generated text? Try to explain why. (Hint: trust your result and report as it is.)

In [19]:
# Load the pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

In [28]:
# pre-trained
generate_text(model, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI, which allows for rapid and efficient transformation of data sets. OpenAI's transformers are based on the Open Data Framework, which provides a powerful tool to build data sets.

In this article, I will cover the data structures, features, and behavior of OpenAI's transformers and how they are used. I will also explain how they are used in a variety of applications.

To use


In [22]:
# fully finetuned
generate_text(model_finetuned, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI , which is designed to classify vertebrate species using the functional traits of the species . The model has been used in many vertebrate taxonomical analyses of vertebrate species , including vertebrate taxonomy , and in the identification of taxonomic units such as phyla . The model has been used to classify vertebrate taxa and to classify vertebrate species by phylogenetic analysis . 

The phylogenetic


In [23]:
# LoRA finetuned
generate_text(model_lora, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 GPT-2 is a langugae model based on transformers developed by OpenAI, a global company dedicated to creating and developing intelligent robots and artificial intelligence. OpenAI has developed a number of different models and systems that interact with a variety of different sensors and processors, including sensors for the detection of drugs, radar for detecting objects, and a sensor for the detection of a body mass index. OpenAI's original model was based on a new approach based on the use of the concept of a


Compare the perplexity of the fully fine-tuned model and LoRA fine-tuned model. Do you see any difference in the perplexity? Try to explain why.

In [24]:
# Evaluate initial perplexity before fine-tuning
perplexity = evaluate_perplexity(model, test_dataloader)
print(f"Initial perplexity: {perplexity}")

Initial perplexity: 49.59413146972656


In [25]:
# perplexity of fully finetuned
ppl = evaluate_perplexity(model_finetuned, test_dataloader)

print(f"Perplexity after fully finetuning: {ppl}")

Perplexity after fully finetuning: 32.76613998413086


In [26]:
# perplexity of LoRA finetuned
ppl = evaluate_perplexity(model_lora, test_dataloader)

print(f"Perplexity after lora finetuning: {ppl}")

Perplexity after lora finetuning: 36.98552322387695


### **Comparison Table of Fine-Tuning Methods**

| Model | **Generated Text** (First 100 Tokens) | **Perplexity (Lower is Better)** | **Training Time** | **Memory Usage** |
|--------|--------------------------------------|--------------------------------|----------------|----------------|
| **Pre-trained GPT-2** | GPT-2 is a langugae model based on transformers developed by OpenAI, which allows for rapid and efficient transformation of data sets. OpenAI's transformers are based on the Open Data Framework, which provides a powerful tool to build data sets.In this article, I will cover the data structures, features, and behavior of OpenAI's transformers and how they are used. I will also explain how they are used in a variety of applications.To use | 49.60 (highest) | No training | None |
| **Fully Fine-tuned GPT-2** | GPT-2 is a langugae model based on transformers developed by OpenAI , which is designed to classify vertebrate species using the functional traits of the species . The model has been used in many vertebrate taxonomical analyses of vertebrate species , including vertebrate taxonomy , and in the identification of taxonomic units such as phyla . The model has been used to classify vertebrate taxa and to classify vertebrate species by phylogenetic analysis . The phylogenetic| **32.77 (best performance)** | **Longest (13:00 min with L4 GPU)** | **High (all layers trained)** |
| **LoRA Fine-tuned GPT-2** | GPT-2 is a langugae model based on transformers developed by OpenAI, a global company dedicated to creating and developing intelligent robots and artificial intelligence. OpenAI has developed a number of different models and systems that interact with a variety of different sensors and processors, including sensors for the detection of drugs, radar for detecting objects, and a sensor for the detection of a body mass index. OpenAI's original model was based on a new approach based on the use of the concept of a| 36.99 (improved, but not as good as full fine-tuning) | **Faster Than Fully Finetuned (09:55 min with L4 GPU)** | **Low (only LoRA adapters trained)** |

### **Conclusion**

After evaluating **pre-trained GPT-2, fully fine-tuned GPT-2, and LoRA fine-tuned GPT-2** on a separate test set, we observe distinct trade-offs in **text generation quality, perplexity, and computational efficiency**.  

---

### **Pre-trained GPT-2: The Most General Model**
- **Generated Text:** The pre-trained model outputs **generic, somewhat repetitive** text that describes GPT-2 in a **structured but vague** way.
- **Perplexity:** **42.99 (highest perplexity)** → Indicates that the model is **less confident in predicting the next word**, but this also suggests **higher diversity and less memorization**.
- **Performance:** No training time or memory cost since it is used as-is.
- **Interpretation:**  
  - The pre-trained model maintains **generalization ability**, allowing it to generate text that is **on-topic but lacks specificity**.
  - However, it **does not adapt to new domain-specific data** (e.g., WikiText-2), meaning it may lack depth in certain specialized topics.

---

### **Fully Fine-Tuned GPT-2: Strong Adaptation but Topic Drift**  
- **Generated Text:** The model **drifts off-topic**, discussing **vertebrate taxonomy** instead of AI or transformers. While the text is **grammatically correct**, it **shows clear overfitting to WikiText-2, leading to an unexpected topic shift**.  
- **Perplexity:** **32.77 (lowest perplexity)** → Indicates that the model is **highly confident** in predicting the next word, but at the cost of **topic relevance**.  
- **Training Time:** **13:00 min on L4 GPU** (Longest)  
- **Memory Usage:** **High (all layers trained)**  
- **Interpretation:**  
  - Fully fine-tuned GPT-2 **adapts strongly to the fine-tuning dataset**, but it **loses its ability to generalize across different topics**.  
  - While it has the **lowest perplexity**, the generated text **is no longer relevant to AI and transformers**, suggesting **catastrophic forgetting**.  
  - The **longest training time and highest memory consumption** make this approach computationally expensive.  

---

### **LoRA Fine-Tuned GPT-2: Best Balance Between Adaptation and Generalization**  
- **Generated Text:** The model generates **AI-related content**, mentioning **OpenAI, artificial intelligence, and robotics**. The content is **more relevant than fully fine-tuned GPT-2**, although it still introduces some unrelated information about sensors and processors.  
- **Perplexity:** **36.99 (improved, but not as low as full fine-tuning)** → The model adapts to WikiText-2 while **preserving some generalization ability**.  
- **Training Time:** **09:55 min on L4 GPU** (Faster than full fine-tuning)  
- **Memory Usage:** **Low (only LoRA adapters trained)**  
- **Interpretation:**  
  - LoRA fine-tuning enables the model to **retain more of its pre-trained GPT-2 knowledge**, while still adapting to WikiText-2.  
  - Unlike fully fine-tuned GPT-2, LoRA does **not completely override previous knowledge**, helping **prevent catastrophic forgetting**.  
  - The **higher perplexity compared to fully fine-tuned GPT-2** suggests that LoRA allows for more **flexibility in predictions**, leading to more **diverse but still relevant** text generation.  
  - **LoRA is computationally efficient, achieving strong improvements with significantly reduced training time and memory usage.**  

---

### **Trade-offs Between the Models**  
| Model | **Key Strength** | **Key Weakness** |
|--------|----------------------|---------------------------|
| **Pre-trained GPT-2** | **Generalization ability, retains diverse knowledge** | High perplexity, lacks domain-specific adaptation |
| **Fully Fine-Tuned GPT-2** | **Lowest perplexity, strongly adapted to WikiText-2** | **Severe topic drift (overfits to fine-tuning data)** |
| **LoRA Fine-Tuned GPT-2** | **Best balance between adaptation and generalization** | Slightly higher perplexity than full fine-tuning |

---

### **The Relationship Between Perplexity and Text Quality**  
A key takeaway from this experiment is that **lower perplexity does not always lead to better text generation**:  
- **Fully fine-tuned GPT-2 has the lowest perplexity (32.77), but the text is off-topic.**  
- **LoRA fine-tuned GPT-2 has a slightly higher perplexity (36.99) but generates more relevant and meaningful content.**  
- **Pre-trained GPT-2 has the highest perplexity (49.60), but it remains general and on-topic**  

**Key Observation:**  
- **Fine-tuning should aim for a balance between perplexity reduction and knowledge retention.**  
- **Overfitting to WikiText-2 (as seen in fully fine-tuned GPT-2) reduces the model’s ability to generate relevant responses.**  

---

### **Conclusion: Which Model is Best?**  
Each model has its own strengths and weaknesses depending on the use case:  

| Model | **Best Use Case** |
|--------|------------------|
| **Pre-trained GPT-2** | General-purpose text generation with broad knowledge retention |
| **Fully Fine-Tuned GPT-2** | Domain-specific adaptation when topic relevance is not a concern |
| **LoRA Fine-Tuned GPT-2** | Best trade-off between efficiency, adaptation, and topic relevance |

- If **generalization is most important**, the **pre-trained GPT-2** is preferable.  
- If **domain-specific adaptation is needed**, but **topic relevance must be preserved**, **LoRA fine-tuning** is the best option.  
- If **full adaptation to new data** is required (even at the cost of knowledge loss), then **fully fine-tuned GPT-2** is the most powerful choice.  

---

### **Final Thought: Fine-Tuning Must Be Done Carefully**  
**Fine-tuning is not always beneficial**—without proper dataset selection, fine-tuning can cause **knowledge loss and topic drift rather than meaningful improvements**.  

**LoRA fine-tuning proves to be the most effective approach**, offering a balance between **efficient learning, topic relevance, and knowledge retention** while avoiding the computational cost of full fine-tuning.  
