# Fine-Tuning LLMs for Sentiment Analysis with LoRA

This work explores the efficient fine-tuning of language models for sentiment analysis using Low-Rank Adaptation (LoRA). 
- **distilgpt2**: A lightweight transformer model (82M parameters)

The model is fine-tuned on the Rotten Tomatoes movie review dataset for binary sentiment classification on a NVIDIA RTX 3070.

## Approach
1. Load and prepare the Rotten Tomatoes dataset
2. Configure and apply LoRA for parameter-efficient fine-tuning
3. Train both models on sentiment classification
4. Evaluate and compare performance

## Data Preparation

*  Rotten Tomatoes dataset - contains movie reviews labeled as positive or negative sentiment.
*  Initial exploratory data analysis - display dataset size, label distribution, and example reviews.

In [9]:
from datasets import load_dataset
from collections import Counter
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import get_peft_model, LoraConfig, TaskType, PeftModel
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import random

# Set global constants
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

# Load Rotten Tomatoes dataset
ds = load_dataset("rotten_tomatoes", split="train")
ds_test = load_dataset("rotten_tomatoes", split="test")

# Get test data
test_texts = ds_test["text"]
test_labels = ds_test["label"]

# Display dataset statistics
print(f"Training examples: {len(ds)}")
print(f"Test examples: {len(ds_test)}")

# Show label distribution
train_label_counts = Counter(ds['label'])
test_label_counts = Counter(ds_test['label'])

print("\nLabel distribution:")
print(f"Training set: {dict(train_label_counts)}")
print(f"Test set: {dict(test_label_counts)}")

# Show a few examples
print("\nExample reviews (randomly selected):")
random_indices = random.sample(range(len(ds)), 3)
for i in random_indices:
    sentiment = "Positive" if ds[i]["label"] == 1 else "Negative"
    print(f"Review: {ds[i]['text']}")
    print(f"Sentiment: {sentiment}")
    print("-" * 50)

Using device: cuda
Training examples: 8530
Test examples: 1066

Label distribution:
Training set: {1: 4265, 0: 4265}
Test set: {1: 533, 0: 533}

Example reviews (randomly selected):
Review: more vaudeville show than well-constructed narrative , but on those terms it's inoffensive and actually rather sweet .
Sentiment: Positive
--------------------------------------------------
Review: what makes the film special is the refreshingly unhibited enthusiasm that the people , in spite of clearly evident poverty and hardship , bring to their music .
Sentiment: Positive
--------------------------------------------------
Review: ozpetek's effort has the scope and shape of an especially well-executed television movie .
Sentiment: Positive
--------------------------------------------------


## Utility Functions

Functions we'll use for predictions and evaluation.
* `gen` function predicts the sentiment of a single review by prompting the model to generate just one token (expected to be "positive" or "negative").
* `predict_sentiment` function processes multiple reviews, using a similar single-token generation approach to classify each review as positive (1), negative (0), or unknown (-1). It can also add a specific instruction for a "baseline" prediction.
* `display_prediction_distribution` function then summarizes these predictions, showing the count and percentage for each sentiment category (positive, negative, and unknown).

In [10]:
# Function to generate sentiment predictions for a single review
def gen(model, tokenizer, review):
    prompt = f"Review: {review} Sentiment:"
    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    out = model.generate(
        **inputs,
        max_new_tokens=1,  # Only generate one token after the prompt
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Function to predict sentiment on a dataset
def predict_sentiment(model, tokenizer, reviews, baseline=False):
    preds = []
    for review in tqdm(reviews, desc="Predicting sentiment"):
        prompt = f"Review: {review} Sentiment:"
        if baseline:
            prompt = f'Instruction: Analyze the following movie review and determine its sentiment. If the review expresses a positive sentiment, reply with "positive". If the review expresses a negative sentiment, reply with "negative". Only reply with "positive" or "negative", and do not provide any additional explanation.\n {prompt}'
        inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
        out = model.generate(
            **inputs,
            max_new_tokens=1,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
        completion = tokenizer.decode(out[0], skip_special_tokens=True)
        last_word = completion.strip().split()[-1].lower()
        if last_word == "positive":
            preds.append(1)
        elif last_word == "negative":
            preds.append(0)
        else:
            preds.append(-1)  # Unknown sentiment
    return preds

# Function to display prediction distribution
def display_prediction_distribution(y_pred, name="Model"):
    """
    Displays the distribution of prediction values with counts and percentages.
    
    Args:
        y_pred: List or array of prediction values
        name: Name of the model for display purposes
    """
    from collections import Counter
    
    # Calculate counts for predictions
    pred_counts = Counter(y_pred)
    
    print(f"\n📊 Prediction Distribution for {name}:")
    for value, count in sorted(pred_counts.items()):
        if value == 1:
            label = "Positive"
        elif value == 0:
            label = "Negative"
        else:
            label = "Unknown"
        print(f"{label} (value={value}): {count} samples ({count/len(y_pred):.1%})")
    
    return pred_counts

**Fine-Tuning distilgpt2**

We will fine-tune the smaller distilgpt2 model using LoRA. This model has approximately 82M parameters.
* *Data Preparation*: The code first formats the Rotten Tomatoes dataset by transforming each review and its sentiment label into a "Review: [text] Sentiment: [Positive/Negative]" prompt-completion pair, which is the format the language model will learn to generate.
* *Model Initialization (LoRA)*: It loads the pre-trained distilgpt2 model and configures it for fine-tuning using Low-Rank Adaptation (LoRA). LoRA is a parameter-efficient fine-tuning technique that adds small, trainable matrices to the pre-trained model, significantly reducing memory usage and training time while achieving comparable performance to full fine-tuning.
* *Training Setup*: A Trainer object is set up with specific training arguments, including the output directory for logs, batch size, number of training epochs, and a data collator. The DataCollatorForLanguageModeling is crucial here as it prepares the tokenized data for causal language modeling, which is the objective of GPT-like models.
* *Fine-tuning Execution*: The trainer.train() method then executes the fine-tuning process. During this phase, the model learns to predict the sentiment word (e.g., "Positive" or "Negative") given the review text, effectively adapting its pre-trained knowledge to the sentiment analysis task on the Rotten Tomatoes dataset

In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import get_peft_model, LoraConfig, TaskType
import torch

BASE = "distilgpt2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Format text + sentiment prompt completion
def fmt(ex):
    sentiment = "Positive" if ex["label"] == 1 else "Negative"
    prompt = f"Review: {ex['text']} Sentiment:"
    target = f" {sentiment}"
    return {"text": prompt + target}

ds = ds.map(fmt)
ds = ds.filter(lambda x: len(x["text"].split()) < 120)

# Tokenize
tokenizer = AutoTokenizer.from_pretrained(BASE)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
def tok(ex):
    return tokenizer(ex["text"], truncation=True, padding="max_length", max_length=128)
ds = ds.map(tok, batched=False)

# Load model + prepare LoRA
model = AutoModelForCausalLM.from_pretrained(BASE).to(DEVICE)
peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, r=8, lora_alpha=32, lora_dropout=0.1)
model = get_peft_model(model, peft_config)

# Train
args = TrainingArguments(
    output_dir="./tmp_distilgpt2",  # Temporary dir just for logs
    per_device_train_batch_size=4,
    num_train_epochs=3,
    #logging_steps=10,
    save_total_limit=1,
    fp16=False,
    label_names=[]
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()
fine_tuned_model_distilgpt2 = model  # Keep in memory instead of saving
print("✅ Fine-tuning complete.")

  trainer = Trainer(


Step,Training Loss
500,4.4556
1000,3.9426
1500,3.8952
2000,3.8267
2500,3.7862
3000,3.7685
3500,3.7383
4000,3.7471
4500,3.7501
5000,3.7255


✅ Fine-tuning complete.


### Testing Fine-tuned distilgpt2

Let's evaluate our fine-tuned distilgpt2 model on a few examples.

In [15]:
# Load base and fine-tuned models
base = AutoModelForCausalLM.from_pretrained(BASE).to(DEVICE).eval()
ft = fine_tuned_model_distilgpt2.eval()  # Use in-memory model directly

test_reviews = [
    "This movie was a thrilling rollercoaster from start to finish.",
    "An absolute disaster performances.",
    "It's a bad film, and that's about it."
]

print("\n📊 Sentiment Comparison: Base vs Fine-Tuned distilgpt2\n")
for rev in test_reviews:
    print(f"🎬 REVIEW:\n\"{rev}\"\n")
    print("🔹 BASE MODEL:", gen(base, tokenizer, rev))
    print("🔸 FINE-TUNED MODEL:", gen(ft, tokenizer, rev))
    print("-" * 50)


📊 Sentiment Comparison: Base vs Fine-Tuned distilgpt2

🎬 REVIEW:
"This movie was a thrilling rollercoaster from start to finish."

🔹 BASE MODEL: Review: This movie was a thrilling rollercoaster from start to finish. Sentiment: I
🔸 FINE-TUNED MODEL: Review: This movie was a thrilling rollercoaster from start to finish. Sentiment: Positive
--------------------------------------------------
🎬 REVIEW:
"An absolute disaster performances."

🔹 BASE MODEL: Review: An absolute disaster performances. Sentiment: A
🔸 FINE-TUNED MODEL: Review: An absolute disaster performances. Sentiment: Negative
--------------------------------------------------
🎬 REVIEW:
"It's a bad film, and that's about it."

🔹 BASE MODEL: Review: It's a bad film, and that's about it. Sentiment: It
🔸 FINE-TUNED MODEL: Review: It's a bad film, and that's about it. Sentiment: Negative
--------------------------------------------------


This example clearly shows the  performance of the fine-tuned distilgpt2 model over its base version for sentiment analysis. 
*  Effective instruction tuning on a smaller model like DistilGPT2 using consumer grade GPUs within a mere 3 minutes and 29 seconds significantly lowers the barrier to entry for advanced NLP tasks.
* Dramatic improvement shown by the fine-tuned model (correctly identifying "Positive" and "Negative" sentiment) compared to the base model's unhelpful completions ("I," "A," "It") demonstrates that even a small model can be highly effective at specific tasks when properly instruction-tuned

**Calculate distilgpt2 prediction**

Evaluates both the base and fine-tuned DistilGPT2 models on the test dataset for sentiment prediction. It's crucial for quantitatively demonstrating the performance improvement achieved through fine-tuning, highlighting how the fine-tuned model becomes specifically adept at the sentiment analysis task. 

In [16]:
# Evaluate distilGPT2 models
import time
print("\n--- Evaluating distilGPT2 Models ---")

print("Predicting with base distilGPT2 model...")
start = time.time()
y_pred_base = predict_sentiment(base, tokenizer, test_texts, baseline=True)
elapsed_base = time.time() - start
display_prediction_distribution(y_pred_base, name="Base distilGPT2 model")
print(f"Time taken: {elapsed_base:.2f} seconds")

print("\nPredicting with fine-tuned distilGPT2 model...")
start = time.time()
y_pred = predict_sentiment(ft, tokenizer, test_texts, baseline=False)
elapsed_ft = time.time() - start
display_prediction_distribution(y_pred, name="Fine-tuned distilGPT2 model")
print(f"Time taken: {elapsed_ft:.2f} seconds")



--- Evaluating distilGPT2 Models ---
Predicting with base distilGPT2 model...


Predicting sentiment: 100%|██████████| 1066/1066 [00:08<00:00, 121.85it/s]



📊 Prediction Distribution for Base distilGPT2 model:
Unknown (value=-1): 1052 samples (98.7%)
Negative (value=0): 9 samples (0.8%)
Positive (value=1): 5 samples (0.5%)
Time taken: 8.75 seconds

Predicting with fine-tuned distilGPT2 model...


Predicting sentiment: 100%|██████████| 1066/1066 [00:09<00:00, 107.29it/s]


📊 Prediction Distribution for Fine-tuned distilGPT2 model:
Negative (value=0): 622 samples (58.3%)
Positive (value=1): 444 samples (41.7%)
Time taken: 9.94 seconds





**Analysis**

* Fine-tuned model successfully classifies nearly all reviews with an appropriate postive or negative sentiment, in stark contrast to the base model which classified 98.7% of reviews as "Unknown".
* Fine-tuning effectively transformed a general-purpose language model into a task-specific sentiment classifier.

In [17]:
# Calculate distilGPT2 metrics for fine-tuned model


acc = accuracy_score(test_labels, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(test_labels, y_pred, average="binary")



print("\nFine-tuned distilGPT2:")
print(f"Accuracy: {acc:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Fine-tuned distilGPT2:
Accuracy: 0.7570
Precision: 0.8086
Recall: 0.6735
F1 Score: 0.7349


**Key Takeways**
*  Fine-tuned DistilGPT2 model achieved an impressive 75.70% accuracy, with balanced precision (80.86%), recall (67.35%), and a strong F1-score (73.49%).
* Outcome highlights that instruction tuning, especially with techniques like LoRA on smaller models, can be incredibly effective even with minimal resources and consumer-grade hardware.