# Benchmark Trained Gemma Models vs Off the Shelf Gemma on Test Set with win rate

This notebook benchmarks the trained Gemma models compared with an Off the Shelf Gemma using winrate on the test set

## 1. Install Dependencies

First, let's make sure we have all required packages.

In [1]:
from unsloth import FastModel
import torch
import torch.nn as nn
from datasets import load_dataset
import re
from trl import GRPOConfig, GRPOTrainer
from transformers import (
    GPT2Model,
    GPT2Tokenizer,
    GPT2PreTrainedModel,
    GPT2Config,
    Trainer,
    TrainingArguments,
    AutoModelForCausalLM,
    TextStreamer,
 AutoTokenizer
)
from typing import Dict, List
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import os
from tqdm import tqdm
from datasets import Dataset as HFDataset
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import datetime
import time
from sklearn.preprocessing import StandardScaler
import pickle
import math

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-02 18:57:47 [__init__.py:256] Automatically detected platform cuda.


## 3. Configure Benchmark Parameters

Set the parameters for your benchmark run.

In [2]:
# --- Configuration ---
# Ensure this path points to where your fine-tuned model was saved
# It should contain 'adapter_config.json', 'adapter_model.safetensors', etc.
MODEL_PATH = "gemma_glicko_pess" 
# The base model used for fine-tuning
BASE_MODEL = "unsloth/gemma-3-1b-it" 
MAX_SEQ_LENGTH = 600
DATASET_NAME = "Columbia-NLP/DPO-tldr-summarisation-preferences"
NUM_SAMPLES_TO_GENERATE = 10 # Adjust as needed, use -1 for the whole test set
OUTPUT_CSV = "generated_summaries_gemma_grpo.csv"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
# # --- Load Model and Tokenizer ---
# model_base, tokenizer_base = FastModel.from_pretrained(
#         model_name = "gemma_glicko_base", # Load the adapter
#         max_seq_length = MAX_SEQ_LENGTH,
#         load_in_4bit = False,
#         load_in_8bit = False,
#     )
# model_pess, tokenizer_pess = FastModel.from_pretrained(
#         model_name = "gemma_glicko_pess", # Load the adapter
#         max_seq_length = MAX_SEQ_LENGTH,
#         load_in_4bit = False,
#         load_in_8bit = False,
#     )

In [4]:
# # --- Setup Device and Tokenizer ---
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model_base.to(device)
# model_base.eval() # Set model to evaluation mode
# model_pess.to(device)
# model_pess.eval() # Set model to evaluation mode

In [3]:
# --- Load Dataset ---
test_sr = ['running','Cooking', 'books', 'jobs', 'cats', 'travel', 'Pets', 'dogs', 'offmychest', 'self', 'college', 'personalfinance']
print(f"Loading dataset: {DATASET_NAME}...")
dataset = load_dataset(DATASET_NAME)
test_set = dataset['test']
dataset = test_set.add_column("sub_reddit", [x['subreddit'] for x in test_set['other_info']])
df = dataset.to_pandas()
test_df = df.loc[ df['sub_reddit'].isin(test_sr)]
dataset = HFDataset.from_pandas(test_df)
print("Dataset loaded.")

Loading dataset: Columbia-NLP/DPO-tldr-summarisation-preferences...
Dataset loaded.


In [4]:
test_df.describe()

Unnamed: 0,score_chosen,score_rejected
count,3555.0,3555.0
mean,10.0,1.0
std,0.0,0.0
min,10.0,1.0
25%,10.0,1.0
50%,10.0,1.0
75%,10.0,1.0
max,10.0,1.0


In [15]:
def generate_summaries(prompt_texts, model, tokenizer, max_new_tokens=53):
    """
    Generates summaries for a batch of prompts using the loaded model.
    Args:
        prompt_texts (list of str): A list of prompts to summarize.
        model: The loaded Hugging Face model.
        tokenizer: The loaded Hugging Face tokenizer.
        max_new_tokens (int): Maximum number of new tokens to generate for each summary.
    Returns:
        list of str: A list of generated summaries.
    """
    inputs = tokenizer(
        prompt_texts, # Process a list of prompts
        return_tensors="pt",
        padding=True, # Pad to the longest sequence in the batch
        padding_side = 'left',
        truncation=True,
        max_length=MAX_SEQ_LENGTH - max_new_tokens # Make space for generated text
    ).to(device)

    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=max_new_tokens,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=True,      # Use sampling
            temperature=0.1,     # Lower temperature for less randomness
            top_p=1,           # Nucleus sampling
            num_return_sequences=1,
            min_length = 53,
        )

    generated_summaries = []
    # Decode each summary in the batch
    # outputs contains the full sequence (prompt + summary)
    # We need to slice off the prompt part for each generated summary
    input_ids_length = inputs.input_ids.shape[1] # Length of the tokenized input prompts (padded)
    for i in range(outputs.shape[0]): # Iterate through each item in the batch
        summary_ids = outputs[i][input_ids_length:]
        summary = tokenizer.decode(summary_ids, skip_special_tokens=True)
        generated_summaries.append(summary.strip())
    return generated_summaries

    
def responses(path, prompts): 
    model, tokenizer = FastModel.from_pretrained(
        model_name = path, # Load the adapter
        max_seq_length = MAX_SEQ_LENGTH,
        load_in_4bit = False,
        load_in_8bit = False,
    )
    model.to(device)
    model.eval() 
    answers = []
    num = 100
    for i in range(math.ceil(len(prompts)/num)):
        answers+=generate_summaries(prompts[i*num:(i+1)*num], model, tokenizer)
    return answers
    #  answers = []
    # num = 10
    # l = len(prompts)//num
    # for i in range(num):
    #     answers += generate_summaries(prompts[i*l:(i+1*l)], model, tokenizer)
    # return answers

In [16]:
prompts = list(set([t[:len(t)-8] + 'Summarize the post in two sentences:' for t in dataset['prompt']]))

In [17]:
len(prompts)

307

In [18]:
import time 
import warnings
warnings.filterwarnings('ignore') # To ignore all warnings
start = time.time()
pess = responses("gemma_glicko_pess", prompts)
base = responses("gemma_glicko_base", prompts)
print(time.time() - start)

==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - i

In [19]:
len('.\n\n**Response:**\n\n')

18

In [12]:
t = []
u = []
for b in base:
    i = b.rfind('**')
    if i != -1:
        t.append(b[i+2:].replace("\n", ""))
    else:
        t.append(b.replace("\n", ""))
for b in pess:
    i = b.rfind('**')
    if i != -1:
        u.append(b[i+2:].replace("\n", ""))
    else:
        u.append(b.replace("\n", ""))

In [20]:
for a in pess[:10]:
    print(a)
    print('____________')

The fiancé signed up for a credit card through the commissary cashier, a point card, which he initially believed was a secure way to avoid affecting his credit. However, he later discovered the card was a credit card, leading to a potential problem with his credit report
____________
The poster has experienced a significant life change, transitioning from a sedentary lifestyle to a demanding schedule of childcare and retail work. This shift has led to a lack of motivation and a struggle to adjust to a new routine, with a body that is struggling to release tension
____________
The author initially received a promising interview for a non-advertised role at a dream company, which was quickly resolved with a decision to offer them a position. However, the author's subsequent lack of communication has led them to question the company's responsiveness
____________
The author initially struggled with paying bills due to a lack of organization. By opening a checking account for recurring expe

In [21]:
df = pd.DataFrame({'Prompt' : prompts, 'Base':base, 'Pessimism' : pess}) 

In [22]:
df.to_csv('summaries.csv', index=False)