# Benchmark Trained Gemma Models vs Off the Shelf Gemma on Test Set with win rate

This notebook benchmarks the trained Gemma models compared with an Off the Shelf Gemma using winrate on the test set

## 1. Install Dependencies

First, let's make sure we have all required packages.

In [1]:
from unsloth import FastModel
import torch
import torch.nn as nn
from datasets import load_dataset
import re
from trl import GRPOConfig, GRPOTrainer
from transformers import (
    GPT2Model,
    GPT2Tokenizer,
    GPT2PreTrainedModel,
    GPT2Config,
    Trainer,
    TrainingArguments,
    AutoModelForCausalLM,
    TextStreamer,
 AutoTokenizer
)
from typing import Dict, List
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import os
from tqdm import tqdm
from datasets import Dataset as HFDataset
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import datetime
import time
from sklearn.preprocessing import StandardScaler
import pickle
import math

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-03 00:11:12 [__init__.py:256] Automatically detected platform cuda.


## 3. Configure Benchmark Parameters

Set the parameters for your benchmark run.

In [2]:
# --- Configuration ---
# Ensure this path points to where your fine-tuned model was saved
# It should contain 'adapter_config.json', 'adapter_model.safetensors', etc.
MODEL_PATH = "gemma_glicko_pess" 
# The base model used for fine-tuning
BASE_MODEL = "unsloth/gemma-3-1b-it" 
MAX_SEQ_LENGTH = 600
DATASET_NAME = "Columbia-NLP/DPO-tldr-summarisation-preferences"
NUM_SAMPLES_TO_GENERATE = 10 # Adjust as needed, use -1 for the whole test set
OUTPUT_CSV = "generated_summaries_gemma_grpo.csv"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
# # --- Load Model and Tokenizer ---
# model_base, tokenizer_base = FastModel.from_pretrained(
#         model_name = "gemma_glicko_base", # Load the adapter
#         max_seq_length = MAX_SEQ_LENGTH,
#         load_in_4bit = False,
#         load_in_8bit = False,
#     )
# model_pess, tokenizer_pess = FastModel.from_pretrained(
#         model_name = "gemma_glicko_pess", # Load the adapter
#         max_seq_length = MAX_SEQ_LENGTH,
#         load_in_4bit = False,
#         load_in_8bit = False,
#     )

In [4]:
# # --- Setup Device and Tokenizer ---
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model_base.to(device)
# model_base.eval() # Set model to evaluation mode
# model_pess.to(device)
# model_pess.eval() # Set model to evaluation mode

In [5]:
# --- Load Dataset ---
test_sr = ['running','Cooking', 'books', 'jobs', 'cats', 'travel', 'Pets', 'dogs', 'offmychest', 'self', 'college', 'personalfinance']
print(f"Loading dataset: {DATASET_NAME}...")
dataset = load_dataset(DATASET_NAME)
test_set = dataset['test']
dataset = test_set.add_column("sub_reddit", [x['subreddit'] for x in test_set['other_info']])
df = dataset.to_pandas()
test_df = df.loc[ df['sub_reddit'].isin(test_sr)]
dataset = HFDataset.from_pandas(test_df)
print("Dataset loaded.")

Loading dataset: Columbia-NLP/DPO-tldr-summarisation-preferences...
Dataset loaded.


In [6]:
test_df.head()

Unnamed: 0,prompt,prompt_id,chosen,rejected,messages,score_chosen,score_rejected,other_info,sub_reddit
29,You are an AI assistant good at summarizing re...,17044c46e73997247c4780d0784be3ddaeca39552f81fe...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,10.0,1.0,"{'chosen_note': 'clear. ', 'id': 't3_1g...",dogs
30,You are an AI assistant good at summarizing re...,17044c46e73997247c4780d0784be3ddaeca39552f81fe...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,10.0,1.0,"{'chosen_note': 'clear. ', 'id': 't3_1gyf...",dogs
31,You are an AI assistant good at summarizing re...,17044c46e73997247c4780d0784be3ddaeca39552f81fe...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,10.0,1.0,"{'chosen_note': 'clear. ', 'id': ...",dogs
32,You are an AI assistant good at summarizing re...,17044c46e73997247c4780d0784be3ddaeca39552f81fe...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,10.0,1.0,"{'chosen_note': 'clear. ', 'id': 't3_1g...",dogs
33,You are an AI assistant good at summarizing re...,17044c46e73997247c4780d0784be3ddaeca39552f81fe...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,[{'content': 'You are an AI assistant good at ...,10.0,1.0,"{'chosen_note': 'clear.', 'id': 't3_1gyf5t', '...",dogs


In [7]:
def generate_summaries(prompt_texts, model, tokenizer, max_new_tokens=53):
    """
    Generates summaries for a batch of prompts using the loaded model.
    Args:
        prompt_texts (list of str): A list of prompts to summarize.
        model: The loaded Hugging Face model.
        tokenizer: The loaded Hugging Face tokenizer.
        max_new_tokens (int): Maximum number of new tokens to generate for each summary.
    Returns:
        list of str: A list of generated summaries.
    """
    inputs = tokenizer(
        prompt_texts, # Process a list of prompts
        return_tensors="pt",
        padding=True, # Pad to the longest sequence in the batch
        padding_side = 'left',
        truncation=True,
        max_length=MAX_SEQ_LENGTH - max_new_tokens # Make space for generated text
    ).to(device)

    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=max_new_tokens,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=True,      # Use sampling
            temperature=0.1,     # Lower temperature for less randomness
            top_p=1,           # Nucleus sampling
            num_return_sequences=1,
            min_length = 53,
        )

    generated_summaries = []
    # Decode each summary in the batch
    # outputs contains the full sequence (prompt + summary)
    # We need to slice off the prompt part for each generated summary
    input_ids_length = inputs.input_ids.shape[1] # Length of the tokenized input prompts (padded)
    for i in range(outputs.shape[0]): # Iterate through each item in the batch
        summary_ids = outputs[i][input_ids_length:]
        summary = tokenizer.decode(summary_ids, skip_special_tokens=True)
        generated_summaries.append(summary.strip())
    return generated_summaries

    
def responses(path, prompts): 
    model, tokenizer = FastModel.from_pretrained(
        model_name = path, # Load the adapter
        max_seq_length = MAX_SEQ_LENGTH,
        load_in_4bit = False,
        load_in_8bit = False,
    )
    model.to(device)
    model.eval() 
    answers = []
    num = 100
    for i in range(math.ceil(len(prompts)/num)):
        answers+=generate_summaries(prompts[i*num:(i+1)*num], model, tokenizer)
    return answers
    #  answers = []
    # num = 10
    # l = len(prompts)//num
    # for i in range(num):
    #     answers += generate_summaries(prompts[i*l:(i+1*l)], model, tokenizer)
    # return answers

In [8]:
test = test_df[['prompt', 'chosen']]
test['prompt'] = [t[:len(t)-8] + 'Summarize the post in two sentences:' for t in test['prompt']]
test['chosen'] = [x[1]['content'] for x in test['chosen']]
test.drop_duplicates('prompt', inplace=True)
test.reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['prompt'] = [t[:len(t)-8] + 'Summarize the post in two sentences:' for t in test['prompt']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['chosen'] = [x[1]['content'] for x in test['chosen']]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test.drop_duplicates('prompt', inplace=True)


Unnamed: 0,index,prompt,chosen
0,29,You are an AI assistant good at summarizing re...,"My dogs get bored of things very easily, so I ..."
1,59,You are an AI assistant good at summarizing re...,Just broke up with wife of 9 years. Slight imp...
2,201,You are an AI assistant good at summarizing re...,"First time cooking a ribeye steak, looking for..."
3,253,You are an AI assistant good at summarizing re...,"Called probation office on sister for meth, sh..."
4,259,You are an AI assistant good at summarizing re...,Teenage genius is depressed having fallen shor...
...,...,...,...
302,45972,You are an AI assistant good at summarizing re...,I'm at my first job and I don't know how to mo...
303,48297,You are an AI assistant good at summarizing re...,Keep putting myself out there and caring for p...
304,49609,You are an AI assistant good at summarizing re...,"Knee pain due to poor balance, Orthopedist pre..."
305,50132,You are an AI assistant good at summarizing re...,"Ex-wife owes me $2,000. I want to know how I ..."


In [19]:
prompts = test['prompt'].to_list()

In [10]:
import time 
import warnings
warnings.filterwarnings('ignore') # To ignore all warnings
start = time.time()
pess = responses("gemma_glicko_pess", prompts)
base = responses("gemma_glicko_base", prompts)
print(time.time() - start)

==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - i

In [11]:
len('.\n\n**Response:**\n\n')

18

In [12]:
t = []
u = []
for b in base:
    i = b.rfind('**')
    if i != -1:
        t.append(b[i+2:].replace("\n", ""))
    else:
        t.append(b.replace("\n", ""))
for b in pess:
    i = b.rfind('**')
    if i != -1:
        u.append(b[i+2:].replace("\n", ""))
    else:
        u.append(b.replace("\n", ""))

In [13]:
for a in pess[:10]:
    print(a)
    print('____________')

The author is experiencing a decline in their dogs' enthusiasm and happiness.  Initially, the dogs enjoyed the author's presence and games, but as the author's visits became less frequent, their excitement faded, leading to boredom and depression.  The author
____________
The author is experiencing a significant shift in their life, having recently experienced a breakup and a subsequent period of emotional recovery.  While initially apprehensive about the separation, the author now feels content and optimistic about the future, with plans to pursue personal goals and a renewed
____________
The author is a first-time cook and is hesitant to start with a 1.25" ribeye. They've researched the process extensively but are unsure about the best method for achieving a perfect sear, and are opting for a more frequent flipping
____________
The author is deeply troubled by their sister's relapse into drug use and the subsequent consequences, including her arrest and the loss of custody of her nie

In [14]:
human = test['chosen'].to_list()

In [21]:
prompts = [x[:len(x)-38] for x in prompts]

In [23]:
df = pd.DataFrame({'Prompt' : prompts, 'Base':base, 'Pessimism' : pess, 'Human' : human}) 

In [24]:
df.to_csv('summaries.csv', index=False)