# Benchmark Trained Gemma Models vs Off the Shelf Gemma on Test Set with win rate

This notebook benchmarks the trained Gemma models compared with an Off the Shelf Gemma using winrate on the test set

## 1. Install Dependencies

First, let's make sure we have all required packages.

In [1]:
from unsloth import FastModel
import torch
import torch.nn as nn
from datasets import load_dataset
import re
from trl import GRPOConfig, GRPOTrainer
from transformers import (
    GPT2Model,
    GPT2Tokenizer,
    GPT2PreTrainedModel,
    GPT2Config,
    Trainer,
    TrainingArguments,
    AutoModelForCausalLM,
    TextStreamer,
 AutoTokenizer
)
from typing import Dict, List
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import os
from tqdm import tqdm
from datasets import Dataset as HFDataset
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import datetime
import time
from sklearn.preprocessing import StandardScaler
import pickle

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-30 01:18:22 [__init__.py:256] Automatically detected platform cuda.


## 3. Configure Benchmark Parameters

Set the parameters for your benchmark run.

In [2]:
# --- Configuration ---
# Ensure this path points to where your fine-tuned model was saved
# It should contain 'adapter_config.json', 'adapter_model.safetensors', etc.
MODEL_PATH = "gemma_glicko_pess" 
# The base model used for fine-tuning
BASE_MODEL = "unsloth/gemma-3-1b-it" 
MAX_SEQ_LENGTH = 512
DATASET_NAME = "Columbia-NLP/DPO-tldr-summarisation-preferences"
NUM_SAMPLES_TO_GENERATE = 10 # Adjust as needed, use -1 for the whole test set
OUTPUT_CSV = "generated_summaries_gemma_grpo.csv"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
# # --- Load Model and Tokenizer ---
# model_base, tokenizer_base = FastModel.from_pretrained(
#         model_name = "gemma_glicko_base", # Load the adapter
#         max_seq_length = MAX_SEQ_LENGTH,
#         load_in_4bit = False,
#         load_in_8bit = False,
#     )
# model_pess, tokenizer_pess = FastModel.from_pretrained(
#         model_name = "gemma_glicko_pess", # Load the adapter
#         max_seq_length = MAX_SEQ_LENGTH,
#         load_in_4bit = False,
#         load_in_8bit = False,
#     )

In [4]:
# # --- Setup Device and Tokenizer ---
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model_base.to(device)
# model_base.eval() # Set model to evaluation mode
# model_pess.to(device)
# model_pess.eval() # Set model to evaluation mode

In [5]:
# --- Load Dataset ---
print(f"Loading dataset: {DATASET_NAME}...")
dataset = load_dataset(DATASET_NAME)
test_set = dataset['test']
dataset = test_set.add_column("sub_reddit", [x['subreddit'] for x in test_set['other_info']])
print("Dataset loaded.")

Loading dataset: Columbia-NLP/DPO-tldr-summarisation-preferences...
Dataset loaded.


In [6]:
def generate_summaries(prompt_texts, model, tokenizer, max_new_tokens=53):
    """
    Generates summaries for a batch of prompts using the loaded model.
    Args:
        prompt_texts (list of str): A list of prompts to summarize.
        model: The loaded Hugging Face model.
        tokenizer: The loaded Hugging Face tokenizer.
        max_new_tokens (int): Maximum number of new tokens to generate for each summary.
    Returns:
        list of str: A list of generated summaries.
    """
    inputs = tokenizer(
        prompt_texts, # Process a list of prompts
        return_tensors="pt",
        padding=True, # Pad to the longest sequence in the batch
        padding_side = 'left',
        truncation=True,
        max_length=MAX_SEQ_LENGTH - max_new_tokens # Make space for generated text
    ).to(device)

    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=max_new_tokens,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=True,      # Use sampling
            temperature=0.1,     # Lower temperature for less randomness
            top_p=1,           # Nucleus sampling
            num_return_sequences=1,
            min_length = 53,
        )

    generated_summaries = []
    # Decode each summary in the batch
    # outputs contains the full sequence (prompt + summary)
    # We need to slice off the prompt part for each generated summary
    input_ids_length = inputs.input_ids.shape[1] # Length of the tokenized input prompts (padded)
    for i in range(outputs.shape[0]): # Iterate through each item in the batch
        summary_ids = outputs[i][input_ids_length:]
        summary = tokenizer.decode(summary_ids, skip_special_tokens=True)
        generated_summaries.append(summary.strip())
    return generated_summaries

    
def responses(path, prompts): 
    model, tokenizer = FastModel.from_pretrained(
        model_name = path, # Load the adapter
        max_seq_length = MAX_SEQ_LENGTH,
        load_in_4bit = False,
        load_in_8bit = False,
    )
    model.to(device)
    model.eval() 
    answers = []
    num = 100
    for i in range(math.ceil(len(prompts)//num)):
        answers+=generate_summaries(prompts[i*num:(i+1)*num], model, tokenizer)
    return answers
    #  answers = []
    # num = 10
    # l = len(prompts)//num
    # for i in range(num):
    #     answers += generate_summaries(prompts[i*l:(i+1*l)], model, tokenizer)
    # return answers

In [7]:
import random 
import math
n = 1000
test = dataset.shuffle()[:n]
prompts = [t[:len(t)-8] + 'Summarize the post in two sentences' for t in test['prompt']]

In [8]:
import time 
import warnings
warnings.filterwarnings('ignore') # To ignore all warnings
start = time.time()
pess = responses("gemma_glicko_pess", prompts)
base = responses("gemma_glicko_base", prompts)
print(time.time() - start)

==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - i

In [16]:
print(df['Base'][0])

.

**Response:**

A good summary is both precise and concise.
     I was in class, done with my work, and bored. It's important to the story to tell you that I have an extremely glitchy iPhone 4. Well..


In [12]:
len('.\n\n**Response:**\n\n')

18

In [21]:
t = []
u = []
for b in base:
    i = b.rfind('**')
    if i != -1:
        t.append(b[i+2:].replace("\n", ""))
    else:
        t.append(b.replace("\n", ""))
for b in pess:
    i = b.rfind('**')
    if i != -1:
        u.append(b[i+2:].replace("\n", ""))
    else:
        u.append(b.replace("\n", ""))

In [22]:
for a in u:
    print(a)
    print('____________')

This summary captures the core of the post'
____________
The individual experienced a difficult and traumatic relationship with a manipulative partner, experiencing significant financial loss and emotional distress. They are now determined to leave the situation and prioritize their own well-being, recognizing the value of their newfound independence and career success
____________
The post describes a bizarre and unsettling situation where a coworker's request for employees to donate money to charity has created a sense of unease and anxiety. The employee's request, combined with the boss's involvement, has led
____________
The user's response is a summary of the post
____________
The
____________
The individual expresses a deep and painful struggle with trust, stemming from a traumatic event in their past, and struggles to reconcile their feelings with the perceived need to maintain the family. They are struggling with the difficult decision of whether to continue the
____________
Th

In [23]:
df = pd.DataFrame({'Prompt' : prompts, 'Human': [ x[1]['content'] for x in test['chosen']], 'Base':t, 'Pessimism' : u}) 

In [24]:
df.to_csv('summaries.csv', index=False)