# Sandbox: generate ratings 


The idea is to have the minimal modeler instantiation that takes prompts and generates output so I can perform some test-runs with my formative models. This is inspired by analysing the items on which there was large disagreement (difference of 3, meaning one reviewer found it bad and another excellent). I found that this is often due to a difference in interpretation. E.g., in "1480e9a9053f9af7": one reviewer saw the answer and judged it to be superficially sound and gave it a 4 (An extremely detailed response targeting exactly what the person is asking. Even the format is pleasing to read with bold headers that go into greater detail about all aspects about what the workers should and shouldn't be doing), while the other reviewer saw that it did not really answer the question at all (This does not answer the question. Rather than providing information for the employers, this answer provides guidelines for the mail and parcel delivery drivers.)

Q:
- What should employers of mail and parcel delivery drivers do to protect their employees from COVID-19?
A:
- As a mail and parcel delivery driver, potential sources of exposure include
having close contact with co-workers or delivery recipients, ..... (very long answer)

The question clearly states __employers__, but the answer goes into the __employees__. Importantly, the models did not catch this inconsistency either. 

NOTE: this problem _might_ be isolated to the QA evaluation, since one could consider answers to be right or wrong based on objective criteria, whereas that is more difficult for marketing copy evaluation. 



Goal: 
- See whether we can get a correct evaluation out of the models by specific prompting (considering the latent variable models framework that I consider)


# INIT

In [1]:
import torch
import pandas as pd
from pathlib import Path

import gc


from src.data_manager import DataManager

from src import data_processing
import yaml
from src import paths


from src.modeler import Modeler
from src.prompt_manager import PromptManager, PromptTemplate, PreparedPrompt, create_chat_prompt_dict, create_empty_prompt_template

------

# Basic Modeler MVP

In [2]:
model_name = "mistralai/Ministral-8B-Instruct-2410"

# model_name = 'google/gemma-3-4b-it'
modeler = Modeler(model_name)

Using device: cuda


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Padding side: left


In [3]:
pp = PreparedPrompt(
    conversation = [{"role": "user", "content": "Why is the sky blue?"}],
    id = "1", 
    input_id = '1',
    dimension_name="asd",
    assistant_prefix= "", 
)

pp2 = PreparedPrompt(
    conversation = [{"role": "user", "content": "Provide a rating out of 6 for so-so"}],
    id = "1", 
    input_id = '1',
    token_constraints=list("123456"),
    dimension_name="asd",
    assistant_prefix= "RATING: ", 
)

output = modeler.generate_chat([pp, pp2], 
                               max_new_tokens=20,
                               top_k = 100)

modeler.decode(output)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




['The sky appears blue due to a process called Rayleigh scattering. Rayleigh scattering is the scattering of',
 '3/6\n\nSo-so is a subjective term that can vary based on personal preferences and expectations.']

In [4]:
output

[ModelOutput(prompt=PreparedPrompt(conversation=[{'role': 'user', 'content': 'Why is the sky blue?'}], id='1', dimension_name='asd', assistant_prefix='', input_id='1', token_constraints=None, constraint_ids=None, metadata={}), sequence=tensor([    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     1,     1,   733, 16289, 28793,  4315,   349,   272,  7212,
          5045, 28804,   733, 28748, 16289, 28793,   415,  7212,  8045,  5045,
          2940,   298,   264, 20757,  1987,  8952,   291,   956, 21635, 28723,
          1684,   272,  4376, 28742, 28713,  2061], device='cuda:0'), input_length=26, logits=LogitsContainer(values=tensor([[24.5895, 16.2988, 15.1364,  ...,  4.1043,  4.1017,  4.0821],
         [17.1203, 14.6359, 14.2582,  ...,  5.2037,  5.2031,  5.1926],
         [23.7761, 18.7153, 17.3523,  ...,  5.6797,  5.6058,  5.5777],
         ...,
         [22.3593, 14.8893, 12.5386,  ...,  4.3191,  4.2971,  4.2947],
         [27.5123, 13.3961, 10.2917

In [9]:
inp_text = modeler.tokenizer.apply_chat_template(pp2.conversation, tokenize = False)
model_inputs = modeler.tokenizer(
    inp_text,
    return_tensors="pt",
    padding=True,
    # padding='max_length',
    truncation=True,
    max_length=10
).to(modeler.device)

In [11]:
modeler.model.generate(**model_inputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


tensor([[    1,     1,   733, 16289, 28793,  7133,   547,   264, 15470,   575,
           302, 28705, 28740, 28734,   354,   272,  2296,  6251, 28747,   345,
          1014,  8599,   349,  7007,   611,    13,    13, 28737,   682,  4338]],
       device='cuda:0')

constrained_tokens are not part of the output since the constraint_ids are needed, which are pre-computed in the PromptSuite for efficiency. 

----------

# Generation pipeline MVP

In [1]:
from src.pipeline import run_experiment
from src.modeler import Modeler
from pathlib import Path
import pandas as pd
from src.prompt_manager import PromptSuite, PromptTemplate, create_empty_prompt_template
from src.results import ResultsContainer
import torch

## Config

In [2]:
df_deals = pd.read_parquet('../data/raw/BARTER_DEALS.parquet')

In [11]:
model_name = 'google/gemma-3-1b-it'
# model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
model_name = 'google/gemma-3-4b-it'
output_dir = Path('../results_test')
file_stem = "barter_test_2"
variable_names = ['deal_text']
batch_size = 2
id_col = 'deal_id'
top_k = 10
assistant_prefix = ["", "RATING: "][1]
shards_per_save = 10 # Total rows saved will be shards*batch_size

In [12]:
modeler = Modeler(model_name)

Using device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Padding side: left


In [13]:
pt = create_empty_prompt_template()

USER_PROMPT_TEMPLATE = """Deal text:
{deal_text}

Rate the quality of the marketing copy.
"""

SYSTEM_PROMPT = """
You will be given a deal text aka marketing copy. Rate the quality of the copy on a scale of 5. 
"""

pt.user_message_template = USER_PROMPT_TEMPLATE
pt.system_message = SYSTEM_PROMPT
pt.assistant_prefix = "RATING: "
pt.token_constraints = list("12345")

ps = PromptSuite(id= "test", templates = {"abc": pt})

ps.precompute_constraints(modeler.tokenizer)

Pre-computing constraint IDs for Suite test...


## Generation

In [14]:
run_experiment(df=df_deals[:20],
               modeler=modeler,
               suite=ps,
               output_dir=output_dir,
               file_stem=file_stem,
               model_name=model_name,
               batch_size=batch_size,
               id_col=id_col,
               top_k=top_k,
               assistant_prefix=assistant_prefix,
               shards_per_save=shards_per_save,
               max_new_tokens=1)

running generation function...
Processing 20 new IDs...
Resuming at Shard 0. Starting streaming inference...
Starting streaming inference...


  0%|          | 0/11 [00:00<?, ?it/s]



 91%|█████████ | 10/11 [00:04<00:00,  2.34it/s]

 Saved checkpoint: barter_test_2_part_0000.pt





## Load results (automated in generation pipeline)


In [15]:
output_dir

WindowsPath('../results_test')

In [16]:
rc = ResultsContainer.get_experiment_state(output_dir)
print(rc)

Scanning 1 existing shards for state recovery...
(1, {'019689e2-efa8-00c8-0aff-9b643cb6337b', '0196cebe-6b09-00c8-c158-41d4ad0c8a0e', '019689e3-acb5-00c8-c791-fe6473f13056', '019689e2-f8cf-00c8-f98b-2dd657eec2a8', '019689e5-f489-00c8-8c26-06b3ea0961a5', '019689e4-bce7-00c8-0815-d6a659be9456', '019689e4-1ac5-00c8-aac6-bcb8c2a3b55a', '019689e4-306b-00c8-d98c-70210470c288', '019689e3-d7a5-00c8-f5a8-6bfe51a21cd6', '019689e2-bfb5-00c8-6485-23352d64b0e5', '01985fd5-7790-00c8-de87-3d581ece0461', '019689e5-7dd0-00c8-615b-7750a8fec2f3', '01973621-62d0-00c8-ed32-a769ec1384af', '01985615-068a-00c8-502e-689ad4f46f4f', '019689e2-c086-00c8-88ca-99f346918665', '0196edc7-e6a8-00c8-6ff5-ea60fa3f2895', '019689e4-93d1-00c8-615c-b8f3e2c25079', '019689e3-aec2-00c8-9617-e72df87220e5', '019689e4-9433-00c8-67d6-281a9789eb34', '019689e4-1af2-00c8-70dc-2d4567f53721'})


In [17]:
from src.results import ResultsContainer
from pathlib import Path
save_dir = Path('../results_test')
file_stem = "barter_test_2"
res = ResultsContainer.load_from_shards(save_dir)
res

ResultsContainer(data={'sequences': [tensor([     2,      2,    105,   2364,    108,   3048,    795,    577,   2238,
           496,   3772,   1816,  44485,   7009,   4865, 236761,  25870,    506,
          3325,    529,    506,   4865,    580,    496,   5559,    529, 236743,
        236810, 236761, 236743,    109, 134478,   1816, 236787,    107, 236792,
          8714,  11109,  28007,    563,  38564,    496,   1944,   2299,    529,
         44100,    531,   3517,    672,   7672, 236764,   3353, 236743, 236778,
        236800, 236764,    573,    496,   6382,   3004,   8363,   4888,    506,
         14871,    532,    528,    506,   3207,    529,  28007, 236743, 246272,
           107,   3048, 236858,    859,    577,  28172,    684,    587,   8714,
        236858, 236751,   9578,   2434,    532,   5908,    496,   9813,  62974,
           528,  10764, 236761,    669,   5671, 236787,   2619,  10103,  58442,
           506,  14871, 236764,    506,   1610, 236764,    532,    506,   6782,
   

In [24]:
res.metadata

Unnamed: 0,prompt_id,input_id,dimension_name,assistant_prefix,input_length,model_name,top_k,constrained_token_ids
0,test,0196edc7-e6a8-00c8-6ff5-ea60fa3f2895,abc,RATING:,203,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
1,test,019689e5-f489-00c8-8c26-06b3ea0961a5,abc,RATING:,203,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
2,test,019689e2-bfb5-00c8-6485-23352d64b0e5,abc,RATING:,149,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
3,test,019689e3-d7a5-00c8-f5a8-6bfe51a21cd6,abc,RATING:,149,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
4,test,019689e5-7dd0-00c8-615b-7750a8fec2f3,abc,RATING:,178,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
5,test,019689e2-efa8-00c8-0aff-9b643cb6337b,abc,RATING:,178,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
6,test,0196cebe-6b09-00c8-c158-41d4ad0c8a0e,abc,RATING:,192,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
7,test,019689e4-306b-00c8-d98c-70210470c288,abc,RATING:,192,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
8,test,019689e4-bce7-00c8-0815-d6a659be9456,abc,RATING:,254,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"
9,test,019689e2-f8cf-00c8-f98b-2dd657eec2a8,abc,RATING:,254,google/gemma-3-4b-it,10,"[236770, 236778, 236800, 236810, 236812]"


In [25]:
modeler.tokenizer.convert_ids_to_tokens(res.metadata['constrained_token_ids'].iloc[0])

['1', '2', '3', '5', '4']

-------

# DataManager: create analysis dataframe

In [1]:
from src.data_manager import DataManager
from pathlib import Path
import pandas as pd

In [2]:
results_dir = Path("../results_test")
dm = DataManager(results_dir)

In [3]:
dm.load_all()
df = dm.create_analysis_dataframe(tokenize = True)

In [4]:
df

Unnamed: 0,prompt_id,input_id,dimension_name,assistant_prefix,input_length,model_name,top_k,constrained_token_ids,sequences,top_k_logits,constrained_logits,top_k_tokens,constrained_tokens
0,3d33389357,0196edc7-e6a8-00c8-6ff5-ea60fa3f2895,holistic,,214,google/gemma-3-1b-it,1000,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 107, 3048, 795, 577, 2238, 4...","[[31.875, 31.875, 30.125, 28.875, 26.875, 25.6...","[[19.625, 18.125, 25.375, 30.125]]","[[Okay, I, 4, **, Here, ★★, 3, Overall, Let, Y...","[1, 2, 3, 4]"
1,3d33389357,019689e5-f489-00c8-8c26-06b3ea0961a5,holistic,,214,google/gemma-3-1b-it,1000,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[44.25, 35.75, 35.0, 33.5, 30.0, 29.625, 29.1...","[[24.25, 22.0, 24.375, 29.625]]","[[Okay, Let, I, Here, **, 4, Alright, My, Rati...","[1, 2, 3, 4]"
2,3d33389357,019689e2-bfb5-00c8-6485-23352d64b0e5,holistic,,160,google/gemma-3-1b-it,1000,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 2, 2, 105, 2364, 107, 3048, 79...","[[37.5, 35.5, 34.0, 30.625, 30.25, 29.0, 26.25...","[[22.25, 20.5, 26.0, 34.0]]","[[Okay, I, 4, Let, **, Here, ★★, Rating, 3, On...","[1, 2, 3, 4]"
3,3d33389357,019689e3-d7a5-00c8-f5a8-6bfe51a21cd6,holistic,,160,google/gemma-3-1b-it,1000,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 107, 3048, 795, 577, 2238, 4...","[[37.75, 35.75, 31.375, 30.375, 30.375, 29.75,...","[[22.0, 21.625, 28.0, 30.375]]","[[Okay, I, Let, **, 4, Here, 3, My, Overall, O...","[1, 2, 3, 4]"
4,3d33389357,019689e5-7dd0-00c8-615b-7750a8fec2f3,holistic,,189,google/gemma-3-1b-it,1000,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[40.5, 33.25, 32.5, 31.875, 31.625, 29.25, 26...","[[21.75, 21.625, 24.25, 31.625]]","[[Okay, I, Here, Let, 4, **, Alright, Rating, ...","[1, 2, 3, 4]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,test,019689e7-3192-00c8-78be-4ebef15ec543,abc,RATING:,464,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[38.25, 36.75, 36.25, 34.0, 33.75, 33.5, 31.2...","[[27.0, 29.875, 33.5, 36.25]]","[[9, 8, 4, 5, 7, 3, 6, 2, 1, \n\n]]","[1, 2, 3, 4]"
396,test,019689e4-3abc-00c8-7d0d-e43b4c5f011b,abc,RATING:,167,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 108, 3048, 795, 577, 2238, 4...","[[37.25, 37.25, 36.75, 36.75, 34.25, 33.5, 33....","[[23.5, 29.0, 34.25, 37.25]]","[[4, 9, 8, 7, 3, 6, 5, 2, \n\n, ▁⭐]]","[1, 2, 3, 4]"
397,test,019689e6-1423-00c8-afd0-83173a951191,abc,RATING:,167,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ...","[[37.75, 37.0, 36.5, 35.25, 34.0, 33.25, 32.5,...","[[24.75, 32.5, 37.0, 36.5]]","[[7, 3, 4, 6, 5, 8, 2, 9, **, \n\n]]","[1, 2, 3, 4]"
398,test,0197a231-f94b-00c8-4fd3-17b8fc086649,abc,RATING:,224,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[37.0, 36.5, 35.5, 35.5, 34.75, 34.5, 33.25, ...","[[24.75, 32.0, 34.75, 36.5]]","[[9, 4, 8, 7, 3, 5, 6, 2, \n\n, ▁⭐]]","[1, 2, 3, 4]"


In [5]:
df['top_k_tokens'].apply(lambda x: tuple(x[0]))

0      (Okay, I, 4, **, Here, ★★, 3, Overall, Let, Ye...
1      (Okay, Let, I, Here, **, 4, Alright, My, Ratin...
2      (Okay, I, 4, Let, **, Here, ★★, Rating, 3, On,...
3      (Okay, I, Let, **, 4, Here, 3, My, Overall, On...
4      (Okay, I, Here, Let, 4, **, Alright, Rating, O...
                             ...                        
395                    (9, 8, 4, 5, 7, 3, 6, 2, 1, \n\n)
396                   (4, 9, 8, 7, 3, 6, 5, 2, \n\n, ▁⭐)
397                   (7, 3, 4, 6, 5, 8, 2, 9, **, \n\n)
398                   (9, 4, 8, 7, 3, 5, 6, 2, \n\n, ▁⭐)
399                   (7, 8, 6, 4, 9, 3, 5, 2, \n\n, **)
Name: top_k_tokens, Length: 420, dtype: object

## Data Processing

In [6]:
from src import data_processing
import yaml

In [7]:
with open('../src/configs/config.yaml', 'r') as f:
    full_config = yaml.safe_load(f)

final_df, dirty_df = data_processing.get_analysis_ready_df(full_config=full_config, 
                                                 use_cache = False, 
                                                 force_refresh = False,
                                                 return_dirty_df=True)


Loading files for analysis barter_deals
🐢 Running full processing pipeline...
Finished loading experiment data
Found 98 experimental trials contaminated by garbage output.
torch.Size([322, 1, 4])
torch.Size([4])


  weights_tensor = torch.tensor(weights, dtype=torch.float32)


In [8]:
final_df

Unnamed: 0,applicants_applications_count,content_types,deal_id,main_image,min_social_media_followers,deal_tags,live_since,created_at,updated_at,deleted_at,...,constrained_token_ids,sequences,top_k_logits,constrained_logits,top_k_tokens,constrained_tokens,sorted_tokens,sorted_logits,mean_rating,mode_rating
0,39,"[{'id': 41, 'name': 'Activities', 'slug': 'per...",0196edc7-e6a8-00c8-6ff5-ea60fa3f2895,uploads/deals/0196edc7-5969-ffff-6700-7ae699b8...,2500,,2025-05-20 15:21:28.249567,2025-05-20 13:00:23.080212,2025-10-28 10:55:34.719187,NaT,...,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 107, 3048, 795, 577, 2238, 4...","[[31.875, 31.875, 30.125, 28.875, 26.875, 25.6...","[[19.625, 18.125, 25.375, 30.125]]","[[Okay, I, 4, **, Here, ★★, 3, Overall, Let, Y...","[1, 2, 3, 4]","[1, 2, 3, 4]","[[19.625, 18.125, 25.375, 30.125]]",[3.9913289546966553],[4.0]
2,0,"[{'id': 21, 'name': 'Fashion', 'slug': 'Dress'}]",019689e5-f489-00c8-8c26-06b3ea0961a5,uploads/deals/019689e5-f54b-ffff-dbed-637e9d2a...,5000,,2023-11-28 14:59:51.626780,2023-11-28 14:59:51.626780,2025-05-01 03:31:11.600095,NaT,...,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[44.25, 35.75, 35.0, 33.5, 30.0, 29.625, 29.1...","[[24.25, 22.0, 24.375, 29.625]]","[[Okay, Let, I, Here, **, 4, Alright, My, Rati...","[1, 2, 3, 4]","[1, 2, 3, 4]","[[24.25, 22.0, 24.375, 29.625]]",[3.9800896644592285],[4.0]
3,0,"[{'id': 21, 'name': 'Fashion', 'slug': 'Dress'}]",019689e5-f489-00c8-8c26-06b3ea0961a5,uploads/deals/019689e5-f54b-ffff-dbed-637e9d2a...,5000,,2023-11-28 14:59:51.626780,2023-11-28 14:59:51.626780,2025-05-01 03:31:11.600095,NaT,...,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[36.25, 35.25, 32.5, 32.25, 31.625, 30.625, 3...","[[23.125, 29.75, 32.5, 31.625]]","[[6, 7, 3, 5, 4, 8, 9, 2, \n\n, **]]","[1, 2, 3, 4]","[1, 2, 3, 4]","[[23.125, 29.75, 32.5, 31.625]]",[3.238213539123535],[3.0]
4,23,"[{'id': 29, 'name': 'UGC', 'slug': 'Film-Strip'}]",019689e2-bfb5-00c8-6485-23352d64b0e5,uploads/deals/019689e2-c06c-ffff-baf0-c7e9decd...,5000,,2025-03-04 11:46:18.176627,2025-03-04 11:46:18.176627,2025-05-01 03:27:41.316646,NaT,...,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 2, 2, 105, 2364, 107, 3048, 79...","[[37.5, 35.5, 34.0, 30.625, 30.25, 29.0, 26.25...","[[22.25, 20.5, 26.0, 34.0]]","[[Okay, I, 4, Let, **, Here, ★★, Rating, 3, On...","[1, 2, 3, 4]","[1, 2, 3, 4]","[[22.25, 20.5, 26.0, 34.0]]",[3.999638080596924],[4.0]
5,23,"[{'id': 29, 'name': 'UGC', 'slug': 'Film-Strip'}]",019689e2-bfb5-00c8-6485-23352d64b0e5,uploads/deals/019689e2-c06c-ffff-baf0-c7e9decd...,5000,,2025-03-04 11:46:18.176627,2025-03-04 11:46:18.176627,2025-05-01 03:27:41.316646,NaT,...,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 2, 2, 105, 2364, 108, 3048, 79...","[[36.25, 35.75, 34.75, 34.25, 33.5, 33.0, 32.2...","[[24.5, 29.625, 34.75, 36.25]]","[[4, 7, 3, 6, 5, 9, 8, 2, **, \n\n]]","[1, 2, 3, 4]","[1, 2, 3, 4]","[[24.5, 29.625, 34.75, 36.25]]",[3.815586805343628],[4.0]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
411,0,"[{'id': 29, 'name': 'UGC', 'slug': 'Film-Strip...",019689e6-867f-00c8-d20a-86c0d7790f4d,uploads/deals/019689e6-87ce-ffff-9002-8bcb1f6b...,5000,,2025-02-17 08:58:07.666936,2025-02-17 08:58:07.666936,2025-05-01 03:31:49.080046,NaT,...,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[38.25, 38.0, 36.75, 36.0, 35.25, 35.0, 34.5,...","[[27.375, 36.75, 38.0, 38.25]]","[[4, 3, 2, 5, 6, 7, 9, 8, 1, ▁**]]","[1, 2, 3, 4]","[1, 2, 3, 4]","[[27.375, 36.75, 38.0, 38.25]]",[3.38803768157959],[4.0]
412,82,"[{'id': 32, 'name': 'Nightlife', 'slug': 'Conf...",01973ac9-9eb0-00c8-66d5-7dda7ae04a9d,uploads/deals/01973add-b289-ffff-7133-bb62b737...,1500,,2025-06-04 12:40:31.415601,2025-06-04 11:53:01.360696,2025-10-28 10:49:08.991448,NaT,...,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[38.0, 36.5, 36.25, 36.0, 35.75, 33.5, 33.25,...","[[27.375, 33.25, 36.25, 36.5]]","[[6, 4, 3, 7, 5, 8, 2, 9, \n\n, 1]]","[1, 2, 3, 4]","[1, 2, 3, 4]","[[27.375, 33.25, 36.25, 36.5]]",[3.5286989212036133],[4.0]
416,0,"[{'id': 33, 'name': 'Entertainment', 'slug': '...",019689e4-3abc-00c8-7d0d-e43b4c5f011b,uploads/deals/019689e4-3bcf-ffff-ae4a-57765c9f...,5000,,2024-10-05 09:41:18.807216,2024-10-05 09:41:18.807216,2025-10-28 10:53:02.522947,NaT,...,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 108, 3048, 795, 577, 2238, 4...","[[37.25, 37.25, 36.75, 36.75, 34.25, 33.5, 33....","[[23.5, 29.0, 34.25, 37.25]]","[[4, 9, 8, 7, 3, 6, 5, 2, \n\n, ▁⭐]]","[1, 2, 3, 4]","[1, 2, 3, 4]","[[23.5, 29.0, 34.25, 37.25]]",[3.952085256576538],[4.0]
417,3,"[{'id': 21, 'name': 'Fashion', 'slug': 'Dress'}]",019689e6-1423-00c8-afd0-83173a951191,uploads/deals/019689e6-14d3-ffff-8b4a-c3ccef4e...,5000,,NaT,2023-11-29 15:08:06.755650,2025-05-01 03:31:19.683789,2024-02-03 11:15:38.532865+00:00,...,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ...","[[37.75, 37.0, 36.5, 35.25, 34.0, 33.25, 32.5,...","[[24.75, 32.5, 37.0, 36.5]]","[[7, 3, 4, 6, 5, 8, 2, 9, **, \n\n]]","[1, 2, 3, 4]","[1, 2, 3, 4]","[[24.75, 32.5, 37.0, 36.5]]",[3.3680737018585205],[3.0]


In [11]:
dirty_df

Unnamed: 0,applicants_applications_count,content_types,deal_id,main_image,min_social_media_followers,deal_tags,live_since,created_at,updated_at,deleted_at,...,assistant_prefix,input_length,model_name,top_k,constrained_token_ids,sequences,top_k_logits,constrained_logits,top_k_tokens,constrained_tokens
1,39,"[{'id': 41, 'name': 'Activities', 'slug': 'per...",0196edc7-e6a8-00c8-6ff5-ea60fa3f2895,uploads/deals/0196edc7-5969-ffff-6700-7ae699b8...,2500,,2025-05-20 15:21:28.249567,2025-05-20 13:00:23.080212,2025-10-28 10:55:34.719187,NaT,...,RATING:,205,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 108, 3048, 795, 577, 2238, 4...","[[40.25, 37.0, 34.25, 33.25, 32.75, 31.625, 30...","[[25.875, 27.625, 31.625, 34.25]]","[[9, 8, 4, 7, 5, 3, 6, 2, 1, \n\n]]","[1, 2, 3, 4]"
29,21,"[{'id': 25, 'name': 'Cooking', 'slug': 'Cookin...",01973621-62d0-00c8-ed32-a769ec1384af,uploads/deals/0197361e-2798-ffff-875a-c043f8cd...,5000,,2025-06-03 14:10:47.277079,2025-06-03 14:10:47.120435,2025-06-30 22:02:33.403814,NaT,...,RATING:,220,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[36.75, 35.5, 35.25, 34.75, 34.5, 33.75, 32.5...","[[24.0, 31.375, 34.75, 35.25]]","[[7, 8, 4, 3, 9, 6, 5, 2, **, \n\n]]","[1, 2, 3, 4]"
31,95,"[{'id': 31, 'name': 'Lifestyle', 'slug': 'Mart...",019689e2-c086-00c8-88ca-99f346918665,uploads/deals/019689e2-c1c9-ffff-de4e-ac0eded8...,5000,,2025-01-22 11:29:14.945329,2025-01-22 11:29:14.945329,2025-05-22 04:00:15.905047,NaT,...,RATING:,220,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 108, 3048, 795, 577, 2238, 4...","[[38.25, 35.0, 35.0, 33.75, 33.5, 33.0, 30.625...","[[23.875, 28.875, 33.5, 38.25]]","[[4, 8, 9, 7, 3, 5, 6, 2, \n\n, **]]","[1, 2, 3, 4]"
42,0,"[{'id': 41, 'name': 'Activities', 'slug': 'per...",019689e2-4ed5-00c8-4120-a08641f9f6f7,uploads/deals/019689e2-4ff4-ffff-5f8b-033c32b8...,5000,,2024-05-02 09:27:34.909380,2024-05-02 09:27:34.909380,2025-05-01 03:27:12.709921,NaT,...,RATING:,888,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 108, 3048, 795, 577, 2238, 4...","[[34.0, 33.0, 32.0, 29.625, 28.625, 27.125, 27...","[[20.75, 27.125, 32.0, 34.0]]","[[4, 7, 3, 9, 8, 5, 6, 2, ▁⭐, **]]","[1, 2, 3, 4]"
48,0,"[{'id': 23, 'name': 'Food', 'slug': 'Pizza'}, ...",019689e1-f847-00c8-ac8b-c85ce3fe3c65,uploads/deals/019689e1-f981-ffff-c78a-e00a60ca...,2500,,2025-03-26 15:19:29.546211,2025-03-26 15:19:29.546211,2025-05-01 03:26:50.412468,NaT,...,RATING:,141,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[36.0, 36.0, 35.5, 35.25, 33.25, 32.5, 32.5, ...","[[23.625, 28.125, 33.25, 36.0]]","[[8, 4, 7, 9, 3, 5, 6, 2, \n\n, 1]]","[1, 2, 3, 4]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,16,"[{'id': 39, 'name': 'Experiences', 'slug': 'Te...",019689e4-d049-00c8-dd1d-a6d642e8eea5,uploads/deals/019689e4-d16b-ffff-4079-52f9a29f...,2500,,2025-02-06 10:27:05.026439,2025-02-06 10:27:05.026439,2025-05-01 03:29:56.694681,NaT,...,RATING:,481,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[37.5, 37.0, 36.25, 35.75, 34.5, 33.25, 31.62...","[[25.0, 30.125, 34.5, 37.5]]","[[4, 9, 8, 7, 3, 5, 6, 2, 1, **]]","[1, 2, 3, 4]"
413,2,"[{'id': 29, 'name': 'UGC', 'slug': 'Film-Strip'}]",0197d0b9-e42c-00c8-988d-662a095cc262,uploads/deals/0197d0b9-7c2f-ffff-1850-bd2ad4f7...,1500,,2025-07-03 14:38:53.080613,2025-07-03 14:38:52.972526,2025-08-10 04:00:00.967033,NaT,...,RATING:,171,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 108, 3048, 795, 577, 2238, 4...","[[40.75, 37.0, 35.5, 34.5, 33.25, 32.25, 30.37...","[[27.5, 29.25, 32.25, 35.5]]","[[9, 8, 4, 5, 7, 3, 6, 2, 1, \n\n]]","[1, 2, 3, 4]"
414,5,"[{'id': 33, 'name': 'Entertainment', 'slug': '...",019689e3-4495-00c8-97c7-c042725aa882,uploads/deals/019689e3-4575-ffff-c9a1-a8b25c3f...,25000,,2025-03-10 21:36:41.721936,2025-03-10 21:36:41.721936,2025-05-01 03:28:15.417398,NaT,...,RATING:,464,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[2, 2, 105, 2364, 108, 3048, 795, 577, 2238, 4...","[[37.25, 34.5, 34.0, 34.0, 32.5, 32.25, 30.125...","[[23.125, 29.375, 34.0, 37.25]]","[[4, 7, 3, 9, 8, 5, 6, 2, ▁⭐, **]]","[1, 2, 3, 4]"
415,6,"[{'id': 41, 'name': 'Activities', 'slug': 'per...",019689e7-3192-00c8-78be-4ebef15ec543,uploads/deals/019689e7-32ba-ffff-e885-81eb9c69...,5000,,NaT,2024-08-23 05:18:58.896398,2025-05-01 03:32:32.722153,2024-10-28 17:13:03.153237+00:00,...,RATING:,464,google/gemma-3-1b-it,10,"[236770, 236778, 236800, 236812]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[38.25, 36.75, 36.25, 34.0, 33.75, 33.5, 31.2...","[[27.0, 29.875, 33.5, 36.25]]","[[9, 8, 4, 5, 7, 3, 6, 2, 1, \n\n]]","[1, 2, 3, 4]"


------

In [9]:
pt = create_empty_prompt_template()

NameError: name 'create_empty_prompt_template' is not defined

In [None]:
df = pd.read_parquet(paths.RAW_DATA_DIR / 'MCGILL_QA_FEEDBACK.parquet')
df['disagreement'] = abs(df['score_1'] - df['score_2'])

df_errors = pd.read_parquet('../data/processed/McGill_QA_blatant_model_eval_errors.parquet')

relevant_columns = ['question', 'answer', 'rating', 'explanation_1', 'explanation_2', 'disagreement']
df_errors_clean = df_errors[relevant_columns + ['llm_disagreement_avg', 'model_name', 'rating_mode_rescaled']]


NameError: name 'paths' is not defined

In [None]:
df_qwen_errors = df_errors[["Qwen" in x for x in df_errors.model_name]]
df_qwen_errors

Unnamed: 0.1,Unnamed: 0,question,passage,feedback,rating,domain,review_1,explanation_1,review_2,explanation_2,...,entropy,relative_entropy,disagreement,human_disagreement,llm_disagreement_1,llm_disagreement_2,rating_mean_rescaled,rating_mode_rescaled,llm_disagreement_avg,token_count
1253,41,Can I be granted a COVID-19 pandemic event vis...,"{'passage_id': 563, 'reference': {'page_title'...",['Answer details how only people with visas al...,"['Excellent', 'Excellent']",Australia,Excellent,Answer details how only people with visas alre...,Excellent,It answered the question. It added more info t...,...,0.004336,0.002694,0,0,3.0,3.0,1.000379,1.0,3.0,61
1258,41,Can I be granted a COVID-19 pandemic event vis...,"{'passage_id': 563, 'reference': {'page_title'...",['Answer details how only people with visas al...,"['Excellent', 'Excellent']",Australia,Excellent,Answer details how only people with visas alre...,Excellent,It answered the question. It added more info t...,...,0.068273,0.042420,0,0,3.0,3.0,1.018659,1.0,3.0,61
1259,41,Can I be granted a COVID-19 pandemic event vis...,"{'passage_id': 563, 'reference': {'page_title'...",['Answer details how only people with visas al...,"['Excellent', 'Excellent']",Australia,Excellent,Answer details how only people with visas alre...,Excellent,It answered the question. It added more info t...,...,0.705146,0.438132,0,0,3.0,3.0,1.404514,1.0,3.0,61
1702,56,What do Family Day Care service providers need...,"{'passage_id': 498, 'reference': {'page_title'...",['Discusses storing records of attendance. Mak...,"['Bad', 'Bad']",Australia,Bad,Discusses storing records of attendance. Makes...,Bad,"This is about attendance, not about educators ...",...,0.587647,0.365126,0,0,3.0,3.0,3.795623,4.0,3.0,58
1703,56,What do Family Day Care service providers need...,"{'passage_id': 498, 'reference': {'page_title'...",['Discusses storing records of attendance. Mak...,"['Bad', 'Bad']",Australia,Bad,Discusses storing records of attendance. Makes...,Bad,"This is about attendance, not about educators ...",...,0.000006,0.000004,0,0,3.0,3.0,4.000000,4.0,3.0,58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54598,1819,What is the length of closure before workers c...,"{'passage_id': 504, 'reference': {'page_title'...","['The answer is for school dismissal, not abou...","['Bad', 'Bad']",CDC,Bad,"The answer is for school dismissal, not about ...",Bad,The question was about when workers would be a...,...,0.075871,0.047142,0,0,3.0,3.0,3.989111,4.0,3.0,126
55522,1850,What can bus drivers or rail transit operators...,"{'passage_id': 145, 'reference': {'page_title'...",['Detailed guidelines aimed at mass transit wo...,"['Excellent', 'Excellent']",CDC,Excellent,Detailed guidelines aimed at mass transit work...,Excellent,The answer is directed toward people operating...,...,0.585737,0.363939,0,0,3.0,3.0,1.183354,1.0,3.0,206
55523,1850,What can bus drivers or rail transit operators...,"{'passage_id': 145, 'reference': {'page_title'...",['Detailed guidelines aimed at mass transit wo...,"['Excellent', 'Excellent']",CDC,Excellent,Detailed guidelines aimed at mass transit work...,Excellent,The answer is directed toward people operating...,...,0.659266,0.409625,0,0,3.0,3.0,1.277689,1.0,3.0,206
55528,1850,What can bus drivers or rail transit operators...,"{'passage_id': 145, 'reference': {'page_title'...",['Detailed guidelines aimed at mass transit wo...,"['Excellent', 'Excellent']",CDC,Excellent,Detailed guidelines aimed at mass transit work...,Excellent,The answer is directed toward people operating...,...,0.011305,0.007024,0,0,3.0,3.0,1.001862,1.0,3.0,206


-------

# Model instantiation


In [5]:
# Instantiate model
model_name = ['google/gemma-3-1b-it', 
              'Qwen/Qwen3-4B-Instruct-2507', 
              'google/gemma-3-4b-it',
              'mistralai/Ministral-8B-Instruct-2410'][1]
print(f"Using model: {model_name}")
modeler = Modeler(model_name)
# modeler.set_token_constraints(list("1234"))

Using model: Qwen/Qwen3-4B-Instruct-2507
Using device: cuda


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Padding side: left


# General experimentation

## Prompts & Generation

In [37]:
sample = df[df.input_id == '1480e9a9053f9af7'].iloc[0]
sample

Unnamed: 0                                                    1758
question         What should employers of mail and parcel deliv...
passage          {'passage_id': 152, 'reference': {'page_title'...
feedback         ["An extremely detailed response targeting exa...
rating                                        ['Excellent', 'Bad']
domain                                                         CDC
review_1                                                 Excellent
explanation_1    An extremely detailed response targeting exact...
review_2                                                       Bad
explanation_2    This does not answer the question. Rather than...
score_1                                                          4
score_2                                                          1
answer           As a mail and parcel delivery driver, potentia...
input_id                                          1480e9a9053f9af7
disagreement                                                  

In [6]:
sample = df[df['input_id'] == 'd8b74fbe497d358c'].iloc[0]
sample

Unnamed: 0                                                      41
question         Can I be granted a COVID-19 pandemic event vis...
passage          {'passage_id': 563, 'reference': {'page_title'...
feedback         ['Answer details how only people with visas al...
rating                                  ['Excellent', 'Excellent']
domain                                                   Australia
review_1                                                 Excellent
explanation_1    Answer details how only people with visas alre...
review_2                                                 Excellent
explanation_2    It answered the question. It added more info t...
score_1                                                          4
score_2                                                          4
answer           The COVID-19 pandemic event visa can only be g...
input_id                                          d8b74fbe497d358c
disagreement                                                  

In [38]:
USER_PROMPT_TEMPLATE = """Question:
"{question}"

Answer:
"{answer}"

Rate the relevance of this answer from 1 (highly relevant) to 4 (highly irrelevant): 
"""

SYSTEM_PROMPT = """
You will be given a question and answer couple. Your task is to determine whether the answer is irrelevant to the question and provide a score.
1. Highly relevant
2. somewhat relevant
3. somewhat irrelevant
4. highly irrelevant 
"""

pt.user_message_template = USER_PROMPT_TEMPLATE
pt.system_message = SYSTEM_PROMPT
pt.assistant_prefix = "RATING: "

output = modeler.generate_chat([pt.render(sample)])
modeler.decode(output)



['3']

In [None]:
USER_PROMPT_TEMPLATE = """Question:
"{question}"

Answer:
"{answer}"

Rate the relevance of this answer from 1 (highly irrelevant) to 4 (highly relevant): 
"""

SYSTEM_PROMPT = """
You will be given a question and answer couple. Your task is to determine whether the answer is relevant to the question and provide a score.
1. Highly irrelevant
2. somewhat irrelevant
3. somewhat relevant
4. highly relevant 
"""


pt.user_message_template = USER_PROMPT_TEMPLATE
pt.system_message = SYSTEM_PROMPT

# modeler.set_token_constraints(list("1234"))
# modeler.clear_token_constraints

output = modeler.generate_chat([pt.render(sample)])
modeler.decode(output)


['1']

In [45]:
USER_PROMPT = """Question:
"{question}"
Answer:
"{answer}"
"""

SYSTEM_PROMPT = """ You will be given a user_question and system_answer couple.
Your task is to provide a ''total rating'' scoring how well the system_answer
answers the user concerns expressed in the user_question.
Give your answer as an integer on a scale of 1 to 5, where 1 means that the system_answer
is not helpful at all, and 5 means that the answer completely and helpfully addresses
the question. Also provide an explanation for your rating"""

pt.user_message_template = USER_PROMPT_TEMPLATE
pt.system_message = SYSTEM_PROMPT

modeler.clear_token_constraints()

output = modeler.generate_chat([pt.render(sample)], max_new_tokens= 200)
modeler.decode(output)





["1 (highly irrelevant)\n\nEXPLANATION: The answer is highly irrelevant to the user's question. The user asked whether they can be granted a COVID-19 pandemic event visa if they are *currently not in Australia*. The provided answer states that the visa is only available to people *already in Australia*, which directly contradicts the user's scenario. It does not address the possibility of applying from abroad, nor does it provide any information about eligibility for non-residents, travel requirements, or alternative visa options. The answer is factually incorrect in the context of the question and fails to address the core concern — whether someone outside Australia can obtain such a visa. Therefore, it is not only irrelevant but also misleading. A relevant answer would clarify that the visa is not available to those not in Australia, or provide information on alternative pathways or visa types for international applicants. Hence, the rating is 1."]

In [61]:
SYSTEM_PROMPT = """
You are a QA answer rater following the 4-level FEEDBACKQA rubric.

Rate how well an ANSWER addresses a QUESTION.

Use this scale:

4 = EXCELLENT
    - Direct, clear, complete. Fully answers the question with no key missing info.
    - No major irrelevant content. User needs no further lookup.

3 = GOOD / ACCEPTABLE
    - Relevant and basically correct.
    - May be brief, mildly unclear, or missing minor clarifications.
    - User can still confidently infer the correct conclusion.

2 = COULD BE IMPROVED
    - Partially answers the question but misses important details.
    - May include irrelevant parts or leave user uncertain.

1 = BAD
    - Off-topic, irrelevant, or does not answer the question.

General rules:
- Assume facts are correct.
- Penalize only clarity, relevance, completeness.
- “No”-type rulings are fine; do NOT penalize simply because an answer denies eligibility.
- Small inferences are fine for 3–4; unclear or ambiguous ones lower the score.
"""

USER_PROMPT_TEMPLATE = """
You will be given a QUESTION and an ANSWER.

Rate how well the ANSWER addresses the QUESTION using the 4-point scale:
1 = Bad, 2 = Could be improved, 3 = Good/Acceptable, 4 = Excellent.

QUESTION:
{question}

ANSWER:
{answer}
"""


pt.user_message_template = USER_PROMPT_TEMPLATE
pt.system_message = SYSTEM_PROMPT
pt.assistant_prefix = "RATING: "
# pt.assistant_prefix = ""

modeler.set_token_constraints(list("1234"))
modeler.clear_token_constraints()
output = modeler.generate_chat([pt.render(sample)], max_new_tokens= 400)
modeler.decode(output)



['2 = Could be improved\n\nReasoning:\nThe answer correctly identifies that the visa is only available to people already in Australia and provides a brief explanation of eligibility criteria. However, it does **not directly and clearly address the core question**: whether someone *currently not in Australia* can be granted the visa.\n\nThe answer implies that the visa is not available to those outside Australia, which is correct, but it fails to explicitly state this in a clear, direct way. It also misses the opportunity to directly answer the question with a "no" or "only if..." clarification, which would make the response more complete and user-friendly.\n\nWhile the information is factually accurate, the answer is **missing a direct, clear "no" or "only for those already in Australia" response** to the specific question. This makes it incomplete in terms of addressing the user\'s intent.\n\nTherefore, it is not excellent (4), nor is it fully acceptable (3) — it is **partially releva

In [34]:
pt.assistant_prefix = "RATING: "
pt.render(sample).conversation

[{'role': 'system',
  'content': '\nYou are a QA answer rater following the 4-level FEEDBACKQA rubric.\n\nRate how well an ANSWER addresses a QUESTION.\n\nUse this scale:\n\n4 = EXCELLENT\n    - Direct, clear, complete. Fully answers the question with no key missing info.\n    - No major irrelevant content. User needs no further lookup.\n\n3 = GOOD / ACCEPTABLE\n    - Relevant and basically correct.\n    - May be brief, mildly unclear, or missing minor clarifications.\n    - User can still confidently infer the correct conclusion.\n\n2 = COULD BE IMPROVED\n    - Partially answers the question but misses important details.\n    - May include irrelevant parts or leave user uncertain.\n\n1 = BAD\n    - Off-topic, irrelevant, or does not answer the question.\n\nGeneral rules:\n- Assume facts are correct.\n- Penalize only clarity, relevance, completeness.\n- “No”-type rulings are fine; do NOT penalize simply because an answer denies eligibility.\n- Small inferences are fine for 3–4; unclear

---------
# Specific examples

## Init

In [9]:
# Instantiate model
model_name = ['google/gemma-3-1b-it', 
              'Qwen/Qwen3-4B-Instruct-2507', 
              'google/gemma-3-4b-it',
              'mistralai/Ministral-8B-Instruct-2410'][2]
print(model_name)
modeler = Modeler(model_name)
# modeler.set_token_constraints(list("1234"))

google/gemma-3-4b-it
Using device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Padding side: left


## Generation


In [4]:
sample = df[df['input_id'] == 'd8b74fbe497d358c']
input_text = sample.to_dict(orient='records')[0]
sample.iloc[0]

Unnamed: 0                                                      41
question         Can I be granted a COVID-19 pandemic event vis...
passage          {'passage_id': 563, 'reference': {'page_title'...
feedback         ['Answer details how only people with visas al...
rating                                  ['Excellent', 'Excellent']
domain                                                   Australia
review_1                                                 Excellent
explanation_1    Answer details how only people with visas alre...
review_2                                                 Excellent
explanation_2    It answered the question. It added more info t...
score_1                                                          4
score_2                                                          4
answer           The COVID-19 pandemic event visa can only be g...
input_id                                          d8b74fbe497d358c
disagreement                                                  

In [5]:
pm = PromptManager(folder = Path('../prompts/MCGILL_QA_FEEDBACK'))
pm.load_all()

pt = pm.prompt_templates['7785d73041']

PromptManager initialized with folder: ..\prompts\MCGILL_QA_FEEDBACK


In [8]:
print(pt.system_message)

You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as an integer on a scale of 1 to 5, where 1 means that the system_answer is not helpful at all, and 5 means that the answer completely and helpfully addresses the question.


In [14]:
output = modeler.generate_chat([pt.render(input_text)], max_new_tokens = 100)
modeler.decode(output)



["3. Somewhat irrelevant\n\n**Reasoning:**\n\nThe answer describes eligibility criteria *for people already in Australia*. The question asks about being granted a visa *if you are not currently in Australia*. While it touches on the general topic of the pandemic event visa, it doesn't address the core question of eligibility for someone outside of Australia. It's related, but not directly answering the specific inquiry."]

In [None]:
USER_PROMPT_TEMPLATE = """Question:
"{question}"

Answer:
"{answer}"

Rate the relevance of this answer from 1 (highly relevant) to 4 (highly irrelevant): 
"""

SYSTEM_PROMPT = """
You will be given a question and answer couple. Your task is to determine whether the answer is irrelevant to the question and provide a score.
1. Highly relevant
2. somewhat relevant
3. somewhat irrelevant
4. highly irrelevant 
"""

pt.system_message = SYSTEM_PROMPT
pt.user_message_template = USER_PROMPT_TEMPLATE

### Example of entity mismatch (employee, employers)

In [None]:
sample = df[df.input_id == '1480e9a9053f9af7'] 
inp = sample.to_dict(orient='records')[0]
inp

------

# Entity Mismatch & Contextual Contamination

Findings
- We analyse $input\_id = '1480e9a9053f9af7'$- the question where the question is about employers, the answer is about employees (entity mismatch)
    - Analysis:
        - All models are tripped by the entity mismatch, rating the answer 4/4 because of its completeness, even though the answer adresses the wrong entity
        - One of the human annotators also got it wrong because of entity mismatch (or, they have a different policy; heuristic processing)
    - Methods:
        - Prompting for answer relevance:
            - Adversarial prompting (flip around the rating system: $1 = irrelevant$, $1 = relevant$)
                - The models still got it wrong because of entity mismatch
        - Prompting which entity is addressed in the question
            - Models get it wrong when both question and answer are input
            - Models get it __right__ when the answer is removed
    - Finding:
        - There is a specific failure mode in transformer-based evaluators: Contextual Contamination. In psychometric terms, this is a failure of Discriminant Validity due to Method Variance (specifically, the method of presenting the full context simultaneously). The model's self-attention mechanism is attending to the tokens in the Answer. We see that the $length$ of the answer affects the reliability of the LLM output. 
        - "Salience Bias" in attention mechanisms: The model isn't just "confused"; it is prioritizing the frequency of the token "Driver" in the Answer over the syntactic position of "Employer" in the Question. (compare this with the contextual contamination argument, should explain this in conjunction)
    - Solution:
        - This problem is very unlikely to occur in my dataset, since all the output is capped at ~400 tokens. The reason I use the FeedbackQA dataset is to serve as proxy for the Barter dataset. In this case, the text length is actually an irrelevant confounding variable. If I want to perform domain transfer from Q&A to the target domain of marketing copy, I need to align both datasets. This avoids this specific failure mode.
    - Actionable steps:
        - Analyse the lengths of the texts from the Barter and the McGill datasets, verify whether Barter samples are shorter
            - Filter out the long answers from the McGill dataset: Calculate the Mean + 2 Standard Deviations of the token counts in your Target Dataset (Barter/Marketing). 
            - see if Cohen's $\kappa$/Spearman's $\rho$ are affected (LLM <-> human, human <-> human)

        


In [4]:
sample = df[df.input_id == '1480e9a9053f9af7']
input_text = sample.to_dict(orient='records')[0]
sample.iloc[0]

Unnamed: 0                                                    1758
question         What should employers of mail and parcel deliv...
passage          {'passage_id': 152, 'reference': {'page_title'...
feedback         ["An extremely detailed response targeting exa...
rating                                        ['Excellent', 'Bad']
domain                                                         CDC
review_1                                                 Excellent
explanation_1    An extremely detailed response targeting exact...
review_2                                                       Bad
explanation_2    This does not answer the question. Rather than...
score_1                                                          4
score_2                                                          1
answer           As a mail and parcel delivery driver, potentia...
input_id                                          1480e9a9053f9af7
disagreement                                                  

In [31]:
try:
    # Attempt to delete the variable
    del modeler
    print("🧹 'modeler' object deleted successfully.")
except NameError:
    # This block executes if 'modeler' was never defined
    print("✨ 'modeler' was not found (not instantiated).")

torch.cuda.empty_cache()
gc.collect()

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x000001E90EE785D0>>
Traceback (most recent call last):
  File "c:\Users\Wouter Barter\Documents\AI_thesis\venv312\Lib\site-packages\ipykernel\ipkernel.py", line 781, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 


✨ 'modeler' was not found (not instantiated).


9929

In [None]:
# Instantiate model
model_name = ['google/gemma-3-1b-it', 
              'Qwen/Qwen3-4B-Instruct-2507', 
              'google/gemma-3-4b-it',
              'mistralai/Ministral-8B-Instruct-2410'][2]
print(f'Using model: {model_name}')
modeler = Modeler(model_name)
# modeler.set_token_constraints(list("1234"))

google/gemma-3-4b-it
Using device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Padding side: left


In [7]:
CONSTRAINED_OUTPUT = ["A", "B"]

SYSTEM_PROMPT = """
Task: Identify the target audience addressed in the Answer text provided in the user prompt. 
Options:
A) Employers
B) Drivers

Provide the final answer as a single option letter (A or B) only.
"""

USER_PROMPT_TEMPLATE = """
Question:
"{question}"

Answer:
"{answer}"

Task: Identify the target audience addressed in the Answer text above.
Is the Answer text strictly addressed to Employers (A) or Drivers (B)?
"""

In [8]:
SYSTEM_PROMPT = """
Task: Analyze the Answer provided in the user prompt. 
Determine which entity is explicitly asked to perform actions or take measures.

Options:
A) Employers
B) Drivers

Provide the final answer as a single option letter (A or B) only.
"""

USER_PROMPT_TEMPLATE = """
Question:
"{question}"

Answer:
"{answer}"

Task: Analyze the Answer text above.
Who is the specific entity expected to TAKE ACTION according to the Answer?
(A) Employers
(B) Drivers
"""

In [None]:
prompt_template, prepared_prompt = generate_prepared_prompt(SYSTEM_PROMPT=SYSTEM_PROMPT, 
                                                            USER_PROMPT_TEMPLATE=USER_PROMPT_TEMPLATE, 
                                                            input_data = input_text,
                                                            constrained_output=CONSTRAINED_OUTPUT)

modeler.set_token_constraints(prompt_template.constrained_output)
output = modeler.generate_chat([prepared_prompt], max_new_tokens = 50)
modeler.get_relevant_logits_dict(output.model_output)
modeler.decode(output)

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


[{'A': 33.5, 'B': 43.25}]

In [51]:
SYSTEM_PROMPT = """
Task: Identify the subject in the Question text provided in the user prompt. 
Options:
A) Employers
B) Drivers

Provide the final answer as a single option letter (A or B) only.
"""

USER_PROMPT_TEMPLATE = """
Question:
"{question}"

Answer:
"{answer}"

Task: Identify the subject in the Question text above.
Is the Question text strictly addressed to Employers (A) or Drivers (B)?
"""

In [17]:
SYSTEM_PROMPT = """
Task: Analyze the Question provided in the user prompt. 
Determine which entity is explicitly asked to perform actions or take measures.

Options:
A) Employers
B) Drivers

Provide the final answer as a single option letter (A or B) only.
"""

USER_PROMPT_TEMPLATE = """
Question:
"{question}"



Task: Analyze the Question text above.
Who is the specific entity expected to TAKE ACTION according to the question?
(A) Employers
(B) Drivers
"""

In [18]:
prompt_template, prepared_prompt = generate_prepared_prompt(SYSTEM_PROMPT=SYSTEM_PROMPT, 
                                                            USER_PROMPT_TEMPLATE=USER_PROMPT_TEMPLATE, 
                                                            input_data = input_text,
                                                            constrained_output=CONSTRAINED_OUTPUT)

modeler.set_token_constraints(prompt_template.constrained_output)
output = modeler.generate_chat([prepared_prompt], max_new_tokens = 50)
print(modeler.get_relevant_logits_dict(output.model_output))
print(modeler.decode(output))

[{'A': 39.75, 'B': 27.125}]
['AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA']


-------