# LLM Evaluation Judge

In questo notebook viene riportato come i seguenti modelli:

1. Meta meta-llama/Llama-3.1-8B-Instruct
2. Alibaba Cloud Qwen/Qwen2.5-7B-Instruct
3. Mistral mistralai/Mistral-7B-Instruct-v0.3

vengono usati in veste da giudice per valutare se l'otuput dei modelli soggetti a SFT sia corretto oppure no.

In [3]:
#####################################
# IMPORTING LIBRERIE
#####################################
import os
import gc
import random
import torch
import time
import json
import pandas as pd
import numpy as np
from IPython.display import Markdown
from datasets import load_dataset, Dataset, concatenate_datasets
from tqdm import tqdm

In [4]:
def init_seed(seed):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

init_seed(33)

In [5]:
DATASET = "stefra/llama_judge_2"
dataset = load_dataset(DATASET, split="train").to_pandas()
final_sample = dataset #dataset.sample(n=1000, random_state=42).reset_index(drop=True)
dataset = Dataset.from_pandas(final_sample)

README.md:   0%|          | 0.00/415 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/170k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

model_id = "mistralai/Mistral-7B-Instruct-v0.3"#"Qwen/Qwen2.5-7B-Instruct"#"meta-llama/Llama-3.1-8B-Instruct"
token = ""
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

print("Caricamento del modello e del tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    token=token,
    quantization_config=quantization_config
)

2025-07-15 17:35:29.216814: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752600929.408888      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752600929.464447      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Caricamento del modello e del tokenizer...


tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [7]:
def build_judge_prompt(nl_text, true_fol, candidate_fol):
    messages = [
        {
            "role": "user",
            "content": (
                "You are a logic expert. Your task is to determine whether a candidate First-Order Logic (FOL) formula "
                "is a correct formalization of a natural language sentence.\n"
                "Answer only with 'Yes' or 'No'.\n\n"

                "Example:\n"
                "Sentence: All humans are mortal.\n"
                "Candidate FOL: ∃x Human(x) ∧ Mortal(x)\n"
                "Answer: No\n\n"

                "Important note: In some cases, the candidate FOL might not be identical to the true FOL, "
                "but if they are logically equivalent (i.e., they express the same meaning), answer 'Yes'.\n\n"

                f"Sentence: {nl_text.strip()}\n"
                f"True FOL: {true_fol.strip()}\n"
                f"Candidate FOL: {candidate_fol.strip()}\n"
                "Answer:"
            )
        }
    ]

    chat_input = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    return chat_input


In [None]:
def generator(prompt, model, tokenizer, max_new_tokens=1024):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            pad_token_id=tokenizer.eos_token_id
        )

    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    input_text = tokenizer.decode(inputs.input_ids[0], skip_special_tokens=True)
    generated_text = full_output[len(input_text):].strip()
    return generated_text

In [9]:
fol_results = []

for i, (_, row) in enumerate(tqdm(final_sample.iterrows(), total=len(final_sample))):
    nl_text = row["NL"]
    fol_true = row['FOL']
    fol_candidate = row['output']
    
    prompt = build_judge_prompt(nl_text,fol_true,fol_candidate)

    output = generator(prompt, model, tokenizer, max_new_tokens=512)

    fol_response = output.split("Answer:")[-1].strip()

    if i < 5:
        print(f"[{i}] NL: {nl_text}")
        print(f"[{i}] FOL OG: {fol_true}")
        #print(f"[{i}] Prompt:\n{prompt}")
        print(f"[{i}] FOL candidate: {fol_candidate}")
        print(f"[{i}] Output (raw):\n{output}")
        print(f"[{i}] Output:\n{fol_response}")
        print("=" * 60)

    fol_results.append(fol_response)

final_sample["judge_mistral"] = fol_results
final_sample.to_csv("nl_to_fol_with_llama.csv", index=False)

  0%|          | 1/1000 [00:02<40:24,  2.43s/it]

[0] NL: Mercury is closer to the sun than Venus, and Venus is closer to the sun than Earth.
[0] FOL OG: ∀x∀y∀z (Mercury(x) ∧ Venus(y) ∧ Earth(z) → (CloserToSun(x, y) ∧ CloserToSun(y, z)))
[0] FOL candidate: ∀x ∀y ∀z (Mercury(x) ∧ Venus(y) ∧ Earth(z) → (CloserToSun(x, sun) ∧ CloserToSun(y, sun) ∧ CloserToSun(z, sun)))
[0] Output (raw):
Yes
[0] Output:
Yes


  0%|          | 2/1000 [00:03<31:57,  1.92s/it]

[1] NL: A celestial event that occurs when the Moon passes between the Earth and the Sun, blocking the Sun's light, is a solar eclipse.
[1] FOL OG: ∀x (CelestialEvent(x) ∧ MoonPassesBetween(x) ∧ EarthAndSun(x) ∧ BlocksSunlight(x) → SolarEclipse(x))
[1] FOL candidate: ∀x (CelestialEvent(x) ∧ MoonPassesBetweenEarthAndSun(x) ∧ BlocksSunlight(x) → SolarEclipse(x))
[1] Output (raw):
Yes
[1] Output:
Yes


  0%|          | 3/1000 [00:05<29:04,  1.75s/it]

[2] NL: A book that is a fictional narrative, has well-developed characters, and contains a plot is typically a novel.
[2] FOL OG: ∀x (Book(x) ∧ FictionalNarrative(x) ∧ WellDevelopedCharacters(x) ∧ ContainsPlot(x) → Novel(x))
[2] FOL candidate: ∀x (FictionalNarrative(x) ∧ WellDevelopedCharacters(x) ∧ ContainsPlot(x) → Novel(x))
[2] Output (raw):
Yes
[2] Output:
Yes


  0%|          | 4/1000 [00:07<27:36,  1.66s/it]

[3] NL: Astronauts explore space, while scuba divers explore the depths of the ocean.
[3] FOL OG: ∀x ((Astronaut(x) → ExploresSpace(x)) ∧ (ScubaDiver(x) → ExploresOceanDepths(x)))
[3] FOL candidate: ∀x (Astronaut(x) → ExploresSpace(x)) ∧ ∀y (ScubaDiver(y) → ExploresOceanDepth(y))
[3] Output (raw):
Yes
[3] Output:
Yes


  0%|          | 5/1000 [00:08<25:30,  1.54s/it]

[4] NL: A camera captures images, has a lens, and uses a sensor.
[4] FOL OG: ∀x (Camera(x) → (CapturesImages(x) ∧ HasLens(x) ∧ UsesSensor(x)))
[4] FOL candidate: ∀x (Camera(x) → (CapturesImages(x) ∧ HasLens(x) ∧ UsesSensor(x)))
[4] Output (raw):
Yes
[4] Output:
Yes


100%|██████████| 1000/1000 [32:31<00:00,  1.95s/it]


In [10]:
dataset_to_push = Dataset.from_pandas(final_sample)
dataset_to_push.push_to_hub("stefra/llama_judge_final")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/173k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/stefra/llama_judge_final/commit/c0a4c19343f39c6be6eb54910aaedb413060f72f', commit_message='Upload dataset', commit_description='', oid='c0a4c19343f39c6be6eb54910aaedb413060f72f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/stefra/llama_judge_final', endpoint='https://huggingface.co', repo_type='dataset', repo_id='stefra/llama_judge_final'), pr_revision=None, pr_num=None)