<a href="https://colab.research.google.com/github/samarreguigui/AI_Ethik/blob/main/Roscoe_drop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

we will study the second point :identify specific cases where the LLM’s judgments
differ from human judgments, and analyze the underlying reasons for these differences,
including systematic biases such as preference for fluent or verbose outputs rather than
task specific criteria.

Human annotators were provided with the full task instance while evaluating individual reasoning steps. To disentangle step level judgment behavior from task solving, we evaluate LLM judges under two conditions: a context free step only condition and a context aware condition that includes the full instance. This allows us to isolate stylistic judgment biases and assess the effect of contextual information.

**Without the instance context**

In [None]:
!pip install -U openai


In [None]:
# ===== STEP 1: LOAD ROSCOE-DROP-STEPWISE =====

import json
import pandas as pd

# 1. Path to your JSON file (adjust if needed)
JSON_PATH = "roscoe-drop-stepwise.json"

# 2. Load the JSON
with open(JSON_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)

# 3. Basic checks
print("Top-level keys:", list(data.keys()))
print("Dataset name:", data.get("dataset"))
print("Number of instances:", len(data["instances"]))

# 4. Flatten instances into a DataFrame
rows = []
for inst in data["instances"]:
    inst_id = inst["id"]
    inst_text = inst["instance"]
    ann = inst["annotations"]

    row = {
        "id": inst_id,
        "instance_text": inst_text,
    }

    # Add majority human label for each metric (yes/no)
    for metric_name, metric_info in ann.items():
        row[f"{metric_name}_majority"] = metric_info["majority_human"]

    rows.append(row)

df = pd.DataFrame(rows)

print("\nColumns in DataFrame:")
print(df.columns.tolist())

print("\nFirst 3 rows:")
print(df.head(3))

print("\nLabel distribution for some metrics:")
for m in ["Grammar", "Factuality", "Coherency and Logic",
          "Final Answer", "Hallucination"]:
    col = f"{m}_majority"
    if col in df.columns:
        print(f"\nMetric: {m}")
        print(df[col].value_counts())


In [None]:
import re

# ===== helper: parse one instance into steps =====

STEP_PATTERN = re.compile(r"(Step\s+\d+\s*-\s*)(.*)")

def split_instance_into_steps(instance_text):
    """
    Returns:
      context_block: everything before 'GENERATED RESPONSE:'
      steps: list of (step_index, step_text)
    """
    # Separate instruction/context from generated response
    parts = instance_text.split("GENERATED RESPONSE:")
    if len(parts) != 2:
        return instance_text, []  # fallback

    context_block = parts[0]
    gen_resp = parts[1]

    steps = []
    # Split by lines and collect those starting with 'Step'
    for line in gen_resp.splitlines():
        line = line.strip()
        if line.startswith("JUDGE:"):
            break  # ignore judge segment
        m = STEP_PATTERN.match(line)
        if m:
            prefix, text = m.groups()
            # step number from prefix, e.g. "Step 1 - "
            try:
                step_num = int(re.findall(r"\d+", prefix)[0])
            except Exception:
                step_num = None
            steps.append((step_num, text.strip()))
    return context_block, steps

# ===== apply to all instances =====

step_rows = []

for _, row in df.iterrows():
    inst_id = row["id"]
    inst_text = row["instance_text"]

    context_block, steps = split_instance_into_steps(inst_text)

    # copy human majority labels for this instance
    meta = {
        "id": inst_id,
        "context_block": context_block,
    }
    for col in df.columns:
        if col.endswith("_majority"):
            meta[col] = row[col]

    for step_num, step_text in steps:
        step_row = meta.copy()
        step_row["step_num"] = step_num
        step_row["step_text"] = step_text
        step_rows.append(step_row)

steps_df = pd.DataFrame(step_rows)

print("Num step-level rows:", len(steps_df))
print("\nColumns:")
print(steps_df.columns.tolist())

print("\nSample steps:")
print(steps_df[["id", "step_num", "step_text"]].head(5))


In [None]:
from openai import OpenAI
import os, time
from tqdm import tqdm


os.environ["OPENAI_API_KEY"] = "-"

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
JUDGE_MODEL = "gpt-4o"

we use the prompt t see if there is a problem. in the next implimentation we will take the human evaluation and create a column from the human_has_problem

That was done to create one simple “human says this step is OK / not OK” label so it can be compared directly to the LLM’s single yes/no judgment.
​

Why combine metrics into one label
ROSCOE gives many labels per step (Factuality, Coherency, Final Answer, Hallucination, Commonsense, Arithmetic, etc.).
​

Your LLM judge, however, was asked one question: “Does this step contain any reasoning or factual problem? yes/no.”

To compare them, you need one human label of the same kind:



In [None]:
JUDGE_SYSTEM_PROMPT = """You are an expert reasoning evaluator.
You see a context, a question, and ONE step from a model's explanation.

Your task: decide if this step contains ANY problem that makes the reasoning incorrect or untrustworthy, such as:
- factual errors
- logical/coherence errors
- incorrect final answer in this step
- hallucinations
- arithmetic or commonsense errors

Ignore minor style issues like wording or politeness.
Answer with exactly one word: 'yes' if there IS some problem in the step, 'no' if the step is fine.
"""

def build_step_prompt(row):
    ctx = row["context_block"]
    step_num = row["step_num"]
    step_text = row["step_text"]

    user = f"""CONTEXT AND QUESTION:
{ctx}

STEP {step_num} TO EVALUATE:
{step_text}

Question for you:
Does this step contain any reasoning or factual problems?
Answer 'yes' or 'no'."""
    return user


In [None]:
#test prompt :
# Pick one row, e.g. first judged step
example_row = analysis_df.iloc[0]

print("=== CONTEXT BLOCK ===")
print(example_row["context_block"])
print("\n=== STEP TEXT ===")
print(f"Step {example_row['step_num']}: {example_row['step_text']}")

print("\n=== FULL USER PROMPT SENT TO LLM ===")
print(build_step_prompt(example_row))


=== CONTEXT BLOCK ===
For this task, you will be shown a CONTEXT with a "Situation" and a "Claim" about that "Situation". The "Claim" may or may not be supported by the "Situation". The Correct Relationship between the "Claim" and the "Situation" is provided.

You will be shown a GENERATED RESPONSE generated from a bot, asked the question

Is the Claim supported by the Situation?

You will be asked to judge the individual STEPS within the GENERATED RESPONSE. Interpret the questions to the best of your ability. Sometimes the generated response will refer to the "Situation" as a "Premise" and the "Claim" as a "Hypothesis". It will oftentimes be faster to read the "Claim" before the "Situation".

CONTEXT:
Situation (Premise): Hoping to rebound from their home loss to the Vikings, the Cardinals flew to Gillette Stadium for a Week 16 interconference duel with the New England Patriots. Arizona would trail early in the first quarter as Patriots running back LaMont Jordan got a one-yard and a 

In [None]:
def judge_step(row, retries=3, sleep_sec=2):
    prompt = build_step_prompt(row)
    for _ in range(retries):
        try:
            resp = client.chat.completions.create(
                model=JUDGE_MODEL,
                messages=[
                    {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
                    {"role": "user", "content": prompt},
                ],
                temperature=0.0,
            )
            txt = resp.choices[0].message.content.strip().lower()
            time.sleep(0.5)

            if "yes" in txt and "no" not in txt:
                return "yes"
            if "no" in txt and "yes" not in txt:
                return "no"
        except Exception as e:
            print("Error, retrying:", e)
            time.sleep(sleep_sec)
    return None


In [None]:
MAX_STEPS_TO_JUDGE = 200   # increase later if needed

subset = steps_df.sample(n=min(MAX_STEPS_TO_JUDGE, len(steps_df)), random_state=0).copy()

tqdm.pandas()
subset["llm_problem"] = subset.progress_apply(judge_step, axis=1)

print("\nValue counts of LLM judgments:")
print(subset["llm_problem"].value_counts(dropna=False))

# Keep only rows where the LLM gave a clear yes/no
subset = subset.dropna(subset=["llm_problem"])


In [None]:
subset

Unnamed: 0,id,context_block,Grammar_majority,Factuality_majority,Coherency and Logic_majority,Final Answer_majority,Hallucination_majority,Redundancy_majority,Repetition_majority,Commonsense_majority,Arithmetic_majority,step_num,step_text,llm_problem
805,305,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,1,Warner was the starting quarterback for the Ca...,no
646,239,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,1,On 10 August 1919 a cease-fire was signed.,no
500,172,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,5,12.73675 people- 5.3% Danish people= 12.177 pe...,yes
204,85,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,1,Bills went home for their last home game of th...,no
1208,417,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,2,The answer is 3.,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
467,168,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,4,14.225 people- 10.7% United States= 12.73675 p...,yes
425,153,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,1,It took the Singapore National Olympic Council...,no
459,167,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,4,14.225 people- 10.7% United States= 12.73675 p...,yes
513,174,"For this task, you will be shown a CONTEXT wit...",no,no,no,no,no,no,no,no,no,2,19.357 people- 16.5% Germans= 16.1185 people.,yes


In [None]:
#human_problem and llm_problem_bin
#Collapsed these metrics into one binary label human_problem = 1 if any of those metrics had majority “yes” (i.e. humans saw some issue in the step), else 0
ERROR_METRICS = [
    "Factuality_majority",
    "Coherency and Logic_majority",
    "Final Answer_majority",
    "Hallucination_majority",
    "Commonsense_majority",
    "Arithmetic_majority",
]

def human_has_problem(row):
    for col in ERROR_METRICS:
        if col in row and row[col] == "yes":
            return 1
    return 0

subset["human_problem"] = subset.apply(human_has_problem, axis=1)
subset["llm_problem_bin"] = subset["llm_problem"].map({"yes": 1, "no": 0})

print(pd.crosstab(subset["human_problem"], subset["llm_problem_bin"],
                  rownames=["human"], colnames=["llm"]))


llm      0   1
human         
0      130  46
1       12  12


In [None]:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

analysis_df = subset.copy()
analysis_df["agree"] = (analysis_df["human_problem"] == analysis_df["llm_problem_bin"]).astype(int)

def num_tokens(text):
    return len(enc.encode(text or ""))

analysis_df["step_len"] = analysis_df["step_text"].apply(lambda x: num_tokens(str(x)))

print("\nAgreement rate:", analysis_df["agree"].mean())
print("\nAvg length by agreement:")
print(analysis_df.groupby("agree")["step_len"].mean())



Agreement rate: 0.71

Avg length by agreement:
agree
0    16.362069
1    18.366197
Name: step_len, dtype: float64


So, on average, the steps where the LLM and humans disagree are slightly shorter than those where they agree. This suggests that the model might be somewhat more reliable on longer steps (perhaps because it has more information to work with), while shorter steps leave more room for it to misjudge whether there is a problem.

# **test whether your LLM judge is biased toward style instead of correctness**

In [None]:
pip install transformers


In [None]:
#1. Compare average fluency in agree vs disagree by computing a fluency score
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np

FLU_MODEL_NAME = "gpt2"   # small, just for a rough fluency proxy

flu_tokenizer = AutoTokenizer.from_pretrained(FLU_MODEL_NAME)
flu_model = AutoModelForCausalLM.from_pretrained(FLU_MODEL_NAME)
flu_model.eval()
if torch.cuda.is_available():
    flu_model.to("cuda")

def avg_nll(text):
    text = str(text)
    if not text.strip():
        return 0.0
    inputs = flu_tokenizer(text, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}
    with torch.no_grad():
        out = flu_model(**inputs, labels=inputs["input_ids"])
    return out.loss.item()   # average per token

analysis_df["nll"] = analysis_df["step_text"].apply(avg_nll)
analysis_df["fluency"] = -analysis_df["nll"]   # higher = more fluent

print("\nFluency stats by agreement:")
print(analysis_df.groupby("agree")["fluency"].mean())



In [None]:
analysis_df

 disagreeing steps are less fluent on average than agreeing steps
 Your main hypothesis was:

The judge might **over‑trust very fluent, verbose steps even when they are wrong.

If that were happening strongly, you would expect many human‑marked error steps that are highly fluent, where the judge says “no problem.”
The aggregate stats you looked at do not show higher fluency on disagreement overall; instead, disagreements are less fluent on average.

In [None]:
# Only steps where humans say there IS a problem
err_df = analysis_df[analysis_df["human_problem"] == 1].copy()

print("\nFluency for human-error steps:")
print(err_df.groupby("llm_problem_bin")["fluency"].mean())
print("\nCounts:")
print(err_df["llm_problem_bin"].value_counts())



Fluency for human-error steps:
llm_problem_bin
0   -3.992727
1   -3.833590
Name: fluency, dtype: float64

Counts:
llm_problem_bin
1    12
0    12
Name: count, dtype: int64


For steps where humans say “there is a problem,” both groups are quite fluent, and the LLM is slightly more likely to agree with humans when the step is less fluent.
​

What these numbers mean
You restricted to steps with human_problem = 1 (humans see an error).

Within those, you split by the LLM’s decision:

llm_problem_bin = 1 (LLM also sees a problem): fluency ≈ ‑3.83.

llm_problem_bin = 0 (LLM misses the problem): fluency ≈ ‑3.99.

Remember: higher (less negative) = more fluent.

So error steps where LLM agrees with humans are slightly more fluent than those where it misses the error.

Interpretation for your bias question
Among clearly erroneous steps, the LLM is not preferentially excusing the most fluent ones; if anything, it detects problems slightly better on more fluent text.

With such a small sample (12 vs 12), these differences are tiny and not statistically strong, but they do not support the story “LLM over‑trusts very fluent wrong steps” on this slice.

Combined with earlier results (disagreements less fluent overall), your current evidence suggests:

The judge struggles more with rough, awkward steps, not with overly fluent ones.

In [None]:
#Saving important result :
subset.to_csv("roscoe_step_judgments_raw.csv", index=False)
subset.to_json("roscoe_step_judgments_raw.jsonl",
               orient="records", lines=True)
print("Saved roscoe_step_judgments_raw.*")
analysis_df.to_csv("roscoe_analysis_features.csv", index=False)
analysis_df.to_json("roscoe_analysis_features.jsonl",
                    orient="records", lines=True)
print("Saved roscoe_analysis_features.*")


import json

summary = {
    "n_steps": int(len(analysis_df)),
    "agreement_rate": float(analysis_df["agree"].mean()),
    "avg_len_agree": float(analysis_df.groupby("agree")["step_len"].mean().get(1, 0.0)),
    "avg_len_disagree": float(analysis_df.groupby("agree")["step_len"].mean().get(0, 0.0)),
    "avg_flu_agree": float(analysis_df.groupby("agree")["fluency"].mean().get(1, 0.0)),
    "avg_flu_disagree": float(analysis_df.groupby("agree")["fluency"].mean().get(0, 0.0)),
}

with open("roscoe_summary.json", "w") as f:
    json.dump(summary, f, indent=2)

print("Saved roscoe_summary.json")



Saved roscoe_step_judgments_raw.*
Saved roscoe_analysis_features.*
Saved roscoe_summary.json


***look directly at disagreement examples and label why the judge disagrees***

In [None]:
disagree_df = analysis_df[analysis_df["agree"] == 0].copy()
print("Num disagreements:", len(disagree_df))

# Save them to inspect comfortably
disagree_df.to_csv("roscoe_disagreements_for_manual_analysis.csv", index=False)
for i in range(5):
    row = disagree_df.iloc[i]
    print("="*80)
    print("ID:", row["id"], "step", row["step_num"])
    print("Human_problem:", row["human_problem"],
          "LLM_problem:", row["llm_problem_bin"])
    print("\nCONTEXT:\n", row["context_block"])
    print("\nSTEP:\n", row["step_text"])


Num disagreements: 58
ID: 172 step 5
Human_problem: 0 LLM_problem: 1

CONTEXT:
 For this task, you will be shown a CONTEXT with a "Situation" and a "Claim" about that "Situation". The "Claim" may or may not be supported by the "Situation". The Correct Relationship between the "Claim" and the "Situation" is provided.

You will be shown a GENERATED RESPONSE generated from a bot, asked the question

Is the Claim supported by the Situation?

You will be asked to judge the individual STEPS within the GENERATED RESPONSE. Interpret the questions to the best of your ability. Sometimes the generated response will refer to the "Situation" as a "Premise" and the "Claim" as a "Hypothesis". It will oftentimes be faster to read the "Claim" before the "Situation".

CONTEXT:
Situation (Premise): As of the census of 2000, there were 24,621 people, 9,029 households, and 6,284 families residing in the county.  The population density was 73 people per square mile (28/km²).  There were 12,064 housing units

From the ROSCOE‑DROP stepwise sample, the LLM judge agreed with human step‑level annotations in about 71% of evaluated steps, showing reasonably high but imperfect alignment. Disagreements were not concentrated on especially long or highly fluent steps; on aggregate, disagreeing steps were slightly shorter and somewhat less fluent than agreeing ones, and among human‑error steps the LLM actually detected problems a bit more often when the text was more fluent. Qualitative inspection of disagreement cases revealed that most mismatches arose from differences in task interpretation and strictness—for example, the LLM re‑solving math questions and flagging numerically wrong answers or irrelevant but factual sentences as “problems” where human annotators had not, rather than from a clear bias toward fluent or verbose language.


In [None]:
import json, os

# 1. Get the notebook filename (replace if needed)
NB_NAME = "Roscoe_drop.ipynb"   # exact name as in your repo / Drive

# 2. Load, remove metadata.widgets, save back
with open(NB_NAME, "r", encoding="utf-8") as f:
    nb = json.load(f)

if "metadata" in nb and "widgets" in nb["metadata"]:
    print("Removing metadata.widgets...")
    del nb["metadata"]["widgets"]
else:
    print("No metadata.widgets found.")

with open(NB_NAME, "w", encoding="utf-8") as f:
    json.dump(nb, f)

print("Cleaned notebook saved:", NB_NAME)


FileNotFoundError: [Errno 2] No such file or directory: 'Roscoe_drop.ipynb'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# List to find exact path/name
!ls "/content/drive/MyDrive"



MessageError: Error: credential propagation was unsuccessful