# MADR prompt evaluation

This notebook is inspired by [this research](https://cacm.acm.org/research/from-prompt-engineering-to-prompt-science-with-humans-in-the-loop/) on prompt engineering with humans in the loop.

We are faced with the issue: the existing prompts in the MADR paper are ill-defined and intended for a different complexity model. While fine-tuning is an enticing option, we lack a dataset of questions, evidence, veracity claim, and target feedback.

In [1]:
import sys
import os
import json
import random
sys.path.append(os.path.abspath(".."))

In [2]:
def in_jupyter_notebook():
    try:
        shell = get_ipython().__class__.__name__
        return shell == 'ZMQInteractiveShell'
    except NameError:
        return False
IN_JUPYTER = in_jupyter_notebook()

In [3]:
from src.model_clients import LlamaCppClient
from src import config
from src.madr import run_madr
from src.utils import parse_ternary, get_prompt_files

In [4]:
prompts = get_prompt_files(config.MADR_DIR)
client = LlamaCppClient(prompts)

In [5]:
corag_run = None
relative_path = os.path.join('assets', '20251123T192328--metrics__100.json')
with open(relative_path, 'r', encoding='utf-8') as f:
    corag_run = json.load(f)["results"]

In [6]:
# set the random seed
random.seed(42)

## An initial codebook for Debater 1, 2, Cross-Agent, Judge, Refiner

A 'codebook' (or criteria) in the above research encapsulates the desired outcomes for generation. It systematically allows any human-in-the-loop to quickly and methodically conclude when a generated response does or does not meet the specifications.

The starting criteria, validated by multiple researchers for each model, is defined below:

1. **Debater 1, given an explanation and evidence:**
- Does distinguish between the following types of errors in the explanation:
  1) misrepresentations of factual details in the evidence
  2) introduces or misrepresents events not present in the evidence
  3) introduces or misrepresents noun phrases in the evidence, changing semantic meaning
  4) logic/reasoning inconsistencies in the explanation
  5) includes information not relevant to the claim/evidence or beyond what the evidence supports
  6) does not adequately justify its position
- Provides actionable feedback, rather than attempting to correct errors in the explanation to align with evidence.
- ONLY considers errors in the explanation, not the claim or Q/A pairs.
- Identifies ALL such errors
- Does NOT output non-existent errors.
- Does NOT merely restate the types of errors without context.
- Does NOT merely restate the types of errors without context.
- Does NOT merely restate the types of errors without context.
- Does NOT merely restate the types of errors without context.
- Does NOT merely restate the types of errors without context.
- Labels all errors with the typology.
- Answers tersely

*note, it is NOT a requirement the model names the error types.*

2. **Debater 2, given an explanation and evidence:**
- Identifies general weaknesses which reduce the faithfulness of an explanation to the claim/evidence.
- Generates explanations for why identified issues are unfaithful:
  1) Factually inaccurate
  2) logically inaccurate
  3) irrelevant
  4) incoherent
  5) incomplete
- Provides actionable feedback, rather than attempting to correct errors in the explanation to align with evidence.
- ONLY considers errors in the explanation, not the claim or Q/A pairs.
- Does not include a restatement of the explanation
- Avoids logical assumptions about the claim
- Considers the evidence as the source of ground truth
- Does provide actionable feedback
- Identifies ALL such errors
- Does NOT output non-existent errors.
- Answers tersely

3. **Cross-Agent, given the original claim, evidence, explanation, and a primary and secondary set of feedback:**
- uses the claim, evidence, and explanation to:
  1) recognize invalid feedback in the primary feedback, and incorporate corrections from the secondary feedback.
  2) does NOT include invalid feedback from the secondary feedback.
- outputs ONLY the modified primary feedback
- does NOT merely copy the secondary feedback

*note, the input to this agent is up to debate. I am making the assumption that only including the explanation rather than the full claim and Q/A pairs balances the agent's ability to refine the feedback given a source of ground truth while keeping the prompt context small.*

4. **Judge, given two sets of feedback:**
- identifies major discrepancies between the input feedback:
  1) identifies different errors or issues
  2) provides inconsistent error descriptions
  3) whether any suggested fixes are not equivalent
- if discrepancies exist, output contains the word 'FALSE'
- else, output contains the word 'TRUE'

5. **Refinement Agent, given two sets of feedback and an explanation:**
- rewrites the explanation to align with both sets of feedback
- Does NOT output anything other than the refined explanation
- Does NOT remove the component containing the verdict in the original explanation, but can 'flip' it, i.e.: true -> false.

## Initial Custom Prompts

Because it is already established the original prompts do not meet these requirements, we will start by creating our initial custom prompts.

To do this, we simply feed the codebook itself into an LLM to generate the prompt. I will use this boilerplate:

> You are tasked with taking a set of requirements and phrasing it as a concise prompt for an LLM. In this initial generation, the codebook serves as the primary material for construction, while a previous prompt is provided purely to fill in any gaps left by the codebook. Extra output besides the prompt is unnecessary and unwarranted.

For further refinement:

> You are tasked with taking a set of requirements and phrasing it as a concise prompt for an LLM. In this initial generation, the codebook serves as the primary material for construction. A repeated point stresses emphasis. Extra output besides the prompt is unnecessary and unwarranted.

In [7]:
madr_init_fb1_system_prompt = """Evaluate the following explanation against the given evidence by identifying all errors it contains, using these labels:

- misrepresented factual details;
- introduced/misrepresented events not in evidence;
- introduced/misrepresented noun phrases altering meaning;
- logic/reasoning inconsistencies;
- irrelevant or unsupported information;
- inadequate justification.

Only analyze the explanation for errors (not the claim or Q/A). Do not invent errors. Provide actionable feedback without rewriting or correcting the explanation. Avoid restating the error types without context."""

madr_init_fb2_system_prompt = """Given the following claim, explanation, and QA-pair evidence, identify all weaknesses in the explanation that reduce the faithfulness to the claim/evidence.

For each problematic sentence in the explanation, quote it (inline, no need to restate the full explanation), explain why it is unfaithful (factually inaccurate, logically inaccurate, irrelevant, incoherent, or incomplete), and provide actional feedback without correcting or rewriting the explanation.

Only analyze the explanation for errors (not the claim or Q/A). Only consider the available evidence as the ground truth. Do not invent issues."""

madr_cross_fb_system_prompt = """Given the following claim, QA-pairs, explanation, and primary and secondary feedback, revise the primary feedback.

Remove any invalid points in the primary feedback, add only valid corrections supported by the secondary feedback, and ignore invalid secondary feedback. Use the provided claim, QA-pairs, and explanation as context.

Do not copy secondary feedback verbatim. Output only the corrected primary feedback."""

madr_cross_fb_user_prompt = """Claim: {}

Question-Answer Pairs:
{}

Explanation:
{}

Primary feedback:
{}

Secondary feedback:
{}"""

madr_judge_system_prompt = """Given the following sets of feedback, compare them for major discrepancies: whether they identify different errors, give inconsistent error descriptions, or propose nonequivalent fixes.

If any discrepancy exists, output FALSE; otherwise output TRUE. Output only one word."""


madr_revise_system_prompt = """Given the following explanation and two sets of feedback, rewrite the explanation so it aligns with both feedback sets.
Your revised feedback must always conclude with either "True", "False", or "Inconclusive", depending on the truthfulness of the claim with regard to the rewritten explanation."""

In [8]:
def replace_prompt(key, system_prompt_str):
    client._prompts[key] = (
        client._prompts[key][0],
        system_prompt_str
    )

In [9]:
replace_prompt("init_fb1", madr_init_fb1_system_prompt)

In [10]:
client._prompts["init_fb1"]

('Claim: {}\n\nQuestion-Answer Pairs:\n{}\n\nExplanation: {}\n\nFeedback:',
 'Evaluate the following explanation against the given evidence by identifying all errors it contains, using these labels:\n\n- misrepresented factual details;\n- introduced/misrepresented events not in evidence;\n- introduced/misrepresented noun phrases altering meaning;\n- logic/reasoning inconsistencies;\n- irrelevant or unsupported information;\n- inadequate justification.\n\nOnly analyze the explanation for errors (not the claim or Q/A). Do not invent errors. Provide actionable feedback without rewriting or correcting the explanation. Avoid restating the error types without context.')

In [11]:
replace_prompt("init_fb2", madr_init_fb2_system_prompt)
replace_prompt("judge", madr_judge_system_prompt)
replace_prompt("revise", madr_revise_system_prompt)

In [12]:
client._prompts["cross_fb"] = (
    madr_cross_fb_user_prompt,
    madr_cross_fb_system_prompt
)

## Refining Base MADR Prompts

I have a set of 100 claims run through the baseline pipeline. When evaluating the MADR pipeline *I will assume that the answers generated by the CoRAG answering agent are always factual.*

In [13]:
corag_run[:3]

[{'claim_id': 0,
  'claim': 'In 2008, Jeff Ament released a solo record.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [["When was Jeff Ament's solo record released?",
    'September 16, 2008.']],
  'verdict_int': 1,
  'verdict_raw': 'The question-answer pair directly confirms the claim that Jeff Ament released a solo record in 2008, as it specifies the release date as September 16, 2008. Therefore, the claim is supported by the provided information. True.'},
 {'claim_id': 1,
  'claim': 'Franklin D. Roosevelt had a family that originated in New York.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [["What is the origin of Franklin D. Roosevelt's family?",
    "The origin of Franklin D. Roosevelt's family can be traced back to the Roosevelt family, a prominent American family with roots in Dutch heritage. The family's history in the United States dates back to the 17th century, with early ancestors arriving in the New World. Th

### Debater 1 Freespace

This marks our first iteration of refining the codebook. To evaluate the debater, we need a reasonable number of responses. In this case, I will generate 10, for 5 random claims of each type (TRUE/FALSE):

In [14]:
def select_random_sample(data, count=5):
    supports = [i for i in data if i["true_label"] == "SUPPORTS"]
    refutes = [i for i in data if i["true_label"] == "REFUTES"]

    selection_supports = random.sample(supports, count)
    selection_refutes = random.sample(refutes, count)

    return selection_supports + selection_refutes
sample = select_random_sample(corag_run)

Now I'll generate and cache the debater's responses:

In [15]:
def regenerate():
    for c in sample:
        claim = c["claim"]
        qa_pairs = c["qa_pairs"]
        verdict_raw = c["verdict_raw"]
        fb = client.send_prompt("init_fb1", [claim, qa_pairs, verdict_raw])
        c["r"] = fb

In [16]:
regenerate()
sample

[{'claim_id': 40,
  'claim': 'Jun Ji-hyun is in the South Korean film called Windstruck.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [['Is Jun Ji-hyun featured in the South Korean film "Windstruck"?',
    'Answer: Yes.']],
  'verdict_int': 1,
  'verdict_raw': 'The question-answer pair directly confirms that Jun Ji-hyun is featured in the South Korean film "Windstruck," which aligns with the claim. Therefore, the claim is supported by the provided information. True.',
  'r': '- logic/reasoning inconsistencies: The explanation assumes that the question-answer pair directly confirms the claim without providing specific evidence or details about Jun Ji-hyun\'s role or presence in the film "Windstruck." The reasoning is based on the assertion that the answer "Yes" supports the claim, but this does not constitute a logical or evidential justification beyond the answer itself.'},
 {'claim_id': 7,
  'claim': 'A subtype of anti-nuclear antibodies are anti-Ro ant

We now need a section of code to pretty-print each claim out, making it easy to judge the effectiveness:

In [17]:
def pprint_results(sample):
    for i, s in enumerate(sample):
        claim = s["claim"]
        qa_pairs = s["qa_pairs"]
        explanation = s["verdict_raw"]
        madr_agent = s["r"].strip()
        print(f"{i+1}. {claim}\n")
        print(f"Available evidence for this claim, in the form of Q/A pairs:")
        for j, e in enumerate(qa_pairs):
            print(f"{j+1}.\tQuestion: {e[0]}\n\t{e[1]}")
        print()
        print(f"The original verdict: {explanation}\n")
        print(f"This agent's feedback:\n{madr_agent}")
        print("\n\n")

pprint_results(sample)

1. Jun Ji-hyun is in the South Korean film called Windstruck.

Available evidence for this claim, in the form of Q/A pairs:
1.	Question: Is Jun Ji-hyun featured in the South Korean film "Windstruck"?
	Answer: Yes.

The original verdict: The question-answer pair directly confirms that Jun Ji-hyun is featured in the South Korean film "Windstruck," which aligns with the claim. Therefore, the claim is supported by the provided information. True.

This agent's feedback:
- logic/reasoning inconsistencies: The explanation assumes that the question-answer pair directly confirms the claim without providing specific evidence or details about Jun Ji-hyun's role or presence in the film "Windstruck." The reasoning is based on the assertion that the answer "Yes" supports the claim, but this does not constitute a logical or evidential justification beyond the answer itself.



2. A subtype of anti-nuclear antibodies are anti-Ro antibodies.

Available evidence for this claim, in the form of Q/A pairs:

## Debater 2 Freespace

In [18]:
def regenerate():
    for c in sample:
        claim = c["claim"]
        qa_pairs = c["qa_pairs"]
        verdict_raw = c["verdict_raw"]
        fb = client.send_prompt("init_fb2", [claim, qa_pairs, verdict_raw])
        c["r"] = fb

In [19]:
sample = select_random_sample(corag_run)
regenerate()
pprint_results(sample)

1. Birth of the Dragon's principal photography began in Vancouver, Canada.

Available evidence for this claim, in the form of Q/A pairs:
1.	Question: When did the principal photography of *Birth of the Dragon* begin?
	The principal photography of *Birth of the Dragon* began on November 17, 2015.

The original verdict: The question-answer pair confirms that the principal photography of *Birth of the Dragon* began on November 17, 2015, but it does not specify the location. Therefore, the claim that it began in Vancouver, Canada, is not fully confirmed by the provided information. Inconclusive.

This agent's feedback:
The explanation contains the following problematic sentence: "The question-answer pair confirms that the principal photography of *Birth of the Dragon* began on November 17, 2015, but it does not specify the location."

This sentence is unfaithful because it inaccurately states that the question-answer pair "does not specify the location," which is factually incorrect based 

## Cross-Reference Freespace

In [20]:
# populate the debaters
# for c in corag_run:
#     claim = c["claim"]
#     qa_pairs = c["qa_pairs"]
#     verdict_raw = c["verdict_raw"]
#     fb1 = client.send_prompt("madr_init_fb1", [claim, qa_pairs, verdict_raw])
#     fb2 = client.send_prompt("madr_init_fb2", [claim, qa_pairs, verdict_raw])
#     c["fb1"] = fb1
#     c["fb2"] = fb2

In [24]:
incomp_relative_path = os.path.join('assets', '20251201T170345--metrics_incomp__100.json')
with open(relative_path, 'w') as f:
    json.dump(corag_run, f, indent=4)

In [26]:
with open(relative_path, 'r', encoding='utf-8') as f:
    incomp_run = json.load(f)

In [27]:
incomp_run[:3]

[{'claim_id': 0,
  'claim': 'In 2008, Jeff Ament released a solo record.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [["When was Jeff Ament's solo record released?",
    'September 16, 2008.']],
  'verdict_int': 1,
  'verdict_raw': 'The question-answer pair directly confirms the claim that Jeff Ament released a solo record in 2008, as it specifies the release date as September 16, 2008. Therefore, the claim is supported by the provided information. True.'},
 {'claim_id': 1,
  'claim': 'Franklin D. Roosevelt had a family that originated in New York.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [["What is the origin of Franklin D. Roosevelt's family?",
    "The origin of Franklin D. Roosevelt's family can be traced back to the Roosevelt family, a prominent American family with roots in Dutch heritage. The family's history in the United States dates back to the 17th century, with early ancestors arriving in the New World. Th

In [29]:
# check that the user prompt include explanation now
client._prompts["cross_fb"]

('Claim: {}\n\nQuestion-Answer Pairs:\n{}\n\nExplanation:\n{}\n\nPrimary feedback:\n{}\n\nSecondary feedback:\n{}',
 'Given the following claim, QA-pairs, explanation, and primary and secondary feedback, revise the primary feedback.\n\nRemove any invalid points in the primary feedback, add only valid corrections supported by the secondary feedback, and ignore invalid secondary feedback. Use the provided claim, QA-pairs, and explanation as context.\n\nDo not copy secondary feedback verbatim. Output only the corrected primary feedback.')

In [30]:
def regenerate_cross():
    for i, c in enumerate(sample):
        fb1 = c["fb1"]
        fb2 = c["fb2"]
        claim = c["claim"]
        qa_pairs = c["qa_pairs"]
        verdict_raw = c["verdict_raw"]
        if i >= 5:
            # try opposite send order
            cross = client.send_prompt("cross_fb", [claim, qa_pairs, verdict_raw, fb2, fb1])
        else:
            cross = client.send_prompt("cross_fb", [claim, qa_pairs, verdict_raw, fb1, fb2])
        c["cross"] = cross

In [31]:
def pprint_results_cross(sample):
    for i, s in enumerate(sample):
        claim = s["claim"]
        qa_pairs = s["qa_pairs"]
        explanation = s["verdict_raw"]
        fb1 = s["fb1"]
        fb2 = s["fb2"]
        cross_agent = s["cross"].strip()
        print(f"{i+1}. {claim}")
        print(f"Available evidence for this claim, in the form of Q/A pairs:")
        for j, e in enumerate(qa_pairs):
            print(f"{j+1}.\tQuestion: {e[0]}\n\t{e[1]}")
        print(f"Explanation: {explanation}\n")
        print(f"Feedback for this explanation:")
        print(f"\ta. {fb1}")
        print(f"\tb. {fb2}")
        print()
        print(f"This agent's rewrite:\n{cross_agent}")
        print("\n\n")

In [None]:
sample = select_random_sample(incomp_run)
regenerate_cross()
pprint_results_cross(sample)