# MADR prompt evaluation

This notebook is inspired by [this research](https://cacm.acm.org/research/from-prompt-engineering-to-prompt-science-with-humans-in-the-loop/) on prompt engineering with humans in the loop.

We are faced with the issue: the existing prompts in the MADR paper are ill-defined and intended for a different complexity model. While fine-tuning is an enticing option, we lack a dataset of questions, evidence, veracity claim, and target feedback.

In [1]:
import sys
import os
import json
import random
sys.path.append(os.path.abspath(".."))

In [2]:
def in_jupyter_notebook():
    try:
        shell = get_ipython().__class__.__name__
        return shell == 'ZMQInteractiveShell'
    except NameError:
        return False
IN_JUPYTER = in_jupyter_notebook()

In [3]:
from src.model_clients import LlamaCppClient
from src import config
from src.madr import run_madr
from src.parsers import parse_ternary

In [4]:
prompts_dir = config.PROMPTS_DIR / "custom"
client = LlamaCppClient(prompts_dir)

In [5]:
corag_run = None
relative_path = os.path.join('assets', '20251123T192328--metrics__100.json')
with open(relative_path, 'r', encoding='utf-8') as f:
    corag_run = json.load(f)["results"]

In [6]:
# set the random seed
random.seed(42)

## An initial codebook for Debater 1, 2, Cross-Agent, Judge, Refiner

A 'codebook' (or criteria) in the above research encapsulates the desired outcomes for generation. It systematically allows any human-in-the-loop to quickly and methodically conclude when a generated response does or does not meet the specifications.

The starting criteria, validated by multiple researchers for each model, is defined below:

1. **Debater 1, given an explanation and evidence:**
- Does distinguish between the following types of errors in the explanation:
  1) misrepresentations of factual details in the evidence
  2) introduces or misrepresents events not present in the evidence
  3) introduces or misrepresents noun phrases in the evidence, changing semantic meaning
  4) logic/reasoning inconsistencies in the explanation
  5) includes information not relevant to the claim/evidence or beyond what the evidence supports
- Does NOT attempt to correct errors in the explanation to align with evidence.
- Identifies ALL such errors
- Does NOT output non-existent errors.
- Answers tersely

*note, it is NOT a requirement the model names the error types.*

2. **Debater 2, given an explanation and evidence:**
- Identifies general weaknesses which reduce the faithfulness of an explanation to the claim/evidence.
- Generates explanations for why identified issues are unfaithful:
  1) Factually inaccurate
  2) logically inaccurate
  3) irrelevant
  4) incomplete
  5) incoherent
- Does NOT attempt to correct errors in the explanation to align with evidence.
- Does provide actionable feedback
- Identifies ALL such errors
- Does NOT output non-existent errors.
- Answers tersely

3. **Cross-Agent, given the original explanation, and a primary and secondary set of feedback:**
- uses the original explanation to:
  1) recognize invalid feedback in the primary feedback, and incorporate corrections from the secondary feedback.
  2) does NOT include invalid feedback from the secondary feedback.
- outputs ONLY the modified primary feedback
- makes MINIMAL changes, rather than copying the secondary feedback

*note, the input to this agent is up to debate. I am making the assumption that only including the explanation rather than the full claim and Q/A pairs balances the agent's ability to refine the feedback given a source of ground truth while keeping the prompt context small.*

4. **Judge, given two sets of feedback:**
- identifies major discrepancies between the input feedback:
  1) identifies different errors or issues
  2) provides inconsistent error descriptions
  3) whether any suggested fixes are not equivalent
- if discrepancies exist, output contains the word 'FALSE'
- else, output contains the word 'TRUE'

5. **Refinement Agent, given two sets of feedback and an explanation:**
- rewrites the explanation to align with both sets of feedback
- Does NOT output anything other than the refined explanation
- Does NOT remove the component containing the verdict in the original explanation, but can 'flip' it, i.e.: true -> false.

## Initial Custom Prompts

Because it is already established the original prompts do not meet these requirements, we will start by creating our initial custom prompts.

To do this, we simply feed the codebook itself into an LLM to generate the prompt. I will use this boilerplate:

> You are tasked with taking a set of requirements and phrasing it as a concise prompt for an LLM. In this initial generation, the codebook serves as the primary material for construction, while a previous prompt is provided purely to fill in any gaps left by the codebook. Extra output besides the prompt is unnecessary and unwarranted.


In [21]:
madr_init_fb1_system_prompt = """Given the following claim, explanation, and QA-pair evidence, identify all errors in the explanation using this error typology:

1. misrepresented factual details;
2. introduced or misrepresented events;
3. introduced or misrepresented noun phrases;
4. reasoning/logic inconsistencies;
5. irrelevant or unsupported information.

Do not attempt to correct the explanation. Do not invent errors.
Output a terse list of all real errors, each labeled with its category."""

madr_init_fb2_system_prompt = """Given the following claim, explanation, and QA-pair evidence, identify all weaknesses in the explanation that reduce the faithfulness to the claim/evidence.

For each problematic sentence in the explanation, quote it, explain why it is unfaithful (factually inaccurate, logically inaccurate, irrelevant, incomplete, or incoherent), and provide actional feedback without correcting or rewriting the explanation.

Do not invent issues. Keep responses terse."""

madr_cross_fb_system_prompt = """Given the following explanation plus primary and secondary feedback, revise the primary feedback using the explanation as ground truth.

Remove any invalid points in the primary feedback, add only valid corrections supported by the secondary feedback, and ignore invalid secondary feedback.

Make minimal edits; do not copy secondary feedback verbatim. Output only the corrected primary feedback."""

madr_cross_fb_user_prompt = """Primary feedback:
{}

Secondary feedback:
{}"""

madr_judge_system_prompt = """Given the following sets of feedback, compare them for major discrepancies: whether they identify idfferent errors, give inconsistent error descriptions, or propose nonequivalent fixes.

If any discrepancy exists, output FALSE; otherwise output TRUE. Output only this word."""


madr_revise_system_prompt = """Given the following explanation and two sets of feedback, rewrite the explanation so it aligns with both feedback sets.

Keep the verdict component but allow flipping its valud. Output only the refined explanation and nothing else."""

In [19]:
def replace_prompt(key, system_prompt_str):
    client._prompts[key] = (
        client._prompts[key][0],
        system_prompt_str
    )

In [22]:
replace_prompt("madr_init_fb1", madr_init_fb1_system_prompt)

In [23]:
client._prompts["madr_init_fb1"]

('Claim: {}\n\nQuestion-Answer Pairs:\n{}\n\nExplanation: {}\n\nFeedback:',
 'Given the following claim, explanation, and QA-pair evidence, identify all errors in the explanation using this error typology:\n\n1. misrepresented factual details;\n2. introduced or misrepresented events;\n3. introduced or misrepresented noun phrases;\n4. reasoning/logic inconsistencies;\n5. irrelevant or unsupported information.\n\nDo not attempt to correct the explanation. Do not invent errors.\nOutput a terse list of all real errors, each labeled with its category.')

In [25]:
replace_prompt("madr_init_fb2", madr_init_fb2_system_prompt)
replace_prompt("madr_judge", madr_judge_system_prompt)
replace_prompt("madr_revise", madr_revise_system_prompt)

In [26]:
client._prompts["madr_cross_fb"] = (
    madr_cross_fb_user_prompt,
    madr_cross_fb_system_prompt
)

## Refining Base MADR Prompts

I have a set of 100 claims run through the baseline pipeline. When evaluating the MADR pipeline *I will assume that the answers generated by the CoRAG answering agent are always factual.*

In [7]:
corag_run[:3]

[{'claim_id': 0,
  'claim': 'In 2008, Jeff Ament released a solo record.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [["When was Jeff Ament's solo record released?",
    'September 16, 2008.']],
  'verdict_int': 1,
  'verdict_raw': 'True.  \nInconclusive'},
 {'claim_id': 1,
  'claim': 'Franklin D. Roosevelt had a family that originated in New York.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [["What is the origin of Franklin D. Roosevelt's family?",
    "Franklin D. Roosevelt's family originated from Dutch settlers in New York, with his ancestors tracing back to the 17th century. His direct lineage includes prominent figures such as his father, James Roosevelt, and his mother, Eleanor Roosevelt, who were part of influential American families."]],
  'verdict_int': 1,
  'verdict_raw': 'True. The question-answer pairs fully confirm the claim that Franklin D. Roosevelt had a family that originated in New York.'},
 {'claim_id

### Debater 1

This marks our first iteration of refining the codebook. To evaluate the debater, we need a reasonable number of responses. In this case, I will generate 10, for 5 random claims of each type (TRUE/FALSE):

In [8]:
def select_random_sample(data, count=5):
    supports = [i for i in data if i["true_label"] == "SUPPORTS"]
    refutes = [i for i in data if i["true_label"] == "REFUTES"]

    selection_supports = random.sample(supports, count)
    selection_refutes = random.sample(refutes, count)

    return selection_supports + selection_refutes
sample = select_random_sample(corag_run)

Now I'll generate and cache the debater's responses:

In [27]:
for c in sample:
    claim = c["claim"]
    qa_pairs = c["qa_pairs"]
    verdict_raw = c["verdict_raw"]
    fb = client.send_prompt("madr_init_fb1", [claim, qa_pairs, verdict_raw])
    c["r"] = fb

In [28]:
sample

[{'claim_id': 40,
  'claim': 'Jun Ji-hyun is in the South Korean film called Windstruck.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [['Is Jun Ji-hyun featured in the South Korean film "Windstruck"?',
    'Answer: Yes.']],
  'verdict_int': 1,
  'verdict_raw': 'True.',
  'fb1': '\n\n[]'},
 {'claim_id': 7,
  'claim': 'A subtype of anti-nuclear antibodies are anti-Ro antibodies.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [['Is the statement accurate that a subtype of anti-nuclear antibodies specifically includes anti-Ro antibodies?',
    'Answer: Yes.']],
  'verdict_int': 1,
  'verdict_raw': 'True.  \nInconclusive',
  'fb1': '\n\n- **Introduced or misrepresented noun phrases**: "Inconclusive" misrepresents the explanation\'s adequacy.'},
 {'claim_id': 1,
  'claim': 'Franklin D. Roosevelt had a family that originated in New York.',
  'true_label': 'SUPPORTS',
  'predicted_label': 'SUPPORTS',
  'qa_pairs': [["What is the ori

We now need a section of code to pretty-print each claim out, making it easy to judge the effectiveness:

In [41]:
def pprint_results(sample):
    for i, s in enumerate(sample):
        claim = s["claim"]
        qa_pairs = s["qa_pairs"]
        explanation = s["verdict_raw"]
        madr_agent = s["fb1"].strip()
        print(f"{i+1}. {claim}\n")
        print(f"Available evidence for this claim, in the form of Q/A pairs:")
        for j, e in enumerate(qa_pairs):
            print(f"{j+1}.\tQuestion: {e[0]}\n\t{e[1]}")
        print()
        print(f"The original verdict: {explanation}\n")
        print(f"This agent's feedback:\n{madr_agent}")
        print("\n\n")

pprint_results(sample)

1. Jun Ji-hyun is in the South Korean film called Windstruck.

Available evidence for this claim, in the form of Q/A pairs:
1.	Question: Is Jun Ji-hyun featured in the South Korean film "Windstruck"?
	Answer: Yes.

The original verdict: True.

This agent's feedback:
[]



2. A subtype of anti-nuclear antibodies are anti-Ro antibodies.

Available evidence for this claim, in the form of Q/A pairs:
1.	Question: Is the statement accurate that a subtype of anti-nuclear antibodies specifically includes anti-Ro antibodies?
	Answer: Yes.

The original verdict: True.  
Inconclusive

This agent's feedback:
- **Introduced or misrepresented noun phrases**: "Inconclusive" misrepresents the explanation's adequacy.



3. Franklin D. Roosevelt had a family that originated in New York.

Available evidence for this claim, in the form of Q/A pairs:
1.	Question: What is the origin of Franklin D. Roosevelt's family?
	Franklin D. Roosevelt's family originated from Dutch settlers in New York, with his ancest

As a reviewer, I note:
1. the debater's response to claim 5 and 6 points out a noun phrase inconsistency in the Q/A pairs, not the explanation. The codebook should be modified to ignore errors in the Q/A pairs.
2. the debater in fails to identify the introduction of evidence by the explanation.
3. the debater fails in cases where the CoRAG pipeline also clearly failed (what to do about this?).