# Evaluating Entailment

How do we know if one text response matches the opinions in another response? This is a tricky question, and there are many ways to approach it. This notebook is going to run through a series of approaches and models, to do some meta-analysis on the best way to do this.

It will depend on the `habermas_machine_questions_with_responses.csv` file, which contains the questions and responses generated by the LLMs, as well as the human responses.

In [None]:
import pandas as pd
import numpy as np
import os
import json
from tqdm import tqdm
import ast
from dotenv import load_dotenv
load_dotenv()

DATA_PATH = os.getenv('DATA_PATH')
TEMP_PATH = os.getenv('TEMP_PATH')

from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

In [None]:
df_questions = pd.read_csv(DATA_PATH+'questions_and_human_perspectives_with_responses.csv')
df_questions['perspectives'] = df_questions['perspectives'].apply(ast.literal_eval)
df_questions.head()

In [None]:
# Let's take a look at a random question to test the different methods
example = df_questions.sample(1).iloc[0]
print("Question: ", example['question'], "\n", "Response: \n", example['gpt-3.5-turbo'], "\n\n", "Opinion: \n", example['perspectives'][0])

In [None]:
# This is a constructed example to test the different methods.
strawman_question = "Should we ban or limit access to guns?"
strawman_response_easy = "No, we should not ban guns. Guns are a necessary tool for self-defense and hunting. We should also make sure that we have a strong police force to protect people and their property."
strawman_response_hard = "No, we should not ban guns completely, but we should make it harder for people to get them with strong laws limiting who can buy them."
strawman_response_mixed = "While banning guns would result in fewer gun related crimes and accidents, it would also make it harder for people to defend themselves and their property."
strawman_opinion = "We should ban guns to reduce crime."
strawman_opinion_2 = "Guns are a fundamental right!!"

## Finding Representation

These functions are used to determine if the LLM responses match the human opinions (not entailment).

#### First, a dead simple prompt

In [None]:
# Now we want to see if the opinions are all represented in the generated responses.
def is_my_opinion_represented(question, response, opinion, model='gpt-4o-mini'):
    """
    Determine if the following opinion is represented in the response to a question. This is the simplest approach I can think of.
    """
    prompt = f"""Task: Determine if the following opinion is represented in the response to a question.

Question: {question}
Response: {response}
Opinion to check for: {opinion}

Instructions:
- Answer ONLY with 'yes' or 'no'
- Answer 'yes' if the opinion is clearly represented
- Answer 'no' if the opinion is absent
- Do not explain your reasoning
- Do not add any other text

Answer:"""

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a precise evaluator that only responds with 'yes' or 'no'."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        max_tokens=1, # We only need one token for 'yes' or 'no'
    )
    
    return completion.choices[0].message.content.strip().lower()

In [None]:
is_my_opinion_represented(example['question'], example['gpt-4o-mini'], example['perspectives'][0])

#### Next, let's use a structured CoT to evaluate the representation

In [None]:
from pydantic import BaseModel

class OpinionPoint(BaseModel):
    claim: str
    explicit_matches: list[str]
    implicit_matches: list[str]
    contradictions: list[str]
    coverage_score: int  # 0-10 for this specific point

class EvaluationStep(BaseModel):
    step_number: int
    analysis: str
    findings: list[str]

class RepresentationAnalysis(BaseModel):
    opinion_points: list[OpinionPoint]
    evaluation_steps: list[EvaluationStep]
    final_score: int  # 0-10 overall score
    reasoning: str  # Brief explanation of final score

def is_my_opinion_represented_structured_cot(question, response, opinion, model='gpt-4o-mini'):
    """
    Determine if the opinion is represented in the response to a question, using structured CoT generation.
    """
    system_prompt = f"""Task: Evaluate how well an opinion is represented in a response through careful step-by-step analysis.

Follow these specific steps in your evaluation:
1. First, break down the core claims/points in the opinion
2. For each point in the opinion:
   - Search for explicit mentions in the response
   - Look for implicit/paraphrased representations
   - Note any contradictions
3. Consider the overall alignment:
   - How many points are covered?
   - How directly are they addressed?
   - Are there any misalignments?
4. Score the representation from 0-10 where:
   - 0: Complete contradiction or no representation
   - 1-3: Minimal/weak representation of few points
   - 4-6: Partial representation of main points
   - 7-9: Strong representation of most points
   - 10: Complete and explicit representation of all points
"""
    
    prompt = f"""Question: {question}
Response: {response}
Opinion to check for: {opinion}

Analyze step-by-step following the instructions, then provide your structured evaluation."""

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        response_format={
            'type': 'json_schema',
            'json_schema': 
                {
                "name": "RepresentationChain", 
                "schema": RepresentationAnalysis.model_json_schema()
                }
            } 
    )
    
    result_object = json.loads(completion.choices[0].message.content)
    return result_object

def process_representation_result(result_object):
    try:
        return result_object['final_score']
    except Exception as e:
        print(e)
        return None


Let's inspect how well the LLMs are doing at finding the opinion in the response.

In [None]:
is_my_opinion_represented_structured_cot(example['question'], example['gpt-4o-mini'], example['perspectives'][0])

## Entailment

Entailment is a lot trickier than representation. Our approach is going to be to use the model to find the exact text that matches the opinion, and then we'll see if that text is in the response.

#### Helper functions for visualizing the results

In [None]:
# from IPython.display import Markdown, display
# import matplotlib.pyplot as plt
# import matplotlib as mpl

# # Generate colors using matplotlib's color map
# n_colors = len(spans)
# color_map = plt.cm.rainbow  # You can also try: viridis, plasma, magma, etc.
# colors = [mpl.colors.rgb2hex(color_map(i / n_colors)) for i in range(n_colors)]


# # Generate a style sheet for the spans
# style_sheet = "<style>\n"
# for perspective_id, _ in enumerate(spans):
#     style_sheet += f""".highlight-{perspective_id} {{ position: relative; }}
#     .highlight-{perspective_id}::after {{
#         content: "";
#         position: absolute;
#         left: 0; right: 0;
#         bottom: -{2*perspective_id}px; /* offset slightly below baseline */
#         border-bottom: 2px solid {colors[perspective_id]};
#     }}
#     """
# style_sheet += "</style>"

# # Convert dict to list of (start, end, color) and sort by start position
# ordered_span = [(start, end, perspective_id) 
#                  for perspective_id, span_list in enumerate(spans)
#                  for start, end in span_list]
# ordered_span.sort(key=lambda x: x[0])

In [None]:
# result = []
# last_idx = 0

# for span_idx, (start, end, perspective_id) in enumerate(ordered_span):
#     # Add text before the span
#     result.append(text[last_idx:start])
#     # Add the highlighted span
#     result.append(f"<span class='highlight-{perspective_id}'>")
#     if span_idx+1 < len(ordered_span) and ordered_span[span_idx+1][0] < end:
#         result.append(text[start:end])
#     else:
#         result.append(text[start:end])
#     result.append("</span>")
#     last_idx = end


In [None]:
def highlight_spans(text: str, spans: list[list[tuple[int, int]]]):
    """
    Highlight the spans in the text based on the spans dictionary. 
    Args:
        text: The text to highlight
        # spans_dict: Dictionary of 'color': list of spans, where each span is (start, end)
    Example:
        spans = [
            [(0, 4), (10, 16)],
            [(28, 37)]
        }
    """
    from IPython.display import Markdown, display
    import matplotlib.pyplot as plt
    import matplotlib as mpl

    # Generate colors using matplotlib's color map
    n_colors = len(spans)
    color_map = plt.cm.rainbow  # You can also try: viridis, plasma, magma, etc.
    colors = [mpl.colors.rgb2hex(color_map(i / n_colors)) for i in range(n_colors)]

    # Create the spans dictionary using these colors
    spans_dict = {color: spans[i] for i, color in enumerate(colors)}
    # highlight_spans(text, spans_dict)

    # Generate a style sheet for the spans
    style_sheet = "<style>"
    for span_id, span in enumerate(spans):
        style_sheet += f""".highlight-{span_id} {{ background-color: {colors[span_id]}; }}
        .highlight-{span_id}::after {{
            content: "";
            position: absolute;
            left: 0; right: 0;
            bottom: -{2*span_id}px; /* offset slightly below baseline */
            border-bottom: 2px solid {colors[span_id]};
        }}
        """
    style_sheet += "</style>"
    
    # Convert dict to list of (start, end, color) and sort by start position
    all_spans = [(start, end, color) 
                 for color, spans in spans_dict.items() 
                 for start, end in spans]
    all_spans.sort(key=lambda x: x[0])
    
    # Build the marked up text piece by piece
    result = []
    last_idx = 0
    
    for start, end, color in all_spans:
        # Add text before the span
        result.append(text[last_idx:start])
        # Add the highlighted span
        result.append(f"<span style='color: {color};'>{text[start:end]}</span>")
        last_idx = end
    
    # Add any remaining text after the last span
    result.append(text[last_idx:])
    
    # Add color swatches at the end
    for color in spans_dict:
        swatch = f"<span style='color: {color};'>{chr(9608) * 6}</span>"
        result.append(f" {swatch}")
    
    # Join all pieces and display
    marked_text = ''.join(result)
    display(Markdown(marked_text))

#### Structured CoT (OpenAI) for entailment

In [None]:
# We're going to use the writefile magic to save the entailment code to a file for use downsteam and for version control.
# %%writefile src/entailment.py

from pydantic import BaseModel

import openai, os, json
client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

class EntailmentMatch(BaseModel):
    text: str
    match_type: str  # "direct", "paraphrase", or "contextual"
    confidence: int  # 0-10 score
    explanation: str  # Why this is a match

class EntailmentStep(BaseModel):
    step_number: int
    concept: str  # The concept from the opinion being analyzed
    analysis: str  # The reasoning process
    matches: list[EntailmentMatch]

class EntailmentAnalysis(BaseModel):
    steps: list[EntailmentStep]
    final_matches: list[str]  # The best, most confident matches
    coverage_score: int  # 0-10 how well the opinion is covered

def entailment_from_gpt_json(question: str, response: str, opinion: str, model='gpt-4o-mini'):
    """
    Find exact text matches between rich text and opinion using GPT-4.
    """
    system_prompt = f"""Task: Precise Text Entailment Analysis. Find and evaluate text in the Response that represents concepts from the Opinion.

Follow these specific steps:
1. Break down the Opinion into key concepts
2. For each concept:
   - Search for direct text matches, this includes single words like "yes" or "no"
   - Identify paraphrased representations
   - Look for contextual/implicit matches
   - Copy the **exact text** in the Response that matches the concept in the Opinion. Copy the text from the response, not the opinion.

3. Evaluate matches by:
   - Precision: How exactly does it match?
   - Context: Is the meaning preserved?
   - Completeness: Is the full concept captured?

4. Score coverage from 0-10 where:
   - 0: No valid matches found
   - 1-3: Few weak/partial matches
   - 4-6: Some good matches but incomplete
   - 7-9: Strong matches for most concepts
   - 10: Complete, precise matches for all concepts

Important:
- Prioritize precision over quantity
- Consider context to avoid false matches
- Explain reasoning for each match
- Always copy the exact text from the Response that matches the concept
"""

    prompt = f"""Context question: {question}
Opinion: {opinion}
Response: {response}

Analyze step-by-step following the instructions to find and evaluate all relevant matches."""
    
    chat_response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        response_format={
            'type': 'json_schema',
            'json_schema': 
                {
                "name": "EntailmentAnalysis", 
                "schema": EntailmentAnalysis.model_json_schema()
                }
            } 
    )

    result_object = json.loads(chat_response.choices[0].message.content)
    return result_object

def process_entailment_result(result_object, response):
    matches = []
    for match in result_object['final_matches']:
        start_index = response.lower().find(match.lower())
        if start_index == -1:
            print("Warning: match was not found in response text.")
            continue
        end_index = start_index + len(match)
        matches.append((start_index, end_index))
    return matches


In [None]:
entailment_results = entailment_from_gpt_json(example['question'], example['gpt-3.5-turbo'], example['perspectives'][0])
entailment_results

In [None]:
entailment_matches = process_entailment_result(entailment_results, example['gpt-3.5-turbo'])
print(entailment_matches)

In [None]:
# strawman_entailment_results = entailment_from_gpt_json(strawman_question, strawman_response_mixed, strawman_opinion)
# strawman_entailment_matches = process_entailment_result(strawman_entailment_results, strawman_response_mixed)
# print(strawman_entailment_matches)

strawman_entailment_results_2 = entailment_from_gpt_json(strawman_question, strawman_response_mixed, strawman_opinion_2)
strawman_entailment_matches_2 = process_entailment_result(strawman_entailment_results_2, strawman_response_mixed)
print(strawman_entailment_matches_2)

In [None]:
highlight_spans(example['gpt-3.5-turbo'], [entailment_matches])


In [None]:
# Now we're going to run this across all the opinions and highlight each match in the response.

opinion_entailments = []
for opinion_idx, opinion in enumerate(tqdm(example['perspectives'])):
    opinion_entailments.append({})
    opinion_entailments[opinion_idx]['full_result'] = entailment_from_gpt_json(example['question'], example['gpt-4o-mini'], opinion)
    opinion_entailments[opinion_idx]['matches'] = process_entailment_result(opinion_entailments[opinion_idx]['full_result'], example['gpt-4o-mini'])

In [None]:
# Highlight the matches for the example
import matplotlib.pyplot as plt
import matplotlib as mpl

# Generate colors using matplotlib's color map
n_colors = len(opinion_entailments)
color_map = plt.cm.rainbow  # You can also try: viridis, plasma, magma, etc.
colors = [mpl.colors.rgb2hex(color_map(i / n_colors)) for i in range(n_colors)]

# Create the spans dictionary using these colors
spans_dict = {color: opinion_entailments[i]['matches'] for i, color in enumerate(colors)}
highlight_spans(example['gpt-4o-mini'], list(spans_dict.values()))

## Using MetaAI SONAR

In [None]:
!pip install sonar-space
!pip install fairseq2

In [None]:
# from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
# t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
#                                            tokenizer="text_sonar_basic_encoder")
# sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
# embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
# print(embeddings.shape)
# # torch.Size([2, 1024])

## Using real NLI methods

In [None]:
# !pip install transformers torch
# !pip install sentence-splitter
# !pip install sentence-transformers
# !pip install tiktoken
# !pip install sentencepiece


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

print(strawman_response_mixed)
print(strawman_opinion)
input = tokenizer(strawman_response_mixed, strawman_opinion_2, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

In [None]:
from sentence_splitter import SentenceSplitter, split_text_into_sentences

text = "This is a paragraph. It contains several sentences. \"But why,\" you ask?"
sentences = split_text_into_sentences(text=text, language='en')
print(sentences)

In [None]:
#  We could also validate with facebook/xnli

In [None]:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("microsoft/deberta-v2-xlarge-mnli")
NLI_model = transformers.pipeline("text-classification", model="microsoft/deberta-v2-xlarge-mnli", tokenizer=tokenizer, top_k=None, device=0)

In [None]:
strawman_response_mixed

In [None]:
{"text": example['gpt-3.5-turbo'], "text_pair": example['own_opinion.text'][0]}

## Cross-comparing the different methods

In [None]:
# I apologize for how chaotic and unreadable this cell is, but we want to run all the same functions and save all intermediate results for debugging.

entailment_models = ['gpt-4o-mini']
response_models = ['gpt-3.5-turbo']

overton_results = []
sample_size = 3
for _, row in tqdm(df_questions.sample(sample_size).iterrows(), total=sample_size, desc="Questions", leave=True):
    question = row['question.text']
    opinions = row['own_opinion.text']
    question_id = row['question_id']
    with tqdm(total=len(opinions), desc="Opinions", leave=False) as opinion_bar:
        for opinion_idx, opinion in enumerate(opinions):
            with tqdm(total=len(response_models), desc="Response Models", leave=False) as response_bar:
                for response_model in response_models:
                    with tqdm(total=len(entailment_models), desc="Entailment Models", leave=False) as entailment_bar:
                        for entailment_model in entailment_models:
                            response = row[response_model]
                            # This is the simple prompt check (which should be the same for all models)
                            result_opinion_represented = is_my_opinion_represented(question, response, opinion, model=entailment_model)
                            # This is the structured cot check, which should be the same for all models
                            result_opinion_represented_structured_cot = is_my_opinion_represented_structured_cot(question, response, opinion, model=entailment_model)
                            result_opinion_represented_structured_cot_score = process_representation_result(result_opinion_represented_structured_cot)


                            # Now we turn to entailment, which we're going to save in a bunch of different formats for debugging
                            entailment_result = entailment_from_gpt_json(question, response, opinion, model=entailment_model)
                            entailment_matches = process_entailment_result(entailment_result, response)

                            overton_results.append({
                                'question_id': question_id,
                                'opinion_idx': opinion_idx,
                                'response_model': response_model,
                                'entailment_model': entailment_model,
                                'is_represented_simple_prompt': result_opinion_represented == 'yes',
                                'is_represented_structured_cot': result_opinion_represented_structured_cot,
                                'is_represented_structured_cot_score': result_opinion_represented_structured_cot_score,
                                'entailment_result': entailment_result,
                                'entailment_matches': entailment_matches
                            })
                            entailment_bar.update(1)
                    response_bar.update(1)
            opinion_bar.update(1)

df_overton_results = pd.DataFrame(overton_results)

In [None]:
df_overton_results = pd.DataFrame(overton_results)
df_overton_results.head()

In [None]:
df_overton_results.to_csv('data/entailment_ablation_results.csv', index=False)

# Analysis

In [None]:
df_overton_results = pd.read_csv('data/entailment_ablation_results.csv')

In [None]:
df_overton_results.groupby(['model', 'question_id'])['is_represented_simple_prompt'].mean().reset_index().groupby('model')['is_represented_simple_prompt'].mean()

In [None]:
df_overton_results['is_represented_structured_cot_score.bool'] = df_overton_results['is_represented_structured_cot_score'] > 5
df_overton_results['meta_analysis.simpleXstructured'] = (df_overton_results['is_represented_simple_prompt'] == df_overton_results['is_represented_structured_cot_score.bool']).astype(int)
df_overton_results['meta_analysis.entailmentLength'] = df_overton_results['entailment_matches'].apply(lambda x: len(x))
df_overton_results['meta_analysis.entailmentXstructured'] = ((df_overton_results['meta_analysis.entailmentLength'] > 1) == df_overton_results['is_represented_structured_cot_score.bool'])


In [None]:
print(f"""Meta analysis:
- Percent represented (structured cot) > 5: {df_overton_results['is_represented_structured_cot_score.bool'].mean()}
- Percent represented (simple prompt) == Percent represented (structured cot): {df_overton_results['meta_analysis.simpleXstructured'].mean()}
- Percent entailment length > 1 == represented (structured cot): {df_overton_results['meta_analysis.entailmentXstructured'].mean()}
- Mean Entailment Length: {df_overton_results['meta_analysis.entailmentLength'].mean()}
- Max Entailment Length: {df_overton_results['meta_analysis.entailmentLength'].max()}
""")


# Visual example

In [None]:
example =df_overton_results.sample(1).iloc[0]
example_question = df_questions.loc[df_questions['question_id'] == example['question_id']].iloc[0]
opinion = example_question['own_opinion.text'][example['opinion_idx']]
response =example_question[example['response_model']]


In [None]:
print(f"""Question: {example_question['question.text']}
      
Response: {response}

From model: {example['response_model']}

Opinion: {opinion}

ANALYSIS:

Is represented (simple prompt): {example['is_represented_simple_prompt']}

Is represented (structured cot): {example['is_represented_structured_cot']}

Entailment result: {example['entailment_matches']}

Entailment Reasoning: {example['entailment_result']}

META ANALYSIS:

Simple prompt == Structured cot: {example['meta_analysis.simpleXstructured'] == 1}

Entailment length > 1 AND simple prompt == structured cot: {example['meta_analysis.entailmentXstructured']}

""")
