# Evaluating Entailment

How do we know if one text response matches the opinions in another response? This is a tricky question, and there are many ways to approach it. This notebook is going to run through a series of approaches and models, to do some meta-analysis on the best way to do this.

It will depend on the `habermas_machine_questions_with_responses.csv` file, which contains the questions and responses generated by the LLMs, as well as the human responses.

In [1]:
import pandas as pd
import numpy as np
import os
import json
from tqdm import tqdm
import ast
from dotenv import load_dotenv
load_dotenv()

from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

In [2]:
df_questions = pd.read_csv('data/habermas_machine_questions_with_responses.csv')
df_questions['own_opinion.text'] = df_questions['own_opinion.text'].apply(ast.literal_eval)
df_questions.head()

Unnamed: 0,question.text,own_opinion.text,question_topic,question_id,gemma-2-2b-it,gemma-2-2b,gpt-4o-mini,gpt-3.5-turbo,gemini-1.5-flash-002,mistral-7b-instruct
0,Are the NHS and the UK welfare state fit for p...,[The NHS and Welfare are not working how they ...,51,19,"[{'generated_text': ""Are the NHS and the UK we...",[{'generated_text': 'Are the NHS and the UK we...,The question of whether the NHS and the UK wel...,This is a highly debated and subjective issue....,"Whether the NHS and the UK welfare state are ""...",The NHS and the UK welfare state have been in...
1,Are the long term risks from radioactivity fro...,"[Nuclear power is inherently dangerous, as sho...",4,25,"[{'generated_text': ""Are the long term risks f...",[{'generated_text': 'Are the long term risks f...,The question of whether the long-term risks fr...,This is a complex and debated topic that requi...,The question of whether the long-term risks fr...,This is a complex issue with no easy answer. ...
2,Are the rules about acceptable content in onli...,[In all honest I don't know what the rules are...,68,29,"[{'generated_text': ""Are the rules about accep...",[{'generated_text': 'Are the rules about accep...,The question of whether rules about acceptable...,This is a subjective question and opinions may...,Whether the rules about acceptable content in ...,The rules about acceptable content in online ...
3,Are the wealthy paying enough tax?,"[In my opinion, no they are not paying enough ...",2,30,"[{'generated_text': ""Are the wealthy paying en...",[{'generated_text': 'Are the wealthy paying en...,The question of whether the wealthy are paying...,This is a subjective question that can vary ba...,"Whether wealthy individuals are paying ""enough...",The question of whether the wealthy are payin...
4,Are there any circumstances where the governme...,"[None whatsoever, a ban infringes on people's ...",10,33,"[{'generated_text': ""Are there any circumstanc...","[{'generated_text': ""Are there any circumstanc...",The question of whether a government should ha...,Some argue that in situations where public pro...,The question of whether a government should be...,"In a democratic society, the government shoul..."


In [3]:
# Let's take a look at a random question to test the different methods
example = df_questions.sample(1).iloc[0]
print("Question: ", example['question.text'], "\n", "Response: \n", example['gpt-3.5-turbo'], "\n\n", "Opinion: \n", example['own_opinion.text'][0])

Question:  Should all adults be given a monthly allowance by the government, to be used for anything they like? 
 Response: 
 This is a complex and debated topic. While some argue that a universal basic income or monthly allowance from the government could provide financial security to all individuals and reduce inequality, others worry about the potential impact on work incentives and the economy as a whole. Ultimately, whether or not all adults should be given a monthly allowance by the government is a decision that must be carefully weighed and considered in the context of broader economic and social policies. 

 Opinion: 
 Yes because if we were given an allowance it would encourage people to spend which would aid the economy and lead to better investment within the UK.


In [4]:
# This is a constructed example to test the different methods.
strawman_question = "Should we ban or limit access to guns?"
strawman_response_easy = "No, we should not ban guns. Guns are a necessary tool for self-defense and hunting. We should also make sure that we have a strong police force to protect people and their property."
strawman_response_hard = "No, we should not ban guns completely, but we should make it harder for people to get them with strong laws limiting who can buy them."
strawman_response_mixed = "While banning guns would result in fewer gun related crimes and accidents, it would also make it harder for people to defend themselves and their property."
strawman_opinion = "We should ban guns to reduce crime."
strawman_opinion_2 = "Guns are a fundimental right!!"

## Finding Representation

These functions are used to determine if the LLM responses match the human opinions (not entailment).

#### First, a dead simple prompt

In [7]:
# Now we want to see if the opinions are all represented in the generated responses.
def is_my_opinion_represented(question, response, opinion, model='gpt-4o-mini'):
    """
    Determine if the following opinion is represented in the response to a question. This is the simplest approach I can think of.
    """
    prompt = f"""Task: Determine if the following opinion is represented in the response to a question.

Question: {question}
Response: {response}
Opinion to check for: {opinion}

Instructions:
- Answer ONLY with 'yes' or 'no'
- Answer 'yes' if the opinion is clearly represented
- Answer 'no' if the opinion is absent
- Do not explain your reasoning
- Do not add any other text

Answer:"""

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a precise evaluator that only responds with 'yes' or 'no'."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        max_tokens=1, # We only need one token for 'yes' or 'no'
    )
    
    return completion.choices[0].message.content.strip().lower()

In [8]:
is_my_opinion_represented(example['question.text'], example['gpt-4o'], example['own_opinion.text'][0])

'no'

#### Next, let's use a structured CoT to evaluate the representation

In [9]:
from pydantic import BaseModel

class OpinionPoint(BaseModel):
    claim: str
    explicit_matches: list[str]
    implicit_matches: list[str]
    contradictions: list[str]
    coverage_score: int  # 0-10 for this specific point

class EvaluationStep(BaseModel):
    step_number: int
    analysis: str
    findings: list[str]

class RepresentationAnalysis(BaseModel):
    opinion_points: list[OpinionPoint]
    evaluation_steps: list[EvaluationStep]
    final_score: int  # 0-10 overall score
    reasoning: str  # Brief explanation of final score

def is_my_opinion_represented_structured_cot(question, response, opinion, model='gpt-4o-mini'):
    """
    Determine if the opinion is represented in the response to a question, using structured CoT generation.
    """
    system_prompt = f"""Task: Evaluate how well an opinion is represented in a response through careful step-by-step analysis.

Follow these specific steps in your evaluation:
1. First, break down the core claims/points in the opinion
2. For each point in the opinion:
   - Search for explicit mentions in the response
   - Look for implicit/paraphrased representations
   - Note any contradictions
3. Consider the overall alignment:
   - How many points are covered?
   - How directly are they addressed?
   - Are there any misalignments?
4. Score the representation from 0-10 where:
   - 0: Complete contradiction or no representation
   - 1-3: Minimal/weak representation of few points
   - 4-6: Partial representation of main points
   - 7-9: Strong representation of most points
   - 10: Complete and explicit representation of all points
"""
    
    prompt = f"""Question: {question}
Response: {response}
Opinion to check for: {opinion}

Analyze step-by-step following the instructions, then provide your structured evaluation."""

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        response_format={
            'type': 'json_schema',
            'json_schema': 
                {
                "name": "RepresentationChain", 
                "schema": RepresentationAnalysis.model_json_schema()
                }
            } 
    )
    
    result_object = json.loads(completion.choices[0].message.content)
    return result_object

def process_representation_result(result_object):
    try:
        return result_object['final_score']
    except Exception as e:
        print(e)
        return None


Let's inspect how well the LLMs are doing at finding the opinion in the response.

In [10]:
is_my_opinion_represented_structured_cot(example['question.text'], example['gpt-4o'], example['own_opinion.text'][0])

{'opinion_points': [{'claim': 'All religious schools should receive public funding because they teach the national curriculum like normal state schools.',
   'explicit_matches': [],
   'implicit_matches': ['teaching national curriculum',
    'normal state schools'],
   'contradictions': [],
   'coverage_score': 3},
  {'claim': 'Every child should be entitled to public funding no matter what school it is taught in.',
   'explicit_matches': ['educational choice',
    'parents should have the option'],
   'implicit_matches': [],
   'contradictions': [],
   'coverage_score': 4},
  {'claim': 'Public funding should only be strictly used for the national curriculum and not for advancement of the religion.',
   'explicit_matches': ['accountability and standards'],
   'implicit_matches': [],
   'contradictions': [],
   'coverage_score': 5}],
 'evaluation_steps': [{'step_number': 1,
   'analysis': 'Identified the core claims in the opinion regarding public funding for religious schools.',
   'fi

## Entailment

Entailment is a lot trickier than representation. Our approach is going to be to use the model to find the exact text that matches the opinion, and then we'll see if that text is in the response.

#### Helper functions for visualizing the results

In [6]:
text = strawman_response_mixed
spans = [strawman_entailment_matches, strawman_entailment_matches_2]

NameError: name 'strawman_entailment_matches' is not defined

In [74]:
from IPython.display import Markdown, display
import matplotlib.pyplot as plt
import matplotlib as mpl

# Generate colors using matplotlib's color map
n_colors = len(spans)
color_map = plt.cm.rainbow  # You can also try: viridis, plasma, magma, etc.
colors = [mpl.colors.rgb2hex(color_map(i / n_colors)) for i in range(n_colors)]


# Generate a style sheet for the spans
style_sheet = "<style>\n"
for perspective_id, _ in enumerate(spans):
    style_sheet += f""".highlight-{perspective_id} {{ position: relative; }}
    .highlight-{perspective_id}::after {{
        content: "";
        position: absolute;
        left: 0; right: 0;
        bottom: -{2*perspective_id}px; /* offset slightly below baseline */
        border-bottom: 2px solid {colors[perspective_id]};
    }}
    """
style_sheet += "</style>"

# Convert dict to list of (start, end, color) and sort by start position
ordered_span = [(start, end, perspective_id) 
                 for perspective_id, span_list in enumerate(spans)
                 for start, end in span_list]
ordered_span.sort(key=lambda x: x[0])

In [None]:
result = []
last_idx = 0

for span_idx, (start, end, perspective_id) in enumerate(ordered_span):
    # Add text before the span
    result.append(text[last_idx:start])
    # Add the highlighted span
    result.append(f"<span class='highlight-{perspective_id}'>")
    if span_idx+1 < len(ordered_span) and ordered_span[span_idx+1][0] < end:
        result.append(text[start:end])
    else:
        result.append(text[start:end])
    result.append("</span>")
    last_idx = end


In [75]:
ordered_span

[(0, 73, '#80ffb4'),
 (6, 18, '#8000ff'),
 (35, 59, '#8000ff'),
 (75, 154, '#80ffb4'),
 (89, 135, '#8000ff')]

In [None]:
def highlight_spans(text: str, spans: list[list[tuple[int, int]]]):
    """
    Highlight the spans in the text based on the spans dictionary. 
    Args:
        text: The text to highlight
        spans_dict: Dictionary of 'color': list of spans, where each span is (start, end)
    Example:
        spans = [
            [(0, 4), (10, 16)],
            [(28, 37)]
        }
    """
    from IPython.display import Markdown, display
    import matplotlib.pyplot as plt
    import matplotlib as mpl

    # Generate colors using matplotlib's color map
    n_colors = len(spans)
    color_map = plt.cm.rainbow  # You can also try: viridis, plasma, magma, etc.
    colors = [mpl.colors.rgb2hex(color_map(i / n_colors)) for i in range(n_colors)]

    # Create the spans dictionary using these colors
    spans_dict = {color: spans[i] for i, color in enumerate(colors)}
    highlight_spans(text, spans_dict)

    # Generate a style sheet for the spans
    style_sheet = "<style>"
    for span_id, span in enumerate(spans):
        style_sheet += f""".highlight-{span_id} {{ background-color: {colors[span_id]}; }}
        .highlight-{span_id}::after {{
            content: "";
            position: absolute;
            left: 0; right: 0;
            bottom: -{2*span_id}px; /* offset slightly below baseline */
            border-bottom: 2px solid {colors[span_id]};
        }}
        """
    style_sheet += "</style>"
    
    # Convert dict to list of (start, end, color) and sort by start position
    all_spans = [(start, end, color) 
                 for color, spans in spans_dict.items() 
                 for start, end in spans]
    all_spans.sort(key=lambda x: x[0])
    
    # Build the marked up text piece by piece
    result = []
    last_idx = 0
    
    for start, end, color in all_spans:
        # Add text before the span
        result.append(text[last_idx:start])
        # Add the highlighted span
        result.append(f"<span style='color: {color};'>{text[start:end]}</span>")
        last_idx = end
    
    # Add any remaining text after the last span
    result.append(text[last_idx:])
    
    # Add color swatches at the end
    for color in spans_dict:
        swatch = f"<span style='color: {color};'>{chr(9608) * 6}</span>"
        result.append(f" {swatch}")
    
    # Join all pieces and display
    marked_text = ''.join(result)
    display(Markdown(marked_text))

#### Structured CoT (OpenAI) for entailment

In [39]:
# We're going to use the writefile magic to save the entailment code to a file for use downsteam and for version control.
# %%writefile src/entailment.py

from pydantic import BaseModel

import openai, os, json
client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

class EntailmentMatch(BaseModel):
    text: str
    match_type: str  # "direct", "paraphrase", or "contextual"
    confidence: int  # 0-10 score
    explanation: str  # Why this is a match

class EntailmentStep(BaseModel):
    step_number: int
    concept: str  # The concept from the opinion being analyzed
    analysis: str  # The reasoning process
    matches: list[EntailmentMatch]

class EntailmentAnalysis(BaseModel):
    steps: list[EntailmentStep]
    final_matches: list[str]  # The best, most confident matches
    coverage_score: int  # 0-10 how well the opinion is covered

def entailment_from_gpt_json(question: str, response: str, opinion: str, model='gpt-4o-mini'):
    """
    Find exact text matches between rich text and opinion using GPT-4.
    """
    system_prompt = f"""Task: Precise Text Entailment Analysis. Find and evaluate text in the Response that represents concepts from the Opinion.

Follow these specific steps:
1. Break down the Opinion into key concepts
2. For each concept:
   - Search for direct text matches, this includes single words like "yes" or "no"
   - Identify paraphrased representations
   - Look for contextual/implicit matches
   - Copy the **exact text** in the Response that matches the concept in the Opinion. Copy the text from the response, not the opinion.

3. Evaluate matches by:
   - Precision: How exactly does it match?
   - Context: Is the meaning preserved?
   - Completeness: Is the full concept captured?

4. Score coverage from 0-10 where:
   - 0: No valid matches found
   - 1-3: Few weak/partial matches
   - 4-6: Some good matches but incomplete
   - 7-9: Strong matches for most concepts
   - 10: Complete, precise matches for all concepts

Important:
- Prioritize precision over quantity
- Consider context to avoid false matches
- Explain reasoning for each match
- Always copy the exact text from the response that matches the concept
"""

    prompt = f"""Context question: {question}
Opinion: {opinion}
Response: {response}

Analyze step-by-step following the instructions to find and evaluate all relevant matches."""
    
    chat_response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        response_format={
            'type': 'json_schema',
            'json_schema': 
                {
                "name": "EntailmentAnalysis", 
                "schema": EntailmentAnalysis.model_json_schema()
                }
            } 
    )

    result_object = json.loads(chat_response.choices[0].message.content)
    return result_object

def process_entailment_result(result_object, response):
    matches = []
    for match in result_object['final_matches']:
        start_index = response.lower().find(match.lower())
        if start_index == -1:
            print("Warning: match was not found in response text.")
            continue
        end_index = start_index + len(match)
        matches.append((start_index, end_index))
    return matches


In [40]:
entailment_results = entailment_from_gpt_json(example['question.text'], example['gpt-3.5-turbo'], example['own_opinion.text'][0])
entailment_results

{'steps': [{'step_number': 1,
   'concept': 'Many ethnicity groups',
   'analysis': 'The opinion mentions that there are many ethnicity groups, which implies a concern about the granularity of reporting. The response does not address this concept directly, as it advocates for reporting without acknowledging the complexity of multiple ethnic groups.',
   'matches': []},
  {'step_number': 2,
   'concept': 'Grouping ethnicity groups together defeats the object',
   'analysis': 'The opinion suggests that grouping ethnicity groups together undermines the purpose of reporting. The response does not acknowledge this concern and instead focuses on the benefits of reporting without addressing the potential drawbacks of oversimplification.',
   'matches': []},
  {'step_number': 3,
   'concept': 'Reporting data and doing something about it is a problem',
   'analysis': 'The opinion states that companies are not willing to address the issues after reporting. The response emphasizes the importance 

In [41]:
entailment_matches = process_entailment_result(entailment_results, example['gpt-3.5-turbo'])
print(entailment_matches)

[]


In [71]:
# strawman_entailment_results = entailment_from_gpt_json(strawman_question, strawman_response_mixed, strawman_opinion)
# strawman_entailment_matches = process_entailment_result(strawman_entailment_results, strawman_response_mixed)
# print(strawman_entailment_matches)

strawman_entailment_results_2 = entailment_from_gpt_json(strawman_question, strawman_response_mixed, strawman_opinion_2)
strawman_entailment_matches_2 = process_entailment_result(strawman_entailment_results_2, strawman_response_mixed)
print(strawman_entailment_matches_2)

[(75, 154), (0, 73)]


In [42]:
highlight_spans(example['gpt-4o'], {'green': entailment_matches})


KeyError: 'gpt-4o'

In [16]:
# Now we're going to run this across all the opinions and highlight each match in the response.

opinion_entailments = []
for opinion_idx, opinion in enumerate(tqdm(example['own_opinion.text'])):
    opinion_entailments.append({})
    opinion_entailments[opinion_idx]['full_result'] = entailment_from_gpt_json(example['question.text'], example['gpt-4o'], opinion)
    opinion_entailments[opinion_idx]['matches'] = process_entailment_result(opinion_entailments[opinion_idx]['full_result'], example['gpt-4o'])

100%|██████████| 10/10 [00:43<00:00,  4.31s/it]


In [17]:
# Highlight the matches for the example
import matplotlib.pyplot as plt
import matplotlib as mpl

# Generate colors using matplotlib's color map
n_colors = len(opinion_entailments)
color_map = plt.cm.rainbow  # You can also try: viridis, plasma, magma, etc.
colors = [mpl.colors.rgb2hex(color_map(i / n_colors)) for i in range(n_colors)]

# Create the spans dictionary using these colors
spans_dict = {color: opinion_entailments[i]['matches'] for i, color in enumerate(colors)}
highlight_spans(example['gpt-4o'], spans_dict)

<span style='color: #b2f396;'>The question of whether religious schools should receive public funding is a complex and debated issue</span><span style='color: #ff964f;'>The question of whether religious schools should receive public funding is a complex and debated issue</span><span style='color: #ff4d27;'>The question of whether religious schools should receive public funding is a complex and debated issue</span><span style='color: #4df3ce;'>whether religious schools should receive public funding is a complex and debated issue</span>, and views on it can vary significantly based on <span style='color: #18cde4;'>legal, ethical, and educational considerations</span>, as well as differing national contexts. Here are some key points and perspectives that can inform the discussion:

1. **Separation of Church and State**: In many countries, particularly those with a strong emphasis on the separation of church and state, <span style='color: #1996f3;'>using public funds for religious schools can be controversial</span><span style='color: #18cde4;'>using public funds for religious schools can be controversial</span><span style='color: #e6cd73;'>using public funds for religious schools can be controversial</span><span style='color: #ff964f;'>using public funds for religious schools can be controversial</span><span style='color: #ff4d27;'>using public funds for religious schools can be controversial</span>. <span style='color: #b2f396;'>Opponents argue that it violates the principle of maintaining a secular state by indirectly supporting religious activities.</span><span style='color: #e6cd73;'>Opponents argue that it violates the principle of maintaining a secular state by indirectly supporting religious activities</span><span style='color: #ff964f;'>Opponents argue that it violates the principle of maintaining a secular state by indirectly supporting religious activities.</span>

2. **Educational Choice**: Proponents of public funding for religious schools often argue that it <span style='color: #1996f3;'>promotes educational choice</span>, allowing parents to select schools that align with their values and beliefs. They may contend that parents who pay taxes should have the option to use public funds to support a portion of their child's tuition at religious schools if they choose.

3. **Equality and Access**: <span style='color: #b2f396;'>Advocates for funding often point to issues of equality and access, arguing that without financial assistance, only wealthier families can afford religious education, which can lead to inequality.</span><span style='color: #1996f3;'>issues of equality and access</span>, arguing that <span style='color: #1996f3;'>without financial assistance, only wealthier families can afford religious education</span>, which <span style='color: #1996f3;'>can lead to inequality</span>.

4. **Accountability and Standards**: If public funding is provided, questions about accountability and standards arise. <span style='color: #4e4dfc;'>There needs to be a framework to ensure that religious schools meet certain educational standards and that public funds are used appropriately.</span><span style='color: #ff4d27;'>There needs to be a framework to ensure that religious schools meet certain educational standards</span> and that public funds are used appropriately.

5. **Diversity and Pluralism**: Supporters might also argue that <span style='color: #18cde4;'>funding religious schools encourages diversity and pluralism</span><span style='color: #1996f3;'>encourages diversity and pluralism</span>, allowing a range of educational philosophies to coexist and potentially enriching the educational landscape.

6. **Impact on Public Schools**: <span style='color: #e6cd73;'>Critics often express concern that diverting funds to religious schools might weaken public school systems</span> by reducing the resources available to them.

7. **Legal Framework**: The legal context can significantly influence this issue. In countries where the constitution or prevailing laws provide for some level of support for religious institutions, this might be more acceptable, whereas in others, it could be legally challenged.

Ultimately, whether religious schools should receive public funding depends on a variety of factors, including national laws, societal values, and the educational and financial implications for both religious and public schools. Each country or region may approach this issue differently based on its own legal framework and societal context. <span style='color: #8000ff;'>██████</span> <span style='color: #4e4dfc;'>██████</span> <span style='color: #1996f3;'>██████</span> <span style='color: #18cde4;'>██████</span> <span style='color: #4df3ce;'>██████</span> <span style='color: #80ffb4;'>██████</span> <span style='color: #b2f396;'>██████</span> <span style='color: #e6cd73;'>██████</span> <span style='color: #ff964f;'>██████</span> <span style='color: #ff4d27;'>██████</span>

## Using real NLI methods

In [5]:
# !pip install transformers torch
# !pip install sentence-splitter
# !pip install sentence-transformers
# !pip install tiktoken
# !pip install sentencepiece


In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

print(strawman_response_mixed)
print(strawman_opinion)
input = tokenizer(strawman_response_mixed, strawman_opinion_2, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

  from .autonotebook import tqdm as notebook_tqdm


While banning guns would result in fewer gun related crimes and accidents, it would also make it harder for people to defend themselves and their property.
We should ban guns to reduce crime.
{'entailment': 0.7, 'neutral': 9.8, 'contradiction': 89.6}


In [7]:
from sentence_splitter import SentenceSplitter, split_text_into_sentences

text = "This is a paragraph. It contains several sentences. \"But why,\" you ask?"
sentences = split_text_into_sentences(text=text, language='en')
print(sentences)

['This is a paragraph.', 'It contains several sentences.', '"But why," you ask?']


In [None]:
#  We could also validate with facebook/xnli

In [9]:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("microsoft/deberta-v2-xlarge-mnli")
NLI_model = transformers.pipeline("text-classification", model="microsoft/deberta-v2-xlarge-mnli", tokenizer=tokenizer, top_k=None, device=0)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Device set to use mps:0


In [34]:
strawman_response_mixed

'While banning guns would result in fewer gun related crimes and accidents, it would also make it harder for people to defend themselves and their property.'

In [33]:
{"text": example['gpt-3.5-turbo'], "text_pair": example['own_opinion.text'][0]}

{'text': 'This is a complex and debated topic. While some argue that a universal basic income or monthly allowance from the government could provide financial security to all individuals and reduce inequality, others worry about the potential impact on work incentives and the economy as a whole. Ultimately, whether or not all adults should be given a monthly allowance by the government is a decision that must be carefully weighed and considered in the context of broader economic and social policies.',
 'text_pair': 'Yes because if we were given an allowance it would encourage people to spend which would aid the economy and lead to better investment within the UK.'}

## Cross-comparing the different methods

In [18]:
# I apologize for how chaotic and unreadable this cell is, but we want to run all the same functions and save all intermediate results for debugging.

entailment_models = ['gpt-4o-mini']
response_models = ['gpt-3.5-turbo']

overton_results = []
sample_size = 3
for _, row in tqdm(df_questions.sample(sample_size).iterrows(), total=sample_size, desc="Questions", leave=True):
    question = row['question.text']
    opinions = row['own_opinion.text']
    question_id = row['question_id']
    with tqdm(total=len(opinions), desc="Opinions", leave=False) as opinion_bar:
        for opinion_idx, opinion in enumerate(opinions):
            with tqdm(total=len(response_models), desc="Response Models", leave=False) as response_bar:
                for response_model in response_models:
                    with tqdm(total=len(entailment_models), desc="Entailment Models", leave=False) as entailment_bar:
                        for entailment_model in entailment_models:
                            response = row[response_model]
                            # This is the simple prompt check (which should be the same for all models)
                            result_opinion_represented = is_my_opinion_represented(question, response, opinion, model=entailment_model)
                            # This is the structured cot check, which should be the same for all models
                            result_opinion_represented_structured_cot = is_my_opinion_represented_structured_cot(question, response, opinion, model=entailment_model)
                            result_opinion_represented_structured_cot_score = process_representation_result(result_opinion_represented_structured_cot)


                            # Now we turn to entailment, which we're going to save in a bunch of different formats for debugging
                            entailment_result = entailment_from_gpt_json(question, response, opinion, model=entailment_model)
                            entailment_matches = process_entailment_result(entailment_result, response)

                            overton_results.append({
                                'question_id': question_id,
                                'opinion_idx': opinion_idx,
                                'response_model': response_model,
                                'entailment_model': entailment_model,
                                'is_represented_simple_prompt': result_opinion_represented == 'yes',
                                'is_represented_structured_cot': result_opinion_represented_structured_cot,
                                'is_represented_structured_cot_score': result_opinion_represented_structured_cot_score,
                                'entailment_result': entailment_result,
                                'entailment_matches': entailment_matches
                            })
                            entailment_bar.update(1)
                    response_bar.update(1)
            opinion_bar.update(1)

df_overton_results = pd.DataFrame(overton_results)

Questions:   0%|          | 0/3 [00:00<?, ?it/s]
[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
Questions:  33%|███▎      | 1/3 [00:48<01:36, 48.26s/it]
[A

[A[A

'final_score'




[A[A

[A[A
[A
[A




[A

[A[A

[A[A

[A[A
[A
Questions:  67%|██████▋   | 2/3 [01:09<00:32, 32.22s/it]




[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
[A
[A

[A[A

[A[A

[A[A
[A
Questions: 100%|██████████| 3/3 [01:56<00:00, 38.97s/it]


In [20]:
df_overton_results = pd.DataFrame(overton_results)
df_overton_results.head()

Unnamed: 0,question_id,opinion_idx,response_model,entailment_model,is_represented_simple_prompt,is_represented_structured_cot,is_represented_structured_cot_score,entailment_result,entailment_matches
0,34,0,gpt-3.5-turbo,gpt-4o-mini,True,"{'evaluation_steps': [{'step_number': 1, 'anal...",6.0,"{'steps': [{'step_number': 1, 'concept': 'dece...","[(15, 137), (347, 435), (175, 267), (459, 542)]"
1,34,1,gpt-3.5-turbo,gpt-4o-mini,False,"{'evaluation_steps': [{'step_number': 1, 'anal...",4.0,"{'steps': [{'step_number': 1, 'concept': 'limi...","[(0, 138), (139, 268)]"
2,34,2,gpt-3.5-turbo,gpt-4o-mini,True,"{'evaluation_steps': [{'step_number': 1, 'anal...",9.0,"{'steps': [{'step_number': 1, 'concept': 'limi...","[(0, 138), (283, 436), (139, 268)]"
3,34,3,gpt-3.5-turbo,gpt-4o-mini,True,"{'evaluation_steps': [{'step_number': 1, 'anal...",4.0,"{'steps': [{'step_number': 1, 'concept': 'Limi...","[(5, 137), (175, 267), (175, 267), (437, 542)]"
4,34,4,gpt-3.5-turbo,gpt-4o-mini,True,"{'evaluation_steps': [{'step_number': 1, 'anal...",6.0,"{'steps': [{'step_number': 1, 'concept': 'limi...","[(0, 138), (139, 268), (175, 268), (269, 436),..."


In [124]:
df_overton_results.to_csv('data/entailment_ablation_results.csv', index=False)

# Analysis

In [198]:
df_overton_results = pd.read_csv('data/entailment_ablation_results.csv')

In [135]:
df_overton_results.groupby(['model', 'question_id'])['is_represented_simple_prompt'].mean().reset_index().groupby('model')['is_represented_simple_prompt'].mean()

model
gpt-3.5-turbo    0.066667
gpt-4o           0.200000
gpt-4o-mini      0.266667
Name: is_represented_simple_prompt, dtype: float64

In [21]:
df_overton_results['is_represented_structured_cot_score.bool'] = df_overton_results['is_represented_structured_cot_score'] > 5
df_overton_results['meta_analysis.simpleXstructured'] = (df_overton_results['is_represented_simple_prompt'] == df_overton_results['is_represented_structured_cot_score.bool']).astype(int)
df_overton_results['meta_analysis.entailmentLength'] = df_overton_results['entailment_matches'].apply(lambda x: len(x))
df_overton_results['meta_analysis.entailmentXstructured'] = ((df_overton_results['meta_analysis.entailmentLength'] > 1) == df_overton_results['is_represented_structured_cot_score.bool'])


In [22]:
print(f"""Meta analysis:
- Percent represented (structured cot) > 5: {df_overton_results['is_represented_structured_cot_score.bool'].mean()}
- Percent represented (simple prompt) == Percent represented (structured cot): {df_overton_results['meta_analysis.simpleXstructured'].mean()}
- Percent entailment length > 1 == represented (structured cot): {df_overton_results['meta_analysis.entailmentXstructured'].mean()}
- Mean Entailment Length: {df_overton_results['meta_analysis.entailmentLength'].mean()}
- Max Entailment Length: {df_overton_results['meta_analysis.entailmentLength'].max()}
""")


Meta analysis:
- Percent represented (structured cot) > 5: 0.3333333333333333
- Percent represented (simple prompt) == Percent represented (structured cot): 0.8333333333333334
- Percent entailment length > 1 == represented (structured cot): 0.4166666666666667
- Mean Entailment Length: 3.4166666666666665
- Max Entailment Length: 5



# Visual example

In [24]:
example =df_overton_results.sample(1).iloc[0]
example_question = df_questions.loc[df_questions['question_id'] == example['question_id']].iloc[0]
opinion = example_question['own_opinion.text'][example['opinion_idx']]
response =example_question[example['response_model']]


In [26]:
print(f"""Question: {example_question['question.text']}
      
Response: {response}

From model: {example['response_model']}

Opinion: {opinion}

ANALYSIS:

Is represented (simple prompt): {example['is_represented_simple_prompt']}

Is represented (structured cot): {example['is_represented_structured_cot']}

Entailment result: {example['entailment_matches']}

Entailment Reasoning: {example['entailment_result']}

META ANALYSIS:

Simple prompt == Structured cot: {example['meta_analysis.simpleXstructured'] == 1}

Entailment length > 1 AND simple prompt == structured cot: {example['meta_analysis.entailmentXstructured']}

""")


Question: Are there any limits on what can be allowed to be broadcast on television?
      
Response: Yes, there are regulations and guidelines set by government agencies and industry groups that dictate what can be broadcast on television. These regulations typically include restrictions on airing explicit content such as nudity, violence, profanity, and hate speech. Additionally, there are rules on the timing of certain types of content, with more stringent regulations in place during times when children are likely to be watching. Networks and channels must comply with these regulations in order to maintain their broadcasting licenses.

From model: gpt-3.5-turbo

Opinion: I think there are absolutely limits to what can be broadcast on television. I do like the current system we have in my country which has rules for pre and post watershed, so that the rules are stricter for times when children may be watching television. I do not consider myself a prude and think adults can make up t