# Evaluating Entailment

How do we know if one text response matches the opinions in another response? This is a tricky question, and there are many ways to approach it. This notebook is going to run through a series of approaches and models, to do some meta-analysis on the best way to do this.

It will depend on the `habermas_machine_questions_with_responses.csv` file, which contains the questions and responses generated by the LLMs, as well as the human responses.

In [103]:
import pandas as pd
import numpy as np
import os
import json
from tqdm.auto import tqdm
import ast
from dotenv import load_dotenv
load_dotenv()


True

In [25]:
df_questions = pd.read_csv('data/habermas_machine_questions_with_responses.csv')
df_questions['own_opinion.text'] = df_questions['own_opinion.text'].apply(ast.literal_eval)
df_questions.head()

Unnamed: 0,question.text,own_opinion.text,question_topic,question_id,gpt-3.5-turbo,gpt-4o,gemini-1.5-flash-002,llama-3.1-8B,gemma-2b,mistral-7B
0,Are people who hold high political office and ...,[One minute I think they should disclose and t...,74,15,There is no universal ethical requirement for ...,The ethical requirement for individuals in hig...,There's no universally agreed-upon ethical sta...,The requirement for public officials to disclo...,\n\nThis question is complex and there is no e...,\n\n[INST] Are people who hold high political ...
1,Are the NHS and the UK welfare state fit for p...,[The NHS and Welfare are not working how they ...,51,19,This is a complex question that can have diffe...,The National Health Service (NHS) and the wide...,"Whether the NHS and the UK welfare state are ""...",The NHS and the UK welfare state are complex s...,The UK's National Health Service (NHS) and its...,"\n\nThe NHS is a national treasure, but it is ..."
2,Are there any limits on what can be allowed to...,[We have to keep in mind some kind of decency ...,52,34,"Yes, there are regulations and guidelines set ...","Yes, there are several limits and regulations ...","Yes, there are many limits on what can be broa...","Yes, there are limits on what can be allowed t...","Yes, there are many limits on what can be allo...",\n\n[INST] What are the limits on what can be ...
3,Are there any questions that we should never a...,[I cannot think of a topic which would fall in...,14,36,"There are certain topics that can be harmful, ...",The idea of restricting discussion or voting o...,There's no universally agreed-upon list of top...,While it's essential to maintain an open and i...,"Sure, there are certain questions that should ...",\n\n[INST] Are there any questions that we sho...
4,Are there limits to how much tax people should...,[I think there are limits for poorer people to...,52,37,There is debate and varying opinions on what c...,The question of whether there should be limits...,There's no universally agreed-upon limit to ho...,The concept of tax limits is a complex and deb...,"Sure, there are limits to how much tax people ...",\n\n[INST] What is the best way to raise money...


In [26]:
# Let's take a look at a random question to test the different methods
example = df_questions.sample(1).iloc[0]

In [31]:
print("Question: ", example['question.text'], "\n", "Response: ", example['gpt-4o'], "\n\n", "Opinion: ", example['own_opinion.text'][0])

Question:  Should we be encouraging more immigration to the UK? 
 Response:  The issue of whether to encourage more immigration to the UK is complex and multifaceted, involving economic, social, and political considerations. Here are some arguments for and against encouraging more immigration:

### Arguments for Encouraging More Immigration:

1. **Economic Growth**: Immigrants can contribute to economic growth by filling labor shortages, starting businesses, and paying taxes. They often fill essential roles in healthcare, agriculture, and technology.

2. **Demographic Challenges**: The UK, like many developed countries, faces an aging population. Immigration can help balance the demographic scales by increasing the proportion of working-age people, which supports the pension system and public services.

3. **Cultural Diversity**: Immigration contributes to cultural diversity, bringing new perspectives, skills, and ideas that can enrich society culturally and economically.

4. **Innovat

## Finding Representation

These functions are used to determine if the LLM responses match the human opinions (not entailment).

#### First, a dead simple prompt

In [123]:
# Now we want to see if the opinions are all represented in the generated responses.
from openai import OpenAI
client = OpenAI(os.getenv('OPENAI_API_KEY'))

def is_my_opinion_represented(question, response, opinion, model='gpt-4o-mini'):
    """
    Determine if the following opinion is represented in the response to a question. This is the simplest approach I can think of.
    """
    prompt = f"""Task: Determine if the following opinion is represented in the response to a question.

Question: {question}
Response: {response}
Opinion to check for: {opinion}

Instructions:
- Answer ONLY with 'yes' or 'no'
- Answer 'yes' if the opinion is clearly represented
- Answer 'no' if the opinion is absent
- Do not explain your reasoning
- Do not add any other text

Answer:"""

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a precise evaluator that only responds with 'yes' or 'no'."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        max_tokens=1, # We only need one token for 'yes' or 'no'
    )
    
    return completion.choices[0].message.content.strip().lower()

In [32]:
is_my_opinion_represented(example['question.text'], example['gpt-4o'], example['own_opinion.text'][0])

'yes'

### Next, let's use a structured CoT to evaluate the representation

In [107]:
from pydantic import BaseModel

class OpinionPoint(BaseModel):
    claim: str
    explicit_matches: list[str]
    implicit_matches: list[str]
    contradictions: list[str]
    coverage_score: int  # 0-10 for this specific point

class EvaluationStep(BaseModel):
    step_number: int
    analysis: str
    findings: list[str]

class RepresentationAnalysis(BaseModel):
    opinion_points: list[OpinionPoint]
    evaluation_steps: list[EvaluationStep]
    final_score: int  # 0-10 overall score
    reasoning: str  # Brief explanation of final score

def is_my_opinion_represented_structured_cot(question, response, opinion, model='gpt-4o-mini'):
    """
    Determine if the opinion is represented in the response to a question, using structured CoT generation.
    """
    system_prompt = f"""Task: Evaluate how well an opinion is represented in a response through careful step-by-step analysis.

Follow these specific steps in your evaluation:
1. First, break down the core claims/points in the opinion
2. For each point in the opinion:
   - Search for explicit mentions in the response
   - Look for implicit/paraphrased representations
   - Note any contradictions
3. Consider the overall alignment:
   - How many points are covered?
   - How directly are they addressed?
   - Are there any misalignments?
4. Score the representation from 0-10 where:
   - 0: Complete contradiction or no representation
   - 1-3: Minimal/weak representation of few points
   - 4-6: Partial representation of main points
   - 7-9: Strong representation of most points
   - 10: Complete and explicit representation of all points
"""
    
    prompt = f"""Question: {question}
Response: {response}
Opinion to check for: {opinion}

Analyze step-by-step following the instructions, then provide your structured evaluation."""

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        response_format={
            'type': 'json_schema',
            'json_schema': 
                {
                "name": "RepresentationChain", 
                "schema": RepresentationAnalysis.model_json_schema()
                }
            } 
    )
    
    result_object = json.loads(completion.choices[0].message.content)
    return result_object

def process_representation_result(result_object):
    try:
        return result_object['final_score']
    except Exception as e:
        print(e)
        return None


Let's inspect how well the LLMs are doing at finding the opinion in the response.

In [95]:
is_my_opinion_represented_structured_cot(example['question.text'], example['gpt-4o'], example['own_opinion.text'][0])


{'evaluation_steps': [{'step_number': 1,
   'analysis': 'The core claims in the opinion are: 1) We should encourage more immigration to fill job vacancies, 2) A larger working population benefits the economy, 3) The UK’s population growth is stagnating, which could lead to a crisis, and 4) Proactive measures are necessary to address future challenges.',
   'findings': ['Encouraging more immigration is beneficial for filling job vacancies.',
    'A larger working population is economically advantageous.',
    'The UK faces stagnating population growth, which could lead to a crisis.',
    'Proactive thinking is necessary for future challenges.']},
  {'step_number': 2,
   'analysis': '1. **Job Vacancies**: The response mentions that immigrants can fill labor shortages, which aligns with the opinion\'s point about job vacancies. \n   - **Explicit Matches**: "filling labor shortages"\n   - **Implicit Matches**: The mention of immigrants filling essential roles can be seen as a paraphrase of

## Entailment

Entailment is a lot trickier than representation. Our approach is going to be to use the model to find the exact text that matches the opinion, and then we'll see if that text is in the response.

In [78]:
from pydantic import BaseModel
class EntailmentMatch(BaseModel):
    text: str
    match_type: str  # "direct", "paraphrase", or "contextual"
    confidence: int  # 0-10 score
    explanation: str  # Why this is a match

class EntailmentStep(BaseModel):
    step_number: int
    concept: str  # The concept from the opinion being analyzed
    analysis: str  # The reasoning process
    matches: list[EntailmentMatch]

class EntailmentAnalysis(BaseModel):
    steps: list[EntailmentStep]
    final_matches: list[str]  # The best, most confident matches
    coverage_score: int  # 0-10 how well the opinion is covered

def entailment_from_gpt_json(question: str, response: str, opinion: str, model='gpt-4o-mini'):
    """
    Find exact text matches between rich text and opinion using GPT-4.
    """
    system_prompt = f"""Task: Precise Text Entailment Analysis. Find and evaluate text in the Response that represents concepts from the Opinion.

Follow these specific steps:
1. Break down the Opinion into key concepts
2. For each concept:
   - Search for direct text matches, this includes single words like "yes" or "no"
   - Identify paraphrased representations
   - Look for contextual/implicit matches
   - Copy the **exact text** in the Response that matches the concept in the Opinion. Copy the text from the response, not the opinion.

3. Evaluate matches by:
   - Precision: How exactly does it match?
   - Context: Is the meaning preserved?
   - Completeness: Is the full concept captured?

4. Score coverage from 0-10 where:
   - 0: No valid matches found
   - 1-3: Few weak/partial matches
   - 4-6: Some good matches but incomplete
   - 7-9: Strong matches for most concepts
   - 10: Complete, precise matches for all concepts

Important:
- Prioritize precision over quantity
- Consider context to avoid false matches
- Explain reasoning for each match
- Always copy the exact text from the response that matches the concept
"""

    prompt = f"""Context question: {question}
Opinion: {opinion}
Response: {response}

Analyze step-by-step following the instructions to find and evaluate all relevant matches."""
    
    chat_response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0,  # Use 0 for maximum consistency
        response_format={
            'type': 'json_schema',
            'json_schema': 
                {
                "name": "EntailmentAnalysis", 
                "schema": EntailmentAnalysis.model_json_schema()
                }
            } 
    )

    result_object = json.loads(chat_response.choices[0].message.content)
    return result_object

def process_entailment_result(result_object, response):
    matches = []
    for match in result_object['final_matches']:
        start_index = response.lower().find(match.lower())
        if start_index == -1:
            print("Warning: match was not found in response text.")
            continue
        end_index = start_index + len(match)
        matches.append((start_index, end_index))
    return matches


In [79]:
entailment_results = entailment_from_gpt_json(example['question.text'], example['gpt-4o'], example['own_opinion.text'][0])
entailment_results

{'steps': [{'step_number': 1,
   'concept': 'Encouraging more immigration to the UK',
   'analysis': 'The opinion states that we should encourage more immigration to the UK, while the response discusses the complexity of the issue and presents arguments for and against immigration.',
   'matches': [{'text': 'The issue of whether to encourage more immigration to the UK is complex and multifaceted',
     'match_type': 'Paraphrased representation',
     'confidence': 7,
     'explanation': "The response acknowledges the topic of encouraging immigration, indicating that it is a complex issue, which aligns with the opinion's stance."}]},
  {'step_number': 2,
   'concept': 'High amount of job vacancies',
   'analysis': 'The opinion mentions a high number of job vacancies that need to be filled, while the response discusses how immigrants can fill labor shortages.',
   'matches': [{'text': 'Immigrants can contribute to economic growth by filling labor shortages',
     'match_type': 'Paraphras

In [81]:
entailment_matches = process_entailment_result(entailment_results, example['gpt-4o'])
print(entailment_matches)

[(0, 88), (292, 363), (292, 336), (512, 576), (578, 629)]


In [90]:
def highlight_spans(text: str, spans_dict: dict[str, list[tuple[int, int]]]):
    """
    Highlight the spans in the text based on the spans dictionary. 
    Args:
        text: The text to highlight
        spans_dict: Dictionary of 'color': list of spans, where each span is (start, end)
    Example:
        spans_dict = {
            '#00acc6': [(0, 4), (10, 16)],
            '#e6a500': [(28, 37)]
        }
    """
    from IPython.display import Markdown, display
    
    # Convert dict to list of (start, end, color) and sort by start position
    all_spans = [(start, end, color) 
                 for color, spans in spans_dict.items() 
                 for start, end in spans]
    all_spans.sort(key=lambda x: x[0])
    
    # Build the marked up text piece by piece
    result = []
    last_idx = 0
    
    for start, end, color in all_spans:
        # Add text before the span
        result.append(text[last_idx:start])
        # Add the highlighted span
        result.append(f"<span style='color: {color};'>{text[start:end]}</span>")
        last_idx = end
    
    # Add any remaining text after the last span
    result.append(text[last_idx:])
    
    # Add color swatches at the end
    for color in spans_dict:
        swatch = f"<span style='color: {color};'>{chr(9608) * 6}</span>"
        result.append(f" {swatch}")
    
    # Join all pieces and display
    marked_text = ''.join(result)
    display(Markdown(marked_text))

In [89]:
highlight_spans(example['gpt-4o'], {'green': entailment_matches})


<span style='color: green;'>The issue of whether to encourage more immigration to the UK is complex and multifaceted</span>, involving economic, social, and political considerations. Here are some arguments for and against encouraging more immigration:

### Arguments for Encouraging More Immigration:

1. **Economic Growth**: <span style='color: green;'>Immigrants can contribute to economic growth by filling labor shortages</span><span style='color: green;'>Immigrants can contribute to economic growth</span> by filling labor shortages, starting businesses, and paying taxes. They often fill essential roles in healthcare, agriculture, and technology.

2. **Demographic Challenges**: <span style='color: green;'>The UK, like many developed countries, faces an aging population</span>. <span style='color: green;'>Immigration can help balance the demographic scales</span> by increasing the proportion of working-age people, which supports the pension system and public services.

3. **Cultural Diversity**: Immigration contributes to cultural diversity, bringing new perspectives, skills, and ideas that can enrich society culturally and economically.

4. **Innovation and Skills**: Immigrants often bring unique skills and knowledge, driving innovation and contributing to sectors like technology, science, and the arts.

5. **Humanitarian Obligations**: As a signatory to various international treaties, the UK has commitments to offer asylum to refugees and those fleeing persecution, aligning immigration policies with humanitarian values.

### Arguments Against Encouraging More Immigration:

1. **Pressure on Public Services**: Increased immigration can strain public services like healthcare, education, and housing, although this is often due to inadequate government investment rather than immigration levels per se.

2. **Social Integration**: Rapid immigration can sometimes lead to challenges in integration, potentially causing social tensions if not managed properly.

3. **Economic Displacement**: Some argue that immigration can lead to lower wages or job displacement for local workers, especially in lower-skilled sectors, although evidence on this point is mixed.

4. **Infrastructure Challenges**: Higher population density can lead to congestion and pressure on infrastructure, necessitating significant investment to accommodate new arrivals.

5. **Political Considerations**: Immigration is a contentious political issue and encouraging more of it can polarize public opinion and impact social cohesion if not carefully managed.

Ultimately, the decision to encourage more immigration should be based on careful consideration of these factors, as well as the specific needs and capacities of the UK. It's important for policies to be balanced, informed by evidence, and adaptable to changing circumstances, ensuring that the benefits of immigration can be maximized while addressing potential challenges. <span style='color: green;'>██████</span>

In [85]:
# Now we're going to run this across all the opinions and highlight each match in the response.

opinion_entailments = []
for opinion_idx, opinion in enumerate(tqdm(example['own_opinion.text'])):
    opinion_entailments.append({})
    opinion_entailments[opinion_idx]['full_result'] = entailment_from_gpt_json(example['question.text'], example['gpt-4o'], opinion)
    opinion_entailments[opinion_idx]['matches'] = process_entailment_result(opinion_entailments[opinion_idx]['full_result'], example['gpt-4o'])

  0%|          | 0/10 [00:00<?, ?it/s]



In [91]:
# Highlight the matches for the example
import matplotlib.pyplot as plt
import matplotlib as mpl

# Generate colors using matplotlib's color map
n_colors = len(opinion_entailments)
color_map = plt.cm.rainbow  # You can also try: viridis, plasma, magma, etc.
colors = [mpl.colors.rgb2hex(color_map(i / n_colors)) for i in range(n_colors)]

# Create the spans dictionary using these colors
spans_dict = {color: opinion_entailments[i]['matches'] for i, color in enumerate(colors)}
highlight_spans(example['gpt-4o'], spans_dict)

<span style='color: #8000ff;'>The issue of whether to encourage more immigration to the UK is complex and multifaceted</span><span style='color: #80ffb4;'>The issue of whether to encourage more immigration to the UK is complex and multifaceted</span><span style='color: #e6cd73;'>The issue of whether to encourage more immigration to the UK is complex and multifaceted</span><span style='color: #ff964f;'>The issue of whether to encourage more immigration to the UK is complex and multifaceted</span>, involving economic, social, and political considerations. Here are some arguments for and against encouraging more immigration:

### Arguments for Encouraging More Immigration:

1. **Economic Growth**: <span style='color: #8000ff;'>Immigrants can contribute to economic growth by filling labor shortages</span><span style='color: #8000ff;'>Immigrants can contribute to economic growth</span><span style='color: #4e4dfc;'>Immigrants can contribute to economic growth by filling labor shortages, starting businesses, and paying taxes.</span><span style='color: #4df3ce;'>Immigrants can contribute to economic growth by filling labor shortages, starting businesses, and paying taxes.</span><span style='color: #e6cd73;'>Immigrants can contribute to economic growth by filling labor shortages, starting businesses, and paying taxes.</span><span style='color: #ff964f;'>Immigrants can contribute to economic growth by filling labor shortages</span><span style='color: #ff4d27;'>Immigrants can contribute to economic growth by filling labor shortages, starting businesses, and paying taxes.</span> <span style='color: #e6cd73;'>They often fill essential roles in healthcare, agriculture, and technology.</span><span style='color: #ff964f;'>They often fill essential roles in healthcare, agriculture, and technology</span>.

2. **Demographic Challenges**: <span style='color: #8000ff;'>The UK, like many developed countries, faces an aging population</span>. <span style='color: #8000ff;'>Immigration can help balance the demographic scales</span> by increasing the proportion of working-age people, which supports the pension system and public services.

3. **Cultural Diversity**: <span style='color: #4df3ce;'>Immigration contributes to cultural diversity, bringing new perspectives, skills, and ideas that can enrich society culturally and economically.</span><span style='color: #e6cd73;'>Immigration contributes to cultural diversity, bringing new perspectives, skills, and ideas that can enrich society culturally and economically.</span><span style='color: #ff4d27;'>Immigration contributes to cultural diversity, bringing new perspectives, skills, and ideas that can enrich society culturally and economically.</span>

4. **Innovation and Skills**: Immigrants often bring unique skills and knowledge, driving innovation and contributing to sectors like technology, science, and the arts.

5. **Humanitarian Obligations**: As a signatory to various international treaties, <span style='color: #80ffb4;'>the UK has commitments to offer asylum to refugees and those fleeing persecution</span>, aligning immigration policies with humanitarian values.

### <span style='color: #18cde4;'>Arguments Against Encouraging More Immigration</span>:

1. **<span style='color: #18cde4;'>Pressure on Public Services</span>**: <span style='color: #1996f3;'>Increased immigration can strain public services like healthcare, education, and housing</span><span style='color: #80ffb4;'>Increased immigration can strain public services like healthcare, education, and housing</span><span style='color: #b2f396;'>Increased immigration can strain public services like healthcare, education, and housing</span>, although this is often due to inadequate government investment rather than immigration levels per se.

2. **Social Integration**: <span style='color: #b2f396;'>Rapid immigration can sometimes lead to challenges in integration, potentially causing social tensions</span> if not managed properly.

3. **Economic Displacement**: Some argue that <span style='color: #b2f396;'>immigration can lead to lower wages or job displacement for local workers</span>, especially in lower-skilled sectors, although evidence on this point is mixed.

4. **<span style='color: #18cde4;'>Infrastructure Challenges</span>**: <span style='color: #80ffb4;'>Higher population density can lead to congestion and pressure on infrastructure</span>, necessitating significant investment to accommodate new arrivals.

5. **Political Considerations**: Immigration is a contentious political issue and encouraging more of it can polarize public opinion and impact social cohesion if not carefully managed.

Ultimately, <span style='color: #1996f3;'>the decision to encourage more immigration should be based on careful consideration of these factors</span><span style='color: #b2f396;'>the decision to encourage more immigration should be based on careful consideration of these factors</span>, as well as the specific needs and capacities of the UK. It's important for policies to be balanced, informed by evidence, and adaptable to changing circumstances, ensuring that the benefits of immigration can be maximized while addressing potential challenges. <span style='color: #8000ff;'>██████</span> <span style='color: #4e4dfc;'>██████</span> <span style='color: #1996f3;'>██████</span> <span style='color: #18cde4;'>██████</span> <span style='color: #4df3ce;'>██████</span> <span style='color: #80ffb4;'>██████</span> <span style='color: #b2f396;'>██████</span> <span style='color: #e6cd73;'>██████</span> <span style='color: #ff964f;'>██████</span> <span style='color: #ff4d27;'>██████</span>

## Cross-comparing the different methods

In [99]:
# I apologize for how chaotic and unreadable this cell is, but we want to run all the same functions and save all intermediate results for debugging.

entailment_models = ['gpt-4o-mini', 'gpt-4o']
response_models = ['gemini-1.5-flash-002', 'gpt-4o', 'gemma-2b']

overton_results = []
sample_size = 3
for _, row in tqdm(df_questions.sample(sample_size).iterrows(), total=sample_size, desc="Questions", leave=True):
    question = row['question.text']
    opinions = row['own_opinion.text']
    question_id = row['question_id']
    with tqdm(total=len(opinions), desc="Opinions", leave=False) as opinion_bar:
        for opinion_idx, opinion in enumerate(opinions):
            with tqdm(total=len(response_models), desc="Response Models", leave=False) as response_bar:
                for response_model in response_models:
                    with tqdm(total=len(entailment_models), desc="Entailment Models", leave=False) as entailment_bar:
                        for entailment_model in entailment_models:
                            response = row[response_model]
                            # This is the simple prompt check (which should be the same for all models)
                            result_opinion_represented = is_my_opinion_represented(question, response, opinion, model=entailment_model)
                            # This is the structured cot check, which should be the same for all models
                            result_opinion_represented_structured_cot = is_my_opinion_represented_structured_cot(question, response, opinion, model=entailment_model)
                            result_opinion_represented_structured_cot_score = process_representation_result(result_opinion_represented_structured_cot)


                            # Now we turn to entailment, which we're going to save in a bunch of different formats for debugging
                            entailment_result = entailment_from_gpt_json(question, response, opinion, model=entailment_model)
                            entailment_matches = process_entailment_result(entailment_result, response)

                            overton_results.append({
                                'question_id': question_id,
                                'opinion_idx': opinion_idx,
                                'response_model': response_model,
                                'entailment_model': entailment_model,
                                'is_represented_simple_prompt': result_opinion_represented == 'yes',
                                'is_represented_structured_cot': result_opinion_represented_structured_cot,
                                'is_represented_structured_cot_score': result_opinion_represented_structured_cot_score,
                                'entailment_result': entailment_result,
                                'entailment_matches': entailment_matches
                            })
                            entailment_bar.update(1)
                    response_bar.update(1)
            opinion_bar.update(1)

df_overton_results = pd.DataFrame(overton_results)

Questions:   0%|          | 0/3 [00:00<?, ?it/s]

Opinions:   0%|          | 0/4 [00:00<?, ?it/s]

Response Models:   0%|          | 0/3 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]



Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]



Response Models:   0%|          | 0/3 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]



Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]

Response Models:   0%|          | 0/3 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]



Response Models:   0%|          | 0/3 [00:00<?, ?it/s]

Entailment Models:   0%|          | 0/2 [00:00<?, ?it/s]

KeyError: 'final_score'

In [105]:
df_overton_results = pd.DataFrame(overton_results)

In [124]:
df_overton_results.to_csv('data/overton_results.csv', index=False)

In [203]:
df_overton_results.to_csv('data/overton_results.csv', index=False)

In [135]:
df_overton_results.groupby(['model', 'question_id'])['is_represented_simple_prompt'].mean().reset_index().groupby('model')['is_represented_simple_prompt'].mean()

model
gpt-3.5-turbo    0.066667
gpt-4o           0.200000
gpt-4o-mini      0.266667
Name: is_represented_simple_prompt, dtype: float64

# Analysis

In [198]:
df_overton_results = pd.read_csv('data/overton_results.csv')

In [199]:
df_overton_results['is_represented_structured_cot_score.bool'] = df_overton_results['is_represented_structured_cot_score'] > 5
df_overton_results['meta_analysis.simpleXstructured'] = (df_overton_results['is_represented_simple_prompt'] == df_overton_results['is_represented_structured_cot_score.bool']).astype(int)
df_overton_results['meta_analysis.entailmentLength'] = df_overton_results['entailment_matches'].apply(lambda x: len(x))
df_overton_results['meta_analysis.entailmentXstructured'] = ((df_overton_results['meta_analysis.entailmentLength'] > 1) == df_overton_results['is_represented_structured_cot_score.bool'])


In [200]:
print(f"""Meta analysis:
- Percent represented (structured cot) > 5: {df_overton_results['is_represented_structured_cot_score.bool'].mean()}
- Percent represented (simple prompt) == Percent represented (structured cot): {df_overton_results['meta_analysis.simpleXstructured'].mean()}
- Percent entailment length > 1 == represented (structured cot): {df_overton_results['meta_analysis.entailmentXstructured'].mean()}
- Mean Entailment Length: {df_overton_results['meta_analysis.entailmentLength'].mean()}
- Max Entailment Length: {df_overton_results['meta_analysis.entailmentLength'].max()}
""")


Meta analysis:
- Percent represented (structured cot) > 5: 0.4
- Percent represented (simple prompt) == Percent represented (structured cot): 0.7142857142857143
- Percent entailment length > 1 == represented (structured cot): 0.4
- Mean Entailment Length: 24.97142857142857
- Max Entailment Length: 44



# Visual example

In [201]:
example =df_overton_results.sample(1).iloc[0]
example_question = df_questions.loc[df_questions['question_id'] == example['question_id']].iloc[0]
opinion = example_question['own_opinion.text'][example['opinion_idx']]
response =example_question[example['model']]


In [202]:
print(f"""Question: {example_question['question.text']}
      
Response: {response}

From model: {example['model']}

Opinion: {opinion}

ANALYSIS:

Is represented (simple prompt): {example['is_represented_simple_prompt']}

Is represented (structured cot): {example['is_represented_structured_cot']}

Entailment result: {example['entailment_matches']}

Entailment Reasoning: {example['entailment_result']}

META ANALYSIS:

Simple prompt == Structured cot: {example['meta_analysis.simpleXstructured'] == 1}

Entailment length > 1 AND simple prompt == structured cot: {example['meta_analysis.entailmentXstructured']}

""")


Question: If we could extend human life by 50 years, should we do it?
      
Response: This question is highly subjective and there are various ethical, social, and scientific considerations to take into account when discussing the idea of extending human life by 50 years. Some arguments in favor of extending human life may include advancements in medical technology and the potential for individuals to experience more of their lives, achieve personal goals, and contribute to society. On the other hand, concerns may be raised about overpopulation, resource depletion, inequality in access to life-extending technologies, and the potential negative impacts on the environment.

Ultimately, the decision to extend human life by 50 years would require careful consideration of these factors and a thorough examination of the possible consequences. It is important to engage in open and transparent discussions with experts, policymakers, ethicists, and members of the public to ensure that any deci