# ImpPres with LLM

You have to implement in this notebook a better ImpPres classifier using an LLM.
This classifier must be implemented using DSPy.


In [1]:
from dotenv import load_dotenv
import os
import dspy
load_dotenv("grok_key.ini")  
lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
dspy.configure(lm=lm)

In [2]:
from typing import List, Literal, Tuple

class TransformationSignature(dspy.Signature):
    premise: str = dspy.InputField()
    hypothesis: str = dspy.InputField()
    label: Literal['entailment', 'neutral', 'contradiction'] = dspy.OutputField()

class ParadigmModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict(TransformationSignature)
    
    def forward(self, transformations: List[Tuple[str, str]]) -> dspy.Prediction:
        results = []
        for premise, hypothesis in transformations:
            try:
                prediction = self.predict(premise=premise, hypothesis=hypothesis)
                results.append(prediction.label)
            except Exception as e:
                print(f"Error in prediction: {e}")
                continue
        
        return dspy.Prediction(results=results)


## Load ImpPres Dataset

In [3]:
from datasets import load_dataset

sections = ['presupposition_all_n_presupposition', 
            'presupposition_both_presupposition', 
            'presupposition_change_of_state', 
            'presupposition_cleft_existence', 
            'presupposition_cleft_uniqueness', 
            'presupposition_only_presupposition', 
            'presupposition_possessed_definites_existence', 
            'presupposition_possessed_definites_uniqueness', 
            'presupposition_question_presupposition']

dataset = {}
for section in sections:
    print(f"Loading dataset for section: {section}")
    dataset[section] = load_dataset("facebook/imppres", section)

Loading dataset for section: presupposition_all_n_presupposition
Loading dataset for section: presupposition_both_presupposition
Loading dataset for section: presupposition_change_of_state
Loading dataset for section: presupposition_cleft_existence
Loading dataset for section: presupposition_cleft_uniqueness
Loading dataset for section: presupposition_only_presupposition
Loading dataset for section: presupposition_possessed_definites_existence
Loading dataset for section: presupposition_possessed_definites_uniqueness
Loading dataset for section: presupposition_question_presupposition


In [4]:
import random
section_to_split = {
    'presupposition_all_n_presupposition': 'all_n_presupposition',
    'presupposition_both_presupposition': 'both_presupposition',
    'presupposition_change_of_state': 'change_of_state',
    'presupposition_cleft_existence': 'cleft_existence',
    'presupposition_cleft_uniqueness': 'cleft_uniqueness',
    'presupposition_only_presupposition': 'only_presupposition',
    'presupposition_possessed_definites_existence': 'possessed_definites_existence',
    'presupposition_possessed_definites_uniqueness': 'possessed_definites_uniqueness',
    'presupposition_question_presupposition': 'question_presupposition',
}


def prepare_paradigm_examples(dataset, sections):
    paradigm_examples = {}
    
    for section in sections:
        paradigm_examples[section] = {}
        split_name = section_to_split[section]
        data = dataset[section][split_name]
        
        # Group by paradigmID
        paradigm_groups = {}
        for item in data:
            paradigm_id = item['paradigmID']
            if paradigm_id not in paradigm_groups:
                paradigm_groups[paradigm_id] = []
            paradigm_groups[paradigm_id].append(item)
        
        for paradigm_id, items in paradigm_groups.items():
            if len(items) == 19:  # Only use complete paradigms
                # Shuffle to avoid position bias
                random.shuffle(items)
                transformations = [(item['premise'], item['hypothesis']) for item in items]
                true_labels = [item['gold_label'] for item in items]
                paradigm_examples[section][paradigm_id] = {
                    'transformations': transformations,
                    'true_labels': true_labels,
                    'items': items
                }
    
    return paradigm_examples


## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [5]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [6]:
import evaluate
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

## Your Turn

Compute the classification metrics on the baseline LLM model on each test section of the ANLI dataset for samples that have a non-empty 'reason' field.

You also must show a comparison between the DeBERTa baseline model and this LLM baseline model. The comparison metric should compute the agreement between the two models:
* On how many samples they are both correct [Correct]
* On how many samples Model1 is correct and Model2 is incorrect [Correct1]
* On how many samples Model1 is incorrect and Model2 is correct [Correct2]
* On how many samples both are incorrect [Incorrect]

In [7]:
from scipy.stats import entropy
from collections import Counter

def calculate_consistency_metric(predictions):
    if not predictions:
        return 0.0
    
    label_counts = Counter(predictions)
    total = len(predictions)
    
    probabilities = [count / total for count in label_counts.values()]
    
    ent = entropy(probabilities, base=len(label_counts)) if len(label_counts) > 1 else 0
    
    # Return consistency (1 - normalized entropy)
    return 1 - ent

def calculate_accuracy_metric(predictions, true_labels):
    if len(predictions) != len(true_labels):
        return 0.0
    
    correct = sum(1 for pred, true in zip(predictions, true_labels) if pred == true)
    return correct / len(true_labels)

def combined_metric(example, pred, alpha=0.5):
    accuracy = calculate_accuracy_metric(pred.results, example['true_labels'])
    consistency = calculate_consistency_metric(pred.results)
    
    return alpha * accuracy + (1 - alpha) * consistency


In [8]:
def create_training_examples(paradigm_examples, sections, max_examples_per_section=10):
    training_examples = []
    
    for section in sections:
        paradigms = list(paradigm_examples[section].items())
        # Limit number of examples per section for training
        selected_paradigms = paradigms[:max_examples_per_section]
        
        for paradigm_id, paradigm_data in selected_paradigms:
            example = dspy.Example(
                transformations=paradigm_data['transformations'],
                true_labels=paradigm_data['true_labels']
            ).with_inputs('transformations')
            
            training_examples.append(example)
    
    return training_examples

In [9]:
import numpy as np
from tqdm import tqdm

label_names = ["entailment", "neutral", "contradiction"]
label2id = {label: i for i, label in enumerate(label_names)}

def evaluate_section(module, paradigm_examples, section):
    paradigms = paradigm_examples[section]
    
    all_predictions = []
    all_true_labels = []
    consistency_scores = []
    transformation_results = {i: {'predictions': [], 'true_labels': []} for i in range(19)}
    
    for paradigm_id, paradigm_data in tqdm(paradigms.items()):
        pred = module(paradigm_data['transformations'])
        
        all_predictions.extend(pred.results)
        all_true_labels.extend(paradigm_data['true_labels'])
        consistency = calculate_consistency_metric(pred.results)
        print(f"Paradigm {paradigm_id} consistency: {consistency}")
        consistency_scores.append(consistency)
        
        for i, (prediction, true_label) in enumerate(zip(pred.results, paradigm_data['true_labels'])):
            transformation_results[i]['predictions'].append(label2id[prediction.strip().lower()])
            transformation_results[i]['true_labels'].append(true_label)
    
    numbered_all_predictions = [label2id[pred.strip().lower()] for pred in all_predictions]
    
    total_accuracy = accuracy.compute(predictions=numbered_all_predictions, references=all_true_labels)["accuracy"]
    total_precision = precision.compute(predictions=numbered_all_predictions, references=all_true_labels, average="macro")["precision"]
    total_recall = recall.compute(predictions=numbered_all_predictions, references=all_true_labels, average="macro")["recall"]
    total_f1 = f1.compute(predictions=numbered_all_predictions, references=all_true_labels, average="macro")["f1"]
    avg_consistency = np.mean(consistency_scores) if consistency_scores else 0.0
    
    print("Calculate metrics by transformation")
    transformation_metrics = {}
    for i in tqdm(range(19)):
        if transformation_results[i]['predictions']:
            trans_acc = accuracy.compute(
                predictions=transformation_results[i]['predictions'], 
                references=transformation_results[i]['true_labels']
            )["accuracy"]
            trans_p = precision.compute(
                predictions=transformation_results[i]['predictions'], 
                references=transformation_results[i]['true_labels'], 
                average='macro', zero_division=0
            )["precision"]
            trans_r = recall.compute(
                predictions=transformation_results[i]['predictions'], 
                references=transformation_results[i]['true_labels'], 
                average='macro', zero_division=0
            )["recall"]
            trans_f1= f1.compute(
                predictions=transformation_results[i]['predictions'], 
                references=transformation_results[i]['true_labels'], 
                average='macro'
            )["f1"]
            transformation_metrics[i] = {
                'accuracy': trans_acc,
                'precision': trans_p,
                'recall': trans_r,
                'f1': trans_f1
            }
    
    return {
        'accuracy': total_accuracy,
        'precision': total_precision,
        'recall': total_recall,
        'f1': total_f1,
        'consistency': avg_consistency,
        'transformation_metrics': transformation_metrics
    }

In [10]:
from dspy import MIPROv2
label_names = ["entailment", "neutral", "contradiction"]
label2id = {label: i for i, label in enumerate(label_names)}

def exact_match_metric(example, pred, trace=None):
    count = 0
    for i in range(len(pred.results)):
        count += 1 if label2id[pred.results[i].strip().lower()] == example['true_labels'][i] else 0
    return count == len(pred.results) 

optimizer = MIPROv2(metric=exact_match_metric)

In [11]:
import pandas as pd
from tqdm import tqdm

def run_evaluation():
    print("Preparing paradigm examples...")
    paradigm_examples = prepare_paradigm_examples(dataset, sections)
    
    print("Creating training examples...")
    training_examples = create_training_examples(paradigm_examples, sections)
    
    print("Initializing module...")
    module = ParadigmModule()
    
    print("Optimizing module...")
    compiled_module = optimizer.compile(
        module, 
        trainset=training_examples, 
        requires_permission_to_run=False
    )
    return compiled_module, paradigm_examples

def evaluate_all_sections(module, paradigm_examples):
    print("Running evaluation...")
    results = {}
    
    for section in tqdm(sections):
        if section == 'presupposition_all_n_presupposition':
            continue
        print(f"Evaluating section: {section}")
        section_results = evaluate_section(module, paradigm_examples, section)
        results[section] = section_results
        
        print(f"Section {section}:")
        print(f"  Accuracy: {section_results['accuracy']:.4f}")
        print(f"  Precision: {section_results['precision']:.4f}")
        print(f"  Recall: {section_results['recall']:.4f}")
        print(f"  F1: {section_results['f1']:.4f}")
        print(f"  Consistency: {section_results['consistency']:.4f}")
        print()
    
    return results


def create_results_table(results, sections):
    section_data = []
    for section in sections:
        section_data.append({
            'Section': section,
            'Accuracy': results[section]['accuracy'],
            'Precision': results[section]['precision'],
            'Recall': results[section]['recall'],
            'F1': results[section]['f1'],
            'Consistency': results[section]['consistency']
        })
    section_df = pd.DataFrame(section_data)
    
    # Transformation-level results (averaged across sections)
    transformation_data = []
    for i in range(19):
        trans_metrics = {'transformation': i}
        for metric in ['accuracy', 'precision', 'recall', 'f1']:
            values = []
            for section in sections:
                if i in results[section]['transformation_metrics']:
                    values.append(results[section]['transformation_metrics'][i][metric])
            trans_metrics[metric] = np.mean(values) if values else 0.0
        transformation_data.append(trans_metrics)
    transformation_df = pd.DataFrame(transformation_data)
    
    return section_df, transformation_df


In [12]:
trained_module, paradigm_examples = run_evaluation()

Preparing paradigm examples...


2025/08/03 13:35:08 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 72

2025/08/03 13:35:08 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/08/03 13:35:08 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/08/03 13:35:08 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Creating training examples...
Initializing module...
Optimizing module...
Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


 33%|███▎      | 6/18 [00:00<00:00, 44.09it/s]


Bootstrapped 4 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 4/6


 39%|███▉      | 7/18 [00:00<00:00, 59.91it/s]


Bootstrapped 3 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Bootstrapping set 5/6


 33%|███▎      | 6/18 [00:00<00:00, 70.88it/s]


Bootstrapped 4 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 6/6


  6%|▌         | 1/18 [00:00<00:00, 53.60it/s]
2025/08/03 13:35:08 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/08/03 13:35:08 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.


2025/08/03 13:35:26 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...



KeyboardInterrupt: 

In [None]:
results = evaluate_all_sections(trained_module, paradigm_examples)

Running evaluation...


  0%|          | 0/9 [00:00<?, ?it/s]

Evaluating section: presupposition_both_presupposition




Paradigm 0 consistency: 0.01736983605538045
Paradigm 1 consistency: 0.01736983605538034
Paradigm 2 consistency: 0.01736983605538045
Paradigm 3 consistency: 0.01736983605538034
Paradigm 4 consistency: 0.01736983605538034
Paradigm 5 consistency: 0.01736983605538045
Paradigm 6 consistency: 0.01736983605538034
Paradigm 7 consistency: 0.01736983605538034
Paradigm 8 consistency: 0.01736983605538045
Paradigm 9 consistency: 0.03826325692095478





Paradigm 10 consistency: 0.047909643385225076


 11%|█         | 11/100 [00:00<00:01, 53.13it/s][A

Paradigm 11 consistency: 0.01736983605538034
Paradigm 12 consistency: 0.01736983605538045
Paradigm 13 consistency: 0.01736983605538034
Paradigm 14 consistency: 0.01736983605538034
Paradigm 15 consistency: 0.03826325692095478
Paradigm 16 consistency: 0.047909643385224854
Paradigm 17 consistency: 0.03826325692095478
Paradigm 18 consistency: 0.01736983605538034
Paradigm 19 consistency: 0.01736983605538045
Paradigm 20 consistency: 0.01736983605538034
Paradigm 21 consistency: 0.01736983605538034




Paradigm 22 consistency: 0.01736983605538045
Paradigm 23 consistency: 0.0024805242792796944
Paradigm 24 consistency: 0.01736983605538034
Paradigm 25 consistency: 0.07413710786179006
Paradigm 26 consistency: 0.01736983605538034
Paradigm 27 consistency: 0.01736983605538045
Paradigm 28 consistency: 0.01736983605538045
Paradigm 29 consistency: 0.03826325692095478
Paradigm 30 consistency: 0.0024805242792796944
Paradigm 31 consistency: 0.01736983605538045
Paradigm 32 consistency: 0.01736983605538034
Paradigm 33 consistency: 0.03826325692095478




Paradigm 34 consistency: 0.01736983605538034
Paradigm 35 consistency: 0.047909643385225076
Paradigm 36 consistency: 0.01736983605538045
Paradigm 37 consistency: 0.07413710786179006
Paradigm 38 consistency: 0.01736983605538045
Paradigm 39 consistency: 0.010502461377285166
Paradigm 40 consistency: 0.01736983605538045
Paradigm 41 consistency: 0.01736983605538045
Paradigm 42 consistency: 0.01736983605538045
Paradigm 43 consistency: 0.01736983605538034
Paradigm 44 consistency: 0.01736983605538034
Paradigm 45 consistency: 0.01736983605538034





Paradigm 46 consistency: 0.047909643385225076
Paradigm 47 consistency: 0.01736983605538034
Paradigm 48 consistency: 0.01736983605538034
Paradigm 49 consistency: 0.01736983605538045
Paradigm 50 consistency: 0.03826325692095478
Paradigm 51 consistency: 0.01736983605538045
Paradigm 52 consistency: 0.01736983605538045
Paradigm 53 consistency: 0.01736983605538045
Paradigm 54 consistency: 0.01736983605538045
Paradigm 55 consistency: 0.01736983605538045
Paradigm 56 consistency: 0.01736983605538045
Paradigm 57 consistency: 0.01736983605538045
Paradigm 58 consistency: 0.01736983605538034
Paradigm 59 consistency: 0.01736983605538045
Paradigm 60 consistency: 0.010502461377285166


 61%|██████    | 61/100 [00:01<00:00, 56.14it/s][A


Paradigm 61 consistency: 0.01736983605538045
Paradigm 62 consistency: 0.01736983605538034
Paradigm 63 consistency: 0.01736983605538045
Paradigm 64 consistency: 0.01736983605538034
Paradigm 65 consistency: 0.01736983605538034
Paradigm 66 consistency: 0.01736983605538045
Paradigm 67 consistency: 0.01736983605538045
Paradigm 68 consistency: 0.01736983605538034
Paradigm 69 consistency: 0.0024805242792796944
Paradigm 70 consistency: 0.010502461377285166
Paradigm 71 consistency: 0.03826325692095478
Paradigm 72 consistency: 0.01736983605538034
Paradigm 73 consistency: 0.01736983605538045


 74%|███████▍  | 74/100 [00:01<00:00, 57.77it/s][A

Paradigm 74 consistency: 0.03826325692095478
Paradigm 75 consistency: 0.0958883125600768
Paradigm 76 consistency: 0.01736983605538034
Paradigm 77 consistency: 0.01736983605538034
Paradigm 78 consistency: 0.01736983605538045
Paradigm 79 consistency: 0.010502461377285166
Paradigm 80 consistency: 0.01736983605538045
Paradigm 81 consistency: 0.01736983605538045
Paradigm 82 consistency: 0.01736983605538034
Paradigm 83 consistency: 0.01736983605538045
Paradigm 84 consistency: 0.01736983605538045
Paradigm 85 consistency: 0.01736983605538034




Paradigm 86 consistency: 0.01736983605538034
Paradigm 87 consistency: 0.07413710786179006
Paradigm 88 consistency: 0.047909643385224854
Paradigm 89 consistency: 0.01736983605538045
Paradigm 90 consistency: 0.047909643385224854
Paradigm 91 consistency: 0.01736983605538045
Paradigm 92 consistency: 0.01736983605538045
Paradigm 93 consistency: 0.03826325692095478
Paradigm 94 consistency: 0.01736983605538034
Paradigm 95 consistency: 0.01736983605538045
Paradigm 96 consistency: 0.01736983605538045
Paradigm 97 consistency: 0.03826325692095478


100%|██████████| 100/100 [00:01<00:00, 53.10it/s]


Paradigm 98 consistency: 0.047909643385224854
Paradigm 99 consistency: 0.01736983605538045
1900
Calculate metrics by transformation


100%|██████████| 19/19 [00:01<00:00, 10.84it/s]
 22%|██▏       | 2/9 [00:03<00:13,  1.95s/it]

Section presupposition_both_presupposition:
  Accuracy: 0.9763
  Precision: 0.9767
  Recall: 0.9773
  F1: 0.9769
  Consistency: 0.0234

Evaluating section: presupposition_change_of_state




Paradigm 0 consistency: 0.5145392392540866
Paradigm 1 consistency: 0.39871518417755825
Paradigm 2 consistency: 0.3141871168274789
Paradigm 3 consistency: 0.7025277510807103
Paradigm 4 consistency: 0.39871518417755825
Paradigm 5 consistency: 0.5115068636860558
Paradigm 6 consistency: 0.5115068636860558
Paradigm 7 consistency: 0.5115068636860558
Paradigm 8 consistency: 0.3028137911081479
Paradigm 9 consistency: 0.5115068636860558




Paradigm 10 consistency: 0.6272947035832889
Paradigm 11 consistency: 0.5115068636860558
Paradigm 12 consistency: 0.314187116827479
Paradigm 13 consistency: 0.3141871168274789
Paradigm 14 consistency: 0.314187116827479
Paradigm 15 consistency: 0.37075077614396534
Paradigm 16 consistency: 0.24936273490419736




Paradigm 17 consistency: 0.5115068636860558
Paradigm 18 consistency: 0.314187116827479
Paradigm 19 consistency: 0.5115068636860558
Paradigm 20 consistency: 0.6272947035832889
Paradigm 21 consistency: 0.6272947035832889
Paradigm 22 consistency: 0.37075077614396534
Paradigm 23 consistency: 0.5115068636860558
Paradigm 24 consistency: 0.20033199319823147
Paradigm 25 consistency: 0.3141871168274789




Paradigm 26 consistency: 0.3555333547974129
Paradigm 27 consistency: 0.6272947035832889
Paradigm 28 consistency: 0.5115068636860558
Paradigm 29 consistency: 0.5145392392540866
Paradigm 30 consistency: 0.5115068636860558
Paradigm 31 consistency: 0.3555333547974129
Paradigm 32 consistency: 0.5145392392540866
Paradigm 33 consistency: 0.37075077614396534
Paradigm 34 consistency: 0.5145392392540866
Paradigm 35 consistency: 0.5145392392540866
Paradigm 36 consistency: 0.314187116827479
Paradigm 37 consistency: 0.6272947035832889




Paradigm 38 consistency: 0.37075077614396534
Paradigm 39 consistency: 0.6272947035832889
Paradigm 40 consistency: 0.42378260447830407
Paradigm 41 consistency: 0.5145392392540866
Paradigm 42 consistency: 0.7025277510807103
Paradigm 43 consistency: 0.5145392392540866
Paradigm 44 consistency: 0.7025277510807103
Paradigm 45 consistency: 0.5115068636860558
Paradigm 46 consistency: 0.7025277510807103
Paradigm 47 consistency: 0.5115068636860558
Paradigm 48 consistency: 0.5115068636860558
Paradigm 49 consistency: 0.7025277510807103
Paradigm 50 consistency: 0.3555333547974129
Paradigm 51 consistency: 0.39871518417755825
Paradigm 52 consistency: 0.314187116827479
Paradigm 53 consistency: 0.6272947035832889
Paradigm 54 consistency: 0.16494454884084464
Paradigm 55 consistency: 0.17194835729502667
Paradigm 56 consistency: 0.7025277510807103




Paradigm 57 consistency: 0.3141871168274789
Paradigm 58 consistency: 0.3028137911081479
Paradigm 59 consistency: 0.3028137911081479
Paradigm 60 consistency: 0.2575124304578764
Paradigm 61 consistency: 0.37075077614396534
Paradigm 62 consistency: 0.6272947035832889
Paradigm 63 consistency: 0.5145392392540866
Paradigm 64 consistency: 0.5115068636860558
Paradigm 65 consistency: 0.16494454884084464




Paradigm 66 consistency: 0.7025277510807103
Paradigm 67 consistency: 0.07413710786179006
Paradigm 68 consistency: 0.7025277510807103
Paradigm 69 consistency: 0.7025277510807103
Paradigm 70 consistency: 0.5145392392540866
Paradigm 71 consistency: 0.5145392392540866
Paradigm 72 consistency: 0.12691452647336965
Paradigm 73 consistency: 0.6272947035832889
Paradigm 74 consistency: 0.17194835729502667
Paradigm 75 consistency: 0.5115068636860558
Paradigm 76 consistency: 0.7025277510807103




Paradigm 77 consistency: 0.5115068636860558
Paradigm 78 consistency: 0.314187116827479
Paradigm 79 consistency: 0.39871518417755825
Paradigm 80 consistency: 0.5115068636860558
Paradigm 81 consistency: 0.1148097082393531
Paradigm 82 consistency: 0.3028137911081479
Paradigm 83 consistency: 0.37075077614396534
Paradigm 84 consistency: 0.20033199319823147




Paradigm 85 consistency: 0.3555333547974129
Paradigm 86 consistency: 0.3028137911081479
Paradigm 87 consistency: 0.314187116827479
Paradigm 88 consistency: 0.42378260447830407
Paradigm 89 consistency: 0.314187116827479
Paradigm 90 consistency: 0.7025277510807103
Paradigm 91 consistency: 0.314187116827479
Paradigm 92 consistency: 0.314187116827479
Paradigm 93 consistency: 0.7025277510807103
Paradigm 94 consistency: 0.42378260447830407


100%|██████████| 100/100 [00:02<00:00, 44.65it/s]


Paradigm 95 consistency: 0.5115068636860558
Paradigm 96 consistency: 0.2330839172350092
Paradigm 97 consistency: 0.42378260447830407
Paradigm 98 consistency: 0.314187116827479
Paradigm 99 consistency: 0.39871518417755825
1900
Calculate metrics by transformation


100%|██████████| 19/19 [00:01<00:00, 11.03it/s]
 33%|███▎      | 3/9 [00:08<00:17,  2.88s/it]

Section presupposition_change_of_state:
  Accuracy: 0.5679
  Precision: 0.6619
  Recall: 0.4970
  F1: 0.4706
  Consistency: 0.4412

Evaluating section: presupposition_cleft_existence




Paradigm 0 consistency: 0.3555333547974129
Paradigm 1 consistency: 0.3555333547974129
Paradigm 2 consistency: 0.3555333547974129
Paradigm 3 consistency: 0.3555333547974129
Paradigm 4 consistency: 0.3555333547974129
Paradigm 5 consistency: 0.26342943586645207
Paradigm 6 consistency: 0.24936273490419736
Paradigm 7 consistency: 0.42378260447830407
Paradigm 8 consistency: 0.42378260447830407
Paradigm 9 consistency: 0.5115068636860558
Paradigm 10 consistency: 0.3555333547974129




Paradigm 11 consistency: 0.42378260447830407
Paradigm 12 consistency: 0.3555333547974129
Paradigm 13 consistency: 0.24936273490419736
Paradigm 14 consistency: 0.3555333547974129
Paradigm 15 consistency: 0.24936273490419725
Paradigm 16 consistency: 0.20033199319823147
Paradigm 17 consistency: 0.39871518417755825
Paradigm 18 consistency: 0.3555333547974129




Paradigm 19 consistency: 0.2330839172350092
Paradigm 20 consistency: 0.24936273490419725
Paradigm 21 consistency: 0.24936273490419736
Paradigm 22 consistency: 0.3555333547974129
Paradigm 23 consistency: 0.3555333547974129
Paradigm 24 consistency: 0.3555333547974129
Paradigm 25 consistency: 0.5115068636860558
Paradigm 26 consistency: 0.42378260447830407
Paradigm 27 consistency: 0.3555333547974129
Paradigm 28 consistency: 0.3028137911081479
Paradigm 29 consistency: 0.3555333547974129
Paradigm 30 consistency: 0.17194835729502667




Paradigm 31 consistency: 0.3555333547974129
Paradigm 32 consistency: 0.20033199319823147
Paradigm 33 consistency: 0.3555333547974129
Paradigm 34 consistency: 0.42378260447830407
Paradigm 35 consistency: 0.24936273490419736
Paradigm 36 consistency: 0.42378260447830407
Paradigm 37 consistency: 0.3555333547974129
Paradigm 38 consistency: 0.3141871168274789
Paradigm 39 consistency: 0.24936273490419736
Paradigm 40 consistency: 0.24936273490419736
Paradigm 41 consistency: 0.42378260447830407
Paradigm 42 consistency: 0.24936273490419736




Paradigm 43 consistency: 0.3555333547974129
Paradigm 44 consistency: 0.3555333547974129
Paradigm 45 consistency: 0.42378260447830407
Paradigm 46 consistency: 0.3555333547974129
Paradigm 47 consistency: 0.3141871168274789
Paradigm 48 consistency: 0.2330839172350092
Paradigm 49 consistency: 0.12691452647336965
Paradigm 50 consistency: 0.20033199319823147
Paradigm 51 consistency: 0.3555333547974129
Paradigm 52 consistency: 0.3028137911081479





Paradigm 53 consistency: 0.24936273490419736
Paradigm 54 consistency: 0.2330839172350092
Paradigm 55 consistency: 0.20033199319823147
Paradigm 56 consistency: 0.3555333547974129
Paradigm 57 consistency: 0.3555333547974129
Paradigm 58 consistency: 0.42378260447830407
Paradigm 59 consistency: 0.3555333547974129
Paradigm 60 consistency: 0.42378260447830407
Paradigm 61 consistency: 0.3555333547974129
Paradigm 62 consistency: 0.3555333547974129
Paradigm 63 consistency: 0.314187116827479
Paradigm 64 consistency: 0.3555333547974129
Paradigm 65 consistency: 0.42378260447830407


 66%|██████▌   | 66/100 [00:01<00:00, 53.27it/s][A

Paradigm 66 consistency: 0.42378260447830407
Paradigm 67 consistency: 0.20033199319823147
Paradigm 68 consistency: 0.3555333547974129
Paradigm 69 consistency: 0.42378260447830407
Paradigm 70 consistency: 0.17194835729502667
Paradigm 71 consistency: 0.42378260447830407
Paradigm 72 consistency: 0.24936273490419736
Paradigm 73 consistency: 0.24936273490419736
Paradigm 74 consistency: 0.42378260447830407




Paradigm 75 consistency: 0.3555333547974129
Paradigm 76 consistency: 0.3555333547974129
Paradigm 77 consistency: 0.42378260447830407
Paradigm 78 consistency: 0.24936273490419725
Paradigm 79 consistency: 0.3555333547974129
Paradigm 80 consistency: 0.42378260447830407
Paradigm 81 consistency: 0.24936273490419736
Paradigm 82 consistency: 0.3555333547974129
Paradigm 83 consistency: 0.24936273490419736
Paradigm 84 consistency: 0.3555333547974129




Paradigm 85 consistency: 0.24936273490419736
Paradigm 86 consistency: 0.24936273490419736
Paradigm 87 consistency: 0.42378260447830407
Paradigm 88 consistency: 0.3555333547974129
Paradigm 89 consistency: 0.24936273490419736
Paradigm 90 consistency: 0.3555333547974129
Paradigm 91 consistency: 0.3555333547974129
Paradigm 92 consistency: 0.42378260447830407
Paradigm 93 consistency: 0.3555333547974129
Paradigm 94 consistency: 0.42378260447830407


100%|██████████| 100/100 [00:02<00:00, 44.37it/s]


Paradigm 95 consistency: 0.3028137911081479
Paradigm 96 consistency: 0.3555333547974129
Paradigm 97 consistency: 0.314187116827479
Paradigm 98 consistency: 0.42378260447830407
Paradigm 99 consistency: 0.42378260447830407
1900
Calculate metrics by transformation


100%|██████████| 19/19 [00:01<00:00, 11.83it/s]
 44%|████▍     | 4/9 [00:12<00:16,  3.33s/it]

Section presupposition_cleft_existence:
  Accuracy: 0.6884
  Precision: 0.8489
  Recall: 0.6312
  F1: 0.6415
  Consistency: 0.3334

Evaluating section: presupposition_cleft_uniqueness




Paradigm 0 consistency: 0.7025277510807103
Paradigm 1 consistency: 0.6272947035832889
Paradigm 2 consistency: 0.7025277510807103
Paradigm 3 consistency: 0.7025277510807103
Paradigm 4 consistency: 0.7025277510807103
Paradigm 5 consistency: 0.7025277510807103
Paradigm 6 consistency: 0.7025277510807103
Paradigm 7 consistency: 0.7025277510807103
Paradigm 8 consistency: 0.7025277510807103
Paradigm 9 consistency: 0.7025277510807103
Paradigm 10 consistency: 0.7025277510807103




Paradigm 11 consistency: 0.7025277510807103
Paradigm 12 consistency: 0.7025277510807103
Paradigm 13 consistency: 0.7025277510807103
Paradigm 14 consistency: 0.7025277510807103
Paradigm 15 consistency: 0.7025277510807103
Paradigm 16 consistency: 0.7025277510807103
Paradigm 17 consistency: 0.7025277510807103
Paradigm 18 consistency: 0.7025277510807103
Paradigm 19 consistency: 0.5145392392540866
Paradigm 20 consistency: 0.7025277510807103
Paradigm 21 consistency: 0.7025277510807103
Paradigm 22 consistency: 0.7025277510807103





Paradigm 23 consistency: 0.7025277510807103


 24%|██▍       | 24/100 [00:00<00:01, 52.86it/s][A


Paradigm 24 consistency: 0.7025277510807103
Paradigm 25 consistency: 0.7025277510807103
Paradigm 26 consistency: 0.7025277510807103
Paradigm 27 consistency: 0.7025277510807103
Paradigm 28 consistency: 0.5145392392540866
Paradigm 29 consistency: 0.7025277510807103
Paradigm 30 consistency: 0.7025277510807103
Paradigm 31 consistency: 0.7025277510807103
Paradigm 32 consistency: 0.7025277510807103
Paradigm 33 consistency: 0.7025277510807103
Paradigm 34 consistency: 0.7025277510807103
Paradigm 35 consistency: 0.7025277510807103
Paradigm 36 consistency: 0.7025277510807103
Paradigm 37 consistency: 0.6272947035832889
Paradigm 38 consistency: 0.7025277510807103


 39%|███▉      | 39/100 [00:00<00:00, 63.60it/s][A

Paradigm 39 consistency: 0.7025277510807103
Paradigm 40 consistency: 0.7025277510807103
Paradigm 41 consistency: 0.7025277510807103
Paradigm 42 consistency: 0.7025277510807103
Paradigm 43 consistency: 0.7025277510807103
Paradigm 44 consistency: 0.7025277510807103
Paradigm 45 consistency: 0.7025277510807103
Paradigm 46 consistency: 0.7025277510807103




Paradigm 47 consistency: 0.6272947035832889
Paradigm 48 consistency: 0.7025277510807103
Paradigm 49 consistency: 0.7025277510807103
Paradigm 50 consistency: 0.7025277510807103
Paradigm 51 consistency: 0.7025277510807103
Paradigm 52 consistency: 0.7025277510807103
Paradigm 53 consistency: 0.7025277510807103
Paradigm 54 consistency: 0.7025277510807103
Paradigm 55 consistency: 0.7025277510807103
Paradigm 56 consistency: 0.7025277510807103
Paradigm 57 consistency: 0.7025277510807103
Paradigm 58 consistency: 0.7025277510807103




Paradigm 59 consistency: 0.5145392392540866
Paradigm 60 consistency: 0.6272947035832889
Paradigm 61 consistency: 0.7025277510807103
Paradigm 62 consistency: 0.7025277510807103
Paradigm 63 consistency: 0.7025277510807103
Paradigm 64 consistency: 0.7025277510807103
Paradigm 65 consistency: 0.7025277510807103
Paradigm 66 consistency: 0.7025277510807103
Paradigm 67 consistency: 0.7025277510807103
Paradigm 68 consistency: 0.7025277510807103
Paradigm 69 consistency: 0.7025277510807103
Paradigm 70 consistency: 0.42378260447830407




Paradigm 71 consistency: 0.7025277510807103
Paradigm 72 consistency: 0.7025277510807103
Paradigm 73 consistency: 0.7025277510807103
Paradigm 74 consistency: 0.7025277510807103
Paradigm 75 consistency: 0.7025277510807103
Paradigm 76 consistency: 0.7025277510807103
Paradigm 77 consistency: 0.7025277510807103
Paradigm 78 consistency: 0.7025277510807103
Paradigm 79 consistency: 0.7025277510807103
Paradigm 80 consistency: 0.5145392392540866
Paradigm 81 consistency: 0.7025277510807103
Paradigm 82 consistency: 0.7025277510807103
Paradigm 83 consistency: 0.7025277510807103
Paradigm 84 consistency: 0.7025277510807103
Paradigm 85 consistency: 0.7025277510807103
Paradigm 86 consistency: 0.5145392392540866
Paradigm 87 consistency: 0.7025277510807103
Paradigm 88 consistency: 0.7025277510807103


100%|██████████| 100/100 [00:01<00:00, 53.51it/s]

Paradigm 89 consistency: 0.7025277510807103
Paradigm 90 consistency: 0.7025277510807103
Paradigm 91 consistency: 0.7025277510807103
Paradigm 92 consistency: 0.7025277510807103
Paradigm 93 consistency: 0.7025277510807103
Paradigm 94 consistency: 0.7025277510807103
Paradigm 95 consistency: 0.7025277510807103
Paradigm 96 consistency: 0.7025277510807103
Paradigm 97 consistency: 0.7025277510807103
Paradigm 98 consistency: 0.7025277510807103
Paradigm 99 consistency: 0.7025277510807103





1900
Calculate metrics by transformation


100%|██████████| 19/19 [00:01<00:00, 10.34it/s]
 56%|█████▌    | 5/9 [00:16<00:14,  3.54s/it]

Section presupposition_cleft_uniqueness:
  Accuracy: 0.4789
  Precision: 0.7490
  Recall: 0.3950
  F1: 0.3120
  Consistency: 0.6873

Evaluating section: presupposition_only_presupposition




Paradigm 0 consistency: 0.314187116827479
Paradigm 1 consistency: 0.16494454884084464
Paradigm 2 consistency: 0.2330839172350092
Paradigm 3 consistency: 0.20033199319823147
Paradigm 4 consistency: 0.24936273490419736
Paradigm 5 consistency: 0.07768278518151717
Paradigm 6 consistency: 0.3141871168274789
Paradigm 7 consistency: 0.314187116827479
Paradigm 8 consistency: 0.16494454884084464




Paradigm 9 consistency: 0.20033199319823147




Paradigm 10 consistency: 0.20033199319823147
Paradigm 11 consistency: 0.39871518417755825
Paradigm 12 consistency: 0.3555333547974129
Paradigm 13 consistency: 0.16494454884084464
Paradigm 14 consistency: 0.24936273490419736
Paradigm 15 consistency: 0.24936273490419736
Paradigm 16 consistency: 0.20033199319823147
Paradigm 17 consistency: 0.20033199319823147
Paradigm 18 consistency: 0.20033199319823147




Paradigm 19 consistency: 0.20033199319823147
Paradigm 20 consistency: 0.20033199319823147
Paradigm 21 consistency: 0.20033199319823147
Paradigm 22 consistency: 0.12691452647336965
Paradigm 23 consistency: 0.16494454884084464




Paradigm 24 consistency: 0.39871518417755825
Paradigm 25 consistency: 0.20033199319823147
Paradigm 26 consistency: 0.17194835729502667
Paradigm 27 consistency: 0.20033199319823147
Paradigm 28 consistency: 0.24936273490419736
Paradigm 29 consistency: 0.39871518417755825
Paradigm 30 consistency: 0.17194835729502667
Paradigm 31 consistency: 0.24936273490419725
Paradigm 32 consistency: 0.24936273490419725




Paradigm 33 consistency: 0.314187116827479
Paradigm 34 consistency: 0.314187116827479
Paradigm 35 consistency: 0.314187116827479
Paradigm 36 consistency: 0.17194835729502667
Paradigm 37 consistency: 0.314187116827479
Paradigm 38 consistency: 0.20033199319823147
Paradigm 39 consistency: 0.17194835729502667
Paradigm 40 consistency: 0.3141871168274789




Paradigm 41 consistency: 0.3141871168274789
Paradigm 42 consistency: 0.24936273490419736
Paradigm 43 consistency: 0.3141871168274789
Paradigm 44 consistency: 0.24936273490419736
Paradigm 45 consistency: 0.24936273490419725
Paradigm 46 consistency: 0.14194027202555726
Paradigm 47 consistency: 0.24936273490419725





Paradigm 48 consistency: 0.24936273490419725
Paradigm 49 consistency: 0.24936273490419725
Paradigm 50 consistency: 0.24936273490419725
Paradigm 51 consistency: 0.39871518417755825
Paradigm 52 consistency: 0.24936273490419725
Paradigm 53 consistency: 0.314187116827479


 54%|█████▍    | 54/100 [00:00<00:00, 59.56it/s][A


Paradigm 54 consistency: 0.24936273490419736
Paradigm 55 consistency: 0.314187116827479
Paradigm 56 consistency: 0.314187116827479
Paradigm 57 consistency: 0.20033199319823147
Paradigm 58 consistency: 0.24936273490419736
Paradigm 59 consistency: 0.314187116827479


 60%|██████    | 60/100 [00:01<00:00, 58.65it/s][A


Paradigm 60 consistency: 0.20033199319823147
Paradigm 61 consistency: 0.314187116827479
Paradigm 62 consistency: 0.20033199319823147
Paradigm 63 consistency: 0.20033199319823147
Paradigm 64 consistency: 0.3141871168274789
Paradigm 65 consistency: 0.24936273490419725


 66%|██████▌   | 66/100 [00:01<00:00, 57.54it/s][A

Paradigm 66 consistency: 0.314187116827479
Paradigm 67 consistency: 0.17194835729502667
Paradigm 68 consistency: 0.24936273490419725
Paradigm 69 consistency: 0.20033199319823147




Paradigm 70 consistency: 0.24936273490419736
Paradigm 71 consistency: 0.3141871168274789
Paradigm 72 consistency: 0.314187116827479
Paradigm 73 consistency: 0.39871518417755825
Paradigm 74 consistency: 0.20033199319823147





Paradigm 75 consistency: 0.24936273490419725
Paradigm 76 consistency: 0.314187116827479
Paradigm 77 consistency: 0.07413710786179006
Paradigm 78 consistency: 0.3141871168274789
Paradigm 79 consistency: 0.20033199319823147
Paradigm 80 consistency: 0.20033199319823147
Paradigm 81 consistency: 0.17194835729502667
Paradigm 82 consistency: 0.20033199319823147
Paradigm 83 consistency: 0.3141871168274789


 84%|████████▍ | 84/100 [00:01<00:00, 49.13it/s][A

Paradigm 84 consistency: 0.314187116827479
Paradigm 85 consistency: 0.2330839172350092
Paradigm 86 consistency: 0.3141871168274789
Paradigm 87 consistency: 0.314187116827479




Paradigm 88 consistency: 0.24936273490419736
Paradigm 89 consistency: 0.24936273490419736
Paradigm 90 consistency: 0.17194835729502667
Paradigm 91 consistency: 0.314187116827479
Paradigm 92 consistency: 0.42378260447830407
Paradigm 93 consistency: 0.24936273490419736
Paradigm 94 consistency: 0.3141871168274789




Paradigm 95 consistency: 0.314187116827479
Paradigm 96 consistency: 0.314187116827479
Paradigm 97 consistency: 0.314187116827479


100%|██████████| 100/100 [00:02<00:00, 49.61it/s]


Paradigm 98 consistency: 0.314187116827479
Paradigm 99 consistency: 0.20033199319823147
1900
Calculate metrics by transformation


100%|██████████| 19/19 [00:01<00:00, 11.95it/s]
 67%|██████▋   | 6/9 [00:19<00:10,  3.65s/it]

Section presupposition_only_presupposition:
  Accuracy: 0.6995
  Precision: 0.8292
  Recall: 0.6557
  F1: 0.6786
  Consistency: 0.2541

Evaluating section: presupposition_possessed_definites_existence




Paradigm 0 consistency: 0.047909643385224854
Paradigm 1 consistency: 0.01736983605538045
Paradigm 2 consistency: 0.01736983605538034
Paradigm 3 consistency: 0.01736983605538045
Paradigm 4 consistency: 0.0958883125600768
Paradigm 5 consistency: 0.01736983605538045




Paradigm 6 consistency: 0.01736983605538034
Paradigm 7 consistency: 0.01736983605538045
Paradigm 8 consistency: 0.01736983605538034





Paradigm 9 consistency: 0.16494454884084464
Paradigm 10 consistency: 0.047909643385224854
Paradigm 11 consistency: 0.01736983605538034
Paradigm 12 consistency: 0.01736983605538034
Paradigm 13 consistency: 0.3141871168274789
Paradigm 14 consistency: 0.01736983605538045
Paradigm 15 consistency: 0.01736983605538045
Paradigm 16 consistency: 0.0024805242792796944
Paradigm 17 consistency: 0.047909643385225076


 18%|█▊        | 18/100 [00:00<00:01, 45.53it/s][A

Paradigm 18 consistency: 0.01736983605538034
Paradigm 19 consistency: 0.01736983605538034
Paradigm 20 consistency: 0.047909643385224854
Paradigm 21 consistency: 0.047909643385224854





Paradigm 22 consistency: 0.047909643385224854
Paradigm 23 consistency: 0.01736983605538045
Paradigm 24 consistency: 0.01736983605538045
Paradigm 25 consistency: 0.01736983605538034
Paradigm 26 consistency: 0.01736983605538034
Paradigm 27 consistency: 0.047909643385224854
Paradigm 28 consistency: 0.01736983605538045


 29%|██▉       | 29/100 [00:00<00:01, 47.55it/s][A

Paradigm 29 consistency: 0.047909643385225076
Paradigm 30 consistency: 0.047909643385225076
Paradigm 31 consistency: 0.2330839172350092
Paradigm 32 consistency: 0.17194835729502667




Paradigm 33 consistency: 0.01736983605538034
Paradigm 34 consistency: 0.0958883125600768
Paradigm 35 consistency: 0.01736983605538045
Paradigm 36 consistency: 0.01736983605538034
Paradigm 37 consistency: 0.01736983605538045
Paradigm 38 consistency: 0.01736983605538034
Paradigm 39 consistency: 0.37075077614396534




Paradigm 40 consistency: 0.01736983605538034
Paradigm 41 consistency: 0.24936273490419736
Paradigm 42 consistency: 0.5115068636860558
Paradigm 43 consistency: 0.01736983605538034
Paradigm 44 consistency: 0.01736983605538045




Paradigm 45 consistency: 0.01736983605538034
Paradigm 46 consistency: 0.03826325692095478
Paradigm 47 consistency: 0.01736983605538045
Paradigm 48 consistency: 0.01736983605538034
Paradigm 49 consistency: 0.047909643385224854




Paradigm 50 consistency: 0.01736983605538045
Paradigm 51 consistency: 0.01736983605538045
Paradigm 52 consistency: 0.0958883125600769





Paradigm 53 consistency: 0.0024805242792796944
Paradigm 54 consistency: 0.01736983605538045
Paradigm 55 consistency: 0.07768278518151717
Paradigm 56 consistency: 0.047909643385224854
Paradigm 57 consistency: 0.01736983605538045
Paradigm 58 consistency: 0.047909643385225076
Paradigm 59 consistency: 0.047909643385225076
Paradigm 60 consistency: 0.01736983605538045
Paradigm 61 consistency: 0.047909643385224854
Paradigm 62 consistency: 0.01736983605538045


 63%|██████▎   | 63/100 [00:01<00:00, 50.09it/s][A

Paradigm 63 consistency: 0.01736983605538045
Paradigm 64 consistency: 0.01736983605538045
Paradigm 65 consistency: 0.01736983605538034
Paradigm 66 consistency: 0.01736983605538034
Paradigm 67 consistency: 0.01736983605538034
Paradigm 68 consistency: 0.01736983605538034





Paradigm 69 consistency: 0.01736983605538034
Paradigm 70 consistency: 0.01736983605538045
Paradigm 71 consistency: 0.01736983605538045
Paradigm 72 consistency: 0.01736983605538045
Paradigm 73 consistency: 0.047909643385224854
Paradigm 74 consistency: 0.0958883125600769
Paradigm 75 consistency: 0.01736983605538045
Paradigm 76 consistency: 0.07413710786179006
Paradigm 77 consistency: 0.0958883125600768
Paradigm 78 consistency: 0.01736983605538045


 79%|███████▉  | 79/100 [00:01<00:00, 61.62it/s][A


Paradigm 79 consistency: 0.047909643385225076
Paradigm 80 consistency: 0.01736983605538045
Paradigm 81 consistency: 0.01736983605538045
Paradigm 82 consistency: 0.0958883125600768
Paradigm 83 consistency: 0.01736983605538034
Paradigm 84 consistency: 0.12691452647336965
Paradigm 85 consistency: 0.01736983605538045


 86%|████████▌ | 86/100 [00:01<00:00, 60.82it/s][A

Paradigm 86 consistency: 0.314187116827479
Paradigm 87 consistency: 0.047909643385224854
Paradigm 88 consistency: 0.047909643385225076
Paradigm 89 consistency: 0.01736983605538045
Paradigm 90 consistency: 0.047909643385224854
Paradigm 91 consistency: 0.01736983605538034


100%|██████████| 100/100 [00:01<00:00, 51.70it/s]

Paradigm 92 consistency: 0.047909643385225076
Paradigm 93 consistency: 0.047909643385225076
Paradigm 94 consistency: 0.07413710786178995
Paradigm 95 consistency: 0.047909643385225076
Paradigm 96 consistency: 0.01736983605538034
Paradigm 97 consistency: 0.047909643385224854
Paradigm 98 consistency: 0.12691452647336965
Paradigm 99 consistency: 0.01736983605538045
1900





Calculate metrics by transformation


100%|██████████| 19/19 [00:01<00:00, 11.21it/s]
 78%|███████▊  | 7/9 [00:23<00:07,  3.75s/it]

Section presupposition_possessed_definites_existence:
  Accuracy: 0.9374
  Precision: 0.9449
  Recall: 0.9289
  F1: 0.9355
  Consistency: 0.0549

Evaluating section: presupposition_possessed_definites_uniqueness




Paradigm 0 consistency: 0.37075077614396534
Paradigm 1 consistency: 0.7025277510807103
Paradigm 2 consistency: 0.7025277510807103
Paradigm 3 consistency: 0.7025277510807103
Paradigm 4 consistency: 0.7025277510807103
Paradigm 5 consistency: 0.7025277510807103
Paradigm 6 consistency: 0.7025277510807103
Paradigm 7 consistency: 0.7025277510807103
Paradigm 8 consistency: 0.7025277510807103
Paradigm 9 consistency: 0.7025277510807103
Paradigm 10 consistency: 0.7025277510807103
Paradigm 11 consistency: 0.7025277510807103




Paradigm 12 consistency: 0.7025277510807103
Paradigm 13 consistency: 0.7025277510807103
Paradigm 14 consistency: 0.7025277510807103
Paradigm 15 consistency: 0.7025277510807103




Paradigm 16 consistency: 0.5145392392540866
Paradigm 17 consistency: 0.7025277510807103
Paradigm 18 consistency: 0.5115068636860558
Paradigm 19 consistency: 0.5145392392540866
Paradigm 20 consistency: 0.7025277510807103
Paradigm 21 consistency: 0.7025277510807103
Paradigm 22 consistency: 0.5145392392540866
Paradigm 23 consistency: 0.7025277510807103




Paradigm 24 consistency: 0.7025277510807103
Paradigm 25 consistency: 0.5145392392540866
Paradigm 26 consistency: 0.7025277510807103
Paradigm 27 consistency: 0.7025277510807103
Paradigm 28 consistency: 0.7025277510807103




Paradigm 29 consistency: 0.20033199319823147
Paradigm 30 consistency: 0.5145392392540866
Paradigm 31 consistency: 0.7025277510807103
Paradigm 32 consistency: 0.7025277510807103
Paradigm 33 consistency: 0.7025277510807103
Paradigm 34 consistency: 0.7025277510807103




Paradigm 35 consistency: 0.7025277510807103
Paradigm 36 consistency: 0.7025277510807103
Paradigm 37 consistency: 0.7025277510807103
Paradigm 38 consistency: 0.7025277510807103
Paradigm 39 consistency: 0.7025277510807103
Paradigm 40 consistency: 0.7025277510807103
Paradigm 41 consistency: 0.7025277510807103
Paradigm 42 consistency: 0.7025277510807103
Paradigm 43 consistency: 0.7025277510807103
Paradigm 44 consistency: 0.7025277510807103
Paradigm 45 consistency: 0.7025277510807103




Paradigm 46 consistency: 0.7025277510807103
Paradigm 47 consistency: 0.7025277510807103
Paradigm 48 consistency: 0.3141871168274789
Paradigm 49 consistency: 0.7025277510807103
Paradigm 50 consistency: 0.7025277510807103
Paradigm 51 consistency: 0.7025277510807103




Paradigm 52 consistency: 0.7025277510807103




Paradigm 53 consistency: 0.7025277510807103




Paradigm 54 consistency: 0.7025277510807103




Paradigm 55 consistency: 0.7025277510807103




Paradigm 56 consistency: 0.7025277510807103




Paradigm 57 consistency: 0.7025277510807103




Paradigm 58 consistency: 0.7025277510807103




Paradigm 59 consistency: 0.7025277510807103




Paradigm 60 consistency: 0.7025277510807103




Paradigm 61 consistency: 0.7025277510807103




Paradigm 62 consistency: 0.7025277510807103




Paradigm 63 consistency: 0.7025277510807103




Paradigm 64 consistency: 0.7025277510807103




Paradigm 65 consistency: 0.7025277510807103




Paradigm 66 consistency: 0.7025277510807103




Paradigm 67 consistency: 0.7025277510807103




Paradigm 68 consistency: 0.7025277510807103




Paradigm 69 consistency: 0.7025277510807103




Paradigm 70 consistency: 0.7025277510807103




Paradigm 71 consistency: 0.7025277510807103




Paradigm 72 consistency: 0.7025277510807103




Paradigm 73 consistency: 0.7025277510807103




Paradigm 74 consistency: 0.7025277510807103




Paradigm 75 consistency: 0.7025277510807103




Paradigm 76 consistency: 0.7025277510807103




Paradigm 77 consistency: 0.7025277510807103




Paradigm 78 consistency: 0.7025277510807103




Paradigm 79 consistency: 0.7025277510807103




Paradigm 80 consistency: 0.7025277510807103




Paradigm 81 consistency: 0.7025277510807103




Paradigm 82 consistency: 0.7025277510807103




Paradigm 83 consistency: 0.7025277510807103




Paradigm 84 consistency: 0.7025277510807103




Paradigm 85 consistency: 0.7025277510807103




Paradigm 86 consistency: 0.7025277510807103




Paradigm 87 consistency: 0.7025277510807103




Paradigm 88 consistency: 0.7025277510807103




Paradigm 89 consistency: 0.7025277510807103




Paradigm 90 consistency: 0.5145392392540866




Paradigm 91 consistency: 0.7025277510807103




Paradigm 92 consistency: 0.7025277510807103




Paradigm 93 consistency: 0.7025277510807103




Paradigm 94 consistency: 0.7025277510807103




Paradigm 95 consistency: 0.7025277510807103




Paradigm 96 consistency: 0.7025277510807103




Paradigm 97 consistency: 0.7025277510807103




Paradigm 98 consistency: 0.7025277510807103


100%|██████████| 100/100 [1:04:47<00:00, 38.87s/it]

Paradigm 99 consistency: 0.7025277510807103
1900





Calculate metrics by transformation


100%|██████████| 19/19 [00:01<00:00, 11.17it/s]
 89%|████████▉ | 8/9 [1:05:12<20:26, 1226.90s/it]

Section presupposition_possessed_definites_uniqueness:
  Accuracy: 0.4805
  Precision: 0.7378
  Recall: 0.3968
  F1: 0.3150
  Consistency: 0.6771

Evaluating section: presupposition_question_presupposition




Paradigm 0 consistency: 0.0958883125600768
Paradigm 1 consistency: 0.047909643385224854
Paradigm 2 consistency: 0.16494454884084464
Paradigm 3 consistency: 0.16494454884084464
Paradigm 4 consistency: 0.12691452647336965
Paradigm 5 consistency: 0.01736983605538045
Paradigm 6 consistency: 0.20033199319823147





Paradigm 7 consistency: 0.01736983605538034


  8%|▊         | 8/100 [00:00<00:02, 37.31it/s][A

Paradigm 8 consistency: 0.035038159617655995
Paradigm 9 consistency: 0.17194835729502667




Paradigm 10 consistency: 0.047909643385225076




Paradigm 11 consistency: 0.3555333547974129




Paradigm 12 consistency: 0.3028137911081479




Paradigm 13 consistency: 0.01736983605538045




Paradigm 14 consistency: 0.26342943586645207




Paradigm 15 consistency: 0.0958883125600768




Paradigm 16 consistency: 0.03826325692095478




Paradigm 17 consistency: 0.0958883125600769




Paradigm 18 consistency: 0.20033199319823147




Paradigm 19 consistency: 0.047909643385224854




Paradigm 20 consistency: 0.1306021193250927




Paradigm 21 consistency: 0.047909643385224854




Paradigm 22 consistency: 0.035038159617655995




Paradigm 23 consistency: 0.6272947035832889




Paradigm 24 consistency: 0.0024805242792796944




Paradigm 25 consistency: 0.20033199319823147




Paradigm 26 consistency: 0.047909643385225076




Paradigm 27 consistency: 0.047909643385224854




Paradigm 28 consistency: 0.26342943586645207




Paradigm 29 consistency: 0.07413710786179006




Paradigm 30 consistency: 0.20033199319823147




Paradigm 31 consistency: 0.01736983605538034




Paradigm 32 consistency: 0.0958883125600769




Paradigm 33 consistency: 0.0958883125600769




Paradigm 34 consistency: 0.16494454884084464




Paradigm 35 consistency: 0.16494454884084464




Paradigm 36 consistency: 0.0958883125600768




Paradigm 37 consistency: 0.07768278518151717




Paradigm 38 consistency: 0.5115068636860558




Paradigm 39 consistency: 0.1148097082393531




Paradigm 40 consistency: 0.0958883125600768




Paradigm 41 consistency: 0.16494454884084464




Paradigm 42 consistency: 0.12691452647336965




Paradigm 43 consistency: 0.5145392392540866




Paradigm 44 consistency: 0.0958883125600768




Paradigm 45 consistency: 0.12691452647336965




Paradigm 46 consistency: 0.2575124304578764




Paradigm 47 consistency: 0.07768278518151717




Paradigm 48 consistency: 0.047909643385224854




Paradigm 49 consistency: 0.12691452647336965




Paradigm 50 consistency: 0.047909643385224854




Paradigm 51 consistency: 0.010502461377285166




Paradigm 52 consistency: 0.42378260447830407




Paradigm 53 consistency: 0.1148097082393531




Paradigm 54 consistency: 0.01736983605538034




Paradigm 55 consistency: 0.0958883125600768




Paradigm 56 consistency: 0.16494454884084464




Paradigm 57 consistency: 0.2634294358664523




Paradigm 58 consistency: 0.01736983605538045




Paradigm 59 consistency: 0.01736983605538045




Paradigm 60 consistency: 0.047909643385224854




Paradigm 61 consistency: 0.01736983605538045




Paradigm 62 consistency: 0.3028137911081479




Paradigm 63 consistency: 0.24936273490419736




Paradigm 64 consistency: 0.1306021193250927




Paradigm 65 consistency: 0.07413710786178995




Paradigm 66 consistency: 0.16852561199027072




Paradigm 67 consistency: 0.01736983605538045




Paradigm 68 consistency: 0.0958883125600768




Paradigm 69 consistency: 0.0958883125600768




Paradigm 70 consistency: 0.047909643385225076




Paradigm 71 consistency: 0.2634294358664523




Paradigm 72 consistency: 0.16494454884084464




Paradigm 73 consistency: 0.01736983605538034




Paradigm 74 consistency: 0.0958883125600768




Paradigm 75 consistency: 0.01736983605538034




Paradigm 76 consistency: 0.047909643385224854




Paradigm 77 consistency: 0.01736983605538045




Paradigm 78 consistency: 0.01736983605538045




Paradigm 79 consistency: 0.16494454884084464




Paradigm 80 consistency: 0.047909643385224854




Paradigm 81 consistency: 0.1148097082393531




Paradigm 82 consistency: 0.01736983605538045




Paradigm 83 consistency: 0.16494454884084464




Paradigm 84 consistency: 0.16852561199027072




Paradigm 85 consistency: 0.16494454884084464




Paradigm 86 consistency: 0.01736983605538045




Paradigm 87 consistency: 0.07768278518151717




Paradigm 88 consistency: 0.12691452647336965




Paradigm 89 consistency: 0.01736983605538045




Paradigm 90 consistency: 0.01736983605538045




Paradigm 91 consistency: 0.1148097082393531




Paradigm 92 consistency: 0.03826325692095478




Paradigm 93 consistency: 0.047909643385224854




Paradigm 94 consistency: 0.10025624130173738




Paradigm 95 consistency: 0.047909643385224854




Paradigm 96 consistency: 0.047909643385224854




Paradigm 97 consistency: 0.01736983605538034




Paradigm 98 consistency: 0.26342943586645207


100%|██████████| 100/100 [1:45:44<00:00, 63.44s/it]

Paradigm 99 consistency: 0.16494454884084464
1900





Calculate metrics by transformation


100%|██████████| 19/19 [00:02<00:00,  8.80it/s]
100%|██████████| 9/9 [2:50:59<00:00, 1140.00s/it]

Section presupposition_question_presupposition:
  Accuracy: 0.8421
  Precision: 0.8741
  Recall: 0.8216
  F1: 0.8340
  Consistency: 0.1217






In [None]:
section_df, transformation_df = create_results_table(results, 
    ['presupposition_both_presupposition',
        'presupposition_change_of_state',
        'presupposition_cleft_existence',
        'presupposition_cleft_uniqueness',
        'presupposition_only_presupposition',
        'presupposition_possessed_definites_existence',
        'presupposition_possessed_definites_uniqueness',
        'presupposition_question_presupposition'])

print("\n" + "="*80)
print("SECTION-LEVEL RESULTS")
print("="*80)
print(section_df.to_string(index=False))

print("\n" + "="*80)
print("TRANSFORMATION-LEVEL RESULTS")
print("="*80)
print(transformation_df.to_string(index=False))

# Calculate overall metrics
overall_accuracy = section_df['Accuracy'].mean()
overall_precision = section_df['Precision'].mean()
overall_recall = section_df['Recall'].mean()
overall_f1 = section_df['F1'].mean()
overall_consistency = section_df['Consistency'].mean()

print(f"\n" + "="*80)
print("OVERALL RESULTS")
print("="*80)
print(f"Overall Accuracy: {overall_accuracy:.4f}")
print(f"Overall Precision: {overall_precision:.4f}")
print(f"Overall Recall: {overall_recall:.4f}")
print(f"Overall F1: {overall_f1:.4f}")
print(f"Overall Consistency: {overall_consistency:.4f}")


SECTION-LEVEL RESULTS
                                      Section  Accuracy  Precision   Recall       F1  Consistency
           presupposition_both_presupposition  0.976316   0.976741 0.977250 0.976923     0.023364
               presupposition_change_of_state  0.567895   0.661949 0.496972 0.470625     0.441210
               presupposition_cleft_existence  0.688421   0.848858 0.631222 0.641477     0.333441
              presupposition_cleft_uniqueness  0.478947   0.748956 0.395028 0.311998     0.687332
           presupposition_only_presupposition  0.699474   0.829189 0.655722 0.678558     0.254056
 presupposition_possessed_definites_existence  0.937368   0.944866 0.928861 0.935482     0.054856
presupposition_possessed_definites_uniqueness  0.480526   0.737844 0.396833 0.314979     0.677115
       presupposition_question_presupposition  0.842105   0.874063 0.821556 0.833981     0.121652

TRANSFORMATION-LEVEL RESULTS
 transformation  accuracy  precision   recall       f1
          

## Results Analysis

This experiment evaluates whether explanation-based prompting can improve presupposition identification by exploiting the structured paradigms encoded in the ImpPres dataset. 

Results show that while certain presupposition types (e.g., possessed definites (existence) and both presupposition) achieve high accuracy, they often lack consistency across transformations, indicating limited generalization from the paradigm structure. In contrast, other types such as cleft uniqueness and possessed definites (uniqueness) show high consistency but low accuracy, suggesting some problem in the model systematicly happening across the section, possibly caused by a specific pattern of the it, that isn't changed throught transformations. 

Transformation-level scores are relatively uniform, implying the model captures some general patterns across variants. However, the overall consistency score (0.32) remains low, limiting the effectiveness of paradigm-aware reasoning. 
The overall accuracy is 0.7 which is less then the previous task, meaning the added pradigm calculation is hurting the preformence of the model.  

These results suggest that while the model benefits from the structure of paradigms to some extent, further improvements are needed, particularly through training or prompting strategies that explicitly reinforce consistency across transformations.

