# Exploring the ImpPres Dataset

The https://huggingface.co/datasets/facebook/imppres dataset was introduced in *"Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition"*, Jeretivc et al, ACL 2020, https://www.aclweb.org/anthology/2020.acl-main.768" to investigate the pragmatic inference capabilities of NLI models.

It was created by synthesizing pairs (premise, hypothesis) according to different templates predicted by pragmatic analysis, for presuppositions triggered by different linguistic forms and implicatures of different forms.  Each sample is grouped in "paradigms" (groups of related pairs) that test the predicted relation between premise and hypothesis according to linguistic transformations.  For example, given a pair (premise, presupposition), the paradigm will include (negated-premise, presupposition), (question-premise, presupposition), (condition-premise, presupposition), (premise, negated-presupposition) etc.  If a model detects that the relation (premise, presupposition) is a form of "presupposition entailment", then it should consistently label the other members of the group according to linguistic predictions.





In [15]:
from datasets import load_dataset
sections = ['implicature_connectives', 'implicature_gradable_adjective', 'implicature_gradable_verb', 'implicature_modals', 'implicature_numerals_10_100', 'implicature_numerals_2_3', 'implicature_quantifiers', 'presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']


imp_connectives = load_dataset("facebook/imppres", sections[0])


In [16]:
imp_connectives

DatasetDict({
    connectives: Dataset({
        features: ['premise', 'hypothesis', 'gold_label_log', 'gold_label_prag', 'spec_relation', 'item_type', 'trigger', 'lexemes'],
        num_rows: 1200
    })
})

In [17]:
imp_connectives['connectives'][0]

{'premise': 'These computers or dresses would irritate Veronica.',
 'hypothesis': "These computers and dresses wouldn't both irritate Veronica.",
 'gold_label_log': 1,
 'gold_label_prag': 0,
 'spec_relation': 'implicature_PtoN',
 'item_type': 'target',
 'trigger': 'connective',
 'lexemes': 'or - and'}

In [18]:
pcos = load_dataset("facebook/imppres", "presupposition_change_of_state")

In [19]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [20]:
pcos['change_of_state'][0]

{'premise': 'The guest had found John.',
 'hypothesis': 'John used to be in an unknown location.',
 'trigger': 'unembedded',
 'trigger1': 'Not_In_Example',
 'trigger2': 'Not_In_Example',
 'presupposition': 'positive',
 'gold_label': 0,
 'UID': 'change_of_state',
 'pairID': '0e',
 'paradigmID': 0}

In [21]:
print(list(set([s['paradigmID'] for s in pcos['change_of_state']])))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


In [22]:
def get_paradigm(dataset, paradigm_id):
    return [s for s in dataset if s['paradigmID'] == paradigm_id]

In [23]:
get_paradigm(pcos['change_of_state'], 0)

[{'premise': 'The guest had found John.',
  'hypothesis': 'John used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'positive',
  'gold_label': 0,
  'UID': 'change_of_state',
  'pairID': '0e',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': "John didn't used to be in an unknown location.",
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'negated',
  'gold_label': 2,
  'UID': 'change_of_state',
  'pairID': '1c',
  'paradigmID': 0},
 {'premise': 'The guest had found John.',
  'hypothesis': 'Peter used to be in an unknown location.',
  'trigger': 'unembedded',
  'trigger1': 'Not_In_Example',
  'trigger2': 'Not_In_Example',
  'presupposition': 'neutral',
  'gold_label': 1,
  'UID': 'change_of_state',
  'pairID': '2n',
  'paradigmID': 0},
 {'premise': "The guest hadn't found John.",
  'hypothesis': 'John 

In [10]:
pop = load_dataset("facebook/imppres", "presupposition_only_presupposition")

presupposition_only_presupposition/only_(…):   0%|          | 0.00/38.1k [00:00<?, ?B/s]

Generating only_presupposition split:   0%|          | 0/1900 [00:00<?, ? examples/s]

In [11]:
pop

DatasetDict({
    only_presupposition: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

In [12]:
pcos

DatasetDict({
    change_of_state: Dataset({
        features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
        num_rows: 1900
    })
})

## Unify the Datasets

Your task is to create a new dataset that 
* Has all the lines from the presupposition sections of ImprPres 
    * ['presupposition_all_n_presupposition', 'presupposition_both_presupposition', 'presupposition_change_of_state', 'presupposition_cleft_existence', 'presupposition_cleft_uniqueness', 'presupposition_only_presupposition', 'presupposition_possessed_definites_existence', 'presupposition_possessed_definites_uniqueness', 'presupposition_question_presupposition']
* Has one more column which is the name of the section:
    * ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID', 'section']

In [13]:
from datasets import load_dataset, concatenate_datasets
import pandas as pd

presupposition_sections = [
    'presupposition_all_n_presupposition',
    'presupposition_both_presupposition', 
    'presupposition_change_of_state',
    'presupposition_cleft_existence',
    'presupposition_cleft_uniqueness',
    'presupposition_only_presupposition',
    'presupposition_possessed_definites_existence',
    'presupposition_possessed_definites_uniqueness',
    'presupposition_question_presupposition'
]

# Load and combine all sections
combined_data = []

for section in presupposition_sections:
    # Load dataset
    dataset = load_dataset("facebook/imppres", section)
    dataset_key = list(dataset.keys())[0]
    
    # Convert to pandas and add section column
    df = dataset[dataset_key].to_pandas()
    df['section'] = section
    
    combined_data.append(df)
    print(f"Loaded {len(df)} rows from {section}")

# Combine all dataframes
unified_df = pd.concat(combined_data, ignore_index=True)

# Convert back to Hugging Face dataset
from datasets import Dataset
unified_dataset = Dataset.from_pandas(unified_df)

print(f"\nUnified dataset created:")
print(f"Total rows: {len(unified_dataset)}")
print(f"Columns: {unified_dataset.column_names}")

# Save as CSV
unified_df.to_csv('unified_presupposition_dataset.csv', index=False)
print("Saved as 'unified_presupposition_dataset.csv'")

# Verify the expected columns are present
expected_columns = ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 
                   'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID', 'section']
print(f"Expected columns present: {set(expected_columns).issubset(set(unified_dataset.column_names))}")

presupposition_all_n_presupposition/all_(…):   0%|          | 0.00/43.0k [00:00<?, ?B/s]

Generating all_n_presupposition split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Loaded 1900 rows from presupposition_all_n_presupposition


presupposition_both_presupposition/both_(…):   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Generating both_presupposition split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Loaded 1900 rows from presupposition_both_presupposition
Loaded 1900 rows from presupposition_change_of_state


presupposition_cleft_existence/cleft_exi(…):   0%|          | 0.00/37.6k [00:00<?, ?B/s]

Generating cleft_existence split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Loaded 1900 rows from presupposition_cleft_existence


presupposition_cleft_uniqueness/cleft_un(…):   0%|          | 0.00/38.3k [00:00<?, ?B/s]

Generating cleft_uniqueness split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Loaded 1900 rows from presupposition_cleft_uniqueness
Loaded 1900 rows from presupposition_only_presupposition


presupposition_possessed_definites_exist(…):   0%|          | 0.00/38.7k [00:00<?, ?B/s]

Generating possessed_definites_existence split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Loaded 1900 rows from presupposition_possessed_definites_existence


presupposition_possessed_definites_uniqu(…):   0%|          | 0.00/42.1k [00:00<?, ?B/s]

Generating possessed_definites_uniqueness split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Loaded 1900 rows from presupposition_possessed_definites_uniqueness


presupposition_question_presupposition/q(…):   0%|          | 0.00/41.2k [00:00<?, ?B/s]

Generating question_presupposition split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Loaded 1900 rows from presupposition_question_presupposition

Unified dataset created:
Total rows: 17100
Columns: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID', 'section']
Saved as 'unified_presupposition_dataset.csv'
Expected columns present: True


In [None]:
import os
import dspy
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
from typing import Literal, List, Dict, Any
import random
from datasets import load_dataset, Dataset
import evaluate
from evaluate import load

os.environ["XAI_API_KEY"] = ""

# Configure DSPy environment
lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
dspy.configure(lm=lm)

print("DSPy configured with Grok-3-mini")

# Load the unified dataset (assuming you've already created it)
print("Loading unified presupposition dataset...")
try:
    unified_df = pd.read_csv('unified_presupposition_dataset.csv')
    print(f"Loaded unified dataset with {len(unified_df)} rows")
except FileNotFoundError:
    print("unified_presupposition_dataset.csv not found. Loading from individual sections...")
    # Fallback: load from individual sections as done in 2.1
    presupposition_sections = [
        'presupposition_all_n_presupposition',
        'presupposition_both_presupposition', 
        'presupposition_change_of_state',
        'presupposition_cleft_existence',
        'presupposition_cleft_uniqueness',
        'presupposition_only_presupposition',
        'presupposition_possessed_definites_existence',
        'presupposition_possessed_definites_uniqueness',
        'presupposition_question_presupposition'
    ]
    
    combined_data = []
    for section in presupposition_sections:
        dataset = load_dataset("facebook/imppres", section)
        dataset_key = list(dataset.keys())[0]
        df = dataset[dataset_key].to_pandas()
        df['section'] = section
        combined_data.append(df)
        print(f"Loaded {len(df)} rows from {section}")
    
    unified_df = pd.concat(combined_data, ignore_index=True)
    print(f"Created unified dataset with {len(unified_df)} rows")

# ============================================================================
# STEP 1: SYSTEMATIC PARADIGM STRUCTURE EXPLORATION
# ============================================================================

print("\n" + "="*60)
print("STEP 1: EXPLORING PARADIGM STRUCTURE")
print("="*60)

def explore_paradigm_structure(df: pd.DataFrame, sample_paradigms: int = 3):
    """Systematically explore the structure of paradigms"""
    
    print(f"Dataset overview:")
    print(f"- Total samples: {len(df)}")
    print(f"- Unique paradigms: {df['paradigmID'].nunique()}")
    print(f"- Sections: {df['section'].unique()}")
    
    # Check paradigm sizes
    paradigm_sizes = df.groupby('paradigmID').size()
    print(f"\nParadigm size distribution:")
    print(paradigm_sizes.value_counts().sort_index())
    
    # Explore transformation types by examining trigger fields
    print(f"\nTransformation analysis:")
    print(f"- Unique 'trigger' values: {sorted(df['trigger'].unique())}")
    print(f"- Unique 'trigger1' values: {sorted(df['trigger1'].unique())}")
    print(f"- Unique 'trigger2' values: {sorted(df['trigger2'].unique())}")
    print(f"- Unique 'presupposition' values: {sorted(df['presupposition'].unique())}")
    
    # Label distribution
    print(f"\nLabel distribution:")
    print(df['gold_label'].value_counts().sort_index())
    
    # Sample a few paradigms for detailed analysis
    sample_paradigm_ids = df['paradigmID'].unique()[:sample_paradigms]
    
    for i, paradigm_id in enumerate(sample_paradigm_ids):
        paradigm_samples = df[df['paradigmID'] == paradigm_id].sort_values('pairID')
        print(f"\n--- PARADIGM {paradigm_id} (Section: {paradigm_samples['section'].iloc[0]}) ---")
        print(f"Size: {len(paradigm_samples)} samples")
        
        # Show the transformation pattern
        for idx, row in paradigm_samples.iterrows():
            trigger_info = f"trigger='{row['trigger']}'"
            if row['trigger1'] != 'Not_In_Example':
                trigger_info += f", trigger1='{row['trigger1']}'"
            if row['trigger2'] != 'Not_In_Example':
                trigger_info += f", trigger2='{row['trigger2']}'"
            
            presup_info = f"presupposition='{row['presupposition']}'"
            
            print(f"  {row['pairID']}: {trigger_info}, {presup_info}, label={row['gold_label']}")
            
            if i == 0:  # Show full text for first paradigm only
                print(f"    P: {row['premise']}")
                print(f"    H: {row['hypothesis']}")
                print()
        
        if i > 0:  # Add spacing for subsequent paradigms
            print()
    
    return paradigm_sizes

# Explore with a small subset first for faster iteration
print("Working with subset for initial exploration...")
subset_df = unified_df.head(1000)  # Start with first 1000 samples for testing
explore_paradigm_structure(subset_df)


DSPy configured with Grok-3-mini
Loading unified presupposition dataset...
Loaded unified dataset with 17100 rows

STEP 1: EXPLORING PARADIGM STRUCTURE
Working with subset for initial exploration...
Dataset overview:
- Total samples: 1000
- Unique paradigms: 53
- Sections: ['presupposition_all_n_presupposition']

Paradigm size distribution:
12     1
19    52
Name: count, dtype: int64

Transformation analysis:
- Unique 'trigger' values: ['Not_In_Example', 'conditional', 'interrogative', 'modal', 'negated', 'unembedded']
- Unique 'trigger1' values: ['Not_In_Example', 'conditional', 'interrogative', 'modal', 'negated']
- Unique 'trigger2' values: ['Not_In_Example', 'unembedded']
- Unique 'presupposition' values: ['Not_In_Example', 'negated', 'neutral', 'positive']

Label distribution:
gold_label
0    264
1    420
2    316
Name: count, dtype: int64

--- PARADIGM 0 (Section: presupposition_all_n_presupposition) ---
Size: 19 samples
  0e: trigger='unembedded', presupposition='positive', labe

paradigmID
0     19
1     19
2     19
3     19
4     19
5     19
6     19
7     19
8     19
9     19
10    19
11    19
12    19
13    19
14    19
15    19
16    19
17    19
18    19
19    19
20    19
21    19
22    19
23    19
24    19
25    19
26    19
27    19
28    19
29    19
30    19
31    19
32    19
33    19
34    19
35    19
36    19
37    19
38    19
39    19
40    19
41    19
42    19
43    19
44    19
45    19
46    19
47    19
48    19
49    19
50    19
51    19
52    12
dtype: int64

In [25]:
# ============================================================================
# STEP 2: IDENTIFY THE 19 TRANSFORMATION TYPES
# ============================================================================

print("\n" + "="*60)
print("STEP 2: IDENTIFYING THE 19 TRANSFORMATION TYPES")
print("="*60)

def analyze_transformation_types(df: pd.DataFrame):
    """Identify and categorize the 19 transformation types"""
    
    # Create a combined transformation identifier
    def get_transformation_signature(row):
        """Create a unique signature for each transformation type"""
        sig_parts = []
        
        # Primary trigger
        if row['trigger'] != 'Not_In_Example':
            sig_parts.append(f"trigger:{row['trigger']}")
        
        # Secondary triggers
        if row['trigger1'] != 'Not_In_Example':
            sig_parts.append(f"trigger1:{row['trigger1']}")
        if row['trigger2'] != 'Not_In_Example':
            sig_parts.append(f"trigger2:{row['trigger2']}")
            
        # Presupposition status
        if row['presupposition'] != 'Not_In_Example':
            sig_parts.append(f"presup:{row['presupposition']}")
            
        return " | ".join(sig_parts) if sig_parts else "base"
    
    # Apply transformation signature to subset
    df_sample = df.copy()
    df_sample['transformation_signature'] = df_sample.apply(get_transformation_signature, axis=1)
    
    # Count transformation types
    transformation_counts = df_sample['transformation_signature'].value_counts()
    print(f"Found {len(transformation_counts)} unique transformation types:")
    
    for i, (transformation, count) in enumerate(transformation_counts.items(), 1):
        print(f"{i:2d}. {transformation} ({count} samples)")
    
    # Verify we have 19 transformations per paradigm
    sample_paradigm = df_sample[df_sample['paradigmID'] == df_sample['paradigmID'].iloc[0]]
    sample_transformations = sample_paradigm['transformation_signature'].unique()
    print(f"\nTransformations in sample paradigm: {len(sample_transformations)}")
    
    return transformation_counts, df_sample

transformation_counts, subset_with_signatures = analyze_transformation_types(subset_df)



STEP 2: IDENTIFYING THE 19 TRANSFORMATION TYPES
Found 19 unique transformation types:
 1. trigger:unembedded | presup:positive (53 samples)
 2. trigger:interrogative | presup:negated (53 samples)
 3. trigger:modal | presup:neutral (53 samples)
 4. trigger:modal | presup:negated (53 samples)
 5. trigger:unembedded | presup:negated (53 samples)
 6. trigger:interrogative | presup:neutral (53 samples)
 7. trigger:modal | presup:positive (53 samples)
 8. trigger:interrogative | presup:positive (53 samples)
 9. trigger:negated | presup:neutral (53 samples)
10. trigger:negated | presup:negated (53 samples)
11. trigger:negated | presup:positive (53 samples)
12. trigger:unembedded | presup:neutral (53 samples)
13. trigger:conditional | presup:positive (52 samples)
14. trigger:conditional | presup:negated (52 samples)
15. trigger:conditional | presup:neutral (52 samples)
16. trigger1:negated | trigger2:unembedded (52 samples)
17. trigger1:interrogative | trigger2:unembedded (52 samples)
18. tri

In [None]:
# ============================================================================
# STEP 3: DSPy CLASSIFIER IMPLEMENTATION
# ============================================================================

print("\n" + "="*60)
print("STEP 3: IMPLEMENTING DSPy CLASSIFIER")
print("="*60)

class NLIClassification(dspy.Signature):
    """
    Perform Natural Language Inference to determine the relationship between premise and hypothesis.
    Focus on presupposition entailment patterns.
    """
    premise: str = dspy.InputField(desc="The premise statement")
    hypothesis: str = dspy.InputField(desc="The hypothesis statement") 
    reasoning: str = dspy.OutputField(desc="Step-by-step reasoning about the entailment relationship, focusing on presuppositions")
    label: Literal['entailment', 'neutral', 'contradiction'] = dspy.OutputField(desc="The entailment relationship: entailment (0), neutral (1), or contradiction (2)")

class PresuppositionNLI(dspy.Module):
    """DSPy module for presupposition-aware NLI classification"""
    
    def __init__(self):
        super().__init__()
        self.classify = dspy.ChainOfThought(NLIClassification)
    
    def forward(self, premise: str, hypothesis: str):
        result = self.classify(premise=premise, hypothesis=hypothesis)
        return result

# Initialize classifier
print("Initializing PresuppositionNLI classifier...")
classifier = PresuppositionNLI()

# Test on a few samples
print("\nTesting classifier on sample data...")
test_samples = subset_df.head(3)

for i, row in test_samples.iterrows():
    print(f"\n--- Test Sample {i+1} ---")
    print(f"Premise: {row['premise']}")
    print(f"Hypothesis: {row['hypothesis']}")
    print(f"Gold label: {row['gold_label']}")
    
    try:
        result = classifier(premise=row['premise'], hypothesis=row['hypothesis'])
        print(f"Predicted: {result.label}")
        print(f"Reasoning: {result.reasoning}")
        
        # Convert label to numeric for comparison
        label_map = {'entailment': 0, 'neutral': 1, 'contradiction': 2}
        pred_numeric = label_map.get(result.label, -1)
        print(f"Numeric prediction: {pred_numeric} (Gold: {row['gold_label']})")
        
    except Exception as e:
        print(f"Error during classification: {e}")

print("\nClassifier testing complete!")




STEP 3: IMPLEMENTING DSPy CLASSIFIER
Initializing PresuppositionNLI classifier...

Testing classifier on sample data...

--- Test Sample 1 ---
Premise: All ten guys that proved to boast were divorcing.
Hypothesis: There are exactly ten guys that proved to boast.
Gold label: 0
Predicted: entailment
Reasoning: The premise states "All ten guys that proved to boast were divorcing," which presupposes the existence of exactly ten guys who proved to boast. This is evident from the phrase "all ten," as it quantifies the group precisely as ten, implying that the set of guys who proved to boast is exactly that number. The hypothesis directly asserts "There are exactly ten guys that proved to boast," which aligns with this presupposition. Therefore, the premise entails the hypothesis because the presupposition in the premise logically implies the statement in the hypothesis.
Numeric prediction: 0 (Gold: 0)

--- Test Sample 2 ---
Premise: All ten guys that proved to boast were divorcing.
Hypothes

In [27]:
# ============================================================================
# STEP 4: INDIVIDUAL INFERENCE PIPELINE
# ============================================================================

print("\n" + "="*60)
print("STEP 4: INDIVIDUAL INFERENCE PIPELINE")
print("="*60)

def prepare_paradigm_data(df: pd.DataFrame, shuffle_within_paradigms: bool = True):
    """Prepare data for individual inference with paradigm shuffling"""
    
    print(f"Preparing paradigm data...")
    print(f"- Total samples: {len(df)}")
    print(f"- Unique paradigms: {df['paradigmID'].nunique()}")
    
    # Group by paradigmID
    paradigms = {}
    for _, row in df.iterrows():
        paradigm_id = row['paradigmID']
        if paradigm_id not in paradigms:
            paradigms[paradigm_id] = []
        paradigms[paradigm_id].append(row.to_dict())
    
    # Verify paradigm sizes and shuffle
    paradigm_stats = []
    for paradigm_id, samples in paradigms.items():
        paradigm_stats.append(len(samples))
        if shuffle_within_paradigms:
            random.shuffle(samples)
    
    print(f"- Paradigm sizes: {Counter(paradigm_stats)}")
    print(f"- Shuffling within paradigms: {shuffle_within_paradigms}")
    
    # Flatten back to individual samples
    all_samples = []
    for paradigm_samples in paradigms.values():
        all_samples.extend(paradigm_samples)
    
    print(f"- Total samples after processing: {len(all_samples)}")
    
    return all_samples, paradigms

def run_individual_inference(samples: List[Dict], classifier, max_samples: int = None):
    """Run individual inference on each sample"""
    
    if max_samples:
        samples = samples[:max_samples]
        print(f"Processing {len(samples)} samples (limited for testing)")
    else:
        print(f"Processing {len(samples)} samples")
    
    results = []
    errors = []
    
    for i, sample in enumerate(samples):
        if i % 50 == 0:  # Progress indicator
            print(f"  Progress: {i}/{len(samples)} samples processed")
        
        try:
            # Run inference
            prediction = classifier(
                premise=sample['premise'], 
                hypothesis=sample['hypothesis']
            )
            
            # Convert label to numeric 
            label_map = {'entailment': 0, 'neutral': 1, 'contradiction': 2}
            pred_numeric = label_map.get(prediction.label, -1)
            
            # Store result with paradigm metadata
            result = {
                'sample_id': sample.get('UID', f"sample_{i}"),
                'paradigm_id': sample['paradigmID'],
                'pair_id': sample['pairID'], 
                'section': sample['section'],
                'transformation_signature': get_transformation_signature(sample),
                'gold_label': sample['gold_label'],
                'predicted_label': prediction.label,
                'predicted_numeric': pred_numeric,
                'reasoning': prediction.reasoning,
                'premise': sample['premise'],
                'hypothesis': sample['hypothesis'],
                'trigger': sample['trigger'],
                'trigger1': sample['trigger1'],
                'trigger2': sample['trigger2'],
                'presupposition': sample['presupposition']
            }
            results.append(result)
            
        except Exception as e:
            error_info = {
                'sample_index': i,
                'paradigm_id': sample['paradigmID'],
                'error': str(e)
            }
            errors.append(error_info)
            print(f"  Error on sample {i}: {e}")
    
    print(f"\nInference complete!")
    print(f"- Successful predictions: {len(results)}")
    print(f"- Errors: {len(errors)}")
    
    if len(results) > 0:
        # Calculate basic accuracy
        correct = sum(1 for r in results if r['predicted_numeric'] == r['gold_label'])
        accuracy = correct / len(results)
        print(f"- Overall accuracy: {accuracy:.3f}")
    
    return results, errors

# Use transformation signature function from earlier
def get_transformation_signature(sample):
    """Create a unique signature for each transformation type"""
    sig_parts = []
    
    # Primary trigger
    if sample['trigger'] != 'Not_In_Example':
        sig_parts.append(f"trigger:{sample['trigger']}")
    
    # Secondary triggers
    if sample['trigger1'] != 'Not_In_Example':
        sig_parts.append(f"trigger1:{sample['trigger1']}")
    if sample['trigger2'] != 'Not_In_Example':
        sig_parts.append(f"trigger2:{sample['trigger2']}")
        
    # Presupposition status
    if sample['presupposition'] != 'Not_In_Example':
        sig_parts.append(f"presup:{sample['presupposition']}")
        
    return " | ".join(sig_parts) if sig_parts else "base"

# Prepare data for inference (start with subset for testing)
print("Preparing data for individual inference...")
all_samples, paradigms_dict = prepare_paradigm_data(subset_df, shuffle_within_paradigms=True)

# Run individual inference on a smaller batch first
print("\nRunning individual inference (limited batch for testing)...")
results, errors = run_individual_inference(all_samples, classifier, max_samples=100)



STEP 4: INDIVIDUAL INFERENCE PIPELINE
Preparing data for individual inference...
Preparing paradigm data...
- Total samples: 1000
- Unique paradigms: 53
- Paradigm sizes: Counter({19: 52, 12: 1})
- Shuffling within paradigms: True
- Total samples after processing: 1000

Running individual inference (limited batch for testing)...
Processing 100 samples (limited for testing)
  Progress: 0/100 samples processed
  Progress: 50/100 samples processed

Inference complete!
- Successful predictions: 100
- Errors: 0
- Overall accuracy: 0.980


In [28]:
# ============================================================================
# STEP 5: PARADIGM-LEVEL CONSISTENCY ANALYSIS FRAMEWORK
# ============================================================================

print("\n" + "="*60)
print("STEP 5: PARADIGM-LEVEL CONSISTENCY ANALYSIS")
print("="*60)

def group_results_by_paradigm(results: List[Dict]):
    """Group inference results back by paradigmID"""
    
    paradigm_results = defaultdict(list)
    
    for result in results:
        paradigm_id = result['paradigm_id']
        paradigm_results[paradigm_id].append(result)
    
    print(f"Grouped results into {len(paradigm_results)} paradigms")
    
    # Verify paradigm sizes
    paradigm_sizes = [len(samples) for samples in paradigm_results.values()]
    print(f"Paradigm size distribution: {Counter(paradigm_sizes)}")
    
    return dict(paradigm_results)

def calculate_paradigm_consistency_metrics(paradigm_samples: List[Dict]):
    """Calculate multiple consistency metrics for a single paradigm"""
    
    if len(paradigm_samples) == 0:
        return None
    
    predictions = [s['predicted_numeric'] for s in paradigm_samples]
    gold_labels = [s['gold_label'] for s in paradigm_samples]
    
    # Basic info
    paradigm_id = paradigm_samples[0]['paradigm_id']
    section = paradigm_samples[0]['section']
    paradigm_size = len(paradigm_samples)
    
    # Accuracy metrics
    correct_predictions = sum(1 for p, g in zip(predictions, gold_labels) if p == g)
    paradigm_accuracy = correct_predictions / len(predictions)
    
    # Consistency metrics
    pred_counts = Counter(predictions)
    gold_counts = Counter(gold_labels)
    
    # 1. Majority vote consistency (what % agree with majority prediction)
    majority_pred = pred_counts.most_common(1)[0][0]
    majority_consistency = pred_counts[majority_pred] / len(predictions)
    
    # 2. Perfect consistency (all predictions identical)
    perfect_consistency = len(set(predictions)) == 1
    
    # 3. Entropy-based consistency (lower entropy = higher consistency)
    prediction_entropy = 0
    total = len(predictions)
    for count in pred_counts.values():
        prob = count / total
        if prob > 0:
            prediction_entropy -= prob * np.log2(prob)
    
    # Normalize entropy (0 = perfect consistency, 1 = maximum inconsistency)
    max_entropy = np.log2(min(3, len(predictions)))  # 3 possible labels
    normalized_entropy = prediction_entropy / max_entropy if max_entropy > 0 else 0
    entropy_consistency = 1 - normalized_entropy
    
    # 4. Agreement with gold label pattern
    # Check if prediction pattern matches gold pattern structure
    pred_pattern = tuple(sorted(pred_counts.items()))
    gold_pattern = tuple(sorted(gold_counts.items()))
    pattern_match = pred_pattern == gold_pattern
    
    # 5. Transformation-specific analysis
    transformation_analysis = {}
    for sample in paradigm_samples:
        trans_sig = sample['transformation_signature']
        if trans_sig not in transformation_analysis:
            transformation_analysis[trans_sig] = {
                'correct': 0, 'total': 0, 'predicted_labels': []
            }
        
        is_correct = sample['predicted_numeric'] == sample['gold_label']
        transformation_analysis[trans_sig]['correct'] += int(is_correct)
        transformation_analysis[trans_sig]['total'] += 1
        transformation_analysis[trans_sig]['predicted_labels'].append(sample['predicted_numeric'])
    
    return {
        'paradigm_id': paradigm_id,
        'section': section,
        'paradigm_size': paradigm_size,
        'accuracy': paradigm_accuracy,
        'correct_count': correct_predictions,
        'majority_pred': majority_pred,
        'majority_consistency': majority_consistency,
        'perfect_consistency': perfect_consistency,
        'entropy_consistency': entropy_consistency,
        'pattern_match': pattern_match,
        'prediction_distribution': dict(pred_counts),
        'gold_distribution': dict(gold_counts),
        'transformation_analysis': transformation_analysis,
        'samples': paradigm_samples
    }

def calculate_combined_consistency_score(accuracy: float, consistency: float, alpha: float = 0.7):
    """
    Combine accuracy and consistency into a single score
    alpha: weight for accuracy (higher alpha = more weight on accuracy)
    """
    return alpha * accuracy + (1 - alpha) * consistency

def analyze_all_paradigms(results: List[Dict]):
    """Analyze consistency across all paradigms"""
    
    # Group by paradigm
    paradigm_results = group_results_by_paradigm(results)
    
    # Calculate metrics for each paradigm
    paradigm_analyses = []
    
    for paradigm_id, samples in paradigm_results.items():
        analysis = calculate_paradigm_consistency_metrics(samples)
        if analysis:
            paradigm_analyses.append(analysis)
    
    print(f"\nCompleted consistency analysis for {len(paradigm_analyses)} paradigms")
    
    # Overall statistics
    if len(paradigm_analyses) > 0:
        overall_accuracy = np.mean([p['accuracy'] for p in paradigm_analyses])
        overall_majority_consistency = np.mean([p['majority_consistency'] for p in paradigm_analyses])
        overall_entropy_consistency = np.mean([p['entropy_consistency'] for p in paradigm_analyses])
        perfect_consistency_rate = np.mean([p['perfect_consistency'] for p in paradigm_analyses])
        
        print(f"\nOverall Paradigm Statistics:")
        print(f"- Average accuracy: {overall_accuracy:.3f}")
        print(f"- Average majority consistency: {overall_majority_consistency:.3f}")
        print(f"- Average entropy consistency: {overall_entropy_consistency:.3f}")
        print(f"- Perfect consistency rate: {perfect_consistency_rate:.3f}")
        
        # Combined scores with different weightings
        combined_score_balanced = calculate_combined_consistency_score(overall_accuracy, overall_majority_consistency, alpha=0.5)
        combined_score_accuracy_focused = calculate_combined_consistency_score(overall_accuracy, overall_majority_consistency, alpha=0.7)
        combined_score_consistency_focused = calculate_combined_consistency_score(overall_accuracy, overall_majority_consistency, alpha=0.3)
        
        print(f"\nCombined Scores:")
        print(f"- Balanced (α=0.5): {combined_score_balanced:.3f}")
        print(f"- Accuracy-focused (α=0.7): {combined_score_accuracy_focused:.3f}")
        print(f"- Consistency-focused (α=0.3): {combined_score_consistency_focused:.3f}")
    
    return paradigm_analyses

# Run consistency analysis on current results
if len(results) > 0:
    print("Running paradigm consistency analysis...")
    paradigm_analyses = analyze_all_paradigms(results)
    
    # Show detailed analysis for first few paradigms
    print(f"\nDetailed Analysis for First 3 Paradigms:")
    for i, analysis in enumerate(paradigm_analyses[:3]):
        print(f"\n--- Paradigm {analysis['paradigm_id']} (Section: {analysis['section']}) ---")
        print(f"Size: {analysis['paradigm_size']}")
        print(f"Accuracy: {analysis['accuracy']:.3f}")
        print(f"Majority consistency: {analysis['majority_consistency']:.3f}")
        print(f"Entropy consistency: {analysis['entropy_consistency']:.3f}")
        print(f"Perfect consistency: {analysis['perfect_consistency']}")
        print(f"Prediction distribution: {analysis['prediction_distribution']}")
        print(f"Gold distribution: {analysis['gold_distribution']}")
        
        # Show transformation-specific accuracy
        print(f"Transformation accuracy:")
        for trans, stats in analysis['transformation_analysis'].items():
            acc = stats['correct'] / stats['total'] if stats['total'] > 0 else 0
            print(f"  {trans}: {acc:.3f} ({stats['correct']}/{stats['total']})")

else:
    print("No results available for consistency analysis")



STEP 5: PARADIGM-LEVEL CONSISTENCY ANALYSIS
Running paradigm consistency analysis...
Grouped results into 6 paradigms
Paradigm size distribution: Counter({19: 5, 5: 1})

Completed consistency analysis for 6 paradigms

Overall Paradigm Statistics:
- Average accuracy: 0.982
- Average majority consistency: 0.435
- Average entropy consistency: 0.031
- Perfect consistency rate: 0.000

Combined Scores:
- Balanced (α=0.5): 0.709
- Accuracy-focused (α=0.7): 0.818
- Consistency-focused (α=0.3): 0.599

Detailed Analysis for First 3 Paradigms:

--- Paradigm 0 (Section: presupposition_all_n_presupposition) ---
Size: 19
Accuracy: 0.947
Majority consistency: 0.474
Entropy consistency: 0.048
Perfect consistency: False
Prediction distribution: {0: 4, 2: 6, 1: 9}
Gold distribution: {0: 5, 2: 6, 1: 8}
Transformation accuracy:
  trigger:interrogative | presup:positive: 1.000 (1/1)
  trigger:negated | presup:negated: 1.000 (1/1)
  trigger:conditional | presup:positive: 0.000 (0/1)
  trigger:conditional |