# ANLI Baseline

This model illustrates how to use the DeBERTa-v3-base-mnli-fever-anli model to perform specialized inference on the ANLI dataset.
This dataset has 184M parameters. It was trained in 2021 on the basis of a BERT-like embedding approach: 
* The premise and the hypothesis are encoded using the DeBERTa-v3-base contextual encoder
* The encodings are then compared on a fine-tuned model to predict a distribution over the classification labels (entailment, contradiction, neutral)

Reported accuracy on ANLI is 0.495 (see https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) 



In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [2]:
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entailment': 6.6, 'neutral': 17.3, 'contradiction': 76.1}


In [3]:
def evaluate(premise, hypothesis):
    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    return prediction

In [4]:
evaluate("The weather is nice today.", "It is sunny outside.")

{'entailment': 0.1, 'neutral': 99.8, 'contradiction': 0.0}

In [5]:
def get_prediction(pred_dict):
    if pred_dict["entailment"] > pred_dict["contradiction"]  and pred_dict["entailment"] > pred_dict["neutral"]:
        return "entailment"
    elif pred_dict["contradiction"] > pred_dict["entailment"] and pred_dict["contradiction"] > pred_dict["neutral"]:
        return "contradiction"
    else:
        return "neutral"

In [6]:
get_prediction(evaluate("The weather is nice today.", "It is sunny outside."))

'neutral'

In [7]:
get_prediction(evaluate("It is sunny outside.", "The weather is nice today."))

'entailment'

In [8]:
get_prediction(evaluate("It is sunny outside.", "The weather is terrible today."))

'contradiction'

## Load ANLI dataset

In [9]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
dataset

DatasetDict({
    train_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 2923
    })
    dev_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 4861
    })
    dev_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 13375
    })
    dev_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1200


In [11]:
# Evaluate the model on the ANLI dataset
from tqdm import tqdm
def evaluate_on_dataset(dataset):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = evaluate(premise, hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'prediction': prediction,
            'pred_label': get_prediction(prediction),
            'gold_label': label_names[example['label']],
            'reason': example['reason']
        })
    return results

In [12]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1200/1200 [03:06<00:00,  6.44it/s]


In [13]:
pred_test_r3[:5]  # Display the first 5 predictions

[{'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.",
  'hypothesis': 'The day of the passage is usually when Christians praise the lord together',
  'prediction': {'entailment': 2.4, 'neutral': 97.4, 'contradiction': 0.2},
  'pred_label': 'neutral',
  'gold_label': 'entailment',
  'reason': "Sunday is considered Lord's Day"},
 {'premise': 'By The Associated Press WELLINGTON, New Zealand (AP) — All passengers and crew have survived a crash-landing of a plane in a lagoon in the Federated States of Micronesia. WELLINGTON, New Zealand (AP) — All passeng

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [14]:
import evaluate as eval

accuracy = eval.load("accuracy")
precision = eval.load("precision")
recall = eval.load("recall")
f1 = eval.load("f1")


In [15]:
clf_metrics = eval.combine(["accuracy", "f1", "precision", "recall"])

In [16]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## Your Turn

Compute the classification metrics on the baseline model on each section of the ANLI dataset.

https://www.kaggle.com/code/faijanahamadkhan/llm-evaluation-framework-hugging-face provides good documentation on how to use the Huggingface evaluate library.

In [17]:
# Task 1.1: evaluation on the ANLI test samples

print("Setting up evaluation metrics...")

def compute_classification_metrics(predictions_list):
    """
    Compute classification metrics for ANLI predictions using individual metrics
    
    Args:
        predictions_list: List of dictionaries with prediction results
    
    Returns:
        Dictionary with computed metrics
    """
    # Extract predicted and gold labels
    pred_labels = [pred['pred_label'] for pred in predictions_list]
    gold_labels = [pred['gold_label'] for pred in predictions_list]
    
    # Convert string labels to integers
    label_to_int = {"entailment": 0, "neutral": 1, "contradiction": 2}
    pred_ints = [label_to_int[label] for label in pred_labels]
    gold_ints = [label_to_int[label] for label in gold_labels]
    
    # Load and compute metrics individually (this approach works reliably)
    accuracy = eval.load("accuracy")
    precision = eval.load("precision")
    recall = eval.load("recall")
    f1 = eval.load("f1")
    
    # Compute metrics with macro averaging for multiclass classification
    results = {
        'accuracy': accuracy.compute(predictions=pred_ints, references=gold_ints)['accuracy'],
        'precision': precision.compute(predictions=pred_ints, references=gold_ints, average='macro')['precision'],
        'recall': recall.compute(predictions=pred_ints, references=gold_ints, average='macro')['recall'],
        'f1': f1.compute(predictions=pred_ints, references=gold_ints, average='macro')['f1']
    }
    return results

print("✓ Evaluation metrics setup complete")

print("\n" + "="*60)
print("EVALUATING BASELINE MODEL ON ALL TEST SECTIONS")
print("="*60)

# Evaluate on all three test sections
test_sections = ['test_r1', 'test_r2', 'test_r3']
all_results = {}

for section in test_sections:
    print(f"\n📊 Evaluating {section}...")
    print(f"Dataset size: {len(dataset[section])} examples")
    
    # Run evaluation on this section
    predictions = evaluate_on_dataset(dataset[section])
    
    # Compute metrics
    metrics = compute_classification_metrics(predictions)
    all_results[section] = {
        'metrics': metrics,
        'predictions': predictions,
        'num_examples': len(predictions)
    }
    
    # Print results for this section
    print(f"✓ {section} completed - {len(predictions)} examples evaluated")
    print(f"  Accuracy: {metrics['accuracy']:.4f}")
    print(f"  F1: {metrics['f1']:.4f}")
    print(f"  Precision: {metrics['precision']:.4f}")
    print(f"  Recall: {metrics['recall']:.4f}")

print("\n" + "="*60)
print("SUMMARY OF RESULTS")
print("="*60)

# Create summary table
print(f"{'Section':<10} {'Examples':<10} {'Accuracy':<10} {'F1':<10} {'Precision':<10} {'Recall':<10}")
print("-" * 65)

for section in test_sections:
    metrics = all_results[section]['metrics']
    num_examples = all_results[section]['num_examples']
    print(f"{section:<10} {num_examples:<10} {metrics['accuracy']:<10.4f} {metrics['f1']:<10.4f} {metrics['precision']:<10.4f} {metrics['recall']:<10.4f}")

print("-" * 65)

# Analysis
print("\n" + "="*60)
print("ANALYSIS")
print("="*60)

# Overall statistics
total_examples = sum(all_results[section]['num_examples'] for section in test_sections)
weighted_accuracy = sum(all_results[section]['metrics']['accuracy'] * all_results[section]['num_examples'] 
                       for section in test_sections) / total_examples

print(f"📈 Total examples: {total_examples}")
print(f"📈 Weighted accuracy: {weighted_accuracy:.4f}")

# Performance across rounds
print(f"\n🔍 Performance by round:")
for i, section in enumerate(test_sections, 1):
    metrics = all_results[section]['metrics']
    print(f"   Round {i}: Accuracy = {metrics['accuracy']:.4f}, F1 = {metrics['f1']:.4f}")

# Best/worst sections
accuracies = {section: all_results[section]['metrics']['accuracy'] for section in test_sections}
best_section = max(accuracies, key=accuracies.get)
worst_section = min(accuracies, key=accuracies.get)

print(f"\n🎯 Best: {best_section} ({accuracies[best_section]:.4f})")
print(f"⚠️  Worst: {worst_section} ({accuracies[worst_section]:.4f})")

print(f"\n💾 Results saved in 'all_results' variable")

print("\n" + "="*60)
print("TASK 1.1 COMPLETED! ✓")
print("="*60)

Setting up evaluation metrics...
✓ Evaluation metrics setup complete

EVALUATING BASELINE MODEL ON ALL TEST SECTIONS

📊 Evaluating test_r1...
Dataset size: 1000 examples


100%|██████████| 1000/1000 [02:30<00:00,  6.66it/s]


✓ test_r1 completed - 1000 examples evaluated
  Accuracy: 0.7120
  F1: 0.7119
  Precision: 0.7135
  Recall: 0.7120

📊 Evaluating test_r2...
Dataset size: 1000 examples


100%|██████████| 1000/1000 [02:25<00:00,  6.85it/s]


✓ test_r2 completed - 1000 examples evaluated
  Accuracy: 0.5470
  F1: 0.5465
  Precision: 0.5472
  Recall: 0.5470

📊 Evaluating test_r3...
Dataset size: 1200 examples


100%|██████████| 1200/1200 [02:58<00:00,  6.71it/s]


✓ test_r3 completed - 1200 examples evaluated
  Accuracy: 0.4950
  F1: 0.4943
  Precision: 0.4985
  Recall: 0.4946

SUMMARY OF RESULTS
Section    Examples   Accuracy   F1         Precision  Recall    
-----------------------------------------------------------------
test_r1    1000       0.7120     0.7119     0.7135     0.7120    
test_r2    1000       0.5470     0.5465     0.5472     0.5470    
test_r3    1200       0.4950     0.4943     0.4985     0.4946    
-----------------------------------------------------------------

ANALYSIS
📈 Total examples: 3200
📈 Weighted accuracy: 0.5791

🔍 Performance by round:
   Round 1: Accuracy = 0.7120, F1 = 0.7119
   Round 2: Accuracy = 0.5470, F1 = 0.5465
   Round 3: Accuracy = 0.4950, F1 = 0.4943

🎯 Best: test_r1 (0.7120)
⚠️  Worst: test_r3 (0.4950)

💾 Results saved in 'all_results' variable

TASK 1.1 COMPLETED! ✓


In [18]:
# Task 1.2: Sample 20 Errors for Investigation

import random

print("="*60)
print("TASK 1.2: SAMPLING ERRORS FOR ANALYSIS")
print("="*60)

# Collect all errors from all test sections
all_errors = []

for section_name, section_data in all_results.items():
    predictions = section_data['predictions']
    for pred in predictions:
        if pred['pred_label'] != pred['gold_label']:  # This is an error
            error_info = {
                'section': section_name,
                'premise': pred['premise'],
                'hypothesis': pred['hypothesis'],
                'predicted': pred['pred_label'],
                'gold': pred['gold_label'],
                'reason': pred['reason'],
                'prediction_scores': pred['prediction']
            }
            all_errors.append(error_info)

print(f"📊 Total errors found: {len(all_errors)}")
for section in ['test_r1', 'test_r2', 'test_r3']:
    section_errors = [e for e in all_errors if e['section'] == section]
    print(f"   {section}: {len(section_errors)} errors")

# Sample 20 errors for analysis
random.seed(42)  # For reproducible results
sampled_errors = random.sample(all_errors, min(20, len(all_errors)))

print(f"\n🔍 Here are {len(sampled_errors)} randomly sampled errors for investigation:")
print("="*80)

# Display each error clearly
for i, error in enumerate(sampled_errors, 1):
    print(f"\n--- ERROR {i} ---")
    print(f"Section: {error['section']}")
    print(f"Premise: {error['premise']}")
    print(f"Hypothesis: {error['hypothesis']}")
    print(f"Model Predicted: {error['predicted']}")
    print(f"Correct Answer: {error['gold']}")
    print(f"Prediction Confidence: {error['prediction_scores']}")
    print(f"Human Reason: {error['reason']}")
    print("-" * 80)

print(f"\n✅ {len(sampled_errors)} error samples displayed!")
print("📝 Copy these examples and send them for detailed analysis.")

TASK 1.2: SAMPLING ERRORS FOR ANALYSIS
📊 Total errors found: 1347
   test_r1: 288 errors
   test_r2: 453 errors
   test_r3: 606 errors

🔍 Here are 20 randomly sampled errors for investigation:

--- ERROR 1 ---
Section: test_r3
Premise: The National Park Trust identified 20 high-priority sites - including the Blue Ridge Parkway in North Carolina and Virginia and Everglades National Park in Florida - as areas with private property that could be sold.
Hypothesis: There are areas with property that could be sold in many states
Model Predicted: neutral
Correct Answer: entailment
Prediction Confidence: {'entailment': 12.7, 'neutral': 86.9, 'contradiction': 0.4}
Human Reason: There were at least 20 states, because the 20 is only considered high priority, there are probably more, which means the word "many" is probably applicable
--------------------------------------------------------------------------------

--- ERROR 2 ---
Section: test_r1
Premise: Gustave Marie Maurice Mesny (28 March 1886

### Task 1.2: Reasons the model made a mistake  


#### 1. **Mathematical/Numerical Reasoning Failures** (25%)
**Errors: 2, 3, 8, 14, 18**

- **Error 2**: Failed age calculation (1990→2014 = 24 years, not 18)
- **Error 3**: Failed age calculation (1973→2019 = 45-46 years) 
- **Error 8**: Failed duration calculation (March 1990→Dec 1992 ≈ 2.75 years, not 3)
- **Error 14**: Failed reverse calculation (8th year in 1938 → started 1930)
- **Error 18**: Failed counting (8 named actors = 8 protagonists)

**Pattern**: Model struggles with basic arithmetic, date calculations, and counting tasks.

---

#### 2. **Missing Information → False Contradiction** (40%)
**Errors: 1, 6, 10, 11, 12, 13, 15, 17**

- **Error 1**: "Only 3 countries" → Model sees contradiction instead of neutral (no exclusivity stated)
- **Error 6**: Extension name → Model assumes contradiction from missing info
- **Error 10**: "Only in branch" → Model contradicts unstated restriction  
- **Error 11**: Start date claim → Model treats missing date as contradiction
- **Error 12**: Future preferences → Model assumes contradiction from unknown future
- **Error 13**: Profession claim → Model contradicts unstated profession
- **Error 15**: Meeting location → Model contradicts unstated location
- **Error 17**: Career continuation → Model contradicts missing timeline info

**Pattern**: Model is overly aggressive in predicting contradictions when information is simply absent. Should predict "neutral" but defaults to "contradiction."

---

#### 3. **Temporal Reasoning Errors** (15%)
**Errors: 5, 16, 19**

- **Error 5**: Marriage timeline vs TV show timing → Can't handle missing temporal overlap
- **Error 16**: Past tense description → Incorrectly infers current non-existence  
- **Error 19**: "Found as kitten" → Incorrectly assumes current state

**Pattern**: Model struggles with temporal states, timeline inference, and distinguishing past vs present states.

---

#### 4. **Basic Reading Comprehension** (10%)
**Errors: 4, 20**

- **Error 4**: Clear text states "Van Morrison song on album" → Model predicts contradiction
- **Error 20**: Text confusion between "large part" vs "tinny population"

**Pattern**: Fundamental misreading of clear, direct statements.

---

#### 5. **Complex Linguistic Reasoning** (5%)
**Error: 9**

- **Error 9**: Wordplay with "sovereignty" as publication name → Model missed that core claim (Britain refused to address sovereignty) remains true despite confusing wording

**Pattern**: Difficulty with complex sentence structures and embedded meaning.

---

#### 6. **Opinion vs Fact Confusion** (5%) 
**Error: 7**

- **Error 7**: "Should be called" (opinion) → Model treats normative statement as factual contradiction

**Pattern**: Cannot distinguish between factual claims and opinion statements.

In [19]:
import pickle


key = "test_r3"
# Filter out missing reasons
filtered_data = dataset[key].filter(lambda x: x["reason"] is not None and x["reason"] != "")


results = evaluate_on_dataset(filtered_data)




# Save baseline model predictions
with open("baseline_preds.pkl", "wb") as f:
    pickle.dump(results, f)


  0%|          | 0/1200 [00:00<?, ?it/s]

100%|██████████| 1200/1200 [03:04<00:00,  6.52it/s]
