# Draft / Ideas: Multilingual Evaluation

## Overview
This notebook evaluates the fine-tuned XLM-RoBERTa model on multiple languages to test **cross-lingual transfer**.

**Tests**
1. **English test set** (in-domain, literary text, full schema)
2. **French test set** (in-domain, literary text, full schema)
3. **German literary text** (placeholder for manual annotations ?)


In [1]:
# Import required libraries
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from typing import List, Dict, Tuple, Optional
from collections import defaultdict
import pandas as pd
from tqdm.auto import tqdm

# Hugging Face libraries
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    pipeline
)
import torch

# Evaluation metrics
from seqeval.metrics import (
    classification_report,
    f1_score,
    precision_score,
    recall_score
)

import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported")

✓ Libraries imported


In [2]:
# Configure paths
MODEL_PATH = Path("../models/litbank-xlm-roberta")
PROCESSED_DATA_PATH = Path("../data/processed")
RESULTS_PATH = Path("../results")
RESULTS_PATH.mkdir(parents=True, exist_ok=True)

# Check if model exists
if not MODEL_PATH.exists():
    print(f"⚠️  Model not found at {MODEL_PATH}")
    print("Please run Notebook 2 (Model Training) first to train the model.")
else:
    print(f"✓ Found trained model at: {MODEL_PATH.absolute()}")

✓ Found trained model at: /storage/homefs/nw03x063/CAS_Mod4_NER/notebooks/../models/litbank-xlm-roberta


## 1. Load Fine-tuned Model

We'll load the XLM-RoBERTa model trained on English LitBank data.

In [3]:
# Load model and tokenizer
print("Loading fine-tuned model...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForTokenClassification.from_pretrained(MODEL_PATH)

# Load label mapping
with open(PROCESSED_DATA_PATH / "label_mapping.json", 'r') as f:
    label_mapping = json.load(f)

label2id = label_mapping["label2id"]
id2label = {int(k): v for k, v in label_mapping["id2label"].items()}

print(f"\n✓ Model loaded successfully")
print(f"  Model: XLM-RoBERTa fine-tuned on LitBank")
print(f"  Parameters: {model.num_parameters():,}")
print(f"  Entity types: {len([l for l in label2id if l.startswith('B-')])}")
print(f"\nSupported entity types:")
entity_types = sorted(set([l[2:] for l in label2id.keys() if l.startswith('B-')]))
for entity_type in entity_types:
    print(f"  - {entity_type}")

Loading fine-tuned model...


The tokenizer you are loading from '../models/litbank-xlm-roberta' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.



✓ Model loaded successfully
  Model: XLM-RoBERTa fine-tuned on LitBank
  Parameters: 277,464,591
  Entity types: 7

Supported entity types:
  - FAC
  - GPE
  - LOC
  - ORG
  - PER
  - TIME
  - VEH


## 2. Create NER Pipeline

- Simplifies inference (handles tokenization, prediction, decoding)
- Aggregates subword predictions into word-level entities
- Provides confidence scores

Example test passage from Charles Dickens' *Christmas Carol*

In [4]:
# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",  # Aggregate subwords into entities
    device=0 if torch.cuda.is_available() else -1  # Use GPU if available
)

print("✓ NER pipeline created")
print(f"  Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

# Test on example sentence
test_sentence = "Although they had but that moment left the school behind them, they were now in the busy thoroughfares of a city, where shadowy passengers passed and re-passed; where shadowy carts and coaches battled for the way, and all the strife and tumult of a real city were. It was made plain enough, by the dressing of the shops, that here, too, it was Christmas-time again; but it was evening, and the streets were lighted up. The Ghost stopped at a certain warehouse door, and asked Scrooge if he knew it."
print(f"\nTest prediction:")
print(f"  Input: {test_sentence}")
predictions = ner_pipeline(test_sentence)
print(f"\n  Detected entities:")
for entity in predictions:
    print(f"    - {entity['word']:20s} → {entity['entity_group']:6s} (confidence: {entity['score']:.3f})")

Device set to use cuda:0


✓ NER pipeline created
  Device: GPU

Test prediction:
  Input: Although they had but that moment left the school behind them, they were now in the busy thoroughfares of a city, where shadowy passengers passed and re-passed; where shadowy carts and coaches battled for the way, and all the strife and tumult of a real city were. It was made plain enough, by the dressing of the shops, that here, too, it was Christmas-time again; but it was evening, and the streets were lighted up. The Ghost stopped at a certain warehouse door, and asked Scrooge if he knew it.

  Detected entities:
    - the school           → FAC    (confidence: 0.936)
    - the busy thoroughfares of → FAC    (confidence: 0.825)
    - a city               → GPE    (confidence: 0.868)
    - shadowy passengers   → PER    (confidence: 0.972)
    - shadowy carts        → VEH    (confidence: 0.635)
    - coaches              → VEH    (confidence: 0.511)
    - way                  → FAC    (confidence: 0.711)
    - a real city 

## 3. Section 1: English Test Set Evaluation

- Domain: Literary texts (in-domain)
- Language: English (training language)
- Schema: Full 6 entity types (PER, LOC, GPE, ORG, FAC, VEH)

In [6]:
def load_test_data(data_path: Path) -> Tuple[List[List[str]], List[List[str]]]:
    """
    Load test data with tokens and labels.
    
    Returns:
        all_tokens: List of token sequences
        all_labels: List of label sequences (BIO tags)
    """
    with open(data_path, 'r', encoding='utf-8') as f:
        test_data = json.load(f)
    
    all_tokens = [example['tokens'] for example in test_data]
    
    # Convert label IDs back to strings
    all_labels = []
    for example in test_data:
        labels = [id2label[label_id] for label_id in example['ner_tags']]
        all_labels.append(labels)
    
    return all_tokens, all_labels


# Load English test data
print("Loading English test set...")
en_tokens, en_labels = load_test_data(PROCESSED_DATA_PATH / "english_test.json")
print(f"✓ Loaded {len(en_tokens)} English test examples")

Loading English test set...
✓ Loaded 693 English test examples


In [7]:
def predict_sequences(tokens_list: List[List[str]], pipeline) -> List[List[str]]:
    """
    Generate predictions for token sequences.
    
    WHY: We need to convert token lists back to text, run inference,
    then align predictions with original tokens.
    
    Args:
        tokens_list: List of token sequences
        pipeline: Hugging Face NER pipeline
        
    Returns:
        List of predicted label sequences (aligned with input tokens)
    """
    all_predictions = []
    
    for tokens in tqdm(tokens_list, desc="Predicting"):
        # Join tokens to text (approximate reconstruction)
        text = " ".join(tokens)
        
        # Get predictions from pipeline
        entities = pipeline(text)
        
        # Align predictions with original tokens
        predictions = align_predictions_with_tokens(tokens, entities, text)
        all_predictions.append(predictions)
    
    return all_predictions


def align_predictions_with_tokens(tokens: List[str], 
                                 entities: List[Dict], 
                                 text: str) -> List[str]:
    """
    Align entity predictions with original token positions.
    
    WHY: Pipeline returns character-based spans, we need token-level labels.
    """
    # Initialize all as 'O' (outside entity)
    labels = ['O'] * len(tokens)
    
    # Build character-to-token mapping
    char_to_token = {}
    char_pos = 0
    for token_idx, token in enumerate(tokens):
        token_len = len(token)
        for i in range(token_len):
            char_to_token[char_pos + i] = token_idx
        char_pos += token_len + 1  # +1 for space
    
    # Map entities to tokens
    for entity in entities:
        start_char = entity['start']
        end_char = entity['end']
        entity_type = entity['entity_group']
        
        # Find which tokens this entity spans
        start_token = char_to_token.get(start_char)
        end_token = char_to_token.get(end_char - 1)
        
        if start_token is not None and end_token is not None:
            # First token gets B- label
            labels[start_token] = f"B-{entity_type}"
            
            # Remaining tokens get I- label
            for token_idx in range(start_token + 1, end_token + 1):
                labels[token_idx] = f"I-{entity_type}"
    
    return labels


print("✓ Prediction functions defined")

✓ Prediction functions defined


In [8]:
# Generate predictions for English test set
print("Generating predictions for English test set...")
en_predictions = predict_sequences(en_tokens, ner_pipeline)

# Compute metrics
en_results = {
    'precision': precision_score(en_labels, en_predictions),
    'recall': recall_score(en_labels, en_predictions),
    'f1': f1_score(en_labels, en_predictions)
}

print("\n" + "="*60)
print("ENGLISH TEST SET RESULTS (In-Domain)")
print("="*60)
print(f"  Precision: {en_results['precision']:.4f}")
print(f"  Recall:    {en_results['recall']:.4f}")
print(f"  F1 Score:  {en_results['f1']:.4f}")

# Detailed report
print("\n" + classification_report(en_labels, en_predictions, digits=4))

Generating predictions for English test set...


Predicting:   0%|          | 0/693 [00:00<?, ?it/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



ENGLISH TEST SET RESULTS (In-Domain)
  Precision: 0.6462
  Recall:    0.6952
  F1 Score:  0.6698

              precision    recall  f1-score   support

         FAC     0.4925    0.5078    0.5000       193
         GPE     0.6000    0.8500    0.7034        60
         LOC     0.5042    0.5505    0.5263       109
         ORG     0.0833    0.1667    0.1111         6
         PER     0.7125    0.7500    0.7308       856
         VEH     0.5556    0.6250    0.5882        16

   micro avg     0.6462    0.6952    0.6698      1240
   macro avg     0.4913    0.5750    0.5267      1240
weighted avg     0.6495    0.6952    0.6707      1240



## 4. Section 2: Cross-Lingual Evaluation

### 4.1 German Literary Text (Placeholder)

**Setup:**
- Domain: Literary texts (in-domain)
- Language: German (zero-shot)
- Schema: Full 6 entity types

**Note:** This section includes placeholder data. You can add manually annotated German literary excerpts here.

In [11]:
# German literary text examples (manually annotated - TO BE ADDED)
german_examples = [
    {
        "text": "Der junge Werther reiste von Frankfurt nach Wahlheim.",
        "tokens": ["Der", "junge", "Werther", "reiste", "von", "Frankfurt", "nach", "Wahlheim", "."],
        "labels": ["O", "O", "B-PER", "O", "O", "B-LOC", "O", "B-LOC", "O"]
    },
    {
        "text": "In Berlin traf Faust den Gelehrten Wagner.",
        "tokens": ["In", "Berlin", "traf", "Faust", "den", "Gelehrten", "Wagner", "."],
        "labels": ["O", "B-LOC", "O", "B-PER", "O", "O", "B-PER", "O"]
    },
    # ADD MORE EXAMPLES HERE
]

print(f"German examples: {len(german_examples)}")
print("\n⚠️  Note: Currently using minimal placeholder data.")
print("   To fully test German, add more annotated examples above.")

German examples: 2

⚠️  Note: Currently using minimal placeholder data.
   To fully test German, add more annotated examples above.


In [12]:
if german_examples:
    # Extract tokens and labels
    de_tokens = [ex['tokens'] for ex in german_examples]
    de_labels = [ex['labels'] for ex in german_examples]
    
    # Generate predictions
    print("Generating predictions for German literary texts...")
    de_predictions = predict_sequences(de_tokens, ner_pipeline)
    
    # Compute metrics
    de_results = {
        'precision': precision_score(de_labels, de_predictions),
        'recall': recall_score(de_labels, de_predictions),
        'f1': f1_score(de_labels, de_predictions)
    }
    
    print("\n" + "="*60)
    print("GERMAN LITERARY TEXT RESULTS (Zero-Shot)")
    print("="*60)
    print(f"  Examples:  {len(german_examples)}")
    print(f"  Precision: {de_results['precision']:.4f}")
    print(f"  Recall:    {de_results['recall']:.4f}")
    print(f"  F1 Score:  {de_results['f1']:.4f}")
    
    print("\n" + classification_report(de_labels, de_predictions, digits=4))
    
    # Show example predictions
    print("\nExample predictions:")
    for i, example in enumerate(german_examples[:2]):
        print(f"\n  Example {i+1}: {example['text']}")
        for token, true_label, pred_label in zip(de_tokens[i], de_labels[i], de_predictions[i]):
            if true_label != 'O' or pred_label != 'O':
                match = "✓" if true_label == pred_label else "✗"
                print(f"    {token:15s} | True: {true_label:8s} | Pred: {pred_label:8s} {match}")
else:
    print("⚠️  No German examples provided. Skipping German evaluation.")
    de_results = None

Generating predictions for German literary texts...


Predicting:   0%|          | 0/2 [00:00<?, ?it/s]


GERMAN LITERARY TEXT RESULTS (Zero-Shot)
  Examples:  2
  Precision: 0.1667
  Recall:    0.1667
  F1 Score:  0.1667

              precision    recall  f1-score   support

         GPE     0.0000    0.0000    0.0000         0
         LOC     0.0000    0.0000    0.0000         3
         PER     0.3333    0.3333    0.3333         3

   micro avg     0.1667    0.1667    0.1667         6
   macro avg     0.1111    0.1111    0.1111         6
weighted avg     0.1667    0.1667    0.1667         6


Example predictions:

  Example 1: Der junge Werther reiste von Frankfurt nach Wahlheim.
    Der             | True: O        | Pred: B-PER    ✗
    junge           | True: O        | Pred: I-PER    ✗
    Werther         | True: B-PER    | Pred: I-PER    ✗
    Frankfurt       | True: B-LOC    | Pred: B-GPE    ✗
    Wahlheim        | True: B-LOC    | Pred: B-GPE    ✗

  Example 2: In Berlin traf Faust den Gelehrten Wagner.
    Berlin          | True: B-LOC    | Pred: B-GPE    ✗
    Faust         

### 4.2 French News Dataset (Out-of-Domain)

**Setup:**
- Domain: News articles (out-of-domain from literary training data)
- Language: French (zero-shot)
- Schema: **Overlapping tags only** (PER, ORG, LOC)

**Why different schema?**
- News datasets typically don't annotate FAC/VEH
- GPE often merged with LOC in news
- Tests domain adaptation: literary → news

In [None]:
# French news examples (overlapping schema: PER, ORG, LOC)
french_examples = [
    {
        "text": "Le président Emmanuel Macron a visité Paris hier.",
        "tokens": ["Le", "président", "Emmanuel", "Macron", "a", "visité", "Paris", "hier", "."],
        "labels": ["O", "O", "B-PER", "I-PER", "O", "O", "B-LOC", "O", "O"]
    },
    {
        "text": "L'ONU a organisé une conférence à Genève.",
        "tokens": ["L'", "ONU", "a", "organisé", "une", "conférence", "à", "Genève", "."],
        "labels": ["O", "B-ORG", "O", "O", "O", "O", "O", "B-LOC", "O"]
    },
    {
        "text": "La ministre Sophie Dupont représente la France.",
        "tokens": ["La", "ministre", "Sophie", "Dupont", "représente", "la", "France", "."],
        "labels": ["O", "O", "B-PER", "I-PER", "O", "O", "B-LOC", "O"]
    },
    # ADD MORE FRENCH NEWS EXAMPLES HERE
]

print(f"French news examples: {len(french_examples)}")
print("\n⚠️  Note: Currently using minimal placeholder data.")
print("   To fully test French, add more annotated news examples above.")

In [None]:
def filter_overlapping_entities(labels: List[List[str]], 
                               predictions: List[List[str]], 
                               allowed_types: List[str]) -> Tuple[List[List[str]], List[List[str]]]:
    """
    Filter labels to only include overlapping entity types.
    
    WHY: French news uses different schema than LitBank.
    We only evaluate on entity types present in both.
    
    Args:
        labels: True labels
        predictions: Predicted labels
        allowed_types: Entity types to keep (e.g., ["PER", "ORG", "LOC"])
        
    Returns:
        Filtered labels and predictions
    """
    filtered_labels = []
    filtered_predictions = []
    
    for label_seq, pred_seq in zip(labels, predictions):
        filtered_label = []
        filtered_pred = []
        
        for label, pred in zip(label_seq, pred_seq):
            # Extract entity type (remove B-/I- prefix)
            label_type = label.split('-')[1] if '-' in label else None
            pred_type = pred.split('-')[1] if '-' in pred else None
            
            # Keep only allowed types
            if label_type in allowed_types or label == 'O':
                filtered_label.append(label)
            else:
                filtered_label.append('O')  # Convert non-overlapping to O
            
            if pred_type in allowed_types or pred == 'O':
                filtered_pred.append(pred)
            else:
                filtered_pred.append('O')
        
        filtered_labels.append(filtered_label)
        filtered_predictions.append(filtered_pred)
    
    return filtered_labels, filtered_predictions


print("✓ Filtering function defined")

In [None]:
if french_examples:
    # Extract tokens and labels
    fr_tokens = [ex['tokens'] for ex in french_examples]
    fr_labels = [ex['labels'] for ex in french_examples]
    
    # Generate predictions
    print("Generating predictions for French news texts...")
    fr_predictions_raw = predict_sequences(fr_tokens, ner_pipeline)
    
    # Filter to overlapping schema (PER, ORG, LOC only)
    OVERLAPPING_TYPES = ['PER', 'ORG', 'LOC']
    fr_labels_filtered, fr_predictions = filter_overlapping_entities(
        fr_labels, 
        fr_predictions_raw, 
        OVERLAPPING_TYPES
    )
    
    # Compute metrics
    fr_results = {
        'precision': precision_score(fr_labels_filtered, fr_predictions),
        'recall': recall_score(fr_labels_filtered, fr_predictions),
        'f1': f1_score(fr_labels_filtered, fr_predictions)
    }
    
    print("\n" + "="*60)
    print("FRENCH NEWS RESULTS (Zero-Shot, Out-of-Domain)")
    print("="*60)
    print(f"  Examples:  {len(french_examples)}")
    print(f"  Schema:    Overlapping tags only (PER, ORG, LOC)")
    print(f"  Precision: {fr_results['precision']:.4f}")
    print(f"  Recall:    {fr_results['recall']:.4f}")
    print(f"  F1 Score:  {fr_results['f1']:.4f}")
    
    print("\n" + classification_report(fr_labels_filtered, fr_predictions, digits=4))
    
    # Show example predictions
    print("\nExample predictions:")
    for i, example in enumerate(french_examples[:2]):
        print(f"\n  Example {i+1}: {example['text']}")
        for token, true_label, pred_label in zip(fr_tokens[i], fr_labels_filtered[i], fr_predictions[i]):
            if true_label != 'O' or pred_label != 'O':
                match = "✓" if true_label == pred_label else "✗"
                print(f"    {token:15s} | True: {true_label:8s} | Pred: {pred_label:8s} {match}")
else:
    print("⚠️  No French examples provided. Skipping French evaluation.")
    fr_results = None

## 5. Cross-Lingual Comparison Table

Summary of performance across all languages.

In [None]:
# Create comparison table
comparison_data = [
    {
        'Language': 'English',
        'Domain': 'Literary (in-domain)',
        'Schema': 'Full (6 types)',
        'Examples': len(en_tokens),
        'Precision': en_results['precision'],
        'Recall': en_results['recall'],
        'F1': en_results['f1']
    }
]

if de_results:
    comparison_data.append({
        'Language': 'German',
        'Domain': 'Literary (in-domain)',
        'Schema': 'Full (6 types)',
        'Examples': len(german_examples),
        'Precision': de_results['precision'],
        'Recall': de_results['recall'],
        'F1': de_results['f1']
    })

if fr_results:
    comparison_data.append({
        'Language': 'French',
        'Domain': 'News (out-of-domain)',
        'Schema': 'Overlapping (PER/ORG/LOC)',
        'Examples': len(french_examples),
        'Precision': fr_results['precision'],
        'Recall': fr_results['recall'],
        'F1': fr_results['f1']
    })

# Create DataFrame
df_comparison = pd.DataFrame(comparison_data)

print("\n" + "="*80)
print("MULTILINGUAL PERFORMANCE COMPARISON")
print("="*80)
print(df_comparison.to_string(index=False))

# Save to file
df_comparison.to_csv(RESULTS_PATH / "multilingual_comparison.csv", index=False)
print(f"\n✓ Saved comparison table to {RESULTS_PATH / 'multilingual_comparison.csv'}")

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 6))

languages = df_comparison['Language'].tolist()
x = np.arange(len(languages))
width = 0.25

ax.bar(x - width, df_comparison['Precision'], width, label='Precision', alpha=0.8)
ax.bar(x, df_comparison['Recall'], width, label='Recall', alpha=0.8)
ax.bar(x + width, df_comparison['F1'], width, label='F1 Score', alpha=0.8)

ax.set_xlabel('Language', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Cross-Lingual NER Performance', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([f"{lang}\n({domain})" for lang, domain in zip(df_comparison['Language'], df_comparison['Domain'])], fontsize=10)
ax.legend()
ax.set_ylim([0, 1])
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(RESULTS_PATH / "multilingual_comparison.png", dpi=300, bbox_inches='tight')
print(f"✓ Saved visualization to {RESULTS_PATH / 'multilingual_comparison.png'}")
plt.show()

## 6. Attention Weight Visualization

**Why visualize attention?**
- Shows which tokens the model focuses on when making predictions
- Reveals if model attends to relevant context (nearby names, titles, etc.)
- Helps understand cross-lingual transfer mechanisms

**What to look for:**
- Strong attention to capitalized words (likely entities)
- Attention to title words ("Dr.", "President", etc.)
- Context dependencies (city names after "in", "from", etc.)

In [None]:
def visualize_attention(text: str, model, tokenizer, layer: int = -1, head: int = 0):
    """
    Visualize attention weights for a given text.
    
    Args:
        text: Input text
        model: Transformer model
        tokenizer: Tokenizer
        layer: Which layer to visualize (-1 = last layer)
        head: Which attention head to visualize
    """
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt")
    
    # Get model outputs with attention weights
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)
    
    # Extract attention weights
    # Shape: (batch_size, num_heads, seq_len, seq_len)
    attention = outputs.attentions[layer][0, head].cpu().numpy()
    
    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    # Plot attention heatmap
    fig, ax = plt.subplots(figsize=(12, 10))
    
    im = ax.imshow(attention, cmap='viridis', aspect='auto')
    
    # Set ticks
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=90, fontsize=9)
    ax.set_yticklabels(tokens, fontsize=9)
    
    ax.set_xlabel('Key Tokens (attending to)', fontsize=11)
    ax.set_ylabel('Query Tokens (attending from)', fontsize=11)
    ax.set_title(f'Attention Weights (Layer {layer}, Head {head})\n"{text}"', 
                fontsize=12, fontweight='bold')
    
    # Add colorbar
    cbar = plt.colorbar(im, ax=ax)
    cbar.set_label('Attention Weight', fontsize=10)
    
    plt.tight_layout()
    return fig


print("✓ Attention visualization function defined")

In [None]:
# Visualize attention for example sentences
example_sentences = [
    ("English", "Dr. Frankenstein traveled from Geneva to the Arctic."),
]

if de_results:
    example_sentences.append(("German", "Der junge Werther reiste von Frankfurt nach Wahlheim."))

if fr_results:
    example_sentences.append(("French", "Le président Emmanuel Macron a visité Paris hier."))

print("\n" + "="*60)
print("ATTENTION WEIGHT VISUALIZATIONS")
print("="*60)

for lang, text in example_sentences:
    print(f"\nVisualizing: {lang} - \"{text}\"")
    fig = visualize_attention(text, model, tokenizer, layer=-1, head=0)
    
    # Save figure
    filename = f"attention_{lang.lower()}.png"
    fig.savefig(RESULTS_PATH / filename, dpi=300, bbox_inches='tight')
    print(f"✓ Saved to {RESULTS_PATH / filename}")
    
    plt.show()
    plt.close()

print("\n✓ All attention visualizations complete")

## 7. Qualitative Analysis

Let's analyze some interesting predictions to understand model behavior.

In [None]:
def analyze_errors(tokens: List[List[str]], 
                   labels: List[List[str]], 
                   predictions: List[List[str]], 
                   language: str):
    """
    Identify and display common error patterns.
    """
    print(f"\n{'='*60}")
    print(f"ERROR ANALYSIS: {language.upper()}")
    print(f"{'='*60}")
    
    # Track error types
    false_positives = defaultdict(int)  # Predicted entity, but was O
    false_negatives = defaultdict(int)  # Missed entity (predicted O, but was entity)
    misclassifications = defaultdict(lambda: defaultdict(int))  # Wrong entity type
    
    for token_seq, label_seq, pred_seq in zip(tokens, labels, predictions):
        for token, label, pred in zip(token_seq, label_seq, pred_seq):
            if label != pred:
                if label == 'O' and pred != 'O':
                    # False positive
                    pred_type = pred.split('-')[1] if '-' in pred else pred
                    false_positives[pred_type] += 1
                
                elif label != 'O' and pred == 'O':
                    # False negative
                    label_type = label.split('-')[1] if '-' in label else label
                    false_negatives[label_type] += 1
                
                elif label != 'O' and pred != 'O':
                    # Misclassification
                    label_type = label.split('-')[1] if '-' in label else label
                    pred_type = pred.split('-')[1] if '-' in pred else pred
                    misclassifications[label_type][pred_type] += 1
    
    # Display results
    print(f"\nFalse Positives (predicted entity, but was O):")
    for entity_type, count in sorted(false_positives.items(), key=lambda x: -x[1]):
        print(f"  {entity_type:6s}: {count:3d} tokens incorrectly marked")
    
    print(f"\nFalse Negatives (missed entities):")
    for entity_type, count in sorted(false_negatives.items(), key=lambda x: -x[1]):
        print(f"  {entity_type:6s}: {count:3d} tokens missed")
    
    print(f"\nMisclassifications (wrong entity type):")
    for true_type, pred_counts in sorted(misclassifications.items()):
        for pred_type, count in sorted(pred_counts.items(), key=lambda x: -x[1]):
            print(f"  {true_type:6s} → {pred_type:6s}: {count:3d} times")


# Analyze errors for each language
analyze_errors(en_tokens[:100], en_labels[:100], en_predictions[:100], "English")  # Limit to first 100 for readability

if de_results:
    analyze_errors(de_tokens, de_labels, de_predictions, "German")

if fr_results:
    analyze_errors(fr_tokens, fr_labels_filtered, fr_predictions, "French")