## Prerequisites

Before running this notebook, ensure:
1. Notebook 03: Complete data annotation and splits → `data/ner_training/{train, val, test}.json`
2. Notebook 04: Train NER model → `models/ner_model/` or `models/experiments/ner_model/`

Run cells in order. Each cell is independent after Step 1–3 complete.

## Run Order and Requirements
1. Complete annotation + split in Notebook 03 (produces train.json, val.json, test.json).
2. Train the NER model in Notebook 04 (saves to models/ner_model).
3. Run this notebook to evaluate on val/test and save metrics.

Requirements:
- Files: data/ner_training/label_mapping.json, val.json/test.json
- Model directory: models/ner_model (or fallback: models/experiments/ner_model)
- Packages: transformers, datasets, seqeval, numpy, pandas

# Notebook 05: NER Model Evaluation

Load a trained NER model and evaluate on validation/test splits.

**Outputs:**
- Overall micro metrics: precision, recall, F1
- Per-label classification report (seqeval)
- Qualitative examples: gold vs. predicted entities
- Metrics JSON saved to results/model_metrics/

In [None]:
# Step 1: Environment and paths setup
import sys, json, random, os
from pathlib import Path

# Detect workspace root
ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
SRC = ROOT / 'src'

# Add src to path for local imports
if str(SRC) not in sys.path:
    sys.path.append(str(SRC))
    print(f"Added to sys.path: {SRC}")

# Define expected paths
NER_TRAINING = ROOT / 'data' / 'ner_training'
MODELS_PREFERRED = ROOT / 'models' / 'ner_model'
MODELS_FALLBACK = ROOT / 'models' / 'experiments' / 'ner_model'
RESULTS_DIR = ROOT / 'results' / 'model_metrics'
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

val_path = NER_TRAINING / 'val.json'
test_path = NER_TRAINING / 'test.json'
label_map_path = NER_TRAINING / 'label_mapping.json'

print("="*80)
print("PATHS DETECTION")
print("="*80)
print(f"ROOT:           {ROOT}")
print(f"NER_TRAINING:   {NER_TRAINING.exists()}")
print(f"val.json:       {val_path.exists()}")
print(f"test.json:      {test_path.exists()}")
print(f"label_mapping:  {label_map_path.exists()}")
print(f"RESULTS_DIR:    {RESULTS_DIR.exists()}")
print()

In [None]:
# Step 2: Import libraries
import warnings
warnings.filterwarnings('ignore')

try:
    import json
    import random
    from typing import List, Tuple, Dict, Any
    import numpy as np
    import pandas as pd

    import transformers
    from transformers import AutoTokenizer, AutoModelForTokenClassification
    from datasets import Dataset
    import seqeval
    from seqeval.metrics import classification_report, f1_score, precision_score, recall_score

    print("Libraries imported OK")
    print(f"  transformers v{transformers.__version__}")
    print(f"  seqeval v{seqeval.__version__}")
    print()
except ImportError as e:
    print("ERROR: Missing package.")
    print(f"  {e}")
    print("Install with: pip install transformers datasets seqeval")

In [None]:
# Step 3: Load label mapping and data splits
def load_json(path: Path):
    if path.exists():
        with open(path, 'r', encoding='utf-8') as f:
            return json.load(f)
    return None

print("Loading label mapping and data splits...")
label_map_path = NER_TRAINING / 'label_mapping.json'
mapping = load_json(label_map_path)

if not mapping:
    print("WARNING: label_mapping.json not found", label_map_path)
    labels = []
else:
    labels = mapping.get('labels', [])
    print(f"Loaded {len(labels)} allergen labels")

val_data = load_json(val_path)
test_data = load_json(test_path)

print(f"val.json:  {len(val_data) if val_data else 0} samples")
print(f"test.json: {len(test_data) if test_data else 0} samples")
print()

In [None]:
# Step 4: Utility functions for tokenization and evaluation

def tokenize_and_align_labels(samples, tokenizer, label2id):
    """Align word-level labels to token-level (BIO tags)."""
    tokenized = tokenizer(samples['tokens'], is_split_into_words=True, truncation=True)
    labels = []
    for i, label_seq in enumerate(samples['labels']):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        prev_word = None
        for wid in word_ids:
            if wid is None:
                label_ids.append(-100)  # Skip special tokens
            elif wid != prev_word:
                label_ids.append(label2id.get(label_seq[wid], 0))
            else:
                label_ids.append(-100)  # Skip sub-tokens
            prev_word = wid
        labels.append(label_ids)
    tokenized['labels'] = labels
    return tokenized


def decode_predictions(logits, label_list):
    """Decode model logits to label sequences."""
    preds = np.argmax(logits, axis=-1)
    return [[label_list[p] if p < len(label_list) else 'O' for p in seq] for seq in preds]


def group_tokens_to_entities(tokens, labels):
    """Convert BIO tags back to (start_idx, end_idx, label) tuples."""
    entities = []
    current_label = None
    start_idx = None
    for i, (tok, lab) in enumerate(zip(tokens, labels)):
        if lab.startswith('B-'):
            if current_label is not None:
                entities.append((start_idx, i, current_label))
            current_label = lab[2:]
            start_idx = i
        elif lab.startswith('I-') and current_label == lab[2:]:
            continue
        else:
            if current_label is not None:
                entities.append((start_idx, i, current_label))
                current_label = None
                start_idx = None
    if current_label is not None:
        entities.append((start_idx, len(tokens), current_label))
    return entities


def evaluate_model(model, tokenizer, dataset, label_list):
    """Compute seqeval metrics on a dataset."""
    label2id = {l: i for i, l in enumerate(label_list)}
    tokenized = dataset.map(lambda x: tokenize_and_align_labels(x, tokenizer, label2id), batched=True)
    tokenized = tokenized.remove_columns([c for c in tokenized.column_names if c not in ['input_ids', 'attention_mask', 'labels']])
    
    import torch
    model.eval()
    all_preds, all_refs = [], []
    
    with torch.no_grad():
        for i in range(0, len(tokenized), 8):
            batch = tokenized[i:i+8]
            inputs = {
                'input_ids': torch.tensor(batch['input_ids']),
                'attention_mask': torch.tensor(batch['attention_mask'])
            }
            outputs = model(**inputs)
            pred_labels = decode_predictions(outputs.logits.cpu().numpy(), label_list)
            # Reconstruct refs (skip -100 padding)
            refs = [[label_list[l] if l != -100 else 'O' for l in labs] for labs in batch['labels']]
            all_preds.extend(pred_labels)
            all_refs.extend(refs)
    
    # Compute metrics
    micro_f1 = f1_score(all_refs, all_preds)
    micro_p = precision_score(all_refs, all_preds)
    micro_r = recall_score(all_refs, all_preds)
    report = classification_report(all_refs, all_preds, digits=3)
    
    return {
        'f1': micro_f1,
        'precision': micro_p,
        'recall': micro_r,
        'report': report
    }


print("Utility functions defined.")

In [None]:
# Step 5: Load trained model
model_dir = MODELS_PREFERRED if MODELS_PREFERRED.exists() else MODELS_FALLBACK

if not model_dir.exists():
    print(f"WARNING: No model found at {model_dir}")
    print("Train the model in Notebook 04 first.")
    model = None
    tokenizer = None
else:
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
    print(f"Loaded model from: {model_dir}")
    print(f"  Tokenizer vocab size: {tokenizer.vocab_size}")
    print(f"  Model num labels: {model.config.num_labels}")
    print()

In [None]:
# Step 6: Convert JSON data to BIO samples

def to_bio_samples(split_data):
    """Convert character-span annotations to BIO token labels."""
    samples = []
    for s in split_data:
        text = s['text']
        tokens = text.split()
        tags = ['O'] * len(tokens)
        
        # Map token positions
        spans = []
        idx = 0
        for tok in tokens:
            start = text.find(tok, idx)
            end = start + len(tok)
            spans.append((start, end))
            idx = end
        
        # Tag entities
        for ent_start, ent_end, label in s.get('entities', []):
            first = True
            for ti, (ts, te) in enumerate(spans):
                if ts < ent_end and te > ent_start:
                    tags[ti] = ('B-' if first else 'I-') + label
                    first = False
        
        samples.append({'tokens': tokens, 'labels': tags, 'text': text})
    return samples


if not (val_data or test_data):
    print("No data available to evaluate.")
    val_ds = None
    test_ds = None
else:
    val_ds = None
    test_ds = None
    
    if val_data:
        val_ds = Dataset.from_list(to_bio_samples(val_data))
        print(f"Created val dataset: {len(val_ds)} samples")
    
    if test_data:
        test_ds = Dataset.from_list(to_bio_samples(test_data))
        print(f"Created test dataset: {len(test_ds)} samples")
    
    if val_ds or test_ds:
        peek = (val_ds or test_ds)[0]
        print(f"Example tokens ({len(peek['tokens'])}): {peek['tokens'][:15]}...")
        print(f"Example labels ({len(peek['labels'])}): {peek['labels'][:15]}...")
    print()

In [None]:
# Step 7: Run evaluation and save metrics
from datetime import datetime

def save_metrics(metrics: Dict[str, Any], split_name: str):
    """Save metrics to results/model_metrics."""
    ts = datetime.now().strftime('%Y%m%d_%H%M%S')
    out_path = RESULTS_DIR / f'ner_eval_{split_name}_{ts}.json'
    with open(out_path, 'w', encoding='utf-8') as f:
        json.dump(metrics, f, indent=2, ensure_ascii=False)
    print(f"  Saved to: {out_path}")


if model is None or not (val_ds or test_ds):
    print("Skipping evaluation: missing model or data.")
else:
    print("="*80)
    print("EVALUATION RESULTS")
    print("="*80)
    
    label_list = sorted(set(['O'] + [f'B-{l}' for l in labels] + [f'I-{l}' for l in labels]))
    
    if val_ds:
        print("\nValidation Set:")
        m_val = evaluate_model(model, tokenizer, val_ds, label_list)
        print(m_val['report'])
        print(f"Micro F1: {m_val['f1']:.3f}  P: {m_val['precision']:.3f}  R: {m_val['recall']:.3f}")
        save_metrics(m_val, 'val')
    
    if test_ds:
        print("\nTest Set:")
        m_test = evaluate_model(model, tokenizer, test_ds, label_list)
        print(m_test['report'])
        print(f"Micro F1: {m_test['f1']:.3f}  P: {m_test['precision']:.3f}  R: {m_test['recall']:.3f}")
        save_metrics(m_test, 'test')
    
    print("="*80)
    print()

In [None]:
# Step 8: Qualitative examples

if model is None or tokenizer is None:
    print("Skipping examples: model/tokenizer not loaded.")
else:
    def predict_entities(text: str):
        """Predict entities in text using the model."""
        toks = text.split()
        enc = tokenizer(toks, is_split_into_words=True, return_tensors='pt', truncation=True)
        import torch
        model.eval()
        with torch.no_grad():
            out = model(**{k: v for k, v in enc.items() if k in ['input_ids', 'attention_mask']})
        pred_ids = out.logits.argmax(-1).cpu().numpy()[0]
        word_ids = enc.word_ids(0)
        word_labels = []
        used = set()
        for wi, pid in zip(word_ids, pred_ids):
            if wi is None or wi in used:
                continue
            used.add(wi)
            word_labels.append(model.config.id2label.get(int(pid), 'O'))
        return toks, word_labels
    
    dataset_for_examples = val_data or test_data or []
    if not dataset_for_examples:
        print("No examples available.")
    else:
        print("="*80)
        print("QUALITATIVE EXAMPLES (up to 3 random samples)")
        print("="*80)
        for idx, ex in enumerate(random.sample(dataset_for_examples, min(3, len(dataset_for_examples))), 1):
            toks, pred_tags = predict_entities(ex['text'])
            gold_ents = ex.get('entities', [])
            pred_ents = group_tokens_to_entities(toks, pred_tags)
            text_preview = ex['text'][:150].replace('\n', ' ') + ('...' if len(ex['text']) > 150 else '')
            print(f"\nExample {idx}:")
            print(f"  TEXT: {text_preview}")
            print(f"  GOLD: {gold_ents}")
            print(f"  PRED: {pred_ents}")
        print("="*80)

## Summary

Results have been saved to `results/model_metrics/ner_eval_*.json`.

### Next Steps
- Review metrics to assess model performance.
- Compare gold vs. predicted entities in qualitative examples.
- If F1 is low: expand annotations and retrain in Notebook 04.
- If F1 is good: integrate the model into the detection pipeline.