# Notebook 04: NER Model Training

Train a BERT-based Named Entity Recognition model on the annotated allergen samples. This notebook focuses on model training only. Evaluation is handled separately in Notebook 05.

# Notebook 04: NER Model Training

This notebook trains a Named Entity Recognition (NER) model to detect allergen mentions in OCR-extracted ingredient text. Compared to rule-based keyword matching, NER can better handle synonyms, context, multi-word entities, and negations.

## Objectives
- Load and prepare annotated data (train.json, val.json, test.json)
- Initialize and fine-tune a BERT-based token classification model
- Train with optimized hyperparameters and evaluate with seqeval metrics
- Save the trained model for integration into the detection pipeline

## Expected Improvements
- Rule-based baseline: ~39% accuracy (current)
- NER model target: 50–55% accuracy (+10–15% improvement)
- More robust handling of variations and context

Proceed to the next cells to configure the environment, load data, and train the model.

## Section A: Setup and Environment Configuration

In [19]:
%pip install -q transformers torch datasets seqeval scikit-learn pandas numpy

import os
import sys
import json
from pathlib import Path
import pandas as pd
import numpy as np
import torch

# Project paths
ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
DATA = ROOT / 'data'
NER_TRAINING = DATA / 'ner_training'
MODEL_DIR = ROOT / 'models' / 'ner_model'

sys.path.insert(0, str(ROOT / 'src'))

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Training data directory: {NER_TRAINING}")
print(f"Model output directory: {MODEL_DIR}")

MODEL_DIR.mkdir(parents=True, exist_ok=True)

Note: you may need to restart the kernel to use updated packages.
PyTorch version: 2.9.1+cpu
CUDA available: False
Training data directory: d:\APU Materials\Year 3 Semester 2\FYP\allergen-detection-fyp\data\ner_training
Model output directory: d:\APU Materials\Year 3 Semester 2\FYP\allergen-detection-fyp\models\ner_model


## Step 1: Load Training Data and Label Mapping

In [20]:
# Load label mapping
with open(NER_TRAINING / 'label_mapping.json', 'r') as f:
    label_mapping = json.load(f)

labels = label_mapping['labels']
label2id = label_mapping['label_to_id']
id2label = {int(k): v for k, v in label_mapping['id_to_label'].items()}

print(f"Allergen labels ({len(labels)}): {labels}")

# Check for split datasets
train_file = NER_TRAINING / 'train.json'
val_file = NER_TRAINING / 'val.json'
test_file = NER_TRAINING / 'test.json'

if not train_file.exists():
    print("\n⚠️  Training data not yet split!")
    print(f"Please complete annotation in: {NER_TRAINING / 'annotation_template.json'}")
    print("Then split into train/val/test (70/15/15) and save as train.json, val.json, test.json")
    print("\nProceeding with demo on annotation_template for now...")
    
    # Load template for demonstration
    with open(NER_TRAINING / 'annotation_template.json', 'r') as f:
        template_data = json.load(f)
    
    print(f"\nLoaded {len(template_data)} samples from annotation template")
    print(f"Sample: {template_data[0]['text'][:100]}...")
    
    # For demo purposes, create a minimal split
    n = len(template_data)
    train_split = int(0.7 * n)
    val_split = int(0.85 * n)
    
    train_data = template_data[:train_split]
    val_data = template_data[train_split:val_split]
    test_data = template_data[val_split:]
    
    print(f"\nDemo split: {len(train_data)} train, {len(val_data)} val, {len(test_data)} test")
else:
    with open(train_file, 'r') as f:
        train_data = json.load(f)
    with open(val_file, 'r') as f:
        val_data = json.load(f)
    with open(test_file, 'r') as f:
        test_data = json.load(f)
    
    print(f"\n✓ Loaded annotated datasets:")
    print(f"  Train: {len(train_data)} samples")
    print(f"  Val: {len(val_data)} samples")
    print(f"  Test: {len(test_data)} samples")

Allergen labels (12): ['GLUTEN', 'MILK', 'EGG', 'PEANUT', 'TREE_NUT', 'SOY', 'FISH', 'SHELLFISH', 'SESAME', 'MUSTARD', 'CELERY', 'SULFITES']

✓ Loaded annotated datasets:
  Train: 9 samples
  Val: 2 samples
  Test: 3 samples


## Step 2: Tokenization and Data Preparation

In [21]:
from transformers import AutoTokenizer
from datasets import Dataset, Features, Sequence, Value

# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def normalize_entities(entity_lists):
    """Ensure entities are structured dicts with start/end/label."""
    normalized = []
    for ent_list in entity_lists:
        norm = []
        for ent in ent_list:
            if isinstance(ent, list):
                start, end, label = ent[0], ent[1], ent[2]
            else:
                start, end, label = ent.get("start"), ent.get("end"), ent.get("label")
            norm.append({"start": int(start), "end": int(end), "label": str(label)})
        normalized.append(norm)
    return normalized

def tokenize_and_align_labels(examples):
    """Tokenize text and align entity labels to subword tokens."""
    tokenized_inputs = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
        is_split_into_words=False,
    )
    
    labels_local = []
    for i, (text, entities) in enumerate(zip(examples["text"], examples["entities"])):
        char_labels = ['O'] * len(text)
        for ent in entities:
            start, end, label = ent["start"], ent["end"], ent["label"]
            if start < len(text) and end <= len(text):
                for j in range(start, end):
                    char_labels[j] = label
        
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            else:
                offsets = tokenized_inputs.encodings[i].offsets[len(label_ids)]
                if offsets[0] < len(char_labels):
                    label = char_labels[offsets[0]]
                    if label == 'O':
                        label_ids.append(-100)
                    else:
                        label_ids.append(label2id[label])
                else:
                    label_ids.append(-100)
        labels_local.append(label_ids)
    
    tokenized_inputs["labels"] = labels_local
    return tokenized_inputs

features = Features({
    "text": Value("string"),
    "entities": Sequence({
        "start": Value("int64"),
        "end": Value("int64"),
        "label": Value("string")
    })
})

# Convert to HF datasets with normalized entities
train_dataset = Dataset.from_dict({
    "text": [d["text"] for d in train_data],
    "entities": normalize_entities([d.get("entities", []) for d in train_data])
})

val_dataset = Dataset.from_dict({
    "text": [d["text"] for d in val_data],
    "entities": normalize_entities([d.get("entities", []) for d in val_data])
})

test_dataset = Dataset.from_dict({
    "text": [d["text"] for d in test_data],
    "entities": normalize_entities([d.get("entities", []) for d in test_data])
})

# Tokenize
train_tokenized = train_dataset.map(tokenize_and_align_labels, batched=True)
val_tokenized = val_dataset.map(tokenize_and_align_labels, batched=True)
test_tokenized = test_dataset.map(tokenize_and_align_labels, batched=True)

print(f"✓ Datasets tokenized and ready for training")

Map: 100%|██████████| 9/9 [00:00<00:00, 260.65 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 129.74 examples/s]
Map: 100%|██████████| 3/3 [00:00<00:00, 134.29 examples/s]

✓ Datasets tokenized and ready for training





## Section B: Model Training

In [8]:
# Quick Sanity Checks (no metrics)
print("Sanity checks only — formal evaluation is in Notebook 05.")

# 1) Show model basics if available
try:
    from datetime import datetime
    print("Timestamp:", datetime.now().isoformat(timespec='seconds'))
    print("Model name:", getattr(model.config, 'name_or_path', 'unknown'))
    print("Num labels:", getattr(model.config, 'num_labels', 'unknown'))
except Exception as e:
    print("Model not yet defined in this session:", e)

# 2) Peek at one training example if dataset objects exist
try:
    example = train_tokenized[0] if 'train_tokenized' in globals() else None
    if example:
        print("Example input_ids length:", len(example['input_ids']))
        print("Example labels length:", len(example.get('labels', [])))
    else:
        print("No tokenized training data in current kernel session.")
except Exception as e:
    print("Could not peek dataset:", e)

# 3) Optional: make a single forward pass to verify shape
try:
    import torch
    if 'model' in globals() and 'train_tokenized' in globals() and len(train_tokenized) > 0:
        batch = train_tokenized[:2]
        inputs = {
            'input_ids': torch.tensor(batch['input_ids']),
            'attention_mask': torch.tensor(batch['attention_mask'])
        }
        with torch.no_grad():
            out = model(**inputs)
        print("Forward pass OK. Logits shape:", tuple(out.logits.shape))
    else:
        print("Forward pass skipped (no model or tokenized data present).")
except Exception as e:
    print("Forward pass failed:", e)

Sanity checks only — formal evaluation is in Notebook 05.
Timestamp: 2025-12-20T15:28:25
Model not yet defined in this session: name 'model' is not defined
Example input_ids length: 512
Example labels length: 512
Forward pass skipped (no model or tokenized data present).


In [24]:
import torch
from torch.utils.data import DataLoader
from transformers import BertForTokenClassification, DataCollatorForTokenClassification
from seqeval.metrics import precision_score, recall_score, f1_score, accuracy_score
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Model and collator
model = BertForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
).to(device)

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True, max_length=512)

# Strip non-tensor columns and set torch format
keep_cols = ["input_ids", "attention_mask", "labels"]
train_ds = train_tokenized.remove_columns([c for c in train_tokenized.column_names if c not in keep_cols]).with_format("torch")
val_ds = val_tokenized.remove_columns([c for c in val_tokenized.column_names if c not in keep_cols]).with_format("torch")
test_ds = test_tokenized.remove_columns([c for c in test_tokenized.column_names if c not in keep_cols]).with_format("torch")

# DataLoaders
train_loader = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=data_collator)
val_loader = DataLoader(val_ds, batch_size=4, shuffle=False, collate_fn=data_collator)
test_loader = DataLoader(test_ds, batch_size=4, shuffle=False, collate_fn=data_collator)

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)


def evaluate(loader):
    model.eval()
    losses = []
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            labels_batch = batch.pop("labels")
            outputs = model(**batch, labels=labels_batch)
            losses.append(outputs.loss.item())
            logits = outputs.logits.detach().cpu().numpy()
            labels_np = labels_batch.cpu().numpy()
            for logit_seq, label_seq in zip(logits, labels_np):
                pred_seq = logit_seq.argmax(-1)
                seq_preds, seq_labels = [], []
                for p, l in zip(pred_seq, label_seq):
                    if l == -100:
                        continue
                    seq_preds.append(id2label[int(p)])
                    seq_labels.append(id2label[int(l)])
                if seq_labels:
                    all_preds.append(seq_preds)
                    all_labels.append(seq_labels)
    if not all_labels:
        return {"loss": np.mean(losses) if losses else None, "precision": 0.0, "recall": 0.0, "f1": 0.0, "accuracy": 0.0}
    return {
        "loss": np.mean(losses) if losses else None,
        "precision": precision_score(all_labels, all_preds),
        "recall": recall_score(all_labels, all_preds),
        "f1": f1_score(all_labels, all_preds),
        "accuracy": accuracy_score(all_labels, all_preds),
    }

# Training loop
num_epochs = 3
print("\n" + "="*60)
print("TRAINING NER MODEL (manual loop)")
print("="*60)
for epoch in range(1, num_epochs + 1):
    model.train()
    total_loss = 0.0
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        labels_batch = batch.pop("labels")
        outputs = model(**batch, labels=labels_batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        total_loss += loss.item()
    avg_train_loss = total_loss / max(1, len(train_loader))
    val_metrics = evaluate(val_loader)
    print(f"Epoch {epoch}: train_loss={avg_train_loss:.4f} val_loss={val_metrics['loss']:.4f} f1={val_metrics['f1']:.3f} prec={val_metrics['precision']:.3f} rec={val_metrics['recall']:.3f}")

# Final evaluation
print("\nEvaluating on test set...")
test_metrics = evaluate(test_loader)
print(test_metrics)

# Save trained model and label mapping
model.save_pretrained(MODEL_DIR)
tokenizer.save_pretrained(MODEL_DIR)
with open(MODEL_DIR / 'label_mapping.json', 'w') as f:
    json.dump({
        "labels": labels,
        "label_to_id": label2id,
        "id_to_label": id2label
    }, f, indent=2)

print(f"\n✓ Trained NER model saved to: {MODEL_DIR}")

Using device: cpu


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



TRAINING NER MODEL (manual loop)




Epoch 1: train_loss=2.2549 val_loss=1.6467 f1=0.286 prec=0.250 rec=0.333
Epoch 2: train_loss=1.4494 val_loss=1.2942 f1=0.000 prec=0.000 rec=0.000
Epoch 3: train_loss=1.7536 val_loss=1.1839 f1=0.000 prec=0.000 rec=0.000

Evaluating on test set...




{'loss': np.float64(1.8011715412139893), 'precision': np.float64(0.2), 'recall': np.float64(0.3333333333333333), 'f1': np.float64(0.25), 'accuracy': 0.4}

✓ Trained NER model saved to: d:\APU Materials\Year 3 Semester 2\FYP\allergen-detection-fyp\models\ner_model


## Step 3: Save Trained Model and Checkpoint

Note: Formal evaluation of the trained model is performed in Notebook 05.

In [9]:
# Evaluate on test set
test_results = trainer.evaluate(test_tokenized)

print("\n" + "="*60)
print("TEST SET RESULTS")
print("="*60)
print(f"  Precision: {test_results['eval_precision']:.3f}")
print(f"  Recall:    {test_results['eval_recall']:.3f}")
print(f"  F1 Score:  {test_results['eval_f1']:.3f}")
print(f"  Accuracy:  {test_results['eval_accuracy']:.3f}")
print("="*60)

# Save model and tokenizer
model.save_pretrained(MODEL_DIR)
tokenizer.save_pretrained(MODEL_DIR)

# Save label mapping
with open(MODEL_DIR / 'label_mapping.json', 'w') as f:
    json.dump(label_mapping, f, indent=2)

print(f"\n✓ Model saved to: {MODEL_DIR}")
print("\nNext steps:")
print("  1. Integrate NER model into allergen detection pipeline")
print("  2. Run comprehensive evaluation (Notebook 05)")
print("  3. Compare NER vs rule-based accuracy")

NameError: name 'trainer' is not defined