# Notebook 03: Data Annotation for NER

Complete workflow for annotating allergen entities in product text samples to create training data for the NER model. Follow the steps sequentially to load, annotate, validate, and split data.

# Notebook 03: Data Annotation for NER Training

## Overview
This notebook helps with the manual annotation process for Named Entity Recognition (NER) training data. The annotation template has been prepared by prepare_ner_sample.py and contains OCR-extracted text that needs allergen entities marked.

### Purpose
- Load the annotation template created by data preparation scripts
- Review OCR text quality and declared allergens
- Provide annotation guidelines and examples
- Validate completed annotations
- Split annotated data into train/val/test sets

### Annotation Format
Each sample needs entities marked with character positions:
{
    "text": "Contains wheat flour, milk, and soy lecithin",
    "entities": [
        [9, 14, "GLUTEN"],
        [23, 27, "MILK"],
        [33, 36, "SOY"]
    ]
}

Note: This notebook is for reviewing and validating annotations. The actual annotation can be done:
1. Manually editing the JSON file
2. Using annotation tools (Label Studio, Doccano)
3. Or using this notebook interactively

## Step 1: Setup and Load Annotation Template

In [5]:
import json
import pandas as pd
from pathlib import Path

# Setup paths
ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
NER_TRAINING = ROOT / 'data' / 'ner_training'

# Check for sample annotation template
sample_template = NER_TRAINING / 'annotation_template_SAMPLE.json'
full_template = NER_TRAINING / 'annotation_template.json'

if sample_template.exists():
    annotation_file = sample_template
    print(f"✓ Loading SAMPLE annotation template")
elif full_template.exists():
    annotation_file = full_template
    print(f"✓ Loading FULL annotation template")
else:
    print("❌ No annotation template found!")
    print(f"Run prepare_ner_sample.py or prepare_ner_training_data.py first")
    annotation_file = None

if annotation_file:
    with open(annotation_file, 'r', encoding='utf-8') as f:
        annotation_data = json.load(f)
    
    print(f"Loaded {len(annotation_data)} samples for annotation")
    print(f"File: {annotation_file}")
    
    # Load label mapping
    with open(NER_TRAINING / 'label_mapping.json', 'r') as f:
        label_mapping = json.load(f)
    
    allergen_labels = label_mapping['labels']
    print(f"\nAllergen labels ({len(allergen_labels)}): {', '.join(allergen_labels)}")
else:
    annotation_data = []
    allergen_labels = []

✓ Loading SAMPLE annotation template
Loaded 30 samples for annotation
File: d:\APU Materials\Year 3 Semester 2\FYP\allergen-detection-fyp\data\ner_training\annotation_template_SAMPLE.json

Allergen labels (12): GLUTEN, MILK, EGG, PEANUT, TREE_NUT, SOY, FISH, SHELLFISH, SESAME, MUSTARD, CELERY, SULFITES


## Step 2: Review Sample Annotations

In [6]:
# Display first few samples
if annotation_data:
    print("="*80)
    print("SAMPLE REVIEW - First 3 Samples")
    print("="*80)
    
    for i, sample in enumerate(annotation_data[:3], 1):
        print(f"\n[Sample {i}]")
        print(f"Image: {sample['image_name']}")
        print(f"Product: {sample['product_name']}")
        print(f"Declared Allergens: {', '.join(sample['declared_allergens'])}")
        print(f"\nOCR Text ({len(sample['text'])} chars):")
        print(f"  {sample['text'][:150]}{'...' if len(sample['text']) > 150 else ''}")
        print(f"\nEntities annotated: {len(sample['entities'])}")
        if sample['entities']:
            for start, end, label in sample['entities']:
                entity_text = sample['text'][start:end]
                print(f"  - {label}: '{entity_text}' (pos {start}-{end})")
        else:
            print("  (No entities annotated yet)")
        print("-"*80)
else:
    print("No data to review. Run data preparation script first.")

SAMPLE REVIEW - First 3 Samples

[Sample 1]
Image: 0888849000463_en
Product: Chocolate peanut butter protein bar, chocolate peanut butter
Declared Allergens: milk, peanuts

OCR Text (311 chars):
  QuzsTBAR Yout 01/11/16 0r0i041cp01501} Giddn They Saio ThaT This 'Protlin BAR COULDN T MADL But We FinlLy did it"s Delicious Food PaCKED With CPOTcN T...

Entities annotated: 0
  (No entities annotated yet)
--------------------------------------------------------------------------------

[Sample 2]
Image: 20742690_en
Product: Peanut Butter Wafers
Declared Allergens: gluten, milk, peanuts, soybeans

OCR Text (80 chars):
  Peanut Butter WAFERS Joigs sGeisiot ~6-2 02 (57g) [ ) PACKAGES NETWT i202 (340g)

Entities annotated: 2
  - PEANUT: 'Peanut' (pos 0-6)
  - TREE_NUT: 'nut' (pos 3-6)
--------------------------------------------------------------------------------

[Sample 3]
Image: 0099482497989_en
Product: Kettle Cooked Texas Style Barbecue Chips
Declared Allergens: mustard

OCR Text (276 cha

## Step 3: Annotation Guidelines

### What to Annotate
Mark **allergen mentions** in the OCR text with their character positions and allergen type.

### Allergen Label Mapping
- **GLUTEN**: wheat, barley, rye, oats, spelt
- **MILK**: milk, dairy, lactose, cream, butter, cheese, whey, casein
- **EGG** / **EGGS**: egg, eggs, albumin
- **PEANUT** / **PEANUTS**: peanut, groundnut
- **TREE_NUT**: almond, cashew, walnut, hazelnut, pecan, pistachio, macadamia
- **SOY** / **SOYBEANS**: soy, soya, soybean
- **FISH**: fish, anchovy, salmon, tuna, cod, etc.
- **SHELLFISH**: shellfish, shrimp, crab, lobster, prawns
- **SESAME**: sesame, sesame seeds
- **MUSTARD**: mustard
- **CELERY**: celery
- **SULFITES**: sulfites, sulphites, sulfur dioxide

### Annotation Format
```python
{
    "text": "INGREDIENTS: Wheat flour, sugar, milk powder, soy lecithin",
    "entities": [
        (13, 18, "GLUTEN"),      # "Wheat"
        (33, 37, "MILK"),         # "milk"
        (47, 50, "SOY")           # "soy"
    ]
}
```

### Tips
- Match **exact character positions** (start is inclusive, end is exclusive)
- Use lowercase for common names, uppercase for labels
- Mark all mentions, even if in "may contain" statements
- Focus on **ingredient names**, not just "allergens:" labels

## Step 3: Validate Annotations and Track Progress

In [7]:
if annotation_data:
    # Count annotation status
    total = len(annotation_data)
    annotated = sum(1 for s in annotation_data if s['entities'])
    not_annotated = total - annotated
    
    print("="*80)
    print("ANNOTATION PROGRESS")
    print("="*80)
    print(f"\nTotal samples: {total}")
    print(f"Annotated: {annotated} ({annotated/total*100:.1f}%)")
    print(f"Not annotated: {not_annotated} ({not_annotated/total*100:.1f}%)")
    
    if annotated > 0:
        # Validate annotations
        print(f"\n{'='*80}")
        print("VALIDATION CHECK")
        print("="*80)
        
        errors = []
        for i, sample in enumerate(annotation_data):
            text = sample['text']
            for start, end, label in sample['entities']:
                # Check bounds
                if start < 0 or end > len(text) or start >= end:
                    errors.append(f"Sample {i}: Invalid bounds ({start}, {end}) for text length {len(text)}")
                # Check label
                if label not in allergen_labels:
                    errors.append(f"Sample {i}: Invalid label '{label}' (not in label set)")
                # Check text extraction
                entity_text = text[start:end]
                if not entity_text.strip():
                    errors.append(f"Sample {i}: Empty entity text at ({start}, {end})")
        
        if errors:
            print(f"❌ Found {len(errors)} validation errors:")
            for err in errors[:10]:  # Show first 10
                print(f"  - {err}")
        else:
            print("✓ All annotations are valid!")
            
            # Statistics
            entity_counts = {}
            for sample in annotation_data:
                for _, _, label in sample['entities']:
                    entity_counts[label] = entity_counts.get(label, 0) + 1
            
            print(f"\nEntity distribution:")
            for label, count in sorted(entity_counts.items(), key=lambda x: x[1], reverse=True):
                print(f"  {label:15s}: {count:3d}")
    
    else:
        print("\n⚠️  No samples have been annotated yet")
        print(f"\nTo annotate:")
        print(f"1. Edit file: {annotation_file}")
        print(f"2. Add entities: [(start, end, 'LABEL'), ...]")
        print(f"3. Re-run this notebook to validate")
else:
    print("No annotation data loaded")

ANNOTATION PROGRESS

Total samples: 30
Annotated: 14 (46.7%)
Not annotated: 16 (53.3%)

VALIDATION CHECK
✓ All annotations are valid!

Entity distribution:
  TREE_NUT       :  14
  GLUTEN         :   5
  PEANUT         :   1
  SESAME         :   1
  MILK           :   1


## Step 4: Split and Save Train/Val/Test Sets

In [8]:
import random
import json

if annotation_data:
    # Check if we have annotated data
    annotated_samples = [s for s in annotation_data if s['entities']]
    
    if len(annotated_samples) < 10:
        print("⚠️  Need at least 10 annotated samples to split")
        print(f"Currently have: {len(annotated_samples)} annotated samples")
    else:
        # Shuffle
        random.seed(42)
        random.shuffle(annotated_samples)
        
        # Split: 70% train, 15% val, 15% test
        n = len(annotated_samples)
        train_split = int(0.7 * n)
        val_split = int(0.85 * n)
        
        train_data = annotated_samples[:train_split]
        val_data = annotated_samples[train_split:val_split]
        test_data = annotated_samples[val_split:]
        
        print("="*80)
        print("DATA SPLIT")
        print("="*80)
        print(f"\nTotal annotated samples: {n}")
        print(f"Train: {len(train_data)} ({len(train_data)/n*100:.1f}%)")
        print(f"Val:   {len(val_data)} ({len(val_data)/n*100:.1f}%)")
        print(f"Test:  {len(test_data)} ({len(test_data)/n*100:.1f}%)")
        
        # Save splits
        with open(NER_TRAINING / 'train.json', 'w', encoding='utf-8') as f:
            json.dump(train_data, f, indent=2, ensure_ascii=False)
        
        with open(NER_TRAINING / 'val.json', 'w', encoding='utf-8') as f:
            json.dump(val_data, f, indent=2, ensure_ascii=False)
        
        with open(NER_TRAINING / 'test.json', 'w', encoding='utf-8') as f:
            json.dump(test_data, f, indent=2, ensure_ascii=False)
        
        print(f"\n✓ Saved splits to:")
        print(f"  {NER_TRAINING / 'train.json'}")
        print(f"  {NER_TRAINING / 'val.json'}")
        print(f"  {NER_TRAINING / 'test.json'}")
        print(f"\n✓ Ready for NER model training (Notebook 04)!")
else:
    print("No annotation data to split")

DATA SPLIT

Total annotated samples: 14
Train: 9 (64.3%)
Val:   2 (14.3%)
Test:  3 (21.4%)

✓ Saved splits to:
  d:\APU Materials\Year 3 Semester 2\FYP\allergen-detection-fyp\data\ner_training\train.json
  d:\APU Materials\Year 3 Semester 2\FYP\allergen-detection-fyp\data\ner_training\val.json
  d:\APU Materials\Year 3 Semester 2\FYP\allergen-detection-fyp\data\ner_training\test.json

✓ Ready for NER model training (Notebook 04)!
