[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.4/token-classification.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic1.4/token-classification.ipynb)

# Token Classification with BERT

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- What token classification is and its applications (NER, POS tagging)
- How to use BERT for token classification tasks
- The structure of token classification outputs and label formats
- How transformer models assign labels to individual tokens
- Common use cases and practical applications

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python programming
- Basic knowledge of Natural Language Processing (NLP) concepts
- Understanding of tokenization (refer to [Tokenizers Notebook](../02_tokenizers.ipynb))

## 📚 What We'll Cover
1. **Introduction**: What is token classification?
2. **Setup**: Installing and importing required libraries
3. **Core Implementation**: Using BERT for Named Entity Recognition
4. **Understanding Output**: Interpreting token labels and confidence scores
5. **Advanced Examples**: Testing different text inputs and entity types
6. **Visualization**: Displaying results in a readable format
7. **Summary**: Key takeaways and next steps

## 1. Introduction to Token Classification

**Token Classification** is a fundamental NLP task that assigns a label to each token (word or sub-word) in a sequence. Unlike sequence classification which assigns one label to the entire text, token classification provides granular, token-level predictions.

### Common Applications:

- **Named Entity Recognition (NER)**: Identifying entities like persons, locations, organizations
- **Part-of-Speech (POS) Tagging**: Identifying grammatical roles (noun, verb, adjective)
- **Chunk Parsing**: Identifying phrases and syntactic structures
- **Information Extraction**: Extracting specific data fields from documents

### How Token Classification Works:

1. **Tokenize** the input text into individual tokens
2. **Encode** each token into numerical representations
3. **Process** the sequence using transformer attention mechanisms
4. **Classify** each token with a specific label and confidence score
5. **Aggregate** sub-word tokens back into complete entities

### Label Formats:
- **BIO Format**: B (Beginning), I (Inside), O (Outside)
  - Example: `John Smith lives in New York` → `B-PER I-PER O O B-LOC I-LOC`
- **Entity Types**: PERSON, LOCATION, ORGANIZATION, MISCELLANEOUS, etc.

## 2. Setup and Installation

Let's start by importing the necessary libraries and setting up our environment.

In [None]:
# Install required packages (run only once in Colab)
# !pip install transformers torch numpy pandas

# Import essential libraries
import torch
import numpy as np
import pandas as pd
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForTokenClassification
)
import warnings
warnings.filterwarnings('ignore')

# Device detection for optimal performance
def get_device():
    """Get the best available device for PyTorch operations."""
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS for Apple Silicon optimization")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU - consider GPU for better performance")
    return device

device = get_device()
print("📚 Libraries loaded successfully!")

## 3. Core Implementation with BERT

We'll use a pre-trained BERT model fine-tuned for Named Entity Recognition. This model can identify various types of entities in text.

In [None]:
# Load pre-trained BERT model for token classification (NER)
# Using a reliable model with proper entity label mappings
model_name = "dslim/bert-base-NER"

print(f"📥 Loading model: {model_name}")
print("   This model is optimized for Named Entity Recognition...")

# Create token classification pipeline
# Pipeline automatically handles tokenization, prediction, and post-processing
ner_pipeline = pipeline(
    "ner",  # Named Entity Recognition task
    model=model_name,
    tokenizer=model_name,
    aggregation_strategy="simple",  # Aggregate sub-word tokens into complete entities
    device=0 if device.type == "cuda" else -1  # Use GPU if available
)

print("✅ Model loaded successfully!")
print(f"🔧 Model details:")
print(f"   Task: Named Entity Recognition (NER)")
print(f"   Base Model: BERT Base")
print(f"   Training Data: CoNLL-2003 NER dataset")
print(f"   Supported Entities: PERSON, LOCATION, ORGANIZATION, MISCELLANEOUS")

## 4. Basic Token Classification Example

Let's start with a simple example to understand how token classification works.

In [None]:
# Example text with various named entities
sample_text = "My name is John Smith and I work at Microsoft in Seattle, Washington."

print(f"📝 Input Text: '{sample_text}'")
print("\n🔍 Performing token classification...")

# Perform named entity recognition
entities = ner_pipeline(sample_text)

print(f"\n✨ Found {len(entities)} entities:")
print("=" * 50)

# Display results in a readable format
for i, entity in enumerate(entities, 1):
    print(f"{i}. Entity: '{entity['word']}'")
    print(f"   Type: {entity['entity_group']}")
    print(f"   Confidence: {entity['score']:.4f} ({entity['score']*100:.1f}%)")
    print(f"   Position: characters {entity['start']}-{entity['end']}")
    print()

## 5. Understanding the Output

Let's break down what each part of the output means and explore the model's confidence scores.

In [None]:
# Create a more detailed analysis function
def analyze_entities(text, show_tokens=True):
    """Analyze entities with detailed breakdown."""
    
    print(f"📊 Detailed Analysis of: '{text}'")
    print("=" * 60)
    
    # Get entities
    entities = ner_pipeline(text)
    
    if not entities:
        print("❌ No entities found in the text.")
        return
    
    # Entity type summary
    entity_types = {}
    for entity in entities:
        entity_type = entity['entity_group']
        if entity_type not in entity_types:
            entity_types[entity_type] = []
        entity_types[entity_type].append(entity['word'])
    
    print(f"📈 Entity Summary:")
    for entity_type, words in entity_types.items():
        print(f"   {entity_type}: {len(words)} entities → {', '.join(words)}")
    
    print(f"\n🔍 Detailed Results:")
    
    # Detailed entity information
    for i, entity in enumerate(entities, 1):
        confidence_level = "High" if entity['score'] > 0.9 else "Medium" if entity['score'] > 0.7 else "Low"
        
        print(f"\n{i}. '{entity['word']}' → {entity['entity_group']}")
        print(f"   📍 Position: {entity['start']}-{entity['end']} | Confidence: {entity['score']:.4f} ({confidence_level})")
        
        # Extract context (words around the entity)
        start_context = max(0, entity['start'] - 20)
        end_context = min(len(text), entity['end'] + 20)
        context = text[start_context:end_context]
        print(f"   🔤 Context: '...{context}...'")

# Test with our sample
analyze_entities(sample_text)

## 6. Advanced Examples with Different Entity Types

Let's test the model with various types of texts to see how it handles different scenarios.

In [None]:
# Test cases with different types of entities
test_cases = [
    {
        "name": "Business Context",
        "text": "Apple Inc. CEO Tim Cook announced new products at the Apple Park campus in Cupertino, California."
    },
    {
        "name": "News Article",
        "text": "President Biden met with European leaders in Brussels to discuss NATO policies."
    },
    {
        "name": "Academic Context",
        "text": "The research was conducted by Dr. Sarah Johnson at Stanford University in collaboration with MIT."
    },
    {
        "name": "Mixed Entities",
        "text": "Google's headquarters in Mountain View, California, hosts thousands of engineers working on AI projects."
    }
]

print("🧪 Testing Multiple Scenarios:")
print("=" * 80)

for i, case in enumerate(test_cases, 1):
    print(f"\n📋 Test Case {i}: {case['name']}")
    print(f"📝 Text: '{case['text']}'")
    
    entities = ner_pipeline(case['text'])
    
    if entities:
        print(f"✅ Found {len(entities)} entities:")
        for entity in entities:
            print(f"   • '{entity['word']}' → {entity['entity_group']} (confidence: {entity['score']:.3f})")
    else:
        print("❌ No entities detected")
    
    print("-" * 40)

## 7. Visualization of Results

Let's create a visual representation of the token classification results.

In [None]:
def visualize_entities(text, title="Entity Visualization"):
    """Create a visual representation of entities in text."""
    
    entities = ner_pipeline(text)
    
    if not entities:
        print(f"📊 {title}")
        print("❌ No entities found to visualize.")
        return
    
    # Color coding for different entity types
    colors = {
        'PER': '🟦',  # Person - Blue
        'ORG': '🟩',  # Organization - Green
        'LOC': '🟨',  # Location - Yellow
        'MISC': '🟪'  # Miscellaneous - Purple
    }
    
    print(f"📊 {title}")
    print("=" * len(title))
    print(f"📝 Original: {text}")
    print()
    
    # Create annotated version
    annotated_text = text
    offset = 0
    
    # Sort entities by start position (reverse order for proper insertion)
    entities_sorted = sorted(entities, key=lambda x: x['start'], reverse=True)
    
    for entity in entities_sorted:
        entity_type = entity['entity_group']
        color = colors.get(entity_type, '🟫')
        
        # Insert annotation
        start_pos = entity['start']
        end_pos = entity['end']
        
        # Create the annotation
        annotation = f"{color}[{entity['word']}({entity_type})]"
        
        # Replace in the text
        annotated_text = annotated_text[:start_pos] + annotation + annotated_text[end_pos:]
    
    print(f"🎯 Annotated: {annotated_text}")
    print()
    
    # Legend
    print("📋 Legend:")
    legend_items = []
    for entity in entities:
        entity_type = entity['entity_group']
        if entity_type not in [item['type'] for item in legend_items]:
            legend_items.append({
                'type': entity_type,
                'color': colors.get(entity_type, '🟫'),
                'full_name': {
                    'PER': 'Person',
                    'ORG': 'Organization', 
                    'LOC': 'Location',
                    'MISC': 'Miscellaneous'
                }.get(entity_type, entity_type)
            })
    
    for item in legend_items:
        print(f"   {item['color']} {item['type']} = {item['full_name']}")
    
    print()

# Visualize our examples
visualize_entities(sample_text, "Sample Text Visualization")

print("\n" + "="*80)

# Visualize a complex example
complex_text = "Barack Obama visited Google headquarters in Mountain View and met with Sundar Pichai to discuss AI innovations."
visualize_entities(complex_text, "Complex Example Visualization")

## 8. Understanding the Model Architecture

Let's explore the technical details of our BERT-based token classification model.

In [None]:
# Load the tokenizer and model separately for detailed analysis
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

print("🔧 Model Technical Details:")
print("=" * 40)

# Model configuration
config = model.config
print(f"📋 Model Configuration:")
print(f"   Model Name: {config.name_or_path}")
print(f"   Model Type: {config.model_type.upper()}")
print(f"   Number of Labels: {config.num_labels}")
print(f"   Vocabulary Size: {config.vocab_size:,}")
print(f"   Max Sequence Length: {config.max_position_embeddings}")
print(f"   Hidden Size: {config.hidden_size}")
print(f"   Number of Attention Heads: {config.num_attention_heads}")
print(f"   Number of Hidden Layers: {config.num_hidden_layers}")

# Label mapping
print(f"\n🏷️  Label Mapping:")
for label_id, label_name in config.id2label.items():
    print(f"   {label_id}: {label_name}")

# Model size information
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n📊 Model Size Information:")
print(f"   Total Parameters: {total_params:,}")
print(f"   Trainable Parameters: {trainable_params:,}")
print(f"   Model Size: ~{total_params/1_000_000:.1f}M parameters")
print(f"   Memory Footprint: ~{total_params * 4 / (1024**3):.2f} GB (FP32)")

## 9. Performance Analysis

Let's analyze the performance characteristics of token classification.

In [None]:
import time

def benchmark_performance(texts, num_runs=3):
    """Benchmark the performance of token classification."""
    
    times = []
    total_tokens = 0
    total_entities = 0
    
    print(f"⏱️ Performance Benchmark ({num_runs} runs):")
    print("=" * 50)
    
    for run in range(num_runs):
        start_time = time.time()
        
        for text in texts:
            entities = ner_pipeline(text)
            
            # Count tokens and entities
            tokens = tokenizer.tokenize(text)
            total_tokens += len(tokens)
            total_entities += len(entities)
        
        end_time = time.time()
        times.append(end_time - start_time)
        
        print(f"   Run {run + 1}: {times[-1]:.3f}s")
    
    # Calculate statistics
    avg_time = sum(times) / len(times)
    avg_tokens = total_tokens / (len(texts) * num_runs)
    avg_entities = total_entities / (len(texts) * num_runs)
    
    print(f"\n📊 Performance Statistics:")
    print(f"   Average Time per Batch: {avg_time:.3f}s")
    print(f"   Average Time per Text: {avg_time/len(texts):.3f}s")
    print(f"   Tokens per Second: {(total_tokens/num_runs)/avg_time:.1f}")
    print(f"   Average Tokens per Text: {avg_tokens:.1f}")
    print(f"   Average Entities per Text: {avg_entities:.1f}")
    print(f"   Processing Speed: {len(texts)/avg_time:.1f} texts/second")

# Performance test with various text lengths
perf_texts = [
    "John works at Google.",  # Short
    "Microsoft CEO Satya Nadella visited the Seattle office to discuss Azure innovations with the engineering team.",  # Medium
    "The meeting between Apple's Tim Cook and Google's Sundar Pichai in Mountain View, California, focused on artificial intelligence collaboration between the two tech giants, with discussions covering machine learning, privacy policies, and future technological partnerships that could benefit consumers worldwide."  # Long
]

benchmark_performance(perf_texts)

# Memory usage tip
print(f"\n💡 Performance Tips:")
print(f"   • Batch processing multiple texts together is more efficient")
print(f"   • GPU acceleration significantly improves speed for large batches")
print(f"   • Consider using smaller models (distilbert) for faster inference")
print(f"   • Enable mixed precision (FP16) to reduce memory usage")

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **Token Classification**: Understanding how to assign labels to individual tokens in text
- **BERT for NER**: Using pre-trained BERT models for Named Entity Recognition tasks
- **Entity Types**: Recognizing different types of entities (Person, Organization, Location, Miscellaneous)
- **BIO Labeling**: Understanding the Begin-Inside-Outside labeling scheme
- **Confidence Scores**: Interpreting model confidence and prediction reliability
- **Aggregation Strategy**: How sub-word tokens are combined into complete entities

### 📈 Best Practices Learned
- Use aggregation strategies to combine sub-word tokens into meaningful entities
- Monitor confidence scores to assess prediction reliability
- Consider context when interpreting entity predictions
- Batch processing improves efficiency for multiple texts
- Validate results especially for domain-specific or uncommon entity types

### 🚀 Next Steps
- **Fine-tuning**: Learn to fine-tune BERT for custom entity types in [Fine-tuning Notebook](../05_fine_tuning_trainer.ipynb)
- **Custom Datasets**: Explore training on domain-specific NER datasets
- **Other Models**: Try different model architectures (RoBERTa, DistilBERT, SpaCy)
- **Documentation**: Review [Hugging Face Token Classification Guide](https://huggingface.co/docs/transformers/tasks/token_classification)
- **Model Hub**: Explore specialized [NER models](https://huggingface.co/models?pipeline_tag=token-classification) for different domains

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*