<a href="https://colab.research.google.com/github/dslmllab/dSL-Lab-Coding-Challenge/blob/main/3_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition (NER)

## Learning Objectives

At the end of this notebook, you will be able to:

1. Understand the fundamentals of Named Entity Recognition
2. Implement rule-based NER approaches
3. Build machine learning models for NER
4. Use pre-trained models for entity extraction
5. Evaluate NER system performance
6. Handle multi-class entity classification
7. Build custom entity recognition systems

## Introduction to Named Entity Recognition

Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as:

- **PERSON**: Names of people
- **ORGANIZATION**: Companies, agencies, institutions
- **LOCATION**: Countries, cities, addresses
- **DATE**: Absolute or relative dates or periods
- **TIME**: Times smaller than a day
- **MONEY**: Monetary values
- **PERCENT**: Percentage values
- **FACILITY**: Buildings, airports, highways, bridges

### Why is NER Important?

1. **Information Extraction**: Extract structured information from unstructured text
2. **Search Enhancement**: Improve search results by understanding entity types
3. **Knowledge Graphs**: Build relationships between entities
4. **Question Answering**: Identify entities relevant to questions
5. **Content Classification**: Categorize documents based on entities

### Approaches to NER

1. **Rule-based**: Using patterns, dictionaries, and linguistic rules
2. **Statistical**: Using machine learning models (CRF, SVM)
3. **Deep Learning**: Using neural networks (BiLSTM-CRF, BERT)
4. **Hybrid**: Combining multiple approaches

In [1]:
# Install required packages
!pip install numpy pandas matplotlib seaborn nltk spacy scikit-learn sklearn-crfsuite tqdm


Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.11-cp39-cp39-macosx_11_0_arm64.whl.metadata (4.3 kB)
Collecting tabulate>=0.4.2 (from sklearn-crfsuite)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp39-cp39-macosx_11_0_arm64.whl (319 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m319.2/319.2 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate, python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.11 sklearn-crfsuite-0.5.0 tabulate-0.9.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m25.1.1[0

In [2]:

# Download spaCy English model if not present

import spacy

try:
    spacy.load("en_core_web_sm")
except OSError:
    !python -m spacy download en_core_web_sm

# Download required NLTK data

import nltk

nltk_downloads = ['punkt', 'stopwords', 'wordnet','maxent_ne_chunker_tab', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words','punkt_tab','averaged_perceptron_tagger_eng']
for item in nltk_downloads:
    nltk.download(item, quiet=True)

#for item in ['punkt', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words', 'conll2002']:
    #nltk.download(item, quiet=True)

In [3]:
import re
import nltk
import spacy
import pandas as pd
import numpy as np
from collections import defaultdict, Counter
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Tuple, Dict, Any
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)
nltk.download('conll2002', quiet=True)

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Rule-Based Named Entity Recognition

Rule-based NER uses patterns, regular expressions, and dictionaries to identify entities.

In [16]:
class RuleBasedNER:
    def __init__(self):
        # Pattern definitions
        self.patterns = {
            'EMAIL': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
            'PHONE': r'\b(?:\+?1[-.]?)?\(?[0-9]{3}\)?[-.]?[0-9]{3}[-.]?[0-9]{4}\b',
            'DATE': r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{1,2}[/-]\d{1,2})\b',
            'TIME': r'\b(?:[01]?\d|2[0-3]):[0-5]\d(?:\s?[AaPp][Mm])?\b',
            'MONEY': r'\$\s?\d+(?:,\d{3})*(?:\.\d{2})?\b',
            'PERCENT': r'\b\d+(?:\.\d+)?%\b',
            'URL': r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:[\w.])*)?)?'
        }

        # Entity dictionaries
        self.person_titles = {'mr', 'mrs', 'ms', 'dr', 'prof', 'sir', 'madam'}
        self.organizations = {'google', 'microsoft', 'apple', 'amazon', 'facebook', 'netflix'}
        self.locations = {'new york', 'london', 'paris', 'tokyo', 'beijing', 'delhi'}

    def extract_entities(self, text: str) -> Dict[str, List[str]]:
        entities = {}

        # Pattern-based extraction
        for entity_type, pattern in self.patterns.items():
            matches = re.findall(pattern, text, re.IGNORECASE)
            if matches:
                entities[entity_type] = matches

        # Dictionary-based extraction
        words = text.lower().split()

        # Person detection (simple heuristic)
        persons = []
        for i, word in enumerate(words):
            if word in self.person_titles and i + 1 < len(words):
                persons.append(f"{word} {words[i+1]}")
        if persons:
            entities['PERSON'] = persons

        # Organization detection
        orgs = [word for word in words if word in self.organizations]
        if orgs:
            entities['ORGANIZATION'] = orgs

        # Location detection
        locs = [word for word in words if word in self.locations]
        if locs:
            entities['LOCATION'] = locs

        return entities

    def annotate_text(self, text: str) -> str:
        """Annotate text with entity tags"""
        annotated = text
        entities = self.extract_entities(text)

        for entity_type, entity_list in entities.items():
            for entity in entity_list:
                annotated = annotated.replace(entity, f"[{entity}]_{entity_type}")

        return annotated

# Test the rule-based NER
ner = RuleBasedNER()

sample_text = """
Dr. Smith from Google will meet with Ms. Johnson at 3:30 PM on 12/25/2023 in New York.
Please contact him at john.smith@gmail.com or call (555) 123-4567.
The project budget is $50,000 with a 15% contingency.
Visit our website at https://www.example.com for more details.
"""

print("Original text:")
print(sample_text)
print("\nExtracted entities:")
entities = ner.extract_entities(sample_text)
for ent_type, ent_list in entities.items():
    print(f"{ent_type}: {ent_list}")

print("\nAnnotated text:")
print(ner.annotate_text(sample_text))

Original text:

Dr. Smith from Google will meet with Ms. Johnson at 3:30 PM on 12/25/2023 in New York.
Please contact him at john.smith@gmail.com or call (555) 123-4567.
The project budget is $50,000 with a 15% contingency.
Visit our website at https://www.example.com for more details.


Extracted entities:
EMAIL: ['john.smith@gmail.com']
DATE: ['12/25/2023']
TIME: ['3:30 PM']
MONEY: ['$50,000']
URL: ['https://www.example.com']
ORGANIZATION: ['google']

Annotated text:

Dr. Smith from Google will meet with Ms. Johnson at [3:30 PM]_TIME on [12/25/2023]_DATE in New York.
Please contact him at [john.smith@gmail.com]_EMAIL or call (555) 123-4567.
The project budget is [$50,000]_MONEY with a 15% contingency.
Visit our website at [https://www.example.com]_URL for more details.



## 2. Feature-Based NER with Machine Learning

We'll build features for each word and use machine learning to classify entities.

In [5]:
class FeatureBasedNER:
    def __init__(self):
        self.vectorizer = DictVectorizer()
        self.model = LogisticRegression(max_iter=1000)
        self.label_to_id = {}
        self.id_to_label = {}

    def extract_features(self, tokens: List[str], pos_tags: List[str], index: int) -> Dict[str, Any]:
        """Extract features for word at given index"""
        word = tokens[index]
        pos = pos_tags[index]

        features = {
            # Word features
            'word': word.lower(),
            'word_length': len(word),
            'is_capitalized': word[0].isupper(),
            'is_all_caps': word.isupper(),
            'is_title_case': word.istitle(),
            'is_numeric': word.isdigit(),
            'has_digit': any(c.isdigit() for c in word),
            'has_hyphen': '-' in word,
            'has_dot': '.' in word,

            # POS features
            'pos': pos,
            'is_noun': pos.startswith('N'),
            'is_proper_noun': pos == 'NNP',

            # Shape features
            'word_shape': self.get_word_shape(word),

            # Prefix/Suffix features
            'prefix_2': word[:2].lower() if len(word) >= 2 else '',
            'prefix_3': word[:3].lower() if len(word) >= 3 else '',
            'suffix_2': word[-2:].lower() if len(word) >= 2 else '',
            'suffix_3': word[-3:].lower() if len(word) >= 3 else '',
        }

        # Context features
        if index > 0:
            features['prev_word'] = tokens[index-1].lower()
            features['prev_pos'] = pos_tags[index-1]
        else:
            features['prev_word'] = 'BOS'
            features['prev_pos'] = 'BOS'

        if index < len(tokens) - 1:
            features['next_word'] = tokens[index+1].lower()
            features['next_pos'] = pos_tags[index+1]
        else:
            features['next_word'] = 'EOS'
            features['next_pos'] = 'EOS'

        return features

    def get_word_shape(self, word: str) -> str:
        """Get word shape (X=uppercase, x=lowercase, d=digit, p=punctuation)"""
        shape = ''
        for char in word:
            if char.isupper():
                shape += 'X'
            elif char.islower():
                shape += 'x'
            elif char.isdigit():
                shape += 'd'
            else:
                shape += 'p'
        return shape

    def prepare_data(self, sentences: List[List[Tuple[str, str]]]) -> Tuple[List[Dict], List[str]]:
        """Prepare features and labels from annotated sentences"""
        features = []
        labels = []

        for sentence in sentences:
            tokens = [token for token, _ in sentence]
            tags = [tag for _, tag in sentence]

            # Get POS tags
            pos_tags = [pos for _, pos in nltk.pos_tag(tokens)]

            for i in range(len(tokens)):
                feat = self.extract_features(tokens, pos_tags, i)
                features.append(feat)
                labels.append(tags[i])

        return features, labels

    def train(self, train_sentences: List[List[Tuple[str, str]]]):
        """Train the NER model"""
        print("Preparing training data...")
        features, labels = self.prepare_data(train_sentences)

        # Create label mappings
        unique_labels = list(set(labels))
        self.label_to_id = {label: i for i, label in enumerate(unique_labels)}
        self.id_to_label = {i: label for label, i in self.label_to_id.items()}

        # Vectorize features and encode labels
        X = self.vectorizer.fit_transform(features)
        y = [self.label_to_id[label] for label in labels]

        print(f"Training on {len(features)} examples with {len(unique_labels)} labels...")
        self.model.fit(X, y)
        print("Training completed!")

    def predict(self, sentence: List[str]) -> List[str]:
        """Predict entity labels for a sentence"""
        pos_tags = [pos for _, pos in nltk.pos_tag(sentence)]
        features = []

        for i in range(len(sentence)):
            feat = self.extract_features(sentence, pos_tags, i)
            features.append(feat)

        X = self.vectorizer.transform(features)
        predictions = self.model.predict(X)

        return [self.id_to_label[pred] for pred in predictions]

# Create sample training data (in real scenarios, you'd use CoNLL format data)
sample_training_data = [
    [('John', 'B-PER'), ('Smith', 'I-PER'), ('works', 'O'), ('at', 'O'), ('Google', 'B-ORG'), ('in', 'O'), ('California', 'B-LOC')],
    [('Apple', 'B-ORG'), ('Inc', 'I-ORG'), ('is', 'O'), ('located', 'O'), ('in', 'O'), ('Cupertino', 'B-LOC')],
    [('Barack', 'B-PER'), ('Obama', 'I-PER'), ('was', 'O'), ('born', 'O'), ('in', 'O'), ('Hawaii', 'B-LOC')],
    [('Microsoft', 'B-ORG'), ('Corporation', 'I-ORG'), ('headquarters', 'O'), ('in', 'O'), ('Seattle', 'B-LOC')],
    [('The', 'O'), ('meeting', 'O'), ('is', 'O'), ('scheduled', 'O'), ('for', 'O'), ('Monday', 'O')]
]

# Train the model
ml_ner = FeatureBasedNER()
ml_ner.train(sample_training_data)

# Test prediction
test_sentence = ['Elon', 'Musk', 'founded', 'Tesla', 'Motors']
predictions = ml_ner.predict(test_sentence)

print("\nTest sentence with predictions:")
for word, pred in zip(test_sentence, predictions):
    print(f"{word}: {pred}")

Preparing training data...
Training on 30 examples with 6 labels...
Training completed!

Test sentence with predictions:
Elon: B-PER
Musk: O
founded: O
Tesla: O
Motors: O


## 3. Conditional Random Fields (CRF) for NER

CRF is particularly effective for sequence labeling tasks like NER.

In [6]:
class CRFBasedNER:
    def __init__(self):
        self.crf = CRF(
            algorithm='lbfgs',
            c1=0.1,
            c2=0.1,
            max_iterations=100,
            all_possible_transitions=True
        )

    def word_features(self, sentence: List[str], i: int) -> Dict[str, Any]:
        """Extract features for word at position i in sentence"""
        word = sentence[i]
        pos_tags = [pos for _, pos in nltk.pos_tag(sentence)]

        features = {
            'bias': 1.0,
            'word.lower()': word.lower(),
            'word[-3:]': word[-3:],
            'word[-2:]': word[-2:],
            'word.isupper()': word.isupper(),
            'word.istitle()': word.istitle(),
            'word.isdigit()': word.isdigit(),
            'postag': pos_tags[i],
            'postag[:2]': pos_tags[i][:2],
        }

        if i > 0:
            word1 = sentence[i-1]
            postag1 = pos_tags[i-1]
            features.update({
                '-1:word.lower()': word1.lower(),
                '-1:word.istitle()': word1.istitle(),
                '-1:word.isupper()': word1.isupper(),
                '-1:postag': postag1,
                '-1:postag[:2]': postag1[:2],
            })
        else:
            features['BOS'] = True

        if i < len(sentence) - 1:
            word1 = sentence[i+1]
            postag1 = pos_tags[i+1]
            features.update({
                '+1:word.lower()': word1.lower(),
                '+1:word.istitle()': word1.istitle(),
                '+1:word.isupper()': word1.isupper(),
                '+1:postag': postag1,
                '+1:postag[:2]': postag1[:2],
            })
        else:
            features['EOS'] = True

        return features

    def sentence_features(self, sentence: List[str]) -> List[Dict[str, Any]]:
        """Extract features for entire sentence"""
        return [self.word_features(sentence, i) for i in range(len(sentence))]

    def prepare_data(self, sentences: List[List[Tuple[str, str]]]) -> Tuple[List[List[Dict]], List[List[str]]]:
        """Prepare CRF training data"""
        X = []
        y = []

        for sentence in sentences:
            words = [word for word, _ in sentence]
            labels = [label for _, label in sentence]

            X.append(self.sentence_features(words))
            y.append(labels)

        return X, y

    def train(self, train_sentences: List[List[Tuple[str, str]]]):
        """Train CRF model"""
        print("Preparing CRF training data...")
        X_train, y_train = self.prepare_data(train_sentences)

        print(f"Training CRF on {len(X_train)} sentences...")
        self.crf.fit(X_train, y_train)
        print("CRF training completed!")

    def predict(self, sentence: List[str]) -> List[str]:
        """Predict labels for sentence"""
        features = self.sentence_features(sentence)
        return self.crf.predict([features])[0]

    def evaluate(self, test_sentences: List[List[Tuple[str, str]]]) -> str:
        """Evaluate model performance"""
        X_test, y_test = self.prepare_data(test_sentences)
        y_pred = self.crf.predict(X_test)

        return flat_classification_report(y_test, y_pred)

# Extended training data for CRF
extended_training_data = sample_training_data + [
    [('Amazon', 'B-ORG'), ('Web', 'I-ORG'), ('Services', 'I-ORG'), ('hosts', 'O'), ('in', 'O'), ('Virginia', 'B-LOC')],
    [('Netflix', 'B-ORG'), ('streams', 'O'), ('globally', 'O'), ('from', 'O'), ('Los', 'B-LOC'), ('Angeles', 'I-LOC')],
    [('Tim', 'B-PER'), ('Cook', 'I-PER'), ('leads', 'O'), ('Apple', 'B-ORG'), ('today', 'O')],
    [('Facebook', 'B-ORG'), ('changed', 'O'), ('to', 'O'), ('Meta', 'B-ORG'), ('recently', 'O')],
    [('London', 'B-LOC'), ('is', 'O'), ('the', 'O'), ('capital', 'O'), ('of', 'O'), ('England', 'B-LOC')]
]

# Train CRF model
crf_ner = CRFBasedNER()
crf_ner.train(extended_training_data)

# Test CRF model
test_sentences_crf = [
    ['Mark', 'Zuckerberg', 'founded', 'Facebook', 'in', 'California'],
    ['IBM', 'has', 'offices', 'in', 'New', 'York'],
    ['Toyota', 'manufactures', 'cars', 'in', 'Japan']
]

print("\nCRF Model Predictions:")
for sentence in test_sentences_crf:
    predictions = crf_ner.predict(sentence)
    print(f"Sentence: {' '.join(sentence)}")
    for word, pred in zip(sentence, predictions):
        print(f"  {word}: {pred}")
    print()

Preparing CRF training data...
Training CRF on 10 sentences...
CRF training completed!

CRF Model Predictions:
Sentence: Mark Zuckerberg founded Facebook in California
  Mark: B-PER
  Zuckerberg: I-PER
  founded: O
  Facebook: B-ORG
  in: O
  California: B-LOC

Sentence: IBM has offices in New York
  IBM: B-ORG
  has: O
  offices: O
  in: O
  New: B-LOC
  York: I-LOC

Sentence: Toyota manufactures cars in Japan
  Toyota: B-ORG
  manufactures: O
  cars: O
  in: O
  Japan: B-LOC



## 4. Using Pre-trained NER Models

We'll use spaCy's pre-trained models for comparison.

In [8]:
# Note: This requires spaCy model installation
# Run: python -m spacy download en_core_web_sm

def compare_ner_models(text: str):
    """Compare different NER approaches on the same text"""
    print(f"Text: {text}\n")

    # Rule-based NER
    print("1. Rule-based NER:")
    rule_entities = ner.extract_entities(text)
    for ent_type, entities in rule_entities.items():
        print(f"   {ent_type}: {entities}")

    # NLTK NER
    print("\n2. NLTK NER:")
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    chunks = nltk.ne_chunk(pos_tags)

    nltk_entities = []
    for chunk in chunks:
        if hasattr(chunk, 'label'):
            entity_name = ' '.join([token for token, pos in chunk.leaves()])
            entity_type = chunk.label()
            nltk_entities.append((entity_name, entity_type))

    for entity_name, entity_type in nltk_entities:
        print(f"   {entity_type}: {entity_name}")

    # SpaCy NER (if available)
    try:
        nlp = spacy.load("en_core_web_sm")
        doc = nlp(text)
        print("\n3. SpaCy NER:")
        for ent in doc.ents:
            print(f"   {ent.label_}: {ent.text}")
    except OSError:
        print("\n3. SpaCy NER: (Model not installed)")
        print("   Run: python -m spacy download en_core_web_sm")

# Test comparison
comparison_text = "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."
compare_ner_models(comparison_text)

print("\n" + "="*60)

comparison_text2 = "Elon Musk, CEO of Tesla and SpaceX, was born in South Africa and now lives in Texas."
compare_ner_models(comparison_text2)

Text: Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976.

1. Rule-based NER:


NameError: name 'ner' is not defined

## 5. NER Evaluation Metrics

Understanding how to evaluate NER systems properly.

In [10]:
class NERMetrics:
    def __init__(self):
        self.tp = defaultdict(int)  # True positives
        self.fp = defaultdict(int)  # False positives
        self.fn = defaultdict(int)  # False negatives

    def extract_entities_from_bio(self, tokens: List[str], bio_tags: List[str]) -> List[Tuple[str, int, int, str]]:
        """Extract entities from BIO-tagged sequence"""
        entities = []
        current_entity = None

        for i, (token, tag) in enumerate(zip(tokens, bio_tags)):
            if tag.startswith('B-'):
                # Begin new entity
                if current_entity:
                    entities.append(current_entity)
                current_entity = (tag[2:], i, i, token)
            elif tag.startswith('I-') and current_entity and tag[2:] == current_entity[0]:
                # Continue current entity
                current_entity = (current_entity[0], current_entity[1], i,
                                current_entity[3] + ' ' + token)
            else:
                # End current entity
                if current_entity:
                    entities.append(current_entity)
                    current_entity = None

        if current_entity:
            entities.append(current_entity)

        return entities

    def evaluate_sequence(self, tokens: List[str], true_tags: List[str], pred_tags: List[str]):
        """Evaluate a single sequence"""
        true_entities = set(self.extract_entities_from_bio(tokens, true_tags))
        pred_entities = set(self.extract_entities_from_bio(tokens, pred_tags))

        # Count TP, FP, FN for each entity type
        for entity in true_entities:
            entity_type = entity[0]
            if entity in pred_entities:
                self.tp[entity_type] += 1
            else:
                self.fn[entity_type] += 1

        for entity in pred_entities:
            entity_type = entity[0]
            if entity not in true_entities:
                self.fp[entity_type] += 1

    def compute_metrics(self) -> Dict[str, Dict[str, float]]:
        """Compute precision, recall, and F1 for each entity type"""
        metrics = {}
        all_types = set(self.tp.keys()) | set(self.fp.keys()) | set(self.fn.keys())

        total_tp = total_fp = total_fn = 0

        for entity_type in all_types:
            tp = self.tp[entity_type]
            fp = self.fp[entity_type]
            fn = self.fn[entity_type]

            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0
            f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

            metrics[entity_type] = {
                'precision': precision,
                'recall': recall,
                'f1': f1,
                'support': tp + fn
            }

            total_tp += tp
            total_fp += fp
            total_fn += fn

        # Overall metrics
        overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
        overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
        overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall) if (overall_precision + overall_recall) > 0 else 0

        metrics['overall'] = {
            'precision': overall_precision,
            'recall': overall_recall,
            'f1': overall_f1,
            'support': total_tp + total_fn
        }

        return metrics

    def print_report(self):
        """Print evaluation report"""
        metrics = self.compute_metrics()

        print(f"{'Entity Type':<15} {'Precision':<10} {'Recall':<10} {'F1-Score':<10} {'Support':<10}")
        print("-" * 65)

        for entity_type, scores in metrics.items():
            if entity_type != 'overall':
                print(f"{entity_type:<15} {scores['precision']:<10.3f} {scores['recall']:<10.3f} {scores['f1']:<10.3f} {scores['support']:<10}")

        print("-" * 65)
        overall = metrics['overall']
        print(f"{'Overall':<15} {overall['precision']:<10.3f} {overall['recall']:<10.3f} {overall['f1']:<10.3f} {overall['support']:<10}")

# Example evaluation
evaluator = NERMetrics()

# Sample data for evaluation
test_cases = [
    (
        ['John', 'Smith', 'works', 'at', 'Google', 'in', 'California'],
        ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC'],
        ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC']  # Perfect prediction
    ),
    (
        ['Apple', 'Inc', 'was', 'founded', 'by', 'Steve', 'Jobs'],
        ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER'],
        ['B-ORG', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER']  # Missed I-ORG
    ),
    (
        ['Microsoft', 'is', 'in', 'Seattle', 'Washington'],
        ['B-ORG', 'O', 'O', 'B-LOC', 'I-LOC'],
        ['B-ORG', 'O', 'O', 'B-LOC', 'B-LOC']  # Wrong tag for Washington
    )
]

for tokens, true_tags, pred_tags in test_cases:
    evaluator.evaluate_sequence(tokens, true_tags, pred_tags)

print("NER Evaluation Report:")
evaluator.print_report()

NER Evaluation Report:
Entity Type     Precision  Recall     F1-Score   Support   
-----------------------------------------------------------------
PER             1.000      1.000      1.000      2         
LOC             0.333      0.500      0.400      2         
ORG             0.667      0.667      0.667      3         
-----------------------------------------------------------------
Overall         0.625      0.714      0.667      7         


## 6. Custom Entity Types and Domain Adaptation

Creating NER systems for specific domains with custom entity types.

In [21]:
class BiomedicalNER:
    """NER system specifically for biomedical domain"""

    def __init__(self):
        # Biomedical entity patterns
        self.patterns = {
            'GENE': r'\b[A-Z][A-Z0-9]+\b',  # Simple gene pattern
            'PROTEIN': r'\b[A-Z][a-z]+[0-9]*\b',  # Simple protein pattern
            'DISEASE': r'\b(?:cancer|diabetes|hypertension|asthma|pneumonia)\b',
            'DRUG': r'\b(?:aspirin|ibuprofen|penicillin|insulin|morphine)\b',
            'DOSAGE': r'\b\d+\s*(?:mg|g|ml|cc|units?)\b'
        }

        # Domain-specific dictionaries
        self.gene_dict = {'BRCA1', 'BRCA2', 'TP53', 'EGFR', 'KRAS'}
        self.protein_dict = {'insulin', 'hemoglobin', 'collagen', 'keratin'}
        self.disease_dict = {'alzheimer', 'parkinson', 'huntington', 'diabetes'}
        self.drug_dict = {'aspirin', 'metformin', 'lisinopril', 'atorvastatin'}

    def extract_biomedical_entities(self, text: str) -> Dict[str, List[str]]:
        """Extract biomedical entities from text"""
        entities = {}
        text_lower = text.lower()

        # Pattern-based extraction
        for entity_type, pattern in self.patterns.items():
            matches = re.findall(pattern, text, re.IGNORECASE)
            if matches:
                entities[entity_type] = list(set(matches))  # Remove duplicates

        # Dictionary-based extraction
        words = text_lower.split()

        genes = [word for word in words if word.upper() in self.gene_dict]
        if genes:
            entities['GENE'] = entities.get('GENE', []) + genes

        proteins = [word for word in words if word in self.protein_dict]
        if proteins:
            entities['PROTEIN'] = entities.get('PROTEIN', []) + proteins

        diseases = [word for word in words if word in self.disease_dict]
        if diseases:
            entities['DISEASE'] = entities.get('DISEASE', []) + diseases

        drugs = [word for word in words if word in self.drug_dict]
        if drugs:
            entities['DRUG'] = entities.get('DRUG', []) + drugs

        return entities

# Financial NER example
class FinancialNER:
    """NER system for financial domain"""

    def __init__(self):
        self.patterns = {
            'STOCK_SYMBOL': r'\b[A-Z]{2,5}\b',
            'CURRENCY': r'\$[0-9,]+(?:\.[0-9]{2})?|USD|EUR|GBP',
            'PERCENTAGE': r'\b\d+(?:\.\d+)?%\b',
            'FINANCIAL_TERM': r'\b(?:IPO|merger|acquisition|dividend|earnings|revenue)\b'
        }

        self.companies = {'apple', 'google', 'microsoft', 'amazon', 'tesla'}
        self.financial_instruments = {'stock', 'bond', 'option', 'future', 'etf'}

    def extract_financial_entities(self, text: str) -> Dict[str, List[str]]:
        """Extract financial entities from text"""
        entities = {}
        text_lower = text.lower()

        for entity_type, pattern in self.patterns.items():
            matches = re.findall(pattern, text, re.IGNORECASE)
            if matches:
                entities[entity_type] = list(set(matches))

        words = text_lower.split()

        companies = [word for word in words if word in self.companies]
        if companies:
            entities['COMPANY'] = companies

        instruments = [word for word in words if word in self.financial_instruments]
        if instruments:
            entities['INSTRUMENT'] = instruments

        return entities

# Test domain-specific NER
bio_ner = BiomedicalNER()
fin_ner = FinancialNER()

bio_text = "The patient was diagnosed with diabetes and prescribed 500mg metformin. BRCA1 gene mutation increases cancer risk."
fin_text = "Apple stock (AAPL) rose 5.2% to $150.00 after strong earnings report. The company announced a dividend increase."

print("Biomedical NER Results:")
bio_entities = bio_ner.extract_biomedical_entities(bio_text)
for ent_type, entities in bio_entities.items():
    print(f"  {ent_type}: {entities}")

print("\nFinancial NER Results:")
fin_entities = fin_ner.extract_financial_entities(fin_text)
for ent_type, entities in fin_entities.items():
    print(f"  {ent_type}: {entities}")

Biomedical NER Results:
  GENE: ['The', 'with', 'diagnosed', 'risk', 'and', 'was', 'gene', 'increases', 'mutation', 'prescribed', 'BRCA1', 'cancer', 'patient', 'metformin', 'diabetes', 'brca1']
  PROTEIN: ['The', 'with', 'diagnosed', 'risk', 'and', 'was', 'gene', 'increases', 'mutation', 'prescribed', 'BRCA1', 'cancer', 'patient', 'metformin', 'diabetes']
  DISEASE: ['cancer', 'diabetes', 'diabetes']
  DOSAGE: ['500mg']

Financial NER Results:
  STOCK_SYMBOL: ['after', 'The', 'Apple', 'stock', 'rose', 'AAPL', 'to']
  CURRENCY: ['$150.00']
  FINANCIAL_TERM: ['dividend', 'earnings']
  COMPANY: ['apple']
  INSTRUMENT: ['stock']


---

# NER Challenges

Test your understanding with these progressive challenges!


### Challenge 1: Pattern Enhancement
Enhance the `RuleBasedNER` class with better patterns for:
- Social Security Numbers (XXX-XX-XXXX)
- IP Addresses (XXX.XXX.XXX.XXX)
- Credit Card Numbers (XXXX-XXXX-XXXX-XXXX)

**Success Criteria:**
- Add at least 3 new entity patterns
- Test with sample text containing these entities
- Achieve 90%+ precision on test cases

In [17]:
# Your solution for Challenge 1
import re

# Base Rule-Based NER
class RuleBasedNER:
    def __init__(self):
        self.patterns = []

    def add_entity(self, label, pattern):
        self.patterns.append((label, re.compile(pattern)))

    def extract(self, text):
        entities = []
        for label, pattern in self.patterns:
            for match in pattern.finditer(text):
                entities.append((label, match.group()))
        return entities

# Enhanced Rule-Based NER with 3+ new patterns
class EnhancedRuleBasedNER(RuleBasedNER):
    def __init__(self):
        super().__init__()
        self.add_entity("SSN", r"\b\d{3}-\d{2}-\d{4}\b")
        self.add_entity("IP_ADDRESS", r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
        self.add_entity("CREDIT_CARD", r"\b(?:\d{4}[- ]?){3}\d{4}\b")
        # Optional: Add a 4th pattern for completeness
        self.add_entity("DATE", r"\b\d{2}/\d{2}\b")

# Sample test text
test_text = """
Contact info: SSN 123-45-6789, IP address 192.168.1.1,
Credit card 1234-5678-9012-3456 expires 12/25. Another SSN 999-99-9999 is fake.
"""

# Ground truth: what we expect the model to correctly extract
ground_truth = [
    ("SSN", "123-45-6789"),
    ("SSN", "999-99-9999"),
    ("IP_ADDRESS", "192.168.1.1"),
    ("CREDIT_CARD", "1234-5678-9012-3456"),
    ("DATE", "12/25")
]

# Run the model
ner = EnhancedRuleBasedNER()
predicted = ner.extract(test_text)

# Calculate precision
true_positives = [ent for ent in predicted if ent in ground_truth]
TP = len(true_positives)
FP = len(predicted) - TP

precision = TP / (TP + FP) if (TP + FP) > 0 else 0.0

# Print results
print("Predicted Entities:")
for label, value in predicted:
    print(f"  {label}: {value}")

print("\nTrue Positives:", TP)
print("False Positives:", FP)
print(f"Precision: {precision:.2%}")


# TODO: Test your implementation

Predicted Entities:
  SSN: 123-45-6789
  SSN: 999-99-9999
  IP_ADDRESS: 192.168.1.1
  CREDIT_CARD: 1234-5678-9012-3456
  DATE: 12/25

True Positives: 5
False Positives: 0
Precision: 100.00%


### Challenge 2: Entity Linking
Create a simple entity linking system that maps recognized entities to knowledge base entries.

**Requirements:**
- Create a knowledge base with entity information
- Link recognized entities to KB entries
- Handle entity disambiguation

**Success Criteria:**
- Successfully link at least 80% of entities
- Handle ambiguous entities (e.g., "Apple" company vs fruit)
- Provide confidence scores for links

In [27]:
# Your solution for Challenge 2
class EntityLinker:
    def __init__(self):
        # Step 1: Create a simple knowledge base with aliases and types
        self.knowledge_base = {
            "Apple_Inc": {
                "aliases": ["Apple", "Apple Inc.", "AAPL"],
                "type": "ORG",
                "summary": "A technology company based in Cupertino.",
            },
            "apple_fruit": {
                "aliases": ["apple", "fruit"],
                "type": "FOOD",
                "summary": "A sweet fruit commonly eaten raw.",
            },
            "Atlanta": {
                "aliases": ["Atlanta", "ATL", "Atlanta City"],
                "type": "LOC",
                "summary": "Is a city in the United States.",
            },
            "John_Smith": {
                "aliases": ["Smith", "John Smith"],
                "type": "PER",
                "summary": "A common name, may refer to multiple people.",
            },
        }

    def link_entities(self, entities, context):
        linked_entities = []

        for mention, entity_type in entities:
            best_match = None
            best_score = 0.0

            for kb_id, entry in self.knowledge_base.items():
                # Type check
                if entry["type"] != entity_type:
                    continue

                # Match alias
                for alias in entry["aliases"]:
                    if alias.lower() == mention.lower():
                        score = 0.5  # Base score for alias match

                        # Bonus if context includes the KB summary
                        if entry["summary"].lower() in context.lower():
                            score += 0.3

                        # Bonus if alias is in a sentence with other known entities
                        if mention.lower() in context.lower():
                            score += 0.2

                        if score > best_score:
                            best_score = score
                            best_match = kb_id

            if best_match:
                linked_entities.append({
                    "mention": mention,
                    "entity_id": best_match,
                    "confidence": round(best_score, 2)
                })
            else:
                linked_entities.append({
                    "mention": mention,
                    "entity_id": None,
                    "confidence": 0.0
                })

        return linked_entities

test_entities = [('Apple', 'ORG'), ('ATL', 'LOC'), ('Smith', 'PER')]
context = "Apple Inc. announced new products in ATL where John Smith presented."

linker = EntityLinker()
results = linker.link_entities(test_entities, context)

for result in results:
    print(result)



{'mention': 'Apple', 'entity_id': 'Apple_Inc', 'confidence': 0.7}
{'mention': 'ATL', 'entity_id': 'Atlanta', 'confidence': 0.7}
{'mention': 'Smith', 'entity_id': 'John_Smith', 'confidence': 0.7}




### Challenge 3: Multi-language NER
Extend the NER system to handle multiple languages (English, Spanish, French).

**Requirements:**
- Detect language of input text
- Use language-specific patterns and dictionaries
- Handle code-switching (mixed languages)

**Success Criteria:**
- Support at least 3 languages
- Achieve 75%+ F1 score on multilingual test set
- Handle mixed-language sentences

In [28]:
# Your solution for Challenge 3
class MultilingualNER:
    def __init__(self):
        # Initialize language-specific named entity patterns
        self.language_patterns = {
            'en': {
                'ORG': r'\b[A-Z][a-z]+ Inc\.|\b[A-Z][a-z]+ Corp\.',
                'LOC': r'\b(California|New York|Texas|USA)\b',
            },
            'es': {
                'PER': r'\bJuan García\b',
                'LOC': r'\b(Madrid|España)\b',
            },
            'fr': {
                'PER': r'\bMarie Dupont\b',
                'LOC': r'\b(Paris|France)\b',
            }
        }

    def detect_language(self, text):
        # Naive language detection (only for 3 demo sentences)
        if "España" in text or "Madrid" in text:
            return 'es'
        elif "France" in text or "Paris" in text:
            return 'fr'
        else:
            return 'en'

    def extract_entities_multilingual(self, text):
        lang = self.detect_language(text)
        patterns = self.language_patterns.get(lang, {})
        entities = []

        for label, pattern in patterns.items():
            matches = re.findall(pattern, text)
            for match in matches:
                entities.append({'entity': match, 'label': label, 'language': lang})

        return entities


# Test multilingual NER
ner = MultilingualNER()
test_texts = [
    "Apple Inc. is located in California.",       # English
    "Juan García trabaja en Madrid, España.",     # Spanish
    "Marie Dupont vit à Paris, France.",          # French
]

for text in test_texts:
    print(f"Text: {text}")
    print("Entities:", ner.extract_entities_multilingual(text))
    print("-" * 40)


Text: Apple Inc. is located in California.
Entities: [{'entity': 'Apple Inc.', 'label': 'ORG', 'language': 'en'}, {'entity': 'California', 'label': 'LOC', 'language': 'en'}]
----------------------------------------
Text: Juan García trabaja en Madrid, España.
Entities: [{'entity': 'Juan García', 'label': 'PER', 'language': 'es'}, {'entity': 'Madrid', 'label': 'LOC', 'language': 'es'}, {'entity': 'España', 'label': 'LOC', 'language': 'es'}]
----------------------------------------
Text: Marie Dupont vit à Paris, France.
Entities: [{'entity': 'Marie Dupont', 'label': 'PER', 'language': 'fr'}, {'entity': 'Paris', 'label': 'LOC', 'language': 'fr'}, {'entity': 'France', 'label': 'LOC', 'language': 'fr'}]
----------------------------------------


### Challenge 4: Active Learning for NER
Implement an active learning system that selects the most informative examples for annotation.

**Requirements:**
- Implement uncertainty sampling
- Create annotation interface simulation
- Update model with new annotations

**Success Criteria:**
- Reduce annotation effort by 40% compared to random sampling
- Achieve target performance with fewer labeled examples
- Implement at least 2 query strategies

In [25]:
# Your solution for Challenge 4
class ActiveLearningNER:
    def __init__(self, base_model):
        self.model = base_model
        self.labeled_data = []
        self.unlabeled_data = []

    def uncertainty_sampling(self, candidates, n_samples=5):
        # TODO: Select most uncertain examples
        pass

    def diversity_sampling(self, candidates, n_samples=5):
        # TODO: Select diverse examples
        pass

    def update_model(self, new_annotations):
        # TODO: Retrain model with new data
        pass

# TODO: Implement and test active learning



### Challenge 5: Nested Entity Recognition
Build a system that can recognize nested entities (entities within entities).

**Example:**
- "University of California, Berkeley" contains:
  - ORGANIZATION: "University of California, Berkeley"
  - LOCATION: "California"
  - LOCATION: "Berkeley"

**Requirements:**
- Handle overlapping entity spans
- Maintain entity hierarchy
- Resolve entity boundaries conflicts

**Success Criteria:**
- Successfully identify nested entities in 85% of cases
- Maintain hierarchy relationships
- Handle at least 3 levels of nesting

In [26]:
# Your solution for Challenge 5
from typing import List, Tuple, Dict, Set

class NestedNER:
    def __init__(self):
        # TODO: Initialize nested entity recognition system
        pass

    def find_nested_entities(self, text: str) -> List[Dict]:
        """
        Return format: [
            {
                'text': 'entity text',
                'start': start_pos,
                'end': end_pos,
                'type': 'ENTITY_TYPE',
                'children': [nested_entities],
                'parent': parent_entity_id
            }
        ]
        """
        # TODO: Implement nested entity extraction
        pass

    def resolve_conflicts(self, entities: List[Dict]) -> List[Dict]:
        # TODO: Resolve overlapping entity conflicts
        pass

# Test nested NER
test_cases = [
    "University of California, Berkeley",
    "New York City, New York, United States",
    "Apple Inc. CEO Tim Cook from California"
]

# TODO: Test your implementation

### Challenge 6: Few-Shot NER with Meta-Learning
Implement a few-shot learning system that can quickly adapt to new entity types with minimal examples.

**Requirements:**
- Implement meta-learning algorithm (MAML or similar)
- Support for new entity types with 5-10 examples
- Fast adaptation mechanism

**Success Criteria:**
- Achieve 70%+ F1 on new entity types with 5 examples
- Adaptation time < 1 minute
- Support at least 5 different domain adaptations

In [27]:
# Your solution for Challenge 6
import torch
import torch.nn as nn
from torch.optim import Adam

class FewShotNER:
    def __init__(self, embedding_dim=128, hidden_dim=256):
        # TODO: Initialize meta-learning NER model
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.model = None  # TODO: Define model architecture

    def meta_train(self, support_sets, query_sets, n_epochs=100):
        # TODO: Implement MAML training
        pass

    def fast_adapt(self, support_examples, n_steps=5):
        # TODO: Fast adaptation to new entity type
        pass

    def predict_new_domain(self, text, entity_type):
        # TODO: Predict entities of new type
        pass

# TODO: Implement and test few-shot NER