# Named Entity Recognition

Named Entity Recognition (NER) extracts entities like names, locations, organizations, and dates from text.

## Methods

Method|Speed|Accuracy|Best Use Case|Confidence Score
---|:---:|:---:|---|:---:
NLTK|🐢 Slow|🔹 Decent|Rule-based NER, offline use|❌ Not available
spaCy|⚡ Fast|🔹 Good|General NER tasks|🔹 Approximate
Transformers (BERT, RoBERTa)|🐌 Slower|🔥 Best|High-accuracy, domain-specific tasks|✅ Yes

## 1. NLTK (Rule-based approach)

Pros:
1. Works offline
2. Good for rule-based NER

Cons:
1. Slower than spaCy

In [None]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

downloaded = True
# Download necessary resources
if not downloaded:
    nltk.download('maxent_ne_chunker_tab')
    nltk.download('words')
    nltk.download('punkt')


# Sample text
text = "Elon Musk is the CEO of Tesla and SpaceX. He was born in Pretoria, South Africa, on June 28, 1971."

# Tokenize and extract entities
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
tree = ne_chunk(pos_tags)

# Extract named entities
for subtree in tree:
    if hasattr(subtree, "label"):
        entity_name = " ".join([token for token, pos in subtree.leaves()])
        entity_type = subtree.label()
        print(f"{entity_name} -> ({entity_type})")

Elon -> (PERSON)
Musk -> (ORGANIZATION)
CEO of Tesla -> (ORGANIZATION)
SpaceX -> (ORGANIZATION)
Pretoria -> (GPE)
South Africa -> (GPE)


## 2. SpaCy with workaround

SpaCy does not provide built-in confidence scores, but we can estimate them using NER token probabilities.

Pros:
1. Fast and efficient
2. Pre-trained models available
3. Supports visualization

Cons:
1. No built-in confidence score

In [20]:
import spacy
from spacy import displacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Elon Musk is the CEO of Tesla and SpaceX. He was born in Pretoria, South Africa, on June 28, 1971."

# Process the text
doc = nlp(text)

# Extract named entities with confidence scores
for ent in doc.ents:
    confidence = max(token.prob for token in ent) # Approx confidence score
    print(f"{ent.text} -> {ent.label_} (Confidence: {confidence:.2f})")

# Render named entities
displacy.render(doc, style="ent", jupyter=True)

Elon Musk -> PERSON (Confidence: -20.00)
Tesla -> ORG (Confidence: -20.00)
Pretoria -> GPE (Confidence: -20.00)
South Africa -> GPE (Confidence: -20.00)
June 28, 1971 -> DATE (Confidence: -20.00)


💡 Note: spaCy’s confidence is approximate since token.prob represents a token's likelihood, not an explicit NER confidence score.
\
- Probabilities are in log-space (negative values are normal).
- Named entities tend to have low probability (e.g., -20) since they are less frequent in training data.
- Common words have values closer to 0 (e.g., "the" ≈ -3).
- Extremely rare words or out-of-vocabulary words may have values below -30.

## 3. Hugging-face transformer (BERT-based NER)

In [17]:
from transformers import pipeline

# Load pre-trained NER model
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Sample text
text = "Elon Musk is the CEO of Tesla and SpaceX. He was born in Pretoria, South Africa, on June 28, 1971."

# Extract named entities
entities = ner(text)

# Print result with confidence scores
for entity in entities:
    entity_name = entity["word"]
    entity_type = entity["entity"]
    confidence = entity["score"]
    print(f"{entity_name} -> {entity_type} (Confidence: {confidence:.2f})")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


El -> I-PER (Confidence: 1.00)
##on -> I-PER (Confidence: 1.00)
Mu -> I-PER (Confidence: 1.00)
##sk -> I-PER (Confidence: 1.00)
Te -> I-ORG (Confidence: 1.00)
##sla -> I-ORG (Confidence: 1.00)
Space -> I-ORG (Confidence: 1.00)
##X -> I-ORG (Confidence: 1.00)
Pre -> I-LOC (Confidence: 1.00)
##toria -> I-LOC (Confidence: 1.00)
South -> I-LOC (Confidence: 1.00)
Africa -> I-LOC (Confidence: 1.00)


#### Output warning

> Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
> - This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
> - This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

\
\
**Why Are Some Weights Unused?**

BERT is a multi-purpose model, and it has different layers for different NLP tasks:
Component|Purpose|Used in NER?
---|---|---
Embedding Layer|Converts words into vector representations|✅ Yes
Transformer Layers|Contextualizes words|✅ Yes
Token Classification Layer|Classifies each word/token into an entity|✅ Yes
Pooler Layer (bert.pooler.dense weights)|Used for tasks like text classification (summarizes sentence into one vector)|❌ No

In NER, we classify each token separately, so we don’t need the pooler.dense weights, which are used to represent entire sentences in classification tasks (e.g., sentiment analysis).

This warning is expected because `bert.pooler` is for text classification, not token classification.

\
\
**How to suppress the warning?**

`import transformers`\
`transformers.logging.set_verbosity_error()`

### 3.1. Fix: Merge subwords tokens

📌 Why Are Words Split (e.g., "El", "##on") in Hugging Face NER Results?

This happens because Hugging Face’s Transformer models use WordPiece tokenization, which breaks words into subword units. Tokens prefixed with ## are continuations of previous subwords.
\
Hugging Face models like BERT and RoBERTa use subword tokenization to handle rare or unknown words efficiently.

For example:
- "Elon" → Split into ["El", "##on"]
- "Musk" → Split into ["Mu", "##sk"]
- "Tesla" → Split into ["Te", "##sla"]
- "SpaceX" → Split into ["Space", "##X"]

This ensures better handling of uncommon words, but it makes the output look fragmented.

In [None]:
def merge_subwords(entities):
    merged_entities = []
    current_word = ""
    current_label = None
    current_score = 0

    for entity in entities:
        word = entity["word"]
        label = entity["entity"]
        score = entity["score"]

        # If the word starts with "##", it's a continuation of the previous token
        if word.startswith("##"):
            current_word += word[2:] # Remove "##" and append
            current_score = max(current_score, score) # Keep the highest score
        else:
             # Save previous word before starting a new one
            if current_word:
                merged_entities.append((current_word, current_label, current_score))
            
            current_word = word
            current_label = label
            current_score = score
            
    if current_word:
        merged_entities.append((current_word, current_label, current_score))
        
    return merged_entities

In [25]:
merged_entities = merge_subwords(entities)

# Print result with confidence scores
for entity in merged_entities:
    entity_name = entity[0]
    entity_type = entity[1]
    confidence = entity[2]
    print(f"{entity_name} -> {entity_type} (Confidence: {confidence:.2f})")

Elon -> B-PER (Confidence: 0.98)
Musk -> I-PER (Confidence: 1.00)
Tesla -> B-ORG (Confidence: 1.00)
and -> I-ORG (Confidence: 0.61)
SpaceX -> B-ORG (Confidence: 1.00)
Pretoria -> B-LOC (Confidence: 1.00)
South -> B-LOC (Confidence: 1.00)
Africa -> I-LOC (Confidence: 1.00)


### 3.2. Fix: Date recognition *(not really)*

The Hugging Face model (dbmdz/bert-large-cased-finetuned-conll03-english) that was used\
was trained on the CoNLL-2003 dataset, which only recognizes four entity types:

Entity Type|Meaning
---|---
PER||Person (e.g., "Elon Musk")
ORG||Organization (e.g., "Tesla")
LOC||Location (e.g., "South Africa")
MISC||Miscellaneous (e.g., "SpaceX")

In [28]:
from transformers import pipeline
import transformers
transformers.logging.set_verbosity_error() # Disable warnings

# Load pre-trained NER model
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER")

# Sample text
text = "Elon Musk is the CEO of Tesla and SpaceX. He was born in Pretoria, South Africa, on June 28, 1971."

# Extract named entities
entities = ner_pipeline(text)

# Merge subwords
merged_entities = merge_subwords(entities)

# Print result with confidence scores
for entity in merged_entities:
    entity_name = entity[0]
    entity_type = entity[1]
    confidence = entity[2]
    if confidence > 0.85:
        print(f"{entity_name} -> {entity_type} (Confidence: {confidence:.2f})")

Elon -> B-PER (Confidence: 0.98)
Musk -> I-PER (Confidence: 1.00)
Tesla -> B-ORG (Confidence: 1.00)
SpaceX -> B-ORG (Confidence: 1.00)
Pretoria -> B-LOC (Confidence: 1.00)
South -> B-LOC (Confidence: 1.00)
Africa -> I-LOC (Confidence: 1.00)


#### Note
The date is still missing, consider to use Flair model ("flair/ner-english-large")

### Legends

Named Entity Recognition (NER) models use the IOB tagging scheme to identify entities in text.
Tag|Meaning
:--:|---
B-XXX|Begin: Start of an entity (e.g., first word of a location name)
I-XXX|Inside: Continuation of an entity (e.g., second word in a multi-word location)
O|Outside: Not part of any entity

💡 The B- tag helps the model differentiate between separate entities.
\
\
**Example**:\
"I visited New York and Los Angeles"
Word|Correct Tag|Incorrect Tag (Without B-LOC)
---|:---:|---
New|B-LOC|I-LOC
York|I-LOC|I-LOC
and|O|I-LOC ❌ (Wrongly included in New York)
Los|B-LOC|I-LOC ❌ (Still part of New York)
Angeles|I-LOC|I-LOC

✅ Using B-LOC ensures that "Los Angeles" is recognized as a separate entity.