# Named Entity Recognition (NER)
***
## Table of Contents
1. [Introduction](#1-introduction)
    - [Types of NER](#types-of-ner)
    - [Common Steps](#common-steps)
2. [NER with NLTK](#2-ner-with-nltk)
3. [NER with spaCy](#3-ner-with-spacy)
4. [NER with Hugging Face Transformers](#4-ner-with-hugging-face-transformers)
***


## 1. Introduction
Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that involves identifying and classifying entities within unstructured text into predefined categories such as names, organisations, location, dates, monetary values, and more. For example, in the sentence:

> Apple Inc. is planning to open a new office in San Francisco in March 2025.

NER should recognise 'Apple Inc.' as an organisation, 'San Francisco' as a location, and 'March 2025' as a date.

NER is foundational for applications such as information extraction, search engines, document classification, chatbots, etc. It converts raw, unstructured text into structured data, making it easier for machines to understand, search, and analyse data in a natural language.

### Types of NER
NER Models can be:
- **Rule-based**: Use hand-crafted patterns and dictionaries.
- **Machine Learning-based**: Use algorithms like Conditional Random Fields (CRFs).
- **Deep Learning-based**: Use neural network based models (BiLSTM-CRF, BERT, etc.).

### Common Steps
1. Text Input
    - Raw text is provided as input.
2. Text Preprocessing
    - Tokenisation
    - POS Tagging
3. Feature Extraction
    - Extracting features such as capitalisation, context, and known entity lists.
4. Model Application
    - Applying a trained model to classify tokens.
5. Entity Classification
    - Assigning labels to detected entities.
6. Post-Prorcessing
    - Refining results, handling ambiguities or nested entities.
7. Output Generation
    - Producing annotated or structured output.

## 2. NER with NLTK
Natural Language Toolkit (NLTK) library provides a rule-based and statistical model for NER.

`nltk.sent_tokenize` -> Splits text into individual sentence.

`nltk.word_tokenize` -> Splits text into individual word.

`nltk.pos_tag` -> Assigns a part-of-speech (POS) tag to each token.

`nltk.ne_chunk` -> Performs named entitiy recognition (NER) by grouping tokens into 'chunks' that represent entities.

| POS Tag | Meaning                                   |
|---------|-------------------------------------------|
| CC      | Coordinating conjunction                  |
| CD      | Cardinal number                           |
| DT      | Determiner                                |
| EX      | Existential there                         |
| FW      | Foreign word                              |
| IN      | Preposition or subordinating conjunction  |
| JJ      | Adjective                                 |
| JJR     | Adjective, comparative                    |
| JJS     | Adjective, superlative                    |
| LS      | List item marker                          |
| MD      | Modal                                     |
| NN      | Noun, singular or mass                    |
| NNS     | Noun, plural                              |
| NNP     | Proper noun, singular                     |
| NNPS    | Proper noun, plural                       |
| PDT     | Predeterminer                             |
| POS     | Possessive ending                         |
| PRP     | Personal pronoun                          |
| PRP$    | Possessive pronoun                        |
| RB      | Adverb                                    |
| RBR     | Adverb, comparative                       |
| RBS     | Adverb, superlative                       |
| RP      | Particle                                  |
| SYM     | Symbol                                    |
| TO      | to                                        |
| UH      | Interjection                              |
| VB      | Verb, base form                           |
| VBD     | Verb, past tense                          |
| VBG     | Verb, gerund or present participle        |
| VBN     | Verb, past participle                     |
| VBP     | Verb, non-3rd person singular present     |
| VBZ     | Verb, 3rd person singular present         |
| WDT     | Wh-determiner                             |
| WP      | Wh-pronoun                                |
| WP$     | Possessive wh-pronoun                     |
| WRB     | Wh-adverb                                 |




In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

[nltk_data] Downloading package punkt to /Users/tsu76i/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tsu76i/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/tsu76i/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /Users/tsu76i/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [2]:
sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, \
    Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans \
        felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

In [3]:
# 1. Tokenise in sentence.
# 2. Tokenise in word.
# 3. Assign POS tags.
# 4. Group tokens into 'chunks'.
# 5. Perform NER if the chunk has the attribute 'label' (PERSON, GPE, etc.).

for sent in nltk.sent_tokenize(sentence):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(f"{chunk.label()}: {' '.join(c[0] for c in chunk)}")

GPE: WASHINGTON
GPE: New York
PERSON: Loretta E. Lynch
GPE: Brooklyn


## 3. NER with spaCy
`spaCy` library has a pre-trained deep leaning based model for NER.

> `sm`/`md`/`lg` refer to the sizes of the models (small, medium, large respectively).
[Reference](https://stackoverflow.com/questions/50487495/what-is-difference-between-en-core-web-sm-en-core-web-mdand-en-core-web-lg-mod)

In [4]:
import spacy

In [5]:
# Load spaCy's pre-trained English NER model.
# 1. uv install pip
# 2. uv run —- spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

text = "On September 14, 2023, Elon Musk announced at a press conference in Berlin that Tesla would be investing over $5 billion to build a new Gigafactory in the outskirts of Paris, France. The event was attended by French President Emmanuel Macron, \
    representatives from the European Union, and executives from several major automotive companies, including Volkswagen and BMW."

doc = nlp(text)

for ent in doc.ents:
    print(f'{ent.label_}: {ent.text}')

DATE: September 14, 2023
PERSON: Elon Musk
GPE: Berlin
ORG: Tesla
MONEY: over $5 billion
GPE: Paris
GPE: France
NORP: French
PERSON: Emmanuel Macron
ORG: the European Union
ORG: Volkswagen
ORG: BMW


## 4. NER with Hugging Face Transformers
Hugging Face Transformers provides advanced NER capabilities using transformer-based models such as BERT, which achieves state-of-the-art accuracy through deep contextual understanding.

In [6]:
from transformers import pipeline

In [7]:
# Load pre-trained NER model with entity aggregation
ner = pipeline(
    task='ner',
    model='dbmdz/bert-large-cased-finetuned-conll03-english',
    aggregation_strategy = 'simple'
)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [8]:
text = """In a landmark announcement at CES 2026 in Las Vegas, NVIDIA CEO Jensen Huang revealed a $500 million partnership with Siemens AG to build AI-powered smart factories across Europe. 
The collaboration will establish three flagship facilities in Munich (Germany), Zurich (Switzerland), and Barcelona (Spain), with construction beginning Q2 2027. 
Accompanying Huang were Siemens CEO Roland Busch and European Commission President Ursula von der Leyen, who called it "a pivotal moment for Industry 5.0".
The initiative aims to create 15,000 jobs while reducing industrial carbon emissions by 45% by 2030, aligning with the Paris Climate Accord. 
Technical development will be led by Dr. Maria Schmidt from NVIDIA's Berlin R&D Center and Prof. Luca Bianchi of Politecnico di Milano."""

# Process text with detailed output
results = ner(text)

# Display results
print('{:<30} {:<15} {:10}'.format('Entity', 'Type', 'Confidence'))
print('-' * 60)
for entity in results:
    print(f'{entity['word']:<30} {entity['entity_group']:<15} {entity['score']:.4f}')

Entity                         Type            Confidence
------------------------------------------------------------
CES                            MISC            0.9805
Las Vegas                      LOC             0.9983
NVIDIA                         ORG             0.9974
Jensen Huang                   PER             0.9996
Siemens AG                     ORG             0.9994
AI                             MISC            0.9774
Europe                         LOC             0.9998
Munich                         LOC             0.9993
Germany                        LOC             0.9998
Zurich                         LOC             0.9995
Switzerland                    LOC             0.9999
Barcelona                      LOC             0.9995
Spain                          LOC             0.9999
Huang                          PER             0.9994
Siemens                        ORG             0.9991
Roland Busch                   PER             0.9997
European Commissi