# **Named Entity Recognition**


It is a **Token Level Classification**

In NLP, Named Entity Recognition (NER) is a key task that identifies and classifies named entities (like people, organizations, locations, dates, etc.) within unstructured text into predefined categories.  
This process involves detecting the specific words or phrases that represent these entities and then assigning them a category, making the text data more structured and understandable for various applications like information extraction, question answering, and chatbots.

Approach Followed:
* Using a Pretrained model -> BERT -> `dslim/bert-base-NER`
* Inference on a test data for showing model performance.

| Abbreviation | Description |
|---------------|-------------|
| O       | Outside of a named entity |
| B-MISC  | Beginning of a miscellaneous entity right after another miscellaneous entity |
| I-MISC  | Miscellaneous entity |
| B-PER   | Beginning of a person’s name right after another person’s name |
| I-PER   | Person’s name |
| B-ORG   | Beginning of an organization right after another organization |
| I-ORG   | Organization |
| B-LOC   | Beginning of a location right after another location |
| I-LOC   | Location |

In [1]:
# loading the necessary libraries

import pandas as pd
from transformers import pipeline

In [2]:
# sample sentences

sentences = [
    "I book a flight to Paris with Air France.",
    "She read a book about the history of Google and Sundar Pichai.",
    "Elon Musk, the CEO of Tesla, attended the conference in New York.",
    "The 2024 Olympics will be held in France.",
]

In [8]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `major project (advanced rag)` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active 

In [None]:
# loading the access key

from google.colab import userdata

token = userdata.get('HF_TOKEN')

In [30]:
# Model loading -> fine-tuned specifically for NER

MODEL_NAME="dslim/bert-base-NER"
ner_pipeline = pipeline(
    "token-classification",
    model=MODEL_NAME,
    aggregation_strategy="simple",
    token= token
)

print(f"Loaded BERT-based NER Tagger: {MODEL_NAME}\n")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Loaded BERT-based NER Tagger: dslim/bert-base-NER



In [17]:
# ---Process Sentences and Format Output ---
results_list = []

for sentence in sentences:
    print(f"Processing Sentence: '{sentence}'")
    # Get the entity tags
    tags = ner_pipeline(sentence)

    # Store results for nice tabular printing
    word_data = {
        'Sentence': sentence,
        'Entity': [item['word'] for item in tags],
        'NER Tag': [item['entity_group'] for item in tags]
    }
    results_list.append(word_data)
    print("-" * 60)

Processing Sentence: 'I book a flight to Paris with Air France.'
------------------------------------------------------------
Processing Sentence: 'She read a book about the history of Google and Sundar Pichai.'
------------------------------------------------------------
Processing Sentence: 'Elon Musk, the CEO of Tesla, attended the conference in New York.'
------------------------------------------------------------
Processing Sentence: 'The 2024 Olympics will be held in France.'
------------------------------------------------------------


In [18]:
final_output = []
for result in results_list:
    df_temp = pd.DataFrame({'Entity': result['Entity'], 'NER Tag': result['NER Tag']})

    # Add an empty row
    final_output.append(df_temp)
    final_output.append(pd.DataFrame([{'Entity': '', 'NER Tag': ''}]))

# Concatenate all results
df_final = pd.concat(final_output, ignore_index=True).iloc[:-1]

In [19]:
df_final

Unnamed: 0,Entity,NER Tag
0,Paris,LOC
1,Air France,ORG
2,,
3,Google,ORG
4,Sun,PER
5,##dar Picha,PER
6,,
7,El,PER
8,##on Musk,PER
9,Tesla,ORG
