# üè∑Ô∏è Named Entity Recognition (NER)
*(Personal Practice Notes)*

This notebook contains my **own breakdown and understanding** of Named Entity Recognition (NER) using spaCy.

The goal is not just to make the code work, but to understand:
- what spaCy returns,
- how named entities are stored and accessed,
- how preprocessing choices affect NER results.

These notes are meant for **self-learning and experimentation**.


## 1Ô∏è‚É£ What is Named Entity Recognition (NER)?

Named Entity Recognition is an NLP task that identifies and classifies **real-world entities** in text, such as:
- people
- organizations
- locations
- dates
- products
- geopolitical entities

NER helps convert **unstructured text** into **structured information** and is commonly used in:
- information extraction
- search systems
- knowledge graphs
- conversational agents


## 2Ô∏è‚É£ Imports and Model Setup

In [None]:
import spacy
from spacy import displacy  #built-in visualizer for spaCy, useful to see entities highlighted in text 
from spacy import tokenizer #module for tokenizing text into words, punctuation, etc.
from IPython.display import HTML, display  #used to display dataframes and other objects in Jupyter notebooks
import re

In [None]:
# We create variable named nlp to load the english model from spacy en_core_web_sm , which is a small English model that 
# includes vocabulary, syntax, and entities.
nlp = spacy.load("en_core_web_sm")  #now nlp variable holds the loaded spaCy model ready for processing text

In [None]:
google_text = "Google was founded on September 4, 1998, by computer scientists Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet."
print(google_text)

## 3Ô∏è‚É£ Creating a spaCy Document

Processing text with spaCy creates a **Doc object**, which stores:
- tokens
- named entities
- linguistic annotations


In [None]:
spacy_doc = nlp(google_text)

In [None]:

for word in spacy_doc.ents:
    print(word.text, word.label_, word.label)    


## 4Ô∏è‚É£ Visualizing Named Entities with displaCy

spaCy provides **displaCy**, a built-in visualizer that highlights
named entities directly in the text.


In [25]:

html = displacy.render(spacy_doc, style="ent", jupyter=False)   #ent means we want to visualize entities , 
#if jupyter=True it will render directly in notebook and if false it returns HTML string that way we can decide how we wanna display it
display(HTML(html))  #we use IPython display function to render the HTML in the notebook


## 5Ô∏è‚É£ Effect of Text Preprocessing on NER

NER is highly sensitive to text structure.
Removing punctuation and lowercasing can reduce accuracy because:
- capitalization cues are lost
- sentence boundaries disappear


In [None]:
#lets see what happenes once we clean the punctuation from the text 
google_text_clean = re.sub(r'[^\w\s]', '', google_text).lower()  #removes punctuation using regex
print(google_text_clean)  

In [None]:
spacy_doc_clean = nlp(google_text_clean)  #process the cleaned text with the nlp model
for word in spacy_doc_clean.ents:
    print(word.text, word.label_, word.label)

## 5Ô∏è‚É£ Visualizing Entities After Text Cleaning

We now visualize the entities detected **after punctuation removal and lowercasing**
to compare results with the original text.


In [None]:
html = displacy.render(spacy_doc_clean, style="ent", jupyter=False)  

display(HTML(html)) 

## 7Ô∏è‚É£ Observations

After aggressive preprocessing, spaCy struggles to identify entities correctly.

This confirms that:
- NER relies heavily on original casing
- punctuation and structure provide important context


## 8Ô∏è‚É£ Best Practices for NER Preprocessing

- ‚úÖ Run NER on raw or lightly cleaned text
- ‚ùå Avoid aggressive preprocessing
- ‚ö†Ô∏è Preprocessing should be task-specific

A common strategy is:
- Run NER first
- Clean text later for other NLP tasks

## ‚úÖ Final Takeaways

- NER extracts real-world entities from text
- spaCy provides strong tools for entity detection and visualization
- displaCy helps with quick inspection and debugging
- Preprocessing choices can strongly impact NER accuracy
- Preserving original text structure is crucial
