# üì∞ POS and NER Practical Example on Realistic Text Data (NLP)

This notebook demonstrates a **complete practical workflow** applying:

- Text preprocessing
- Part-of-Speech (POS) tagging
- Named Entity Recognition (NER)

on **real-world BBC news data** using:
- spaCy
- pandas
- NLTK
- regular expressions

The goal is to move beyond toy examples and apply NLP techniques
to **realistic, noisy text data**.


## 1Ô∏è‚É£ Overview of the Workflow

In this notebook, we:

1. Load real BBC news data
2. Focus on the **news titles** only
3. Clean and preprocess text:
   - lowercasing
   - stopword removal
   - punctuation removal
   - tokenization
   - lemmatization
4. Apply **POS tagging** using spaCy
5. Analyze most frequent nouns, verbs, and adjectives
6. Apply **Named Entity Recognition (NER)**
7. Visualize detected entities using displaCy


## 2Ô∏è‚É£ Importing Libraries

We use a combination of NLP and data analysis libraries:

- **NLTK** for tokenization, stopwords, and lemmatization
- **spaCy** for POS tagging and NER
- **pandas** for structured data manipulation
- **regex** for text cleaning
- **matplotlib** for optional visualization


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy
import re
import pandas as pd
import matplotlib.pyplot as plt
from spacy import displacy 
from IPython.display import HTML, display

## 3Ô∏è‚É£ Loading the BBC News Dataset

We load BBC news data from a CSV file.
Each row represents a news article, including a title and content.

For this exercise, we focus only on the **title column**,
which is short but rich in linguistic information.


# Load Data

In [None]:
bbc_data = pd.read_csv('../../data/bbc_news.csv')

In [None]:
bbc_data.head()

In [None]:
bbc_data.info()  

## 4Ô∏è‚É£ Selecting the Text for Analysis

To simplify the analysis and reduce noise, we extract
only the **news titles** into a separate DataFrame.


In [None]:

titles = pd.DataFrame(bbc_data['title'])
titles.head()

## 5Ô∏è‚É£ Text Cleaning and Preprocessing

Before applying POS tagging and NER, we clean the text to improve consistency.

Steps applied:
- convert text to lowercase
- remove stopwords
- remove punctuation
- tokenize text
- lemmatize tokens

This preprocessing helps reduce noise and normalize words
before linguistic analysis.


In [None]:
titles['lowercase'] = titles['title'].str.lower()
titles.head()

In [None]:
en_stopwords = stopwords.words('english')
titles['no_stopwords'] = titles['lowercase'].apply(lambda x: ' '. join([word for word in x.split() if word not in (en_stopwords)]))
titles.head()

In [None]:
#punctuation removal
titles['no_stopwords_nopunct'] = titles.apply(lambda x: re.sub(r"[^\w\s]", ' ', x['no_stopwords']), axis=1)
titles.head()

In [None]:
# tokenize
titles['tokens_raw'] = titles.apply(lambda x: word_tokenize(x['title']) , axis=1)
titles['tokens_clean'] = titles.apply(lambda x: word_tokenize(x['no_stopwords_nopunct']) , axis=1)

In [None]:
#lemmatization
lemmatizer = WordNetLemmatizer()
titles['tokens_clean_lemmatized'] = titles['tokens_clean'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])
titles.head()

## 6Ô∏è‚É£ Preparing Tokens for spaCy

spaCy expects text input as a string, not a list of lists.

After tokenizing and lemmatizing each title, we:
- flatten all token lists
- join them into a single string

This allows spaCy to process the entire dataset in one pass.


In [None]:
tokens_raw_list = sum(titles['tokens_raw'], [])   
tokens_clean_list = sum(titles['tokens_clean_lemmatized'], []) 

## 7Ô∏è‚É£ Part-of-Speech (POS) Tagging

Using spaCy, we assign a grammatical role
(noun, verb, adjective, etc.) to each token.

The result is stored in a pandas DataFrame with:
- the token
- its POS tag


In [None]:

nlp = spacy.load("en_core_web_sm")

In [None]:
spacy_doc = nlp(' '.join(tokens_clean_list))  

In [None]:
records = [(t.text, t.pos_) for t in spacy_doc]
pos_df = pd.DataFrame(records, columns=['token', 'pos_tag'])
pos_df.head()

## 8Ô∏è‚É£ Analyzing POS Frequencies

We group tokens by:
- word
- POS tag

and count how often each combination appears.

This allows us to:
- identify common nouns
- identify frequent verbs
- identify descriptive adjectives


In [None]:
pos_df_counts = pos_df.value_counts(['token', 'pos_tag']).reset_index(name='counts')


In [None]:
pos_df_counts.head(10)

## 9Ô∏è‚É£ Extracting Top POS Categories

From the POS frequency table, we extract:
- the most common nouns
- the most common verbs
- the most common adjectives

This provides insight into the **language style**
of BBC news headlines.


In [None]:
nouns = pos_df_counts[pos_df_counts['pos_tag'] == 'NOUN'][0:10]
nouns

In [None]:
verbs = pos_df_counts[pos_df_counts['pos_tag'] == 'VERB'][0:10]
verbs

In [None]:
adjectives = pos_df_counts[pos_df_counts['pos_tag'] == 'ADJ'][0:10]
adjectives

## üîü Named Entity Recognition (NER)

Next, we apply **Named Entity Recognition** to detect:
- people
- organizations
- locations
- dates
- geopolitical entities

spaCy automatically identifies these entities
based on context and linguistic cues.


In [None]:
records = []
for ent in spacy_doc.ents:
    records.append((ent.text, ent.label_))
ner_df = pd.DataFrame(records, columns=['token', 'ner_tag'])
ner_df.head()

## 1Ô∏è‚É£1Ô∏è‚É£ Counting Named Entities

We count how often each named entity appears
and group them by:
- entity text
- entity label

This helps identify:
- frequently mentioned organizations
- recurring locations
- prominent people in the news


In [None]:
ner_df_counts = ner_df.value_counts(['token', 'ner_tag']).reset_index(name='counts')
ner_df_counts.head(10)


## 1Ô∏è‚É£2Ô∏è‚É£ Visualizing Named Entities with displaCy

Finally, we use **displaCy** to visually inspect
the named entities detected by spaCy.

Entities are highlighted directly in the text
using different colors for each label.


In [None]:


html = displacy.render(spacy_doc, style="ent", jupyter=False)  
display(HTML(html))  

## ‚úÖ Final Takeaways

- Real-world text requires careful preprocessing
- POS tagging reveals grammatical structure
- NER extracts meaningful real-world information
- spaCy makes advanced NLP tasks accessible
- Visualization helps validate model behavior
- Understanding the pipeline is more important than copying code
