# POS Tagging with spaCy (Personal Practice Notes)

This notebook contains my **own breakdown and understanding** of Part-of-Speech (POS) tagging using spaCy and pandas.

The goal is not just to make the code work, but to understand:
- what spaCy returns,
- how pandas is used to organize the data,
- and why certain approaches are better than others.

These notes are meant for **self-learning and helping fellow learners**.


## 1Ô∏è‚É£ Introduction: Text Tagging in NLP

After pre-processing text (for example using n-grams), we can extend our analysis by **tagging** the text.

There are two common types of tagging:

### 1) Part-of-Speech (POS) Tagging
- Labels each word with its grammatical role
- Examples: NOUN, VERB, ADJ, DET, PRON

### 2) Named Entity Recognition (NER)
- Identifies real-world entities such as:
  - people
  - locations
  - organizations
  - works of art

Tagging helps us:
- understand what‚Äôs inside a text
- explore language patterns
- create features for machine learning models
- perform standalone linguistic analysis

2Ô∏è‚É£ Libraries and Model Setup

In [None]:
import spacy
import pandas as pd

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")


## 2Ô∏è‚É£ Libraries and Model Setup

We use:
- **spaCy** for NLP processing (tokenization, POS tagging, etc.)
- **pandas** for organizing and analyzing structured data

The `en_core_web_sm` model is:
- lightweight
- fast
- suitable for learning and small projects

In [None]:

emma_ja = """emma woodhouse handsome clever and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twentyone years in the world with very little to distress or vex her she was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period her mother had died too long ago for her to have more than an indistinct remembrance of her caresses and her place had been supplied by an excellent woman as governess who had fallen little short of a mother in affection sixteen years had miss taylor been in mr woodhouses family less as a governess than a friend very fond of both daughters but particularly of emma between them it was more the intimacy of sisters even before miss taylor had ceased to hold the nominal office of governess the mildness of her temper had hardly allowed her to impose any restraint and the shadow of authority being now long passed away they had been living together as friend and friend very mutually attached and emma doing just what she liked highly esteeming miss taylors judgment but directed chiefly by her own"""
print(emma_ja)

In [None]:
spacy_doc = nlp(emma_ja)


## 3Ô∏è‚É£ Creating a spaCy Doc

A **spaCy Doc** is a **structured object**, not just a string.

It contains:
- the original text
- tokens (words)
- linguistic annotations such as POS tags

Each token already has:
- `token.text` ‚Üí the word
- `token.pos_` ‚Üí its part-of-speech tag

In [None]:
print("=== Full Text ===")
print(spacy_doc.text)

print(spacy_doc[0].text)   # first token
print(spacy_doc[0].pos_)   # POS tag of first token


## 4Ô∏è‚É£ Extracting Tokens and POS Tags into a DataFrame

We want a table with:
- one row per word
- the word itself
- its POS tag

In [None]:
# Method 1: Row-by-row DataFrame concatenation (Instructor‚Äôs approach)
pos_df = pd.DataFrame(columns=["token", "pos_tag"])

for token in spacy_doc:
    pos_df = pd.concat(
        [
            pos_df,
            pd.DataFrame.from_records([
                {"token": token.text, "pos_tag": token.pos_}
            ])
        ],
        ignore_index=True
    )
print(pos_df)

### ‚ùå Method 1 Notes

This approach works, but it is inefficient because:
- a new DataFrame is created on every loop iteration
- existing data is repeatedly copied
- performance degrades for large datasets

In [None]:
records = []

for token in spacy_doc:
    records.append((token.text, token.pos_))

pos_df = pd.DataFrame(records, columns=["token", "pos_tag"])
print(pos_df)



#Why Method 2 Is Better
### ‚úÖ Method 2 Notes

This approach is:
- simpler
- faster
- more memory-efficient
- standard practice when working with pandas


In [None]:
# Method 3: Pythonic one-liner (list comprehension)
# List Comprehension (Optimized approach)
records = [(t.text, t.pos_) for t in spacy_doc]
pos_df = pd.DataFrame(records, columns=["token", "pos_tag"])
print(pos_df)

## 5Ô∏è‚É£ Counting Word Occurrences by POS Tag

Next, we want to:
- group similar words together
- count how many times they appear
- produce a clean summary table

In [None]:
# Method 1: groupby() + size()
pos_df_counts = pos_df.groupby(['token','pos_tag']).size().reset_index(name ='counts').sort_values(by='counts', ascending=False)    

In [None]:
pos_df_counts

### Conceptual Explanation (Bucket Model)

- `groupby()` puts identical `(token, POS)` pairs into the same bucket
- `.size()` counts how many items are inside each bucket
- `reset_index()` converts bucket labels into normal columns
- `sort_values()` shows the most frequent pairs first

In [None]:
# Method 2: Using value_counts() (simpler)
pos_df_counts_2 = (
    pos_df
    .value_counts(['token', 'pos_tag'])
    .reset_index(name='counts')
    .sort_values(by='counts', ascending=False)
)

In [None]:
pos_df_counts_2.head(10)

### Counting how many words belong to each POS tag

At this stage, `pos_df_counts` contains:
- one row per unique `(token, pos_tag)` pair
- a `counts` column showing how often each pair appears

To understand the **distribution of parts of speech**, we now group the data **only by POS tag** and count how many different tokens fall under each tag.

```python
pos_df_poscounts = (
    pos_df_counts
    .groupby(['pos_tag'])['token']
    .count()
    .sort_values(ascending=False)
)


In [None]:
pos_df_poscounts = pos_df_counts.groupby(['pos_tag'])['token'].count().sort_values(ascending=False)

In [None]:
pos_df_poscounts.head(10)

### üîç Filtering the most common nouns

To focus specifically on **nouns**, we filter the DataFrame to keep only rows where the POS tag is `NOUN`.

```python
nouns = pos_df_counts[pos_df_counts['pos_tag'] == 'NOUN']
nouns.head(10)


In [None]:
nouns = pos_df_counts[pos_df_counts['pos_tag'] == 'NOUN'] [:10]
nouns


### üîç Filtering the most common verbs

To focus specifically on **verbs**, we filter the DataFrame to keep only rows where the POS tag is `NOUN`.

```python
nouns = pos_df_counts[pos_df_counts['pos_tag'] == 'NOUN']
nouns.head(10)


In [None]:
verbs = pos_df_counts[pos_df_counts['pos_tag'] == 'VERB'][:10]
verbs

## ‚úÖ Final Takeaways

- spaCy Docs are already structured ‚Äî take advantage of the token-level information they provide
- Collect data first, then create DataFrames in a single step
- Avoid building DataFrames row by row inside loops
- Use `value_counts()` when you only need frequency counts
- Understanding *why* things work is more important than copying syntax

