#  Parts of Speech Tagging
---

Here are a few steps to do a basic preprocessing of a dataset. The example dataset used is the `IMDB Dataset of 50K Movie Reviews dataset`


---

## 📑 Contents

1. Lower Casing
2. Remove HTML tags
3. Remove URLs
4. Remove Punctuation
5. Chat word treatment
6. Spelling Correction
7. Removing Stop words
8. Handling Emojis
9. Tokenization
10. Stemming
11. Lemmatization

## Definition:

POS (Part-of-Speech) Tagging is the process of assigning a part of speech (noun, verb, adjective, etc.) to each word in a sentence based on its meaning and context.

Words can take on different roles in different sentences. POS tagging uses linguistic rules or machine learning models to determine the correct part of speech for each word.

| Tag | Meaning   | Example         |
| --- | --------- | --------------- |
| NN  | Noun      | `dog`, `car`    |
| VB  | Verb      | `run`, `eat`    |
| JJ  | Adjective | `happy`, `blue` |
| RB  | Adverb    | `quickly`       |
| PRP | Pronoun   | `he`, `they`    |


| Pros                                       | Description                                              |
| ------------------------------------------ | -------------------------------------------------------- |
| ✅ **Helps grammatical analysis**           | Important for parsing, translation, summarization.       |
| ✅ **Improves downstream NLP tasks**        | Like named entity recognition, sentiment analysis.       |
| ✅ **Provides context-aware understanding** | Especially useful in disambiguating word meanings.       |
| ✅ **Widely available tools**               | NLTK, spaCy, Stanford NLP, etc. offer ready POS taggers. |


| Cons                                  | Description                                                                      |
| ------------------------------------- | -------------------------------------------------------------------------------- |
| ❌ **Context sensitivity limitations** | Some models can misclassify without enough context.                              |
| ❌ **Ambiguity**                       | A word can have multiple tags; hard to resolve perfectly.                        |
| ❌ **Language and domain dependency**  | May need retraining on informal or domain-specific text (e.g., tweets, medical). |
| ❌ **Errors propagate**                | Mistakes in tagging can affect downstream tasks.                                 |


| Application                          | Role of POS Tagging                       |
| ------------------------------------ | ----------------------------------------- |
|  **Parsing**                        | Helps build syntax trees.                 |
|  **Named Entity Recognition (NER)** | Identifies proper nouns.                  |
|  **Question answering**             | Understands sentence structure.           |
|  **Machine translation**            | Maintains grammatical correctness.        |
|  **Speech recognition**             | Improves understanding of sentence roles. |


In [1]:
from nltk import word_tokenize, pos_tag

text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)
tags = pos_tag(tokens)

print(tags)


[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


In [2]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [3]:
doc = nlp('I will google about facebook')

doc.text

'I will google about facebook'

In [6]:
doc[2]

google

In [9]:
doc[2].pos_     # Course grained parts of speech

'VERB'

In [10]:
doc[2].tag_     # Fine grained parts of speech

'VB'

In [11]:
spacy.explain('VB')

'verb, base form'

In [13]:
for word in doc:
    print(word.text, '--------->', word.pos_, word.tag_, spacy.explain(word.tag_))

I ---------> PRON PRP pronoun, personal
will ---------> AUX MD verb, modal auxiliary
google ---------> VERB VB verb, base form
about ---------> ADP IN conjunction, subordinating or preposition
facebook ---------> PROPN NNP noun, proper singular


In [16]:
# Word left used in different sense

doc2 = nlp('I left the room')
for word in doc2:
    print(word.text, '--------->', word.pos_, word.tag_, spacy.explain(word.tag_))

print('\n')

doc3 = nlp('to the left of the room')
for word in doc3:
    print(word.text, '--------->', word.pos_, word.tag_, spacy.explain(word.tag_))

I ---------> PRON PRP pronoun, personal
left ---------> VERB VBD verb, past tense
the ---------> DET DT determiner
room ---------> NOUN NN noun, singular or mass


to ---------> ADP IN conjunction, subordinating or preposition
the ---------> DET DT determiner
left ---------> NOUN NN noun, singular or mass
of ---------> ADP IN conjunction, subordinating or preposition
the ---------> DET DT determiner
room ---------> NOUN NN noun, singular or mass


### Hidden Markov Model (HMM)

A Hidden Markov Model (HMM) is a statistical model that represents systems with hidden (unobserved) states. In NLP, it’s often used to model sequences of words and their parts of speech (POS).

Words (like “run”, “cat”, “quickly”) are the observable outputs.

POS tags (like noun, verb, adverb) are the hidden states we want to infer.

HMM assumes:

  -  Each word is generated by a POS tag.

  -  The current tag depends only on the previous tag (this is the Markov property).

    It learns:

      -  Transition probabilities: Probability of tagₜ given tagₜ₋₁
        👉 P(tagₜ | tagₜ₋₁)

      -  Emission probabilities: Probability of wordₜ given tagₜ
        👉 P(wordₜ | tagₜ)

Sentence: "Time flies like an arrow"

We want to find the best POS tag sequence:

  -  “Time” could be noun or verb

  -  “flies” could be noun or verb

    etc.

So HMM will compute all possible tag sequences and their probabilities and choose the most likely one.

### Viterbi Algorithm

The Viterbi algorithm is a optimization technique to find the most likely sequence of hidden states (POS tags) in an HMM.

Input:

    A sentence like "Time flies like an arrow" (observed words)

    HMM parameters (transition and emission probabilities)

Output:

    The best POS tag sequence, e.g., [NN, VB, IN, DT, NN]

Backtrace:
Track the path of max probabilities to get the best tag sequence.

#### Summary
| Concept     | Purpose                                         |
| ----------- | ----------------------------------------------- |
| **HMM**     | Models sequence of hidden POS tags behind words |
| **Viterbi** | Efficiently finds the best tag sequence in HMM  |
