<a href="https://colab.research.google.com/github/vvrgit/NLP-LAB/blob/main/PoS_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Parts of Speech (POS)** tagging is the process of assigning a grammatical category (such as noun, verb, adjective, etc.) to each word (token) in a sentence based on its definition and context.

**It answers the question:** What role does each word play in this sentence?

| Category          | Description     | Example     |
| ----------------- | --------------- | ----------- |
| Noun (NN)         | Name of entity  | student, AI |
| Verb (VB)         | Action or state | run, learn  |
| Adjective (JJ)    | Describes noun  | smart, new  |
| Adverb (RB)       | Describes verb  | quickly     |
| Pronoun (PRP)     | Replaces noun   | he, they    |
| Preposition (IN)  | Shows relation  | in, on      |
| Determiner (DT)   | Limits noun     | the, a      |
| Conjunction (CC)  | Connects words  | and, but    |
| Interjection (UH) | Emotion         | wow, omg    |


PoS Tagging helps



*   Helps understand sentence structure
*   Essential for Named Entity Recognition (NER)
*   Used in information extraction, sentiment analysis, and machine translation





**POS Tagging Example**

Sentence: "Students are learning Natural Language Processing."

| Word       | POS |
| ---------- | --- |
| Students   | NNS |
| are        | VBP |
| learning   | VBG |
| Natural    | JJ  |
| Language   | NN  |
| Processing | NN  |


NLTK uses the Penn Treebank tag set

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

sentence = "Students are learning Natural Language Processing"
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

print(pos_tags)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('Students', 'NNS'), ('are', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP')]


**en_core_web_sm** is a spaCy English language model.

The model **en_core_wen_sm** includes:

*   Part-of-Speech (POS) tagging
*   Lemmatization
*   Named Entity Recognition (NER)
*   Tokenization

It does NOT include word vectors (embeddings). For vectors, you need **en_core_web_md** or **en_core_web_lg**

spaCy uses **Universal POS tags** (simpler & semantic)

spaCy performs better for tweets, captions, and informal text.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Students are learning Natural Language Processing")
for token in doc:
    print(token.text, token.pos_)

Students NOUN
are AUX
learning VERB
Natural PROPN
Language PROPN
Processing NOUN


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a startup in India.")

for token in doc:
    print(token.text, token.pos_, token.tag_)

Apple PROPN NNP
is AUX VBZ
looking VERB VBG
at ADP IN
buying VERB VBG
a DET DT
startup NOUN NN
in ADP IN
India PROPN NNP
. PUNCT .


POS Tagging for Social Media Text
| Token | NLTK Tag | spaCy Tag |
| ----- | -------- | --------- |
| OMG   | NNP      | INTJ      |
| lol   | NN       | INTJ      |
| ЁЯШВ    | NN       | SYM       |
| #AI   | NN       | PROPN     |


| Token | Meaning       | NLTK Tag | spaCy Tag |
| ----- | ------------- | -------- | --------- |
| lol   | laughing      | NN       | INTJ      |
| omg   | surprise      | NNP      | INTJ      |
| rn    | right now     | NN       | ADV       |
| idk   | I donтАЩt know  | NN       | VERB      |
| lit   | excellent     | JJ       | ADJ       |
| ЁЯШВ    | emotion       | NN       | SYM       |
| #AI   | hashtag/topic | NN       | PROPN     |


In [None]:
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

text = "Loving the new AI features ЁЯШН #AI #MachineLearning"
doc = nlp(text)

nouns = []
verbs = []

for token in doc:
    if token.pos_ in ["NOUN", "PROPN"]:
        nouns.append(token.text)
    elif token.pos_ == "VERB":
        verbs.append(token.text)

noun_freq = Counter(nouns)
verb_freq = Counter(verbs)

print("Noun Frequency:", noun_freq)
print("Verb Frequency:", verb_freq)


Noun Frequency: Counter({'AI': 2, 'ЁЯШН': 1, 'MachineLearning': 1})
Verb Frequency: Counter({'Loving': 1, 'features': 1})


In [13]:
import nltk
from collections import Counter

# Assuming 'tokens' is already available from previous steps and contains all words
# If not, it would need to be created from 'texts' like this:
# all_tokens = []
# for text_item in texts:
#     all_tokens.extend(nltk.word_tokenize(text_item))
# tokens = all_tokens

pos_tags_nltk = nltk.pos_tag(tokens)

nouns_nltk = []
verbs_nltk = []

for word, tag in pos_tags_nltk:
    if tag.startswith('NN'): # NLTK tags for nouns (NN, NNS, NNP, NNPS)
        nouns_nltk.append(word)
    elif tag.startswith('VB'): # NLTK tags for verbs (VB, VBD, VBG, VBN, VBP, VBZ)
        verbs_nltk.append(word)

noun_freq_nltk = Counter(nouns_nltk)
verb_freq_nltk = Counter(verbs_nltk)

print("NLTK Noun Frequency:", noun_freq_nltk)
print("NLTK Verb Frequency:", verb_freq_nltk)

NLTK Noun Frequency: Counter({'t': 13, '>': 11, 'тАЩ': 10, 'time': 9, 'рд╣реИ': 9, 'job': 8, 'people': 8, 'AI': 8, 'manager': 7, 'рдХреНрдпрд╛': 7, 'la': 7, 'para': 7, '┘Б┘К': 7, 'LinkedIn': 6, 'someone': 6, 'parents': 6, 'рдореЗрдВ': 6, 's': 5, 'hai': 5, 'growth': 5, '╪з┘Д╪╣┘Е┘Д': 5, 'рдХреЗ': 5, 'рдФрд░': 5, 'worth': 4, 'end': 4, 'rejection': 4, 'way': 4, 'things': 4, 'company': 4, 'companies': 4, 'day': 4, 'work': 4, 'LPA': 4, 'strategy': 4, 'balance': 4, 'рд╣реИред': 4, 'que': 4, 'des': 4, 'ржПржмржВ': 4, 'A': 4, '├й': 4, 'rahe': 4, 'hain': 4, '%': 3, 'breathe': 3, 'jobs': 3, 'ke': 3, 'something': 3, 'brand': 3, 'son': 3, 'тАУ': 3, 'success': 3, 'Lots': 3, "'well": 3, 'issue': 3, 'ways': 3, 'clients': 3, 'quality': 3, 'problems': 3, 'рд╕реЗ': 3, 'рдХреА': 3, 'рд╣реИрдВ': 3, 'рд╕рдХрддрд╛': 3, 'La': 3, 'Desde': 3, 'hasta': 3, 'est├б': 3, 'dans': 3, '╪и┘К╪ж╪й': 3, '┘Е┘Ж': 3, 'Como': 3, '╨▓': 3, '╨╕': 3, 'Hinglish': 3, 'рдХреЛ': 3, 'рдХрд░': 3, 'рдХрд╛рд░реНрдпрд╕реНрдерд▓': 3, 'рднре