# **Parts of Speech Tagging**


**POS (Part-of-Speech) tagging** is the process of labeling each word in a text with its corresponding part of speech, such as **noun, verb, or adjective**.  
This is a fundamental task in Natural Language Processing (NLP) that helps computers understand sentence structure and meaning by identifying the grammatical role of each word in a sentence.  
For example, it distinguishes between the word "book" used as a verb in "I will book a flight" and as a noun in "I am reading a book".

It is a **Token Level classification** task

Approach Followed:
* Using a Pretrained model -> BERT -> `vblagoje/bert-english-uncased-finetuned-pos`
* Inference on a test data for showing model performance.

| Tag   | Meaning                                                  | Examples                     |
|--------|----------------------------------------------------------|-------------------------------|
| ADP    | Adposition (prepositions or postpositions)               | in, on, by                    |
| ADJ    | Adjective                                                | significant, global           |
| ADV    | Adverb                                                   | quickly, often                |
| AUX    | Auxiliary verb                                           | is, was                       |
| CCONJ  | Coordinating conjunction                                 | and, but                      |
| DET    | Determiner                                               | the, a                        |
| INTJ   | Interjection                                             | oh, wow                       |
| NOUN   | Noun                                                     | man, city                     |
| NUM    | Number                                                   | one, 2022                     |
| PART   | Particle                                                 | 's, to                        |
| PRON   | Pronoun                                                  | he, which                     |
| PROPN  | Proper noun                                              | Neil Armstrong, Paris         |
| PUNCT  | Punctuation mark                                         | ,, .                          |
| SCONJ  | Subordinating conjunction                                | because, although             |
| SYM    | Symbol                                                   | $, %                          |
| VERB   | Verb                                                     | run, is                       |
| X      | Other (generally words that do not fit into other categories) | [not defined]             |


In [None]:
# loading the necessary libraries

import pandas as pd
from transformers import pipeline
from dotenv import load_dotenv
import os

In [None]:
# sample sentences

sentences = [
    "I book a flight.",
    "She read a book.",
    "The quick brown fox jumps over the lazy dog.",
    "They are quickly running to the store.",
]

In [None]:
# Model loading -> fine-tuned specifically for Part-of-Speech (POS)

MODEL_NAME = "vblagoje/bert-english-uncased-finetuned-pos"
tagger_pipeline = pipeline("token-classification",
                           model=MODEL_NAME,
                           aggregation_strategy="simple")

print(f"Loaded BERT-based POS Tagger: {MODEL_NAME}\n")

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Loaded BERT-based POS Tagger: vblagoje/bert-english-uncased-finetuned-pos



In [None]:
# ---Process Sentences and Format Output ---
results_list = []

for sentence in sentences:
    print(f"Processing Sentence: '{sentence}'")
    tags = tagger_pipeline(sentence)

    # Store results
    word_data = {
        'Sentence': sentence,
        'Word': [item['word'] for item in tags],
        'POS Tag (UD)': [item['entity_group'] for item in tags]
    }
    results_list.append(word_data)
    print("-" * 50)

Processing Sentence: 'I book a flight.'
--------------------------------------------------
Processing Sentence: 'She read a book.'
--------------------------------------------------
Processing Sentence: 'The quick brown fox jumps over the lazy dog.'
--------------------------------------------------
Processing Sentence: 'They are quickly running to the store.'
--------------------------------------------------


In [None]:
final_output = []
for result in results_list:
    df_temp = pd.DataFrame({'Word': result['Word'], 'POS Tag': result['POS Tag (UD)']})

    # Add an empty row for separation
    final_output.append(df_temp)
    final_output.append(pd.DataFrame([{'Word': '', 'POS Tag': ''}]))

# Concatenate all results
df_final = pd.concat(final_output, ignore_index=True).iloc[:-1]

In [None]:
df_final

Unnamed: 0,Word,POS Tag
0,i,PRON
1,book,VERB
2,a,DET
3,flight,NOUN
4,.,PUNCT
5,,
6,she,PRON
7,read,VERB
8,a,DET
9,book,NOUN
