# PoS Tagger


A Part-of-Speech (POS) tagger, also known as a POS Tagger, is a component of natural language processing that assigns parts-of-speech labels to words in a text based on their grammatical and syntactical context within a sentence. Parts of speech refer to the linguistic categories that words can be classified into, such as nouns, verbs, adjectives, adverbs, pronouns, conjunctions, prepositions, and more. POS tagging is an essential task in various NLP applications, as it provides information about the grammatical roles and relationships between words in a sentence, which helps in understanding the meaning and structure of the text.

For example, consider the sentence: "The cat is sleeping peacefully." A POS tagger would assign the following parts-of-speech labels to the words:

    "The": Determiner (DT)
    "cat": Noun (NN)
    "is": Verb (VBZ)
    "sleeping": Verb (VBG)
    "peacefully": Adverb (RB)


There are various approaches to POS tagging, ranging from rule-based methods to statistical and machine learning-based methods like Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), and neural network-based models. These methods learn patterns from labeled training data and use them to predict the most likely POS tag for each word in a given sentence.


## Train and use HMM based tagger


The nltk.tag.hmm module in NLTK provides classes and functions for working with Hidden Markov Models (HMMs) for part-of-speech tagging and other sequence labeling tasks. HMMs are a statistical model that can be used to model sequential data, where the current state is influenced by the previous state and emits observations (labels) according to certain probabilities.

In the context of the NLTK's hmm module, the primary use is for training and using Hidden Markov Model-based taggers, which are commonly used for part-of-speech tagging. Here's a brief overview of the main functionalities and uses of the `nltk.tag.hmm` module:

- **Training an HMM Tagger:**
The module provides a `HiddenMarkovModelTrainer` class that allows you to train an HMM-based tagger using a tagged corpus. You can use this trainer to estimate the transition probabilities between different states (tags) and the emission probabilities (observations given states) from your training data.

- **Tagging with an HMM Tagger:**
Once you have trained an HMM tagger using the `HiddenMarkovModelTrainer`, you can use the trained tagger to tag new sequences of observations (usually words). The tagger assigns the most likely sequence of states (tags) to the given sequence of observations based on the learned probabilities.

- **Customizing HMM Parameters:**
The `HiddenMarkovModelTagger` class allows you to create a custom HMM tagger by specifying your own states, symbols (observations), transitions, and emission probabilities. This can be useful if you want to fine-tune the model or incorporate additional information.

- **Using Word Distributions:**
The HMM tagger works with probability distributions for transitions and emissions. You can provide your own probability distributions or let the trainer estimate them from the training data.

In [1]:
import nltk
from nltk.tag import hmm

In [2]:
nltk.download('treebank')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
from nltk.corpus import treebank, wordnet

In [4]:
# Load the tagged corpus (for training and evaluation)
corpus = treebank.tagged_sents()
corpus[0]

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [5]:
# Create a basic HMM tagger
trainer = hmm.HiddenMarkovModelTrainer()
basic_tagger = trainer.train(corpus)

In [6]:
# Test the custom tagger
test_sentence = "The quick brown fox jumps over the lazy dog"
tokens = nltk.word_tokenize(test_sentence)
tags = basic_tagger.tag(tokens)
print(tags)

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NNP'), ('fox', 'NNP'), ('jumps', 'NNP'), ('over', 'NNP'), ('the', 'NNP'), ('lazy', 'NNP'), ('dog', 'NNP')]


##  TNT (Trigrams 'n' Tags)

The TNT (Trigrams 'n' Tags) corpus, also known as the TnT tagset, is a resource used for part-of-speech tagging and morphological analysis. It is widely known for its association with the TnT (Trigrams 'n' Tags) statistical tagger, which is a part-of-speech tagging system based on trigram probabilities. The TnT tagger uses the frequencies of trigrams (sequences of three words or tags) to predict the next tag in a sequence.

The TNT corpus is used for training and evaluating statistical taggers like TnT. It includes tagged text data where each word is associated with its corresponding part-of-speech tag. This tagged data serves as the basis for estimating the probabilities used by the trigram-based tagger.


In [7]:
from nltk.tag import tnt
from nltk.corpus import treebank

train = treebank.tagged_sents()[:1000]
test = treebank.tagged_sents()[1000:1250]
tagg = tnt.TnT()
tagg.train(train)
print ("Accuracy of TnT Tagging : ", tagg.accuracy(test))

Accuracy of TnT Tagging :  0.7964917727413847


In [8]:
print(len(treebank.tagged_sents()))

3914


In [9]:
# print(train[:2])
sent = "Lets us try to tag this sentence"
tokens = sent.split(' ')
tagg.tag(tokens)

[('Lets', 'Unk'),
 ('us', 'PRP'),
 ('try', 'VB'),
 ('to', 'TO'),
 ('tag', 'Unk'),
 ('this', 'DT'),
 ('sentence', 'Unk')]

## spaCy pos tagger

spaCy is an open-source natural language processing (NLP) library for Python that provides tools and functionalities for various NLP tasks. It is designed to be fast, efficient, and easy to use, making it a popular choice for both research and production-level applications. spaCy offers pre-trained models for various languages and supports a wide range of NLP tasks, such as tokenization, part-of-speech tagging, named entity recognition, dependency parsing, text classification, and more.

In [12]:
import spacy

In [None]:
!python -m spacy download en_core_web_lg

In [15]:
nlp = spacy.load('en_core_web_lg')

In [16]:
sentence = "He was being opposed by her without any reason. A plan is being prepared by charles for next project"
for token in nlp(sentence):
    print(f'{token.text:{10}} {token.tag_:>{10}}\t{spacy.explain(token.tag_):<{50}} {token.pos_:>{5}}')

He                PRP	pronoun, personal                                   PRON
was               VBD	verb, past tense                                     AUX
being             VBG	verb, gerund or present participle                   AUX
opposed           VBN	verb, past participle                               VERB
by                 IN	conjunction, subordinating or preposition            ADP
her               PRP	pronoun, personal                                   PRON
without            IN	conjunction, subordinating or preposition            ADP
any                DT	determiner                                           DET
reason             NN	noun, singular or mass                              NOUN
.                   .	punctuation mark, sentence closer                  PUNCT
A                  DT	determiner                                           DET
plan               NN	noun, singular or mass                              NOUN
is                VBZ	verb, 3rd person singular pres

## Brill Tagger

Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as `tag.DefaultTagger`) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the `TagRule` interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using one of the TaggerTrainers available.

In [18]:
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [19]:
import nltk
from nltk.corpus import brown
from nltk.tag import brill, brill_trainer
from nltk.tag.util import untag

# Load the Brown corpus and divide it into sentences
corpus = brown.tagged_sents(categories='news')

# Split the data into training and testing sets
train_data = corpus[:3000]
test_data = corpus[3000:]

# Define the baseline tagger (in this case, a default tagger)
baseline_tagger = nltk.DefaultTagger('NN')

# Create templates for Brill Tagger rules
templates = [
    brill.Template(brill.Pos([-1])),
    brill.Template(brill.Pos([1])),
    brill.Template(brill.Pos([-2])),
    brill.Template(brill.Pos([2])),
    brill.Template(brill.Pos([-2, -1])),
    brill.Template(brill.Pos([1, 2])),
    brill.Template(brill.Pos([-3, -2, -1])),
    brill.Template(brill.Pos([1, 2, 3])),
    brill.Template(brill.Word([-1])),
    brill.Template(brill.Word([1])),
    brill.Template(brill.Word([-2])),
    brill.Template(brill.Word([2])),
    brill.Template(brill.Word([-2, -1])),
    brill.Template(brill.Word([1, 2])),
    brill.Template(brill.Word([-3, -2, -1])),
    brill.Template(brill.Word([1, 2, 3])),
]

# Train a Brill Tagger
trainer = brill_trainer.BrillTaggerTrainer(baseline_tagger, templates, trace=3)
brill_tagger = trainer.train(train_data, max_rules=10)

# Evaluate the tagger on the test data
accuracy = brill_tagger.evaluate(test_data)
print("Accuracy:", accuracy)

# Tag a sample sentence
sample_sentence = "NLTK is a great library for natural language processing."
words = nltk.word_tokenize(sample_sentence)
tags = brill_tagger.tag(words)
print(tags)

TBL train (fast) (seqs: 3000; tokens: 64638; tpls: 16; min score: 2; min acc: None)
Finding initial useful rules...
    Found 280917 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
19712010  391457  | NN->IN if Word:the@[1]
34933493   0  13  | NN->AT if Pos:IN@[-1]
 5491001 4524534  | NN->NP if Word:,@[-2,-1]
26642664   0  45  | NN->, if Pos:NP@[1]
 478 517  39 443  | NN->VB if Word:to@[-1]
 603 603   0 357  | NN->TO if Pos:VB@[1]
 420 429   9  96  | NN->NP if Word:Mrs.@[-3,-2,-1]
 368 387  19 559  | NN->IN if Word:a@[1]
 963 963   0   2  | NN->AT if Pos:IN@[-1]
 211 211   0   0  | IN->, if Pos:NP@[2]


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = brill_tagger.evaluate(test_data)


Accuracy: 0.300256153246464
[('NLTK', 'NN'), ('is', 'IN'), ('a', 'AT'), ('great', 'NN'), ('library', 'NN'), ('for', 'NN'), ('natural', 'NN'), ('language', 'NN'), ('processing', 'NN'), ('.', 'NN')]


Check other POS Taggers as well: [Reference](https://medium.com/@yashj302/pos-tagging-nlp-python-41df5243da78)