# Hindi POS Tagger using NLTK

## Introduction

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.

## Code

In [15]:
# Importing NLTK library, corpus and tagger
import nltk
from nltk.corpus import indian
from nltk.tag import tnt
import string

In [16]:
# Downloading Indian Languages Corpora which consists Hindi, Bangla, Marathi and Telugu corpus respectively
nltk.download("indian")

# Downloading Tokenizers 
nltk.download("punkt")

[nltk_data] Downloading package indian to
[nltk_data]     C:\Users\Pranshu\AppData\Roaming\nltk_data...
[nltk_data]   Package indian is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Pranshu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
# Training the POS Tagger Model using Hindi dataset
def train():
    taggedSet = "hindi.pos"
    wordSet = indian.sents(taggedSet)
    count = 0
    
    # Joining dataset words to form a sentence 
    for sen in wordSet:
        count = count + 1
        sen = "".join(
            [
                " " + i if not i.startswith("'") and i not in string.punctuation else i
                for i in sen
            ]
        ).strip()
        # Viewing individual corpus sentence
        # print(count, sen, "sentences")
        
    # Total Sentence Count
    print("Total sentences in the tagged file are", count)
    
    # Spliting dataset into Training Data and Test Data
    trainPerc = 0.9

    # Assigning the last index for Training Data and first index of Test Data
    trainRows = int(trainPerc * count)
    testRows = trainRows + 1

    # Slicing the corpus
    data = indian.tagged_sents(taggedSet)
    train_data = data[:trainRows]
    test_data = data[testRows:]

    # Stats
    print("Training dataset length: ", len(train_data))
    print("Testing dataset length: ", len(test_data))
    pos_tagger = tnt.TnT()
    pos_tagger.train(train_data)
    print("Accuracy: ", pos_tagger.evaluate(test_data))
    
    return pos_tagger

In [18]:
# Tagging function to tag all words in a sentence
def tagger(pos_tagger, sentenceToBeTagged):
    tokenized = nltk.word_tokenize(sentenceToBeTagged)
    return pos_tagger.tag(tokenized)

In [19]:
# Main Driving Module
if __name__ == "__main__":
    pos_tagger = train()
    sentence_to_be_tagged = "प्रधानमंत्री की सिफारिश पर मंत्रियों के एक पैनल का गठन किया गया था ।"
    print(tagger(pos_tagger, sentence_to_be_tagged))

Total sentences in the tagged file are 540
Training dataset length:  486
Testing dataset length:  53
Accuracy:  0.8111964873765093
[('प्रधानमंत्री', 'NN'), ('की', 'PREP'), ('सिफारिश', 'NN'), ('पर', 'PREP'), ('मंत्रियों', 'NN'), ('के', 'PREP'), ('एक', 'QFNUM'), ('पैनल', 'NN'), ('का', 'PREP'), ('गठन', 'NVB'), ('किया', 'VFM'), ('गया', 'VAUX'), ('था', 'VAUX'), ('।', 'PUNC')]
