# Problem Statement: 
- Modify the viterbi algorithm to solve the problem of unknown words using at least two techniques

# Goals:
- Write the vanilla Viterbi algorithm for assigning POS tags (i.e. without dealing with unknown words) 
- Solve the problem of unknown words using at least two techniques. These techniques can use any of the approaches discussed in the class - lexicon, rule-based, probabilistic etc. Note that to implement these techniques, you can either write separate functions and call them from the main Viterbi algorithm, or modify the Viterbi algorithm, or both.
- Compare the tagging accuracy after making these modifications with the vanilla Viterbi algorithm.
- List down at least three cases from the sample test file (i.e. unknown word-tag pairs) which were incorrectly tagged by the original Viterbi POS tagger and got corrected after your modifications.

# Approach:
- Part 1: Reading & Understanding The Dataset
- Part 2: Performing Exploratory Data Analysis
- Part 3: Build The Vanilla Vitebri Based POS Tagger
    - 3A. Perform Training - Validation Split
    - 3B. Determine Emission Probabilities
    - 3C. Determine Transition Probabilities
    - 3D. Generate The Vitebri Algorithm
- Part 4: Improve Results Using Alternative Models
    - 4A. Using Combination Tagger (Lexicon & Rule Based)
    - 4B. Using Conditional Random Fileds (CRF)
- Part 5: Load The Test Data File
- Part 6: Demonstrate Improvement In the Two Models Over the Vanialla Viterbi Model
    - 6A. Improvement From The Combination Tagger Model
    - 6B. Improvement From The CRF Model

### Dependent Files: "Test_sentences.txt" (File provided for test data to evaluate model built).
    

In [1]:
# Import required libraries
import nltk
import numpy as np
import pandas as pd
import re
import requests
import matplotlib.pyplot as plt
import seaborn
import pprint
import time
import random

from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from collections import Counter

## Part 1: Reading & Understanding The Dataset

In [2]:
nltk_data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))
print('Lenght of dataset: ', len(nltk_data), '\n')
print(nltk_data)

Lenght of dataset:  3914 





In [3]:
# samples: Each sentence is a list of (word, pos) tuples. Print first 3 sentences
print(nltk_data[:3])

[[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')], [('Mr.', 'NOUN'), ('Vinken', 'NOUN'), ('is', 'VERB'), ('chairman', 'NOUN'), ('of', 'ADP'), ('Elsevier', 'NOUN'), ('N.V.', 'NOUN'), (',', '.'), ('the', 'DET'), ('Dutch', 'NOUN'), ('publishing', 'VERB'), ('group', 'NOUN'), ('.', '.')], [('Rudolph', 'NOUN'), ('Agnew', 'NOUN'), (',', '.'), ('55', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), ('and', 'CONJ'), ('former', 'ADJ'), ('chairman', 'NOUN'), ('of', 'ADP'), ('Consolidated', 'NOUN'), ('Gold', 'NOUN'), ('Fields', 'NOUN'), ('PLC', 'NOUN'), (',', '.'), ('was', 'VERB'), ('named', 'VERB'), ('*-1', 'X'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('British', 'ADJ'), ('industrial', 'ADJ'), ('

In the list mentioned above, each element of the list is a sentence. Also, note that each sentence ends with a full stop '.' whose POS tag is also a '.'. Thus, the POS tag '.' demarcates the end of a sentence.

In [4]:
# Print the first sentence
print(nltk_data[:1])

[[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]]


In [5]:
# Print the first sentence without POS tag
x=[]
for i in range(0, len(nltk_data[0])):
    x.append(nltk_data[0][i][0])
x = ' '.join(x)
print(x)

del x

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .


Also, we do not need the corpus to be segmented into sentences, but can rather use a list of (word, tag) tuples. Let's convert the list into a (word, tag) tuple.

In [6]:
# Converting the list of sents to a list of (word, pos tag) tuples

tagged_words = [tup for sent in nltk_data for tup in sent]
print(len(tagged_words))
print(tagged_words[:10])

100676
[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET')]


Build the list of words

In [7]:
nltk_words_list = []
for i in range(0, len(nltk_data)):
    for m in range(0, len(nltk_data[i])):
        nltk_words_list.append(nltk_data[i][m][0])

#nltk_words = ' '.join(nltk_words)
print(nltk_words_list)






## Part 2: Performing Exploratory Data Analysis

###### 1. Words in the corpus

In [8]:
words = []
words = [word[0] for sent in nltk_data for word in sent]
unique_words = set(words)
print('Total words in the corpus : ', len(words))
print('Unique words in the corpus: ', len(unique_words))

Total words in the corpus :  100676
Unique words in the corpus:  12408


###### 2. Tags in the corpus

In [9]:
tags = []
tags = [tag[1] for pair in nltk_data for tag in pair]
#tags = tags.sort()
unique_tags = set(tags)
print("Unique tags are: ")
for tag in unique_tags:
    if tag == 'NOUN':
        print(tag, "(Noun)")
    elif tag == 'VERB':
        print(tag, '(Verb)')
    elif tag == 'ADP':
        print(tag, '(Adposition [Preposition, Postposition])')
    elif tag == 'CONJ':
        print(tag, '(Conjuction)')
    elif tag == 'ADV':
        print(tag, '(Adverb)')
    elif tag == 'NUM':
        print(tag, '(Cardinal Numbers)')
    elif tag == 'PRON':
        print(tag, '(Pronouns)')
    elif tag == 'DET':
        print(tag, '(Determinant)')
    elif tag == 'ADJ':
        print(tag, '(Adjective)')
    elif tag == 'PRT':
        print(tag, '(Particles or other function words)')
    elif tag == '.':
        print(tag, '(.)')
    else:
        print(tag, '(Unknonwn)')


Unique tags are: 
CONJ (Conjuction)
X (Unknonwn)
PRON (Pronouns)
. (.)
ADP (Adposition [Preposition, Postposition])
PRT (Particles or other function words)
VERB (Verb)
DET (Determinant)
NUM (Cardinal Numbers)
NOUN (Noun)
ADV (Adverb)
ADJ (Adjective)


In [10]:
print('Total tags in the corpus : ', len(tags))
print('Unique tags in the corpus: ', len(unique_tags))

Total tags in the corpus :  100676
Unique tags in the corpus:  12


###### 3. Most common tags

In [11]:
c = Counter(tags)
tag_counts = c.most_common()
print('Most common tags are: ', '\n', tag_counts)

Most common tags are:  
 [('NOUN', 28867), ('VERB', 13564), ('.', 11715), ('ADP', 9857), ('DET', 8725), ('X', 6613), ('ADJ', 6397), ('NUM', 3546), ('PRT', 3219), ('ADV', 3171), ('PRON', 2737), ('CONJ', 2265)]


Most common tags are nouns, verbs, periods and adpositions (prepositions and postpositions)

###### 4. Most common words

In [12]:
c = Counter(words)
tag_counts = c.most_common(30)
print('Most common tags are: ', '\n', tag_counts)

Most common tags are:  
 [(',', 4885), ('the', 4045), ('.', 3828), ('of', 2319), ('to', 2164), ('a', 1878), ('in', 1572), ('and', 1511), ('*-1', 1123), ('0', 1099), ('*', 965), ("'s", 864), ('for', 817), ('that', 807), ('*T*-1', 806), ('*U*', 744), ('$', 718), ('The', 717), ('``', 702), ("''", 684), ('is', 671), ('said', 628), ('on', 490), ('it', 476), ('%', 446), ('by', 429), ('at', 402), ('with', 387), ('from', 386), ('as', 385)]


In [13]:
c_words = len(words)

verbs = [pair for pair in tagged_words if pair[1] == 'VERB'] 
c_verbs = len(verbs)

nouns = [pair for pair in tagged_words if pair[1] == 'NOUN'] 
c_nouns = len(nouns)

adps = [pair for pair in tagged_words if pair[1] == 'ADP'] 
c_adps = len(adps)


print('Percentage of words that are nouns       : ', "{:.2%}".format(c_nouns / c_words))
print('Percentage of words that are verbs       : ', "{:.2%}".format(c_verbs / c_words))
print('Percentage of words that are adpositions : ', "{:.2%}".format(c_adps / c_words))

Percentage of words that are nouns       :  28.67%
Percentage of words that are verbs       :  13.47%
Percentage of words that are adpositions :  9.79%


###### 5. "X" tag follows "Y" Tag. Write a function to determine what tagged words followed another tagged words (focus on nouns, verbs, unknown tagged words)


In [14]:
def tag_followed_by_tag(tag_1, tag_2):
    tags = [tag[1] for tag in tagged_words]
    tags_count = len(tags)

    t1_tags = [t1_tag[1] for t1_tag in tagged_words if t1_tag[1] == tag_1]
    t1_tags_count = len(t1_tags)

    t1_t2_tags = [(t, tags[index+1]) for index, t in enumerate(tags) if t == tag_1 and tags[index+1] == tag_2]
    t1_t2_tags_count = len(t1_t2_tags)
    
    if tag_1 == 'X':
        tag_1 = 'UNKNOWN'
    
    if tag_2 == 'X':
        tag_2 = 'UNKNOWN'
    
    print('Total Tags:', tags_count)
    print('Total', tag_1, 'Tags:', t1_tags_count)
    print('Total', tag_1, 'followed by', tag_2, 'Tags:', t1_t2_tags_count)
    print('Percentage of', tag_1, 'Tags that are followed by', tag_2, 'Tags:', "{:.2%}".format(t1_t2_tags_count/t1_tags_count))
    print('\n')
    return #(tags_count, t1_tags_count, t1_t2_tags_count)


In [15]:
# Determine what fracton of nouns are followed by adjectives
tag_followed_by_tag('NOUN', 'ADJ')
# Determine what fracton of nouns are followed by verbs
tag_followed_by_tag('NOUN', 'VERB')
# Determine what fracton of nouns are followed by Periods
tag_followed_by_tag('NOUN', '.')
# Determine what fracton of nouns are followed by determinants
tag_followed_by_tag('NOUN', 'DET')
# Determine what fracton of nouns are followed by adpositions
tag_followed_by_tag('NOUN', 'ADP')
# Determine what fracton of nouns are followed by adverbs
tag_followed_by_tag('NOUN', 'ADV')
# Determine what fracton of nouns are followed by cardinal numbers
tag_followed_by_tag('NOUN', 'NUM')
# Determine what fracton of nouns are followed by Unknowns
tag_followed_by_tag('NOUN', 'X')


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by ADJ Tags: 355
Percentage of NOUN Tags that are followed by ADJ Tags: 1.23%


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by VERB Tags: 4240
Percentage of NOUN Tags that are followed by VERB Tags: 14.69%


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by . Tags: 6927
Percentage of NOUN Tags that are followed by . Tags: 24.00%


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by DET Tags: 380
Percentage of NOUN Tags that are followed by DET Tags: 1.32%


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by ADP Tags: 5102
Percentage of NOUN Tags that are followed by ADP Tags: 17.67%


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by ADV Tags: 491
Percentage of NOUN Tags that are followed by ADV Tags: 1.70%


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by NUM Tags: 273
Percentage of NOUN Tags that are followed by NUM Tags: 0.95%


To

- 24% of nouns are followed by periods (.)
- 18% of nouns are followed by adpositions
- 15% of nouns are followed by verbs
-  3% of nouns are followed by UNKNOWN tagged words


In [16]:
# Determine what fracton of verbs are followed by Nouns
tag_followed_by_tag('VERB', 'NOUN')
# Determine what fracton of verbs are followed by adjectives
tag_followed_by_tag('VERB', 'ADJ')
# Determine what fracton of verbs are followed by periods
tag_followed_by_tag('VERB', '.')
# Determine what fracton of verbs are followed by determinants
tag_followed_by_tag('VERB', 'DET')
# Determine what fracton of verbs are followed by adpositions
tag_followed_by_tag('VERB', 'ADP')
# Determine what fracton of verbs are followed by adverbs
tag_followed_by_tag('VERB', 'ADV')
# Determine what fracton of verbs are followed by cardinal numbers
tag_followed_by_tag('VERB', 'NUM')
# Determine what fracton of verbs are followed by Unknowns
tag_followed_by_tag('VERB', 'X')

Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by NOUN Tags: 1497
Percentage of VERB Tags that are followed by NOUN Tags: 11.04%


Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by ADJ Tags: 884
Percentage of VERB Tags that are followed by ADJ Tags: 6.52%


Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by . Tags: 475
Percentage of VERB Tags that are followed by . Tags: 3.50%


Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by DET Tags: 1822
Percentage of VERB Tags that are followed by DET Tags: 13.43%


Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by ADP Tags: 1239
Percentage of VERB Tags that are followed by ADP Tags: 9.13%


Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by ADV Tags: 1110
Percentage of VERB Tags that are followed by ADV Tags: 8.18%


Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by NUM Tags: 310
Percentage of VERB Tags that are followed by NUM Tags: 2.29%


To

- 22% of verbs are followed by unknown tagged words
- 13% of verbs are followed by determinant tagged words
- 11% of verbs are followed by noun tagged words

In [17]:
# Determine what fraction of noun tags are followed by Unknowns
tag_followed_by_tag('NOUN', 'X')
# Determine what fraction of verb tags followed by Unknowns
tag_followed_by_tag('VERB', 'X')
# Determine what fraction of adverb tags are followed by Unknowns
tag_followed_by_tag('ADV', 'X')
# Determine what fraction of adjective tags are followed by Unknowns
tag_followed_by_tag('ADJ', 'X')
# Determine what fraction of adposition tags are followed by Unknowns
tag_followed_by_tag('ADP', 'X')
# Determine what fraction of particles tags are followed by Unknowns
tag_followed_by_tag('PRT', 'X')
# Determine what fraction of cardinal numbers tags are followed by Unknowns
tag_followed_by_tag('NUM', 'X')
# Determine what fraction of conjections tags are followed by Unknowns
tag_followed_by_tag('CONJ', 'X')
# Determine what fraction of determinants tags are followed by Unknowns
tag_followed_by_tag('DET', 'X')
# Determine what fraction of pronouns tags are followed by Unknowns
tag_followed_by_tag('PRON', 'X')
# Determine what fraction of particle tags are followed by Unknowns
tag_followed_by_tag('PRT', 'X')
# Determine what fraction of unknown tags are followed by Unknowns
tag_followed_by_tag('X', 'X')


Total Tags: 100676
Total NOUN Tags: 28867
Total NOUN followed by UNKNOWN Tags: 839
Percentage of NOUN Tags that are followed by UNKNOWN Tags: 2.91%


Total Tags: 100676
Total VERB Tags: 13564
Total VERB followed by UNKNOWN Tags: 2954
Percentage of VERB Tags that are followed by UNKNOWN Tags: 21.78%


Total Tags: 100676
Total ADV Tags: 3171
Total ADV followed by UNKNOWN Tags: 73
Percentage of ADV Tags that are followed by UNKNOWN Tags: 2.30%


Total Tags: 100676
Total ADJ Tags: 6397
Total ADJ followed by UNKNOWN Tags: 134
Percentage of ADJ Tags that are followed by UNKNOWN Tags: 2.09%


Total Tags: 100676
Total ADP Tags: 9857
Total ADP followed by UNKNOWN Tags: 343
Percentage of ADP Tags that are followed by UNKNOWN Tags: 3.48%


Total Tags: 100676
Total PRT Tags: 3219
Total PRT followed by UNKNOWN Tags: 43
Percentage of PRT Tags that are followed by UNKNOWN Tags: 1.34%


Total Tags: 100676
Total NUM Tags: 3546
Total NUM followed by UNKNOWN Tags: 746
Percentage of NUM Tags that are foll

- 22% of verbs are followed by unknown tagged words
- 21% of cardinal numbers are followed by unknown tagged words
-  9% of pronouns are followed by unknown tagged words

## Part 3: Build The Vanilla Vitebri Based POS Tagger
- 3A. Perform Training - Validation Split
- 3B. Determine Emission Probabilities
- 3C. Determine Transition Probabilities
- 3D. Generate The Vitebri Algorithm

### 3A. Perform Training - Validation Split (978% and 5% respectively)

In [18]:
# Randomize and split (validation sample size: 5%)

random.seed(1234)
train_nltk_data, validation_nltk_data = train_test_split(nltk_data, test_size = 0.05)

# Check the size of training and validation data, print the words & tag pairs in the first 5 sentences

print('Training Set Size  : ', len(train_nltk_data))
print('Validation Set Size: ', len(validation_nltk_data))
print('\n')
print('Training Sample: ', '\n')
print(train_nltk_data[:5])
print('\n')
print('Validation Sample : ', '\n')
print(validation_nltk_data[:5])


Training Set Size  :  3718
Validation Set Size:  196


Training Sample:  

[[('IBM', 'NOUN'), (',', '.'), ('the', 'DET'), ('giant', 'ADJ'), ('computer', 'NOUN'), ('maker', 'NOUN'), (',', '.'), ('offered', 'VERB'), ('$', '.'), ('750', 'NUM'), ('million', 'NUM'), ('*U*', 'X'), ('of', 'ADP'), ('non-callable', 'ADJ'), ('30-year', 'ADJ'), ('debentures', 'NOUN'), ('priced', 'VERB'), ('*', 'X'), ('*-1', 'X'), ('to', 'PRT'), ('yield', 'VERB'), ('8.47', 'NUM'), ('%', 'NOUN'), (',', '.'), ('or', 'CONJ'), ('about', 'ADP'), ('1\\/2', 'NUM'), ('percentage', 'NOUN'), ('point', 'NOUN'), ('higher', 'ADJ'), ('than', 'ADP'), ('the', 'DET'), ('yield', 'NOUN'), ('on', 'ADP'), ('30-year', 'ADJ'), ('Treasury', 'NOUN'), ('bonds', 'NOUN'), ('.', '.')], [('This', 'DET'), ('being', 'VERB'), ('Britain', 'NOUN'), (',', '.'), ('no', 'DET'), ('woman', 'NOUN'), ('has', 'VERB'), ('filed', 'VERB'), ('an', 'DET'), ('equal-opportunity', 'NOUN'), ('suit', 'NOUN'), (',', '.'), ('but', 'CONJ'), ('the', 'DET'), ('extent', '

In [19]:
# Generate 

train_tagged_words = [tup for sent in train_nltk_data for tup in sent]
print('Sample Training Data Word Tags Pairs:')
print(train_tagged_words[:10], '\n')
print('Training Data Word Tag Pair Count:', len(train_tagged_words), '\n')

print('------'*10, '\n')
validation_tagged_words = [tup for sent in validation_nltk_data for tup in sent]
print('Sample Validation Data Word Tags Pairs:', '\n')
print(validation_tagged_words[:10], '\n')
print('Validaton Data Word Tag Pair Count:', len(validation_tagged_words), '\n')

Sample Training Data Word Tags Pairs:
[('IBM', 'NOUN'), (',', '.'), ('the', 'DET'), ('giant', 'ADJ'), ('computer', 'NOUN'), ('maker', 'NOUN'), (',', '.'), ('offered', 'VERB'), ('$', '.'), ('750', 'NUM')] 

Training Data Word Tag Pair Count: 95390 

------------------------------------------------------------ 

Sample Validation Data Word Tags Pairs: 

[('Wedtech', 'NOUN'), ('management', 'NOUN'), ('used', 'VERB'), ('the', 'DET'), ('merit', 'NOUN'), ('system', 'NOUN'), ('.', '.'), ('An', 'DET'), ('index-arbitrage', 'ADJ'), ('trade', 'NOUN')] 

Validaton Data Word Tag Pair Count: 5286 



In [20]:
train_words = [word[0] for sent in train_nltk_data for word in sent]
train_tags = [word[1] for sent in train_nltk_data for word in sent]
print(train_words[:10])
print(train_tags[:10], '\n')

validation_words = [word[0] for sent in validation_nltk_data for word in sent]
validation_tags = [word[1] for sent in validation_nltk_data for word in sent]
print(validation_words[:10])
print(validation_tags[:10], '\n')

['IBM', ',', 'the', 'giant', 'computer', 'maker', ',', 'offered', '$', '750']
['NOUN', '.', 'DET', 'ADJ', 'NOUN', 'NOUN', '.', 'VERB', '.', 'NUM'] 

['Wedtech', 'management', 'used', 'the', 'merit', 'system', '.', 'An', 'index-arbitrage', 'trade']
['NOUN', 'NOUN', 'VERB', 'DET', 'NOUN', 'NOUN', '.', 'DET', 'ADJ', 'NOUN'] 



In [21]:
# Generate the tokens, tags (T) and vocabulary (V)
tokens = [pair[0] for pair in train_tagged_words]
print('Tokens:', tokens[:10], '\n')

T = set([pair[1] for pair in train_tagged_words])
print('Tags:', T)
print('Tags Size:', len(T), '\n')

V = set(tokens)
#print('Vocabulary:', V)
print('Vocabulary Size:', len(V), '\n')


Tokens: ['IBM', ',', 'the', 'giant', 'computer', 'maker', ',', 'offered', '$', '750'] 

Tags: {'CONJ', 'X', 'PRON', '.', 'ADP', 'PRT', 'VERB', 'DET', 'NUM', 'NOUN', 'ADV', 'ADJ'}
Tags Size: 12 

Vocabulary Size: 12060 



### 3B. Determine Emission Probabilities

In [22]:
# Compute P(w/t) and store the results in T x V matrix
t = len(T)
v = len(V)
w_given_t = np.zeros((t, v))


In [23]:
# compute word given tag: Emission Probability]
def word_given_tag(word, tag, train_bag = train_tagged_words):
    tag_list = [pair for pair in train_bag if pair[1]==tag]
    count_tag = len(tag_list)
    w_given_tag_list = [pair[0] for pair in tag_list if pair[0]==word]
    count_w_given_tag = len(w_given_tag_list)
    
    return (count_w_given_tag, count_tag)

In [24]:
# Some examples:

# generally
print("generally:")
print(word_given_tag('generally', 'ADJ'))
print(word_given_tag('generally', 'VERB'))
print(word_given_tag('generally', 'NOUN'), "\n")

# will
print("will:")
print(word_given_tag('will', 'ADJ'))
print(word_given_tag('will', 'NOUN'))
print(word_given_tag('will', 'VERB'))

# book
print("book:")
print(word_given_tag('book', 'NOUN'))
print(word_given_tag('book', 'VERB'))

# courts

generally:
(0, 6063)
(0, 12845)
(0, 27333) 

will:
(0, 6063)
(1, 27333)
(260, 12845)
book:
(6, 27333)
(1, 12845)


In [25]:
x, y = word_given_tag('including', 'NOUN')
print('including as noun: ', x, 'out of ', y, 'nouns')

x, y = word_given_tag('including', 'ADJ')
print('including as adjectives: ', x, 'out of ', y, 'adjectives')

x, y = word_given_tag('including', 'VERB')
print('including as verb: ', x, 'out of ', y, 'verbs')

print('\n')

x, y = word_given_tag('normal', 'NOUN')
print('normal as noun: ', x, 'out of ', y, 'nouns')
x, y = word_given_tag('normal', 'ADJ')
print('normal as adjectives: ', x, 'out of ', y, 'adjectives')
x, y = word_given_tag('normal', 'VERB')
print('normal as verb: ', x, 'out of ', y, 'verbs')

including as noun:  0 out of  27333 nouns
including as adjectives:  0 out of  6063 adjectives
including as verb:  36 out of  12845 verbs


normal as noun:  0 out of  27333 nouns
normal as adjectives:  5 out of  6063 adjectives
normal as verb:  0 out of  12845 verbs


### 3C. Determine Transition Probabilities

In [26]:
# compute tag given tag: tag2(t2) given tag1 (t1), i.e. Transition Probability

def t2_given_t1(t2, t1, train_bag = train_tagged_words):
    tags = [pair[1] for pair in train_bag]
    count_t1 = len([t for t in tags if t==t1])
    count_t2_t1 = 0
    for index in range(len(tags)-1):
        if tags[index]==t1 and tags[index+1] == t2:
            count_t2_t1 += 1
    return (count_t2_t1, count_t1)

In [27]:
# examples
print(t2_given_t1(t2='NOUN', t1='ADJ'))
print(t2_given_t1('NOUN', 'ADV'))
print(t2_given_t1('NOUN', 'DET'))
print(t2_given_t1('NOUN', 'VERB'))
print(t2_given_t1('.', 'NOUN'))
print(t2_given_t1('NOUN', 'NOUN'))
print(t2_given_t1('VERB', 'NOUN'))

(4248, 6063)
(96, 3008)
(5289, 8288)
(1415, 12845)
(6530, 27333)
(7194, 27333)
(4032, 27333)


In [28]:
#Please note P(tag|start) is same as P(tag|'.')
print(t2_given_t1('DET', '.'))
print(t2_given_t1('VERB', '.'))
print(t2_given_t1('NOUN', '.'))


(1924, 11069)
(967, 11069)
(2447, 11069)


In [29]:
print(t2_given_t1('.', 'X'))
print(t2_given_t1('DET', 'X'))
print(t2_given_t1('VERB', 'X'))
print(t2_given_t1('NOUN', 'X'))
print(t2_given_t1('NUM', 'X'))

(1030, 6257)
(344, 6257)
(1281, 6257)
(388, 6257)
(9, 6257)


In [30]:
# creating t x t transition matrix of tags
# each column is t2, each row is t1
# thus M(i, j) represents P(tj given ti)

tags_matrix = np.zeros((len(T), len(T)), dtype='float32')
for i, t1 in enumerate(list(T)):
    for j, t2 in enumerate(list(T)): 
        tags_matrix[i, j] = t2_given_t1(t2, t1)[0]/t2_given_t1(t2, t1)[1]

In [31]:
tags_matrix

array([[4.64468176e-04, 8.36042687e-03, 5.99163957e-02, 3.43706459e-02,
        5.29493727e-02, 5.10914996e-03, 1.56061307e-01, 1.17974922e-01,
        4.22666036e-02, 3.49280089e-01, 5.57361804e-02, 1.17510453e-01],
       [9.90890246e-03, 7.46364072e-02, 5.57775274e-02, 1.64615631e-01,
        1.44638002e-01, 1.84113786e-01, 2.04730704e-01, 5.49784228e-02,
        1.43838895e-03, 6.20105490e-02, 2.62106434e-02, 1.69410259e-02],
       [5.02706878e-03, 9.31941196e-02, 8.12064949e-03, 3.90564576e-02,
        2.24284604e-02, 1.31477183e-02, 4.83372003e-01, 1.00541376e-02,
        7.73395225e-03, 2.08043307e-01, 3.51894833e-02, 7.46326372e-02],
       [5.83611876e-02, 2.71027200e-02, 6.60402924e-02, 9.36850682e-02,
        9.14265066e-02, 2.43924465e-03, 8.73610973e-02, 1.73818767e-01,
        8.16695243e-02, 2.21067846e-01, 5.29406443e-02, 4.39967476e-02],
       [8.53970938e-04, 3.44790779e-02, 6.87446594e-02, 4.00298908e-02,
        1.72929112e-02, 1.38770277e-03, 8.64645559e-03, 3.23

In [32]:
# convert the matrix to a df for better readability
tags_df = pd.DataFrame(tags_matrix, columns = list(T), index=list(T))

tags_df

Unnamed: 0,CONJ,X,PRON,.,ADP,PRT,VERB,DET,NUM,NOUN,ADV,ADJ
CONJ,0.000464,0.00836,0.059916,0.034371,0.052949,0.005109,0.156061,0.117975,0.042267,0.34928,0.055736,0.11751
X,0.009909,0.074636,0.055778,0.164616,0.144638,0.184114,0.204731,0.054978,0.001438,0.062011,0.026211,0.016941
PRON,0.005027,0.093194,0.008121,0.039056,0.022428,0.013148,0.483372,0.010054,0.007734,0.208043,0.035189,0.074633
.,0.058361,0.027103,0.06604,0.093685,0.091427,0.002439,0.087361,0.173819,0.08167,0.221068,0.052941,0.043997
ADP,0.000854,0.034479,0.068745,0.04003,0.017293,0.001388,0.008646,0.323228,0.063834,0.32152,0.013877,0.106106
PRT,0.002291,0.013089,0.017343,0.042866,0.019961,0.001963,0.400851,0.102094,0.054319,0.249673,0.010144,0.085406
VERB,0.005294,0.217361,0.035189,0.035111,0.091865,0.031452,0.168548,0.135228,0.022733,0.11016,0.081199,0.065862
DET,0.000483,0.046211,0.00362,0.017616,0.009653,0.000241,0.039696,0.00555,0.022925,0.638152,0.012066,0.203789
NUM,0.014269,0.212247,0.001189,0.117122,0.035375,0.027348,0.018133,0.002378,0.182818,0.353746,0.002675,0.032699
NOUN,0.042879,0.028683,0.004646,0.238905,0.177588,0.044086,0.147514,0.013647,0.009549,0.263198,0.017049,0.012256


### 3D. Generate The Vitebri Algorithm

In [33]:
len(train_tagged_words)

95390

In [34]:
# Viterbi Heuristic
def Viterbi(words, train_bag = train_tagged_words):
    state = []
    T = list(set([pair[1] for pair in train_bag]))
    
    for key, word in enumerate(words):
        #initialise list of probability column for a given observation
        p = [] 
        for tag in T:
            if key == 0:
                transition_p = tags_df.loc['.', tag]
            else:
                transition_p = tags_df.loc[state[-1], tag]
                
            # compute emission and state probabilities
            emission_p = word_given_tag(words[key], tag)[0]/word_given_tag(words[key], tag)[1]
            state_probability = emission_p * transition_p    
            p.append(state_probability)
            
        pmax = max(p)
        # getting state for which probability is maximum
        state_max = T[p.index(pmax)] 
        state.append(state_max)
    return list(zip(words, state))



In [35]:
# Running on entire test dataset would take more than 3-4hrs. 
# Let's test our Viterbi algorithm on a few sample sentences of test dataset

random.seed(1234)

# choose random 5 sents
rndom = [random.randint(1,len(validation_nltk_data)) for x in range(10)]

# list of sents
validation_run = [validation_nltk_data[i] for i in rndom]

# list of tagged words
validation_run_base = [tup for sent in validation_run for tup in sent]

# list of untagged words
validation_tagged_words = [tup[0] for sent in validation_run for tup in sent]
validation_run

[[('Now', 'ADV'),
  (',', '.'),
  ('on', 'ADP'),
  ('a', 'DET'),
  ('good', 'ADJ'),
  ('day', 'NOUN'),
  (',', '.'),
  ('Chicago', 'NOUN'),
  ("'s", 'PRT'),
  ('stock-index', 'NOUN'),
  ('traders', 'NOUN'),
  ('trade', 'VERB'),
  ('more', 'ADJ'),
  ('dollars', 'NOUN'),
  ('worth', 'NOUN'),
  ('of', 'ADP'),
  ('stock', 'NOUN'),
  ('futures', 'NOUN'),
  ('than', 'ADP'),
  ('the', 'DET'),
  ('Big', 'NOUN'),
  ('Board', 'NOUN'),
  ('trades', 'VERB'),
  ('in', 'ADP'),
  ('stock', 'NOUN'),
  ('.', '.')],
 [('The', 'DET'),
  ('proposal', 'NOUN'),
  ('comes', 'VERB'),
  ('as', 'ADP'),
  ('a', 'DET'),
  ('surprise', 'NOUN'),
  ('even', 'ADV'),
  ('to', 'PRT'),
  ('administration', 'NOUN'),
  ('officials', 'NOUN'),
  ('and', 'CONJ'),
  ('temporarily', 'ADV'),
  ('throws', 'VERB'),
  ('into', 'ADP'),
  ('chaos', 'NOUN'),
  ('the', 'DET'),
  ('House', 'NOUN'),
  ("'s", 'PRT'),
  ('work', 'NOUN'),
  ('on', 'ADP'),
  ('clean-air', 'ADJ'),
  ('legislation', 'NOUN'),
  ('.', '.')],
 [('Unitholders', '

In [36]:
# tagging the test sentences
start = time.time()
tagged_seq = Viterbi(validation_tagged_words)
end = time.time()
difference = end-start

In [37]:
print("Time taken in seconds: ", difference)
print(tagged_seq)
#print(test_run_base)

Time taken in seconds:  37.18284225463867
[('Now', 'ADV'), (',', '.'), ('on', 'ADP'), ('a', 'DET'), ('good', 'ADJ'), ('day', 'NOUN'), (',', '.'), ('Chicago', 'NOUN'), ("'s", 'PRT'), ('stock-index', 'ADJ'), ('traders', 'NOUN'), ('trade', 'NOUN'), ('more', 'ADV'), ('dollars', 'NOUN'), ('worth', 'ADP'), ('of', 'ADP'), ('stock', 'NOUN'), ('futures', 'NOUN'), ('than', 'ADP'), ('the', 'DET'), ('Big', 'NOUN'), ('Board', 'NOUN'), ('trades', 'NOUN'), ('in', 'ADP'), ('stock', 'NOUN'), ('.', '.'), ('The', 'DET'), ('proposal', 'NOUN'), ('comes', 'VERB'), ('as', 'ADP'), ('a', 'DET'), ('surprise', 'NOUN'), ('even', 'ADV'), ('to', 'PRT'), ('administration', 'NOUN'), ('officials', 'NOUN'), ('and', 'CONJ'), ('temporarily', 'ADV'), ('throws', 'VERB'), ('into', 'ADP'), ('chaos', 'CONJ'), ('the', 'DET'), ('House', 'NOUN'), ("'s", 'PRT'), ('work', 'VERB'), ('on', 'ADP'), ('clean-air', 'ADJ'), ('legislation', 'NOUN'), ('.', '.'), ('Unitholders', 'CONJ'), ('will', 'VERB'), ('receive', 'VERB'), ('two', 'NUM')

In [38]:
# accuracy
check = [i for i, j in zip(tagged_seq, validation_run_base) if i == j] 
accuracy = len(check)/len(tagged_seq)
print('Accuracy of the Vanilla Viterbi Model is:', "{:.2%}".format(accuracy))

Accuracy of the Vanilla Viterbi Model is: 93.42%


In [39]:
# Determine count of words, where tag is unknown (X)
ctr = 0
for seq in tagged_seq:
    if seq[1] == 'X':
        ctr += 1
        print(seq[0])
print('Total number of words tagged unknown:', ctr, 'out of: ', len(tagged_seq))

*-1
0
*T*-2
*T*-1
0
*T*-2
*T*-1
0
*
*
*T*-1
0
*T*-1
*U*
Total number of words tagged unknown: 14 out of:  243


In [40]:
unknown_words = [word[0] for sent in validation_nltk_data for word in sent if word[1] == 'X']
print(unknown_words)

['0', '*T*-1', '*-1', '0', '*T*-2', '*-1', '*EXP*-1', '*', '*T*-2', '*U*', '*U*', '*T*-235', '0', '*T*-116', '0', '*U*', '*-2', '*-1', '*T*-156', '*-1', '0', '*-3', '*T*-2', '*U*', '*-1', '*T*-75', '*-1', '*T*-2', '*U*', '*U*', '*-46', '0', '*T*-1', '*T*-1', '0', '*T*-2', '*T*-1', '*-24', '0', '*T*-2', '*U*', '*-53', '*-54', '*T*-1', '0', '*', '*', '*', '0', '*', '*T*-1', '*T*-2', '*-1', '*-1', '*T*-1', '0', '*T*-2', '*-3', '*T*-4', '0', '*U*', '*', '*T*-1', '*-1', '*T*-2', '*-1', '0', '*T*-3', '0', '*T*-1', '*', '*U*', '*', '0', '*-33', '*-1', '*', '*', '*', '*-1', '*T*-2', '0', '*', '0', '*-1', '*', '*T*-1', '*U*', '*ICH*-1', '*U*', '*-4', '*U*', '*U*', '*-3', '*U*', '*-3', '*-2', '*U*', '*-3', '*-5', '*U*', '*-5', '*-1', '*U*', '*-5', '*T*-1', '*T*-1', '0', '*', '*T*-2', '*T*-1', '*-35', '*', '*T*-2', '*', '0', '*ICH*-1', '*-2', '*-1', '*-1', '*', '0', '*-1', '0', '*', '*T*-1', '*-1', '*-19', '*T*-15', '*-1', '*T*-71', '*-1', '0', '*-2', '*T*-3', '*', '0', '*T*-1', '*', '*', '*T*-1'

## Part 4: Improve Results Using Alternative Models
    - 4A. Using Combination Tagger (Lexicon & Rule Based)
    - 4B. Using Conditional Random Fileds (CRF)

### 4A: Using A Combination Tagger (Lexicon and Rule based)

In [41]:
# Lexicon (or unigram tagger)
unigram_tagger = nltk.UnigramTagger(train_nltk_data)
patterns = [
(r'^-?[0-9]+(.[0-9]+)?$', 'NUM'),   # cardinal numbers
#('^(?=.*[0-9]$)(?=.*[a-zA-Z])', 'NUM'),
#('.*\d', 'NUM'),
#('.*\*', 'NUM'),
(r'(The|the|A|a|An|an)$', 'DET'),   # determinants
(r'(And|and|For|for|But|but|Or|or|Yet|yet|So|so|Because|because)$', 'CONJ'),   # conjuctions
(r'(She|she|Him|him|Her|her|He|he|I|i|Me|me|You|you|We|we|Us|us|They|they|Them|them)$', 'PRON'),   # Pronouns
(r'.*able$', 'ADJ'),                # adjectives
(r'.*ness$', 'NOUN'),               # nouns formed from adjectives
(r'.*ly$', 'ADV'),                  # adverbs
(r'.*s$', 'NOUN'),                  # plural nouns
(r'.*ing$', 'VERB'),                # gerunds
(r'.*ed$', 'VERB'),                 # past tense verbs
(r'.*\d', 'NUM'),                   # cardinal numbers (default)
(r'.\*', 'NUM'),
(r'.?0', 'NUM'),
(r'.*', 'NOUN')
]
unigram_tagger.evaluate(validation_nltk_data)

# rule based tagger
rule_based_tagger = nltk.RegexpTagger(patterns)

# lexicon backed up by the rule-based tagger
combination_tagger = nltk.UnigramTagger(train_nltk_data, backoff=rule_based_tagger)

# Evaluate against test data and pring accuracy
acc = combination_tagger.evaluate(validation_nltk_data)

In [42]:
print('Accuracy with combination Tagger:', "{:.2%}".format(acc))

Accuracy with combination Tagger: 94.53%


We now have improvement in accuracry, from vanilla tagger (~95% from ~93%)

### 4B: Using Conditional Random Fields (CRF)

Creating the Feature Function

In [43]:
def features(sentence,index):
    ### sentence is of the form [w1,w2,w3,..], index is the position of the word in the sentence
    return {
        'is_first_capital':int(sentence[index][0].isupper()),
        'is_first_word': int(index==0),
        'is_last_word':int(index==len(sentence)-1),
        'is_complete_capital': int(sentence[index].upper()==sentence[index]),
        'prev_word':'' if index==0 else sentence[index-1],
        'next_word':'' if index==len(sentence)-1 else sentence[index+1],
        'is_numeric':int(sentence[index].isdigit()),
        'is_alphanumeric': int(bool((re.match('^(?=.*[0-9]$)(?=.*[a-zA-Z])',sentence[index])))),
        'prefix_1':sentence[index][0],
        'prefix_2': sentence[index][:2],
        'prefix_3':sentence[index][:3],
        'prefix_4':sentence[index][:4],
        'suffix_1':sentence[index][-1],
        'suffix_2':sentence[index][-2:],
        'suffix_3':sentence[index][-3:],
        'suffix_4':sentence[index][-4:],
        'word_has_hyphen': 1 if '-' in sentence[index] else 0  
         }
def untag(sentence):
    return [word for word,tag in sentence]


def prepareData(tagged_sentences):
    X,y=[],[]
    for sentences in tagged_sentences:
        X.append([features(untag(sentences), index) for index in range(len(sentences))])
        y.append([tag for word,tag in sentences])
    return X,y
X_train,y_train=prepareData(train_nltk_data)
X_test,y_test=prepareData(validation_nltk_data)

In [44]:
crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.01, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

In [45]:
y_pred=crf.predict(X_test)
print("F1 score on Test Data ")
print(metrics.flat_f1_score(y_test, y_pred,average='weighted',labels=crf.classes_))
print("F score on Training Data ")
y_pred_train=crf.predict(X_train)
metrics.flat_f1_score(y_train, y_pred_train,average='weighted',labels=crf.classes_)

### Look at class wise score
print(metrics.flat_classification_report(
    y_test, y_pred, labels=crf.classes_, digits=3
))

F1 score on Test Data 
0.9698436007288416
F score on Training Data 
              precision    recall  f1-score   support

        NOUN      0.960     0.981     0.971      1534
           .      1.000     1.000     1.000       646
         DET      0.995     0.995     0.995       437
         ADJ      0.897     0.859     0.878       334
        VERB      0.965     0.947     0.956       719
         NUM      1.000     0.989     0.994       182
           X      1.000     1.000     1.000       356
         ADP      0.968     0.988     0.978       489
         PRT      0.969     0.969     0.969       163
        CONJ      0.991     0.991     0.991       112
         ADV      0.907     0.834     0.869       163
        PRON      1.000     0.993     0.997       151

    accuracy                          0.970      5286
   macro avg      0.971     0.962     0.966      5286
weighted avg      0.970     0.970     0.970      5286



Accuracy has now improved to 97% to 98% (Vanilla viterbi model: ~ 93% and combination tagger: ~95%)

## Part 5: Load The Test Data File

In [46]:
filename = 'Test_sentences.txt'
f = open(filename, 'r')
test_data = f.read()
test_data = test_data.strip() # Remove all white spaces from test data
print(test_data)
f.close()

Android is a mobile operating system developed by Google.
Android has been the best-selling OS worldwide on smartphones since 2011 and on tablets since 2013.
Google and Twitter made a deal in 2015 that gave Google access to Twitter's firehose.
Twitter is an online news and social networking service on which users post and interact with messages known as tweets.
Before entering politics, Donald Trump was a domineering businessman and a television personality.
The 2018 FIFA World Cup is the 21st FIFA World Cup, an international football tournament contested once every four years.
This is the first World Cup to be held in Eastern Europe and the 11th time that it has been held in Europe.
Show me the cheapest round trips from Dallas to Atlanta
I would like to see flights from Denver to Philadelphia.
Show me the price of the flights leaving Atlanta at about 3 in the afternoon and arriving in San Francisco.
NASA invited social media users to experience the launch of ICESAT-2 Satellite.


## Part 6: Demonstrate Improvement In the Two Models Over the Vanialla Viterbi Model

### Part 6A: Improvement From The Combination Tagger Model

Run the Vanilla Viterbi Model on Test Data

In [47]:
test_data = re.sub('\\n', ' ', test_data)
print(test_data)

Android is a mobile operating system developed by Google. Android has been the best-selling OS worldwide on smartphones since 2011 and on tablets since 2013. Google and Twitter made a deal in 2015 that gave Google access to Twitter's firehose. Twitter is an online news and social networking service on which users post and interact with messages known as tweets. Before entering politics, Donald Trump was a domineering businessman and a television personality. The 2018 FIFA World Cup is the 21st FIFA World Cup, an international football tournament contested once every four years. This is the first World Cup to be held in Eastern Europe and the 11th time that it has been held in Europe. Show me the cheapest round trips from Dallas to Atlanta I would like to see flights from Denver to Philadelphia. Show me the price of the flights leaving Atlanta at about 3 in the afternoon and arriving in San Francisco. NASA invited social media users to experience the launch of ICESAT-2 Satellite.


In [48]:
test_list = list(test_data.split(' '))
test_viterbi_tags = Viterbi(test_list)
print(test_viterbi_tags)

[('Android', 'CONJ'), ('is', 'VERB'), ('a', 'DET'), ('mobile', 'ADJ'), ('operating', 'NOUN'), ('system', 'NOUN'), ('developed', 'VERB'), ('by', 'ADP'), ('Google.', 'CONJ'), ('Android', 'CONJ'), ('has', 'VERB'), ('been', 'VERB'), ('the', 'DET'), ('best-selling', 'ADJ'), ('OS', 'CONJ'), ('worldwide', 'CONJ'), ('on', 'ADP'), ('smartphones', 'CONJ'), ('since', 'ADP'), ('2011', 'CONJ'), ('and', 'CONJ'), ('on', 'ADP'), ('tablets', 'NOUN'), ('since', 'ADP'), ('2013.', 'CONJ'), ('Google', 'CONJ'), ('and', 'CONJ'), ('Twitter', 'CONJ'), ('made', 'VERB'), ('a', 'DET'), ('deal', 'NOUN'), ('in', 'ADP'), ('2015', 'CONJ'), ('that', 'DET'), ('gave', 'VERB'), ('Google', 'CONJ'), ('access', 'NOUN'), ('to', 'PRT'), ("Twitter's", 'CONJ'), ('firehose.', 'CONJ'), ('Twitter', 'CONJ'), ('is', 'VERB'), ('an', 'DET'), ('online', 'CONJ'), ('news', 'NOUN'), ('and', 'CONJ'), ('social', 'ADJ'), ('networking', 'NOUN'), ('service', 'NOUN'), ('on', 'ADP'), ('which', 'DET'), ('users', 'NOUN'), ('post', 'NOUN'), ('and',

Words incorrectly tagged in the vanilla Viterbi model and correctly tagged in the combination model

In [49]:
print(Viterbi(['Android']))
print(Viterbi(['os']))
print(Viterbi(['Twitter']))
print(Viterbi(['FIFA']))
print(Viterbi(['2015']))

[('Android', 'CONJ')]
[('os', 'CONJ')]
[('Twitter', 'CONJ')]
[('FIFA', 'CONJ')]
[('2015', 'CONJ')]


In [50]:
print(combination_tagger.tag(['Android']))
print(combination_tagger.tag(['os']))
print(combination_tagger.tag(['Twitter']))
print(combination_tagger.tag(['FIFA']))
print(combination_tagger.tag(['2015']))

[('Android', 'NOUN')]
[('os', 'NOUN')]
[('Twitter', 'NOUN')]
[('FIFA', 'NOUN')]
[('2015', 'NUM')]


Words Incorrectly classified by the Vanilla Viterbi Model and correctly classified by the combination tagger

### Part 6B: Improvement From The CRF Model

In [51]:
print(Viterbi(['Android']))
print(Viterbi(['Twitter']))
print(Viterbi(['FIFA']))
print(Viterbi(['2015']))

[('Android', 'CONJ')]
[('Twitter', 'CONJ')]
[('FIFA', 'CONJ')]
[('2015', 'CONJ')]


In [52]:
# Extract the Viterbi determined tags 

list_tags = []
list_tags.append(Viterbi(['Android']))
list_tags.append(Viterbi(['Twitter']))
list_tags.append(Viterbi(['FIFA']))
list_tags.append(Viterbi(['2015']))

tag_Android = list_tags[0][0][1]
tag_Twitter = list_tags[1][0][1]
tag_FIFA = list_tags[2][0][1]
tag_2015 = list_tags[3][0][1]

In [53]:
print(tag_Android)
print(tag_Twitter)
print(tag_FIFA)
print(tag_2015)

CONJ
CONJ
CONJ
CONJ


In [54]:
a, b = prepareData([[('Andriod', tag_Android)], [('Twitter', tag_Twitter)], [('FIFA', tag_FIFA)], [('2015', tag_2015)]])
y_pred=crf.predict(a)
print(y_pred)

[['NOUN'], ['NOUN'], ['NOUN'], ['NUM']]


In [55]:
print(test_list)

['Android', 'is', 'a', 'mobile', 'operating', 'system', 'developed', 'by', 'Google.', 'Android', 'has', 'been', 'the', 'best-selling', 'OS', 'worldwide', 'on', 'smartphones', 'since', '2011', 'and', 'on', 'tablets', 'since', '2013.', 'Google', 'and', 'Twitter', 'made', 'a', 'deal', 'in', '2015', 'that', 'gave', 'Google', 'access', 'to', "Twitter's", 'firehose.', 'Twitter', 'is', 'an', 'online', 'news', 'and', 'social', 'networking', 'service', 'on', 'which', 'users', 'post', 'and', 'interact', 'with', 'messages', 'known', 'as', 'tweets.', 'Before', 'entering', 'politics,', 'Donald', 'Trump', 'was', 'a', 'domineering', 'businessman', 'and', 'a', 'television', 'personality.', 'The', '2018', 'FIFA', 'World', 'Cup', 'is', 'the', '21st', 'FIFA', 'World', 'Cup,', 'an', 'international', 'football', 'tournament', 'contested', 'once', 'every', 'four', 'years.', 'This', 'is', 'the', 'first', 'World', 'Cup', 'to', 'be', 'held', 'in', 'Eastern', 'Europe', 'and', 'the', '11th', 'time', 'that', 'it'

Words Incorrectly classified by the Vanilla Viterbi Model and correctly classified by the CRF model

###### We now have better POS tagging for the test dataset with the CRF model, with more correct tagging compared to the vanilla Viterbi model. 

# End of Exercise