#  2020AIML544 - TEXT MINING - MINI PROJECT 1

## Overall Approach

### TASKS PERFORMED:
    1. OPEN THE TEXT FILE IN READ MODE
    2. SPLIT THE TEXT FILE INTO 8 DOCUMENTS
    3. REMOVE SPECIAL CHARACTERS FROM THE TEXT DATA
    4. CONVERT THE TEXT TO LOWER CASE
    5. REMOVE STOP WORDS IN THE DATA
    6. IMPLEMENT TF-IDF ALGORITHM FROM SCRATCH
    7. LIST DOWN TOP 10 WORDS WITH HIGHEST TF-IDF VALUE
    8. LABEL THE TF-IDF DATASET USING POS TAGGING
    9. SPLIT THE TRAIN AND TEST DATASET
    10. IMPLEMENT VITERBI ALGORITHM TO GET POS TAGGING
    11. CALCULATE ACCURACY AND F1 SCORE
    12. USING LDA, CREATE 10 TOPICS AND LIST 10 WORDS FROM EACH 


#### Import all needed libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import gensim
import gensim.corpora as corpora

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

#### Open the input text file in read mode

In [2]:
with open('TF-IDF_dataset.txt', 'r',encoding ='UTF-8') as file:
    data = file.read().replace('\n', ' ')

#### Split the text file into 8 documents based on Chapter

In [3]:
text_file = data.split("Chapter ", 8)
text_file

['',
 '1  I am by birth a Genevese, and my family is one of the most distinguished of that republic.  My ancestors had been for many years counsellors and syndics, and my father had filled several public situations with honour and reputation.  He was respected by all who knew him for his integrity and indefatigable attention to public business.  He passed his younger days perpetually occupied by the affairs of his country; a variety of circumstances had prevented his marrying early, nor was it until the decline of life that he became a husband and the father of a family.  As the circumstances of his marriage illustrate his character, I cannot refrain from relating them.  One of his most intimate friends was a merchant who, from a flourishing state, fell, through numerous mischances, into poverty.  This man, whose name was Beaufort, was of a proud and unbending disposition and could not bear to live in poverty and oblivion in the same country where he had formerly been distinguished for

#### Remove the 1st empty string

In [4]:
text_file.pop(0)
text_file

['1  I am by birth a Genevese, and my family is one of the most distinguished of that republic.  My ancestors had been for many years counsellors and syndics, and my father had filled several public situations with honour and reputation.  He was respected by all who knew him for his integrity and indefatigable attention to public business.  He passed his younger days perpetually occupied by the affairs of his country; a variety of circumstances had prevented his marrying early, nor was it until the decline of life that he became a husband and the father of a family.  As the circumstances of his marriage illustrate his character, I cannot refrain from relating them.  One of his most intimate friends was a merchant who, from a flourishing state, fell, through numerous mischances, into poverty.  This man, whose name was Beaufort, was of a proud and unbending disposition and could not bear to live in poverty and oblivion in the same country where he had formerly been distinguished for his 

### <span style='background : yellow' > **TASK 2:**</span> Remove Punctuations, Special characters and Convert the text to lower case.

In [5]:
for i in range(0,8):
    text_file[i] = re.sub('[^a-zA-Z \n]', '', text_file[i])
text_file

['  I am by birth a Genevese and my family is one of the most distinguished of that republic  My ancestors had been for many years counsellors and syndics and my father had filled several public situations with honour and reputation  He was respected by all who knew him for his integrity and indefatigable attention to public business  He passed his younger days perpetually occupied by the affairs of his country a variety of circumstances had prevented his marrying early nor was it until the decline of life that he became a husband and the father of a family  As the circumstances of his marriage illustrate his character I cannot refrain from relating them  One of his most intimate friends was a merchant who from a flourishing state fell through numerous mischances into poverty  This man whose name was Beaufort was of a proud and unbending disposition and could not bear to live in poverty and oblivion in the same country where he had formerly been distinguished for his rank and magnifice

In [6]:
for i in range(0,8):
    text_file[i] = text_file[i].lower()
text_file

['  i am by birth a genevese and my family is one of the most distinguished of that republic  my ancestors had been for many years counsellors and syndics and my father had filled several public situations with honour and reputation  he was respected by all who knew him for his integrity and indefatigable attention to public business  he passed his younger days perpetually occupied by the affairs of his country a variety of circumstances had prevented his marrying early nor was it until the decline of life that he became a husband and the father of a family  as the circumstances of his marriage illustrate his character i cannot refrain from relating them  one of his most intimate friends was a merchant who from a flourishing state fell through numerous mischances into poverty  this man whose name was beaufort was of a proud and unbending disposition and could not bear to live in poverty and oblivion in the same country where he had formerly been distinguished for his rank and magnifice

### <span style='background : yellow' > **TASK 1:**</span> Remove Stopwords

In [7]:
for i in range(0,8):
    text_file[i] = ' '.join([word for word in text_file[i].split() if word not in (stopwords.words('english'))])
text_file

['birth genevese family one distinguished republic ancestors many years counsellors syndics father filled several public situations honour reputation respected knew integrity indefatigable attention public business passed younger days perpetually occupied affairs country variety circumstances prevented marrying early decline life became husband father family circumstances marriage illustrate character cannot refrain relating one intimate friends merchant flourishing state fell numerous mischances poverty man whose name beaufort proud unbending disposition could bear live poverty oblivion country formerly distinguished rank magnificence paid debts therefore honourable manner retreated daughter town lucerne lived unknown wretchedness father loved beaufort truest friendship deeply grieved retreat unfortunate circumstances bitterly deplored false pride led friend conduct little worthy affection united lost time endeavouring seek hope persuading begin world credit assistance',
 'brought tog

### <span style='background : yellow' > **TASK 3:**</span> Create bigrams for the entire dataset and list down 20 most frequent bigrams

In [8]:
def bigram_sequence(t):
    result = [a for ls in text_file for a in zip(ls.split(" ")[:-1], ls.split(" ")[1:])]
    return result

t_bigrams = bigram_sequence(text_file)
print("\nBigram sequence of the said list:")
print(bigram_sequence(text_file))


Bigram sequence of the said list:
[('birth', 'genevese'), ('genevese', 'family'), ('family', 'one'), ('one', 'distinguished'), ('distinguished', 'republic'), ('republic', 'ancestors'), ('ancestors', 'many'), ('many', 'years'), ('years', 'counsellors'), ('counsellors', 'syndics'), ('syndics', 'father'), ('father', 'filled'), ('filled', 'several'), ('several', 'public'), ('public', 'situations'), ('situations', 'honour'), ('honour', 'reputation'), ('reputation', 'respected'), ('respected', 'knew'), ('knew', 'integrity'), ('integrity', 'indefatigable'), ('indefatigable', 'attention'), ('attention', 'public'), ('public', 'business'), ('business', 'passed'), ('passed', 'younger'), ('younger', 'days'), ('days', 'perpetually'), ('perpetually', 'occupied'), ('occupied', 'affairs'), ('affairs', 'country'), ('country', 'variety'), ('variety', 'circumstances'), ('circumstances', 'prevented'), ('prevented', 'marrying'), ('marrying', 'early'), ('early', 'decline'), ('decline', 'life'), ('life', 'b

In [9]:
from collections import Counter
counts = Counter(t_bigrams)
print(counts.most_common(20))

[(('native', 'country'), 2), (('natural', 'philosophy'), 2), (('two', 'years'), 2), (('thought', 'returning'), 2), (('long', 'time'), 2), (('return', 'us'), 2), (('birth', 'genevese'), 1), (('genevese', 'family'), 1), (('family', 'one'), 1), (('one', 'distinguished'), 1), (('distinguished', 'republic'), 1), (('republic', 'ancestors'), 1), (('ancestors', 'many'), 1), (('many', 'years'), 1), (('years', 'counsellors'), 1), (('counsellors', 'syndics'), 1), (('syndics', 'father'), 1), (('father', 'filled'), 1), (('filled', 'several'), 1), (('several', 'public'), 1)]


#### TOKENIZATION

In [10]:
# Tokenizing strings in list of strings
res = list(map(str.split, text_file))
print(res)

[['birth', 'genevese', 'family', 'one', 'distinguished', 'republic', 'ancestors', 'many', 'years', 'counsellors', 'syndics', 'father', 'filled', 'several', 'public', 'situations', 'honour', 'reputation', 'respected', 'knew', 'integrity', 'indefatigable', 'attention', 'public', 'business', 'passed', 'younger', 'days', 'perpetually', 'occupied', 'affairs', 'country', 'variety', 'circumstances', 'prevented', 'marrying', 'early', 'decline', 'life', 'became', 'husband', 'father', 'family', 'circumstances', 'marriage', 'illustrate', 'character', 'cannot', 'refrain', 'relating', 'one', 'intimate', 'friends', 'merchant', 'flourishing', 'state', 'fell', 'numerous', 'mischances', 'poverty', 'man', 'whose', 'name', 'beaufort', 'proud', 'unbending', 'disposition', 'could', 'bear', 'live', 'poverty', 'oblivion', 'country', 'formerly', 'distinguished', 'rank', 'magnificence', 'paid', 'debts', 'therefore', 'honourable', 'manner', 'retreated', 'daughter', 'town', 'lucerne', 'lived', 'unknown', 'wretch

### <span style='background : yellow' > **TASK 4:**</span> TF-IDF IMPLMENTATION FROM SCRATCH

In [11]:
# Create vocabulary from the corpus
sentences = []
word_set = []
total_documents = len(text_file)

i = 0
for sent in text_file:
    x = [i.lower() for  i in word_tokenize(sent)]
    sentences.append(x)   
    for word in sentences[i]:
        if word not in word_set:
            word_set.append(word)
    i = i + 1
word_set = set(word_set)
word_set

{'able',
 'absence',
 'absent',
 'abstruse',
 'accompanied',
 'accomplishment',
 'accordingly',
 'account',
 'acquaintance',
 'acquainted',
 'act',
 'activity',
 'adduced',
 'admiration',
 'adventure',
 'aerial',
 'affairs',
 'affection',
 'afterwards',
 'age',
 'ages',
 'aggravation',
 'agitated',
 'agony',
 'agrippa',
 'air',
 'akin',
 'alarming',
 'alas',
 'almost',
 'alpine',
 'already',
 'also',
 'although',
 'always',
 'among',
 'amounted',
 'ample',
 'ancestors',
 'another',
 'anxiety',
 'appearance',
 'appearances',
 'appeared',
 'application',
 'applied',
 'apprehension',
 'apprehensions',
 'ardent',
 'ardour',
 'arguments',
 'around',
 'arrive',
 'arrived',
 'arteries',
 'arthur',
 'ascribed',
 'asked',
 'asks',
 'assistance',
 'assured',
 'astonishment',
 'attach',
 'attained',
 'attainment',
 'attend',
 'attendants',
 'attended',
 'attending',
 'attention',
 'attentions',
 'attest',
 'avoid',
 'away',
 'babe',
 'banished',
 'bear',
 'beaufort',
 'beautiful',
 'beauty',
 'be

In [12]:
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1
index_dict

{'assistance': 0,
 'acquaintance': 1,
 'innocence': 2,
 'encountering': 3,
 'seclusion': 4,
 'go': 5,
 'plays': 6,
 'spent': 7,
 'deeply': 8,
 'parents': 9,
 'earliest': 10,
 'romance': 11,
 'brother': 12,
 'gazed': 13,
 'accompanied': 14,
 'applied': 15,
 'union': 16,
 'wondrous': 17,
 'desert': 18,
 'given': 19,
 'professor': 20,
 'happy': 21,
 'saw': 22,
 'supply': 23,
 'sickbed': 24,
 'integrity': 25,
 'look': 26,
 'tear': 27,
 'women': 28,
 'calmer': 29,
 'elder': 30,
 'debts': 31,
 'beautiful': 32,
 'reputation': 33,
 'returns': 34,
 'drawn': 35,
 'sad': 36,
 'tempted': 37,
 'arthur': 38,
 'indifferent': 39,
 'gave': 40,
 'distance': 41,
 'exquisitely': 42,
 'playing': 43,
 'grave': 44,
 'feelings': 45,
 'among': 46,
 'therefore': 47,
 'candle': 48,
 'ingolstadt': 49,
 'greatest': 50,
 'procured': 51,
 'lake': 52,
 'seven': 53,
 'nearer': 54,
 'country': 55,
 'conducive': 56,
 'form': 57,
 'rank': 58,
 'train': 59,
 'happened': 60,
 'protracted': 61,
 'letters': 62,
 'aerial': 63

In [13]:
#Create a count dictionary
 
def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count
 
word_count = count_dict(sentences)

In [14]:
#Term Frequency (TF)

def termfreq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    return occurance/N

In [15]:
#Inverse Document Frequency (IDF)
 
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1
    return np.log(total_documents/word_occurance)

In [16]:
#Term Frequency * Inverse Document Frequency (TF-IDF)

def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    
    word_value,  word_val = [], []

    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
        value = tf*idf      

        word_value = [word, value]
        word_val.append(word_value)
        tf_idf_vec[index_dict[word]] = value

    return tf_idf_vec, word_val

In [17]:
#TF-IDF Encoded text corpus - for each document individually

vectors, word_val, wordlist = [], [], []
total_documents = len(text_file)

for sent in sentences:
    vec, word_val = tf_idf(sent)
    
    vectors.append(vec)
    wordlist.append(word_val)

print(wordlist)

[[['birth', 0.008106026884394432], ['genevese', 0.011456978191073476], ['family', 0.016212053768788863], ['one', 0.0022071304566036788], ['distinguished', 0.022913956382146952], ['republic', 0.011456978191073476], ['ancestors', 0.011456978191073476], ['many', 0.005728489095536738], ['years', 0.005728489095536738], ['counsellors', 0.011456978191073476], ['syndics', 0.011456978191073476], ['father', 0.017185467286610214], ['filled', 0.011456978191073476], ['several', 0.011456978191073476], ['public', 0.022913956382146952], ['situations', 0.011456978191073476], ['honour', 0.011456978191073476], ['reputation', 0.011456978191073476], ['respected', 0.011456978191073476], ['knew', 0.011456978191073476], ['integrity', 0.011456978191073476], ['indefatigable', 0.011456978191073476], ['attention', 0.011456978191073476], ['public', 0.022913956382146952], ['business', 0.011456978191073476], ['passed', 0.0038843275144275673], ['younger', 0.008106026884394432], ['days', 0.011456978191073476], ['perpe

### <span style='background : yellow' > **TASK 5:**</span> Calculate TF-IDF on the preprocessed data for unigrams and list down the top 10 words which have the highest TF-IDF Value

In [18]:
#TF-IDF Encoded text corpus - for the list of unigrams

vectors, word_val, wordlist = [], [], []
for sent in text_file:
    sentence = word_tokenize(sent)
    
vec, word_val = tf_idf(sentence)
    
vectors.append(vec)
wordlist.append(word_val)

print(wordlist)

[[['passed', 0.003051971618478803], ['sad', 0.009001911435843446], ['hours', 0.009001911435843446], ['eleven', 0.009001911435843446], ['oclock', 0.009001911435843446], ['trial', 0.009001911435843446], ['commence', 0.009001911435843446], ['father', 0.004500955717921723], ['rest', 0.009001911435843446], ['family', 0.006369021123452768], ['obliged', 0.009001911435843446], ['attend', 0.009001911435843446], ['witnesses', 0.009001911435843446], ['accompanied', 0.006369021123452768], ['court', 0.018003822871686892], ['whole', 0.009001911435843446], ['wretched', 0.009001911435843446], ['mockery', 0.009001911435843446], ['justice', 0.009001911435843446], ['suffered', 0.018003822871686892], ['living', 0.009001911435843446], ['torture', 0.009001911435843446], ['decided', 0.009001911435843446], ['whether', 0.009001911435843446], ['result', 0.009001911435843446], ['curiosity', 0.006369021123452768], ['lawless', 0.009001911435843446], ['devices', 0.009001911435843446], ['would', 0.018003822871686892

In [19]:
from operator import itemgetter
word_sort = sorted(word_val, key=itemgetter(1), reverse=True)
# print(word_sort[:10])
output = []
for x in word_sort:
    if x not in output:
        output.append(x)
print(output[:10])

[['justine', 0.027005734307530335], ['court', 0.018003822871686892], ['suffered', 0.018003822871686892], ['would', 0.018003822871686892], ['cause', 0.018003822871686892], ['innocence', 0.018003822871686892], ['obliterated', 0.018003822871686892], ['committed', 0.018003822871686892], ['appearance', 0.018003822871686892], ['quickly', 0.018003822871686892]]


### <span style='background : yellow' > **TASK 6:**</span> Label the cleaned Tf-IDF dataset

In [20]:
sent_tagged = []
for sent in text_file:
    sentence = word_tokenize(sent)
    tagged = nltk.pos_tag(sentence)
    sent_tagged.append(tagged)
print(sent_tagged)

[[('birth', 'NN'), ('genevese', 'JJ'), ('family', 'NN'), ('one', 'CD'), ('distinguished', 'VBN'), ('republic', 'JJ'), ('ancestors', 'NNS'), ('many', 'JJ'), ('years', 'NNS'), ('counsellors', 'NNS'), ('syndics', 'VBP'), ('father', 'RB'), ('filled', 'VBN'), ('several', 'JJ'), ('public', 'JJ'), ('situations', 'NNS'), ('honour', 'VBP'), ('reputation', 'NN'), ('respected', 'VBD'), ('knew', 'JJ'), ('integrity', 'NN'), ('indefatigable', 'JJ'), ('attention', 'NN'), ('public', 'NN'), ('business', 'NN'), ('passed', 'VBD'), ('younger', 'JJR'), ('days', 'NNS'), ('perpetually', 'RB'), ('occupied', 'JJ'), ('affairs', 'NNS'), ('country', 'NN'), ('variety', 'NN'), ('circumstances', 'NNS'), ('prevented', 'VBD'), ('marrying', 'VBG'), ('early', 'JJ'), ('decline', 'JJ'), ('life', 'NN'), ('became', 'VBD'), ('husband', 'NN'), ('father', 'NN'), ('family', 'NN'), ('circumstances', 'NNS'), ('marriage', 'NN'), ('illustrate', 'VBP'), ('character', 'NN'), ('can', 'MD'), ('not', 'RB'), ('refrain', 'VB'), ('relating

### <span style='background : yellow' > **TASK 7:**</span> Split the Train and the Test Dataset  

In [21]:
#Using first 6 documents as train data and last 2 as test data

for i in range(0,8):
    sent_tagged[i].append(('&&','&&'))
    
train_data = sent_tagged[0:6]
test_data = sent_tagged[6:8]

### <span style='background : yellow' > **TASK 8:**</span> Implement the Viterbi Algorithm to get the Part of Speech Tagging

In [22]:
#Creating a dictionary whose keys are tags and values contain words which were assigned the correspoding tag

train_word_tag = {}
for s in train_data:
  for (w,t) in s:
    w=w.lower()
    try:
      try:
        train_word_tag[t][w]+=1
      except:
        train_word_tag[t][w]=1
    except:
      train_word_tag[t]={w:1}
    
train_word_tag

{'NN': {'birth': 1,
  'family': 2,
  'reputation': 1,
  'integrity': 1,
  'attention': 1,
  'public': 1,
  'business': 1,
  'country': 5,
  'variety': 1,
  'life': 6,
  'husband': 1,
  'father': 3,
  'marriage': 1,
  'character': 1,
  'intimate': 1,
  'state': 1,
  'man': 1,
  'name': 1,
  'beaufort': 2,
  'proud': 1,
  'disposition': 2,
  'poverty': 1,
  'oblivion': 1,
  'magnificence': 1,
  'manner': 2,
  'town': 2,
  'lucerne': 1,
  'wretchedness': 1,
  'friendship': 2,
  'retreat': 1,
  'pride': 1,
  'conduct': 1,
  'affection': 2,
  'time': 3,
  'hope': 1,
  'world': 3,
  'credit': 1,
  'assistance': 1,
  'year': 1,
  'difference': 1,
  'dispute': 1,
  'harmony': 1,
  'soul': 2,
  'companionship': 1,
  'diversity': 1,
  'contrast': 2,
  'calmer': 1,
  'ardour': 2,
  'application': 2,
  'knowledge': 2,
  'home': 2,
  'sublime': 1,
  'silence': 1,
  'winter': 1,
  'turbulence': 1,
  'alpine': 1,
  'summersshe': 1,
  'scope': 1,
  'admiration': 2,
  'companion': 1,
  'spirit': 1,
  '

In [23]:
#Calculating the emission probabilities using train_word_tag

train_emission_prob={}
for k in train_word_tag.keys():
  train_emission_prob[k]={}
  count = sum(train_word_tag[k].values())
  for k2 in train_word_tag[k].keys():
    train_emission_prob[k][k2]=train_word_tag[k][k2]/count
train_emission_prob

{'NN': {'birth': 0.003355704697986577,
  'family': 0.006711409395973154,
  'reputation': 0.003355704697986577,
  'integrity': 0.003355704697986577,
  'attention': 0.003355704697986577,
  'public': 0.003355704697986577,
  'business': 0.003355704697986577,
  'country': 0.016778523489932886,
  'variety': 0.003355704697986577,
  'life': 0.020134228187919462,
  'husband': 0.003355704697986577,
  'father': 0.010067114093959731,
  'marriage': 0.003355704697986577,
  'character': 0.003355704697986577,
  'intimate': 0.003355704697986577,
  'state': 0.003355704697986577,
  'man': 0.003355704697986577,
  'name': 0.003355704697986577,
  'beaufort': 0.006711409395973154,
  'proud': 0.003355704697986577,
  'disposition': 0.006711409395973154,
  'poverty': 0.003355704697986577,
  'oblivion': 0.003355704697986577,
  'magnificence': 0.003355704697986577,
  'manner': 0.006711409395973154,
  'town': 0.006711409395973154,
  'lucerne': 0.003355704697986577,
  'wretchedness': 0.003355704697986577,
  'friend

In [24]:
#Estimating the bigram of tags to be used for transition probability

bigram_tag_data = {}
for s in train_data:
  bi=list(nltk.bigrams(s))
  for b1,b2 in bi:
    try:
      try:
        bigram_tag_data[b1[1]][b2[1]]+=1
      except:
        bigram_tag_data[b1[1]][b2[1]]=1
    except:
      bigram_tag_data[b1[1]]={b2[1]:1}
bigram_tag_data

{'NN': {'JJ': 37,
  'CD': 3,
  'VBD': 48,
  'NN': 110,
  'NNS': 21,
  'VBP': 6,
  'MD': 6,
  'WP$': 1,
  'RB': 25,
  'JJS': 6,
  'VBG': 5,
  '&&': 3,
  'VBN': 16,
  'VBZ': 2,
  'IN': 4,
  'JJR': 1,
  'VB': 2,
  'RBR': 1,
  'NNP': 1},
 'JJ': {'NN': 106,
  'NNS': 31,
  'JJ': 33,
  'CD': 3,
  'VBZ': 1,
  'RB': 2,
  'VBN': 5,
  'JJS': 1,
  'VBG': 1,
  'VB': 1},
 'CD': {'VBN': 1, 'NN': 5, 'NNS': 3, 'IN': 1},
 'VBN': {'JJ': 15,
  'RB': 7,
  'NN': 9,
  'VBP': 1,
  'NNS': 6,
  'VBG': 1,
  'DT': 1,
  'IN': 1},
 'NNS': {'JJ': 13,
  'NNS': 10,
  'VBP': 22,
  'RB': 10,
  'NN': 13,
  'VBD': 21,
  'JJS': 1,
  'VBN': 8,
  '&&': 2,
  'FW': 1,
  'VBZ': 1,
  'VBG': 2,
  'VB': 1,
  'PRP': 1},
 'VBP': {'RB': 6,
  'NN': 8,
  'VBG': 2,
  'VBP': 1,
  'NNS': 3,
  'JJ': 9,
  'PRP': 1,
  'RBR': 1,
  'VBN': 1,
  'VBZ': 1},
 'RB': {'VBN': 5,
  'JJ': 24,
  'VB': 6,
  'VBD': 12,
  'RB': 2,
  'NN': 4,
  'VBG': 4,
  'VBZ': 3,
  'RBR': 1,
  'MD': 4,
  'NNS': 5,
  'CD': 2,
  'VBP': 1},
 'VBD': {'JJ': 31,
  'JJR': 1,
  

In [25]:
#Calculating the probabilities of tag bigrams for transition probability  

bigram_tag_prob={}
for k in bigram_tag_data.keys():
  bigram_tag_prob[k]={}
  count=sum(bigram_tag_data[k].values())
  for k2 in bigram_tag_data[k].keys():
    bigram_tag_prob[k][k2]=bigram_tag_data[k][k2]/count
bigram_tag_prob

{'NN': {'JJ': 0.12416107382550336,
  'CD': 0.010067114093959731,
  'VBD': 0.1610738255033557,
  'NN': 0.3691275167785235,
  'NNS': 0.07046979865771812,
  'VBP': 0.020134228187919462,
  'MD': 0.020134228187919462,
  'WP$': 0.003355704697986577,
  'RB': 0.08389261744966443,
  'JJS': 0.020134228187919462,
  'VBG': 0.016778523489932886,
  '&&': 0.010067114093959731,
  'VBN': 0.053691275167785234,
  'VBZ': 0.006711409395973154,
  'IN': 0.013422818791946308,
  'JJR': 0.003355704697986577,
  'VB': 0.006711409395973154,
  'RBR': 0.003355704697986577,
  'NNP': 0.003355704697986577},
 'JJ': {'NN': 0.5760869565217391,
  'NNS': 0.16847826086956522,
  'JJ': 0.1793478260869565,
  'CD': 0.016304347826086956,
  'VBZ': 0.005434782608695652,
  'RB': 0.010869565217391304,
  'VBN': 0.02717391304347826,
  'JJS': 0.005434782608695652,
  'VBG': 0.005434782608695652,
  'VB': 0.005434782608695652},
 'CD': {'VBN': 0.1, 'NN': 0.5, 'NNS': 0.3, 'IN': 0.1},
 'VBN': {'JJ': 0.36585365853658536,
  'RB': 0.170731707317

In [26]:
#Calculating the possible tags for each word
#Note: Here we have used the whole data(Train+Test)
#Reason: There may be some words which are not present in train data but are present in test data 
tags_of_tokens = {}
count=0
for s in train_data:
  for (w,t) in s:
    w=w.lower()
    try:
      if t not in tags_of_tokens[w]:
        tags_of_tokens[w].append(t)
    except:
      l = []
      l.append(t)
      tags_of_tokens[w] = l
        
for s in test_data:
  for (w,t) in s:
    w=w.lower()
    try:
      if t not in tags_of_tokens[w]:
        tags_of_tokens[w].append(t)
    except:
      l = []
      l.append(t)
      tags_of_tokens[w] = l

In [27]:
#Dividing the test data into test words and test tags

test_words=[]
test_tags=[]
for s in test_data:
  temp_word=[]
  temp_tag=[]
  for (w,t) in s:
    temp_word.append(w.lower())
    temp_tag.append(t)
  test_words.append(temp_word)
  test_tags.append(temp_tag)

In [28]:
#Executing the Viterbi Algorithm

predicted_tags = []                #intializing the predicted tags
for x in range(len(test_words)):   # for each tokenized sentence in the test data
  s = test_words[x]
  #storing_values is a dictionary which stores the required values
  #ex: storing_values = {step_no.:{state1:[previous_best_state,value_of_the_state]}}                
  storing_values = {}              
  for q in range(len(s)):
    step = s[q]
    #for the starting word of the sentence
    if q == 1:                
      storing_values[q] = {}
      tags = tags_of_tokens[step]
      for t in tags:
        #this is applied since we do not know whether the word in the test data is present in train data or not
        try:
          storing_values[q][t] = ['##',bigram_tag_prob['##'][t]*train_emission_prob[t][step]]
        #if word is not present in the train data but present in test data we assign a very low probability of 0.0001
        except:
          storing_values[q][t] = ['##',0.0001]#*train_emission_prob[t][step]]
    
    #if the word is not at the start of the sentence
    if q>1:
      storing_values[q] = {}
      previous_states = list(storing_values[q-1].keys())   # loading the previous states
      current_states  = tags_of_tokens[step]               # loading the current states
      #calculation of the best previous state for each current state and then storing
      #it in storing_values
      for t in current_states:                             
        temp = []
        for pt in previous_states:                         
          try:
            temp.append(storing_values[q-1][pt][1]*bigram_tag_prob[pt][t]*train_emission_prob[t][step])
          except:
            temp.append(storing_values[q-1][pt][1]*0.0001)
        max_temp_index = temp.index(max(temp))
        best_pt = previous_states[max_temp_index]
        storing_values[q][t]=[best_pt,max(temp)]

  #Backtracing to extract the best possible tags for the sentence
  pred_tags = []
  total_steps_num = storing_values.keys()
  last_step_num = max(total_steps_num)
  for bs in range(len(total_steps_num)):
    step_num = last_step_num - bs
    if step_num == last_step_num:
          pred_tags.append('&&')
          pred_tags.append(storing_values[step_num]['&&'][0])
    if step_num<last_step_num and step_num>0:
          pred_tags.append(storing_values[step_num][pred_tags[len(pred_tags)-1]][0])
  predicted_tags.append(list(reversed(pred_tags)))

### <span style='background : yellow' > **TASK 9:**</span> Calculate the Accuracy and F1 score

In [29]:
#Calculating the accuracy based on tagging each word in the test data.

right = 0 
wrong = 0
for i in range(len(test_tags)):
  gt = test_tags[i]
  pred = predicted_tags[i]
  for h in range(len(gt)):
    if gt[h] == pred[h]:
      right = right+1
    else:
      wrong = wrong +1 

print('Accuracy on the test data is: ',right/(right+wrong))
print('Loss on the test data is: ',wrong/(right+wrong))

Accuracy on the test data is:  0.9138576779026217
Loss on the test data is:  0.08614232209737828


In [30]:
#Calculating the f1 score 

from sklearn.metrics import f1_score
f1_score = f1_score(gt, pred, average='weighted')
f1_score

0.9188985081927122

### <span style='background : yellow' > **TASK 10:**</span> Using the LDA algorithm create the Topics (10) for the Corpus

In [31]:
data_words = []
for sent in text_file:
    sentence = word_tokenize(sent)
    data_words.append(sentence)

print(data_words[:1][0][:30])

# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1][0][:30])

['birth', 'genevese', 'family', 'one', 'distinguished', 'republic', 'ancestors', 'many', 'years', 'counsellors', 'syndics', 'father', 'filled', 'several', 'public', 'situations', 'honour', 'reputation', 'respected', 'knew', 'integrity', 'indefatigable', 'attention', 'public', 'business', 'passed', 'younger', 'days', 'perpetually', 'occupied']
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 3), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1)]


### <span style='background : yellow' > **TASK 11:**</span> List down the 10 words in each of the Topics Extracted. 

In [32]:
from pprint import pprint
# number of topics
num_topics = 10
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
# Print 10 Keywords in the 10 topics
pprint(lda_model.print_topics(num_words=10))
doc_lda = lda_model[corpus]
doc_lda

[(0,
  '0.006*"limbs" + 0.006*"black" + 0.005*"yellow" + 0.005*"almost" + '
  '0.004*"beautiful" + 0.004*"already" + 0.004*"life" + 0.004*"hard" + '
  '0.004*"god" + 0.003*"formed"'),
 (1,
  '0.006*"us" + 0.005*"one" + 0.005*"yet" + 0.004*"would" + 0.004*"time" + '
  '0.004*"country" + 0.003*"friends" + 0.003*"father" + 0.003*"could" + '
  '0.003*"never"'),
 (2,
  '0.006*"son" + 0.005*"found" + 0.004*"even" + 0.004*"life" + 0.004*"us" + '
  '0.004*"return" + 0.004*"years" + 0.004*"long" + 0.004*"country" + '
  '0.004*"would"'),
 (3,
  '0.007*"one" + 0.006*"father" + 0.005*"country" + 0.004*"life" + '
  '0.004*"circumstances" + 0.004*"yet" + 0.004*"years" + 0.003*"became" + '
  '0.003*"us" + 0.003*"would"'),
 (4,
  '0.006*"life" + 0.004*"beautiful" + 0.004*"one" + 0.004*"great" + '
  '0.004*"could" + 0.004*"elizabeth" + 0.003*"seemed" + 0.003*"almost" + '
  '0.003*"geneva" + 0.003*"yellow"'),
 (5,
  '0.009*"would" + 0.005*"yet" + 0.004*"us" + 0.004*"justine" + 0.004*"seemed" '
  '+ 0.00

<gensim.interfaces.TransformedCorpus at 0x1ab0d2b8340>

In [33]:
#Printing the topic association with the documents
count=1
for i in lda_model[corpus]:
    print("Chapter : ",count,i)
    count+=1

Chapter :  1 [(3, 0.9926204)]
Chapter :  2 [(2, 0.08842799), (8, 0.9075295)]
Chapter :  3 [(6, 0.9936161)]
Chapter :  4 [(2, 0.5264588), (6, 0.011185666), (7, 0.45914102)]
Chapter :  5 [(0, 0.9907194)]
Chapter :  6 [(1, 0.9943011)]
Chapter :  7 [(2, 0.5202191), (9, 0.47263572)]
Chapter :  8 [(5, 0.37596744), (7, 0.61886865)]
