Your task is to train *character-level* language models. 
You will train unigram, bigram, and trigram character level models on a collection of books from Project Gutenberg. You will then use these trained English language models to distinguish English documents from Brazilian Portuguese documents in the test set.

In [0]:
import re
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
np.random.seed(101)

In [0]:
import pandas as pd
import httpimport

with httpimport.remote_repo(['lm_helper'], 'https://raw.githubusercontent.com/jasoriya/CS6120-PS2-support/master/utils/'):
  from lm_helper import get_train_data, get_test_data

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.
[nltk_data] Downloading package mac_morpho to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\mac_morpho.zip.


This code loads the training and test data. Each dataset is a list of books. Each book contains a list of sentences, and each sentence contains a list of words. For building a character language model, you should join the words of a sentence together with a space character.

In [0]:
# get the train and test data
train = get_train_data()
test, test_files = get_test_data()

## 1.1
Collect statistics on the unigram, bigram, and trigram character counts.

If your machine takes a long time to perform this computation, you may save these counts to files in your github repository and load them on request. This is not necessary, however.

In [0]:
TRAIN,DEV=train_test_split(train,test_size=0.2, random_state=42)
corpus=[]
for i in range(len(TRAIN)):
  for j in range(len(TRAIN[i])):
    corpus.append((re.sub("[\x1a]+", ' '," ".join(TRAIN[i][j])).lstrip().rstrip()))

c1,c2,c3=dict(),dict(),dict()

cv1=CountVectorizer(ngram_range=(1,1),analyzer='char')
x1=cv1.fit_transform(corpus)

t1=np.array(np.sum(x1,axis=0))[0]
for i,name in enumerate(cv1.get_feature_names()):
  c1[name]=t1[i] # word and count
 
#removing unigrams that have a count <100
rep=[]
tot=0
for a in c1:
  if c1[a]<100:
    tot +=c1[a]
    rep.append(a)

c1['µ']=tot # unknown token since it doesnt appear int the docs

for i in rep:
  del c1[i]

corpus2=[]
for sent in corpus:
  corpus2.append(re.sub('[/$/&>/`/æ]',"µ" ,sent))

# removed tokens with values less than 100 and replaced with special token
cv1=CountVectorizer(ngram_range=(1,1),analyzer='char')
cv2=CountVectorizer(ngram_range=(2,2),analyzer='char')
cv3=CountVectorizer(ngram_range=(3,3),analyzer='char')
x1=cv1.fit_transform(corpus2)
x2=cv2.fit_transform(corpus2)
x3=cv3.fit_transform(corpus2)

t1=np.array(np.sum(x1,axis=0))[0]
for i,name in enumerate(cv1.get_feature_names()):
  c1[name]=t1[i] 

t2=np.array(np.sum(x2,axis=0))[0]
for i,name in enumerate(cv2.get_feature_names()):
  c2[name]=t2[i]

t3=np.array(np.sum(x3,axis=0))[0]
for i,name in enumerate(cv3.get_feature_names()):
  c3[name]=t3[i]

print("Number of Unigrams {}".format(len(cv1.get_feature_names())))
print("Number of Bigrams {}".format(len(cv2.get_feature_names())))
print("Number of Trigrams {}".format(len(cv3.get_feature_names())))

Number of Unigrams 54
Number of Bigrams 884
Number of Trigrams 8618


In [0]:
def generate_trigram_sent(sent):
    tri=[]
    for j in range(len(sent)-3):
      tri.append(sent[j:j+3])
    return tri


# trying a different variant
def uni_prob(token,a1):
  if token not in set(a1.keys()) :
    return a1['µ']/np.array(list(a1.values())).sum() # denominator represnts the total number of characters in the training set
  else :
    return a1[token]/np.array(list(a1.values())).sum()

def bi_prob(token,a2,a1):
  if token not in set(a2.keys()):
    return 0
  else:
    return a2[token]/a1[token[:1]]

def tri_prob(token,a3,a2,a1):
  if token not in set(a3.keys()):
    return 0
  else:
    return a3[token]/a2[token[:2]]

In [0]:
def perplexity(corp,la1,la2,la3,a1,a2,a3): # corp can be the whole test set or dev set  
    perplexity=[]
    for doc in corp: # picking single documents
        #print("doc")
    
        dd=CountVectorizer(ngram_range=(1,1),analyzer='char')
        gg=dd.fit_transform(doc)
        tt=dict()
        tp1=np.array(np.sum(gg,axis=0))[0]
        for i,name in enumerate(dd.get_feature_names()):
          tt[name]=tp1[i] # word and count
        char=np.array(list(tt.values())).sum() # refers to M # total number of characters in a document

        # real calculations begin

        log_prob=0
        for sent in doc: # refers to each document fro the dev set or test set # loops over all thesentences in the document
          trigrams=generate_trigram_sent(sent) # generates trigrams for the given sentence
          for tri_gram in trigrams: # loops over all the characters in a sentence
              words = tri_gram     # trigram
              bi_word = words[-2:] # bigram
              uniword = words[-1]  # unigram

              prob=float(la1) * uni_prob(uniword,a1) + float(la2) * bi_prob(bi_word,a2,a1) + float(la3) * tri_prob(words,a3,a2,a1)
              log_prob += np.log2(prob)

        l=log_prob/char
        perp=pow(2,-l)
        print(perp)
        perplexity.append(perp)
    return(perplexity)



In [0]:
# preparing test set
dev=[]
for i in range(len(DEV)):
  k=[]
  for j in range(len(DEV[i])):
    k.append((" ".join(DEV[i][j])).lstrip().rstrip())
  dev.append(k)

In [0]:
vals=[(0.3,0.3,0.4),(0.1,0.1,0.8),(0.05,0.05,0.9)]
gs=pd.DataFrame(index=np.arange(10),columns=['l1','l2','l3','perplexity on Dev1','perplexity on Dev2','perplexity on Dev3','perplexity on Dev4'])
for index,(a,b,c) in enumerate(vals):
  print(index)
  pp=perplexity(dev,a,b,c,c1,c2,c3)
  gs.iloc[index,0]=a
  gs.iloc[index,1]=b
  gs.iloc[index,2]=c
  gs.iloc[index,3]=pp[0]
  gs.iloc[index,4]=pp[1]
  gs.iloc[index,5]=pp[2]
  gs.iloc[index,6]=pp[3]

  

0
9.813588164958933
9.406214996393329
9.725194606093474
9.260701066184927
1
8.716136989029756
8.285508611518384
8.705132744577154
8.259456370655682
2
8.76291047726098
8.289438931279633
8.750331381329111
8.31421348518957


In [0]:

vals=[(0.9,0.05,0.05),(0.05,0.9,0.05),(0.15,0.15,0.7)]
for index,(a,b,c) in enumerate(vals):
  print(index)
  pp=perplexity(dev,a,b,c,c1,c2,c3)
  gs.iloc[index+3,0]=a
  gs.iloc[index+3,1]=b
  gs.iloc[index+3,2]=c
  gs.iloc[index+3,3]=pp[0]
  gs.iloc[index+3,4]=pp[1]
  gs.iloc[index+3,5]=pp[2]
  gs.iloc[index+3,6]=pp[3]



0
16.166323735910773
15.584765299412037
15.75528288514541
15.277191065067907
1
11.687808575041446
11.101164440327587
11.364067567752691
10.825074660256876
2
8.84456655594152
8.432693964934181
8.823073546257332
8.373265237943288


In [0]:
vals=[(0.2,0.2,0.6),(0.1,0.2,0.7),(0.2,0.1,0.7)]
for index,(a,b,c) in enumerate(vals):
  print(index)
  pp=perplexity(dev,a,b,c,c1,c2,c3)
  gs.iloc[index+6,0]=a
  gs.iloc[index+6,1]=b
  gs.iloc[index+6,2]=c
  gs.iloc[index+6,3]=pp[0]
  gs.iloc[index+6,4]=pp[1]
  gs.iloc[index+6,5]=pp[2]
  gs.iloc[index+6,6]=pp[3]


0
9.07585218782757
8.672069124613337
9.037754981041642
8.58426766942614
1
8.83179665800465
8.40320982132402
8.795475806790627
8.347725160525366
2
8.902203822526706
8.500849406990639
8.89360758701261
8.442758289241285


In [0]:
vals=[(0.3,0.4,0.3)]
for index,(a,b,c) in enumerate(vals):
  print(index)
  pp=perplexity(dev,a,b,c,c1,c2,c3)
  gs.iloc[index+9,0]=a
  gs.iloc[index+9,1]=b
  gs.iloc[index+9,2]=c
  gs.iloc[index+9,3]=pp[0]
  gs.iloc[index+9,4]=pp[1]
  gs.iloc[index+9,5]=pp[2]
  gs.iloc[index+9,6]=pp[3]

0
10.220304348961951
9.798488483148885
10.09251311269769
9.61725547882338


In [0]:
gs

Unnamed: 0,l1,l2,l3,perplexity on Dev1,perplexity on Dev2,perplexity on Dev3,perplexity on Dev4
0,0.3,0.3,0.4,9.81359,9.40621,9.72519,9.2607
1,0.1,0.1,0.8,8.71614,8.28551,8.70513,8.25946
2,0.05,0.05,0.9,8.76291,8.28944,8.75033,8.31421
3,0.9,0.05,0.05,16.1663,15.5848,15.7553,15.2772
4,0.05,0.9,0.05,11.6878,11.1012,11.3641,10.8251
5,0.15,0.15,0.7,8.84457,8.43269,8.82307,8.37327
6,0.2,0.2,0.6,9.07585,8.67207,9.03775,8.58427
7,0.1,0.2,0.7,8.8318,8.40321,8.79548,8.34773
8,0.2,0.1,0.7,8.9022,8.50085,8.89361,8.44276
9,0.3,0.4,0.3,10.2203,9.79849,10.0925,9.61726


# l1=0.1, l2=0.1,l3=0.8 performs the best

In [0]:
# preparing test set
TEST=[]
for i in range(len(test)):
  k=[]
  for j in range(len(test[i])):
    k.append((" ".join(test[i][j])).lstrip().rstrip())
  TEST.append(k)
pp=perplexity(TEST,0.1,0.1,0.8,c1,c2,c3)

12.602390110472822
12.346587207317121
10.813909384232538
12.301324449065973
10.334058173770245
11.203919545977598
10.69027776474952
8.427348238623338
8.316009691123016
10.382608688528812
13.2842284896863
7.878219910596223
31.078739827539465
11.995724482059607
11.766587432913054
9.008044532271953
9.475038225336784
11.093115147087302
29.275315310487283
13.528705359679043
9.43866002812946
11.029165067546833
9.848851805500207
11.265702239026407
9.580269702757967
9.873322938600513
12.59538793889052
8.887262568883836
14.079374442628644
30.605784881330095
12.043241208918033
8.292701831575126
11.441162855580028
8.585146806628495
31.190553557149396
12.22428948834306
18.949264958845284
14.313950153699587
9.046061131551417
12.348565818537603
21.43623604356789
8.216636249350882
10.004584595202408
13.462176885826853
17.994696558180927
32.97223405368126
11.736038500366522
11.442336978084544
10.280458930969614
9.711636335097205
31.344721364705983
12.587931952082073
7.921767505840446
9.251379955404877

## 1.2
Calculate the perplexity for each document in the test set using linear interpolation smoothing method. For determining λs for linear interpolation, you can divide the training data into a new training set (80%) and a held-out set (20%), then using grid search method:
Choose ~10 values of λ to test using grid search on held-out data.

Some documents in the test set are in Brazilian Portuguese. Identify them as follows: 
  - Sort by perplexity and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names (from `test_files`) and perplexities of the documents above the threshold

    ```
        file name, score
        file name, score
        . . .
        file name, score
    ```

  - Copy this list of filenames and manually annotate them as being correctly or incorrectly labeled as Portuguese.




## **Please double click on the tab below to have an expanded view in order**


67       cg04           7.607644        English 
209      cf13           7.762838        English 
132      cf26           7.846112        English 
11       ce21           7.878220        English 
117      ce24           7.904342        English 
52       ce17           7.921768        English 
216      cf27           7.936450        English 
205      cd13           8.137768        English 
124      cg11           8.174361        English 
160      cd01           8.203411        English 
41       cf06           8.216636        English 
118      ce08           8.276184        English 
31       ce19           8.292702        English 
8        cf07           8.316010        English 
147      ce02           8.320410        English 
81       ce28           8.365333        English 
122      cg05           8.395723        English 
7        cg09           8.427348        English 
101      cf36           8.532034        English 
33       cf43           8.585147        English 
191      cd11           8.587983        English 
76       ce20           8.674718        English 
151      cd07           8.727451        English 
93       cd04           8.736486        English 
181      cb10           8.846665        English 
27       cf37           8.887263        English 
104      ce29           8.888606        English 
90       cb08           8.928633        English 
185      cf42           8.968367        English 
15       ce15           9.008045        English 
183      cf44           9.030851        English 
92       ce16           9.040102        English 
91       cf48           9.045551        English 
38       cf12           9.046061        English 
125      cf33           9.099066        English 
150      cf04           9.133517        English 
102      cf03           9.140057        English 
99       cd17           9.216462        English 
73       cf11           9.226338        English 
88       cf31           9.239302        English 
53       cf39           9.251380        English 
97       ce06           9.252108        English 
112      cb09           9.301730        English 
20       ce30           9.438660        English 
198      ce11           9.446291        English 
16       cd02           9.475038        English 
178      ca35           9.476882        English 
94       cg01           9.528356        English 
24       cf34           9.580270        English 
95       cf01           9.582123        English 
184      cg02           9.602060        English 
110      cf21           9.641537        English 
167      cd09           9.666186        English 
108      ce32           9.684026        English 
206      cf08           9.685151        English 
149      ce03           9.701251        English 
49       cd12           9.711636        English 
219      ca44           9.711984        English 
156      cf45           9.722043        English 
70       cd10           9.723633        English 
212      cf38           9.778379        English 
144      cd05           9.785918        English 
75       cf19           9.817012        English 
22       ce33           9.848852        English 
201      cb17           9.857195        English 
25       ce23           9.873323        English 
169      cg10           9.883288        English 
157      ce14           9.889469        English 
140      ca38           9.921090        English 
116      ce36           9.944183        English 
131      cd15           9.949579        English 
42       ce27          10.004585        English 
138      cf15          10.007910        English 
109      ce34          10.093416        English 
85       cb13          10.120943        English 
54       cf47          10.123208        English 
196      cd16          10.155332        English 
114      cb03          10.174852        English 
171      cc13          10.182947        English 
145      ce25          10.218851        English 
120      cb04          10.272351        English 
48       cd06          10.280459        English 
105      cb22          10.288331        English 
148      ce26          10.289340        English 
189      cf23          10.312102        English 
98       ce35          10.321920        English 
4        cb21          10.334058        English 
165      ca36          10.346784        English 
9        cb14          10.382609        English 
77       ca08          10.457917        English 
65       ce04          10.471021        English 
115      cb19          10.532577        English 
176      cf28          10.547691        English 
143      cc06          10.574119        English 
162      cf02          10.587063        English 
159      ce31          10.588922        English 
166      cf40          10.606168        English 
199      cf16          10.614876        English 
113      cb23          10.619955        English 
72       cf24          10.621813        English 
6        cd08          10.690278        English 
163      cc16          10.716301        English 
74       cd14          10.716817        English 
154      cf25          10.731328        English 
96       cb01          10.731821        English 
62       ce07          10.738781        English 
197      cb15          10.777816        English 
57       cg03          10.794390        English 
2        cf14          10.813909        English 
217      cb07          10.847927        English 
111      cb16          10.885428        English 
82       cb20          10.892710        English 
195      cf09          10.897506        English 
59       cf22          10.963097        English 
89       cb27          11.023990        English 
21       ce22          11.029165        English 
190      ce13          11.029813        English 
71       cc03          11.051312        English 
200      cf35          11.060439        English 
17       cb18          11.093115        English 
210      ca05          11.149384        English 
66       cf10          11.159860        English 
207      cf18          11.190478        English 
218      cf30          11.196598        English 
192      cc17          11.202068        English 
5        cb05          11.203920        English 
161      cc02          11.207150        English 
137      ca43          11.227175        English 
121      cc05          11.241235        English 
23       ca28          11.265702        English 
142      ca03          11.413440        English 
32       ce18          11.441163        English 
47       cg08          11.442337        English 
127      ca41          11.481596        English 
172      cg07          11.625483        English 
155      ce01          11.690457        English 
46       cf20          11.736039        English 
14       cf32          11.766587        English 
214      cc08          11.785824        English 
86       ca34          11.854519        English 
80       ce10          11.970081        English 
13       ca04          11.995724        English 
30       cc12          12.043241        English 
129      cb02          12.083724        English 
35       cc14          12.224289        English 
87       cb24          12.225326        English 
136      cb12          12.239529        English 
139      cf46          12.244886        English 
3        cc07          12.301324        English 
164      cf05          12.306229        English 
204      cc01          12.325659        English 
1        cf41          12.346587        English 
39       cb25          12.348566        English 
84       cg06          12.349902        English 
134      cc10          12.530201        English 
63       cc04          12.533545        English 
51       ca20          12.587932        English 
26       cb06          12.595388        English 
0        cd03          12.602390        English 
187      ca19          12.689184        English 
194      ca01          12.731162        English 
123      cb26          12.835588        English 
202      ca26          12.934313        English 
60       ca32          12.937487        English 
64       cc11          13.017470        English 
128      ca14          13.044402        English 
215      ca33          13.088101        English 
141      ca10          13.099486        English 
177      ca06          13.105373        English 
173      ce05          13.164509        English 
126      ca09          13.232818        English 
10       ca30          13.284228        English 
208      ca39          13.412261        English 
43       ca22          13.462177        English 
19       ca12          13.528705        English 
58       ca21          13.529430        English 
56       ca27          13.561162        English 
174      ca07          13.601775        English 
133      ce12          13.638902        English 
193      ca15          13.831988        English 
55       ca37          13.880419        English 
213      ca29          13.936849        English 
28       ca42          14.079374        English 
186      ca13          14.212800        English 
78       cc09          14.215237        English 
153      cb11          14.260719        English 
37       cf17          14.313950        English 
188      ca02          14.333278        English 
119      ca24          14.337879        English 
211      ca25          14.372320        English 
106      cf29          14.503452        English 
135      ca40          14.697287        English 
180      cc15          15.348686        English 
79       ca11          15.492125        English 
146      ca23          16.583751        English 
68       ca31          16.708870        English 
44       ca16          17.994697        English 
36       ca17          18.949265        English 
61       ca18          21.102725        English 
40       ce09          21.436236        English 
203   ag94fe1.txt          27.078045       Portugese
18   ag94ja11.txt          29.275315       Portugese
107  br94ab02.txt          30.196547       Portugese
69    ag94mr1.txt          30.415082       Portugese
29   br94ju01.txt          30.605785       Portugese
12   ag94ab12.txt          31.078740       Portugese
103  ag94no01.txt          31.179321       Portugese
34   br94jl01.txt          31.190554       Portugese
50   ag94jl12.txt          31.344721       Portugese
170  br94de01.txt          31.424423       Portugese
152  br94ja04.txt          31.668146       Portugese
158  ag94ma03.txt          31.704502       Portugese
175  ag94ou04.txt          31.788427       Portugese
83   ag94de06.txt          32.058848       Portugese
130  ag94ju07.txt          32.076658       Portugese
168  ag94ag02.txt          32.241798       Portugese
182  br94ag01.txt          32.757263       Portugese
45    br94fe1.txt          32.972234       Portugese
179  br94ma01.txt          33.606311       Portugese
100  ag94se06.txt          34.076098       Portugese

In [0]:
df=pd.DataFrame()
df["file_name"]=test_files
df["Perplexity Values"]=pp
df=df.sort_values(by='Perplexity Values',axis=0)
threshold=22
df["Predicted Label"]=df["Perplexity Values"].apply(lambda x: "English " if x<threshold else "Portugese")
print("Given below are the non english files")
print(df[df["Perplexity Values"]>threshold])# Your code here

Given below are the non english files
        file_name  Perplexity Values Predicted Label
203   ag94fe1.txt          27.078045       Portugese
18   ag94ja11.txt          29.275315       Portugese
107  br94ab02.txt          30.196547       Portugese
69    ag94mr1.txt          30.415082       Portugese
29   br94ju01.txt          30.605785       Portugese
12   ag94ab12.txt          31.078740       Portugese
103  ag94no01.txt          31.179321       Portugese
34   br94jl01.txt          31.190554       Portugese
50   ag94jl12.txt          31.344721       Portugese
170  br94de01.txt          31.424423       Portugese
152  br94ja04.txt          31.668146       Portugese
158  ag94ma03.txt          31.704502       Portugese
175  ag94ou04.txt          31.788427       Portugese
83   ag94de06.txt          32.058848       Portugese
130  ag94ju07.txt          32.076658       Portugese
168  ag94ag02.txt          32.241798       Portugese
182  br94ag01.txt          32.757263       Portugese
45    br

file name            perplexity      Predicted label       True Label

ag94fe1.txt          27.078045       Portugese              Portugese

ag94ja11.txt          29.275315       Portugese             Portugese

br94ab02.txt          30.196547       Portugese            Portugese

ag94mr1.txt          30.415082       Portugese               Portugese

br94ju01.txt          30.605785       Portugese             Portugese

ag94ab12.txt          31.078740       Portugese              Portugese

ag94no01.txt          31.179321       Portugese              Portugese

br94jl01.txt          31.190554       Portugese               Portugese

ag94jl12.txt          31.344721       Portugese              Portugese

br94de01.txt          31.424423       Portugese               Portugese

br94ja04.txt          31.668146       Portugese              Portugese

ag94ma03.txt          31.704502       Portugese               Portugese

ag94ou04.txt          31.788427       Portugese               Portugese

ag94de06.txt          32.058848       Portugese               Portugese

ag94ju07.txt          32.076658       Portugese              Portugese

ag94ag02.txt          32.241798       Portugese                Portugese

br94ag01.txt          32.757263       Portugese               Portugese

br94fe1.txt          32.972234       Portugese                Portugese

br94ma01.txt          33.606311       Portugese              Portugese

ag94se06.txt          34.076098       Portugese                 Portugese

## English has perplexity between 7-21 while Portugese has perplexity between 27-34

## 1.3
Build a trigram language model with add-λ smoothing (use λ = 0.1).

Sort the test documents by perplexity and perform a check for Brazilian Portuguese documents as above:

  - Observe the perplexity scores and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names and perplexities of the documents above the threshold

  ```
      file name, score
      file name, score
      . . .
      file name, score
  ```

  - Copy this list of filenames and manually annotate them for correctness.

file_name  Perplexity Values Predicted Label
67       cg04           7.304036        English 
209      cf13           7.449559        English 
132      cf26           7.494317        English 
11       ce21           7.581537        English 
216      cf27           7.602170        English 
52       ce17           7.604204        English 
117      ce24           7.644647        English 
160      cd01           7.746349        English 
124      cg11           7.846200        English 
205      cd13           7.855948        English 
41       cf06           7.878799        English 
8        cf07           7.983433        English 
31       ce19           8.018265        English 
118      ce08           8.033742        English 
7        cg09           8.045040        English 
191      cd11           8.078555        English 
147      ce02           8.083724        English 
81       ce28           8.159405        English 
101      cf36           8.162105        English 
122      cg05           8.172458        English 
151      cd07           8.184022        English 
33       cf43           8.241441        English 
93       cd04           8.265286        English 
90       cb08           8.334354        English 
76       ce20           8.336455        English 
27       cf37           8.422766        English 
104      ce29           8.444962        English 
38       cf12           8.459403        English 
181      cb10           8.474488        English 
185      cf42           8.482337        English 
15       ce15           8.601110        English 
91       cf48           8.621589        English 
88       cf31           8.654008        English 
183      cf44           8.715900        English 
150      cf04           8.729199        English 
99       cd17           8.756755        English 
102      cf03           8.764841        English 
125      cf33           8.775444        English 
112      cb09           8.810848        English 
92       ce16           8.823718        English 
97       ce06           8.827657        English 
73       cf11           8.939385        English 
167      cd09           9.002029        English 
53       cf39           9.019933        English 
16       cd02           9.070546        English 
94       cg01           9.074175        English 
49       cd12           9.155629        English 
178      ca35           9.157226        English 
144      cd05           9.203840        English 
95       cf01           9.213021        English 
70       cd10           9.214313        English 
184      cg02           9.239399        English 
198      ce11           9.257273        English 
206      cf08           9.267367        English 
149      ce03           9.277927        English 
108      ce32           9.283809        English 
156      cf45           9.306165        English 
219      ca44           9.311409        English 
212      cf38           9.333878        English 
22       ce33           9.339820        English 
196      cd16           9.368256        English 
201      cb17           9.375684        English 
116      ce36           9.395724        English 
169      cg10           9.410946        English 
25       ce23           9.426030        English 
20       ce30           9.445501        English 
131      cd15           9.526335        English 
24       cf34           9.579629        English 
138      cf15           9.583606        English 
48       cd06           9.588310        English 
98       ce35           9.627246        English 
54       cf47           9.666189        English 
189      cf23           9.675814        English 
171      cc13           9.676550        English 
114      cb03           9.683189        English 
105      cb22           9.692627        English 
75       cf19           9.695514        English 
110      cf21           9.714171        English 
85       cb13           9.746956        English 
166      cf40           9.772725        English 
157      ce14           9.786911        English 
9        cb14           9.796737        English 
4        cb21           9.812612        English 
159      ce31           9.835854        English 
120      cb04           9.845955        English 
165      ca36           9.867804        English 
176      cf28           9.943745        English 
42       ce27           9.946855        English 
148      ce26           9.957617        English 
140      ca38           9.989771        English 
57       cg03          10.000188        English 
143      cc06          10.006471        English 
65       ce04          10.018981        English 
77       ca08          10.031330        English 
145      ce25          10.057300        English 
115      cb19          10.061790        English 
162      cf02          10.073499        English 
199      cf16          10.073657        English 
109      ce34          10.074691        English 
6        cd08          10.108209        English 
113      cb23          10.110773        English 
72       cf24          10.127622        English 
62       ce07          10.132447        English 
74       cd14          10.146853        English 
2        cf14          10.155125        English 
96       cb01          10.160223        English 
195      cf09          10.161904        English 
197      cb15          10.194039        English 
89       cb27          10.237407        English 
163      cc16          10.266433        English 
154      cf25          10.294652        English 
217      cb07          10.317523        English 
210      ca05          10.320732        English 
111      cb16          10.334316        English 
207      cf18          10.391376        English 
161      cc02          10.413781        English 
17       cb18          10.418634        English 
59       cf22          10.456223        English 
71       cc03          10.470589        English 
190      ce13          10.472309        English 
82       cb20          10.501107        English 
66       cf10          10.574416        English 
121      cc05          10.610752        English 
23       ca28          10.635858        English 
5        cb05          10.641327        English 
32       ce18          10.697165        English 
21       ce22          10.713839        English 
137      ca43          10.734768        English 
192      cc17          10.741346        English 
218      cf30          10.805893        English 
46       cf20          10.819290        English 
14       cf32          10.945491        English 
172      cg07          10.994494        English 
127      ca41          11.087999        English 
47       cg08          11.116685        English 
13       ca04          11.116924        English 
214      cc08          11.119927        English 
142      ca03          11.135558        English 
155      ce01          11.148113        English 
200      cf35          11.260124        English 
129      cb02          11.381932        English 
86       ca34          11.419067        English 
30       cc12          11.419384        English 
35       cc14          11.517960        English 
136      cb12          11.523518        English 
3        cc07          11.621524        English 
80       ce10          11.677504        English 
0        cd03          11.701323        English 
134      cc10          11.707887        English 
63       cc04          11.718321        English 
164      cf05          11.761819        English 
87       cb24          11.776969        English 
1        cf41          11.865708        English 
26       cb06          11.869949        English 
139      cf46          11.872384        English 
187      ca19          11.969772        English 
39       cb25          11.983505        English 
84       cg06          11.984293        English 
204      cc01          11.987018        English 
51       ca20          12.104867        English 
123      cb26          12.150470        English 
194      ca01          12.168731        English 
177      ca06          12.182307        English 
64       cc11          12.233150        English 
141      ca10          12.245598        English 
60       ca32          12.311206        English 
202      ca26          12.385267        English 
126      ca09          12.391217        English 
173      ce05          12.422467        English 
215      ca33          12.445762        English 
43       ca22          12.489832        English 
10       ca30          12.513109        English 
56       ca27          12.560387        English 
128      ca14          12.658197        English 
174      ca07          12.776310        English 
78       cc09          12.825969        English 
19       ca12          13.053244        English 
208      ca39          13.084837        English 
58       ca21          13.154931        English 
55       ca37          13.161555        English 
153      cb11          13.170134        English 
119      ca24          13.196285        English 
188      ca02          13.214838        English 
213      ca29          13.282081        English 
28       ca42          13.288785        English 
133      ce12          13.290805        English 
193      ca15          13.484348        English 
37       cf17          13.622221        English 
211      ca25          13.761926        English 
106      cf29          13.765441        English 
186      ca13          13.797593        English 
135      ca40          13.854586        English 
79       ca11          14.935484        English 
180      cc15          15.065617        English 
146      ca23          15.456654        English 
68       ca31          15.697082        English 
44       ca16          17.139938        English 
36       ca17          17.880277        English 
40       ce09          19.294780        English 
61       ca18          19.738884        English 
203   ag94fe1.txt          28.606582       Portugese
107  br94ab02.txt          30.531663       Portugese
34   br94jl01.txt          30.885890       Portugese
29   br94ju01.txt          30.921611       Portugese
18   ag94ja11.txt          31.043050       Portugese
170  br94de01.txt          31.227055       Portugese
152  br94ja04.txt          31.281177       Portugese
12   ag94ab12.txt          31.295896       Portugese
103  ag94no01.txt          31.400614       Portugese
45    br94fe1.txt          31.615863       Portugese
69    ag94mr1.txt          31.810722       Portugese
175  ag94ou04.txt          31.839155       Portugese
50   ag94jl12.txt          31.850135       Portugese
130  ag94ju07.txt          31.901756       Portugese
182  br94ag01.txt          32.037548       Portugese
158  ag94ma03.txt          32.210307       Portugese
168  ag94ag02.txt          32.494577       Portugese
83   ag94de06.txt          32.653609       Portugese
100  ag94se06.txt          33.076175       Portugese
179  br94ma01.txt          33.539113       Portugese

In [0]:
corpus=[]
for i in range(len(train)):
  for j in range(len(train[i])):
    corpus.append((re.sub("[\x1a]+", ' '," ".join(train[i][j])).lstrip().rstrip()))

c1,c2,c3=dict(),dict(),dict()

cv1=CountVectorizer(ngram_range=(1,1),analyzer='char')
x1=cv1.fit_transform(corpus)

t1=np.array(np.sum(x1,axis=0))[0]
for i,name in enumerate(cv1.get_feature_names()):
  c1[name]=t1[i] # word and count
 
#removing unigrams that have a count <100
rep=[]
tot=0
for a in c1:
  if c1[a]<100:
    tot +=c1[a]
    rep.append(a)

c1['µ']=tot # unknown token since it doesnt appear int the docs

for i in rep:
  del c1[i]

corpus2=[]
for sent in corpus:
  corpus2.append(re.sub('[/$/&>/`/æ/è/î/~/@/=/</+/%]',"µ" ,sent))

# removed tokens with values less than 100 and replaced with special token
cv1=CountVectorizer(ngram_range=(1,1),analyzer='char')
cv2=CountVectorizer(ngram_range=(2,2),analyzer='char')
cv3=CountVectorizer(ngram_range=(3,3),analyzer='char')
x1=cv1.fit_transform(corpus2)
x2=cv2.fit_transform(corpus2)
x3=cv3.fit_transform(corpus2)

t1=np.array(np.sum(x1,axis=0))[0]
for i,name in enumerate(cv1.get_feature_names()):
  c1[name]=t1[i] 

t2=np.array(np.sum(x2,axis=0))[0]
for i,name in enumerate(cv2.get_feature_names()):
  c2[name]=t2[i]

t3=np.array(np.sum(x3,axis=0))[0]
for i,name in enumerate(cv3.get_feature_names()):
  c3[name]=t3[i]



In [0]:
def generate_trigram_sent(sent):
    tri=[]
    for j in range(len(sent)-3):
      tri.append(sent[j:j+3])
    return tri



def tri_smoothing(token,a3,a2,a1,lam):
  if token not in set(a3.keys()):
    num=lam
  elif token in set(a3.keys()) :
    num=lam+a3[token]
  if token[:2] not in set(a2.keys()):
    den=lam*len(a1)  # len(a1) represents vocabulary
  elif token[:2] in set(a2.keys()):
    den=(lam*len(a1)) + a2[token[:2]]
  return(num/den)

In [0]:
def perplexity(corp,a1,a2,a3,lam): # corp can be the whole test set or dev set  
    perplexity=[]
    for doc in corp: # picking single documents
        #print("doc")
    
        dd=CountVectorizer(ngram_range=(1,1),analyzer='char')
        gg=dd.fit_transform(doc)
        tt=dict()
        tp1=np.array(np.sum(gg,axis=0))[0]
        for i,name in enumerate(dd.get_feature_names()):
          tt[name]=tp1[i] # word and count
        char=np.array(list(tt.values())).sum() # refers to M # total number of characters in a document

        # real calculations begin

        log_prob=0
        for sent in doc: # refers to each document fro the dev set or test set # loops over all thesentences in the document
          trigrams=generate_trigram_sent(sent) # generates trigrams for the given sentence
          for tri_gram in trigrams: # loops over all the characters in a sentence
              

              prob=tri_smoothing(tri_gram,a3,a2,a1,lam)
              log_prob += np.log2(prob)

        l=log_prob/char
        perp=pow(2,-l)
        print(perp)
        perplexity.append(perp)
    return(perplexity)



In [0]:
# preparing test set
TEST=[]
for i in range(len(test)):
  k=[]
  for j in range(len(test[i])):
    k.append((" ".join(test[i][j])).lstrip().rstrip())
  TEST.append(k)

In [0]:
perp=perplexity(TEST,c1,c2,c3,0.1)

11.701323018776531
11.865708032142411
10.155124841480799
11.62152445257162
9.812612435628473
10.641326910534529
10.108208651716497
8.045040098759397
7.98343304549849
9.796736749676292
12.513108799810345
7.581537011881531
31.295896069427805
11.116923763996232
10.945491470446333
8.601109788178563
9.070545837868497
10.418634167058752
31.04304956905218
13.053243765548137
9.44550073309761
10.71383871484805
9.339820493970603
10.635858162697854
9.57962916769351
9.42602952082602
11.869948967428959
8.422765558190747
13.288785457815964
30.92161106154445
11.419384013153264
8.01826538392935
10.697164966162148
8.24144097261492
30.88588998096397
11.517959661315333
17.88027687330908
13.622221091257767
8.459403146755344
11.983504933450977
19.294780139202846
7.878798964939564
9.94685473443481
12.48983165525897
17.139938149166717
31.61586305567981
10.819290233381475
11.116685028217018
9.588310345720345
9.155629273611313
31.85013491129407
12.1048672342839
7.60420351188095
9.019932535533114
9.666189090494

In [0]:
df1=pd.DataFrame()
df1["file_name"]=test_files
df1["Perplexity Values"]=perp

In [0]:
df1=df1.sort_values(by='Perplexity Values',axis=0)

In [0]:
threshold=22
df1["Predicted Label"]=df1["Perplexity Values"].apply(lambda x: "English " if x<threshold else "Portugese")
print("Given below are thenon english files")
print(df1[df1["Perplexity Values"]>threshold])


Given below are thenon english files
        file_name  Perplexity Values Predicted Label
203   ag94fe1.txt          28.606582       Portugese
107  br94ab02.txt          30.531663       Portugese
34   br94jl01.txt          30.885890       Portugese
29   br94ju01.txt          30.921611       Portugese
18   ag94ja11.txt          31.043050       Portugese
170  br94de01.txt          31.227055       Portugese
152  br94ja04.txt          31.281177       Portugese
12   ag94ab12.txt          31.295896       Portugese
103  ag94no01.txt          31.400614       Portugese
45    br94fe1.txt          31.615863       Portugese
69    ag94mr1.txt          31.810722       Portugese
175  ag94ou04.txt          31.839155       Portugese
50   ag94jl12.txt          31.850135       Portugese
130  ag94ju07.txt          31.901756       Portugese
182  br94ag01.txt          32.037548       Portugese
158  ag94ma03.txt          32.210307       Portugese
168  ag94ag02.txt          32.494577       Portugese
83   ag94

Index file_name  Perplexity Values Predicted Label true label

203   ag94fe1.txt          28.606582       Portugese Portugese

107  br94ab02.txt          30.531663       Portugese Portugese

34   br94jl01.txt          30.885890       Portugese Portugese

29   br94ju01.txt          30.921611       Portugese Portugese

18   ag94ja11.txt          31.043050       Portugese Portugese

170  br94de01.txt          31.227055       Portugese Portugese

152  br94ja04.txt          31.281177       Portugese Portugese

12   ag94ab12.txt          31.295896       Portugese Portugese

103  ag94no01.txt          31.400614       Portugese Portugese

45    br94fe1.txt          31.615863       Portugese Portugese

69    ag94mr1.txt          31.810722       Portugese Portugese

175  ag94ou04.txt          31.839155       Portugese Portugese

50   ag94jl12.txt          31.850135       Portugese Portugese

130  ag94ju07.txt          31.901756       Portugese Portugese

182  br94ag01.txt          32.037548       Portugese Portugese

158  ag94ma03.txt          32.210307       Portugese Portugese

168  ag94ag02.txt          32.494577       Portugese Portugese

83   ag94de06.txt          32.653609       Portugese Portugese

100  ag94se06.txt          33.076175       Portugese Portugese

179  br94ma01.txt          33.539113       Portugese Portugese

## English has a perplexity between 7.3-19.73 while Portugese has perplexity between 28-33

## 1.4
Based on your observation from above questions, compare linear interpolation and add-λ smoothing by listing out their pros and cons.

Linear Interpolation and Lambda smoothing has provided similar perplexity values but one can notice that both models perform equally well
Advantage of Linear interpolation is that it given weightage to bigrams and unigrams when trigram is missing
Disadvantage is Linear interpolation is a time consuming process which takes a lon time to execute
Advantage of Lambda smoothing is that its faster.