**Assignment 4**

The following file contains a number of sentence-aligned parallel texts taken from the European Parliament proceedings (Links to an external site.). The file contains Swedish-English, German-English, and French-English sentence pairs. For instance, the file europarl-v7.sv-en.lc.sv contains the Swedish part of the Swedish-English dataset. The texts have been preprocessed to be more easy to work with: all words are in lowercase, and punctuation has been separated from the words. This means that you can split each sentence into separate words simply by considering the whitespace.

In [26]:
import pandas as pd
import numpy as np

from collections import Counter

en_de = open("europarl-v7.de-en.lc.en", "r", encoding="utf-8").read()
en_fr = open("europarl-v7.fr-en.lc.en", "r", encoding="utf-8").read()
en_sv = open("europarl-v7.sv-en.lc.en", "r", encoding="utf-8").read()
en = en_de + " " + en_fr + " " + en_sv

de = open("europarl-v7.de-en.lc.de", "r", encoding="utf-8").read()
fr = open("europarl-v7.fr-en.lc.fr", "r", encoding="utf-8").read()
sv = open("europarl-v7.sv-en.lc.sv", "r", encoding="utf-8").read()

all_ = en + " " + de + " " + fr + " " + sv

**a) Print the 10 most frequent words in each language.**

In [27]:
Counter(en.split()).most_common(15)

[('the', 58790),
 (',', 42043),
 ('.', 29542),
 ('of', 28406),
 ('to', 26842),
 ('and', 21459),
 ('in', 18485),
 ('is', 13331),
 ('that', 13219),
 ('a', 13090),
 ('we', 9936),
 ('this', 9916),
 ('for', 8973),
 ('i', 8896),
 ('be', 7842)]

In [28]:
Counter(de.split()).most_common(15)

[(',', 18549),
 ('die', 10521),
 ('.', 9733),
 ('der', 9374),
 ('und', 7028),
 ('in', 4175),
 ('zu', 3168),
 ('den', 2976),
 ('wir', 2863),
 ('daß', 2738),
 ('ich', 2670),
 ('das', 2669),
 ('für', 2483),
 ('von', 2476),
 ('ist', 2277)]

In [29]:
Counter(fr.split()).most_common(15)

[('&apos;', 16729),
 (',', 15402),
 ('de', 14520),
 ('la', 9746),
 ('.', 9734),
 ('et', 6619),
 ('l', 6536),
 ('le', 6174),
 ('les', 5585),
 ('à', 5500),
 ('des', 5232),
 ('que', 4797),
 ('d', 4555),
 ('en', 4018),
 ('nous', 3437)]

In [30]:
Counter(sv.split()).most_common(15)

[('.', 9648),
 ('att', 9181),
 (',', 8876),
 ('och', 7038),
 ('i', 5949),
 ('det', 5687),
 ('som', 5028),
 ('för', 4959),
 ('av', 4013),
 ('är', 3840),
 ('en', 3724),
 ('vi', 3211),
 ('jag', 3093),
 ('den', 2953),
 ('de', 2930)]

In [31]:
all_counter = Counter(all_.split())


The probability of one word occurring in the whole text data is the frequency of the word divided by the total number of the words.

In [32]:
all_counter["speaker"]/sum(all_counter.values())

1.9327394942430718e-05

In [33]:
all_counter["zebra"]/sum(all_counter.values())

0.0


**b) Language modeling**

Implement a bigram language model and use it to compute the probability of a short sentence.

In [34]:
import nltk
from collections import Counter, defaultdict
import math
import re

from collections import Counter
import numpy as np, pandas as pd
pdf = lambda data, index=None, columns=None: pd.DataFrame(data, index, columns)

In [35]:
en.splitlines() #data overview

['i declare resumed the session of the european parliament adjourned on friday 17 december 1999 , and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .',
 'although , as you will have seen , the dreaded &apos; millennium bug &apos; failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .',
 'you have requested a debate on this subject in the course of the next few days , during this part-session .',
 'in the meantime , i should like to observe a minute &apos; s silence , as a number of members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the european union .',
 'you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka .',
 'one of the people assassinated very recently in sri lanka was mr kumar ponnambalam

In [36]:

clean_en=re.sub(r'[^\w]', ' ', en) #removed the symbols
the_list=clean_en.splitlines() #split the data into sentences

In [37]:
corpus = the_list 
counter = Counter()  # Frequency
for i in range(len(corpus)):
    for word in corpus[i].split():
        counter[word] += 1   
### if most_common() empty it returns all of them
counter = counter.most_common()
words = [wc[0] for wc in counter]  #
lec = len(counter)
word2id = {counter[i][0]: i for i in range(lec)}
id2word = {i: w for w, i in word2id.items()}
pdf(counter, None, ['word', 'freq'])

Unnamed: 0,word,freq
0,the,58807
1,of,28511
2,to,26875
3,and,21470
4,in,18599
...,...,...
11367,ponder,1
11368,gruesome,1
11369,reassess,1
11370,filtering,1


In [38]:
unigram = np.array([i[1] for i in counter])

unigram_freq=np.array([i[1] for i in counter]) / sum(i[1] for i in counter)

unigram_df=pdf(unigram_freq.reshape(1, lec), ['prob'], words)

In [39]:
bigram = np.zeros((lec, lec)) + 1e-8  # smoothing
for i in range(len(corpus)):
#for sentence in corpus:
    temp=corpus[i].split()
    temp = [word2id[w] for w in temp]
    for j in range(1, len(temp)):

        bigram[[temp[j - 1]], [temp[j]]] += 1

In [40]:
# Frequency
bigram_df=pd.DataFrame(bigram, words, words, int)

In [41]:
# Frequency --> prob
bigram_freq=np.zeros((lec, lec)) + 1e-8
for i in range(lec):
    bigram_freq[i] = bigram[i]/bigram[i].sum()
pdf(bigram_freq, words, words)

Unnamed: 0,the,of,to,and,in,is,that,a,we,this,i,for,be,it,on,which,are,have,as,not,with,will,european,commission,by,apos,has,mr,would,an,at,s,but,should,must,all,also,you,our,there,...,afflict,burst,aquifers,historically,whims,unpredictable,terrain,adversity,hinder,disservice,lucky,lake,tightly,surround,intimated,realism,diversions,achievable,arid,meaningless,58,extraction,deletion,scottish,fathomless,wounded,unusable,annihilation,prohibitive,irrigation,rainwater,soil,marset,campos,chicanery,ponder,gruesome,reassess,filtering,endanger
the,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-05,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.020287e-04,1.700478e-13,3.400956e-04,4.455252e-02,5.839441e-02,1.700478e-13,5.271481e-04,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,5.101434e-05,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,...,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-05,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-05,1.700478e-13,1.700478e-05,1.700478e-13,1.700478e-13,1.700478e-05,1.700478e-05,1.700478e-13,1.700478e-05,1.700478e-05,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13,1.700478e-13
of,3.183333e-01,3.507418e-13,3.507418e-13,7.014836e-04,1.753709e-04,2.104451e-04,2.735786e-03,2.911157e-02,3.507418e-13,3.538985e-02,1.052225e-04,1.402967e-04,3.507418e-13,2.700712e-03,3.507418e-13,4.454421e-03,3.507418e-13,3.507418e-13,2.104451e-04,2.805935e-04,3.507418e-13,1.052225e-04,7.155133e-03,8.768545e-04,3.507418e-13,9.470029e-04,3.507418e-13,1.543264e-03,3.507418e-13,7.400652e-03,3.858160e-04,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,1.153941e-02,3.507418e-13,2.525341e-03,1.252148e-02,3.507418e-13,...,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-13,3.507418e-05,3.507418e-13
to,1.301581e-01,3.720930e-13,3.720930e-13,1.079070e-03,1.525581e-03,2.232558e-04,2.902326e-03,1.600000e-02,3.348837e-04,1.506977e-02,1.116279e-04,1.116279e-04,7.851163e-02,2.493023e-03,7.441860e-05,4.167442e-03,3.720930e-13,1.350698e-02,7.069767e-04,1.116279e-04,1.116279e-04,3.720930e-13,1.116279e-03,2.232558e-04,1.860465e-04,3.720930e-13,3.720930e-13,3.051163e-03,3.720930e-13,3.534884e-03,5.581395e-04,3.720930e-13,1.116279e-04,3.720930e-13,3.720930e-13,3.906977e-03,3.720930e-04,3.646512e-03,4.651163e-03,1.116279e-04,...,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-13,3.720930e-05,3.720930e-13,3.720930e-05
and,1.000000e-01,8.523521e-03,2.412669e-02,4.657662e-13,1.867722e-02,4.517932e-03,2.538426e-02,1.429902e-02,2.310200e-02,1.234280e-02,3.632976e-02,1.187704e-02,9.315324e-04,1.201677e-02,4.471355e-03,7.265952e-03,3.446670e-03,1.071262e-03,4.797392e-03,8.849557e-03,3.353517e-03,4.657662e-03,2.002795e-03,6.520727e-04,2.049371e-03,2.794597e-04,2.515137e-03,5.309734e-03,1.071262e-03,1.956218e-03,2.561714e-03,4.657662e-13,4.657662e-13,1.210992e-03,2.095948e-03,1.583605e-03,6.707033e-03,1.909641e-03,3.912436e-03,3.959013e-03,...,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-05,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13,4.657662e-13
in,2.703371e-01,5.376633e-13,1.612990e-04,1.021560e-03,5.376633e-13,1.612990e-04,6.667025e-03,3.672240e-02,5.376633e-13,7.064896e-02,5.376633e-13,2.688317e-04,5.376633e-13,1.666756e-03,1.612990e-04,2.026991e-02,5.376633e-13,5.376633e-13,1.021560e-03,5.376633e-13,1.021560e-03,5.376633e-13,1.935588e-03,5.376633e-13,3.225980e-04,5.914296e-04,5.376633e-13,2.688317e-04,5.376633e-13,5.860530e-03,4.301307e-04,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,8.172482e-03,5.376633e-13,1.612990e-04,1.365665e-02,1.612990e-04,...,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13,5.376633e-13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ponder,9.998863e-01,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,...,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09
gruesome,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,...,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09
reassess,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,...,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09
filtering,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-01,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,...,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09,9.998863e-09


In [42]:
def prob(sentence):
    s = [word2id[w] for w in sentence.split()]
    les = len(s)
    if les < 1:
        return 0
    p = unigram_freq[s[0]]
    if les < 2:
        return p
    for i in range(1, les):
        p *= bigram_freq[s[i - 1], s[i]]
    return p

In [43]:
print(prob("in the"))

0.006409520929440792


**What happens if you try to compute the probability of a sentence that contains a word that did not appear in the training texts?**
** **
If the word did not appear in the training text, the probability will be zero. We fix this by using laplace smoothing.

**And what happens if your sentence is very long (e.g. 100 words or more)?**

** **

If we have a long sentence, the probability will be very small, to fix this we might use a bigger n value in the n-gram model (like 50-gram) instead of using the bigram model, and the log-probabilities might be another possible solution. However, I do not think looking at the probability of a long sentence can give us any useful information.



**c) Translation modeling**


We will now estimate the parameters of the translation model P(f|e).


To print for either Swedish, German, or French, the 10 words that the English word *european* is most likely to be translated into, according to your estimate. 

In [44]:
import collections
lang1 = fr #inout
lang2 = en_fr #output
#lang1 = en_sv
#lang2 = sv
n = 100

n_words = len(set(lang2.split()))
lang1 = lang1.splitlines()
lang2 = lang2.splitlines()

t = defaultdict(lambda: defaultdict(lambda: 1/n_words))


In [45]:
def get_unique(en,sv):
  unique_sv=[]
  unique_en=[]
  for j in range(len(sv)):
    if sv[j] not in unique_sv:
      unique_sv.append(sv[j])
  
  for j in range(len(en)):
    if en[j] not in unique_en:
      unique_en.append(en[j])
  unique_en.append( 'NULL')
  return unique_en,unique_sv


In [46]:
def ibmModell(en, sv):
  en_unique, sv_unique=get_unique(en.splitlines(),sv.splitlines())
  #en_sentence, sv_sentence=get_sentence(en, sv)
  sentence_num=len(lang2)
  # Initializing tffle) uniformly
  t_fe = defaultdict(lambda:float(1/len(en_unique)))

  # For each EM iteration
  Iteration_num = 10
  for t in range(Iteration_num):
    print("Iteration:",t+1)
    count_fe = defaultdict(float)
    count_e = defaultdict(float)
    #for each sentence pair
    for stc in range (sentence_num):
      en_stc=lang2[stc]
      sv_stc=lang1[stc]
      sv_word_insentence=sv_stc.splitlines()
      en_word_insentence=en_stc.splitlines()
      en_word_insentence.append('NULL') # including the NULL word
      # For each Forigen word
      for sv_word in sv_word_insentence:
        # For each English word
        for en_word in en_word_insentence:
          all_enwords_cor_this_sv = np.array([t_fe[(enw, sv_word)] for enw in en_word_insentence])
          delta=t_fe[(en_word, sv_word)]/np.sum(all_enwords_cor_this_sv)
          count_fe[(en_word,sv_word)] += delta
          count_e[en_word] += delta

      for (en_word, sv_word) in count_fe:
         t_fe[(en_word, sv_word)] = count_fe[(en_word, sv_word) ]/count_e[en_word]
  return t_fe
         
          

In [48]:
t_sv_en=ibmModell(en_sv,sv)

Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 6
Iteration: 7
Iteration: 8
Iteration: 9
Iteration: 10


In [None]:
# The top 10 swedish words for "european"
possible_word=[]

keys_list = list(t_sv_en.keys())
for key_list in keys_list:
    if 'european' in key_list:
        possible_word.append(key_list)
possible_dict = {}

for index in possible_word:
    if index in t_sv_en.keys():
        possible_dict[index] = t_sv_en.get(index)
len(possible_dict)
sorted(possible_dict.items(),key = lambda x:x[1],reverse = True)[:15]

**d) Decoding**

We should try to find E* = argmaxE P(E|F).

F is the given source language sentence, we want to find E which is the highest frequency sentence. explain the algorithm how does it work.


In [50]:
import random
def random_pick(some_list,probabilities):
    x=random.uniform(0,1)
    cumulative_probability=0.0
    for item,item_probability in zip(some_list,probabilities):
        cumulative_probability+=item_probability
        if x < cumulative_probability: break
    return item