# Machine Translation #

Earlier in class we covered two machine translation models: IBM Model 1 and IBM Model 2.  

Given a foreign sentence $f = (f_1,f_2,...,f_{l_f})$ of length $l_f$ and an English sentence  $e = (e_1,e_2,...,e_{l_e})$ of length $l_e$ with an alignment of each English word $e_j$ to a foreign word $f_i$ according to the alignment function $a: j \rightarrow i$, the translation probability is defined as follows:
$$ \Large p(e,a|f)= \frac{\epsilon}{(l_f+1)^{l_e}}\prod_{j=1}^{l_e}t(e_j|f_{a(j)}) $$

Since we include the special NULL token there are actually $l_f+1$ input words.

In IBM Model 2 we add an explicit model for alignment. The translation of a foreign input word in position j is modeled by an alignment probability distribution:
$$ \Large a(i|j, l_e, l_f) $$

We combine this probability distribution with the lexical translation probability $t(e|f)$ to come up with the final probability:
$$ \Large p(e,a|f)= \epsilon \prod_{j=1}^{l_e}t(e_j|f_{a(j)}) a(a(j)|j,l_e,l_f) $$

The translation probabilities in both models could be combined with the language model through the noisy-channel model.
$$ \Large argmax_e p(e|f) = argmax_e \frac{p(f|e)p(e)}{p(f)} =  argmax_e p(f|e)p(e) $$

In this lab session we are going to utilize IBM Models 1 and 2 to generate lexical translation models for a set of parallel sentences. We will also explore the BLEU score which is the most widely used measure for evaluating machine translation quality. We will also explore the generated alignment from running these two models.  

We will be using two datasets that were used as official evaluation collections for the WMT 2009 and 2011 conferences.  

The first collection consists of 2000 English-Spanish parallel sentences from the European parliament proceedings.  

The second collection consists of 3003 English-German parallel sentences from various news articles. We will be using a subset of these sentence pairs to train an IBM Model 1 and 2.  

In addition to this collection we will also be using English sentences generated from various MT systems that took part in the 2011 WMT evaluations. These systems were evaluated on the news test set.  

We'll be using the __nltk__ implementation of the two IBM models. More specifically the __nltk.translate__ package.

## Loading Collections ##

Both collections are sentence aligned and consist of two files - one for each language. Each line in a file contains a single sentence. Let's first load both collections:

In [None]:
import numpy as np
import nltk
import nltk.translate
stopwords_list = nltk.corpus.stopwords.words('english')

In [None]:
def rank_words_based_on_fc(all_words):
    frequency_count = nltk.FreqDist(all_words)
    words =np.array([word for word in frequency_count.keys()])
    word_freq=np.array([word for word in frequency_count.values()])
    freq_sort = np.argsort(word_freq)[::-1]
    word_freq_sort =word_freq[freq_sort]
    words_sorted = words[freq_sort]
    rank=1
    for object in words_sorted:
        if (rank<=1000):
            print(object+"\t"+str(frequency_count[object]))
    rank+=1

In [None]:
def load_sentences(sent_file):
    coll_sent = list()
    for sent in sent_file.readlines():
        sent_words = sent.split(" ")
        sent_words = [word for word in sent_words]
        coll_sent.append(sent_words)
    return coll_sent

In [None]:
en_euro_file = open("../../../data/mt/europarl/test2008.true.en",'r')
sp_euro_file = open("../../../data/mt/europarl/test2008.true.es",'r')

sent_limit = 500

en_euro = list()
all_en_euro_words = list()
r_count = 0
for sent in en_euro_file.readlines():
    r_count += 1
    if (r_count % 100 == 0):
        print(r_count)
    if (r_count==sent_limit):
        break
    sent_words = nltk.word_tokenize(sent)
    #sent_words_tok = [word for word in sent_words]
    sent_words_tok = [word.lower() for word in sent_words if word not in stopwords_list]
    en_euro.append(sent_words_tok)
    for words in sent_words_tok:
        all_en_euro_words.append(words)

sp_euro = list()
all_sp_euro_words = list()
r_count = 0
for sent in sp_euro_file.readlines():
    r_count += 1
    if (r_count % 100 == 0):
        print(r_count)
    if (r_count==sent_limit):
        break
    sent_words = nltk.word_tokenize(sent)
    #sent_words_tok = [word for word in sent_words]
    sent_words_tok = [word.lower() for word in sent_words if word not in stopwords_list]
    sp_euro.append(sent_words_tok)
    for words in sent_words_tok:
        all_sp_euro_words.append(words)

In [None]:
en_news_file = open("../../../data/mt/news/newstest2011-ref.en",'r')
de_news_file = open("../../../data/mt/news/newstest2011-ref.de",'r')

sent_limit = 500

en_news = list()
all_en_news_words = list()
r_count = 0
for sent in en_news_file.readlines():
    r_count += 1
    if (r_count % 100 == 0):
        print(r_count)
    if (r_count==sent_limit):
        break
    sent_words = nltk.word_tokenize(sent)
    #sent_words_tok = [word for word in sent_words]
    sent_words_tok = [word.lower() for word in sent_words if word not in stopwords_list]
    en_news.append(sent_words_tok)
    for words in sent_words_tok:
        all_en_news_words.append(words)

de_news = list()
all_de_news_words = list()
r_count = 0
for sent in de_news_file.readlines():
    r_count += 1
    if (r_count % 100 == 0):
        print(r_count)
    if (r_count==sent_limit):
        break
    sent_words = nltk.word_tokenize(sent)
    #sent_words_tok = [word for word in sent_words]
    sent_words_tok = [word.lower() for word in sent_words if word not in stopwords_list]
    de_news.append(sent_words_tok)
    for words in sent_words_tok:
        all_de_news_words.append(words)


Now that we have both parallel collections loaded into separate lists. Let's first build a MT model. We'll start with the IBM Model 1. In order to train this model we would first need to store the parallel sentences into an alignment object. But before we proceed let's first get a sense of the type of words that are in both collections and both languages:

In [None]:
rank_words_based_on_fc(all_en_euro_words)

In [None]:
rank_words_based_on_fc(all_sp_euro_words)

In [None]:
rank_words_based_on_fc(all_en_news_words)

In [None]:
rank_words_based_on_fc(all_de_news_words)

Let's now store these two parallel collections into alignment objects:

In [None]:
europarl_corpus = list()
for en, sp in zip(en_euro, sp_euro):
    europarl_corpus.append(nltk.translate.AlignedSent(en,sp))

In [None]:
news_corpus = list()
for en, de in zip(en_news, de_news):
    news_corpus.append(nltk.translate.AlignedSent(en,de))

## IBM Model 1 ##

Now let's train the IBM Model 1 using the Europarl bitext. We do this with the __nltk.translate.IBMModel1__ method. Both IBM models use the EM algorithm to train the lexical translation models. Depending on the number of sentences in the collection running the EM algorithm may take some time. 

In [None]:
ibm1_europarl = nltk.translate.IBMModel1(europarl_corpus, 5)

**[Assignment 1]**  
The __.translation_table__ method allows us to obtain the estimated translation probabilities of two words. For example, the code below will give us the translation probability of the English word "together" to the Spanish word "todos":

In [None]:
ibm1_europarl.translation_table["together"]["todos"]

Use this method to explore the trained translatio model. 

**[Assignment 2]**  
Train the IBM Model1 using the news collection and explore the translation probabilities.

**[Solution 2]**

## IBM Model 2 ##
So far we used IBM Model 1. In this part of the lab session we'll be using IBM Model 2. This model is implemented under the implemented under the __nltk.translate.IBMModel2__ method.

In [None]:
ibm2_europarl = nltk.translate.IBMModel2(europarl_corpus, 5)

**[Assignment 3]**  
Let's use IBM Model 2 to repeat the above assignments. 

**[Solution 3]**  

## Computing BLEU Score ##
In this part of the lab session we are going to analyze the performance of various MT systems using the BLEU score. The nltk.translate module implements this measure through the __nltk.translate.bleu_score__ method. Provided in the mt folder are the translation outputs of various MT systems that participated in the WMT 2011 evaluations. Let's load the generated sentences of one such MT system. 

In [None]:
from nltk.translate.bleu_score import SmoothingFunction
ref_trans_file = open("../../../data/mt/news/newstest2011-ref.de",'r')

mt1_file = open("../../../data/mt/news/newstest2011.en-de.jhu",'r')

mt1_output = load_sentences(mt1_file)
ref_trans = load_sentences(ref_trans_file)

bleu_mt1 = nltk.translate.bleu_score.corpus_bleu(ref_trans,mt1_output,smoothing_function=SmoothingFunction().method4)
print(bleu_mt1)

**[Assignment 4]**  
Compare the BLEU score performance of this system with other MT system that participants.

**[Solution 4]**

**[Assignment 5 (Optional)]**
So far we've only the top 500 sentences from each bitext to train our models. Re-train the two models using a larger collection ($\geq 500$) of sentences and observe the translation probabilities of these models. __Note__: Training with a larger number of sentences may take longer period of time.

**[Solution 5 (Optional)]**