Machine translation is a challenge for computers not to only understand human languages but also to generate languages. A machine translation can be viewed as a conditional language model, given a source sentence $x_i$, we needed to calculate the probability of generated sentence $p(y_i|x_i)$. In early years, statistical machine translation(SMT) was a focus, amongst which IBM models were basis, if you are interested, please visit Michael Collins' [webpage](http://www.cs.columbia.edu/~mcollins/), there he provided many useful and explicit lecture notes to illustrate basis terms and models of SMT.

In recent years, with the development of artificial neural networks as well as deep learning applications, neural translation models were explored, especially [seq2seq](https://arxiv.org/pdf/1406.1078v3.pdf) model as well as later models has improved performances of machine translation.

In the last part, we used a simple encoder-decoder model to translate English to German. In fact, since 2015, more efficient models such as attention models have been proposed and tested effectively in machine translation. In an attention model, the decoder doesn't rely on a fixed vector output by the encoder, but on a context vector varies according to the alignments of source and target sentences, and the context vector is a sum of product of attention weights and source sentence vectors. It is reasonable because different parts source sentences have different influence on target sentences. For more information, please read Dzmitry Bahdanau's paper [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) .

Actually, there are some [excellent blogs](http://blog.csdn.net/u011414416/article/details/51057789) in Chinese, which introduced the development and theoretic models of neural machine translation explicitly and systemamtically.

Specify the paths of the original dataset.

In [1]:
# Data Parameters
data_dir = 'temp'
data_file = 'eng_ger.txt'

In [2]:
# Make data directory
import os
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

## Data Acquiry

Download the data from website if it does not exist.

In [3]:
class dataReader:
    '''
    Read text files from local drive.
    If not exists, download it.
    '''
    def __init__(self, file_path):
        self.file_path = file_path
        if not self.__checkExists():
            self.__download()
        
            
    def loadData(self):
        #print('Data Exists!')
        eng_ger_data = []
        with open(self.file_path, 'r') as in_conn:
            for row in in_conn:
                eng_ger_data.append(row[:-1])
        return eng_ger_data

    def __download(self):
        '''Download text files which contain translation pairs'''
        print('Data not found, downloading Eng-Ger sentences from www.manythings.org')
        sentence_url = 'http://www.manythings.org/anki/deu-eng.zip'
        r = urllib.request.urlopen(sentence_url)
        z = ZipFile(io.BytesIO(r.read()))
        file = z.read('deu.txt')
        # Format Data
        eng_ger_data = file.decode()
        eng_ger_data = eng_ger_data.encode('ascii',errors='ignore')
        eng_ger_data = eng_ger_data.decode().split('\n')
        # Write to file
        with open(self.file_path, 'w') as out_conn:
            for sentence in eng_ger_data:
                out_conn.write(sentence + '\n')
                
    def __checkExists(self):
        '''Check the file'''
        return os.path.isfile(self.file_path)

In [4]:
dr = dataReader('eng_ger.txt')
eng_ger_data = dr.loadData()

In [5]:
eng_ger_data[:10]

['Hi.\tHallo!',
 'Hi.\tGr Gott!',
 'Run!\tLauf!',
 'Wow!\tPotzdonner!',
 'Wow!\tDonnerwetter!',
 'Fire!\tFeuer!',
 'Help!\tHilfe!',
 'Help!\tZu Hlf!',
 'Stop!\tStopp!',
 'Wait!\tWarte!']

## Data Processig

Preprocess the original data. We can define a class to remove punctuation, split the original sentences, then build corresponding vocabulary for source language and target language.

In [6]:
import string
from collections import Counter
vocab_size = 10000
class textHandler:
    '''Split sentences into pairs of Source-Target language'''
    def __init__(self, data, vocab_size):
        self.data = data
        self.vocab_size = vocab_size
        self.__sentSplit()
        
    def __removePunctuation(self):
        '''Remove punctuation'''
        # Remove punctuation
        punct = string.punctuation
        pair_data = [''.join(char for char in sent if char not in punct) for sent in self.data]
        return pair_data
    
    def __sentSplit(self):
        # Break each sentence pair by tabs, one part is English, the other is German. 
        pair_data = self.__removePunctuation()
        s_t_data = [x.split('\t') for x in pair_data if len(x)>=1]
        [source_sentence, target_sentence] = [list(x) for x in zip(*s_t_data)]
        #Split each sentence into words
        self.source_sentence = [x.lower().split() for x in source_sentence]
        self.target_sentence = [x.lower().split() for x in target_sentence]
        #return source_sentence, target_sentence
    
    def __buildVocab(self, sents):
        '''Build Vocabulary for both languages'''
        # Process the English Vocabulary
        all_words = [word for sent in sents for word in sent]
        #Count the frequency of English words
        all_words_counts = Counter(all_words)
        #Get the most frequent vocab_size words, left regarded as unknow
        word_keys = [x[0] for x in all_words_counts.most_common(self.vocab_size-3)] 
        #Word to ID, set Starting token as 'SOS', ending token as 'EOS'
        vocab2ix = dict(zip(word_keys, range(2,self.vocab_size)))
        vocab2ix['SOS'] = 0
        vocab2ix['EOS'] = 1
        vocab2ix['UNK'] = self.vocab_size - 1
        #ID to Word
        ix2vocab = {val:key for key, val in vocab2ix.items()}
        return vocab2ix, ix2vocab
    
    def getSents(self):
        '''Get preprocessed sentences'''
        return self.source_sentence, self.target_sentence
    
    def sent2vec(self, sents, vocab2ix):
        '''Transform sentences into Ids'''
        processed = []
        for sent in sents:
            temp_sentence = []
            for word in sent:
                try:
                    temp_sentence.append(vocab2ix[word])
                except:
                    #Unknown words
                    temp_sentence.append(self.vocab_size-1)
            processed.append(temp_sentence)
        return processed
    
    def generateVocab(self):
        '''Generate Vocabulary'''
        #source_sentence, target_sentence = self.__sentSplit()
        source_vocab2ix, source_ix2vocab = self.__buildVocab(self.source_sentence)
        target_vocab2ix, target_ix2vocab = self.__buildVocab(self.target_sentence)
        return source_vocab2ix, source_ix2vocab, target_vocab2ix, target_ix2vocab
        


In [7]:
th = textHandler(data=eng_ger_data, vocab_size=vocab_size)

In [8]:
#Split sentences into tokens
english_sentence, german_sentence = th.getSents()

In [9]:
#Get vocabulary
eng_vocab2ix, eng_ix2vocab, ger_vocab2ix, ger_ix2vocab = th.generateVocab()

Now that we have encoded each word(including starting, ending tokens and unknown ones) as an ID. We can transform texts into sequences of numbers before we feed them into algorithms.

In [10]:
#Transform tokens into IDs
english_processed = th.sent2vec(english_sentence, eng_vocab2ix)
german_processed = th.sent2vec(german_sentence, ger_vocab2ix)

In [11]:
test_data = ['I love this dog', 'What a nice day', 'This is a book']
test_data = [x.lower().split() for x in test_data]
test_data = th.sent2vec(test_data, eng_vocab2ix)

In [14]:
test_data

[[5, 168, 17, 191], [24, 7, 392, 117], [17, 8, 7, 123]]

# Build an encoder-decoder architecture

In this demo, we use a simple encoder-decoder architecture to train and infer translations. Unlike a simple encoder-decoder system, we do not compress all the source sentence into a fixed vector as an input for the decoder. Instead, we use an attention mechanism to represent source sentences dynamically for different source-target pairs.

In [12]:
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F

use_cuda = torch.cuda.is_available()

In [54]:
gru = nn.GRU(hidden_size, hidden_size, bidirectional=True)


In [55]:
help(gru)

Help on GRU in module torch.nn.modules.rnn object:

class GRU(RNNBase)
 |  Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
 |  
 |  
 |  For each element in the input sequence, each layer computes the following
 |  function:
 |  
 |  .. math::
 |  
 |          \begin{array}{ll}
 |          r_t = sigmoid(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
 |          i_t = sigmoid(W_{ii} x_t + b_{ii} + W_hi h_{(t-1)} + b_{hi}) \\
 |          n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
 |          h_t = (1 - i_t) * n_t + i_t * h_{(t-1)} \\
 |          \end{array}
 |  
 |  where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is the hidden
 |  state of the previous layer at time `t` or :math:`input_t` for the first layer,
 |  and :math:`r_t`, :math:`i_t`, :math:`n_t` are the reset, input, and new gates, respectively.
 |  
 |  Args:
 |      input_size: The number of expected features in the input x
 |      hidden_size: The numb

In [63]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        #Embedding for souce words
        self.embedding = nn.Embedding(input_size, hidden_size)
        #Bidirectional RNN
        self.gru = nn.GRU(hidden_size, hidden_size, bidirectional=True)

    def forward(self, input, hidden):
        #Get embedding series of input words
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        #Compress the input vectors into RNN
        for i in range(self.n_layers):
            output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        #Create a initial zero hidden state
        result = Variable(torch.zeros(2, 1, self.hidden_size))
        return result

In [88]:
#For more information, please refer to https://arxiv.org/pdf/1409.0473.pdf
class AttnDecoderRNN(nn.Module):
    '''Use attention model to decode target sentences'''
    def __init__(self, hidden_size, output_size, n_layers=1, dropout_p=0.1, max_length=10):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        #Use dropout to prevent against overfitting
        self.dropout_p = dropout_p
        self.max_length = max_length
        #Embedding for target words
        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        #Define functions to calculate attentions
        self.attn = nn.Linear(self.hidden_size * 3, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 3, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_output, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)
        #Alignment factors, concatenate last hidden state of target
        #And current source word vector
        e = self.attn(torch.cat((embedded[0], hidden[0], hidden[0]), 1))
        #e = self.attn(torch.cat((encoder_hidden[0], hidden[0]), 1))
        #Calculate Attention weights based on alignment factors
        attn_weights = F.softmax(e)
        #Conext vectors derived by attention weights and 
        context_vec = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))
        #Concatenate the embedded and context vector
        output = torch.cat((embedded[0], context_vec[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)
        #Decode target sentence
        for i in range(self.n_layers):
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]))
        return output, hidden, attn_weights

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        return result

**Note, this attention model is a little bit different from that proposed by Dzmitry Bahdanau, in his paper the attention weights were calculated by the target hidden state and source annotation sequences represented by bidirectional RNNs, whereas in this project the source annotation sequences were simply represented by word vectors.**

WE denote the ith word of the source sentence as $h_i$, there are many ways to represent $h_i$, for example, we can simply use the embedding of ith word as $h_i$, or we can use RNN hidden state of the ith word as $h_i$, even a concatenation of bidirectional RNN hidden states to represent $h_i$.

In an attention model, the decoder part's input consists of previous hidden state, previous target word and the context vector of the source sentence, we denote it as $c_j$(suppose the index of current target is j), and the attention weights as $\alpha$. And $\alpha$ can be derived from previous target hidden state and the source sentence.

$$c_j = \sum_{i=1} \alpha_{ji} h_i$$

## Training Data
In order to understand the mechanism of neural machine translation, we wrap a pair of translation sentences each time instead of a batch of them.

In [65]:
hidden_size = 256
max_length = 10
encoder1 = EncoderRNN(vocab_size, hidden_size)

In [66]:
index = 100
input_variable, target_variable = english_processed[index], german_processed[index]

In [67]:
#Transform the input data and target into Variable vectors
input_variable = Variable(torch.LongTensor(input_variable).view(-1, 1))
target_variable = Variable(torch.LongTensor(target_variable).view(-1, 1))
input_length = input_variable.size()[0]
target_length = target_variable.size()[0]

### Encoder
We can compress a sereis of word embeddings into a final hidden state and output through RNN. Note we use Bidirectional-RNN here, so the output should contain two hidden states.
$$h_t = f(h_{t-1}, x_t)$$

In [69]:
encoder_hidden = encoder1.initHidden()
encoder_outputs = Variable(torch.zeros(max_length, 2*encoder1.hidden_size))
#Calculate the final state of input words
for ei in range(input_length):
    encoder_output, encoder_hidden = encoder1(
        input_variable[ei], encoder_hidden)
    encoder_outputs[ei] = encoder_output[0][0]

### Decoder with Attention

In the decoder part,for training, we only take two inputs into consideration: The first is the target variables provided, and the second is the previous hidden state initialized by the final state($C_T$) of the encoder.
$$h_t = f(h_{t-1}, y_{t-1}), h_0=C_T$$

The architecture is as below(quoted from http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html):
![encoder-decoder](https://i.imgur.com/1152PYf.png)

In [89]:
#Create an instance for decoder
decoder1 = AttnDecoderRNN(hidden_size, vocab_size)

In [90]:
learning_rate = 0.001
encoder_optimizer = optim.SGD(encoder1.parameters(), lr=learning_rate)
decoder_optimizer = optim.SGD(decoder1.parameters(), lr=learning_rate)

In [91]:
encoder_outputs.size()

torch.Size([10, 512])

In [92]:
loss = 0
criterion = nn.NLLLoss()
decoder_input = Variable(torch.LongTensor([[0]]))
#Set the beginning hidden state of decoder as the final state of encoder
decoder_hidden = encoder_hidden
for di in range(target_length):
    decoder_output, decoder_hidden, decoder_attention = decoder1(
        decoder_input, decoder_hidden, encoder_output, encoder_outputs)
    loss += criterion(decoder_output, target_variable[di])
    decoder_input = target_variable[di]

In [93]:
print(loss)

Variable containing:
 18.3926
[torch.FloatTensor of size 1]



## Wrap it up

Now, we can put the training procedures in one function. Note, in the paper [Sequence to Sequence Learning with Neural Networks
](https://arxiv.org/pdf/1409.3215.pdf), they mentioned that it would be more efficient if we reversed the order of words in source sentences because the encoder could retain more information of the last few words.

In [94]:
import copy
compressed = list(zip(english_processed, german_processed))

In [95]:
sent_pairs = copy.deepcopy(compressed)

In [96]:
#Filter those long sentences
pairs_filtered = []
#Because we need to add one ending tokens later, so substract 1 here
for item in sent_pairs:
    if len(item[0]) <= (max_length-1) and len(item[0]) > 3:
        pairs_filtered.append(item)

In [97]:
import numpy as np
criterion = nn.NLLLoss()
def training(encoder, decoder, encoder_optimizer, decoder_optimizer, epochs=1):
    for e in range(epochs):
        np.random.shuffle(pairs_filtered)
        for c, pair in enumerate(pairs_filtered):
            #Add ending tokens for each pair
            input_data, target_data = pair[0], pair[1]
            input_data.append(1)
            target_data.append(1)
            #Transform the input data and target into Variable vectors
            input_variable = Variable(torch.LongTensor(input_data).view(-1, 1))
            target_variable = Variable(torch.LongTensor(target_data).view(-1, 1))
            input_length = input_variable.size()[0]
            target_length = target_variable.size()[0]
            encoder_hidden = encoder.initHidden()
            encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
            #Calculate the final state of input words
            for i in range(input_length):
                encoder_output, encoder_hidden = encoder(
                    input_variable[i], encoder_hidden)
                encoder_outputs[i] = encoder_output[0][0]
            #Clear grads
            encoder_optimizer.zero_grad()
            decoder_optimizer.zero_grad()
            loss = 0
            decoder_input = Variable(torch.LongTensor([[0]]))
            #Set the beginning hidden state of decoder as the final state of encoder
            decoder_hidden = encoder_hidden
            for di in range(target_length):
                decoder_output, decoder_hidden, decoder_attention = decoder(
                    decoder_input, decoder_hidden, encoder_output, encoder_outputs)
                loss += criterion(decoder_output[0], target_variable[di])
                decoder_input = target_variable[di]
                #print(decoder_output[0].size())
                #print('*'*20)
                #print(target_variable[di])
                #loss += criterion(decoder_output[0], target_variable[di])
                #Set the target as input
                #decoder_input = target_variable[di]  # Teacher forcing
            loss.backward()
            encoder_optimizer.step()
            decoder_optimizer.step()
            if c%200 == 0:
                print(loss.data[0] / target_length)

In [98]:
encoder1 = EncoderRNN(vocab_size, hidden_size)
decoder1 = AttnDecoderRNN(hidden_size, vocab_size)
encoder1_optimizer = optim.SGD(encoder1.parameters(), lr=learning_rate)
decoder1_optimizer = optim.SGD(decoder1.parameters(), lr=learning_rate)
training(encoder1, decoder1, encoder1_optimizer, decoder1_optimizer)

RuntimeError: inconsistent tensor size at d:\downloads\pytorch-master-1\torch\lib\th\generic/THTensorCopy.c:51

And next, we use a greedy method to generate a target sentence based on source sentence. Each time, we selected the word wich has the maximum probability untile the decoder generate an ending token 'EOS'. However, a best choice each time does not guarantte the most-likely sentence in the end.

In [None]:
def evaluate_greedy(encoder, decoder, sentence, max_length=10):
    input_variable = Variable(torch.LongTensor(sentence).view(-1, 1))
    input_length = input_variable.size()[0]
    encoder_hidden = encoder.initHidden()

    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    #encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_variable[ei],
                                                 encoder_hidden)
        #encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]
    
    #Set the inital value as SOS token
    decoder_input = Variable(torch.LongTensor([[0]]))  # SOS

    decoder_hidden = encoder_hidden
    decoded_words = []

    #Greedy method
    for di in range(max_length):
        decoder_output, decoder_hidden, decoder_attention = decoder(
            decoder_input, decoder_hidden, encoder_output, encoder_outputs)
        top_value, top_index = decoder_output.data.topk(1)
        #Get the index of the word
        ni = top_index[0][0]
        if ni == 1:
            decoded_words.append('<EOS>')
            break
        else:
            decoded_words.append(ger_ix2vocab[ni])

        decoder_input = Variable(torch.LongTensor([[ni]]))
        #decoder_input = decoder_input.cuda() if use_cuda else decoder_input


    print(decoded_words)

In [None]:
test_data[1]

In [None]:
evaluate_greedy(encoder1, decoder1, test_data[2])