Machine translation is a challenge for computers not to only understand human languages but also to generate languages. A machine translation can be viewed as a conditional language model, given a source sentence $x_i$, we needed to calculate the probability of generated sentence $p(y_i|x_i)$. In early years, statistical machine translation(SMT) was a focus, amongst which IBM models were basis, if you are interested, please visit Michael Collins' [webpage](http://www.cs.columbia.edu/~mcollins/), there he provided many useful and explicit lecture notes to illustrate basis terms and models of SMT.

In recent years, with the development of artificial neural networks as well as deep learning applications, neural translation models were explored, especially [seq2seq](https://arxiv.org/pdf/1406.1078v3.pdf) model as well as later models has improved performances of machine translation.

This project aims to realize a neural machine translation model through seq2seq concept.

In [87]:
#
# Creating Sequence to Sequence Models
#-------------------------------------
#  Here we show how to implement sequence to sequence models.
#  Specifically, we will build an English to German translation model.
#

import os
import re
import string
import urllib
import io
import numpy as np
import collections
import random
import pickle
import string
import matplotlib.pyplot as plt
#import tensorflow as tf
from zipfile import ZipFile
from collections import Counter
import data_utils
#import seq2seq_model
#from tensorflow.python.framework import ops
#ops.reset_default_graph()

定义训练中各种变量。

In [88]:

# Model Parameters
learning_rate = 0.1
lr_decay_rate = 0.99
lr_decay_every = 100
max_gradient = 5.0
batch_size = 50
num_layers = 3
rnn_size = 500
layer_size = 512
generations = 100
vocab_size = 10000
save_every = 1000
eval_every = 500
output_every = 50
punct = string.punctuation

Specify the paths of the original dataset.

In [89]:
# Data Parameters
data_dir = 'temp'
data_file = 'eng_ger.txt'
model_path = 'seq2seq_model'
full_model_dir = os.path.join(data_dir, model_path)

In [90]:
# Test Translation from English (lowercase, no punct)
test_english = ['hello where is my computer',
                'the quick brown fox jumped over the lazy dog',
                'is it going to rain tomorrow']

# Make Model Directory
if not os.path.exists(full_model_dir):
    os.makedirs(full_model_dir)

# Make data directory
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

## Data Acquiry

Download the data from website if it does not exist.

In [91]:
z = ZipFile('temp/deu-eng.zip', 'r')
file = z.read('deu.txt')
# Format Data
eng_ger_data = file.decode()
eng_ger_data = eng_ger_data.encode('ascii',errors='ignore')

In [92]:
eng_ger_data = eng_ger_data.decode().split('\n')
# Write to file
with open(os.path.join(data_dir, data_file), 'w') as out_conn:
      for sentence in eng_ger_data:
          out_conn.write(sentence + '\n')

In [93]:
print('Loading English-German Data')
# Check for data, if it doesn't exist, download it and save it
if not os.path.isfile(os.path.join(data_dir, data_file)):
    print('Data not found, downloading Eng-Ger sentences from www.manythings.org')
    sentence_url = 'http://www.manythings.org/anki/deu-eng.zip'
    r = urllib.request.urlopen(sentence_url)
    z = ZipFile(io.BytesIO(r.read()))
    file = z.read('deu.txt')
    # Format Data
    eng_ger_data = file.decode()
    eng_ger_data = eng_ger_data.encode('ascii',errors='ignore')
    eng_ger_data = eng_ger_data.decode().split('\n')
    # Write to file
    with open(os.path.join(data_dir, data_file), 'w') as out_conn:
        for sentence in eng_ger_data:
            out_conn.write(sentence + '\n')
else:
    print('Data Exists!')
    eng_ger_data = []
    with open(os.path.join(data_dir, data_file), 'r') as in_conn:
        for row in in_conn:
            eng_ger_data.append(row[:-1])

Loading English-German Data
Data Exists!


## Data Processig

Remove punctuation.

In [94]:
# Remove punctuation
eng_ger_data = [''.join(char for char in sent if char not in punct) for sent in eng_ger_data]
eng_ger_data[0]

'Hi\tHallo'

Break each translation pair into English and German sentences. And split each sentences into words.

In [9]:
# Break each sentence pair by tabs, one part is English, the other is French.   
eng_ger_data = [x.split('\t') for x in eng_ger_data if len(x)>=1]
[english_sentence, german_sentence] = [list(x) for x in zip(*eng_ger_data)]
#Split each sentence into words
english_sentence = [x.lower().split() for x in english_sentence]
german_sentence = [x.lower().split() for x in german_sentence]

In [10]:
print(english_sentence[7], german_sentence[7])

['help'] ['zu', 'hlf']


Map each unique word into a unique ID. Note, we need to set the start token and ending tokens.

In [11]:
SOS = 0
EOS = 1

In [203]:
print('Processing the vocabularies.')
# Process the English Vocabulary
all_english_words = [word for sentence in english_sentence for word in sentence]
#Count the frequency of English words
all_english_counts = Counter(all_english_words)
#Get the most frequent vocab_size words, left regarded as unknow
eng_word_keys = [x[0] for x in all_english_counts.most_common(vocab_size-3)] 
#Word to ID
eng_vocab2ix = dict(zip(eng_word_keys, range(2,vocab_size-1)))
eng_vocab2ix['SOS'] = 0
eng_vocab2ix['EOS'] = 1
#ID to Word
eng_ix2vocab = {val:key for key, val in eng_vocab2ix.items()}
english_processed = []
for sent in english_sentence:
    temp_sentence = []
    for word in sent:
        try:
            temp_sentence.append(eng_vocab2ix[word])
        except:
            #Unknown words
            temp_sentence.append(vocab_size-1)
    english_processed.append(temp_sentence)

Processing the vocabularies.


Now the sentences have been transformed into series of word IDs.

In [204]:
# Process the German Vocabulary
all_german_words = [word for sentence in german_sentence for word in sentence]
all_german_counts = Counter(all_german_words)
#Leave three ids for start, ending tokens and unknown words
ger_word_keys = [x[0] for x in all_german_counts.most_common(vocab_size-3)]
#Set 0, 1 as the ID of starting and ending tokens
ger_vocab2ix = dict(zip(ger_word_keys, range(2,vocab_size-1)))
ger_vocab2ix['SOS'] = 0
ger_vocab2ix['EOS'] = 1
ger_ix2vocab = {val:key for key, val in ger_vocab2ix.items()}
german_processed = []
for sent in german_sentence:
    temp_sentence = []
    for word in sent:
        try:
            temp_sentence.append(ger_vocab2ix[word])
        except:
            #Unknown words
            temp_sentence.append(vocab_size-1)
    german_processed.append(temp_sentence)

In the English-German translation, within each pair, the English sentence and the German sentence do not have the same length.

In [205]:
# Process the test english sentences, use '0' if word not in our vocab
test_data = []
for sentence in test_english:
    temp_sentence = []
    for word in sentence.split(' '):
        try:
            temp_sentence.append(eng_vocab2ix[word])
        except:
            # Use '0' if the word isn't in our vocabulary
            temp_sentence.append(vocab_size-1)
    test_data.append(temp_sentence)

In [None]:
len(eng_vocab2ix.keys())

## Build encoder-decoder architecture

In [207]:
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F

use_cuda = torch.cuda.is_available()

In [246]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        #Get embedding series of input words
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        #Compress the input vectors into RNN
        for i in range(self.n_layers):
            output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        #Create a initial zero hidden state
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        #result = result.cuda() if use_cuda else result
        return result

In [247]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, n_layers=1):
        super(DecoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        #Create embedding
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax()

    def forward(self, input, hidden):
        #Transform input word id into embedding
        output = self.embedding(input).view(1, 1, -1)
        for i in range(self.n_layers):
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        #Create a initial zero hidden state
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        #result = result.cuda() if use_cuda else result
        return result

## Training Data
In order to understand the mechanism of neural machine translation, we wrap a pair of translation sentences each time instead of a batch of them.

In [262]:
hidden_size = 256
max_length = 10
encoder1 = EncoderRNN(vocab_size, hidden_size)

In [263]:
index = 100
input_variable, target_variable = english_processed[index], german_processed[index]

In [264]:
#Transform the input data and target into Variable vectors
input_variable = Variable(torch.LongTensor(input_variable).view(-1, 1))
target_variable = Variable(torch.LongTensor(target_variable).view(-1, 1))
input_length = input_variable.size()[0]
target_length = target_variable.size()[0]

### Encoder
We can compress a sereis of word embeddings into a final hidden state and output through RNN.
$$h_t = f(h_{t-1}, x_t)$$

In [265]:
encoder_hidden = encoder1.initHidden()
encoder_outputs = Variable(torch.zeros(max_length, encoder1.hidden_size))
#Calculate the final state of input words
for ei in range(input_length):
    encoder_output, encoder_hidden = encoder1(
        input_variable[ei], encoder_hidden)
    encoder_outputs[ei] = encoder_output[0][0]

### Decoder without Attention

In the decoder part,for training, we only take two inputs into consideration: The first is the target variables provided, and the second is the previous hidden state initialized by the final state($C_T$) of the encoder.
$$h_t = f(h_{t-1}, y_{t-1}), h_0=C_T$$

In [266]:
#Create an instance for decoder
decoder1 = DecoderRNN(hidden_size, vocab_size)

In [267]:
encoder_optimizer = optim.SGD(encoder1.parameters(), lr=learning_rate)
decoder_optimizer = optim.SGD(decoder1.parameters(), lr=learning_rate)

In [268]:
loss = 0
criterion = nn.NLLLoss()
decoder_input = Variable(torch.LongTensor([[0]]))
#Set the beginning hidden state of decoder as the final state of encoder
decoder_hidden = encoder_hidden
for di in range(target_length):
    decoder_output, decoder_hidden = decoder1(
        decoder_input, decoder_hidden)
    loss += criterion(decoder_output[0], target_variable[di])
    #Set the target as input
    decoder_input = target_variable[di]  # Teacher forcing

In [269]:
print(loss)

Variable containing:
 18.6042
[torch.FloatTensor of size 1]



## Wrap it up

Now, we can put the training procedures in one function.

In [303]:
compressed = list(zip(english_processed, german_processed))

In [304]:
#Filter those long sentences
pairs_filtered = []
#Because we need to add one ending tokens later, so substract 1 here
for item in compressed:
    if len(item[0]) <= (max_length-1) and len(item[0]) > 2:
        pairs_filtered.append(item)

In [305]:
len(pairs_filtered)

131431

In [306]:
import numpy as np
criterion = nn.NLLLoss()
def training(encoder, decoder, encoder_optimizer, decoder_optimizer, epochs=1):
    for e in range(epochs):
        np.random.shuffle(pairs_filtered)
        for c, pair in enumerate(pairs_filtered):
            #Add ending tokens for each pair
            input_data, target_data = pair[0], pair[1]
            input_data.append(1)
            target_data.append(1)
            #Transform the input data and target into Variable vectors
            input_variable = Variable(torch.LongTensor(input_data).view(-1, 1))
            target_variable = Variable(torch.LongTensor(target_data).view(-1, 1))
            input_length = input_variable.size()[0]
            target_length = target_variable.size()[0]
            encoder_hidden = encoder.initHidden()
            encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
            #Calculate the final state of input words
            for i in range(input_length):
                encoder_output, encoder_hidden = encoder(
                    input_variable[i], encoder_hidden)
                if i == max_length:
                    print(c, pair)
                encoder_outputs[i] = encoder_output[0][0]
            #Clear grads
            encoder_optimizer.zero_grad()
            decoder_optimizer.zero_grad()
            loss = 0
            decoder_input = Variable(torch.LongTensor([[0]]))
            #Set the beginning hidden state of decoder as the final state of encoder
            decoder_hidden = encoder_hidden
            for di in range(target_length):
                decoder_output, decoder_hidden = decoder(
                    decoder_input, decoder_hidden)
                #print(decoder_output[0].size())
                #print('*'*20)
                #print(target_variable[di])
                loss += criterion(decoder_output[0], target_variable[di])
                #Set the target as input
                decoder_input = target_variable[di]  # Teacher forcing
            loss.backward()
            encoder_optimizer.step()
            decoder_optimizer.step()
            if c%200 == 0:
                print(loss.data[0] / target_length)

In [307]:
encoder1 = EncoderRNN(vocab_size, hidden_size)
decoder1 = DecoderRNN(hidden_size, vocab_size)
encoder1_optimizer = optim.SGD(encoder1.parameters(), lr=learning_rate)
decoder1_optimizer = optim.SGD(decoder1.parameters(), lr=learning_rate)
training(encoder1, decoder1, encoder1_optimizer, decoder1_optimizer)

9.21161346435547
18.286376953125
43.86820329938616
33.79951477050781
44.640017700195315
54.71051788330078
26.475701904296876
46.159088134765625
49.15302022298177
66.8128182547433
84.60193379720052
38.160868326822914
50.66724141438802
52.046244303385414
37.329995727539064
20.53452410016741
31.838956197102863
42.17073059082031
63.11253051757812
69.06246185302734
42.57438659667969
41.650033133370535
42.26680162217882
24.653512137276785
40.868829345703126
31.17901865641276
37.6739247639974
46.94205583844866
55.58522687639509
30.570780436197918
44.40029035295759
49.02313741048177
50.93433634440104
38.45894077845982
33.55299072265625
62.91929626464844
54.59217529296875
62.50486450195312
22.900371551513672
34.34137725830078
19.435302734375
59.3594482421875
55.129996163504465
26.114536830357142
80.97391764322917
45.019345092773435
23.85517272949219
46.259591238839285
45.58848353794643
56.709316677517364
56.16846172626202
70.88126220703126
45.00146484375
25.09646987915039
37.136749267578125
42.

KeyboardInterrupt: 

In [308]:
def evaluate(encoder, decoder, sentence, max_length=10):
    input_variable = Variable(torch.LongTensor(sentence).view(-1, 1))
    input_length = input_variable.size()[0]
    encoder_hidden = encoder.initHidden()

    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    #encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_variable[ei],
                                                 encoder_hidden)
        #encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]

    decoder_input = Variable(torch.LongTensor([[0]]))  # SOS
    #decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    decoder_hidden = encoder_hidden

    decoded_words = []
    #decoder_attentions = torch.zeros(max_length, max_length)

    for di in range(max_length):
        decoder_output, decoder_hidden = decoder(
            decoder_input, decoder_hidden)
        topv, topi = decoder_output.data.topk(1)
        ni = topi[0][0]
        if ni == 1:
            decoded_words.append('<EOS>')
            break
        else:
            decoded_words.append(ger_ix2vocab[ni])

        decoder_input = Variable(torch.LongTensor([[ni]]))
        #decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    print(decoded_words)

In [309]:
test_data[1]

[2, 1696, 1488, 2510, 1367, 161, 2, 1544, 191]

In [310]:
test_english[1]

'the quick brown fox jumped over the lazy dog'

In [311]:
evaluate(encoder1, decoder1, test_data[1])

['tom', 'frau', 'nicht', 'nicht', 'nicht', 'nicht', 'nicht', 'nicht', 'nicht', 'nicht']
