# Machine Translation with LSTM (seq2seq)

This problem is from the last assignment of the [Udacity Deep Learning course](https://www.udacity.com/course/deep-learning--ud730). The idea is to build a sequence to sequence model using LSTMs that will convert sequences of words of the form:

    the quick brown fox
    
to this form:

    eht kciuq nworb xof

i.e., the characters of each word are reversed. This is a similar (although much simplified) scenario to machine translation where input words are in one language and output words are in another. However, the creation of training data has been simplified with this approach.

One caveat with this approach is that we cannot make it a word based seq2seq model, since there is a 1 to 1 correspondence between the two "languages". Instead, we will create a character based seq2seq model so the model cannot depend on any regularity.

The file name for the notebook is a reference to [The Shining](http://www.imdb.com/title/tt0081505/) in case you were wondering.

## Setup Imports

In [1]:
from __future__ import division, print_function
from keras.layers.core import Activation, Dense, RepeatVector
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed
from keras.models import Sequential
from sklearn.cross_validation import train_test_split
import nltk
import numpy as np

Using Theano backend.


## Extract Text from file

We extract the list of words for use later. We also capture the vocabulary as we read it.

In [2]:
char_vocab = set(" ")
sentences = []
fin = open("../data/alice_in_wonderland.txt", "rb")
for line in fin:
    line = line.strip()
    if len(line) == 0:
        continue
    for sentence in nltk.sent_tokenize(line):
        words = []
        for word in nltk.word_tokenize(sentence):
            word = word.lower()
            words.append(word)
            for c in word:
                char_vocab.add(c)
        sentences.append(words)
fin.close()

vocab_size = len(char_vocab)
print("vocab size: %d" % (vocab_size))

vocab size: 45


## Create text sequences

Our input sequences are 4 words long. Here we construct the input and output training sequences from the text, and compute the maximum size of the sequence in characters.

In [3]:
def reverse_words(words):
    reversed_words = []
    for w in words:
        reversed_words.append("".join(reversed([c for c in w])))
    return reversed_words

nb_words_in_seq = 4
input_texts = []
output_texts = []
for sentence in sentences:
    ngrams = nltk.ngrams(sentence, nb_words_in_seq)
    for ngram in ngrams:
        input_texts.append(" ".join(ngram))
        output_texts.append(" ".join(reverse_words(ngram)))

maxlen = max([len(x) for x in input_texts])
print("maximum length of sequence: %d chars" % (maxlen))

maximum length of sequence: 36 chars


## Create Lookup tables

As mentioned earlier, we are going to build a character based seq2seq model. We use the vocabulary generated earlier to construct lookup tables for each character.

In [4]:
char2idx = dict((c, i) for i, c in enumerate(char_vocab))
idx2char = {v:k for k, v in char2idx.items()}

## Vectorize sequences

In [5]:
X = np.zeros((len(input_texts), maxlen, vocab_size), dtype=np.bool)
Y = np.zeros((len(output_texts), maxlen, vocab_size), dtype=np.bool)
for i, input_text in enumerate(input_texts):
    input_text = input_text.ljust(maxlen)
    for j, ch in enumerate([c for c in input_text]):
        X[i, j, char2idx[ch]] = 1
for i, output_text in enumerate(output_texts):
    output_text = output_text.ljust(maxlen)
    for j, ch in enumerate([c for c in output_text]):
        Y[i, j, char2idx[ch]] = 1

## Split data into training and test

In [6]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.3, random_state=0)
print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)

(16621, 36, 45) (7124, 36, 45) (16621, 36, 45) (7124, 36, 45)


## Build Model

In [7]:
model = Sequential()
model.add(LSTM(512, input_shape=(maxlen, vocab_size), return_sequences=False))
model.add(RepeatVector(maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

## Evaluate Model

In [8]:
def decode_text(probas):
    text_seq = []
    for i in range(probas.shape[0]):
        idx = np.argmax(probas[i])
        text_seq.append(idx2char[idx])
    return "".join(text_seq).strip()

for iteration in range(51):
    print("=" * 50)
    print("Iteration-#: %d" % (iteration))
    model.fit(Xtrain, Ytrain, batch_size=128, nb_epoch=1, 
              verbose=0, validation_data=(Xtest, Ytest))
    for i in range(10):
        test_idx = np.random.randint(Xtest.shape[0])
        x_test = np.array([Xtest[test_idx, :, :]])
        y_test = np.array([Ytest[test_idx, :, :]])
        y_pred = model.predict([x_test], verbose=0)
        xtest_text = decode_text(x_test[0])
        ytest_text = decode_text(y_test[0])
        ypred_text = decode_text(y_pred[0])
        print("input: [%s], expected: [%s], got: [%s]" % 
              (xtest_text, ytest_text, ypred_text))

Iteration-#: 0
input: [and legs in all], expected: [dna sgel ni lla], got: [e]
input: [sea as you can], expected: [aes sa uoy nac], got: [e]
input: [of the conversation .], expected: [fo eht noitasrevnoc .], got: [ee]
input: ['oh , hush !], expected: [ho' , hsuh !], got: []
input: [of voices asked .], expected: [fo seciov deksa .], got: [e]
input: [very turn-up nose ,], expected: [yrev pu-nrut eson ,], got: [ee]
input: [as he shook his], expected: [sa eh koohs sih], got: [e]
input: [what they will do], expected: [tahw yeht lliw od], got: [e]
input: [very uncomfortable , and], expected: [yrev elbatrofmocnu , dna], got: [eee]
input: [grow smaller , i], expected: [worg rellams , i], got: [e]
Iteration-#: 1
input: [a duck and a], expected: [a kcud dna a], got: [e]
input: [the dormouse slowly opened], expected: [eht esuomrod ylwols denepo], got: [eeee                  eeee]
input: [rustled at her feet], expected: [deltsur ta reh teef], got: [ee               e]
input: [was quite silent for]