### Sahitya Chalamchala (Venkata Sahitya) 
### 801019292

## Neural Machine Translation (German-English)

In [1]:
import string
import re
from numpy import array, argmax, random, take
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, RepeatVector
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from keras import optimizers
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_colwidth', 200)

Using TensorFlow backend.


### Reading a raw text file and splitting the file into English and German Sentences:
The file is formatted in such a way that the English-German pairs align properly

In [2]:
def read_file(fname):
    file = open(fname, mode='rt', encoding='utf-8')
    text = file.read()
    file.close()
    return text

In [3]:
def split_text(text):
    sentences = text.strip().split('\n')
    sentences = [i.split('\t') for i in sentences]
    return sentences

In [4]:
data = read_file("deu.txt")
deu_eng = split_text(data)
deu_eng = array(deu_eng)
#print(deu_eng)
deu_eng = deu_eng[:50000,:] #reducing the sample size due to computational limitations.

### Removing punctuation and converting the text to lowercase:

In [5]:
deu_eng[:,0] = [sen.translate(str.maketrans('', '', string.punctuation)) for sen in deu_eng[:,0]]
deu_eng[:,1] = [sen.translate(str.maketrans('', '', string.punctuation)) for sen in deu_eng[:,1]]

for i in range(len(deu_eng)):
    deu_eng[i,0] = deu_eng[i,0].lower()
    
    deu_eng[i,1] = deu_eng[i,1].lower()
# deu_eng

##### Before sending the sentences as it is to the model, it is required to convert the text into a format understandable by the machine i.e., numbers.

### Converting the sentences into vectors (sequence of integers):

In [12]:
def tokenize(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

In [13]:
eng_tokenize = tokenize(deu_eng[:, 0])
eng_vocab = len(eng_tokenize.word_index) + 1
eng_length = 8
print('English Vocabulary Size:','', eng_vocab)
deu_tokenize = tokenize(deu_eng[:, 1])
deu_vocab = len(deu_tokenize.word_index) + 1
deu_length = 8
print('German Vocabulary size:','', deu_vocab)

English Vocabulary Size:  6361
German Vocabulary size:  10597


### Padding the sentences of both the languages to an equal length by adding 0s at the end

In [18]:
def pad_encoder(tokenizer, length, lines):
    padd= tokenizer.texts_to_sequences(lines)
    padd= pad_sequences(padd, maxlen=length, padding='post')
    return padd

### Building the model:


Splitting the data into train and test sets with a validation size of 20% and a train size of 80%

In [19]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(deu_eng, test_size=0.2, random_state = 12)

In [20]:
# Train data preparation
train_X = pad_encoder(deu_tokenize, deu_length, train[:, 1])
train_Y = pad_encoder(eng_tokenize, eng_length, train[:, 0])

In [21]:
# Test data preparation
test_X = pad_encoder(deu_tokenize, deu_length, test[:, 1])
test_Y = pad_encoder(eng_tokenize, eng_length, test[:, 0])

Neural Machine Translation Model built on Seq2Seq architeture using Sequential model from the Keras library. The encoder consists of an Embedding layer and an LSTM layer. The decoder consists of an LSTM layer and a Dense layer with "softmax" as the activation function and optimizer used here is the recommended optimizer for neural networks: RMSprop

In [22]:
# NMT model with encoder and decoder 
def build_model(in_vocab, out_vocab, in_timesteps, out_timesteps, units):
    model = Sequential() 
    model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True))
    model.add(LSTM(units))
    model.add(RepeatVector(out_timesteps))
    model.add(LSTM(units, return_sequences=True))
    model.add(Dense(out_vocab, activation='softmax'))
    return model

Use of 'sparse_categorical_crossentropy'  loss function instead of one-hot encoding is recommended here because of memory/computational limitations and also on such a large data one-hot encoding will consume a lot of space. 
'sparse_categorical_crossentropy' will retain the original format of data which is another advantage over one-hot encoding.

In [24]:
model = build_model(deu_vocab, eng_vocab, deu_length, eng_length, 512)
rms = optimizers.RMSprop(lr=0.001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')

Instructions for updating:
Colocations handled automatically by placer.


## Training the model:
I have currently trained my model for 30 epochs with a batch size of 512. These parameters can be changed for multiple runs. 
To evaluate the model's performance I have used ModelCheckpoint() that will save my model to a file ('Nlp.nmt.oct.2019') and check the validation loss on every cycle and save the model with the lowest validation loss. It will log the imporvement of my model in every epoch.

In [25]:
filename = 'Nlp.nmt.oct.2019'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
history = model.fit(train_X, train_Y.reshape(train_Y.shape[0], train_Y.shape[1], 1), 
          epochs=30, batch_size=512, 
          validation_split = 0.2,
          callbacks=[checkpoint], verbose=1)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 32000 samples, validate on 8000 samples
Epoch 1/30

Epoch 00001: val_loss improved from inf to 2.93516, saving model to Nlp.nmt.oct.2019
Epoch 2/30

Epoch 00002: val_loss improved from 2.93516 to 2.83750, saving model to Nlp.nmt.oct.2019
Epoch 3/30

Epoch 00003: val_loss improved from 2.83750 to 2.71759, saving model to Nlp.nmt.oct.2019
Epoch 4/30

Epoch 00004: val_loss improved from 2.71759 to 2.56973, saving model to Nlp.nmt.oct.2019
Epoch 5/30

Epoch 00005: val_loss improved from 2.56973 to 2.45075, saving model to Nlp.nmt.oct.2019
Epoch 6/30

Epoch 00006: val_loss improved from 2.45075 to 2.33836, saving model to Nlp.nmt.oct.2019
Epoch 7/30

Epoch 00007: val_loss improved from 2.33836 to 2.27733, saving model to Nlp.nmt.oct.2019
Epoch 8/30

Epoch 00008: val_loss improved from 2.27733 to 2.18789, saving model to Nlp.nmt.oct.2019
Epoch 9/30

Epoch 000

### Predictions based on trained data:

In [26]:
model = load_model('Nlp.nmt.oct.2019') # best-saved model obtained from training 
pclass = model.predict_classes(test_X.reshape((test_X.shape[0],test_X.shape[1])))


In [27]:
def word_vector(size, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == size:
            return word
    return None



In [33]:
arr= []
for i in pclass:
    temp = []
    for j in range(len(i)):
        k = word_vector(i[j], eng_tokenize)
        if j > 0:
            if (k == word_vector(i[j-1], eng_tokenize)) or (k == None):
                temp.append('')
            else:
                temp.append(k)
             
        else:
            if(k == None):
                temp.append('')
            else:
                temp.append(k)            
        
    arr.append(' '.join(temp))

    


In [34]:
df = pd.DataFrame({'actual' : test[:,0], 'predicted' : arr})
pd.set_option('display.max_colwidth', 200)
df.head(15)



Unnamed: 0,actual,predicted
0,you need to wake up,you must be awake
1,i feel old,i feel bad
2,im sorry tom,im sorry for tom
3,hide the money,is the money
4,is someone calling me,help me
5,you should work hard,you should work hard
6,tom has a big problem,tom has a big hero
7,he lives with his mom,he lives his cat
8,whats toms job,what is toms
9,i dont want to drive,i dont want to go


In [35]:
df.sample(15)

Unnamed: 0,actual,predicted
958,let me see that list,let me do it
207,tom hated mary,tom ignored mary
7878,tom dialed 911,tom checked his
8609,throw it to me,give it to me
1760,enough is enough,is it enough
6985,i got it for free,i will it
316,im going downtown,im am in the
2163,how did this all start,how did it begin
7640,its made of leather,this is for
4396,come back soon,come back


In [36]:
df.tail(15)

Unnamed: 0,actual,predicted
9985,do you need a ride,can i sit you
9986,it is very small,its very small
9987,we can handle that,we can do it
9988,nobody believes you,no one believes you
9989,tom never changes,tom never complains
9990,i just emailed you,i just as a you
9991,tom is so stupid,tom is so stupid
9992,i found tom,i found tom
9993,i met mary yesterday,i met yesterday
9994,try and stop me,try to me


The predicitions aren't accurate and can still be improved if the entire data is used instead of reducing the size and also by changing the parameters while training the model. 