### **Introduction**

**Artificial Intelligence - Internship Project:**

Language Translation (German to English)

**Problem Statement:**

The objective of the project is to implement language
translation model aka machine translation for converting
German to English (and vice versa)

**Dataset Description:**

For this project, the data is a text file (.txt) of English-German sentence
pairs. The actual data contains over 150,000 sentence-pairs.
However, only the first 50,000 sentence
pairs will be used to reduce the training time of the model.


### 1.Import the required libraries

In [None]:
import string
import re
from numpy import array, argmax, random, take
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Bidirectional, RepeatVector, TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from tensorflow.keras import optimizers
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_colwidth',200)

### 2.Function to read the data file
The data is a text file of English-German sentence pairs.

In [None]:
#function to read raw text file
def read_text(filename):
  file = open(filename,mode='rt',encoding='utf-8')
  text = file.read()
  file.close()
  return text

### 3.Function to split the text
The text data is split into English-German pairs separated by '\n'.
These pairs are then split into English and German sentences.

In [None]:
#split the text into sentences
def to_lines(text):
  sent = text.strip().split('\n')
  sent = [i.split('\t') for i in sent]
  return sent

### 4.Reading & preparing the data

In [None]:
data = read_text("/content/Deu_dataset/deu.txt")
deu_eng = to_lines(data)
deu_eng = array(deu_eng)

In [None]:
print(deu_eng)

The actual data contains over 150,000 sentence pairs. But only the first 50,000 sentence pairs will be used in order to reduce the training time of the model.

In [None]:
deu_eng = deu_eng[:50000,:]

In [None]:
deu_eng

### 5.Text to Sequence Conversion
The input and output sentences need to be converted into integer sequences of fixed length, so that it can be fed to Seq2Seq model. Two histograms are used to visualize the length of the sentences. The lengths of all the sentences are captured in two separate lists for English and German respectively.

In [None]:
#empty lists
eng_l = []
deu_l = []

#populate the lists with sentence lengths
for i in deu_eng[:,0]:
  eng_l.append(len(i.split()))

for i in deu_eng[:,1]:
  deu_l.append(len(i.split()))

In [None]:
length_df = pd.DataFrame({'eng':eng_l, 'deu':deu_l})

In [None]:
length_df.hist(bins = 30)
plt.show()

From the visualization, it can be seen that the maximum length of German sentences is 11 and that of English sentences is 8.

#### 5.1.Function to vectorize the text data 
The text data is vectorized by using keras Tokenizer() class. This will turn the sentences into sequence of integers. Both the English and German sentences are vectorized.

In [None]:
#function to build a tokenizer
def tokenization(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer

In [None]:
#prepare english tokenizer
eng_tokenizer = tokenization(deu_eng[:,0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = 8
print('English Vocabulary Size: %d' % eng_vocab_size)

In [None]:
#prepare deutch tokenizer
deu_tokenizer = tokenization(deu_eng[:,1])
deu_vocab_size = len(deu_tokenizer.word_index) + 1
deu_length = 8
print('Deutch Vocabulary Size: %d' % deu_vocab_size)

#### 5.2.Function to pad the sequences
The vectorized sequences will be padded with zeroes upto the maximum sequence length so that all the sequences are of the same length.

In [None]:
#encode and pad sequences
def encode_sequences(tokenizer, length, lines):
  seq = tokenizer.texts_to_sequences(lines)
  seq = pad_sequences(seq, maxlen = length, padding = 'post')
  return seq

### 6.Building the model
The data is split into traininig and testing set for model training and evaluation.

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(deu_eng, test_size = 0.2, random_state = 12)

#### 6.1.Encoding the sentences
The German sentences are encoded as the input and English sentences as the target sequences.

In [None]:
#prepare training data
train_x = encode_sequences(deu_tokenizer, deu_length, train[:,1])
train_y = encode_sequences(eng_tokenizer, eng_length, train[:,0])

In [None]:
#prepare validation data
test_x = encode_sequences(deu_tokenizer, deu_length, test[:,1])
test_y = encode_sequences(eng_tokenizer, eng_length, test[:,0])

#### 6.2.Seq2Seq Model
An Embedding layer and an LSTM layer are used as the encoder and another LSTM layer followed by a Dense layer are used as the decoder.

In [None]:
#build NMT model
def build_model(in_vocab, out_vocab, in_timesteps, out_timesteps, units):
  model = Sequential()
  model.add(Embedding(in_vocab, units, input_length= in_timesteps, mask_zero = True))
  model.add(LSTM(units))
  model.add(RepeatVector(out_timesteps))
  model.add(LSTM(units, return_sequences= True))
  model.add(Dense(out_vocab, activation = 'softmax'))
  return model

RMSprop optimizer is used in this model as it is considered a good choice for Recurrent Neural Networks.

In [None]:
model = build_model(deu_vocab_size, eng_vocab_size, deu_length, eng_length, 512)
rms = optimizers.RMSprop(learning_rate=0.001)
model.compile(optimizer = rms, loss = 'sparse_categorical_crossentropy')

### 7.Training the model
The model will be trained for 5 epochs and with a batch size of 512. The ModelCheckpoint() function is used to save the best model with lowest validation loss.

In [None]:
filename = 'model1'
checkpoint = ModelCheckpoint(filename, monitor = 'val_loss', verbose=1, save_best_only = True, mode='min')
history = model.fit(train_x, train_y.reshape(train_y.shape[0], train_y.shape[1], 1),
                    epochs=5, batch_size =512, 
                    validation_split=0.2, 
                    callbacks = [checkpoint], verbose=1)

Comparison of the training loss and validation loss using a visualization.

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train','validation'])
plt.show()

### 8.Testing the model

In [None]:
model = load_model('model1')
pred = model.predict(test_x.reshape((test_x.shape[0],test_x.shape[1])))

In [None]:
pred = argmax(pred,axis=1)

In [None]:
pred

In [None]:
def get_word(n, tokenizer):
  for word, index in tokenizer.word_index.items():
    if index == n:
      return word
  return None

In [None]:
#convert predictions into text in English
pred_text = []
for i in pred:
  temp = []
  for j in range(len(i)):
    t = get_word(i[j], eng_tokenizer)
    if j>0:
      if(t == get_word(i[j-1], eng_tokenizer)) or (t == None):
        temp.append('')
      else:
        temp.append(t)
    else:
      if(t == None):
        temp.append('')
      else:
        temp.append(t)
pred_text.append(''.join(temp))

In [None]:
pred_df = pd.DataFrame({'actual':test[:,0],'predicted':pred_text})

In [None]:
pd.set_option('display.max_colwidth',200)

In [None]:
pred_df.head(15)

In [None]:
pred_df.tail(15)