# Bi-directional RNN Translator (encoder-decoder architecture)

We will create a language translator using Recurrent BiDirectional LSTMs in the Tensorflow library of Python. We choose a RNN, since RNNs are the preferred model used for Natural Language Processing. BiDirectional LSTMs are an extension of LSTMs that can enhance a model's performance on sequential classification problems. LSTMs will preserve 'past' information corresponding to certain data and will store & use this information when processing current data. A BiDirectional LSTM will process data in both a backwards and forward direction, and it will store both 'past' and 'future' information relevant to the data, making it simpler for the model to detect patterns/sequences in either direction. This will be especially useful when processing speech, since it's easier to understand the meaning of a word when given the previous & next couple of words in the sentence. 

## Import libraries & datasets

First, we will begin by importing the necessary libraries & uploading our data

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
import math

from keras.preprocessing.text import Tokenizer
from keras import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers import LSTM, Bidirectional, Dense, Embedding, Input, TimeDistributed, Embedding

We will upload the data stored in google drive. We will subsequently convert the dataframe into a list of strings, each string resembling a sentence in it's resperctive dataframe's language.

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
df_eng = pd.read_csv('/content/drive/MyDrive/Data-files/small_vocab_en.csv', sep='delimiter', header=None)
df_frn = pd.read_csv('/content/drive/MyDrive/Data-files/small_vocab_fr.csv', sep='delimiter', header=None)

  """Entry point for launching an IPython kernel.
  


In [4]:
from itertools import chain
english_sentences = df_eng.astype(str).values.tolist()
english_sentences = list(chain.from_iterable(english_sentences))

french_sentences = df_frn.astype(str).values.tolist()
french_sentences = list(chain.from_iterable(french_sentences))

In order to feed our data into the neural network, we must convert the input from strings into numbers by tokenizing our list. The tokenizer function will take in a list of sentences and will return a 2D list, where each sentence is converted into a list of number & each number represents a word. The smaller the number, the more frequent a word is. For example, the sentence 'The man walks his dog in the park' will be converted into [1,2,3,4,5,6,1,7], where 'the' is represented by 1 since it's the most frequent word present.

## Preprocess Data

In [5]:
def tokenize(x):

    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer.texts_to_sequences(x), tokenizer

In [7]:
english_numbers, english_tokenizer = tokenize(english_sentences)
english_numbers[0]

[17, 23, 1, 8, 67, 4, 39, 7, 3, 1, 55, 2, 44]

The sentence 'new jersey is sometimes quiet during autumn , and it is snowy in april .' is converted into the number list [17, 23, 1, 8, 67, 4, 39, 7, 3, 1, 55, 2, 44]

Next, we will create a function that pads our data. Our neural network will expect to receive an input array in which all the elements (number lists) are of equal size. Since most of the sentences in our dataset are of varying sizes, we must pass our list into a function that can pad our list to ensure that all the elements have the same size. We will use the 'pad_sequences' function.

In [8]:
def pad(x, size=None):
  return pad_sequences(x, maxlen = size, padding='post')

Now that we've created the padding & tokenizing functions, we will preprocess our data and move on to splitting our data into training & testing sets before we begin to build our model.

In [9]:
french_numbers, french_tokenizer = tokenize(french_sentences)

#english_wordlist & french_wordlist are our English & French dictionaries respectively, where the keys represent words & the values represent the numerical representation of the word.
english_wordlist = english_tokenizer.word_index
french_wordlist = french_tokenizer.word_index

english_numbers = pad(english_numbers)
french_numbers = pad(french_numbers)

french_numbers = french_numbers.reshape(*french_numbers.shape, 1)

The decode function will take in an array of integers & the language's respective dictionary, and return the sentence translated to it's original form.

In [10]:
def decode(x, vocab):
  translated_sentence = ''
  for num in x:
    if (num in vocab.values()):
      translated_sentence += list(vocab.keys())[list(vocab.values()).index(num)]
      translated_sentence += ' '
  return translated_sentence

Now we will split our input data (english sentences) & our target data (french translations) into a training & testing set. We will use a train-test split of 80:20, and we will later use a validation set from the training data of 0.2

In [11]:
#Since both datasets have equal size
size = len(english_numbers)

#We use a ratio of 0.8:0.2 for the train-test-split
train_x = english_numbers[0:math.ceil(size*0.8)]
test_x = english_numbers[math.ceil(size*0.8):]
train_y = french_numbers[0:math.ceil(size*0.8)]
test_y = french_numbers[math.ceil(size*0.8):]

## Building our Model

In [125]:
x_shape = english_numbers.shape
x_shape[1:]
#shpe

(15,)

In [13]:
from keras.layers import RepeatVector, TimeDistributed

In [18]:
english_numbers.shape[1:]

(15,)

Now we will build our sequential model. The structure of this model is based on the encoder-decoder model for neural machine translation. I have added a reference to an article that discusses the Encoder-decoder model in depth. We will create an input embedding layer, which will take in the vocabulary size of our input (english vocabulary size) and an input size & shape of our english sentences. Next, we will build our encoder layers by adding a bidirectional LSTM, which will be helpful in detecting backward & forward sequences in our data. Lastly, We will build our decoder layers by creating another bidirectional LSTM with 'return_sequences=True' to ensure that the input going into the time distributed layer is of the appropriate proportions. Our output layer will be a time distributed dense layer that takes in the vocabulary size of our output (french vocabulary size). The time distributed layer will simplify the network by requiring far fewer weights.

In [15]:
model = Sequential()

# Embedding input layer
model.add(Embedding(len(english_wordlist)+1, 128, input_length=english_numbers.shape[1],
                         input_shape=english_numbers.shape[1:]))
# Encoder
model.add(Bidirectional(LSTM(128)))
model.add(RepeatVector(french_numbers.shape[1]))

# Decoder
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(TimeDistributed(Dense(len(french_wordlist)+1, activation='softmax')))

In [16]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 15, 128)           25600     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 256)               263168    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 21, 256)           0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, 21, 256)           394240    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 21, 345)           88665     
Total params: 771,673
Trainable params: 771,673
Non-trainable params: 0
_________________________________________________________________


In [17]:
#train_y = train_y.reshape(*train_y.shape, 1)
model.fit(train_x, train_y, batch_size=1024, epochs=25, validation_split=0.2)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<tensorflow.python.keras.callbacks.History at 0x7fed9992d510>

Now that our model is trained, we can test it's accuracy by predicting the french translation of english sentences in our test set.

In [None]:
eng_sentence = decode(test_x[0], english_wordlist)
frn_sentence = decode(test_y[0], french_wordlist)
result = model.predict(test_x[0])
print('English Sentence: ',eng_sentence)
print('French Translation prediction: ',eng_sentence)
print('Actual French Translation: ',eng_sentence)

## Resources

I will link some resources that have helped me out in this project

https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/
https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/