## Possible Implementation of Attention Mechanism for English to Bangla Translation
MD Muhaimin Rahman
contact: sezan92[at]gmail[dot]com

In this project I have implemented -at least tried to implement- Attention Mechanism for Encoder-Decoder Deep Learning Network for English To Bangla Translation in keras. Neural Machine Translation is a case for Encoder Decoder network. An example is given in Jason Brownlee's [blog](https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/) . But this architecture had a problem for long sentences . Bahdanau et al. used Attention mechanism for Neural Machine Translation , in this [paper](https://arxiv.org/abs/1409.0473). 


importing libraries

In [None]:
import keras
from keras.layers import Input,Permute,Dense,LSTM,Embedding,Bidirectional,multiply,RepeatVector,Flatten,Activation
from keras.models import Model
from keras.activations import softmax
from keras.utils import plot_model
import text_preprocess_utils_fh as tpu
from keras.callbacks import ModelCheckpoint
from sklearn.cross_validation import train_test_split
from keras.optimizers import Adam
from keras.regularizers import l2
import numpy as np
import pickle
import os

Preprocessing text. I have wrote a class ```text_prep``` in the file ```text_preprocess_utils_fh.py``` . Please have a look at the file for more details

In [None]:
tp = tpu.text_prep('/data/ben.txt',limit=20000)
tp.show()

#### Notice
The file is saved at ```data``` folder . If you want to work on your computer write ```tp = tpu.text_prep('data/ben.txt',limit=20000)```. the extra ```/``` is for floydhub cloud gpu

In [None]:
max_words = 12
source_input,input_starter,output_encoded = tp.preprocess(max_words)
# In[get vocabulary]
source_words,target= tp.get_vocab()
source_vocab = source_words[0] #first one is vocabulary list
source_i2w = source_words[1] # second one is vocabulary index to word 
source_w2i = source_words[2] # third one is vocabulary word to index
target_vocab = target[0] #first one is vocabulary list
target_i2w = target[1] # second one is vocabulary index to word
target_w2i = target[2] # third one is vocabulary word to index

vocab size for both languages. Notice that I have added 2 for both size.

In [None]:
source_vocab_size = len(source_vocab)+2
target_vocab_size = len(target_vocab)+2

timesteps i.e. words per sentance

In [None]:

source_timesteps = source_input.shape[1]
target_timesteps = output_encoded.shape[1]

LSTM units . Notice that first one is half of second one. It's because I have used Bidirectional LSTM for first one, which doubles the given units at the output

In [None]:
n_a= 32
n_s =64

Hyper parameters

In [None]:
em_shape=100
batch_size=64
epochs = 100

saving the vocabulary for later use

In [None]:
pickle.dump(file=open('/output/ben_eng_source_vocab_fh.pkl','wb'),obj=source_words)
pickle.dump(file=open('/output/ben_eng_target_vocab_fh.pkl','wb'),obj=target)


In [None]:
source = Input(shape=(source_timesteps,),name='source') #Source sequence
source_emb = Embedding(input_dim=source_vocab_size ,output_dim=100,name='source_embedding',mask_zero=True)(source) #Embedding for source sequence
h_source = Bidirectional(LSTM(32,return_sequences=True,name='h_s'))(source_emb) #Hidden state of source sequence
initial_hidden = Input(shape=(n_s,),name='hidden_target') #Initial hidden state of target , we will give input <s> as starting of the sequence
init_state_att=initial_hidden # Initial hidden state of the target sentence
init_hid = Input(shape=(n_s,),name='cell_target') # Initial cell state of the target sentence
init_hid_att=init_hid
init_state_att_repeat = RepeatVector(source_timesteps)(init_state_att)
output=[] #Output empty list

## Attention!
Now comes the real stuff. The following is pseudo code

In [None]:
for _ in range(target_timesteps): # For loop for manually looping through sequences
    merged = multiply([init_state_att_repeat,h_source]) #Dot product as of h_t and h_s
    score = Dense(1,activation='tanh')(merged) # tanh(h_txh_s)
    attention_prob = Dense(1,activation='softmax')(score) #prob = softmax(tanh(h_t x h_s))
    context = multiply([h_source,attention_prob]) #context = prob x h_source
    init_state_att,_,init_hid_att = LSTM(64,return_state=True)(context,initial_state=[init_state_att,init_hid_att]) #hidden state of next word of target
    init_state_att_repeat = RepeatVector(source_timesteps)(init_state_att) #making it 3D by repeat vector
    #context = merge([attention_prob,h_source],mode='mul',name='context_vector')
    prediction = Dense(target_vocab_size,activation='softmax',kernel_regularizer=l2())(init_state_att) #predicting next word
    output.append(prediction) #appending to output list

In [None]:
model =Model(inputs=[source,initial_hidden,init_hid],outputs=output)

In [None]:
model.compile(optimizer=Adam(0.008),loss='sparse_categorical_crossentropy')
model.summary()

In [None]:
filepath="/output/attention_ben_%d_words_eng.best.hdf5"%max_words
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]

In [None]:
json_file = model.to_json()
with open('/output/ben_model_%d_words.json'%max_words,'w') as file:
    file.write(json_file)

In [None]:
x_train1,x_test1,y_train,y_test= train_test_split(
        source_input,
        output_encoded)
x_train2 = np.zeros((x_train1.shape[0],n_s))
x_train3 = np.zeros((x_train1.shape[0],n_s))

x_test2 = np.zeros((x_test1.shape[0],n_s))
x_test3 = np.zeros((x_test1.shape[0],n_s))


In [None]:
model.fit(x=[np.array(x_train1),
             x_train2,x_train3],
    y=list(y_train.swapaxes(0,1)),
    validation_data=([np.array(x_test1),
             x_test2,x_test3],
    list(y_test.swapaxes(0,1))),
    batch_size=batch_size,epochs=epochs,
    callbacks=callbacks_list)

In [None]:
model.save_weights("/output/attention_ben_%d_words_eng.final.hdf5"%max_words)