<pre>
1. Download the Italian to English translation dataset from <a href="http://www.manythings.org/anki/ita-eng.zip">here</a>
2. You will find ita.txt file in that ZIP, you can read that data using python and preprocess that data. 
3. You have to implement an Encoder and Decoder architecture with Luong attention.

Encoder - 1 layer LSTM 
Decoder - 1 layer LSTM 
attention - Luone attention. 

You can read Luonge attention from This <a href="https://arxiv.org/pdf/1508.04025.pdf">this</a> paper. <a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">this</a> is one of the best resource you can find in google, please go through it.

You can check some high level overview in below images. You have to use only Global attention. In Global attention, we have 3 types of scoring functions. please create one model for each scoring function. 




<pre><font size=5><b>Luonge Attention (Multiplicative Attention)</b></font>
<img src="https://lilianweng.github.io/lil-log/assets/images/luong2015-fig2-3.png">

<img src="https://miro.medium.com/max/1400/0*4y96boGNMiNVHNo8.">
<img src="https://i.stack.imgur.com/RaTOU.png"></pre>



4. Using attention weights, you can plot the attention plots, please plot those for 2-3 examples. You can check about those in <a href="https://www.tensorflow.org/tutorials/text/nmt_with_attention#translate">this</a>

5. The attention layer has to be written by yourself only. The main objective of this assignment is to read and implement a paper on yourself so please do it yourself.  

6. You can use any tf.Keras highlevel API's to build and train the models. 

7. Use BLEU score as metric to evaluate your model. You can use any loss function you need.

8. You have to use Tensorboard to plot the Graph, Scores and histograms of gradients. 

</pre>

# 1. Writing a custom layer

before we write custom layers in tensorflow lets see the definition of <b>Layers</b> class

<a href='https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer'> tf.keras.layers.Layers</a>

From the tf documentation
<pre>
This is the class from which all layers inherit.

A layer is a class implementing common neural networks operations, such as convolution, batch norm, etc. These operations require managing weights, losses, updates, and inter-layer connectivity.

Users will just instantiate a layer and then treat it as a callable.

We recommend that descendants of Layer implement the following methods:

+-------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   |
|<strong> <font color='green'>def __init__(self, trainable=True, name=None, dtype=None, dynamic=False, **kwargs):</font>                               |
+</strong>-------------------------------------------------------------------------------------------------------------------+                                                                                                                 
|                                                                                                                   |
|* the properties should be set by the user via keyword arguments.                                                  |
|                                                                                                                   |
|* note that 'dtype', 'input_shape' and 'batch_input_shape' are only applicable to input layers, do not pass these  |
|  keywords to non-input layers.                                                                                    |
+-------------------------------------------------------------------------------------------------------------------+
|* allowed_kwargs = {'input_shape', 'batch_input_shape', 'batch_size', 'weights', 'activity_regularizer','autocast'}|
+-------------------------------------------------------------------------------------------------------------------+


+-------------------------------------------------------------------------------------------------------------------+
|<strong> <font color='green'>def build(self, input_shape)</font></strong>:                                                                                     |                                                                                       +-------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   |
| * Creates the variables of the layer (optional, for subclass implementers). This is a method that implementers of |
|   subclasses of `Layer` or `Model`                                                                                |
|                                                                                                                   |
| * You can override if you need a state-creation step in-between <em><font color='blue'>layer instantiation</font></em> and <em><font color='blue'>layer call</font></em>.               |
|                                                                                                                   |
| * This is typically used to create the weights of `Layer` subclasses.                                             |
+-------------------------------------------------------------------------------------------------------------------+
| Arguments:                                                                                                        |
|    input_shape:                                                                                                   |
|    Instance of `TensorShape`, or list of instances of `TensorShape` if the layer expects a list of inputs         |
+-------------------------------------------------------------------------------------------------------------------+

+-------------------------------------------------------------------------------------------------------------------+
| <strong> <font color='green'>def call(self, inputs, **kwargs)</font></strong>:                                                                                |
+-------------------------------------------------------------------------------------------------------------------+
| * This is where the layer's logic lives.                                                                          |
+-------------------------------------------------------------------------------------------------------------------+
|* Arguments:                                                                                                       |
|        inputs: Input tensor, or list/tuple of input tensors.                                                      |
|        **kwargs: Additional keyword arguments.                                                                    |
+-------------------------------------------------------------------------------------------------------------------+
|* Returns:                                                                                                         |
|        A tensor or list/tuple of tensors.                                                                         |
+-------------------------------------------------------------------------------------------------------------------+
    
<a href='https://github.com/tensorflow/tensorflow/blob/r2.1/tensorflow/python/keras/engine/base_layer.py#L310'>check for more arguments</a>                               
+-------------------------------------------------------------------------------------------------------------------+
|<strong> <font color='green'>def add_weight(self,name=None, shape=None, ..., **kwargs)</font></strong>:                                                        |
+-------------------------------------------------------------------------------------------------------------------+
|* Adds a new variable to the layer.                                                                                |
+-------------------------------------------------------------------------------------------------------------------+
|* Arguments:                                                                                                       |
|        name : Variable name.                                                                                      |
|        shape: Variable shape. Defaults to scalar if unspecified.                                                  |
|        dtype: The type of the variable. Defaults to `self.dtype` or `float32`.                                    |
|        ...                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------+
|* Returns:                                                                                                         |
|        The created variable. Usually either a `Variable` or `ResourceVariable` instance.                          |
+-------------------------------------------------------------------------------------------------------------------+
...
there are other functions also availabel, please check this link for better understanding of it
<a href='https://github.com/tensorflow/tensorflow/blob/r2.1/tensorflow/python/keras/engine/base_layer.py'>base_layer.py</a>

</pre>

## 1.1 Example
super(): https://stackoverflow.com/a/27134600/4084039
<img src='https://i.imgur.com/1a8N7gH.png' width=600>

## 1.2 Resources
Do read this blog for more information: https://www.tensorflow.org/guide/keras/custom_layers_and_models
few screenshots from the above blog

1.
<img src='https://i.imgur.com/SDNQgos.png' width=600>
2.
<img src='https://i.imgur.com/syqjpux.png' width=600>
3. 
<img src='https://i.imgur.com/PfmYWno.png' width=600>

# 2. Writing a custom Model

There are three ways to implement a model architecture in TF
<img src='https://i.imgur.com/n7DBcoo.png' width=400>
The third and final method to implement a model architecture using Keras and TensorFlow 2.0 is called model subclassing.

Inside of tf.keras the `Model` class is the root class used to define a model architecture. Since tf.keras utilizes object-oriented programming, we can actually `subclass` the Model class and then insert our architecture definition.

<pre>
    The `Model` class has the same API as `Layer`, with the following differences:
        It exposes built-in training, evaluation, and prediction loops (model.fit(), model.evaluate(), model.predict()).
        It exposes the list of its inner layers, via the `model.layers` property.
        It exposes saving and serialization APIs.
    
    <font color='blue'>Effectively, the "Layer" class corresponds to what we refer to in the literature as a "layer" (as in "convolution layer" or "recurrent layer") or as a "block" (as in "ResNet block" or "Inception block").

    Meanwhile, the "Model" class corresponds to what is referred to in the literature as a "model" (as in "deep learning model") or as a "network" (as in "deep neural network").
    </font>
</pre>
## 2. 1 Example

In [None]:
class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self, num_outputs, **kwargs):
        super().__init__(**kwargs) #https://stackoverflow.com/a/27134600/4084039
        self.num_outputs = num_outputs
        
    def build(self, input_shape):
        self.kernel = self.add_weight("kernel", shape=[int(input_shape[-1]), self.num_outputs])
        
    def call(self, input):
        print(input.shape,self.kernel.shape)
        return tf.matmul(input, self.kernel)


class MyModel(Model):
    def __init__(self, num_inputs, num_outputs, rnn_units):
        super().__init__() # https://stackoverflow.com/a/27134600/4084039
        self.dense = MyDenseLayer(num_outputs, name='myDenseLayer')
#         self.lstmcell = tf.keras.layers.LSTMCell(rnn_units)
#         self.rnn = RNN(self.lstmcell)
        self.softmax = Softmax()
        
    def call(self, input):
#         output = self.rnn(input)
        output = self.dense(input)
        output = self.softmax(output)
        return output

import numpy as np
data = np.zeros([10, 5])
y = np.zeros([10,2])

model  = MyModel(num_inputs=5, num_outputs=2, rnn_units=32)

loss_object = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam()

model.compile(optimizer=optimizer,loss=loss_object)
model.fit(data,y, steps_per_epoch=1)

model.summary()

Source : https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f

# 3. Encode decoder Architecture

<pre>
First, let's talk about why we need Attention models and before that lets review how basic seq2seq models work. Normal seq2seq models first process all the parts in the Input sequence and generate a context vector. This context vector is then forwarded to the decoder and then it will start producing the Output sequence. This architecture works fine with small input sequences but not with long sequences the reason being normal seq2seq model is not being able to preserve dependencies of words at the start with Context vector. Refer the below image to visualize the working of normal Seq2Seq Model. 
<img src="./Attention_1.jpeg" style="width: 600px;">

Performace of a simple Seq2Seq Model:

<img src="./Attention_2.jpeg" style="width: 600px;">

Attention models will solve the long term dependency problem by performing the task the way humans do the translation. For example, let's consider we have to translate long English sentence to Hindi. The way we do this is by reading the first few words then translate it and go on to the next few words. Most of the times we perform our translation on giving importance to one word over others. This is exactly how attention models work.
In a normal Seq2Seq model, it will only use the Context vector generated at the end of the Encoder discarding rest of the hidden states but Attention model will use the hidden states.

Architecture of Attention Models:

<img src="./Attention_3.jpeg" style="width: 600px;">

In the above image c1, c2, c3.. these are the context vectors which will be inputted to Decoder in each state. Each context vector is created by taking the sum of hi*alpha_i ( i=1 to 3 in the above case). We will create alpha in such a manner such sum of all the alpha's contributing to one context vector will be 1. It makes logical sense because by that constraint we can teach the model to give the highest alpha value to the word which is most relevant in that time step.
<img src="./Attention_4.jpeg" style="width: 400px;">


There are two types of Attention concepts:
1) Local Attention
2) Global Attention

Attention architecture explained above is the base for local attention but in this article, we will use Global attention. Global attention means we will consider all the hidden states when we are calculating the context vector for 1 word (Tx in the above image will be length(input_sequence)).

</pre>

In [3]:
import tensorflow as tf

tf.enable_eager_execution()

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import time
import string

In [4]:
lines = open('./ita.txt', encoding='UTF-8').read().strip().split('\n')
lines[0:10]

['Hi.\tCiao!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #607364 (Cero)',
 'Run!\tCorri!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906347 (Guybrush88)',
 'Run!\tCorra!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906348 (Guybrush88)',
 'Run!\tCorrete!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906350 (Guybrush88)',
 'Who?\tChi?\tCC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #2126402 (Guybrush88)',
 'Wow!\tWow!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #1922050 (Guybrush88)',
 'Jump!\tSalta!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1102981 (jamessilver) & #1543215 (Guybrush88)',
 'Jump!\tSalti!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1102981 (jamessilver) & #4356755 (Guybrush88)',
 'Jump!\tSaltate!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1102981 (jamessilver) & #4356756 (Guybrush88)',
 'Jump.\tSalta.\tCC-BY 2.0 (France) Attribution: tatoeba.org #63103

In [5]:
len(lines)

335031

## Pre processing text data

In [6]:
# remove special characters
exclude = set(string.punctuation)
# remove numbers
remove_digits = str.maketrans('', '', string.digits)

In [7]:
def preprocess_eng_sentence(sent):
    sent = sent.lower()
    sent = re.sub("'", '', sent)
    sent = ''.join(ch for ch in sent if ch not in exclude)
    sent = sent.translate(remove_digits)
    sent = sent.strip()
    sent = re.sub(" +", " ", sent)
    sent = '<start> ' + sent + ' <end>'
    return sent

In [8]:
def preprocess_ita_sentence(sent):
    sent = re.sub("'", '', sent)
    sent = ''.join(ch for ch in sent if ch not in exclude)
    sent = sent.strip()
    sent = re.sub(" +", " ", sent) 
    sent = '<start> ' + sent + ' <end>'
    return sent

In [9]:
from tqdm import tqdm_notebook as tqdm
sent_pairs = []
for line in tqdm(lines):
    sent_pair = []
    #print(line.split('\t'))
    eng, ita = line.split('\t')[0:2]
    eng = preprocess_eng_sentence(eng)
    sent_pair.append(eng)
    ita = preprocess_ita_sentence(ita)
    sent_pair.append(ita)
    sent_pairs.append(sent_pair)
sent_pairs[5000:5010]

HBox(children=(IntProgress(value=0, max=335031), HTML(value='')))




[['<start> i want that <end>', '<start> La voglio <end>'],
 ['<start> i want that <end>', '<start> Lo voglio <end>'],
 ['<start> i want that <end>', '<start> Io lo voglio <end>'],
 ['<start> i want that <end>', '<start> Io la voglio <end>'],
 ['<start> i want them <end>', '<start> Voglio loro <end>'],
 ['<start> i want them <end>', '<start> Io voglio loro <end>'],
 ['<start> i want them <end>', '<start> Li voglio <end>'],
 ['<start> i want them <end>', '<start> Io li voglio <end>'],
 ['<start> i want them <end>', '<start> Le voglio <end>'],
 ['<start> i want them <end>', '<start> Io le voglio <end>']]

In [10]:
# Indexing words to numbers and vice versa
class LanguageIndexing():
    def __init__(self, lang):
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()

        self.create_index()

    def create_index(self):
        for phrase in self.lang:
            self.vocab.update(phrase.split(' '))

        self.vocab = sorted(self.vocab)

        self.word2idx['<pad>'] = 0
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1

        for word, index in self.word2idx.items():
            self.idx2word[index] = word

In [11]:
def max_length(tensor):
    return max(len(t) for t in tensor)

In [12]:
# padding train and text data
def load_dataset(pairs, num_examples):  
    inp_lang = LanguageIndexing(en for en, ma in pairs)
    targ_lang = LanguageIndexing(ma for en, ma in pairs)
    input_tensor = [[inp_lang.word2idx[s] for s in en.split(' ')] for en, ma in pairs]
    target_tensor = [[targ_lang.word2idx[s] for s in ma.split(' ')] for en, ma in pairs]
    max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)
    input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, 
                                                                 maxlen=max_length_inp,
                                                                 padding='post')
    
    target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, 
                                                                  maxlen=max_length_tar, 
                                                                  padding='post')
    
    return input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_tar

In [13]:
input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_targ = load_dataset(sent_pairs, len(lines))

In [14]:
input_tensor_train, input_tensor_test, target_tensor_train, target_tensor_test = train_test_split(input_tensor, target_tensor, test_size=0.1, random_state = 101)
len(input_tensor_train), len(target_tensor_train), len(input_tensor_test), len(target_tensor_test)

(301527, 301527, 33504, 33504)

In [39]:
BUFFER_SIZE = len(input_tensor_train)
# If you have more GPU idle space you can increase the batch size
BATCH_SIZE = 16
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 128
# If your GPU utilization is low increase the num of units
units = 512
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)
dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [40]:
# We are using CuDNNLSTM because it is the fastest implementation of LSTM using GPU
from tensorflow.keras.layers import CuDNNLSTM

In [41]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.lstm = CuDNNLSTM(self.enc_units, return_state=True, return_sequences=True, name="Encoder_LSTM")
        
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state,_ = self.lstm(x)        
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

<pre>
I mentioned above that we calculate alpha's by using some function on c_i and decoder_output from previous state Loung Attention provides 3 such functions to calculate alpha's.
<img src="https://i.stack.imgur.com/RaTOU.png">
In this article we are using general scoring method.
</pre>

In [42]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.lstm = CuDNNLSTM(self.dec_units, return_state=True, return_sequences=True, name="Decoder_LSTM")
        self.fc = tf.keras.layers.Dense(vocab_size)
        self.W1 = tf.keras.layers.Dense(self.dec_units)
        self.W2 = tf.keras.layers.Dense(self.dec_units)
        self.V = tf.keras.layers.Dense(1)
        
    def call(self, x, hidden, enc_output):
        # enc_output shape == (batch_size, max_length, hidden_size)
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        
        # genral Scoring method.
        new_scoring = tf.einsum('bnm,bkm->bnk', enc_output, self.W2(hidden_with_time_axis))
        
        # We use this to make the sum of all attention weigts to 1 
        attention_weights = tf.nn.softmax(new_scoring, axis=1)
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)
        x = self.embedding(x)
        
        # Concating decoder_output of previous time stamp and context vector.
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        output, state,_= self.lstm(x)
        output = tf.reshape(output, (-1, output.shape[2]))
        # We are using dense layer to get the probabilities of output word w.r.t to train Vocabulary
        x = self.fc(output)
        
        return x, state, attention_weights
        
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))

In [43]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

In [44]:
optimizer = tf.train.AdamOptimizer()


def loss_function(real, pred):
    mask = 1 - np.equal(real, 0)
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
    return tf.reduce_mean(loss_)

In [45]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [46]:
EPOCHS = 8

for epoch in range(EPOCHS):
    start = time.time()
    
    hidden = encoder.initialize_hidden_state()
    total_loss = 0
    
    for (batch, (inp, targ)) in enumerate(dataset):
        loss = 0
        
        with tf.GradientTape() as tape:
            #print(inp.shape)
            enc_output, enc_hidden = encoder(inp, hidden)
            #print(enc_output.shape)
            #print(enc_hidden.shape)
            
            dec_hidden = enc_hidden
            
            dec_input = tf.expand_dims([targ_lang.word2idx['<start>']] * BATCH_SIZE, 1)       
            
            # Teacher forcing - feeding the target as the next input
            for t in range(1, targ.shape[1]):
                # passing enc_output to the decoder
                #print(dec_input)
                #print('dkjbvwobwenv')
                #g=0
                predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
                
                loss += loss_function(targ[:, t], predictions)
                
                # Making current decoder output as input to decoder in next step.
                dec_input = tf.expand_dims(targ[:, t], 1)
        #break
        batch_loss = (loss / int(targ.shape[1]))
        
        total_loss += batch_loss
        
        variables = encoder.variables + decoder.variables
        
        gradients = tape.gradient(loss, variables)
        
        optimizer.apply_gradients(zip(gradients, variables))
        
        if batch%500==0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.numpy()))
    #break
    # saving (checkpoint) the model every epoch
    checkpoint.save(file_prefix = checkpoint_prefix)
    
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / N_BATCH))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 1.3853
Epoch 1 Batch 500 Loss 0.7859
Epoch 1 Batch 1000 Loss 0.6754
Epoch 1 Batch 1500 Loss 0.8138
Epoch 1 Batch 2000 Loss 0.6890
Epoch 1 Batch 2500 Loss 0.6978
Epoch 1 Batch 3000 Loss 0.5077
Epoch 1 Batch 3500 Loss 0.6819
Epoch 1 Batch 4000 Loss 0.5178
Epoch 1 Batch 4500 Loss 0.4762
Epoch 1 Batch 5000 Loss 0.5440
Epoch 1 Batch 5500 Loss 0.5431
Epoch 1 Batch 6000 Loss 0.4669
Epoch 1 Batch 6500 Loss 0.4762
Epoch 1 Batch 7000 Loss 0.4787
Epoch 1 Batch 7500 Loss 0.4245
Epoch 1 Batch 8000 Loss 0.4659
Epoch 1 Batch 8500 Loss 0.5099
Epoch 1 Batch 9000 Loss 0.4846
Epoch 1 Batch 9500 Loss 0.4509
Epoch 1 Batch 10000 Loss 0.4363
Epoch 1 Batch 10500 Loss 0.4757
Epoch 1 Batch 11000 Loss 0.3527
Epoch 1 Batch 11500 Loss 0.4643
Epoch 1 Batch 12000 Loss 0.4128
Epoch 1 Batch 12500 Loss 0.4944
Epoch 1 Batch 13000 Loss 0.5252
Epoch 1 Batch 13500 Loss 0.4846
Epoch 1 Batch 14000 Loss 0.3975
Epoch 1 Batch 14500 Loss 0.4267
Epoch 1 Batch 15000 Loss 0.3565
Epoch 1 Batch 15500 Loss 0.4216


KeyboardInterrupt: 

In [47]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x17bccc30808>

## Inference

<pre>
For each sentence we translate from English to Italian we will plot attention weights of each Italian word vs all the English words. By watching the attention weights we can visualize how well the model is performing.

We are using the BLEU score as a metric to test our model. BLEU is a modification of precision to correct the wild inaccuracy of precision in machine translation tasks.

Lets take the below example:
<img src="./Attention_5.png" style="width: 400px;">

1) In the above image Candidate is machine translated sentence, references are input sentences. All the seven words in the Candidate all are there in the references. Hence, precision (p) = 7/7 = 1
2) This problem is rectified in BLEU. In BLEU for each word in the candidate translation, the algorithm takes its maximum total count, m_max, in any of the reference translations. In the example above, the word "the" appears twice in reference 1, and once in reference 2. Thus m_max = 2. 
3) BLEU (candidate,references) = 2/7
4) The above-mentioned definition is Vanilla BLEU score we use nltk.translate.bleu which uses n-grams Comparision i =n references and candidate and if there is no n-grams overlap for any order of n-grams, BLEU returns the value 0.
5) To avoid this harsh behaviour when no n-gram overlaps are found we used smoothing function.
<pre>

In [48]:
import plotly
import chart_studio.plotly as py
from plotly.offline import init_notebook_mode, iplot
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go

In [49]:
def evaluate(inputs, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    sentence = ''
    for i in inputs[0]:
        if i == 0:
            break
        sentence = sentence + inp_lang.idx2word[i] + ' '
    sentence = sentence[:-1]
    
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, units))]
    # Passing the whole input sequence to Encoder and getting all the hidden states
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word2idx['<start>']], 0)
    
    # Predicting 1 word per iteration.
    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        
        # storing the attention weights to visualize how much each output is dependent on every input word.
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.idx2word[predicted_id] + ' '

        if targ_lang.idx2word[predicted_id] == '<end>':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

In [56]:
import nltk
from tqdm import tqdm_notebook as tqdm
from nltk.translate.bleu_score import SmoothingFunction
from nltk.translate import bleu
import random
def predict_random_val_sentence(num,is_viz):
    if num > len(input_tensor_test):
        print('test length exceeded')
        return
    bleu_score=0
    smoothie = SmoothingFunction().method4
    for i in tqdm(range(num)):
        random_input = input_tensor_test[i]
        random_output = target_tensor_test[i]
        random_input = np.expand_dims(random_input,0)
        result, sentence, attention_plot = evaluate(random_input, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)
        actual_sent = ''
        for i in random_output:
            if i == 0:
                break
            actual_sent = actual_sent + targ_lang.idx2word[i] + ' '
        actual_sent = actual_sent[8:-7]
        hypothesis =  result[:-6].split(" ")
        reference = actual_sent.split(" ")
        bleu_score += bleu([reference], hypothesis,smoothing_function=smoothie)
        if is_viz==True:
            # 8,-6 because or <start> and <end> tags
            print('Input sentence: {}'.format(sentence[8:-6]))
            print('Translated sentnce: {}'.format(result[:-6]))
            print('Actual translation: {}'.format(actual_sent))
            attention_plot = attention_plot[:len(result.split(' '))-2, 1:len(sentence.split(' '))-1]
            sentence, result = sentence.split(' '), result.split(' ')
            sentence = sentence[1:-1]
            result = result[:-2]

            trace = go.Heatmap(z = attention_plot, x = sentence, y = result, colorscale='Reds')
            data=[trace]
            iplot(data)
    
    print("Bleu score on the text corpus is "+str(bleu_score/num))

In [57]:
predict_random_val_sentence(20,True)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))

Input sentence: were not guilty
Translated sentnce: Non siamo più 
Actual translation: Non siamo colpevoli


Input sentence: they must die
Translated sentnce: Dobbiamo essere 
Actual translation: Loro devono morire


Input sentence: tom is puzzled
Translated sentnce: Tom è spietato 
Actual translation: Tom è confuso


Input sentence: i could tell from his accent that he was a frenchman
Translated sentnce: Io potrei essere in prestito il suo aiuto 
Actual translation: Potevo dire dal suo accento che lui era un francese


Input sentence: i saw him enter the house
Translated sentnce: Io non mi sento di andare a Boston 
Actual translation: Lho visto entrare in casa


Input sentence: you let me down tom
Translated sentnce: Mi ha messo a Tom 
Actual translation: Mi hai deluso Tom


Input sentence: i wont help you
Translated sentnce: Non sarò voluto 
Actual translation: Io non laiuterò


Input sentence: i used to live close to tom
Translated sentnce: Io apprezzo la porta domani 
Actual translation: Io abitavo vicino a Tom


Input sentence: i dont want to walk
Translated sentnce: Non voglio andare 
Actual translation: Non voglio camminare


Input sentence: tom isnt studying french anymore
Translated sentnce: Tom non è in francese 
Actual translation: Tom non sta più studiando il francese


Input sentence: id buy one of those
Translated sentnce: Io ti ha fatto un po 
Actual translation: Comprerei uno di quelli


Input sentence: where are you studying french
Translated sentnce: Dove stai facendo il francese 
Actual translation: Dove state studiando francese


Input sentence: im canadian
Translated sentnce: Sono più 
Actual translation: Io sono canadese


Input sentence: i want to be here
Translated sentnce: Voglio essere qui 
Actual translation: Voglio essere qui


Input sentence: have you seen that movie yet
Translated sentnce: Lei ha visto qualcuno che è successo 
Actual translation: Avete già visto quel film


Input sentence: i can probably do that
Translated sentnce: Mi chiedo che voi 
Actual translation: Probabilmente lo riesco a fare


Input sentence: i go to boston once a month
Translated sentnce: Io ho dato a Boston un po di Tom 
Actual translation: Io vado a Boston una volta al mese


Input sentence: dont risk insulting your boss
Translated sentnce: Non mi ha giocato a scuola 
Actual translation: Non rischiare insultando il tuo capo


Input sentence: why is exercise important
Translated sentnce: Perché è il suo lavoro 
Actual translation: Perché lattività fisica è importante


Input sentence: i wish tom had done it
Translated sentnce: Io vorrei Tom sia successo 
Actual translation: Vorrei che Tom la avesse fatta



Bleu score on the text corpus is 0.16540752364626482


In [58]:
predict_random_val_sentence(len(input_tensor_test),False)

HBox(children=(IntProgress(value=0, max=33504), HTML(value='')))


Bleu score on the text corpus is 0.1695228541679428
