<a href="https://colab.research.google.com/github/znawfar/NMT-with-Attention/blob/master/English_French_NMT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Machine Translation
Machine translation is the task of automatically converting source text in one language to text in another language. In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language. Given a sequence of symbols in a source language, there is no one single best translation of that sequence to another language. This is because of the natural ambiguity and flexibility of human language. This makes the challenge of automatic machine translation difficult, perhaps one of the most difficult in artificial intelligence.

Machine Translation can be thought of as a Seq2Seq learning problem. Recurrent Neural Networks are the incumbent technology for this learning problem. A typical Recurrent Neural Networks model for Seq2Seq problem consists of an encoder and a decoder which are themselves two separate neural networks combined into a single giant network. Both encoder and decoder are typically LSTM or GRU models.

# Neural Machine Translation with Attention
To translate a sentence from a language to another one, a human translator reads the sentence part by part, and generates part of translation. A neural machine translation with attention like a human translator looks at the sentence part by part. To generate each part of translation, the attention mechanism tells a Neural Machine Translation model where it should pay attention to.

### Project
I implement encoder-decoder based seq2seq models with attention. The encoder can be a Bidirectional LSTM, a simple LSTM or GRU, and the decoder can be LSTM or GRU. I have a argument for encoder type (RNN model used in encoder); it can be 'bidirectional', 'lstm' or 'gru'. When this argument is set to 'bidirectional', the model has Bidirectional LSTM as enocder a simple LSTM as decoder. When it is set to 'lstm', the encoder and decoder are both simple LSTMs, and for the 'gru' value, they are GRUs. Thus, I can have different three models. 

To evaluate the models, I use English-French dataset provided by http://www.manythings.org/anki/.

I run these models and save the results. The experiments show that the model with a Bidirectional LSTM as the encoder outperforms the rest.


### Google Colab
To train NMT on real-world translation data you need GPUs. You can use GPUs of Google Colab.  Even on GPUs training can take days. While you are using Google Colab and if the runtime restarts during training, you will lose your trained model. Then you have to start again from the scratch, which is not optimal. Instead, you should save your model checkpoint to Google Drive and reload it next time when you start. To do so, you need to mount your Google Drive  and give permission to Google Colab to access it. Also you need to create a directory in your Google Drive for this project. 


The code below mounts Google Drive and creates the folder 'NMT_with_attention' for this project in your Google Drive. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')
working_dir = './drive/MyDrive/NMT_with_attention'
import os
if not os.path.exists(working_dir):
      os.makedirs(working_dir)

Mounted at /content/drive


Import basic libraries.

In [2]:
import pickle
from collections import Counter, defaultdict
from unicodedata import normalize
import re
import numpy as np

import os
from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
import tensorflow as tf
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K

from keras.layers import Embedding, Input, LSTM, GRU, Dense, Bidirectional, RepeatVector, Concatenate, Activation, Dot, Lambda
from sklearn.model_selection import train_test_split
import numpy as np
from keras.models import load_model

## Download The **Dataset** 

In [2]:
!mkdir ./drive/MyDrive/NMT_with_attention/data
!wget -O ./drive/MyDrive/NMT_with_attention/fra-eng.zip http://www.manythings.org/anki/fra-eng.zip
!unzip  './drive/MyDrive/NMT_with_attention/fra-eng.zip' -d './drive/MyDrive/NMT_with_attention/data'
!rm ./drive/MyDrive/NMT_with_attention/fra-eng.zip 

--2021-04-15 21:50:14--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 172.67.173.198, 104.21.55.222, 2606:4700:3036::ac43:adc6, ...
Connecting to www.manythings.org (www.manythings.org)|172.67.173.198|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6281268 (6.0M) [application/zip]
Saving to: ‘./drive/MyDrive/NMT_with_attention/fra-eng.zip’


2021-04-15 21:50:15 (5.32 MB/s) - ‘./drive/MyDrive/NMT_with_attention/fra-eng.zip’ saved [6281268/6281268]

Archive:  ./drive/MyDrive/NMT_with_attention/fra-eng.zip
  inflating: ./drive/MyDrive/NMT_with_attention/data/_about.txt  
  inflating: ./drive/MyDrive/NMT_with_attention/data/fra.txt  


Now you have one file:
* fra.txt


## Download the Glove Word Embedding
I use word feature vector data from [*nlp.standford.edu*](https://nlp.stanford.edu/data/glove.6B.zip) to initialize the embedding layer.

In [4]:
!mkdir ./drive/MyDrive/NMT_with_attention/embedding
!wget -O ./drive/MyDrive/NMT_with_attention/glove.6B.zip https://nlp.stanford.edu/data/glove.6B.zip
!unzip  './drive/MyDrive/NMT_with_attention/glove.6B.zip' -d './drive/MyDrive/NMT_with_attention/embedding'
!rm ./drive/MyDrive/NMT_with_attention/glove.6B.zip

--2021-04-15 21:59:03--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-04-15 21:59:04--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘./drive/MyDrive/NMT_with_attention/glove.6B.zip’


2021-04-15 22:01:48 (5.04 MB/s) - ‘./drive/MyDrive/NMT_with_attention/glove.6B.zip’ saved [862182613/862182613]

Archive:  ./drive/MyDrive/NMT_with_attention/glove.6B.zip
  inflating: ./drive/MyDrive/NMT_with_attention/embedding/glove.

Set basic parameters.

In [3]:
import easydict
args = easydict.EasyDict({
    "rnn_arch":"bidirectional",
    "epochs": 20, 'embedding_dim':200,'hidden':1024, 'checkpoint':'checkpoint','dataset_file':'fra.txt', 'result':'result', 'glove':True,'data_dir':'data',
    "batch_size": 128, "lr":0.1, 'momentum':0.9, 'weight_decay':5e-4,'embedding_dir' :'embedding'
})


rnn_arch = ['gru', 'lstm', 'bidirectional']
embed_dim = ['50','100','200','300']
data_dir = os.path.join(working_dir, args.data_dir )
checkpoint_dir = os.path.join(working_dir, args.checkpoint +"_"+args.rnn_arch )
if not os.path.exists(checkpoint_dir):
      os.makedirs(checkpoint_dir)
result_dir = os.path.join(working_dir, args.result+"_"+args.rnn_arch )
if not os.path.exists(result_dir):
      os.makedirs(result_dir)
embedding_dir = os.path.join(working_dir,args.embedding_dir)

embedding = 'glove.6B.'+str(args.embedding_dim)+'d.txt'

## Data Preprocessing

The data needs some cleaning before being used to train our neural translation model.
1. Normalizing case to lowercase.
2. Removing punctuation from each word.
3. Removing non-printable characters.
4. Converting French characters to Latin characters.
5. Removing words that contain non-alphabetic characters. 
6. Add a special token $<eos>$ at the end of target sentences
7.  Create two dictionaries mapping from each word in vocabulary to an id, and the id to the word. 
8.  Mark all out of vocabulary (OOV) words with a special token $<unk>$
9. Pad each sentence to a maximum length by adding special token $<pad>$ at the end of the sentence.
10. Convert each sentence to its feature vector:
11.  Map each token in the sentence to one-hot encoding based on its id

In [4]:
def load_data(file):
    lines = open(file, encoding='UTF-8').read().strip().split('\n')
    sentence_pairs = []
    for line in lines:
        if '\t' not in line:
            continue

        s1, s2, _ = line.rstrip().split('\t')
        sentence_pairs.append([s1, s2])
    return sentence_pairs

def filter(sentence_pairs, Tx, Ty):
  # import pdb; pdb.set_trace()
  lengths = [ [len(s1.split()), len(s2.split())] for s1,s2 in sentence_pairs]
  good = [ True if (l1 <=Tx) and (l2 <=Ty) else False for l1,l2 in lengths]
  filtered = [s for i,s in enumerate(sentence_pairs) if good[i]]
  return filtered


def unicode_to_ascii(s):
    s = normalize('NFD', s).encode('ascii', 'ignore')
    return s.decode('UTF-8')


def clean_sentence(sentence):
    sentence = unicode_to_ascii(sentence.lower().strip())

    # creating a space between a word and the punctuation following it. Ex: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)

    sentence = sentence.rstrip().strip()
    return sentence


class LanguageVocab:
    def __init__(self, sentences):
        self.vocab = self.make_vocab(sentences)
        self.vocab.update({'<eos>', '<sos>'})
        self.word_idx = self.word_index()
        self.idx_word = self.reverse_word_index()

    def make_vocab(self, sentences, min_occurance=3):
        token_count = Counter()
        for sentence in sentences:
            tokens = sentence.split()
            token_count.update(tokens)
        print("total vocab-before triming:", len(token_count))
        vocab = [k for k, c in token_count.items() if c >= min_occurance]
        print("total vocab-after triming:", len(vocab))
        return set(vocab)

    def word_index(self):
        vocab = sorted(self.vocab)
        return dict(zip(['<pad>'] + vocab + ['<unk>'], list(range(len(vocab) + 2))))

    def reverse_word_index(self):
        return {v: k for k, v in self.word_idx.items()}


def max_length(sentences):
    lengths = [len(s.split()) for s in sentences]
    return max(lengths)


def features(sentence, language_vocab, max_length):
    tokens = sentence.split()

    tokens = [token if token in language_vocab.vocab else '<unk>' for token in tokens]

    tokens.extend(['<pad>'] * (max_length - len(tokens)))
    rep = list(map(lambda x: language_vocab.word_idx[x], tokens))
    return rep
def preprocess_sentences(sentences):

  language_vocab = LanguageVocab(sentences)
  lang_max_length = max_length(sentences)
  X = np.array([features(s,language_vocab, lang_max_length) for s in sentences])
  return X, language_vocab, lang_max_length

def save_pairs_dict(sentence_pairs):
  inp_ref_dict = defaultdict(list)
  for s1,s2 in sentence_pairs:
    inp_ref_dict[s1].append(s2)

def prepare_data(sentence_pairs, num_examples=0, Tx = 15, Ty=18):
    clean_sentence_pairs = [[clean_sentence(s1),clean_sentence(s2)]  for s1,s2 in sentence_pairs]
    clean_sentence_pairs = filter(clean_sentence_pairs, Tx, Ty)
    if num_examples > 0:
        clean_sentence_pairs = clean_sentence_pairs[0:num_examples]
    input_sentences = [s1 for s1, s2 in clean_sentence_pairs]
    target_sentences = [s2 for s1, s2 in clean_sentence_pairs]
    X, inp_vocab, inp_length = preprocess_sentences(input_sentences)
    Y, targ_vocab, targ_length = preprocess_sentences(target_sentences)
    return X, Y, inp_vocab, targ_vocab, inp_length, targ_length

In [5]:
sentence_pairs = load_data(os.path.join(data_dir, args.dataset_file))
X, Y, inp_vocab, targ_vocab, Tx, Ty = prepare_data(sentence_pairs)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
del X
del Y
inp_vocab_size = len(inp_vocab.word_idx)
targ_vocab_size = len(targ_vocab.word_idx)

total vocab-before triming: 14081
total vocab-after triming: 7885
total vocab-before triming: 23464
total vocab-after triming: 11641


# Define Model 
Here are some properties of the model that you may notice: 
- There are two separate RNNs in this model : pre-attention and post-attention RNNs on both sides of the attention mechanism. Pre-attention RNN is the encoder and the post-attention RNN is the decoder.
- Encoder:
     - The encoder goes through $T_x$ time steps
     - Output sequence (hidden states) of the encoder is input of the attention mechanism.
     
- Decoder
     - The decoder goes through $T_y$ time steps. 
- The attention mechanism computes the context variable $context^{\langle t \rangle}$ for each timestep in the output ($t=1, \ldots, T_y$).

#### one_step_attention
* The inputs to the one_step_attention at time step $t$ are:
    * $[a^{\<1\>},a^{\<2\>}, ..., a^{\<T_x\>}]$: all hidden states of the  pre-attention Bi-LSTM.
    - $s^{\<t-1\>}$: the previous hidden state of the post-attention LSTM 
* one_step_attention computes:
    - $[\alpha^{<t,1>},\alpha^{<t,2>}, ..., \alpha^{<t,T_x>}]$: the attention weights
    - $context^{ \langle t \rangle }$: the context vector:
    
$$context^{<t>} = \sum_{t' = 1}^{T_x} \alpha^{<t,t'>}a^{<t'>}\tag{1}$$ 

In [6]:
# setting
HIDDEN_UNITS = args.hidden
EMBEDDING_DIM = args.embedding_dim
encoder_units = HIDDEN_UNITS
decoder_units = HIDDEN_UNITS
print(HIDDEN_UNITS)
print(EMBEDDING_DIM)

1024
200


In [7]:
# load in word vectors in a dict
word_embedding = np.zeros((inp_vocab_size, EMBEDDING_DIM))
if args.glove:
    wordVec = {}

    print('Loading wordVec')
    with open(os.path.join(embedding_dir, embedding)) as f:
        for line in f:
            data = line.split()
            word = data[0]
            vec = np.asarray(data[1:], dtype='float32')
            wordVec[word] = vec

    print('Finished loading wordVec.')


    # create word embedding by fetching each word vector
    for tok, idx in inp_vocab.word_idx.items():
        if idx < inp_vocab_size:
            word_vector = wordVec.get(tok)
            if word_vector is not None:
                word_embedding[idx] = word_vector

Loading wordVec
Finished loading wordVec.


In [8]:
def lstm(units,return_sequences=False, return_state=False):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
  if tf.test.is_gpu_available():
    return tf.compat.v1.keras.layers.CuDNNLSTM(units,
                                    return_sequences=return_sequences,
                                    return_state=return_state,
                                    recurrent_initializer='glorot_uniform')
  else:
    return LSTM(units, return_sequences=return_sequences, return_state=return_state,  recurrent_activation='sigmoid', recurrent_initializer='glorot_uniform')


def bidirectional(units,return_sequences=False, return_state=False):
    return  Bidirectional(lstm(units,return_sequences, return_state))


def gru(units, return_sequences=False, return_state=False):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
  if tf.test.is_gpu_available():
    return tf.compat.v1.keras.layers.CuDNNGRU(units,
                                    return_sequences=return_sequences,
                                    return_state=return_state,
                                    recurrent_initializer='glorot_uniform')
  else:
    return tf.keras.layers.GRU(units,
                               return_sequences=return_sequences,
                               return_state=return_state,
                               recurrent_activation='sigmoid',
                               recurrent_initializer='glorot_uniform')


rnn_archs = {'lstm': lstm, 'gru':gru, 'bidirectional':bidirectional}

In [9]:
def softmax(x, axis=1):
    """Softmax activation function.
    # Arguments
        x : Tensor.
        axis: Integer, axis along which the softmax normalization is applied.
    # Returns
        Tensor, output of softmax transformation.
    # Raises
        ValueError: In case `dim(x) == 1`.
    """
    ndim = K.ndim(x)
    if ndim == 2:
        return K.softmax(x)
    elif ndim > 2:
        e = K.exp(x - K.max(x, axis=axis, keepdims=True))
        s = K.sum(e, axis=axis, keepdims=True)
        return e / s
    else:
        raise ValueError('Cannot apply softmax to a tensor that is 1D')
class AttentionLayer:
    def __init__(self, Tx):
        self.repeator = RepeatVector(Tx)
        self.concatenator = Concatenate(axis=-1)
        self.densor1 = Dense(1024, activation="tanh")
        self.densor2 = Dense(1)
        self.activator = Activation(softmax,
                               name='attention_weights')  # We are using a custom softmax(axis = 1) loaded in this notebook
        self.dotor = Dot(axes=1)

    def one_step_attention(self,a, s_prev):
        """
        Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
        "alphas" and the hidden states "a" of the Bi-LSTM.

        Arguments:
        a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
        s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)

        Returns:
        context -- context vector, input of the next (post-attention) LSTM cell
        """
        # Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a"
        s_prev = self.repeator(s_prev)
        # Use concatenator to concatenate a and s_prev on the last axis
        concat = self.concatenator([a, s_prev])
        e = self.densor1(concat)
        energies = self.densor2(e)
        alphas = self.activator(energies)
        context = self.dotor([alphas, a])

        return context

In [17]:



class NMT_Model:

    def stack(outputs):
        outputs = K.stack(outputs)
        return K.permute_dimensions(outputs, pattern=(1, 0, 2))

    def __init__(self, rnn_arch, Tx, Ty, encoder_units, decoder_units, embedding_dim, input_vocab_size,
                 target_vocab_size, word_embedding):
        # import pdb; pdb.set_trace()
        self.rnn_arch = rnn_arch
        self.decoder_units = decoder_units

        # attention
        self.attentionLayer = AttentionLayer(Tx)

        # encoder
        self.input = Input(shape=(Tx,))
        encoder_embedding = Embedding(input_vocab_size, embedding_dim, weights=[word_embedding], input_length=Tx)
        encoder = rnn_archs[rnn_arch](encoder_units, return_sequences=True)
        encoder_inp_embedded = encoder_embedding(self.input)
        self.encoder_out = encoder(encoder_inp_embedded)

        # decoder
        self.decoder_embedding = Embedding(target_vocab_size, embedding_dim)
        if rnn_arch == 'gru':
            self.decoder = gru(units=decoder_units, return_state=True)
        else:
            self.decoder = lstm(units=decoder_units, return_state=True)
        # self.decoder = rnn_archs[rnn_arch](units=decoder_units, return_state=True)
        self.dense_decode = Dense(target_vocab_size, activation='softmax')

        # concat
        self.concat2 = Concatenate(axis=2)

        self.decoder_state_0 = Input(shape=(decoder_units,))
        if rnn_arch != 'gru':
            self.decoder_cell_0 = Input(shape=(decoder_units,))
        self.train_model = self.get_train_model( Ty)
        print('train model was built')
        self.inference_model = self.get_inference_model(Ty)
        print('inference model was built')

    def get_train_model(self, Ty):
        decoder_inp = Input(shape=(Ty,))
        decoder_inp_embedded = self.decoder_embedding(decoder_inp)

        decoder_state = self.decoder_state_0
        if self.rnn_arch != 'gru':
            decoder_cell = self.decoder_cell_0

        # Iterate attention Ty times
        outputs = []
        for t in range(Ty):

            # Get context vector with encoder and attention
            context = self.attentionLayer.one_step_attention(self.encoder_out, decoder_state)

            # For teacher forcing, get the previous word
            select_layer = Lambda(lambda x: x[:, t:t + 1])
            prevWord = select_layer(decoder_inp_embedded)

            # Concat context and previous word as decoder input

            decoder_in_concat = self.concat2([context, prevWord])

            # pass into decoder, inference output
            if self.rnn_arch == 'gru':
                pred, decoder_state = self.decoder(decoder_in_concat, initial_state=decoder_state)
            else:
                pred, decoder_state, decoder_cell = self.decoder(decoder_in_concat,
                                                                 initial_state=[decoder_state, decoder_cell])
            pred = self.dense_decode(pred)
            outputs.append(pred)

        stack_layer = Lambda(NMT_Model.stack)
        outputs = stack_layer(outputs)
        if self.rnn_arch == 'gru':
            return Model(inputs=[self.input, decoder_inp, self.decoder_state_0], outputs=outputs)
        else:
            return Model(inputs=[self.input, decoder_inp, self.decoder_state_0, self.decoder_cell_0], outputs=outputs)

    # in the inference model teacher forcing is not available
    def get_inference_model(self, Tx):
        decoder_inp = Input(shape=(1,))

        decoder_state = self.decoder_state_0

        if self.rnn_arch != 'gru':
            decoder_cell = self.decoder_cell_0

        decoder_inp_embedded = self.decoder_embedding(decoder_inp)
        # Get context vector with encoder and attention
        context = self.attentionLayer.one_step_attention(self.encoder_out, decoder_state)

        # Concat context and previous word as decoder input

        decoder_in_concat = self.concat2([context, decoder_inp_embedded])

        # pass into decoder, inference output
        if self.rnn_arch == 'gru':
            pred, decoder_state = self.decoder(decoder_in_concat, initial_state=decoder_state)
        else:
            pred, decoder_state, decoder_cell = self.decoder(decoder_in_concat,
                                                             initial_state=[decoder_state, decoder_cell])

        pred = self.dense_decode(pred)

        if self.rnn_arch == 'gru':
            return Model(inputs=[self.input, decoder_inp, self.decoder_state_0], outputs=pred)
        else:
            return Model(inputs=[self.input, decoder_inp, self.decoder_state_0, self.decoder_cell_0], outputs=pred)

    def fit(self, enc_inp, decoder_inp, targ, batch_size = 64, verbose = 0):

        s_0 = np.zeros((len(enc_inp), self.decoder_units))
        if self.rnn_arch != 'gru':
            c_0 = np.zeros((len(enc_inp), self.decoder_units))
            history = self.train_model.fit([enc_inp, decoder_inp, s_0, c_0], targ, batch_size=batch_size, verbose=verbose)

        else:

            history = self.train_model.fit([enc_inp, decoder_inp, s_0], targ, batch_size=batch_size, verbose=verbose)
        self.inference_model.set_weights(self.train_model.get_weights)

    def compile(self, opt, loss, metrics):
        self.train_model.compile(optimizer=opt, loss=loss, metrics=metrics)
        self.inference_model.compile(optimizer=opt, loss=loss, metrics=metrics)
    
    def evaluate(self, enc_inp, decoder_inp, targ, batch_size = 64, verbose = 0):
        s_0 = np.zeros((len(enc_inp), self.decoder_units))
        if self.rnn_arch != 'gru':
            c_0 = np.zeros((len(enc_inp), self.decoder_units))
            return self.train_model.evaluate([enc_inp, decoder_inp, s_0, c_0], targ, batch_size=batch_size, verbose=verbose)

        else:

            return self.train_model.evaluate([enc_inp, decoder_inp, s_0], targ, batch_size=batch_size, verbose=verbose)

    def inference_evaluate(self, enc_inp, decoder_inp, targ, batch_size = 64 , verbose = 0):
        s_0 = np.zeros((len(enc_inp), self.decoder_units))
        if self.rnn_arch != 'gru':
            c_0 = np.zeros((len(enc_inp), self.decoder_units))

        Ty = targ.shape[1]
        loss = 0.0
        acc = 0.0
        preds = []

        for t in range(Ty):

            if self.rnn_arch == 'gru':
                pred = self.inference_model.predict([enc_inp, decoder_inp, s_0])
                print('prediction is done')

                loss_b, acc_b = self.inference_model.evaluate([enc_inp, decoder_inp, s_0], targ[:, t], batch_size=batch_size,
                                                              verbose=verbose)
            else:
                pred = self.inference_model.predict([enc_inp, decoder_inp, s_0, c_0])
                
                loss_b, acc_b = self.inference_model.evaluate([enc_inp, decoder_inp, s_0, c_0], targ[:, t],
                                                              batch_size=batch_size, verbose=verbose)
            if math.isnan(loss_b):
                import pdb;
                pdb.set_trace()
                loss__ = myLoss(targ[:, t], pred)
                acc__ = my_acc(targ[:, t], pred)
       
            pred = np.argmax(pred, axis=-1)
            decoder_inp = np.expand_dims(pred, axis=1)
            preds.append(decoder_inp)
            loss += loss_b
            acc += acc_b
     

        return loss / Ty, acc / Ty, NMT_Model.stack(preds).numpy()
      def save(self,train_file, infer_file):
        self.train_model.save(train_file)
        self.inference_model.save(infer_file)

      



In [18]:
def myLoss(y_train, pred):
    
    mask = K.cast(y_train > 0, dtype='float32')
    mask2 = tf.greater(y_train, 0)
    non_zero_y = tf.boolean_mask(pred, mask2)
    val = K.log(non_zero_y)
    # import pdb; pdb.set_trace()
    return 0.0 if K.sum(mask)== 0 else -K.sum(val) / K.sum(mask)


def my_acc(y_train, pred):
    # import pdb; pdb.set_trace()
    targ = K.argmax(y_train, axis=-1)
    pred = K.argmax(pred, axis=-1)
    correct = K.cast(K.equal(targ, pred), dtype='float32')

    mask = K.cast(K.greater(targ, 0), dtype='float32')  # filter out padding value 0.
    correctCount = K.sum(mask * correct)
    totalCount = K.sum(mask)
    return 1.0 if totalCount==0 else correctCount / totalCount

In [19]:
import math


def make_batch(X, Y, shuffle=True, batch_size=64):
    # import pdb; pdb.set_trace()
    idx = np.arange(len(X))
    if shuffle:
        np.random.shuffle(idx)
    dataset = []
    batchs = math.ceil(len(X) / batch_size)
    for b in range(batchs):
        s = b * batch_size
        e = min(s + batch_size, len(X))
        dataset.append([b, X[s:e], Y[s:e]])
    return dataset


def save_result(loss,acc,loss_test,acc_test,epochs, dir):
  f ={}
  f['loss'] = loss
  f['acc'] = acc
  f['loss_test'] = loss_test
  f['acc_test'] = acc_test
  name = open(os.path.join(dir,'result_'+str(epochs)+'.pkl'),'wb')
  pickle.dump(f,name)
  name.close()
from keras.models import load_model
def save_model(model, main_file, train_file, infer_file):
  f = {}
  f['model'] = model
  name = open(main_file,'wb')
  pickle.dump(f,name)
  name.close()
  model.save(train_file, infer_file)
def load_model(main_file, train_file, infer_file):
  pkl_file = open(main_file, 'rb')
  f = pickle.load(pkl_file)
  model = f['model']
  pkl_file.close()
  model.train_model = load_model(train_file)
  model.inference_model  = load_model(infer_file)
  return model


In [40]:
 
def evaluate(model, dataset, batch_size =64, verbose = 0):
  loss, acc, data_count =0.0, 0.0, 0
  for batch, inp, targ in dataset:
  
    data_count += len(inp)

    decoder_inp = np.zeros((len(targ), Ty))
    decoder_inp[:, 1:] = targ[:, :-1]
    decoder_inp[:, 0] = targ_vocab.word_idx['<sos>']
    targ_one_hot = np.zeros((len(targ), Ty, targ_vocab_size), dtype='float32')
    for idx, tokVec in enumerate(targ):
      
        for tok_idx, tok in enumerate(tokVec):
          if (tok > 0):
            targ_one_hot[idx, tok_idx, tok] = 1
                
    
    loss_b, acc_b = model.evaluate(inp, decoder_inp, targ_one_hot, batch_size=batch_size, verbose=verbose)
    loss += loss_b* len(inp) 
    acc += acc_b * len(inp)
  return loss/ data_count, acc/ data_count



In [21]:
test_dataset = make_batch(X_test, Y_test, shuffle= False, batch_size= args.batch_size )

In [44]:
model = NMT_Model(args.rnn_arch, Tx, Ty, encoder_units, decoder_units, EMBEDDING_DIM, inp_vocab_size, targ_vocab_size, word_embedding)


model.compile(opt='adam', loss=myLoss, metrics=[my_acc])

# final final debug
### debug

EPOCHS = args.epochs 
import time
loss_t, acc_t = [], []
loss_e = []
acc_e = []
best_acc = 0
best_model = None

train model was built
inference model was built


In [None]:

for epoch in range(EPOCHS):
    start = time.time()
    print("epoch:", epoch+1)
    loss, acc, data_count = 0.0, 0.0, 0
    dataset = make_batch(X_train, Y_train, batch_size=args.batch_size)
    for batch, inp, targ in dataset:
        
        data_count += len(inp)

        

        decoder_inp = np.zeros((len(targ), Ty))
        decoder_inp[:, 1:] = targ[:, :-1]
        decoder_inp[:, 0] = targ_vocab.word_idx['<sos>']
        targ_one_hot = np.zeros((len(targ), Ty, targ_vocab_size), dtype='float32')
        for idx, tokVec in enumerate(targ):
          
            for tok_idx, tok in enumerate(tokVec):
              if (tok > 0):
                targ_one_hot[idx, tok_idx, tok] = 1
                    
       
        history = model.fit(inp, decoder_inp, targ_one_hot, batch_size=args.batch_size, verbose=0)
      
        loss_b, acc_b =  history.history['loss'][0], history.history['my_acc'][0]
        
        loss += (loss_b * len(inp))
        acc += (acc_b * len(inp))
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
                                                                          batch,
                                                                          loss / data_count, acc / data_count))
    
    loss, acc= evaluate(model, dataset,batch_size=args.batch_size)
    loss_test, acc_test = evaluate(model, test_dataset, batch_size=args.batch_size)
    if acc_test > best_acc:
      save_model(model.train_model,EPOCHS, checkpoint_dir)
     
      
    loss_t.append(loss_test)
    acc_t.append(acc_test)
    loss_e.append(loss)
    acc_e.append(acc)
    save_result(loss_e, acc_e,loss_t,acc_t,EPOCHS, result_dir)
    print('Epoch {}  Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, loss, acc))
    print('Epoch {}  Loss on test {:.4f} Accuracy on test {:.4f}'.format(epoch + 1, loss_test, acc_test))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))


epoch: 1
Epoch 1 Batch 0 Loss 9.3639 Accuracy 0.0000
Epoch 1 Batch 100 Loss 6.0147 Accuracy 0.1169
Epoch 1 Batch 200 Loss 5.8025 Accuracy 0.1337
Epoch 1 Batch 300 Loss 5.6491 Accuracy 0.1495
Epoch 1 Batch 400 Loss 5.4951 Accuracy 0.1702
Epoch 1 Batch 500 Loss 5.3458 Accuracy 0.1874
Epoch 1 Batch 600 Loss 5.2021 Accuracy 0.2038
Epoch 1 Batch 700 Loss 5.0573 Accuracy 0.2194
Epoch 1 Batch 800 Loss 4.9165 Accuracy 0.2341
Epoch 1 Batch 900 Loss 4.7817 Accuracy 0.2480
Epoch 1 Batch 1000 Loss 4.6529 Accuracy 0.2610
Epoch 1 Batch 1100 Loss 4.5309 Accuracy 0.2734
Saved model to disk
Epoch 1  Loss 3.0641 Accuracy 0.4179
Epoch 1  Loss on test 3.1708 Accuracy on test 0.4115
Time taken for 1 epoch 570.528392791748 sec

epoch: 2
Epoch 2 Batch 0 Loss 3.4238 Accuracy 0.3790
Epoch 2 Batch 100 Loss 3.1165 Accuracy 0.4163
Epoch 2 Batch 200 Loss 3.0457 Accuracy 0.4227
Epoch 2 Batch 300 Loss 2.9824 Accuracy 0.4296
Epoch 2 Batch 400 Loss 2.9322 Accuracy 0.4340
Epoch 2 Batch 500 Loss 2.8811 Accuracy 0.4391
E