<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/attention-and-transformers-mechanism/neural-machine-translation/nmt_with_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Neural Machine Translation With Attention Mechanism

Today, let’s join me in the journey of creating a neural machine translation model with attention mechanism by using the hottest-on-the-news Tensorflow 2.0.

With that being said, our objective is pretty simple: we will use a very simple dataset (with only 20 examples) and we will try to overfit the training data with the renown Seq2Seq model. For the attention mechanism, we’re gonna use Luong attention, which I personally prefer over Bahdanau’s.

Without talking too much about theories today, let’s jump right into the implementation. As usual, we will go through the steps below:

* Data Preparation
* Seq2Seq without Attention
* Seq2Seq with Luong Attention


Reference:

[Neural Machine Translation With Attention Mechanism](https://blog.erico.vn/posts/neural-machine-translation-with-attention-mechanism)

##Setup

In [1]:
import tensorflow as tf
import numpy as np
import unicodedata
import re

##Data Preparation

Let’s talk about the data. We’re gonna use 20 English – French pairs (which I extracted from the original dataset).

In [2]:
raw_data = (
    ('What a ridiculous concept!', 'Quel concept ridicule !'),
    ('Your idea is not entirely crazy.', "Votre idée n'est pas complètement folle."),
    ("A man's worth lies in what he is.", "La valeur d'un homme réside dans ce qu'il est."),
    ('What he did is very wrong.', "Ce qu'il a fait est très mal."),
    ("All three of you need to do that.", "Vous avez besoin de faire cela, tous les trois."),
    ("Are you giving me another chance?", "Me donnez-vous une autre chance ?"),
    ("Both Tom and Mary work as models.", "Tom et Mary travaillent tous les deux comme mannequins."),
    ("Can I have a few minutes, please?", "Puis-je avoir quelques minutes, je vous prie ?"),
    ("Could you close the door, please?", "Pourriez-vous fermer la porte, s'il vous plaît ?"),
    ("Did you plant pumpkins this year?", "Cette année, avez-vous planté des citrouilles ?"),
    ("Do you ever study in the library?", "Est-ce que vous étudiez à la bibliothèque des fois ?"),
    ("Don't be deceived by appearances.", "Ne vous laissez pas abuser par les apparences."),
    ("Excuse me. Can you speak English?", "Je vous prie de m'excuser ! Savez-vous parler anglais ?"),
    ("Few people know the true meaning.", "Peu de gens savent ce que cela veut réellement dire."),
    ("Germany produced many scientists.", "L'Allemagne a produit beaucoup de scientifiques."),
    ("Guess whose birthday it is today.", "Devine de qui c'est l'anniversaire, aujourd'hui !"),
    ("He acted like he owned the place.", "Il s'est comporté comme s'il possédait l'endroit."),
    ("Honesty will pay in the long run.", "L'honnêteté paye à la longue."),
    ("How do we know this isn't a trap?", "Comment savez-vous qu'il ne s'agit pas d'un piège ?"),
    ("I can't believe you're giving up.", "Je n'arrive pas à croire que vous abandonniez."),
)

Next, we will need to clean up the raw data a little bit. This kind of task usually involves normalizing strings, filtering unwanted tokens, adding space before punctuation, etc.

In [3]:
def unicode_to_ascii(sent):
  return "".join(char for char in unicodedata.normalize("NFD", sent) if unicodedata.category(char) != "Mn")

def normalize_string(sent):
  sent = unicode_to_ascii(sent)
  sent = re.sub(r"([!.?])", r"\1", sent)
  sent = re.sub(r"[^a-zA-Z.!?]+", r" ", sent)
  sent = re.sub(r"\s+", r" ", sent)
  return sent

We will now split the data into two separate lists, each containing its own sentences. 

Then we will apply the functions above and add two special tokens: `<start> and <end>`:

In [4]:
raw_data_en, raw_data_fr = list(zip(*raw_data))
raw_data_en, raw_data_fr = list(raw_data_en), list(raw_data_fr)

raw_data_en = [normalize_string(data) for data in raw_data_en]

raw_data_fr_in = ["<start> " + normalize_string(data) for data in raw_data_fr]
raw_data_fr_out = [normalize_string(data) + " <end>" for data in raw_data_fr]

I need to elaborate a little bit here. First off, let’s take a look at the figure below:

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/attention-and-transformers-mechanism/neural-machine-translation/images/input_roea0w.webp?raw=1' width='400'/>

The Seq2Seq model consists of two networks: Encoder and Decoder. The encoder, which is on the left-hand side, requires only sequences from source language as inputs.

In [5]:
raw_data_en[0]

'What a ridiculous concept!'

The decoder, on the other hand, requires two versions of the destination language’s sequences, one for inputs and one for targets (loss computation). The decoder itself is usually called a language model (we used it a lot for text generation, remember?).

In [6]:
raw_data_fr_in[0]

'<start> Quel concept ridicule !'

In [7]:
raw_data_fr_out[0]

'Quel concept ridicule ! <end>'

From personal experiments, I also found that it would be better not to add `<start>` and `<end>` tokens to source sequences. Doing so would confuse the model, especially the attention mechanism later on, since all sequences start with the same token.

Next, let’s see how to tokenize the data, i.e. convert the raw strings into integer sequences. 

We’re gonna use the text tokenization utility class from Keras:



In [8]:
en_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters="")

By default, Keras’ `Tokenizer` will trim out all the punctuations, which is not what we want. Since we have already filtered out punctuations ourselves (except for `.!?`), we can just set filters as blank here.

The crucial part of tokenization is vocabulary. Keras’ `Tokenizer` class comes with a few methods for that. Since our data contains raw strings, we will use the one called `fit_on_texts`.

In [9]:
en_tokenizer.fit_on_texts(raw_data_en)

The tokenizer will created its own vocabulary as well as conversion dictionaries.

In [10]:
print(en_tokenizer.word_index)

{'you': 1, 'the': 2, 'a': 3, 'he': 4, 'what': 5, 'is': 6, 'in': 7, 'do': 8, 'can': 9, 't': 10, 'did': 11, 'giving': 12, 'i': 13, 'few': 14, 'please?': 15, 'this': 16, 'know': 17, 'ridiculous': 18, 'concept!': 19, 'your': 20, 'idea': 21, 'not': 22, 'entirely': 23, 'crazy.': 24, 'man': 25, 's': 26, 'worth': 27, 'lies': 28, 'is.': 29, 'very': 30, 'wrong.': 31, 'all': 32, 'three': 33, 'of': 34, 'need': 35, 'to': 36, 'that.': 37, 'are': 38, 'me': 39, 'another': 40, 'chance?': 41, 'both': 42, 'tom': 43, 'and': 44, 'mary': 45, 'work': 46, 'as': 47, 'models.': 48, 'have': 49, 'minutes': 50, 'could': 51, 'close': 52, 'door': 53, 'plant': 54, 'pumpkins': 55, 'year?': 56, 'ever': 57, 'study': 58, 'library?': 59, 'don': 60, 'be': 61, 'deceived': 62, 'by': 63, 'appearances.': 64, 'excuse': 65, 'me.': 66, 'speak': 67, 'english?': 68, 'people': 69, 'true': 70, 'meaning.': 71, 'germany': 72, 'produced': 73, 'many': 74, 'scientists.': 75, 'guess': 76, 'whose': 77, 'birthday': 78, 'it': 79, 'today.': 80

We can now have the raw English sentences converted to integer sequences:

In [11]:
data_en = en_tokenizer.texts_to_sequences(raw_data_en)
data_en[0]

[5, 3, 18, 19]

Last but not least, we need to pad zeros so that all sequences have the same length. Otherwise, we won’t be able to create `tf.data.Dataset` object later on.

In [12]:
data_en = tf.keras.preprocessing.sequence.pad_sequences(data_en, padding="post")
data_en[0]

array([ 5,  3, 18, 19,  0,  0,  0,  0,  0], dtype=int32)

Let’s check if everything is okay:

In [13]:
data_en[:5]

array([[ 5,  3, 18, 19,  0,  0,  0,  0,  0],
       [20, 21,  6, 22, 23, 24,  0,  0,  0],
       [ 3, 25, 26, 27, 28,  7,  5,  4, 29],
       [ 5,  4, 11,  6, 30, 31,  0,  0,  0],
       [32, 33, 34,  1, 35, 36,  8, 37,  0]], dtype=int32)

Everything is perfect. 

Let's go ahead and do exactly the same with French sentences:

In [14]:
fr_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters="")

# make vacabulary by converting to integer sequences
fr_tokenizer.fit_on_texts(raw_data_fr_in)
fr_tokenizer.fit_on_texts(raw_data_fr_out)

# pad zeros so that all sequences have the same length
data_fr_in = fr_tokenizer.texts_to_sequences(raw_data_fr_in)
data_fr_in = tf.keras.preprocessing.sequence.pad_sequences(data_fr_in, padding="post")

# do the same for target sequences
data_fr_out = fr_tokenizer.texts_to_sequences(raw_data_fr_out)
data_fr_out = tf.keras.preprocessing.sequence.pad_sequences(data_fr_out, padding="post")

data_fr_in[:2]

array([[ 2, 30, 31, 32, 15,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 2, 33, 34, 19,  6,  9, 35, 36,  0,  0,  0,  0,  0,  0]],
      dtype=int32)

In [15]:
data_fr_out[:2]

array([[30, 31, 32, 15,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [33, 34, 19,  6,  9, 35, 36,  3,  0,  0,  0,  0,  0,  0]],
      dtype=int32)

A mid-way notice though, we can call `fit_on_texts` multiple times on different corpora and it will update vocabulary automatically. Always remember to finish with `fit_on_texts` first before using `texts_to_sequences`.

The last step is easy, we only need to create an instance of `tf.data.Dataset`:

In [16]:
dataset = tf.data.Dataset.from_tensor_slices((data_en, data_fr_in, data_fr_out))
dataset = dataset.shuffle(20).batch(5)

And that’s it. We have done preparing the data!

##Seq2Seq model without Attention

By now, we probably know that attention mechanism is the new standard in machine translation tasks. But I think there are good reasons to create the vanilla `Seq2Seq` first:

* Pretty simple and easy with `tf.keras`
* No headache to debug when things go wrong
* Be able to answer: Why need attention at all?

Okay, let’s assume that you are all convinced. We will start off with the encoder. Inside the encoder, there are an embedding layer and an RNN layer (can be either vanilla RNN or LSTM, or GRU). 

At every forward pass, it takes in a batch of sequences and initial states and returns output sequences as well as final states:

In [17]:
class Encoder(tf.keras.Model):

  def __init__(self, vocab_size, embedding_size, lstm_size):
    super(Encoder, self).__init__()

    self.lstm_size = lstm_size 
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_size)
    self.lstm = tf.keras.layers.LSTM(lstm_size, return_sequences=True, return_state=True)

  def call(self, sequence, states):
    embed = self.embedding(sequence)
    output, state_hidden, state_context = self.lstm(embed, initial_state=states)
    return output, state_hidden, state_context

  def init_states(self, batch_size):
    return (tf.zeros([batch_size, self.lstm_size]), tf.zeros([batch_size, self.lstm_size]))

And here is how the data’s shape changes at each layer. I find that keeping track of the data’s shape is extremely helpful not to make silly mistakes, just like stacking up Lego pieces:

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/attention-and-transformers-mechanism/neural-machine-translation/images/data_shapes-1_l7luwu.webp?raw=1' width='600'/>

We have done with the encoder. Next, let’s create the decoder. 

Without attention mechanism, the decoder is basically the same as the encoder, except that it has a Dense layer to map RNN’s outputs into vocabulary space:



In [18]:
class Decoder(tf.keras.Model):

  def __init__(self, vocab_size, embedding_size, lstm_size):
    super(Decoder, self).__init__()

    self.lstm_size = lstm_size 
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_size)
    self.lstm = tf.keras.layers.LSTM(lstm_size, return_sequences=True, return_state=True)
    self.dense = tf.keras.Dense(vocab_size)

  def call(self, sequence, state):
    embed = self.embedding(sequence)
    lstm_out, state_hidden, state_context = self.lstm(embed, state)
    logits = self.dense(lstm_out)
    return logits, state_hidden, state_context

Similarly, here’s the data’s shape at each layer:

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/attention-and-transformers-mechanism/neural-machine-translation/images/data_shapes-2_w7unlz.webp?raw=1' width='600'/>

As you might have noticed, the final states of the encoder will act as the initial states of the decoder. That’s the difference between a language model and a decoder of `Seq2Seq` model.

And that is the decoder we need to create. 

Before moving on, let’s check if we didn’t make any mistake along the way: