#**Lab 5 and 6: Neural Machine Translation (Extra Guide)**

This week and the next, we'll be build a neural machine translation model based on the sequence-to-sequence (seq2seq) models proposed by Sutskever et al., 2014 and Cho et al., 2014. The seq2seq model is widely used in machine translation systems such as Google’s neural machine translation system (GNMT) (Wu et al., 2016).

A folder, **nmt_lab_files** has been provided for you. This folder contains 3 files:
1. **data.30.vi** - a file. each line of the file contains a Vietnamese sentence to be translated (i.e. the source sentences)
2. **data.30.en** - a file. each line of the file contains an English sentence corresponding to the Vietnamese sentence in the same line position. (i.e. the target sentences)
3. **nmt_model_keras.py** - the incomplete code for this lab.

The doc file provided contains an explanation of the code file and a guide on how to complete the code (by doing 3 tasks). Read the doc file and if you can, complete the code as instructed. When the code is completed, skip to section xx of this notebook. 

This notebook (prior to section section xx) merely contains further explanation on sections of the code.

##**LanguageDict**

LanguageDict is a class for creating language dict objects.

##**The <load_dataset()> Method**

This helper method reads from the source and target files to 
- load max_num_examples sentences, 
- split the sentences them into train, development and testing, and
- return relevant data.
The code for this is fully commented. 

<br>

As an example to the kind of ouput returned by this model, let's assume we are translating the sentence 'I like dogs' from English to English (this of course is never the case), such that the tokenized and case normalized source sentence list and target sentence list are as follows:


```
# In our case this would actually be [['tôi', 'thích', 'thỏ']], i.e the Vietnamese equivalent of the English sentence. 
# We've used English to English here so we can follow along with the code.
source_words = [['i', 'like', 'rabbits']] 
target_words = [['i', 'like', 'rabbits']]
```
The word2ids for the source and target language dictionaries would look something like:
```
source_dict.word2ids = {'<PAD>': 0, '<UNK>': 1, 'i': 2, 'like': 3, 'rabbits':4}

# end and start tokens are added for the target words
target_dict.word2ids = {'<PAD>': 0, '<UNK>': 1, '<start>': 2, 'i': 3, 'like': 4, 'rabbits':5, '<end>':6}

```
Let's also assume that we are training and testing on this same dataset of one sentence.
The **source words** for train/dev/test will be given as
```
# a batch_size X max_sent_length array.
source_words_train = [[2,3,4]] # corresponding to ['i', 'like', 'rabbits']
source_words_dev = [[2,3,4]]  # corresponding to ['i', 'like', 'rabbits']
source_words_test = [[2,3,4]] # corresponding to ['i', 'like', 'rabbits']
```

The **target words** for train data will be given as follows (dev/test don't need target words as the model will provide this):
```
target_words_train = [[2,3,4,5]] # corresponding to ['<start>', 'i', 'like', 'rabbits']
```

The **target words labels** for each word will be the word after it. The target word labels for train/dev/test data will be given as follows
```
target_words_train_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
target_words_dev_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
target_words_test_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
```
The dimensions for train target words labels would be expanded to this:
`[[3], [4], [5], [6]]`






##**The Neural Translation Model (NMT)**

For the NMT the network (a system of connected layers/models) used for training differs slightly from the network used for inference. Both use the the seq-to-seq encoder-decoder architecture. 




###**The training mode**

**Encoder**

Given:
- `source_words`: a `batch_size(num_sents) x max_sentence_length` array representing the source words. In our mini example, this would be the Vietnamese equivalent of `['i', 'like', 'rabbits']`; `[['tôi', 'thích', 'thỏ']]`

The following steps comprise the encoding network:

1. transform `source_words` into `source_words_embeddings` using a randomly initialized embedding lookup. source_words_embeddings is thus a `batch_size(num_sents) x max_sentence_length x embedding_dim` array.
2. Apply embedding dropout of `embedding_dropout_rate`.
3. Use a single `LSTM` with `hidden_size` units to learn a representation for the source words i.e. to encode the input. 

    (a.) The hidden and cell states for this `LSTM` are initialized to zeros (i.e. we leave the `initial_states = None` default as is).

    (b.) We save the `encoder_output` (the sequence not just the last state); and the encoder (hidden and cell) states. 

This way, the model encodes a representation for the source words. Task 1 guides you to complete the encoder part of the training model.

<br>

**Decoder (No Attention)**

Given:
- `target_words`: a `batch_size(i.e.num_sents in batch) x max_sentence_length+1` array representing the target words. This is a time shifted translation of the source words with an added (prepended) `<START>` token `['<start>', 'i', 'like', 'rabbits']`.

The decoding is in the following steps:

1. transform `target_words` into `target_words_embeddings` using a randomly initialized embedding lookup. target_words_embeddings is thus a `batch_size x max_sentence_length+1 x embedding_dim` array.

2.  Apply embedding dropout of `embedding_dropout_rate`.

3. Use a single `LSTM` with `hidden_size` units to learn a representation for the target words. Some context is given to this model by using the encoder states to initialize the decoder lstm. This way the encoder state for `'tôi'` for example is used to learn to the representation (and next word prediction, see number 4.) for the `'<start>'` token, and so on.

4. For each token representation, use a dense layer to predict a `target_vocab_size` vector which is the probability that any given word in the target vocabulary is the next word following the represented token. The output `decoder_outputs_train` is thus a `batch_size x max_sent_length x target_vocab_size` array.


###**The Inference Mode**

**Encoder**

The inference time encoding follows the same steps as training time encoding.

<br>

**Decoder (No attention)**

During training time, we passed a `batch_size(num_sents) x max_sentence_length` array representing the target words into the decoder lstm. The decoder_lstm learns how to represent a given target sentence using the context from the encoder lstm (that learns to represent a source sentence).  

At test time, several things are different:

1. We no longer have access to a complete translation of the source sentence (recall that no target_words array exists for dev and test sets). Rather, we initialize the target_words_array as thus:

    Each expected sentence contains only a single token index, the index of the `'<start>'` token. So, the target_word_dev/test is a `batch_size x 1` array. (see the nmt.eval() function for this)

2. This `batch_size x 1` array is fed to the trained decoder_lstm and the predicted array is a `batch_size x 1 x target_vocab_size` such that taking the argmax of this array accross the dimension 2 will give the most probable next word. 

    For example, at time_step `0`, the first time step, where the `step_target_words` given is the `batch_size x 1` array containing the `'<start>'` token, the next word prediction of the decoder is for each sentence (in the batch) the initial word in the sentence. 

3. At the first time step, the decoder_lstm still uses the encoder_states as it's initial states. At subsequent time steps, it uses it's own states from the previous time steps. This is also what the decoder_lstm does at training time but it is made more explicit here as we loop over time steps using a for loop.
(see nmt.eval())





##**Training Without Attention**

If you've completed Tasks 1 and 2, you are ready to train the NMT model without attention.

Run the following cells to train the model for 10 epochs. It also shows the model summary of the each model you encapsulated.

If you're using a GPU, training will no more than 10 minutes and you will get a BLEU score between 4 and 5. 

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# change this to the path to your folder. Remember to start from the home directory
PATH = 'My Drive/GTA1/' 

In [None]:
PATH_TO_FOLDER = "/content/drive/" + PATH

In [None]:
import sys
sys.path.append(PATH_TO_FOLDER)

In [None]:
SOURCE_PATH = PATH_TO_FOLDER + 'data.30.vi'
TARGET_PATH = PATH_TO_FOLDER + 'data.30.en'

In [None]:
!pip install --upgrade tensorflow
!pip install --upgrade tensorflow-gpu

Collecting tf-estimator-nightly==2.8.0.dev2021122109
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 3.8 MB/s 
Installing collected packages: tf-estimator-nightly
Successfully installed tf-estimator-nightly-2.8.0.dev2021122109
Collecting tensorflow-gpu
  Downloading tensorflow_gpu-2.8.0-cp37-cp37m-manylinux2010_x86_64.whl (497.5 MB)
[K     |████████████████████████████████| 497.5 MB 27 kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-2.8.0


In [None]:
import nmt_model_keras as nmt 

In [None]:
#
nmt.main(SOURCE_PATH, TARGET_PATH, use_attention=False)

loading dictionaries
read 24000/3000/3000 train/dev/test batches
number of tokens in source: 2034, number of tokens in target:2506
Task 1(a): Creating the embedding lookups...

Task 1(b): Looking up source and target words...

Task 1(c): Creating an encoder
						 Train Model Summary.
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 source_embed_layer (Embedding)  (None, None, 100)   203400      ['input_1[0][0]']                
      

  super(Adam, self).__init__(name, **kwargs)


encoder_outputs shape: KerasTensor(type_spec=TensorSpec(shape=(None, None, 200), dtype=tf.float32, name=None), name='encoder/transpose_2:0', description="created by layer 'encoder'")

 Putting together the decoder states
						 Decoder Inference Model summary
Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 target_embed_layer (Embedding)  (None, None, 100)   250600      ['input_2[0][0]']                
                                                                                                  
 dropout_1 (Dropout)            (None, None, 100)    0           ['target_embed_layer[0][0]']     
                              

##**Decoding with Attention**

The inputs to the attention layer are encoder and decoder outputs. The attention mechanism:
1. Computes a score (a luong score) for each source word
2. Weights the words by their luong scores.
3. Concatenates the wieghted encoder representation with the decoder_ouput.
This new decoder output will now be the input to the decoder_dense layer. 

Task 3 description in the doc file outlines the steps for this in detail. Once you have completed this Task, you are now ready to train with attention. Training time will be no more than 10 minutes using a GPU and you should get a bleu score of about 15.

In [None]:
nmt.main(SOURCE_PATH, TARGET_PATH, use_attention=True)

loading dictionaries
read 24000/3000/3000 train/dev/test batches
number of tokens in source: 2034, number of tokens in target:2506
Task 1(a): Creating the embedding lookups...

Task 1(b): Looking up source and target words...

Task 1(c): Creating an encoder
decoder_outputs_transposed: (None, 200, None)
luong_score(before softmax): (None, None, None)
luong_score(add dimension): (None, None, None, 1)
encoder_outputs: (None, None, 1, 200)
encoder_vector: (None, None, None, 200)
encoder_vector: (None, None, 200)
decoder_outputs shape: Tensor("Placeholder_1:0", shape=(None, None, 200), dtype=float32)
						 Train Model Summary.
Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_6 (InputLayer)           [(None, None)]       0           []                               
                                                       

  super(Adam, self).__init__(name, **kwargs)


decoder_outputs_transposed: (None, 200, None)
luong_score(before softmax): (None, None, None)
luong_score(add dimension): (None, None, None, 1)
encoder_outputs: (None, None, 1, 200)
encoder_vector: (None, None, None, 200)
encoder_vector: (None, None, 200)
decoder_outputs shape: Tensor("Placeholder_1:0", shape=(None, None, 200), dtype=float32)
						 Decoder Inference Model summary
Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_7 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 target_embed_layer (Embedding)  (None, None, 100)   250600      ['input_7[0][0]']                
                                                                                                  
 drop