<font color="#483D8B">
<h1  align="center">Language Translation
<div align="center">
<font size=3><b>
<br>Ruobing Wang
<br>April 20, 2019
<br></font></b></div>


---------------

## Overview

In this project, I will use the Tensorflow to build a language translation model using sequence to sequence (seq2seq) or encoder-decoder. From the name we can see encoder-decoder includes two parts: Encoder and Decoder. The purpose is to translate English sentences to French sentence by using from WMT10 French- English corpus dataset which includes a source file (English) and a target file(French).  

Natural Language Processing becomes more and more important for lives. Such like machine translation(for here) given a piece of text in one language, translate to another language. The ability to quickly and automatically translate anything will make huge profit in real world. With machine translation, you can sit in America and invest the France's market as long as you understand what happened in real time.

This project includes: how to preprocess the dataset, how to define inputs, how to define encoder model, how to define decoder model, how to build the entire seq2seq model, how to calculate the loss and clip gradients, and how to train and get prediction.

### References:

https://github.com/angelmtenor/data-science-keras

https://github.com/deep-diver/EN-FR-MLT-tensorflow

https://github.com/udacity/deep-learning

https://deepnotes.io/softmax-crossentropy

[Why special tokens?](https://datascience.stackexchange.com/questions/26947/why-do-we-need-to-add-start-s-end-s-symbols-when-using-recurrent-neural-n)

[Python enumerate](https://docs.python.org/3/library/functions.html#enumerate)






-------------

## Data

The original dataset is at

(WMT10 French-English corpus(http://www.statmt.org/wmt10/training-giga-fren.tar)). 

Since I only use the laptop for this project, I chose a relatively small dataset. 

These small datasets store English(source file) and French(target file) sentences but there are also some other language in this data set. Maybe Russian. I manually delete these lines. And these two files exactly contains the same number of lines and mapping well which means the same line has the same meaning in different languages.

Here is a brief summary of the datasets: 

In [1]:
import os
import pickle
import copy
import numpy as np

def load_data(path):
    input_file = os.path.join(path)
    with open(input_file, 'r', encoding='utf-8') as f:
        data = f.read()

    return data

In [2]:
source_path = 'data/small_vocab_en'
target_path = 'data/small_vocab_fr'
source_text = load_data(source_path)
target_text = load_data(target_path)

In [3]:
import numpy as np
from collections import Counter

print('Dataset Brief Stats')
print('* number of unique words in English sample sentences: {}\
        [this is roughly measured/without any preprocessing]'.format(len(Counter(source_text.split()))))
print()

english_sentences = source_text.split('\n')
print('* English sentences')
print('\t- number of sentences: {}'.format(len(english_sentences)))
print('\t- avg. number of words in a sentence: {}'.format(np.average([len(sentence.split()) for sentence in english_sentences])))

french_sentences = target_text.split('\n')
print('* French sentences')
print('\t- number of sentences: {} [data integrity check / should have the same number]'.format(len(french_sentences)))
print('\t- avg. number of words in a sentence: {}'.format(np.average([len(sentence.split()) for sentence in french_sentences])))
print()

sample_sentence_range = (0, 5)
side_by_side_sentences = list(zip(english_sentences, french_sentences))[sample_sentence_range[0]:sample_sentence_range[1]]
print('* Sample sentences range from {} to {}'.format(sample_sentence_range[0], sample_sentence_range[1]))

for index, sentence in enumerate(side_by_side_sentences):
    en_sent, fr_sent = sentence
    print('[{}-th] sentence'.format(index+1))
    print('\tEN: {}'.format(en_sent))
    print('\tFR: {}'.format(fr_sent))
    print()

Dataset Brief Stats
* number of unique words in English sample sentences: 227        [this is roughly measured/without any preprocessing]

* English sentences
	- number of sentences: 137861
	- avg. number of words in a sentence: 13.225277634719028
* French sentences
	- number of sentences: 137861 [data integrity check / should have the same number]
	- avg. number of words in a sentence: 14.226612312401622

* Sample sentences range from 0 to 5
[1-th] sentence
	EN: new jersey is sometimes quiet during autumn , and it is snowy in april .
	FR: new jersey est parfois calme pendant l' automne , et il est neigeux en avril .

[2-th] sentence
	EN: the united states is usually chilly during july , and it is usually freezing in november .
	FR: les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .

[3-th] sentence
	EN: california is usually quiet during march , and it is usually hot in june .
	FR: california est généralement calme en mars , et il est généraleme

---------------


## Exploratory Data Analysis

From the summary, we can see the dataset has the exactly of the same number of sentences and match well(ith sentence in English has the same meaning with ith sentence in French)

Here are brief overview what steps will be done in this section

- **create lookup tables** 
  - create two mapping tables 
      - (key, value) == (unique word string, its unique index)     - `(1)`
      - (key, value) == (its unique index, unique word string)     - `(2)`
      - `(1)` is used in the next step, and (2) is used later for prediction step
      
      
- **text to word ids**
  - convert each string word in the list of sentences to the index
  - `(1)` is used for converting process
  
  
- **save the pre-processed data**
  - create two `(1)` mapping tables for English and French
  - using the mapping tables, replace strings in the original source and target dataset with indicies



## Preprocessing


### Create Lookup Tables

As mentioned breifly, I am going to implement a function to create lookup tables. Since every models are mathmatically represented, the input and the output(prediction) should also be represented as numbers. That is why this step is necessary for NLP problem because human readable text is not machine readable. This function takes a list of sentences and returns two mapping tables (dictionary data type). Along with the list of sentences, there are special tokens, `<PAD>` for pading the sentence to get same length of both of the source sentence and the target sentence, `<EOS>` tell us we should we stop outputting the sentence , `<UNK>` for the words do not appear in the sentence, and `<GO>` to be added in the mapping tables for letting the sentence output to start. 

- (key, value) == (unique word string, its unique index)     - `(1)`
- (key, value) == (its unique index, unique word string)     - `(2)`

`(1)` will be used in the next step, `test to word ids`, to find a match between word and its index. `(2)` is not used in pre-processing step, but `(2)` will be used later. After making a prediction, the sequences of words in the output sentence will be represented as their indicies. The predicted output is machine readable but not human readable. That is why we need `(2)` to convert each indicies of words back into human readable words in string. 

<br/>
<img src='https://i.ibb.co/BPjPwrt/lookup.png' alt='Drawing' width='70%'>

In [4]:
CODES = {'<PAD>': 0, '<EOS>': 1, '<UNK>': 2, '<GO>': 3 }

def create_lookup_tables(text):
    # make a list of unique words
    vocab = set(text.split())

    # (1)
    # starts with the special tokens
    vocab_to_int = copy.copy(CODES)

    # the index (v_i) will starts from 4 (the 2nd arg in enumerate() specifies the starting index)
    # since vocab_to_int already contains special tokens
    for v_i, v in enumerate(vocab, len(CODES)):
        vocab_to_int[v] = v_i

    # (2)
    int_to_vocab = {v_i: v for v, v_i in vocab_to_int.items()}

    return vocab_to_int, int_to_vocab

Now we have two tables:

one is for vocabulary to its own ID

the another is for its own ID to this vocabulary

### Text to Word Ids

Two `(1)` lookup tables (one for English sentences and the another is for French) will be provided in `text_to_ids` functions as arguments. They will be used in the converting process for English(source) and French(target) respectively. This part is more like a programming part, so there are not much to mention. I will just go over few minor things to remember before jumping in.

- original(raw) source & target datas contain a list of sentences
  - they are represented as a string 

- the number of sentences are the same for English and French
 
- by accessing each sentences, need to convert word into the corresponding index. For example, as the graph shows, we convert "This is a short sentence" to a list include numbers : [18,19,3,20,21]
  - each word should be stored in a list
  - this makes the resuling list as a 2-D array ( row: sentence, column: word index )
  
- for every target sentences, special token, `<EOS>` should be inserted at the end
  - this token suggests when to stop creating a sequence
  
<br/>
<img src='https://i.ibb.co/2Zn4kx1/conversion.png' alt='Drawing' width='100%'>
<br/>

In [5]:
def text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int):
    """
        1st, 2nd args: raw string text to be converted
        3rd, 4th args: lookup tables for 1st and 2nd args respectively
    
        return: A tuple of lists (source_id_text, target_id_text) converted
    """
    # empty list of converted sentences
    source_text_id = []
    target_text_id = []
    
    # make a list of sentences (extraction)
    source_sentences = source_text.split("\n")
    target_sentences = target_text.split("\n")
    
    max_source_sentence_length = max([len(sentence.split(" ")) for sentence in source_sentences])
    max_target_sentence_length = max([len(sentence.split(" ")) for sentence in target_sentences])
    
    # iterating through each sentences (# of sentences in source&target is the same)
    for i in range(len(source_sentences)):
        # extract sentences one by one
        source_sentence = source_sentences[i]
        target_sentence = target_sentences[i]
        
        # make a list of tokens/words (extraction) from the chosen sentence
        source_tokens = source_sentence.split(" ")
        target_tokens = target_sentence.split(" ")
        
        # empty list of converted words to index in the chosen sentence
        source_token_id = []
        target_token_id = []
        
        for index, token in enumerate(source_tokens):
            if (token != ""):
                source_token_id.append(source_vocab_to_int[token])
        
        for index, token in enumerate(target_tokens):
            if (token != ""):
                target_token_id.append(target_vocab_to_int[token])
                
        # put <EOS> token at the end of the chosen target sentence
        # this token suggests when to stop creating a sequence
        target_token_id.append(target_vocab_to_int['<EOS>'])
            
        # add each converted sentences in the final list
        source_text_id.append(source_token_id)
        target_text_id.append(target_token_id)
    
    return source_text_id, target_text_id

Then we have two list 

one includes the id of the text of English and the other includes the id of the text of French.

### Preprocess and Save Data

Now, we have the functions: `create_lookup_tables`, `text_to_ids` and we will use these functions in this step.

`create_lookup_tables`, `text_to_ids` are generalized functions. It can  be used for other languages too. In this particular project, the target languages are English and French, so those languages have to fed into `create_lookup_tables`, `text_to_ids` functions to generate pre-processed dataset for this project. Here is the steps to do it basically it is the summary of what we did before.

- Load data(text) from the original file for English and French
- Make them lower case letters
- Create lookup tables for both English and French
- Convert the original data into the list of sentences whose words are represented in index
- Finally, in this step, we combine these steps together and save the preprocessed data to the external file (checkpoint)

In [6]:
def preprocess_and_save_data(source_path, target_path, text_to_ids):
    # Preprocess
    
    # load original data (English, French)
    source_text = load_data(source_path)
    target_text = load_data(target_path)

    # to the lower case
    source_text = source_text.lower()
    target_text = target_text.lower()

    # create lookup tables for English and French data
    source_vocab_to_int, source_int_to_vocab = create_lookup_tables(source_text)
    target_vocab_to_int, target_int_to_vocab = create_lookup_tables(target_text)

    # create list of sentences whose words are represented in index
    source_text, target_text = text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int)

    # Save data for later use
    pickle.dump((
        (source_text, target_text),
        (source_vocab_to_int, target_vocab_to_int),
        (source_int_to_vocab, target_int_to_vocab)), open('preprocess.p', 'wb'))

In [7]:
preprocess_and_save_data(source_path, target_path, text_to_ids)

# Check Point
 This project uses a small set of sentences. However, in general, NLP requires a huge amount of raw text data. It would take quite a long time to preprocess, so it is recommended to avoid whenever possible.(unFortunately shutdown). Also, saving the preprocessed data to the external file could speed up your job and let you focus more on building a model.

In [8]:
import pickle

def load_preprocess():
    with open('preprocess.p', mode='rb') as in_file:
        return pickle.load(in_file)

In [9]:
import numpy as np

(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = load_preprocess()

### Check the Version of TensorFlow and Access to GPU
Since the Recurrent Neural Networks is kind of heavy model to train, it is recommended to train the model in GPU environment. I use AWS to run this step since my laptop does not have a GPU. 

From this step, you can check your version of TensorFlow to avoid problems, mine is 1.13.1

In [10]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf
from tensorflow.python.layers.core import Dense

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.1'), 'Please use TensorFlow version 1.1 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.13.1
Default GPU Device: /device:GPU:0


## Models

For the machine translation, I will build a model named sequence to sequence (seq2seq in short). 

Basically, it has two sub- models. Encoder and Decoder. In each of the sub- models, utilize RNN. For Encoder, it takes raw input text data and outputs a neural representation. And the output of Encoder is the input data for Decoder.

Encoder makes an output encoded in neural representational form. Let's call it C. We do not know what it really is but decoder has the ability to look inside the C and create another different output data (for here, it is French)

In order to build a model, I will use the following steps. For overview:

- define and process the input parameters
    - for encoder model
        - enc_dec_model_inputs
    - for the decoder model 
        - enc_dec_model_inputs, process_decoder_input, decoding_layer
- build encoder model 
    - encoding_layer
- Build decoder model for training (decoding- training process)
    - decoding_layer_train
- Build decoder model for inference process 
    - decoding_layer_infer
- Build the Seq2Seq model (connect encoder and decoder models)
    - seq2seq_model
- Train and estimate loss and accuracy

### Input Parameters

enc_dec_model_inputs function creates and return parameters related to building model.

Inputs placeholder will be (here) fed with English sentence data and the shape is [None, None] which is because the first None is the batch size that user can define and the second None is the lengths of sentences. Then maximum length of sentence is different from batch to batch so it cannot be set with an exact number. But we can set the lengths of every sentences to the maximum length for all the sentences in every batch. But remember to set the `<PAD>` in empty positions. We will deal with it later.

targets place holder is similar to inputs placeholder except for target we will feed French sentence data.

target_sequence_length placeholder represents the lengths of each sentences, since the shape is not fixed and the same number to the batch size. It is None for here. And we need a particular value as an argument of TrainerHelper to build decoder model later.

max_target_len gets the maximum value out of lengths of all target sentences. Since we store the lengths of all the sentences in target_sequence_length parameter. We can use reduce_max which can computes the maximum of elements across dimensions of a tensor to get the maximum value.

tf.placeholder will take the dtype of the elements and shape and name as the arguments and it will insert a placeholder for a tensor that will be fed (we fed English or French)

In [11]:
def enc_dec_model_inputs():
    inputs = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets') 
    
    target_sequence_length = tf.placeholder(tf.int32, [None], name='target_sequence_length')
    max_target_len = tf.reduce_max(target_sequence_length)    
    
    return inputs, targets, target_sequence_length, max_target_len

hyperparam_inputs function creates and returns parameters (TF placeholders) related to hyper-parameters to the model. 
- lr_rate is learning rate
- keep_prob is the keep probability for Dropouts


In [12]:
def hyperparam_inputs():
    lr_rate = tf.placeholder(tf.float32, name='lr_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    return lr_rate, keep_prob

### Process Decoder Input
<br/>
<img src="https://i.ibb.co/gPrMr27/go-insert.png" style="width:600px;"/>
<div style="text-align:center;">Fig 2. `<GO>` insertion</div>
<br/>

For Decoder, we need two different inout for training and predict purposes respectively. In training phase, the input is provided as target label, but still need to be embeded. On the inference phase, the output of each time step will be the input for the next time step. They also need to be embeded and embedding vector should be shared between two different phases.

In order to start the translation in Decoder or preprocess the target label data for the training phase in other words, we need to add <GO> token in front of all target data to tell model you should start now. I use 3 functions from Tensorflow to solve this:
    
- TFstrided_slice(TF Tensor, Begin, End, Strides)
    - extracts a stride slice of tensor, generalize python array indexing
    - can split into multiple tensors with the striding window size from begin to end
- TF fill (TF tensor, values)
    - creates a tensor filled with scalar value
- TF concat (list of TF tensor, tf fill and after slice)
    - concatenates tensors along one dimension
After process the target label data, we will embed it later when implementing decoding_layer function

In [13]:
def process_decoder_input(target_data, target_vocab_to_int, batch_size):
    """
    Preprocess target data for encoding
    :return: Preprocessed target data
    """
    # get '<GO>' id
    go_id = target_vocab_to_int['<GO>']
    
    after_slice = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    after_concat = tf.concat( [tf.fill([batch_size, 1], go_id), after_slice], 1)
    
    return after_concat

After preprocess the target label, we need to embed it later when we implement decoding_ layer function.

### Encoding

<br/>
<img src="https://i.ibb.co/8x4jpTC/fg3.png" style="width:600px;"/>
<div style="text-align:center;">Fig 3. Encoding model highlighted - Embedding/RNN layers</div>
<br/>
    


The encoding includes two parts: Embedding layer and RNN layer. Each word in a sentence will be represented with the number of features in the encoding_embedding_size. The reasons why we use tf.contrib.layers.embed_sequence are
- Reduce the number of parameters in the network while preserving depth.
- It allows for arbitrary input shapes, which helps the implementation be simple and flexible

For RNN layers, we use LSTM. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). Multiple LSTM cells are stacked together after dropout technique is applied. 

Embedding layer:
tf.contrib.layers.embed_sequence

RNN layers:

- tf.nn.rnn_cell.LSTMCell 
    - simply specifies how many internal units it has
- tf.contrib.rnn.DropoutWrapper
    - wraps a cell with keep probability value
- tf.contrib.rnn.MultiRNNCell
    - stacks multiple RNN (type) cells
    - To give the model more expressive power, we can add multiple layers of LSTMs to process the data. The output of the first layer will become the input of the second and so on.

- tf.nn.dynamic_rnn
    - connect Embedding layer and RNN layers all together.

In [14]:
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob, 
                   source_vocab_size, 
                   encoding_embedding_size):
    """
    :return: tuple (RNN output, RNN state)
    """
    embed = tf.contrib.layers.embed_sequence(rnn_inputs, 
                                             vocab_size=source_vocab_size, 
                                             embed_dim=encoding_embedding_size)
    
    stacked_cells = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.LSTMCell(rnn_size), keep_prob) for _ in range(num_layers)])
    
    outputs, state = tf.nn.dynamic_rnn(stacked_cells, 
                                       embed, 
                                       dtype=tf.float32)
    return outputs, state

### Decoding - Training process 

Decoding model can be thought of two separate processes, training and inference. It is not they have different architecture, but they share the same architecture and its parameters. It is that they have different strategy to feed the shared model.

For this(training) and the next(inference) section, Fig 4 shows clearly shows what they are.

<img src="https://i.ibb.co/XkrwRy4/decoder-shift.png" style="width:700px;"/>
<div style="text-align:center;">Fig 4. Decoder shifted inputs</div>
<br/>

From the graph, we can see, for the decoder, the output of the current time step will be the input of the next time step. Thus, rather than what we did in in encoding, in the encoding, we actually prepared dataset before running by using the function  TF contrib.layers.embed_sequence. We use dynamic embedding capability. Also, from the graph, we can see the training and inference processes share the same embedding parameter. In training, embed input should be delivered. In the prediction, only embedding parameters used in the training part should be delivered.

Let's do training part first, the function I used as following:

- tf.contrib.seq2seq.TrainingHelper
    - TrainingHelper is where we pass the embeded input.This is not a decoder model, just a helper instance. We need to pass it into BasicDecoder, the actual process of building the decoder model;

- tf.contrib.seq2seq.BasicDecoder
    - BasicDecoder builds the decoder model. It means it connects the RNN layer(s) on the decoder side.

- tf.contrib.seq2seq.dynamic_decode
    - dynamic_decode unrolls the decoder model so that actual prediction can be retrieved by BasicDecoder for each time steps.

In [15]:
def decoding_layer_train(encoder_state, dec_cell, dec_embed_input, 
                         target_sequence_length, max_summary_length, 
                         output_layer, keep_prob):
    """
    Create a training process in decoding layer 
    :return: BasicDecoderOutput containing training logits and sample_id
    """
    dec_cell = tf.contrib.rnn.DropoutWrapper(dec_cell, 
                                             output_keep_prob=keep_prob)
    
    # for only input layer
    helper = tf.contrib.seq2seq.TrainingHelper(dec_embed_input, 
                                               target_sequence_length)
    
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, 
                                              helper, 
                                              encoder_state, 
                                              output_layer)

    # unrolling the decoder layer
    outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, 
                                                      impute_finished=True, 
                                                      maximum_iterations=max_summary_length)
    return outputs

### Decoding - Inference process

For inferencing, for example, we take the first output and plug it in the next word. And that is why we need to use GreedyEmbeddingHelper function here.

- tf.contrib.seq2seq.GreedyEmbeddingHelper
  - GreedyEmbeddingHelper dynamically takes the output of the current step and give it to the next time step's input. In order to embed the each input result dynamically, embedding parameter(just bunch of weight values) should be provided. Along with it, GreedyEmbeddingHelper asks to give the start_of_sequence_id for the same amount as the batch size and end_of_sequence_id. 
- tf.contrib.seq2seq.BasicDecoder
  - same as described in the training process section
- tf.contrib.seq2seq.dynamic_decode
  - same as described in the training process section

In [16]:
def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
                         end_of_sequence_id, max_target_sequence_length,
                         vocab_size, output_layer, batch_size, keep_prob):
    """
    Create a inference process in decoding layer 
    :return: BasicDecoderOutput containing inference logits and sample_id
    """
    dec_cell = tf.contrib.rnn.DropoutWrapper(dec_cell, 
                                             output_keep_prob=keep_prob)
    
    helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings, 
                                                      tf.fill([batch_size], start_of_sequence_id), 
                                                      end_of_sequence_id)
    
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, 
                                              helper, 
                                              encoder_state, 
                                              output_layer)
    
    outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, 
                                                      impute_finished=True, 
                                                      maximum_iterations=max_target_sequence_length)
    return outputs

### Build the Decoding Layer 

__Embed the target sequences__

- TF contrib.layers.embed_sequence creates internal representation of embedding parameter, so we cannot look into or retrieve it. 

- Manually created embedding parameter is used for training phase to convert provided target data(sequence of sentence) by TF nn.embedding_lookup before the training is run. 

TF nn.embedding_lookup creates embedding parameters return the similar result to the TF contrib.layers.embed_sequence. For the inference process, dynamic whenever the output of the current time step is calculated via decoder, then embeded by the shared embedding parameter and become the input for the next step. We can just use the embedding parameter to plug in the the helper, then it will do the process.

embedding_lookup function retrieves rows of the params tensor. Like how we use index with arrays in numpy. In short, it selects specified rows.

__Construct the decoder RNN layer(s)__
- As Fig 3 and Fig 4 show, the number of RNN layer in the decoder model has to be equal to the number of RNN layer(s) in the encoder model.

Finally, we create an output layer to map the outputs of the decoder to the elements of our vocabulary and connect all layers to get probabilities of occurance of each word.

In [17]:
def decoding_layer(dec_input, encoder_state,
                   target_sequence_length, max_target_sequence_length,
                   rnn_size,
                   num_layers, target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, decoding_embedding_size):
    """
    Create decoding layer
    :return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
    """
    target_vocab_size = len(target_vocab_to_int)
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    
    cells = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(rnn_size) for _ in range(num_layers)])
    
    with tf.variable_scope("decode"):
        output_layer = tf.layers.Dense(target_vocab_size)
        train_output = decoding_layer_train(encoder_state, 
                                            cells, 
                                            dec_embed_input, 
                                            target_sequence_length, 
                                            max_target_sequence_length, 
                                            output_layer, 
                                            keep_prob)

    with tf.variable_scope("decode", reuse=True):
        infer_output = decoding_layer_infer(encoder_state, 
                                            cells, 
                                            dec_embeddings, 
                                            target_vocab_to_int['<GO>'], 
                                            target_vocab_to_int['<EOS>'], 
                                            max_target_sequence_length, 
                                            target_vocab_size, 
                                            output_layer,
                                            batch_size,
                                            keep_prob)

    return (train_output, infer_output)

### Build the Seq2Seq model
Now, we can use the previously defined functions
encoding_layer, process_decoder_input, and decoding_layer put together to build the seq2seq model.

In [18]:
def seq2seq_model(input_data, target_data, keep_prob, batch_size,
                  target_sequence_length,
                  max_target_sentence_length,
                  source_vocab_size, target_vocab_size,
                  enc_embedding_size, dec_embedding_size,
                  rnn_size, num_layers, target_vocab_to_int):
    """
    Build the Sequence-to-Sequence model
    :return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
    """
    enc_outputs, enc_states = encoding_layer(input_data, 
                                             rnn_size, 
                                             num_layers, 
                                             keep_prob, 
                                             source_vocab_size, 
                                             enc_embedding_size)
    
    dec_input = process_decoder_input(target_data, 
                                      target_vocab_to_int, 
                                      batch_size)
    
    train_output, infer_output = decoding_layer(dec_input,
                                               enc_states, 
                                               target_sequence_length, 
                                               max_target_sentence_length,
                                               rnn_size,
                                              num_layers,
                                              target_vocab_to_int,
                                              target_vocab_size,
                                              batch_size,
                                              keep_prob,
                                              dec_embedding_size)
    
    return train_output, infer_output

## Neural Network Training
### Hyperparameters

In [19]:
display_step = 300

epochs = 13
batch_size = 128

#epochs = 6
#batch_size = 256

rnn_size = 128
num_layers = 3

encoding_embedding_size = 200
decoding_embedding_size = 200

learning_rate = 0.001
keep_probability = 0.5

Set after 300 each batch, it will appear once, 
Epochs means to let the entire dataset(English or French) pass forward and backward through the neural network of 13 times. 
Set Batch = 128 is let the total number of training example present in a single batch. When I test, I set the size = 256 which we can increase the speed.
When we talk about the learning rate we need to talk about Gradient Descent. It is an algorithm to iterative optimization. Iteration to get the most optimal output. It has a parameter names learning rate. It can determine to what extent newly acquired information overrides old information. The lower the value is, the slower we travel along the downward slope. 

### Build the Graph
The word Graph here is not really graph. It likes a structure. 

Tensorflow(version 1) uses a dataflow graph to represent computations in terms of the dependencies between individual operations. This leads to a low- level programming model in which we can first define the dataflow graph, then create a TensorFlow seeion to run parts of the graph across a set of local and remote devices. It also includes operations and tensors.

tf.Graph contains two relevant kinds of information:
- Graph structure
- Graph collections: for example tf.train.Optimizer

`seq2seq_model` function creates the model. It defines how the feedforward and backpropagation should flow. The last step for this model to be trainable is deciding and applying what optimization algorithms to use. In this section, [TF contrib.seq2seq.sequence_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss) is used to calculate the loss, then [TF train.AdamOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) is applied to calculate the gradient descent on the loss. Let's go over eatch steps in the code cell below.

__load data from the checkpoint__
- (source_int_text, target_int_text) are the input data, and (source_vocab_to_int, target_vocab_to_int) is the dictionary to lookup the index number of each words.
- max_target_sentence_length is the length of the longest sentence from the source input data. This will be used for GreedyEmbeddingHelper when building inference process in the decoder mode.

__create inputs__
- inputs (input_data, targets, target_sequence_length, max_target_sequence_length) from enc_dec_model_inputs function
- inputs (lr, keep_prob) from hyperparam_inputs function

__build seq2seq model__
- build the model by seq2seq_model function. It will return train_logits(logits to calculate the loss) and inference_logits(logits from prediction).

__cost function__
- [TF contrib.seq2seq.sequence_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss) is used. This loss function is just a weighted softmax cross entropy loss function, but it is particularly designed to be applied in time series model (RNN). 

So, what is softmax function. It takes an N- dimensional vector of real numbers and transforms it into a vector of real number in range (0,1)

And what is Cross Entropy loss? Cross entropy indicates the distance between what the model believes the output distribution should be, and what the original distribution really is.

__Optimizer__
- TF train.AdamOptimizer is used, and this is where the learning rate should be specified. You can choose other algorithms as well, this is just a choice.

__Gradient Clipping__
- Since recurrent neural networks is notorious about vanishing/exploding gradient, gradient clipping technique is believed to improve the issues. 
- The concept is really easy. You decide thresholds to keep the gradient to be in a certain boundary. In this project, the range of the threshold is between -1 and 1.
- Now, you need to apply this conceptual knowledge to the TensorFlow code. In breif, you get the gradient values from the optimizer manually by calling compute_gradients, then manipulate the gradient values with clip_by_value. Lastly, you need to put back the modified gradients into the optimizer by calling apply_gradients.

<img src="https://i.ibb.co/DRDXyCx/gradient-clipping.png" style="width:700px;"/>
<div style="text-align:center;">Fig 4. Gradient Clipping</div>
<br/>

In [34]:
save_path = 'checkpoints/dev'
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = load_preprocess()
max_target_sentence_length = max([len(sentence) for sentence in source_int_text])

train_graph = tf.Graph()
with train_graph.as_default():
    input_data, targets, target_sequence_length, max_target_sequence_length = enc_dec_model_inputs()
    lr, keep_prob = hyperparam_inputs()
    
    train_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                   targets,
                                                   keep_prob,
                                                   batch_size,
                                                   target_sequence_length,
                                                   max_target_sequence_length,
                                                   len(source_vocab_to_int),
                                                   len(target_vocab_to_int),
                                                   encoding_embedding_size,
                                                   decoding_embedding_size,
                                                   rnn_size,
                                                   num_layers,
                                                   target_vocab_to_int)
    
    training_logits = tf.identity(train_logits.rnn_output, name='logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')

    # https://www.tensorflow.org/api_docs/python/tf/sequence_mask
    # - Returns a mask tensor representing the first N positions of each cell.
    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function - weighted softmax cross entropy
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)

### Get Batches and Pad the source and target sequences
<br/>
<img src="https://i.ibb.co/wBZgXM6/pad-insert.png" style="width:300px;"/>
<div style="text-align:center;">Fig 5. Padding character in empty space of sentences shorter than the longest one in a batch</div>
<br/>

In [21]:
def pad_sentence_batch(sentence_batch, pad_int):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]


def get_batches(sources, targets, batch_size, source_pad_int, target_pad_int):
    """Batch targets, sources, and the lengths of their sentences together"""
    for batch_i in range(0, len(sources)//batch_size):
        start_i = batch_i * batch_size

        # Slice the right amount for the batch
        sources_batch = sources[start_i:start_i + batch_size]
        targets_batch = targets[start_i:start_i + batch_size]

        # Pad
        pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
        pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))

        # Need the lengths for the _lengths parameters
        pad_targets_lengths = []
        for target in pad_targets_batch:
            pad_targets_lengths.append(len(target))

        pad_source_lengths = []
        for source in pad_sources_batch:
            pad_source_lengths.append(len(source))

        yield pad_sources_batch, pad_targets_batch, pad_source_lengths, pad_targets_lengths

### Train

`get_accuracy`
- compare the lengths of target(label) and logits(prediction)
- add(pad) 0s at the end of the ones having the shorter length
  - `[(0,0),(0,max_seq - target.shape[1])]` indicates the 2D array. The first (0,0) means no padding for the first dimension. The second (0, ...) means there is no pads in front of the second dimension but pads at the end. And pad as many times as until makes two entities to have the same shape (length)
- finally, returns the average of where the target and logits have the same value (1) because they must have the same shape.

[numpy pad function](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.pad.html)

In [22]:
def get_accuracy(target, logits):
    """
    Calculate accuracy
    """
    max_seq = max(target.shape[1], logits.shape[1])
    if max_seq - target.shape[1]:
        target = np.pad(
            target,
            [(0,0),(0,max_seq - target.shape[1])],
            'constant')
    if max_seq - logits.shape[1]:
        logits = np.pad(
            logits,
            [(0,0),(0,max_seq - logits.shape[1])],
            'constant')

    return np.mean(np.equal(target, logits))

# Split data to training and validation sets
train_source = source_int_text[batch_size:]
train_target = target_int_text[batch_size:]
valid_source = source_int_text[:batch_size]
valid_target = target_int_text[:batch_size]
(valid_sources_batch, valid_targets_batch, valid_sources_lengths, valid_targets_lengths ) = next(get_batches(valid_source,
                                                                                                             valid_target,
                                                                                                             batch_size,
                                                                                                             source_vocab_to_int['<PAD>'],
                                                                                                             target_vocab_to_int['<PAD>']))                                                                                                  
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(epochs):
        for batch_i, (source_batch, target_batch, sources_lengths, targets_lengths) in enumerate(
                get_batches(train_source, train_target, batch_size,
                            source_vocab_to_int['<PAD>'],
                            target_vocab_to_int['<PAD>'])):

            _, loss = sess.run(
                [train_op, cost],
                {input_data: source_batch,
                 targets: target_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 keep_prob: keep_probability})


            if batch_i % display_step == 0 and batch_i > 0:
                batch_train_logits = sess.run(
                    inference_logits,
                    {input_data: source_batch,
                     target_sequence_length: targets_lengths,
                     keep_prob: 1.0})

                batch_valid_logits = sess.run(
                    inference_logits,
                    {input_data: valid_sources_batch,
                     target_sequence_length: valid_targets_lengths,
                     keep_prob: 1.0})

                train_acc = get_accuracy(target_batch, batch_train_logits)
                valid_acc = get_accuracy(valid_targets_batch, batch_valid_logits)

                print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.4f}, Validation Accuracy: {:>6.4f}, Loss: {:>6.4f}'
                      .format(epoch_i, batch_i, len(source_int_text) // batch_size, train_acc, valid_acc, loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_path)
    print('Model Trained and Saved')

Epoch   0 Batch  300/1077 - Train Accuracy: 0.3993, Validation Accuracy: 0.4876, Loss: 1.9738
Epoch   0 Batch  600/1077 - Train Accuracy: 0.4747, Validation Accuracy: 0.4911, Loss: 1.0928
Epoch   0 Batch  900/1077 - Train Accuracy: 0.5473, Validation Accuracy: 0.5987, Loss: 0.8654
Epoch   1 Batch  300/1077 - Train Accuracy: 0.6110, Validation Accuracy: 0.6275, Loss: 0.6749
Epoch   1 Batch  600/1077 - Train Accuracy: 0.6656, Validation Accuracy: 0.6449, Loss: 0.5348
Epoch   1 Batch  900/1077 - Train Accuracy: 0.6617, Validation Accuracy: 0.6669, Loss: 0.5025
Epoch   2 Batch  300/1077 - Train Accuracy: 0.7126, Validation Accuracy: 0.7060, Loss: 0.3989
Epoch   2 Batch  600/1077 - Train Accuracy: 0.7891, Validation Accuracy: 0.7525, Loss: 0.3397
Epoch   2 Batch  900/1077 - Train Accuracy: 0.8094, Validation Accuracy: 0.7919, Loss: 0.3237
Epoch   3 Batch  300/1077 - Train Accuracy: 0.8491, Validation Accuracy: 0.8256, Loss: 0.2686
Epoch   3 Batch  600/1077 - Train Accuracy: 0.8315, Validati

When we train the model, we want to avoid overfitting. Both overfitting and underfitting will reduce the performance. Overfitting happens when a model learns the detail and noise in the training data which will pock up the noise or random fluctuations which our concept do not apply to them. To find out if it is overfitting, we have cross validation.
At the moment this model has an accuracy of ~98% on the training set and ~96% on the validation set. This means that we can expect our model to perform with ~96% accuracy on new data.

### Save Parameters
Save the `batch_size` and `save_path` parameters for inference.

In [23]:
def save_params(params):
    with open('params.p', 'wb') as out_file:
        pickle.dump(params, out_file)


def load_params():
    with open('params.p', mode='rb') as in_file:
        return pickle.load(in_file)

In [24]:
# Save parameters for checkpoint
save_params(save_path)

## Checkpoint

In [27]:
import tensorflow as tf
import numpy as np
import problem_unittests as tests

_, (source_vocab_to_int, target_vocab_to_int), (source_int_to_vocab, target_int_to_vocab) = load_preprocess()
load_path = load_params()

## Translate
This will translate `translate_sentence` from English to French.

In [33]:
def sentence_to_seq(sentence, vocab_to_int):
    results = []
    for word in sentence.split(" "):
        if word in vocab_to_int:
            results.append(vocab_to_int[word])
        else:
            results.append(vocab_to_int['<UNK>'])
            
    return results

translate_sentence = 'he saw a old yellow truck .'

translate_sentence = sentence_to_seq(translate_sentence, source_vocab_to_int)

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_path + '.meta')
    loader.restore(sess, load_path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')

    translate_logits = sess.run(logits, {input_data: [translate_sentence]*batch_size,
                                         target_sequence_length: [len(translate_sentence)*2]*batch_size,
                                         keep_prob: 1.0})[0]

print('Input')
print('  Word Ids:      {}'.format([i for i in translate_sentence]))
print('  English Words: {}'.format([source_int_to_vocab[i] for i in translate_sentence]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i for i in translate_logits]))
print('  French Words: {}'.format(" ".join([target_int_to_vocab[i] for i in translate_logits])))

INFO:tensorflow:Restoring parameters from checkpoints/dev
Input
  Word Ids:      [59, 112, 228, 30, 102, 161, 65]
  English Words: ['he', 'saw', 'a', 'old', 'yellow', 'truck', '.']

Prediction
  Word Ids:      [135, 357, 22, 103, 236, 33, 138, 125, 1]
  French Words: il a pas un vieux camion jaune . <EOS>


For here, the meaning of French Words we have is He doesn't have an old yellow truck. Different meaning(a word will change the whole meaning but from word to word, not bad if we get rid of the meaning just see the word prediction, actually, not bad), since we only have two small datasets and only run 10 minutes for training if we want to improve the result, we can just use the original dataset.

# Conclusion

For building Encoder- Decoder, we need to utilize RNN in both Encoder and Decoder. Take an input sequence and each word from the input sequence is associated to a vector so we create a lookup table. And run an LSTM over the sequence of vectors and store the last hidden state output. After this, we have a vector that captures the meaning of the input sequence, we can use it to generate the target sequence word by word. Feed to another LSTM cell: the vector as hidden state. Then apply some functions to get another vector and has the same size as the vocabulary. Then apply softmax to creat a probability vector which will help us to determine the final output.

Also, there is an interesting issue, for example two English words equivalent to one French word. I also find an interesting topic about Attention and Beam Search to solve this question.

I really learn a lot from this. It is not a super difficult even if some parts I am still considering. However, it is a good start for NLP. How to preprocess data, why we should tokenize sentence, how RNN works, what models include in this seq2seq model, how to train the model and etc. I think I figure it out. It is a very interesting topic.

We try long short- term memory (LSTM). LSTM has feedback connections that make it a "general purpose computer" (that is, it can compute anything that a Turing machine can). It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). for RNN. In the same year there was two methods of RNN appeared- LSTM and GRU. We can also try GRU and then compare these two models and then evaluate these models.

For the future work, I will try to translate Chinese to English. Or it can also the take task of automatically converting one natural language into another, producing fluent text in the output language. Also, since we have store the data as check point we can simply use them in the future.