## Lab: Transformer Network

In this notebook you'll explore the Transformer architecture, a neural network that takes advantage of parallel processing and allows to speed up the training process. 

**After this assignment you'll be able to**:

* Create positional encodings to capture sequential relationships in data
* Calculate scaled dot-product self-attention with word embeddings
* Implement masked multi-head attention
* Build a Transformer model

In [1]:
#pip install transformers !!!

import tensorflow as tf
import pandas as pd
import time
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization
from transformers import DistilBertTokenizerFast #, TFDistilBertModel
from transformers import TFDistilBertForTokenClassification
from tqdm import tqdm_notebook as tqdm

### 1 - Positional Encoding

In sequence to sequence tasks, the relative order of  data is important to its meaning. When training sequential neural networks such as RNNs, you fed the inputs into the network in order. Information about the order of the data was automatically fed into the model.  However, when you train a Transformer network, you feed the data into the model all at once. While this reduces training time, there is no information about the order of the data. This is where positional encoding is useful - you can specifically encode the positions of the inputs and pass them into the network using these sine and cosine formulas:
    
$$
PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
\tag{1}$$
<br>
$$
PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
\tag{2}$$

* $d$ is the dimension of the word embedding and positional encoding
* $pos$ is the position of the word.
* $i$ refers to each of the different dimensions of the positional encoding.

The values of the sine and cosine equations are small enough (between -1 and 1) that when you add the positional encoding to a word embedding, the word embedding is not significantly distorted. The sum of the positional encoding and word embeding is ultimately what is fed into the model. Using a combination of these two equations helps Transformer network attend to the relative positions of the input data. In this assignment, all vectors are horizontal. All matrix multiplications are adjusted accordingly.

### 1.1 - Get Sine and Cosine Angles

Get the possible angles used to compute the positional encodings by calculating the inner term of the sine and cosine equations: 

$$\frac{pos}{10000^{\frac{2i}{d}}} \tag{3}$$

Complete `get_angles()` to get the possible angles for sine and cosine positional encodings.

In [2]:
def get_angles(pos, i, d):
    """
    Get the angles for the positional encoding
    
Arguments:
pos -- Column vector containing the positions [[0], [1], ...,[N-1]]
i --   Row vector containing the dimension span [[0, 1, 2, ..., M-1]]
d(integer) -- Encoding/embedding size
    
    Returns:
        angles -- (pos, d) numpy array 
    """
    angles = ?
    
    return angles

### 1.2 - Sine and Cosine Positional Encodings

The computed angles are used to calculate the sine and cosine positional encodings.

$$
PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
$$
<br>
$$
PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
$$

Function `positional_encoding()` calculates sine and cosine  positional encodings

**NOTE:** Sine equation is used when $i$ is even number and cosine equation when $i$ is odd number.

In [3]:
def positional_encoding(positions, d):
    """
    Precomputes a matrix with all the positional encodings 
    
    Arguments:
        positions (int) -- Maximum number of positions to be encoded 
        d (int) -- Encoding/embedding size 
    
    Returns:
pos_encoding: matrix with the positional encodings, shape (1,position,d) 
    """
    # initialize a matrix angle_rads of all the angles 
# Transform from row to column vector (np.newaxis add new dimension)
    angle_rads = get_angles(np.arange(positions)[:, np.newaxis],
                            np.arange(d)[ np.newaxis,:],d)
  
    # -> angle_rads has dim (positions,d)
    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
  
    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
     
    pos_encoding = angle_rads[np.newaxis, ...]
    
    #tf.cast change the type of tensor to float32
    return tf.cast(pos_encoding, dtype=tf.float32)

In [18]:
#Test function positional_encoding for some small values of the arguments

?

(1, 3, 5)


<tf.Tensor: shape=(1, 3, 5), dtype=float32, numpy=
array([[[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.0000000e+00,
         0.0000000e+00],
        [8.4147096e-01, 9.9968451e-01, 6.3095731e-04, 1.0000000e+00,
         3.9810718e-07],
        [9.0929741e-01, 9.9873835e-01, 1.2619144e-03, 1.0000000e+00,
         7.9621435e-07]]], dtype=float32)>

### 2 - Masking

There are two types of masks when building Transformer network: *padding mask* and *look-ahead mask*. Both help the softmax computation give the appropriate weights to the words in the input sentence. 

### 2.1 - Padding Mask

If the input sequence exceeds the maximum length of a sequence the network can process. Let's say the maximum length of the model is 5, it is fed the following sequences:

    [["Do","you","know","when","Jane","is","going","to", "visit", Africa"], 
     ["Jane", "visits", "Africa", "in", "September" ],
     ["Exciting", "!"]
    ]

which is vectorized as:

    [[ 71, 121, 4, 56, 99, 2344, 345, 1284, 15],
     [ 56, 1285, 15, 181, 545],
     [ 87, 600]
    ]
    
When passing sequences into a transformer model, it is important that they are of uniform length. You can achieve this by padding the sequence with zeros, and truncating sentences that exceed the maximum length of the model:

    [[ 71, 121, 4, 56, 99],
     [ 2344, 345, 1284, 15, 0],
     [ 56, 1285, 15, 181, 545],
     [ 87, 600, 0, 0, 0],
    ]
    
Sequences longer than the maximum length of 5 will be truncated, and zeros will be added to the truncated sequence to achieve uniform length. Similarly, for sequences shorter than the maximum length, zeros will be added for padding. However, these zeros will affect the softmax calculation - this is when a padding mask comes in handy! By multiplying a padding mask by -1e9 and adding it to the sequence, you mask out the zeros by setting them close to negative infinity. 

After masking, the input should go from `[87, 600, 0, 0, 0]` to `[87, 600, -1e9, -1e9, -1e9]`, so that when you take the softmax, the zeros don't affect the score.

In [5]:
def create_padding_mask(seq):
    """
    Creates a matrix mask for the padding cells
    
    Arguments:
        seq -- (n, m) matrix
    
    Returns:
        mask -- (n, 1, 1, m) binary tensor
    """
#tf.math.equal(x, 0) = give back binary tensor, 
# if x==0 is true give back 1 if it is not true give back 0

#tf.cast change the type of tensor to float32
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32) #(n,m)
  
    # add extra dimensions to add the padding to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :] #(n, 1, 1, m)

In [6]:
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])

#Apply mask to x and check what is the result ?

?

tf.Tensor(

[[7. 6. 0. 0. 1.]

 [1. 2. 3. 0. 0.]

 [0. 0. 0. 4. 5.]], shape=(3, 5), dtype=float32)



tf.Tensor(

[[[[0. 0. 1. 1. 0.]]]





 [[[0. 0. 0. 1. 1.]]]





 [[[1. 1. 1. 0. 0.]]]], shape=(3, 1, 1, 5), dtype=float32)


If we multiply this mask by -1e9 and add it to the sample input sequences, the zeros are essentially set to negative infinity.

### 2.2 - Look-ahead Mask

The look-ahead mask follows similar intuition. In training, you will have access to the complete correct output of the training example. The look-ahead mask helps your model pretend that it correctly predicted a part of the output and see if, *without looking ahead*, it can correctly predict the next output. 

For example, if the expected correct output is `[1, 2, 3]` and you wanted to see if given that the model correctly predicted the first value it could predict the second value, you would mask out the second and third values. So you would input the masked sequence `[1, -1e9, -1e9]` and see if it could generate `[1, 2, -1e9]`.

In [8]:
def create_look_ahead_mask(size):
    """
    Returns an lower triangular matrix filled with ones
    
    Arguments:
        size -- matrix size
    
    Returns:
        mask -- (size, size) tensor
    """
    # Transform square (size, size) tensor into: 
    # lower triangular part =1
    # Upper triangular part = 0
    mask = tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask 

In [23]:
# Generate a row vector with 5 random values with uniform distribution 
#  in the range (0,1)
x = ?

#Apply Look-ahead Mask and observe the result

?


[0.90884209 0.37910562 0.05240325 0.69914559 0.9440902 ]

(5,)




<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[1., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0.],
       [1., 1., 1., 0., 0.],
       [1., 1., 1., 1., 0.],
       [1., 1., 1., 1., 1.]], dtype=float32)>

### 3 - Self-Attention

As the authors of the Transformers paper state, "Attention is All You Need". 

<img src="images/self-attention.png" alt="Encoder" width="600"/>
<caption><center><font color='purple'><b>Figure 1: Self-Attention calculation visualization</font></center></caption>
    
The use of self-attention paired with traditional networks allows for the parallization which speeds up training. Function **scaled dot product attention** takes in a query, key, value, and a mask as inputs to returns rich, attention-based vector representations of the words in a sequence. This type of self-attention can be mathematically expressed as:
$$
\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{4}\
$$

* $Q$ is the matrix of queries 
* $K$ is the matrix of keys
* $V$ is the matrix of values
* $M$ is the optional mask to apply 
* ${d_k}$ is the dimension of the keys, which is used to scale everything down so the softmax doesn't explode


Function **scaled_dot_product_attention()** creates attention-based representations. 
    
**Note**: The boolean mask parameter can be passed as `none` or as padding or look-ahead. 

In [10]:
def scaled_dot_product_attention(q, k, v, mask):
    """
    Calculate the attention weights.
      q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, 
    i.e.: seq_len_k = seq_len_v.
Mask has different shapes depending on the type (padding /look ahead)

    Arguments:
        q -- query shape == (..., seq_len_q, depth)
        k -- key shape == (..., seq_len_k, depth)
        v -- value shape == (..., seq_len_v, depth_v)
        mask: Float tensor with shape broadcastable 
              to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
        output -- attention_weights
    """  
    # Q*K'
    matmul_qk = tf.matmul(q, k, transpose_b=True)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # add mask to the scaled tensor
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # softmax is normalized on the last axis (seq_len_k) so that the
    # scores add up to 1
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
    
    # attention_weights * V
    output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)

    return output, attention_weights

### 4 - Encoder

Here, the Transformer Encoder will be implemented by pairing multi-head attention and a Feed Forward Neural Network- FFNN (Figure 2a). 
<img src="images/encoder_layer.png" alt="Encoder" width="250"/>
<caption><center><font color='purple'><b>Figure 2a: Transformer encoder layer</font></center></caption>

* `MultiHeadAttention (MHA)`  you can think as computing self-attention several times to detect different features. 
* FFNN contains two Dense layers implemented as `FullyConnected`

Input sentence first passes through a *multi-head attention layer*, where the encoder looks at other words in the input sentence as it encodes a specific word. The outputs of the multi-head attention layer are then fed to a FFNN. The same FFNN is independently applied to each position.
   
* For the `MultiHeadAttention (MHA)` layer, you will use the [Keras implementation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention). If you're curious about how to split the query matrix Q, key matrix K, and value matrix V into different heads, you can look through the implementation. 
* You will also use the [Sequential API](https://keras.io/api/models/sequential/) with two dense layers to built the FFNN layers.

In [11]:
def FullyConnected(embedding_dim, fully_connected_dim):
    return tf.keras.Sequential([
        # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(fully_connected_dim, activation='relu'), 
        # (batch_size, seq_len, d_model)
        tf.keras.layers.Dense(embedding_dim)  
    ])

### 4.1 Encoder Layer

Here `EncoderLayer()` (Fig. 2a) is implemented using `call()` method. Residual connections and layer normalization are used to speed up training. The function performs the following steps: 
1. Pass Q, V, K matrices and a boolean mask to a multi-head attention layer. 
2. Pass the output of the multi-head attention layer to a dropout layer. The `training` parameter set the mode of the model. 
3. Add a skip connection by adding the original input `x` and the output of the dropout layer. 
4. Pass the output through the first layer normalization.
5. Pass the output through the FFNN.  
6. Add a dropout layer, add a skip connection, apply layer normalization. 

**Additional Hints**:
* `__init__` method creates all the layers that will be accesed by the `call` method. Wherever you want to use a layer defined inside `__init__`  method you will have to use the syntax `self.[insert layer name]`. 
* You may find the documentation of [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention) helpful.

In [12]:
class EncoderLayer(tf.keras.layers.Layer):
    """
The encoder layer is composed by a multi-head attention (MHA) mechanism,
followed by a fully connected feed-forward network (FFNN). 
This archirecture includes a residual connection around each of the two 
sub-layers, followed by layer normalization.
    """
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1, layernorm_eps=1e-6):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(num_heads=num_heads,
                                      key_dim=embedding_dim)

        self.ffn = FullyConnected(embedding_dim=embedding_dim,
                            fully_connected_dim=fully_connected_dim)

        self.layernorm1 = LayerNormalization(epsilon=layernorm_eps)
        self.layernorm2 = LayerNormalization(epsilon=layernorm_eps)

        self.dropout1 = Dropout(dropout_rate)
        self.dropout2 = Dropout(dropout_rate)
    
    def call(self, x, training, mask):
        """
        Forward pass for the Encoder Layer
        
        Arguments:
        x -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
        training -- Boolean, set to true to activate
                        the training mode for dropout layers
        mask -- Boolean mask to ensure that the padding is not 
                    treated as part of the input
    Returns:
    out2 -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
        """
          # calculate self-attention using MHA
# To compute self-attention, Q, V and K are inicialized with x
# Output shape: (batch_size, input_seq_len, embedding_dim)
        self_attn_output = self.mha(x, x, x, mask) 
        
        # apply dropout layer to the self-attention output
        self_attn_output = self.dropout1(self_attn_output, training=training)
        
# apply layer normalization on sum of the input and the attention 
# output to get the output of MHA layer 
# Output shape: (batch_size, input_seq_len, embedding_dim)
        mult_attn_out = self.layernorm1(x + self_attn_output)  

# pass the output of  MHA layer through a ffn (FFNN) 
# Output shape: (batch_size, input_seq_len, embedding_dim)
        ffn_output = self.ffn(mult_attn_out)  
        
        # apply dropout layer to ffn output 
        ffn_output = self.dropout2(ffn_output, training=training)
        
# apply normalization on sum of the output from MHA and ffn output
# to get the output of the encoder layer 
        encoder_layer_out = self.layernorm2(ffn_output + mult_attn_out)  # (batch_size, input_seq_len, embedding_dim)
        
        return encoder_layer_out

### 4.2 - Full Encoder

Now we will build the full Transformer Encoder (Fig. 2b). 
First the input is embedded and positional encodings is added. Then feed the encoded embeddings to a stack of Encoder layers. 

<img src="images/encoder.png" alt="Encoder" width="330"/>
<caption><center><font color='purple'><b>Figure 2b: Transformer Encoder</font></center></caption>

`Encoder()` function is implemented with `call()` method. The Encoder is initialized with an Embedding layer, positional encoding, and multiple EncoderLayers. The `call()` method performs the following steps: 
1. Pass the input through the Embedding layer.
2. Scale the embedding by multiplying it by the square root of the embedding dimension. Cast the embedding dimension to data type `tf.float32` before computing the square root.
3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to the embedding.
4. Pass encoded embedding through dropout layer. Use`training` parameter to set the training mode. 
5. Pass the output of the dropout layer through the stack of encoding layers using a for loop.

In [13]:
class Encoder(tf.keras.layers.Layer):
    """
The entire Encoder starts by passing the input to an embedding layer 
and using positional encoding to then pass the output through a stack 
of encoder Layers
        
    """   
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size,
        maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
        super(Encoder, self).__init__()

        self.embedding_dim = embedding_dim
        self.num_layers = num_layers

        self.embedding = Embedding(input_vocab_size, self.embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                                self.embedding_dim)


        self.enc_layers = [EncoderLayer(embedding_dim=self.embedding_dim,
                            num_heads=num_heads,
                            fully_connected_dim=fully_connected_dim,
                            dropout_rate=dropout_rate,
                            layernorm_eps=layernorm_eps) 
                           for _ in range(self.num_layers)]

        self.dropout = Dropout(dropout_rate)
        
    def call(self, x, training, mask):
        """
        Forward pass for the Encoder
        
        Arguments:
            x -- Tensor of shape (batch_size, input_seq_len)
            training -- Boolean, set to true to activate
                        training mode for dropout layers
            mask -- Boolean mask to ensure that the padding is not 
                    treated as part of the input
        Returns:
    out2 -- Tensor of shape (batch_size, input_seq_len, embedding_dim)
        """

        seq_len = tf.shape(x)[1]
        
        # Pass input through Embedding layer
        # Output shape: (batch_size, input_seq_len, embedding_dim)
        x = self.embedding(x) 
        
# Scale embedding multiplying by square root of the embedding dimension
        x *= tf.math.sqrt(tf.cast(self.embedding_dim,tf.float32))
        
        # Add position encoding to embedding
        x += self.pos_encoding[:, :seq_len, :]
        
        # Pass the encoded embedding through a dropout layer
        x = self.dropout(x, training=training)
        
        # Pass the output through the stack of encoding layers 
        for i in range(self.num_layers):
            x = self.enc_layers[i](x,training, mask)

        return x  # Shape: (batch_size, input_seq_len, embedding_dim)

### 5 - Decoder

Transformer Decoder layer takes K and V matrices generated by Encoder and computes the second MHA layer with Q matrix from the output (Fig. 3a).

<img src="images/decoder_layer.png" alt="Encoder" width="250"/>
<caption><center><font color='purple'><b>Figure 3a: Transformer Decoder layer</font></center></caption>
 
### 5.1 - Decoder Layer
Again, the MHA is paired with a FFNN, but this time 2 MHA layers are implemented. Residual connections and layer normalization are again used (Fig. 3a).
    
 `DecoderLayer()` is implemented using `call()` method. 
    
1. Block 1 is MHA layer with residual connection, dropout layer, and look-ahead mask.
2. Block 2 will take into account the output of the Encoder, so MHA layer will receive K and V from the encoder, and Q from  Block 1. Dropout layer, layer normalization and residual connection are applied, like before. 
3. Block 3 is FFNN with dropout and normalization layers and a residual connection.
    
**Additional Hints:**
* The first two blocks are fairly similar to the EncoderLayer except you will return `attention_scores` when computing self-attention. 

In [14]:
class DecoderLayer(tf.keras.layers.Layer):
    """
Decoder layer is composed by two MHA blocks, 
one that takes the new input and uses self-attention, and the other 
one that combines it with the output of the encoder, followed by a
fully connected block. 
    """
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1, layernorm_eps=1e-6):
        super(DecoderLayer, self).__init__()

        self.mha1 = ?
        
        self.mha2 = ?

        self.ffn =  ?

        self.layernorm1 = ?
        self.layernorm2 = ?
        self.layernorm3 = ?

        self.dropout1 = ?
        self.dropout2 = ?
        self.dropout3 = ?
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Forward pass for the Decoder Layer
        
Arguments:
x: Tensor of shape (batch_size, target_seq_len, embedding_dim)
enc_output:  Tensor of shape(batch_size, input_seq_len, embedding_dim)
training: Boolean, set to true to activate training mode for dropout layers
look_ahead_mask: Boolean mask for the target_input
padding_mask:  Boolean mask for the 2nd MHA layer

Returns:
out3: Tensor of shape (batch_size, target_seq_len, embedding_dim)
attn_weights_block1: Tensor of shape(batch_size, num_heads, 
                               target_seq_len, input_seq_len)
attn_weights_block2: Tensor of shape (batch_size, num_heads, 
                            target_seq_len, input_seq_len)
        """
        
# BLOCK 1
# calculate self-attention and return attention scores as 
# attn_weights_block1
        attn1, attn_weights_block1 = self.mha1(x, x, x,look_ahead_mask, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
        
# apply 1st dropout layer on the attn1 output 
        attn1 = ?
        
#apply 1st layer normalization to the sum of attn1 output & input x
        out1 = ?

# BLOCK 2
# calculate self-attention using Q from the 1st block and K and V
#  from the encoder output.
# MultiHeadAttention's call takes input (Query, Value, Key, 
# attention_mask, return_attention_scores, training)
# Return attention scores as attn_weights_block2 
        attn2, attn_weights_block2 = self.mha2( out1,enc_output, enc_output, padding_mask, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
        
        # apply 2nd dropout layer on the attn2 output
        attn2 = ?
        
# apply 2nd layer normal. to the sum of attn2 output & output of Block 1
#  out2 shape: (batch_size, target_seq_len, embedding_dim)
        out2 = ? 
        
 #BLOCK 3
        # pass the output of the 2nd block through a ffn
        # ffn_output shape: (batch_size, target_seq_len, embedding_dim)
        ffn_output = ? 
        
        # apply 3rd dropout layer to ffn output
        ffn_output = ?
        
# Apply 3rd layer normaliz. to the sum of ffn output & output of Block 2
# Output shape:(batch_size, target_seq_len, embedding_dim)
        out3 =  ?


        return out3, attn_weights_block1, attn_weights_block2
    

### 5.2 Full Decoder

<img src="images/decoder.png" alt="Encoder" width="300"/>
<caption><center><font color='purple'><b>Figure 3b: Transformer Decoder</font></center></caption>

Implement `Decoder()` using `call()` method to embed the output, add positional encoding, and implement multiple decoder layers. 
 
The Decoder is initialized with an Embedding layer, positional encoding, and multiple DecoderLayers. `call()` method performs the following steps: 
1. Pass the generated output through the Embedding layer.
2. Scale the embedding by multiplying it by the square root of the embedding dimension. Cast the embedding dimension to data type `tf.float32` before computing the square root.
3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to the embedding.
4. Pass the encoded embedding through a dropout layer. Use the `training` parameter to set the model training mode. 
5. Pass the output of the dropout layer through the stack of Decoding layers using a for loop.

In [15]:
class Decoder(tf.keras.layers.Layer):
    """
The entire Encoder starts by passing the target input to an embedding
layer and using positional encoding to then pass the output through a 
stack of decoder Layers
        
    """ 
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size,
               maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
        super(Decoder, self).__init__()

        self.embedding_dim = embedding_dim
        self.num_layers = num_layers

        self.embedding = Embedding(target_vocab_size, self.embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.embedding_dim)

        self.dec_layers = [DecoderLayer(embedding_dim=self.embedding_dim,
                                        num_heads=num_heads,
                                        fully_connected_dim=fully_connected_dim,
                                        dropout_rate=dropout_rate,
                                        layernorm_eps=layernorm_eps) 
                           for _ in range(self.num_layers)]
        self.dropout = Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, 
           look_ahead_mask, padding_mask):
        """
        Forward  pass for the Decoder
        
Arguments:
x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
enc_output --  Tensor of shape(batch_size, input_seq_len, embedding_dim)
training -- Boolean, set to true to activate the training mode 
for dropout layers
look_ahead_mask -- Boolean mask for the target_input
padding_mask -- Boolean mask for the second multihead attention layer

Returns:
x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)
attention_weights - Dictionary of tensors containing all attention 
weights as Tensor of shape 
(batch_size, num_heads, target_seq_len, input_seq_len)
        """

        seq_len = tf.shape(x)[1]
        attention_weights = {}
        
        # Pass input through the Embedding layer 
        # x shape: (batch_size, target_seq_len, embedding_dim)
        x = ?
        
# Scale embeddings multiplying by square root of embeddings dimension
        x *= ?
        
        # Add positional encodings to word embedding
        x += ?
        
        # apply a dropout layer to x
        x = ?

# Pass x and the encoder output through a stack of decoder layers and 
# save the attention weights of block1 and block2

        for i in range(self.num_layers):

            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

#update attention_weights dictionary with the attention weights of 
# block 1 and block 2
            attention_weights['decoder_layer{}_block1_self_att'.format(i+1)] = ?
            attention_weights['decoder_layer{}_block2_decenc_att'.format(i+1)] = ?
        
        # x.shape = > (batch_size, target_seq_len, embedding_dim)
        return x, attention_weights

### 6 - Transformer
<img src="images/transformer.png" alt="Transformer" width="550"/>
<caption><center><font color='purple'><b>Figure 4: Transformer</font></center></caption>
    
The flow of data through the Transformer Architecture is as follows:
* Input passes through the implemented above Encoder:
    - embedding and positional encoding of the input
    - MHA on the input
    - FFNN to detect features
* The encoder output passes through  the implemented above Decoder:
    - embedding and positional encoding of the output
    - MHA on the generated output
    - MHA with Q from the 1st MHA layer and K and V from the Encoder. 
    - FFNN to help detect features
* Finally, after the Nth Decoder layer, two dense layers and softmax are applied to generate prediction for the next output in the sequence.

`Transformer()` is implemented using `call()` method:
1. Pass input through the Encoder with appropiate mask.
2. Pass encoder output and target through Decoder with appropiate mask.
3. Apply linear transformation and softmax to get prediction.

In [16]:
class Transformer(tf.keras.Model):
    """
    Complete transformer with an Encoder and a Decoder
    """
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, 
               target_vocab_size, max_positional_encoding_input,
               max_positional_encoding_target, dropout_rate=0.1, layernorm_eps=1e-6):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers=num_layers,
                               embedding_dim=embedding_dim,
                               num_heads=num_heads,
                            fully_connected_dim=fully_connected_dim,
                               input_vocab_size=input_vocab_size,
            maximum_position_encoding=max_positional_encoding_input,
                               dropout_rate=dropout_rate,
                               layernorm_eps=layernorm_eps)

        self.decoder = Decoder(num_layers=num_layers, 
                               embedding_dim=embedding_dim,
                               num_heads=num_heads,
                            fully_connected_dim=fully_connected_dim,
                               target_vocab_size=target_vocab_size, 
            maximum_position_encoding=max_positional_encoding_target,
                               dropout_rate=dropout_rate,
                               layernorm_eps=layernorm_eps)

        self.final_layer = Dense(target_vocab_size, activation='softmax')
    
    def call(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
        """
        Forward pass for the entire Transformer
Arguments:
inp -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
tar -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
training -- Boolean, set to true to activate
                        the training mode for dropout layers
enc_padding_mask -- Boolean mask to ensure that the padding is not 
                    treated as part of the input
look_ahead_mask -- Boolean mask for the target_input
padding_mask -- Boolean mask for the 2nd MHA layer

Returns:
final_output --
attention_weights - Dictionary of tensors containing all attention 
weights for the decoder each Tensor of shape 
(batch_size, num_heads, target_seq_len, input_seq_len)
        
        """
# call self.encoder with appropriate arguments to get encoder output
# Output shape: (batch_size, inp_seq_len, fully_connected_dim)
        enc_output = self.encoder(inp,training,enc_padding_mask) 
        
# call self.decoder with appropriate arguments to get decoder output
# Output shape: (batch_size, tar_seq_len, fully_connected_dim)
        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)
        
# pass decoder output through a linear layer and softmax
# Output shape: (batch_size, tar_seq_len, target_vocab_size)
        final_output = self.final_layer(dec_output)  

        return final_output, attention_weights

<font color='blue'>
    <b>What you should remember</b>:

- Combination of self-attention and FFNN layers allows for parallization of training and *faster training*.
- Self-attention is calculated using the generated query Q, key K, and value V matrices.
- Adding positional encoding to word embeddings is an effective way to include sequence information in self-attention calculations. 
- Multi-head attention (MHA) helps to detect multiple features in the sentence.
- Masking stops the model from 'looking ahead' during training, or weighting zeroes too much when processing cropped sentences. 

### 7 - References

The Transformer algorithm was due to Vaswani et al. (2017). 

- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 