# <center> Attention Model <center>
### <center> August 10 2024 <center> 
The goal of this project is to train a neural network that given any human-readable date, it can translate it into a machine-readable format (yyyy-mm-dd). To generate arbitary data, we will use the file 'nmt_utils.py'. 

The structure of the network consists of an encoder and a decoder network. 
* The encoder structure is a bi-directional LSTM with 32 hidden neurons in each layer. The length of the input is fixed to 30; inputs with a larger length are cut short or if the length is less than 30, extra padding is applied to the end of the sentences. 

* The attention model is a fully connected neural network with two hidden layers. The first hidden layer consists of 10 neurons and activation function tanh. The second hidden layer consists of 1 neuron and activation function relu. The output of the hidden layers is then run through a softmax to ensure the attention score is ranged between 0-1. 

* The decoder model is made of another LSTM network consisting of 64 neurons in each hidden layer; the output of the LSTM is then run through a Dence layer with len_machine_vocab neurons and activation function tanh. A Softmax function is the applied to the output of Dence to ensure the output is in the form of probabilities. The length of the output sequence is fixed as 10. 

**Note** This project is part of the Coursera course Sequence Models offered by DeepLearning.AI. 

In [1]:
#Loading the packages required: 
from tensorflow.keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply, Softmax
from tensorflow.keras.layers import RepeatVector, Dense, Activation, Lambda
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.activations import softmax
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model, Model
import tensorflow.keras.backend as K
import tensorflow as tf

import numpy as np
from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#Loading the dataset 
m = 10000 #number of training samples 
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

100%|█████████████████████████████████| 10000/10000 [00:00<00:00, 122101.72it/s]


Dataset is a list of tuples. In each tuple we have the human-readable and machine-readable dates. Note the different formats. 

In [3]:
dataset[0:10] 

[('1 oct 1992', '1992-10-01'),
 ('22.07.70', '1970-07-22'),
 ('1/24/15', '2015-01-24'),
 ('wednesday april 23 1986', '1986-04-23'),
 ('tuesday february 13 1990', '1990-02-13'),
 ('tuesday july 29 1980', '1980-07-29'),
 ('tuesday november 28 2000', '2000-11-28'),
 ('31 oct 1978', '1978-10-31'),
 ('14 oct 1976', '1976-10-14'),
 ('monday august 23 1993', '1993-08-23')]

In [4]:
def preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty):

    # Unlist the tuples; seperate and save the human-readable and machine-readable dates into X and Y respectively. 
    X, Y = zip(*dataset)

    #Convert each date into a vector of integers corresponding to its index in human_vocab (for X) or machine-vocab (for Y): 
    X = np.array([string_to_int(i, Tx, human_vocab) for i in X])
    Y = [string_to_int(t, Ty, machine_vocab) for t in Y]

    #Create the one-hot vectors (will be used as input) 
    Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), X))) #one-hot vector of each X element
    Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y)))

    return X, np.array(Y), Xoh, Yoh

In [5]:
Tx = 30 
Ty = 10 
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)
print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37)
Yoh.shape: (10000, 10, 11)


In [6]:
print(f"First element of X is :\n{X[0]}")
print(f"First element of Y is :\n{Y[0]}")
print(f"First one-hot vector encoding for the first element of X is: \n{Xoh[0][0]}")

First element of X is :
[ 4  0 26 15 30  0  4 12 12  5 36 36 36 36 36 36 36 36 36 36 36 36 36 36
 36 36 36 36 36 36]
First element of Y is :
[ 2 10 10  3  0  2  1  0  1  2]
First one-hot vector encoding for the first element of X is: 
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [7]:
index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])


Source date: 1 oct 1992
Target date: 1992-10-01

Source after preprocessing (indices): [ 4  0 26 15 30  0  4 12 12  5 36 36 36 36 36 36 36 36 36 36 36 36 36 36
 36 36 36 36 36 36]
Target after preprocessing (indices): [ 2 10 10  3  0  2  1  0  1  2]

Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
Target after preprocessing (one-hot): [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]


### Neural Attention
In this step, we define a function that takes the previous hidden state of post-attention LSTM network, $s^{(t-1)}$, as well as all the hidden states of the pre-attention bidirectional LSTM $a^{(t')}$ for all t' $\in$  $T_x$ as inputs. This function then runs these two inputs through a fully connected neural network to calculate the energies. Once the energies are calculated, they are run through a Softmax layer to get the alpha attention weights, $\alpha^{t'}$, which are then multiplied with their respective $a^{(t')}$ to produce the context vector $C^{(t)}$. 

In [8]:
def NeuralAttention(a,s_prev): 
    """
    Implements one step of attention mechanism
    
    Arguments:
    a -- output of the Bi-LSTM of shape (m, Tx, 2* n_a)  #(#samples, #rows, #columns)
    s_prev -- previous hidden state of the LSTM of shape (m, n_s)
    Tx -- length of the input sequence (Global Variable)

    Returns:
    context -- context vector, input of the next LSTM cell
    """
    #Create copies of s_prev 
    s_prev = RepeatVector(Tx)(s_prev) #what about all samples together 
    
    #Concatenate s_prev and a: 
    concat = Concatenate(axis = -1)([a,s_prev])
    
    #Run through the first layer of FFN with activation tanh and with 10 neurons: 
    dense1 = Dense(10, activation = "tanh")(concat) #[m,30,10+len(s_prev]-> [m,1,30]
    
    #Run through the final layer of FFN with activation ReLU and 1 neuron: 
    energies = Dense(1,activation = "relu")(dense1)
    
    #Run through a Softmax function to find alphas: 
    alphas = Softmax(axis = 1)(energies)
    
    #Multiply the alphas with their respective a<t'>: 
    Context = Dot(axes=1)([alphas,a])
    
    return(Context)



In [9]:
#Example: 
np.random.seed(10)
tf.random.set_seed(10)
m = 10 #samples 
Tx = 30 #length of seq
n_a = 32 #neurons 
n_s = 64
a = np.random.uniform(1, 0, (m, Tx, 2 * n_a)).astype(np.float32)
s_prev =np.random.uniform(1, 0, (m, n_s)).astype(np.float32) * 1
Context = NeuralAttention(a,s_prev)

#### Encoder Bi-directional LSTM

The pre-attention bi-directional LSTM has 32 hidden neurons at each time step, meaning that the outcome of each hidden state at time t' is 64 (since bidirectional) for t' $\in T_x $; since the maximum sequence length is 30, the output of the bi-directional LSTM will be a tensor of size (m, 30, 64). 

On the other hand, the context vector will represent the output of the bi-directional pre-attention encoder but with the attention scores multiplied to the output of every hidden state. So, when predicting the $t^{th}$ word in the decoder network, we will have the output of the hidden states from the pre-attention bi-directional LSTM, where the outputs are weighed by the attention scores. We will also need to have the previous hidden state and cell state of the decoder LSTM. The decoder LSTM has 64 neurons in each hidden layer. Note that this means that there are 64 neurons in the hidden state and 64 neurons in cell state at every time step. 

Therefore: 
* a dim at each time step t: (None, 30, 64) --> n_a = 32
* alphas at each time step t: (None, 30, 1) --> T_x = 30 : for every layer of hidden state encoder we get one attention score. 
* context vector at time t: (None, 1, 64)  --> n_a = 32 + 32 = 64 bi-directional

#### Decoder LSTM: 

We'll have an LSTM structure with n_s = 64 neurons in each hidden state, which equals to the number of neurons in each cell state. So, in the decoder structure, we'll have an LSTM layer and then a Dense layer will len_machine_vocab neurons and activation function tanh. Finally, the prediction is calculated by running the output of the dense layer through a softmax activation function. 

Questions: 
1, what does the LSTM cell output once we run the initial state and the context vector in it? 

The LSTM cell will output only the result of the last hidden layer with shape (m, 1, 64). 

2, What is the purpose of the Dense layer? 

To convert a tensor of size (m, 1, 64) into size (m, 1, len_machine_vocab). Run the output through a Softmax, so that they are in the form of probabilities corresponding to the likelihood of the prediction being each word in the machine_vocab. 



In [10]:
n_a = 32 # number of units for the pre-attention, bi-directional LSTM's hidden state 'a'
n_s = 64 # number of units for the post-attention LSTM's hidden state "s"

post_activation_LSTM_cell = LSTM(n_s, return_state = True) 
output_layer = Dense(len(machine_vocab), activation=Softmax)

In [11]:
def modelf(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    human_vocab_size -- size of the python dictionary "human_vocab"
    machine_vocab_size -- size of the python dictionary "machine_vocab"

    Returns:
    model -- Keras model instance
    """

    
    # Define the inputs of your model with a shape (Tx, human_vocab_size)
    # Define s0 (initial hidden state) and c0 (initial cell state)
    # for the decoder LSTM with shape (n_s,)
    X = Input(shape=(Tx, len_human_vocab))
    # initial hidden state
    s0 = Input(shape=(n_s,), name='s0')
    # initial cell state
    c0 = Input(shape=(n_s,), name='c0')
    # hidden state
    s = s0
    # cell state
    c = c0
    
    # Initialize empty list of outputs
    outputs = []
    
    # Define your pre-attention Bi-LSTM. a is a list of all the hidden states. 
    a = Bidirectional(LSTM(units=n_a, return_sequences=True))(X)

    
    # Iterate for Ty steps
    for t in range(Ty):
    
        # Perform one step of the attention mechanism to get back the context vector at step t 
        context = NeuralAttention(a,s)
        
        # Apply the post-attention LSTM cell to the "context" vector while also inputting the previous hidden state and cell state. 
        s, _, c = post_activation_LSTM_cell(context, initial_state=[s, c])
       
        # Apply Dense layer to the hidden state output of the post-attention LSTM 
        out = Dense(machine_vocab_size,activation = "tanh")(s)
        #out = output_layer(s)
        # Run through a Softmax function: 
        res = Softmax(axis = 1)(out)
        # Append "out" to the "outputs" list 
        outputs.append(res)
    
    # Create model instance taking three inputs and returning the list of outputs.
    model = Model(inputs = [X,s0,c0], outputs = outputs)
    
    return model

In [12]:
Tx = 30
n_a = 32
n_s = 64
len_human_vocab = len(human_vocab)
len_machine_vocab = len(machine_vocab)

In [13]:
model = modelf(Tx, Ty, n_a, n_s, len_human_vocab, len_machine_vocab)
model.summary()


Define the loss function, learning rate, and the optimizer: 

a few tasks to do: 

answer the question xin gao had. 

search for what an optimizer is. 

question is if the ouput of the model is in the form of probabilities, how does the model know how to map each probability with its 
corresponding word in the machine-vocab dict? the model doesn't know what to map it to the loss is numerical and is calculated based on the outputs variable given. in the examples, we map the model to the corresponding output. 


In [45]:
opt = Adam(0.005,beta_1 = 0.9, beta_2 = 0.999, decay = 0.01) 
model.compile(loss = "categorical_crossentropy", optimizer = opt, metrics = ["accuracy"]*10)

In [46]:
m = 10000
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))

In [47]:
model.fit([Xoh, s0, c0], outputs, epochs=100, batch_size=100)

Epoch 1/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step - loss: 8.6387 - softmax_10_accuracy: 1.0000 - softmax_12_accuracy: 1.0000 - softmax_14_accuracy: 1.0000 - softmax_16_accuracy: 1.0000 - softmax_18_accuracy: 0.9474 - softmax_20_accuracy: 1.0000 - softmax_2_accuracy: 0.9999 - softmax_4_accuracy: 0.9999 - softmax_6_accuracy: 0.8971 - softmax_8_accuracy: 0.9975
Epoch 2/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 25ms/step - loss: 8.6387 - softmax_10_accuracy: 1.0000 - softmax_12_accuracy: 1.0000 - softmax_14_accuracy: 1.0000 - softmax_16_accuracy: 1.0000 - softmax_18_accuracy: 0.9473 - softmax_20_accuracy: 1.0000 - softmax_2_accuracy: 0.9999 - softmax_4_accuracy: 0.9999 - softmax_6_accuracy: 0.9879 - softmax_8_accuracy: 0.9975
Epoch 3/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 24ms/step - loss: 8.6416 - softmax_10_accuracy: 1.0000 - softmax_12_accuracy: 0.9997 - softmax_14_accuracy: 0.9999 - softmax

KeyboardInterrupt: 

In [50]:
EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']
s00 = np.zeros((1, n_s))
c00 = np.zeros((1, n_s))
for example in EXAMPLES:
    source = string_to_int(example, Tx, human_vocab)
    #print(source)
    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)
    source = np.swapaxes(source, 0, 1)
    source = np.expand_dims(source, axis=0)
    prediction = model.predict([source, s00, c00])
    prediction = np.argmax(prediction, axis = -1)
    output = [inv_machine_vocab[int(i)] for i in prediction]
    print("source:", example)
    print("output:", ''.join(output),"\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
source: 3 May 1979
output: 1979-05-03 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
source: 5 April 09
output: 2009-04-05 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
source: 21th of August 2016
output: 2016-08-21 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
source: Tue 10 Jul 2007
output: 2007-07-10 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
source: Saturday May 9 2018
output: 2018-05-09 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
source: March 3 2001
output: 2001-03-03 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
source: March 3rd 2001
output: 2001-03-02 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
source: 1 March 2001
output: 2001-03-01 



  output = [inv_machine_vocab[int(i)] for i in prediction]


In [56]:
# Create a test dataset: 
m = 10000 #number of testing samples 
test_dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

100%|██████████████████████████████████| 10000/10000 [00:00<00:00, 87386.43it/s]


In [60]:
Tx = 30 
Ty = 10
m = 10000
s00 = np.zeros((m, n_s))
c00 = np.zeros((m, n_s))
test_X, test_Y, test_Xoh, test_Yoh = preprocess_data(test_dataset, human_vocab, machine_vocab, Tx, Ty)

In [61]:
outputs = list(test_Yoh.swapaxes(0,1))

In [62]:
model.evaluate([test_Xoh,s00,c00], outputs)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 8.6351 - softmax_10_accuracy: 1.0000 - softmax_12_accuracy: 1.0000 - softmax_14_accuracy: 1.0000 - softmax_16_accuracy: 1.0000 - softmax_18_accuracy: 0.9507 - softmax_20_accuracy: 1.0000 - softmax_2_accuracy: 1.0000 - softmax_4_accuracy: 1.0000 - softmax_6_accuracy: 0.9998 - softmax_8_accuracy: 0.9977


[8.634012222290039,
 1.0,
 1.0,
 1.0,
 1.0,
 0.9520000219345093,
 1.0,
 1.0,
 1.0,
 0.9997000098228455,
 0.9984999895095825]

Since the length of the output sequence is 10, we will test the accuracy at each time step and hence there are 10 accuracy scores printed. As shown, the model is able to translate the inputed dates into machine-readable dates with an accuracy of almost 100%. 

In [55]:
# Save the weights of the model for future use 
model.save_weights('/Users/apple/model_weights.weights.h5')

In [29]:
# Load the model 
model.load_weights('/Users/apple/model_weights.weights.h5')