The objective of this project is to: 
* Create positional encodings to capture sequential relationships in data
* Calculate scaled dot-product self-attention with word embeddings
* Implement masked multi-head attention
* Build and train a Transformer model

In [92]:
# Loading the required packages: 
import tensorflow as tf
import time
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization, Layer
from tensorflow.keras.models import Sequential
from tensorflow import  reshape, shape, transpose, ones, linalg

from sklearn.model_selection import train_test_split 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer,LancasterStemmer
import re

In [2]:
def pred(y): 
    '''
    This function maps the probabilities outputed by the model back to the rankings list 
    and outputs the ranking with the highest probability. 
    
    inputs: 
    y  (1,m)     : Probability output of the RNN model 
    
    outputs: 
    res (string) : The ranking corresponding to the most probable outcome. 
    
    '''
    y = y.tolist()
    #ranking = ['Below Average' , 'Average' , 'Above Average']
    res = ranking[y.index(max(y))]
    return(res)


In [3]:
#need to write another function that maps the correct output of the function to the rankings. 
def vec_output(y): 
    m = len(ranking)
    txt = y
    v = np.zeros(m) 
    j = ranking.index(txt)
    v[j] = 1
    return v 


In [4]:
#Loading the data: 
CustomerFeed = 'Canva_reviews.xlsx'
df = pd.read_excel(CustomerFeed)

print(df)

                                               reviewId            userName  \
0     gp:AOqpTOFxf3fttcT5DSvFIn9KPp5FErgH9yC533Fmoxv...      Donna Caritero   
1     gp:AOqpTOEq6rNIWLnPV4KFTctWvm0mpGEQljtD6mvy1H-...  Soumi Mukhopadhyay   
2     gp:AOqpTOE86hSyPRHZgYt28Uk5zGe4FZGb1hkmtFDiYJ2...   Theknown _unknown   
3     gp:AOqpTOHSuKkVTcM3QgCCKysHQlxEnk2ocOKsUMiMIJy...        Anthony Dean   
4     gp:AOqpTOEOrZt5H6jXPiplJyffCd5ZBnVXACTWgwNsF1R...   Neha Diana Wesley   
...                                                 ...                 ...   
1495  gp:AOqpTOHhnXMpylU3f-1V1KbR2hwWArOilxPlKI6K4xY...            Reen Ali   
1496  gp:AOqpTOEcz62DHS-amqTB5xGMhM4_R0UJpcv_HDNny9i...     Shaurya Chilwal   
1497  gp:AOqpTOFMqEqa_kpp29Q8wjcBmKUCAvOQGQx4KZQ8b83...           GK Gaming   
1498  gp:AOqpTOGY4z3pUxeiqGzn2ad3Noxqlbm-9DZ3ksHqD1_...    1203_Vani Sharma   
1499  gp:AOqpTOFVGZ0MXyR-Gv_d2cYf2KD709Hwple_u7OZE4y...           MeLLy EcK   

                                              userI

In [5]:
df = df[["review", "Sentiment"]]
df.head()

Unnamed: 0,review,Sentiment
0,Overall it's really an amazing app. I've been ...,Negative
1,Hey! Yes I gave a 5 star rating... coz I belie...,Positive
2,Canva used to be a good app! But recently I've...,Negative
3,"It's a brilliant app, but I have just one prob...",Negative
4,This was such a great app. I used to make BTS ...,Negative


In [6]:
def edit_txt(review):
    """
    This function receives a text and returns it edited as follows: 
    1, all words converted to lower case 
    2, integers removed
    3, tokenize the words 
    4, punctuation removed 
    5, common words that are unnecessary are removed. 
    """
    
    review_edited = []

    #Converting to lower case: 
    review_edited = review.lower() 
    
    #Removing integers: 
    pattern = r'[0-9]'
    # Match all digits in the string and replace them with an empty string
    review_edited = re.sub(pattern, '', review_edited) 

    #Tokenize the comment: 
    review_edited = word_tokenize(review_edited) 

    #Removing punctuation 
    tokenizer = RegexpTokenizer(r'\w+')
    review_edited = [''.join(tokenizer.tokenize(word)) for word in review_edited if len(tokenizer.tokenize(word))>0]

    #Removing common words: 
    #remove_list = stopwords.words('english') 
    #to_remove = [ "not",'don',"don't",'should',"should've", 'ain','aren',"aren't",'couldn',"couldn't",'didn',"didn't",'doesn',"doesn't",'hadn',"hadn't",'hasn',"hasn't",'haven',"haven't",'isn',"isn't",'mightn',"mightn't",'mustn',"mustn't",'needn',"needn't",'shan',"shan't",'shouldn',"shouldn't",'wasn',"wasn't",'weren',"weren't",'won',"won't",'wouldn', "wouldn't"]
 
    #review_edited = [word for word in review_edited if not word in remove_list]
    return(review_edited) 



In [7]:
# Defining the review dataset as x: 
x = df["review"] 
dfrank = df.iloc[:,1]

print(x[10])

y = df["Sentiment"].tolist()
ranking = np.unique(y)
ranking = ranking.tolist()
print(f"\nCorresponding ranking: {y[10]}\n")
print(f"Rankigns include {ranking}")


Really great editing app, its all around which makes it great. Has everything I need for basic editing. It makes editing easier because of premade tools and stickers, designs, etc. I gave it four stars only because of how slow it loads, especially at starting the app. It is pretty stressful, so you really gotta have patience at waiting for stuff to load.

Corresponding ranking: Positive

Rankigns include ['Negative', 'Positive']


In [8]:
#creating the dictionary: 
reviews_edited = [edit_txt(review) for review in x]
print(f"Comment before editting: {x[13]}")
print(f"Comment after editting: {reviews_edited[13]}")

Split = [] 
Dic = []
dictionary = np.unique([word for review in reviews_edited for word in review]).tolist()
print(dictionary[0:30])
len(dictionary)

Comment before editting: Unable to save my work. Nothing works :(
Comment after editting: ['unable', 'to', 'save', 'my', 'work', 'nothing', 'works']
['_', 'a', 'aa', 'aap', 'ability', 'able', 'about', 'above', 'absolutely', 'acc', 'accepted', 'access', 'accessibilities', 'accessible', 'accidentally', 'accoding', 'according', 'account', 'across', 'action', 'activity', 'actual', 'actually', 'ad', 'adaptable', 'add', 'added', 'adding', 'addition', 'address']


2317

In [9]:
# Load the word embeddigns:
embeddings_dict = {}
with open("glove.6B.50d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

words =  list(embeddings_dict.keys())
vectors = [embeddings_dict[word] for word in words]

In [10]:
#dividing the dataset into 75% training set and 25% test set: 
x = x.to_list()
X_train, X_test, y_train, y_test = train_test_split(x,y, 
                                   random_state=104,  
                                   test_size=0.25,  
                                   shuffle=True) 


In [11]:
# Edit the text in the training and texting datasets: 
X_train = [edit_txt(comment) for comment in X_train]
X_test = [edit_txt(comment) for comment in X_test]

In [12]:
X_train[0]

['i',
 'spend',
 'so',
 'much',
 'time',
 'in',
 'working',
 'a',
 'poster',
 'and',
 'the',
 'app',
 'is',
 'not',
 'allowing',
 'to',
 'download',
 'simply',
 'wasted',
 'my',
 'hard',
 'work']

## Creating the word embeddings based on GloVe embeddings. 

In [13]:
# Encoding the input with Glove Word Embeddings: 
def gvec_input(x,m,e): 
    """
    
    This function takes any input, x, and returns a glove vector based on the 
    words introduced in the vocabulary (400,000 words). This function returns k vectors where k is the number of words in the 
    sentence. Every vector corresponds to a word in the dictionary and each entry will describe a feature of the word. 
    
    inputs: 
    
    x (string) : a statement from customers. 
    m (int)    : size of the sequence 
    e (int)    : size of the embeddings 
    outputs: 
    v (m,n)    : where m is the number of words in the sentence and n = 50 is the number of total features describing a word. 

    """
    
    n = len(x)
    gv = np.zeros((n,m, e))
    
    for i in range(0, n): #looping over each comment 
        txt = x[i] #select the ith comment  
        txt = (txt[:m] if len(txt) > m else txt + ['<pad>'] * (m - len(txt))) #shorten or add extra padding
        for l in range(m): #looping over each word 
            
            # add the embedding of all ones for pads
            if txt[l] == "<pad>": 
                gv[i,l,:] = np.zeros(e) 
                
            # if a word is not is the list of Glove embeddings, then assign an array which is the average of all embeddings:    
            elif txt[l] not in words: 
                gv[i,l,:] = np.mean(vectors, axis = 0)
            # add the word embeddings: 
            else: 
                gv[i,l,:] = embeddings_dict[txt[l]]
    return(gv)

In [193]:
# Limit the length of the sequence: 
m = 30 
# The length of the embeddings: 
e = 50
X_trainmod = gvec_input(X_train,m,e) #X_train is training dataset modified (edited and tokenized);
                                     # shape of X_trainmod (#samples, len_seq, len_emb) 

X_testmod = gvec_input(X_test,m,e)   #X_test will be the testing dataset modified (edited and tokenized) 
                                     # shape of X_testmod (#samples, len_seq, len_emb) 

In [15]:
print(X_trainmod[0])
print(X_trainmod.shape)
print(X_testmod.shape)

[[ 0.11891     0.15255    -0.082073   ... -0.57511997 -0.26671001
   0.92120999]
 [ 0.79238999  0.21864     0.68711001 ... -0.066753   -0.39660001
   0.74818999]
 [ 0.60307997 -0.32023999  0.088857   ... -0.25187001 -0.26879001
   0.36657   ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]
(1125, 30, 50)
(375, 30, 50)


In [16]:
# Map the y_training and y_testing datasets to Boolean 0, 1: 
y_trainmod = (np.array([vec_output(y) for y in y_train])).reshape(len(y_train), len(ranking))
y_testmod = (np.array([vec_output(y) for y in y_test])).reshape(len(y_test),len(ranking))
y_trainmod[0:5]

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

## Define the self attention: 

In [207]:
def self_attention(q,k,v, masking):
    """
    This function applied the self-attention mechanism to a given input. 
    
    """
    
    # Perform matrix multiplication on the last two dimensions
    dotqk = tf.matmul(q, k, transpose_b = True) #must be of size (batch_size, seq_len, seq_len) 

    dim_k = tf.cast(k.shape[-1],tf.float32) 
    normalized_dotqk = dotqk/tf.math.sqrt(dim_k)
    
    #then add the masking if masking if given" 
    if masking is not None: 
        normalized_dotqk += masking* -1e9
    
    attention_scores =  tf.nn.softmax(tf.cast(normalized_dotqk, dtype=tf.float32),axis = -1)
    res = tf.matmul(attention_scores,v) 
    
    return(attention_scores, res)
    

## Define the Padding Mask

In [194]:
def create_padding_mask(matrix,num_heads):
    """
    Creates a matrix mask for the padding cells
    
    Arguments:
        seq -- (n, m) matrix
    
    Returns:
        mask -- (n, 1, 1, m) binary tensor
    """
    # Check if each row is all zeros
    zero_rows = np.all(matrix == 0, axis=2)
    
    # Convert boolean array to integer array (0s and 1s)
    padded_mask = zero_rows.astype(int)
    # Expand to make 4D: 
    expanded_padding_mask_init = tf.expand_dims(padded_mask, axis=1)
    expanded_padding_mask_final = tf.expand_dims(expanded_padding_mask_init, axis=1)
    # Repeat for each head: 
    final_mask = tf.cast(tf.tile(expanded_padding_mask_final, [1, num_heads, 1, 1]),tf.float32)  # (batch_size, num_heads, 1, seq_len)

    return final_mask

In [203]:
# Create the padding mask depending on the number of heads in the multi-head attention: 
padding_mask= create_padding_mask(X_trainmod,2) 
padding_mask.shape

TensorShape([1125, 2, 1, 30])

In [204]:
# Print the padding mask corresponding to the first sample in the dataset: 
print(padding_mask[0])
print(f"\nTwo matrices will be printed, each corresponding to the masking for a head. \n"
       "Note that the same padding mask will be applied to all heads sicne the source sample is the same."
        "Indexes up to 21st are all zero meaning that no masking is required for these words but the rest must be masked")

tf.Tensor(
[[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
   1. 1. 1. 1. 1. 1. 1.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
   1. 1. 1. 1. 1. 1. 1.]]], shape=(2, 1, 30), dtype=float32)

Two matrices will be printed, each corresponding to the masking for a head. 
Note that the same padding mask will be applied to all heads sicne the source sample is the same.Indexes up to 21st are all zero meaning that no masking is required for these words but the rest must be masked


#### Try how it works: 

In [209]:
# Example: 
# Note that the first sample in the dataset has a length of 
print(X_trainmod[0])
print(f"\nHere is the 21st word of the first samlpe feedback: {X_trainmod[0][21]}\n")
print(f"Zero paddings start from position 22 in the sequence: {X_trainmod[0][22]}")

[[ 0.11891     0.15255    -0.082073   ... -0.57511997 -0.26671001
   0.92120999]
 [ 0.79238999  0.21864     0.68711001 ... -0.066753   -0.39660001
   0.74818999]
 [ 0.60307997 -0.32023999  0.088857   ... -0.25187001 -0.26879001
   0.36657   ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]

Here is the 21st word of the first samlpe feedback: [ 5.13589978e-01  1.96950004e-01 -5.19439995e-01 -8.62179995e-01
  1.54940002e-02  1.09729998e-01 -8.02929997e-01 -3.33609998e-01
 -1.61189993e-04  1.01889996e-02  4.67340015e-02  4.67510015e-01
 -4.74750012e-01  1.10380001e-01  3.93269986e-01 -4.36520010e-01
  3.99839997e-01  2.71090001e-01  4.26499993e-01 -6.06400013e-01
  8.11450005e-01  4.56299990e-01 -1.27260000e-01 -2.24739999e-01
  6.40709996e-01 -1.27670002e+00 -7.22310007e-01 -6.95900023e-01
  2.80450005e-02 -2.3071

In [205]:
 padding_mask[0] * -1e9

<tf.Tensor: shape=(2, 1, 30), dtype=float32, numpy=
array([[[-0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00,
         -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00,
         -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00,
         -0.e+00, -1.e+09, -1.e+09, -1.e+09, -1.e+09, -1.e+09, -1.e+09,
         -1.e+09, -1.e+09]],

       [[-0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00,
         -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00,
         -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00, -0.e+00,
         -0.e+00, -1.e+09, -1.e+09, -1.e+09, -1.e+09, -1.e+09, -1.e+09,
         -1.e+09, -1.e+09]]], dtype=float32)>

In [208]:
#Try the masking: 

# Define the query, key and, value matrices: 
dense_q = Dense(units = 20)(X_trainmod) # shape = (#samples, len_seq, dim_q)
dense_k = Dense(units = 20)(X_trainmod) # shape = (#samples, len_seq, dim_k) 
dense_v = Dense(units = 30)(X_trainmod) # shape = (#samples, len_seq, dim_v) 

# Reshape the query, key, and value matrices: 
dense_qre = reshape_tensor(dense_q,2, pre_attention = True) #shape = (#samples, #heads, len_seq, dim_q/heads)
dense_kre = reshape_tensor(dense_k, 2, pre_attention = True) #shape = (#samples, #heads, len_seq, dim_q/heads)
dense_vre = reshape_tensor(dense_v, 2, pre_attention = True)

attention_scores, res = self_attention(dense_qre,dense_kre,dense_vre, padding_mask)

In [201]:
print(normalized_dotqk[0][0][0])#this is the result of the dot-product for the first sample, first head, first word. 
print(f"\nIf the normalized dot-product is passed through the Softmax this way, the amount of attention given to the padded"
      " vectors will not be zero.")

tf.Tensor(
[-2.4714701e+00 -1.6462295e+00 -2.0647089e+00 -2.1044219e+00
 -1.0884269e+00 -1.3197348e+00 -5.0339353e-01 -1.6755505e+00
 -1.3996890e+00 -1.2149099e+00 -1.1414751e+00  6.1669850e-01
 -1.9416343e+00 -1.2651926e+00  3.4386593e-01 -7.9303038e-01
  1.0365025e+00 -8.2757193e-01  8.4825709e-02 -2.2144122e+00
 -1.3065406e+00 -5.2829480e-01 -1.0000000e+09 -1.0000000e+09
 -1.0000000e+09 -1.0000000e+09 -1.0000000e+09 -1.0000000e+09
 -1.0000000e+09 -1.0000000e+09], shape=(30,), dtype=float32)

If the normalized dot-product is passed through the Softmax this way, the amount of attention given to the padded vectors will not be zero.


In [202]:
print(attention_scores.shape) #(num_samples, #heads, len_seq,len_seq) 
print(attention_scores[0][0][0]) #this is the attention scores for the first sample, first head, for the first word:

(1125, 2, 30, 30)
tf.Tensor(
[0.00690022 0.01574927 0.01036376 0.00996025 0.0275113  0.02183008
 0.04938419 0.0152942  0.02015263 0.02424266 0.02608991 0.15136927
 0.01172109 0.02305381 0.11522534 0.03696581 0.23033306 0.03571076
 0.08893    0.00892282 0.02212003 0.04816965 0.         0.
 0.         0.         0.         0.         0.         0.        ], shape=(30,), dtype=float32)


##### Try the code more in detail: 

In [73]:
dotqk = tf.matmul(dense_qre, dense_kre, transpose_b = True) #must be of size (batch_size, seq_len, seq_len) 
dotqk[0][0][0]

<tf.Tensor: shape=(30,), dtype=float32, numpy=
array([ 9.854767 ,  8.496491 ,  7.7601767,  6.7471952,  4.0660534,
        1.5270729,  1.5375006,  1.9661689,  0.4747281,  2.5537195,
        0.4514052,  1.0212986,  1.6682776,  4.2590213,  3.5995393,
        5.3422823,  5.162285 ,  5.785897 ,  5.343223 , 10.098306 ,
        9.047625 ,  8.059312 ,  4.5651455,  4.44317  ,  4.018174 ,
        3.7627876,  3.8784006,  4.133665 ,  4.086391 ,  3.4858587],
      dtype=float32)>

In [75]:
#Normalize: 
dim_k = tf.cast(dense_kre.shape[-1],tf.float32) 
normalized_dotqk = dotqk/tf.math.sqrt(dim_k)
print(tf.math.sqrt(dim_k))
print(normalized_dotqk[0][0][0])

tf.Tensor(3.1622777, shape=(), dtype=float32)
tf.Tensor(
[3.116351   2.6868265  2.4539833  2.1336505  1.2857989  0.48290285
 0.4862004  0.6217572  0.15012221 0.807557   0.14274685 0.322963
 0.5275557  1.3468207  1.1382742  1.689378   1.6324577  1.8296611
 1.6896755  3.1933646  2.86111    2.5485783  1.4436257  1.4050537
 1.2706583  1.1898979  1.226458   1.3071797  1.2922302  1.1023253 ], shape=(30,), dtype=float32)


In [76]:
normalized_dotqk += masking* -1e9
normalized_dotqk[0][0][0]

<tf.Tensor: shape=(30,), dtype=float32, numpy=
array([ 3.1163509e+00,  2.6868265e+00,  2.4539833e+00,  2.1336505e+00,
        1.2857989e+00,  4.8290285e-01,  4.8620039e-01,  6.2175721e-01,
        1.5012221e-01,  8.0755699e-01,  1.4274685e-01,  3.2296300e-01,
        5.2755570e-01,  1.3468207e+00,  1.1382742e+00,  1.6893780e+00,
        1.6324577e+00,  1.8296611e+00,  1.6896755e+00,  3.1933646e+00,
        2.8611100e+00,  2.5485783e+00, -1.0000000e+09, -1.0000000e+09,
       -1.0000000e+09, -1.0000000e+09, -1.0000000e+09, -1.0000000e+09,
       -1.0000000e+09, -1.0000000e+09], dtype=float32)>

In [77]:
attention_scores =  tf.nn.softmax(tf.cast(normalized_dotqk, dtype=tf.float32),axis = -1)
attention_scores[0][0][0]

<tf.Tensor: shape=(30,), dtype=float32, numpy=
array([0.14328252, 0.09325092, 0.07388063, 0.0536305 , 0.02297178,
       0.01029203, 0.01032603, 0.0118251 , 0.00737864, 0.01423957,
       0.00732442, 0.00877082, 0.01076202, 0.02441721, 0.01982099,
       0.03439274, 0.03248977, 0.03957227, 0.03440297, 0.15475327,
       0.11100524, 0.0812106 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ],
      dtype=float32)>

# Look Ahead Mask: 

In [210]:
def look_ahead_mask(dim): 
    
    """
    At each iteration of the decoder making predictions, pass the length of the input (dim) to this function to mask the proceeding words
    
    """
    # keeps the main diagonal and all sub-diagonals and sets all super-diagonals to zero: 
    mask = 1 - linalg.band_part(ones((dim, dim)), -1, 0) 
    expanded_mask_init = tf.expand_dims(mask, axis = 0) #(1,len_seq, len_seq) 
    expanded_mask_final = tf.expand_dims(expanded_mask_init, axis = 0)
 
    return expanded_mask_final

#### Try the code to see how the look-ahead mask works: 

In [211]:
look_ahead_mask1 = look_ahead_mask(5)#try a smaller example than the dataset 
print(look_ahead_mask1)


tf.Tensor(
[[[[0. 1. 1. 1. 1.]
   [0. 0. 1. 1. 1.]
   [0. 0. 0. 1. 1.]
   [0. 0. 0. 0. 1.]
   [0. 0. 0. 0. 0.]]]], shape=(1, 1, 5, 5), dtype=float32)


In [217]:
#Try the masking: 

# Define the query, key and, value matrices: 
dense_q = reshape(Dense(units = 20)(X_trainmod[0:2,0:5]), (2, 5, 20))# shape = (#samples, len_seq, dim_q)
dense_k = reshape(Dense(units = 20)(X_trainmod[0:2,0:5]),(2,5,20)) # shape = (#samples, len_seq, dim_k) 
dense_v = reshape(Dense(units = 20)(X_trainmod[0:2,0:5]),(2,5,20)) # shape = (#samples, len_seq, dim_v) 


In [218]:
# Reshape the query, key, and value matrices: 
dense_qre = reshape_tensor(dense_q,2, pre_attention = True) #shape = (#samples, #heads, len_seq, dim_q/heads)
dense_kre = reshape_tensor(dense_k, 2, pre_attention = True) #shape = (#samples, #heads, len_seq, dim_k/heads)
dense_vre = reshape_tensor(dense_v, 2, pre_attention = True)

dense_qre.shape #2 samples, 2 heads, len_se = 10, 2*10 = 20 = dim_q

TensorShape([2, 2, 5, 10])

In [219]:
dotqk = tf.matmul(dense_qre, dense_kre, transpose_b = True) #must be of size (batch_size, seq_len, seq_len) 
dim_k = tf.cast(dense_kre.shape[-1],tf.float32) 
normalized_dotqk = dotqk/tf.math.sqrt(dim_k)
normalized_dotqk #not yet masked 

<tf.Tensor: shape=(2, 2, 5, 5), dtype=float32, numpy=
array([[[[-8.43732730e-02, -7.93995082e-01, -4.75165278e-01,
          -3.78226489e-01,  4.82216001e-01],
         [ 1.42866743e+00,  2.02169847e+00,  1.53103852e+00,
           1.46419132e+00,  1.43432593e+00],
         [ 3.60690475e-01,  3.96175802e-01, -8.98327008e-02,
          -1.11132786e-01,  6.80906415e-01],
         [ 1.79796472e-01,  2.40602046e-01,  3.25290933e-02,
          -8.80601332e-02,  3.28476518e-01],
         [-2.47988656e-01, -7.86857754e-02, -2.97271460e-01,
          -2.34151274e-01,  2.09704652e-01]],

        [[-5.26181340e-01,  6.88323304e-02, -3.39775950e-01,
          -1.80493101e-01,  3.51474524e-01],
         [-7.38709748e-01, -3.05430740e-01, -9.91906375e-02,
          -7.81071559e-02,  7.52550006e-01],
         [-1.39334106e+00, -4.78385419e-01, -7.98267722e-01,
          -7.59888589e-01, -6.86964672e-03],
         [-1.05567908e+00, -4.79450911e-01, -6.87709332e-01,
          -7.32642293e-01,  2.09290

In [220]:
normalized_dotqk += look_ahead_mask1*-1e9
attention_scores =  tf.nn.softmax(tf.cast(normalized_dotqk, dtype=tf.float32),axis = -1)
print(f"Attention scores given to the 1st word relative to the rest of the words:\n {attention_scores[0][0]}")#attention scores for the first and only sample, first head,first word 

#print(f"\n\nAttention scores given for the 10th word relative to the rest of the words: {attention_scores[0][10]}")

Attention scores given to the 1st word relative to the rest of the words:
 [[1.         0.         0.         0.         0.        ]
 [0.3559397  0.6440603  0.         0.         0.        ]
 [0.37405312 0.38756484 0.23838204 0.         0.        ]
 [0.27094594 0.28793216 0.23384346 0.20727839 0.        ]
 [0.17449728 0.20668833 0.16610603 0.17692864 0.27577972]]


Assuming that the rows (words) are processed one at a time, we can see that the attentions are masked. 

## Creating the positional encodings: 

In [None]:
# Calculate the angles for positional embeddings: 

def get_angles(pos, k, d):
    """
    Get the angles for the positional encoding
    
    Arguments:
        pos -- Column vector containing the positions [[0], [1], ...,[N-1]]
        k --   Row vector containing the dimension span [[0, 1, 2, ..., d-1]]
        d(integer) -- Encoding size
    
    Returns:
        angles -- (pos, d) numpy array 
    """
    
    # Get i from dimension span k
    i = k//2
    # Calculate the angles using pos, i and d
    angles = pos/ (10000)**(2*i/d)

    
    return angles
    
def pos_emb(len_seq,len_emb): 
    
    """
    This function creates the positional embeddings for all the words in the sequence based on: 
    
    Input: 
    len_seq (int) : The length of the sequences inputed into the model. 
    len_emb (int) : The length of the word embeddings for every word in the sequence. 

    Note: the size of the positional encoding and the word embeddings must match in order to add them in the next step. 

    Output: 
    res (np.array(len_seq, len_emb)) : ith row of this matrix represents the positional encodings for the ith position in the sequence. 

    """

    len_i = int(len_emb/2)

    # Initialize the matrix to save positional encodings: 
    res = np.zeros((len_seq,len_emb))
    angles = np.zeros((len_seq,len_emb))
    
    #for each position in the sequence 
    for pos in range(len_seq): #there are 30 words so position ranges between 0-29
        
        #calculate the angles: 
        for i in range(len_i): #ranges between 0 - 24
            angles[pos,2*i] = pos/(10000**(2*i/len_emb))
            angles[pos, 2*i +1] = pos/(10000**(2*i/len_emb)) 
        
        # Calculate the entries corresponding to each position 
        #for j in range(len_i): 
        res[pos, 0::2] = np.sin(angles[pos,0::2])
        res[pos,1::2] = np.cos(angles[pos,0::2])
            
    return(tf.cast(res.reshape(1,len_seq,len_emb), dtype=tf.float32))


In [64]:
# Create the positional embeddings: 
position_enc = pos_emb(X_trainmod.shape[1],X_trainmod.shape[2])
position_enc.shape

TensorShape([1, 30, 50])

In [48]:
# Add the positional encoding to the word embeddings: 
X_trainmod = X_trainmod + position_enc 
print(X_trainmod.shape)

X_testmod = X_testmod + position_enc 
X_testmod.shape

(1125, 30, 50)


TensorShape([375, 30, 50])

## Defining the feed forward neural network: 
This will be used as a part of the encoder and decoder structures. 

In [324]:
def FullFeedForward(n_1, emb_size):#the model must return vectors of the same size as the embeddings of the input so can be combined with decoder
    model = Sequential([
    Dense(n_1, activation='tanh', name="dense1"), #relu? (#samples, len_seq, n_1)
    Dense(emb_size, activation='tanh', name="dense2")# linear? (#samples, len_seq, emb_size)
])
    return(model)
    

In [46]:
# Define a reshape_tensor which will be later on used for the Multi-head attention: 

def reshape_tensor(q_matrix, heads, pre_attention): 
    """
    """
    
    #pre_attention, we'll need to reform into 4d 
    if pre_attention:

        dense_qre = reshape(q_matrix, (shape(q_matrix)[0], shape(q_matrix)[1], heads, -1))
        dense_qre = transpose(dense_qre, ([0, 2, 1, 3]))
        
        
    #post_attention, we'll need to revert back to 3d: 1125, 2, 30, 15]
    else: 
        q_matrix_transpose = transpose(q_matrix, ([0,2,1,3]))
        dense_qre = reshape(q_matrix_transpose, (shape(q_matrix_transpose)[0], shape(q_matrix_transpose)[1], -1)) 
        
        
    return(dense_qre)
        

## Define the class for multi-head attention: 

In [344]:
class MultiHeadAttention(Layer): 

    def __init__(self, dim_kv, dim_q, len_emb, heads, **kwargs):
        
        super(MultiHeadAttention, self).__init__(**kwargs) 
        self.heads = heads
        self.denseq = Dense(units = dim_q)
        self.densek = Dense(units = dim_kv)
        self.densev = Dense(units = dim_kv) 
        self.dense = Dense(units = len_emb)
    
    def call(self,q,k,v, masking, **kwargs): #by passing self, you passed all the attributes you've defined above. 
       
        # Define the query, key, and value matrices: 
        dense_q = self.denseq(q) # shape = (#samples, len_seq, dim_q)
        dense_k = self.densek(k) # shape = (#samples, len_seq, dim_k) 
        dense_v = self.densev(v) # shape = (#samples, len_seq, dim_v) 
        
        # Reshape: 
        dense_qre = reshape_tensor(dense_q, self.heads, pre_attention = True) #shape = (#samples, #heads, len_seq, dim_q/heads)
        dense_kre = reshape_tensor(dense_k, self.heads, pre_attention = True) #shape = (#samples, #heads, len_seq, dim_k/heads)
        dense_vre = reshape_tensor(dense_v, self.heads, pre_attention = True) #shape = (#samples, #heads, len_seq, dim_v/heads)
        
        # Calculate the attention scores: 
        attention_scores, res = self_attention(dense_qre, dense_kre,dense_vre,masking) #shape = (#samples, #heads, dim_q/heads, len_seq)
        
        # Revert the shape:
        attention_with_v = reshape_tensor(res, self.heads, pre_attention = False) #shape = (#samples, len_seq, dim_q)
        
        # Run through another dense and add to the initial x: 
        res = self.dense(attention_with_v)  # shape = (#samples, len_seq, d_model) 
        
        return(res)


In [345]:
# Check if it works: 
dim_kv = 30 #we keep the dimension of k and q the same for the dot product to work. and then the dim of v the same so that mult happens
dim_q = 30 
len_emb = 50
heads = 2 


function = MultiHeadAttention(dim_kv, dim_q, len_emb, heads)
function(X_trainmod, X_trainmod,X_trainmod, masking).shape

TensorShape([1125, 30, 50])

## Define the Encoder layer: 

In [346]:
class Encoder(Layer):
    
    def __init__(self, dim_kv, dim_q, heads, fnn_neurons, len_emb, iter, drop_rate):
        
        super(Encoder,self).__init__()
        self.mha     = MultiHeadAttention(dim_kv, dim_q, len_emb, heads)
        self.norm1    = LayerNormalization(epsilon = 1e-6)
        self.norm2    = LayerNormalization(epsilon = 1e-6)
        self.drop    = Dropout(rate = drop_rate)
        self.fnn     = FullFeedForward(fnn_neurons, len_emb)
        self.iter    = iter

        
    def call(self,x,training, masking): 
        """
        """
        
        for _ in range(self.iter): 

            # Add dropout layer: 
            drop_x = self.drop(x, training = training) 
            
            # Calculate the attention scores: 
            mha_scores = self.mha(drop_x, drop_x, drop_x, masking = masking)
        
            # Add dropout and normalize: 
            dropout_1 = self.drop(mha_scores, training = training)
            norm_1  = self.norm1(dropout_1 + x )
        
            #Run through a fully connected neural network: 
            fnn_output = self.fnn(norm_1) 
            
            # Add dropout: 
            dropout_2 = self.drop(fnn_output, training = training)
        
            # Normalize: 
            x = self.norm2(dropout_2 + norm_1)
            
        return x
            
        

In [352]:
# Check if it works: 
dim_kv = 40 
dim_q = 40 
len_emb = 50
heads = 4 
masking = masking
fnn_neurons = 20
drop_rate = 0.1
function = Encoder(dim_kv, dim_q, heads, fnn_neurons, len_emb, iter = 10, drop_rate = 0.1)
output_encoder = function(X_trainmod, training = True, masking = None)
output_encoder.shape

TensorShape([1125, 30, 50])

## Define the Decoder layer: 

In [66]:
class Decoder(tf.keras.layers.Layer): 

    def __init__(self, len_emb, dim_kv, dim_q, heads, 
                dd_model, iter, 
                drop_rate = 0.1, epsilon = 1e-6):  #dd_model is the number of neurons in the last layer of decoder (dense with softmax) 
        super(Decoder, self).__init__()
        self.len_emb = len_emb
        self.mha1 = MultiHeadAttention(dim_kv, dim_q, len_emb, heads) #remove the masking from the attributes and add it to the call argument) 
        self.mha2 = MultiHeadAttention(dim_kv, dim_q, len_emb, heads) #same for here 
        self.drop = Dropout(rate = drop_rate)
        self.layernorm1 = LayerNormalization(epsilon = epsilon)
        self.layernorm2 = LayerNormalization(epsilon = epsilon)
        self.layernorm3 = LayerNormalization(epsilon = epsilon)
        self.dense =  FullFeedForward(dd_model, len_emb) 
        self.iter = iter


#question! how does the built-in mha receive the number of q, k, v dims to map and create the q, k, v matrices? are the default. 
#question! during training will the layer normaliation parameters also train> if so, we need to define deperate layer norms to each. 
#question! there are some dense models in mha how are the number of neurons in them defined here? 


    def call(self, x, enc_output, training, dec_pad_mask): 
        """
        The look-ahead mask will be defined within the model when training == True; otherwise, look-ahead-mask = None
        """
        
        len_seq = x.shape[1]
        
        # Create the look-ahead mask: 
        if training == True:
            
            look_ahead_mask1 = look_ahead_mask(len_seq)
            
        else 
            look_ahead_mask1 = None 
            
    
        for _ in range(iter):
            
            # Add positional Encoding: #remove the pos embeddings and have it in hte transformer. 
            #x += pos_emb(x.shape[1], self.len_emb)
        
            # Add a dropout layer: 
            x = self.drop(x, training = training) 
           
            # Run through a MHA with the look-forward mask: 
            attn_mat1 = self.mha1(x, x, x, masking = look_ahead_mask1)
            
            # Add dropout here during training:  
            attn_mat1 = self.drop(attn_mat, training = training)
            
            # Add and Normalize: 
            attn_mat1_x = self.layernorm1(attn_mat1 + x)
            
            # Run through the next MHA: 
            attn_mat2 = self.mha2(x , enc_output, enc_output, masking = dec_pad_mask)
            
            # Add dropout during training: 
            attn_mat2 = self.drop(attn_mat2, training = training) 
            
            # Add and Normalize: 
            attn_mat2_x = self.layernorm2(attn_mat2 +  attn_mat1_x) 
            
            # Run through a dense layer: 
            dense_output = self.dense(attn_mat2_x)
            
            # Add Dropout: 
            dense_drop = self.drop(dense_output, training = training)
            
            # Add and Normalize: 
            x = self.layernorm3(dense_drop + attn_mat2_x) #x is the res but remember that since it's in a loop we still call it x. 
            
        return(x) 
            
        

In [68]:
#Check if it works after you've defined your output sequence (decoder input):  
len_emb = 50 
dim_kv = 30 
dim_q = 50 
heads = 3 
dd_model = 20 
iter = 3 
drop_rate = 0.1
function_decoder = Decoder(len_emb, dim_kv, dim_q, heads, 
                           dd_model, iter, drop_rate = 0.1, epsilon = 1e-6)

function_decoder(y, output_encoder, training = True, dec_pad_mask = None).shape

## Define the Transformer architecture: 

In [77]:
class Transformer(tf.keras.layers.Layer): 

    def __init__(self, len_emb, dim_kv, dim_q, heads, d_model,
                dd_model, iterEnc, iterDec, df_model, len_seq_out,
                drop_rate = 0.1, epsilon = 1e-6):
        
        super(Transformer, self).__init__()
        self.len_emb = len_emb
        self.len_seq_out = len_seq_out
        
        self.encoder = Encoder(dim_kv, dim_q, heads, d_model, len_emb, iterEnc, drop_rate = 0.1)
        
        self.decoder = Decoder(len_emb, dim_kv, dim_q, heads, dd_model, iterDec, drop_rate = 0.1, epsilon = 1e-6)
        
        self.dense =  Dense(units = df_model,activation = 'softmax') 
        
    def call(self, input_seqs, output_seqs, training, enc_pad_mask, dec_pad_mask, look_ahead_mask):
    
        """
        the output sequence and the input sequence must already be in the form of word embeddings added. we need two more paddings. <sos> and <eos> 
        len_seq in and out might be different 
        """
        
        #first pass the input embeddings to add the positional encodings no dropouts necessary as the encoder already has it: 
        len_seq = input_seqs.shape[1]
        input_seqs += pos_enc(len_seq_in, self.len_emb) 
        
        #multiply by a constant for numerical stability #look into it! 
        input_seqs *= tf.math.sqrt(tf.cast(self.len_emb,tf.float32))
        
        # Run through the encoder part: 
        enc_output = self.encoder(input_seqs, training = training, masking = enc_pad_mask)
        
        # Add positional encoding for the output sequence: 
        output_seqs += pos_enc(self.len_seq_out, self.len_emb)
        output_seqs *= tf.math.sqrt(tf.cast(self.len_emb,tf.float32))
        
        #Run through the decoder part: 
        dec_output = self.decoder(output_seqs, enc_output, training = training, dec_pad_mask = dec_pad_mask)
        
        # Run through a linear layer with activation function softmax 
        res = self.dense(dec_output) 
        return(res) 


before running through the final linear layer, do we add drop out to the model? 

For the word embeddings and if we are to use the decoder structure, we need to modify the word embeddings to also include two tokens : $<sos> $ start of the sentence and $<eos>$ end of the sentence. 

We want the Softmax function that assigns the attention scores to avoid assigning any attention score to the padded parts of the sequence. So, instead we can either define a function that replaces vectors of all zeros with negative infinity (-1e-9) or when creating the padded embeddings for each input, we can assign -1e-9 to every padded token. But if we add the padding before going through the dot product attention (before the softmax), it is possible that through multiplication with matrices q,k, and v the padded vectors grow larger and then when we run the resultant matrix through softmax, it might again not assign 0 attention scores to the padded sequences. Therefore, the padded mask must be added after the dot product. Then apply Softhen multiply with the V matrix. Where to normalize? we will normalize the attention scores after the dot product before masking is applied. 

mPreferably, we want the input of the Encoder structure to already have the word embeddings and the positional encodings. In the Encoder structure, we will have the multi-head attention (think of it as running the self-attention multiple times) and a fully connected neural network which will be called FullFeedForward. 

My intuition is that when the output is not normalized, the algo will be caught in many local minima or maxima and cannot easily and quickly converge 

change the layer norms as they are also trainable. 

# Questions
Why is the embedding size also taken as an argument in MHA? we get matrices q, k, and v. The product of qTk will give a dim_k or dim_q by emb_size. The final product in the attention mechanism must yield a matrix of the same length of seq and emb_size. 

* look into the command of MHA.
* LayerNormalization.

### Multi-head attention? 
We will input 3 xs (possibly they could be different?) then the inputs are mapped linearly to give us the matrices Query, Key and Value. 
* dimension x (#batches, len_seq, len_emb)
* dim of k:$K^T x$ if k is (len_seq,dim_k), then its transpose is (dim_k, len_seq), the resultant matrix is going to have dim (dim_k, len_emb)
* dim of q: $Q^T x $; if q is (len_seq,dim_q), then its transpose is of dim (dim_q, len_emb) and the resultant dot product gives (dim_q,len_emb)
* Similarly, for the multiplication of $V^T x$, we have the value being of dimension (dim_v, len_emb).
  * if it is a self-attention (attention with only one head), then $qk^T$ has dim (dim_q, dim_k), scale, add the mask and dropout if given.
  * if it has n heads, then we will produce query and key matrices of dimensions dim_q/n, dim_k/n. After the dot product, the result is of dim (dim_q/n, dim_k/n). We then concatenate these results to get the desired dim of (dim_q,dim_k). $ \bold{make sure you understand the concatenation} $
* dot prodcut v (dim_v, len_emb) qTk (dim_q, dim_k) --> $ qTk .v $ Note that here dim_k must be the same as the dimension of v for this dot product to occur.
* just like magic, you have the attention scores now and the result is a matrix of (dim_k, len_emb).
* so then we add our initial x and normalize too. in order to add x to the attention scores, the attention scores need to have the same dim as x. meaning that dim_k needs to be the same as the len of the sequence.

### Fully Connected Neural Network: 

We feed the matrix out of the attention mechanism into the fully connected neural network. how many neurons? what matters is that the output layer must have len_emb neurons in order to match the dim of x. why do we need them to match? becoz we again add the input seq x to the result (after another layer of normalization). 

Then copy the result, pass as key and value to the decoder network. 

# Question isn't the dot product we are talking here actually a cross product?!

Do you wanna define another function that takes the dims you'd like and deliver you the query, key and value matrices? 
because now we no longer need to have as inputs, the dim_kv and dim_q. would we need the masking? yes in self_attention. 
we need the mha to take 3 arguments as q,k,v. 

* How do we initialize the q, k, and v matrices?

    A multi-head attention class is defined where based on the training x, created the q,k, and v matrices by applying a dense layer to the input sequence each time. 


* How is this model trained?
  Still a question.

* For the encoder layer, what attributes do we need?
   * Better question to ask is what do we want the Encoder layer do?
     When running the encoder layer, we want to input the input sequence; then this input sequence will go through to add word embeddings, then positional encodings. We then run the attention model on this to get the attention scores added to the structure. we then normalize and add dropout. Then run through a fully connected neural network, add x, normalize and add another dropout layer.

* What is the purpose of the Dropout function and what are its arguments?
 
  let's assume the dropout rate is 0.1. During training, the dropout layer randomly selects 10% of the input and replace it with zeros. This prevents the model to overfit the parameters based on the training set and also prevents the model to become too reliant on certain parameters. During the call function, make sure you set the training argument to 'True' so that the model will apply dropout only during training and does nothing during the inference mode (making predictions). 

* As an alternative to defining our own Multi-Head attention, we could use the one built-in Tensorflow package. Check out if the calculations are all the same and what the arguments to this layer are. 

the next task is to have an encoder layer. you then have a decoder and then the transformer. to the transformer, we would like to only input the x and not modify to add embeddings or positional embeddings. but for the encoder part, we would like to repeat the encoder part multiple times. so essentially, we want to add a loop to the encoder section. how to do that? 
what is going to be on repeat? the full encoder layer.
so what would be the input to the encoder? x 
at first, the x will be the training set but for the next iterations on the loop, we will take the output of the encoder and input for the next time. so, this in that sense it sequential but the length of the senquence is actually much less. I would like to see how would repeating the loop actually benefit training. 
* try adding multiple iterations of the encoder and then try with only one layer of encoder and see if there is a difference in the model performance. 

cool thing to know, you can use the underscore for any variable that is not gonna be used later. so for example, if you know a function will output 3 vars and you only need the first two, you can have the third variable saved as an underscore. or during a for loop, you can write for _ in range() this means that the place holder for the iterations will actually not be used inside the loop so you don't bother defining it. 

* Note that we must make sure in the attention paper bahdanua, we defined the correct variables to be saved and disregarded in the post-attention LSTM. 

So what does the decoder do? 
it seems that the decoder but for the decoder to start we need the encoder code in coursera to be complete we then can move to it? not right now I am primed to work to have at least an understanding of the decoder before going through it we do not necessarily start the code right away. 

so what does a decoder do? the decoder, has also an input that is prob encoded with embeddings and the pos encodings. then the decoder must go through yet another mha. to this mha that takes 3 inputs, we input the query as the input of the decoder and we input the output of the encoding as the key and value. why? query is where the model is at prediction. so essentially, the query has info about what has already been predicted. then you pass on all the info about the input as the key so the model learns what part of the input to focus on most when making prediction at the next step. you then multiply the attention scores with the value matrix which is again the input encoded. so essentially, the decoder takes the info on what has already been predicted and the full key matrix (input encoded) decides which parts of the input to pay attention to the most and once the attention scores are calculated, then the attention scores are weigh the encoded input. this is beautiful! then the mha might repeat for several iterations and then the output is added and normalized to the initial input of the decoder. 

* the input of the decoder will go through a masked multi-head attention. might repeat multiple times. then you add the initial input embeddings and encoding to the output of the multi-head loop (after you add the dropout layer to it). then this is inputed into another mha as the query. the key and the value are taken as the output of the encoder. another mha in a loop. then you add the dropout layer and then add to the query of this mha. then normalize and then run through a ffn. then again add dropout and add the input of ffn to the output.

there might be another linear map and the run through the softmax. and voila! 

ok so the first step is to modify our mha function. how? this model should take the query, key and values as inputs. previously, we would take the the input, and equal to the size of the input, we would calculate the query, key and value inside the mha. now take this calculation out. so the key, query and value will be defined outside the mha and inputed to reshape and cal attn scores. but note that this process must take place after the loop in the encoder is introduced. 

might also need to define a masked mha. 

in case it was needed, we can run our x matrix in the jupyter notebook of coursera and check if the outputs and inputs are the same and if one model performs differently than the other. 

# ? would this be helpful for the task of sentiment analysis? I believe it should be. 

in case it was needed, we can run our x matrix in the jupyter notebook of coursera and check if the outputs and inputs are the same and if one model performs differently than the other. 
#change the padding of all 1s to a padding of all zeros and see how the performance of the model might change. 
# you might also be interested in applying a padding to the model to examine the improvment in the performance. 
#need to add training = training for all the dropouts applied so this will only occur during the training mode. not that right now, the model is 
#always in the training mode. no inference so the dropout layer is also applied during inference. 



There are multiple tasks that must be followed: 
1, build the decoder network from scratch. (today) 
2, build the transform's architecture (tom)
3, learn about the dropouts (tom)
4, learn about the masks (tom) 
5, apply the transformer to a task (2days each) 2 tasks (friday start this - sat done with one task) (sat - mon) finish the other task. 

transformer: 
embeddings of the encoder and decoder should occur here but pos enc inside the encoder and decoder. 

check the performance of the model both with and without editting the text. 

try other datasets to train and see if it outperforms. 

question: we have masking to be applied and before that we have positional encoding that changes the inputs the padded rows are no longer all zeros and will affect the dot-product. but then once the dot product is calculated, we mask them so that no attention is given to those words. couldn't it be even more informative for the model if the positional encoding just prints out zeros for the padded tokens? because no matter how small the pe are, they still do affect the dot product. 

write down why the model performance did not improve with the glove word embeddings compared to one-hot vectors once the additive attention model is applied. 


explore why the divide by the constant dim_k in the dot product attention and that why would it possibily become problematic. 

#****the question is when would the decoder be stopped? would we run it through a cretain number of iterations? no we will generate tokens until the 
#token end of the sentence is generated. so till then we will generate tokens. but right now for a simplification, we will only run a loop for w
#which the decoder makes predictions. 