# Lab:  Language model - Emojify 

In this assignment you will use word vector representations to build an Emojifier. 
🤩 💫 🔥

You'll implement a model which inputs a sentence (such as "Let's go see the baseball game tonight!") and finds the most appropriate emoji to be used with this sentence (⚾️).

1. You'll start with a baseline model (Emojifier-V1) using word embeddings.
2. Then you will build a more sophisticated model (Emojifier-V2) that further incorporates an LSTM. 

In [1]:
# pip install emoji 
import numpy as np
import emoji
import matplotlib.pyplot as plt
import csv
import pandas as pd
from termcolor import colored

from sklearn.metrics import confusion_matrix

from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Dropout 
from tensorflow.keras.layers import Conv2DTranspose
from tensorflow.keras.layers import concatenate
from tensorflow.keras.layers import ZeroPadding2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import RepeatVector

#from emo_utils import *
#from test_utils import *

%matplotlib inline

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [12]:
def read_glove_vecs(glove_file):
    with open(glove_file, encoding="utf8") as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()


def read_csv(filename = 'data/emojify_data.csv'):
    phrase = []
    emoji = []

    with open (filename) as csvDataFile:
        csvReader = csv.reader(csvDataFile)

        for row in csvReader:
            phrase.append(row[0])
            emoji.append(row[1])

    X = np.asarray(phrase)
    Y = np.asarray(emoji, dtype=int)

    return X, Y

def convert_to_one_hot(Y, C):
    Y = np.eye(C)[Y.reshape(-1)]
    return Y


emoji_dictionary = {#"0": ":red_heart:",    # :heart: prints a black instead of red heart depending on the font
                    "0": "\u2764\ufe0f",
                    "1": ":baseball:",
                    "2": ":smile:",
                    "3": ":disappointed:",
                    "4": ":fork_and_knife:"}

def label_to_emoji(label):
    """
    Converts a label (int or string) into the corresponding emoji code (string) ready to be printed
    """ 
    #return emoji.emojize(emoji_dictionary[str(label)], use_aliases=True)
    return emoji.emojize(emoji_dictionary[str(label)], language='alias')
              
    
def print_predictions(X, pred):
    print()
    for i in range(X.shape[0]):
        print(X[i], label_to_emoji(int(pred[i])))
        
        
def plot_confusion_matrix(y_actu, y_pred, title='Confusion matrix', cmap=plt.cm.gray_r):
    
    df_confusion = pd.crosstab(y_actu, y_pred.reshape(y_pred.shape[0],), rownames=['Actual'], colnames=['Predicted'], margins=True)
    
    df_conf_norm = df_confusion / df_confusion.sum(axis=1)
    
    plt.matshow(df_confusion, cmap=cmap) # imshow
    #plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(df_confusion.columns))
    plt.xticks(tick_marks, df_confusion.columns, rotation=45)
    plt.yticks(tick_marks, df_confusion.index)
    #plt.tight_layout()
    plt.ylabel(df_confusion.index.name)
    plt.xlabel(df_confusion.columns.name)
    
    
    
def predict(X, Y, W, b, word_to_vec_map):
    """
    Given X (sentences) and Y (emoji indices), predict emojis and compute the accuracy of your model over the given set.
    
    Arguments:
    X -- input data containing sentences, numpy array of shape (m, None)
    Y -- labels, containing index of the label emoji, numpy array of shape (m, 1)
    
    Returns:
    pred -- numpy array of shape (m, 1) with your predictions
    """
    m = X.shape[0]
    pred = np.zeros((m, 1))
    any_word = list(word_to_vec_map.keys())[0]
    # number of classes  
    n_h = word_to_vec_map[any_word].shape[0] 
    
    for j in range(m):                       # Loop over training examples
        
        # Split jth test example (sentence) into list of lower case words
        words = X[j].lower().split()
        
        # Average words' vectors
        avg = np.zeros((n_h,))
        count = 0
        for w in words:
            if w in word_to_vec_map:
                avg += word_to_vec_map[w]
                count += 1
        
        if count > 0:
            avg = avg / count

        # Forward propagation
        Z = np.dot(W, avg) + b
        A = softmax(Z)
        pred[j] = np.argmax(A)
        
    print("Accuracy: "  + str(np.mean((pred[:] == Y.reshape(Y.shape[0],1)[:]))))
    
    return pred

In [4]:
# Compare the two inputs
def comparator(learner, instructor):
    for a, b in zip(learner, instructor):
        if tuple(a) != tuple(b):
            print(colored("Test failed", attrs=['bold']),
                  "\n Expected value \n\n", colored(f"{b}", "green"), 
                  "\n\n does not match the input value: \n\n", 
                  colored(f"{a}", "red"))
            raise AssertionError("Error in test") 
    print(colored("All tests passed!", "green"))

# extracts the description of a given model
def summary(model):
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    result = []
    for layer in model.layers:
        descriptors = [layer.__class__.__name__, layer.output_shape, layer.count_params()]
        if (type(layer) == Conv2D):
            descriptors.append(layer.padding)
            descriptors.append(layer.activation.__name__)
            descriptors.append(layer.kernel_initializer.__class__.__name__)
        if (type(layer) == MaxPooling2D):
            descriptors.append(layer.pool_size)
            descriptors.append(layer.strides)
            descriptors.append(layer.padding)
        if (type(layer) == Dropout):
            descriptors.append(layer.rate)
        if (type(layer) == ZeroPadding2D):
            descriptors.append(layer.padding)
        if (type(layer) == Dense):
            descriptors.append(layer.activation.__name__)
        if (type(layer) == LSTM):
            descriptors.append(layer.input_shape)
            descriptors.append(layer.activation.__name__)
            descriptors.append(layer.return_sequences)
        if (type(layer) == RepeatVector):
            descriptors.append(layer.n)
        result.append(descriptors)
    return result

## PART 1 - Baseline Model: Emojifier-V1

### 1.1 - Dataset EMOJISET

You have a tiny dataset (X, Y) where:
- X contains sentences(strings). At each row one sentence. 
- Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence.

Load the dataset. It is split between training (*train_emoji.csv*) and testing (*test.csv*) data sets.

<img src="https://github.com/renanaferreira/CAA-repository/blob/main/Lab10_emojify/images/data_set.png?raw=true" style="width:700px;height:300px;">
<caption><center><font color='purple'><b>Figure 1</b>: EMOJISET - a classification problem with 5 classes. A few examples of sentences. </center></caption>

In [9]:
# function read_csv is part of emo_utils.py

X_train, Y_train = read_csv('/kaggle/input/emojify/train_emoji.csv')

X_test, Y_test = read_csv('/kaggle/input/emojify/test_emoji.csv')

#Check the dimension of each variable. 
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
print()

# How many train examples (sentences) ? Print some of them.
print(len(X_train))
print()
print(X_train[1])
print()

# How many test examples (sentences) ? 
print(len(X_test))


(132,)
(132,)
(56,)
(56,)

132

I am proud of your achievements

56


In [10]:
# Guess what is maxLen ? 
maxLen = len(max(X_train, key=len).split())

#How many words consists the longest sentence? 
print(maxLen)


10


In [14]:
# Print 1st sentence from X_train & corresponding label from Y_train. 
idx = 1
print(X_train[idx], label_to_emoji(Y_train[idx]))

#Make a cicle for to see the first 10 sentences. 
for idx in range(10):
    print(X_train[idx], label_to_emoji(Y_train[idx]))

I am proud of your achievements 😄
never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴
I love you mum ❤️
Stop saying bullshit 😞
congratulations on your acceptance 😄
The assignment is too long  😞
I want to go play ⚾


### 1.2 - Overview of the baseline model Emojifier-V1 

<center>
<img src="https://github.com/renanaferreira/CAA-repository/blob/main/Lab10_emojify/images/image_1.png?raw=true" style="width:900px;height:300px;">
    <caption><center><font color='purple'><b>Figure 2</b>: Baseline model (Emojifier-V1).</center></caption>
</center></font>


#### Inputs and Outputs
* Input of the model is a string corresponding to a sentence (e.g. "I love you"). 
* Output is a vector of shape (1,5), indicating what is the probability of each emojis.
* The (1,5) probability vector is passed to an argmax layer, which extracts the index of the emoji with the highest probability.

#### One-hot Encoding
* To get the labels into a format suitable for training a softmax classifier, convert $Y$ from its current shape  $(m, 1)$ into a "one-hot representation" $(m, 5)$, 
    * Each row is a one-hot vector giving the label of one example.
    * `Y_oh` stands for "Y-one-hot" in the variable names `Y_oh_train` and `Y_oh_test`: 

In [65]:
# Function convert_to_one_hot is part of emo_utils.py

Y_oh_train = convert_to_one_hot(Y_train, C = 5)

Y_oh_test = convert_to_one_hot(Y_test, C = 5)

See what `convert_to_one_hot()` did. Change `index` to print different values. 

In [18]:
idx = 1

print(X_train[idx])

print(Y_train[idx])

print(label_to_emoji(Y_train[idx]))

print(Y_oh_train[idx])
      

I am proud of your achievements
2
😄
[0. 0. 1. 0. 0.]


<a name='1-3'></a>
### 1.3 - Implementing Emojifier-V1

As shown in Figure 2 (above), the first step is to:
* Convert each word in the input sentence into their word vector representations.
* Take an average of the word vectors. 

You will use pre-trained 50-dimensional GloVe embeddings. 

In [19]:
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('/kaggle/input/glovedata/glove.6B.50d.txt')

You've loaded:
- `word_to_index`: dictionary mapping from words to their indices in the vocabulary 
    - (400,001 words, with the valid indices ranging from 0 to 400,000)
- `index_to_word`: dictionary mapping from indices to their corresponding words in the vocabulary
- `word_to_vec_map`: dictionary mapping words to their GloVe vector representation. (50-dimensional)

In [20]:
# What is the index of the word "cucumber" in the vocabulary ?
print(word_to_index['cucumber'])

#What is the word in the vocabulary with index 289846 ?
print(index_to_word[289846])

113317
potatos


In [22]:
print(word_to_vec_map['cucumber'].shape)

(50,)


###  Implement `sentence_to_avg()` 

Here, two steps are carried out:

1. Convert every sentence to lower-case, then split the sentence into a list of words. 
    * `X.lower()` and `X.split()` are useful.
2. For each word in the sentence, access its GloVe representation.
    * Then take the average of all of these word vectors.

In [23]:
def sentence_to_avg(sentence, word_to_vec_map):
    """
    Converts a sentence into a list of words. 
    Extracts the GloVe representation of each word,
    averages its value into a single vector encoding the meaning 
    of the complete sentence.
    
    Arguments:
    sentence -- string, one training example from X
    word_to_vec_map -- dictionary mapping every word in the vocabulary 
    into its 50-dimensional vector representation
    
    Returns:
    avg -- average vector encoding information about the sentence, 
    numpy-array of shape (50,)
    """
    # Get just one word contained in the word_to_vec_map. 
    any_word = list(word_to_vec_map.keys())[0]
    
    # STEP 1: Split sentence into list of lower case words
    words = sentence.lower().split()

    # Initialize the average word vector, 
    # should has the same shape as the word vectors.
    avg = np.zeros(word_to_vec_map[any_word].shape)
    
    # Initialize count to 0
    count = 0
    
# STEP 2: Average the word vectors. 
# Make loop for over all words in the list "words"
    for w in words:
        # Check that word exists in word_to_vec_map
        if w in list(word_to_vec_map.keys()):
            avg = avg + word_to_vec_map[w]
            # Increment count
            count +=1
          
    if count > 0:
        # Get the average, only if count > 0
        avg = avg / count
    
  
    return avg

### 1.4 Implement the baseline model (Fig.2)

After using `sentence_to_avg()` you need to:
* Pass the average through forward propagation
* Compute the cost
* Backpropagate to update the softmax parameters

* Equations to implement the forward pass & compute cross-entropy cost are below:
* Variable $Y_{oh}$ is one-hot encoding of output labels. 

$$ z^{(i)} = W . avg^{(i)} + b$$

$$ a^{(i)} = softmax(z^{(i)})$$

$$ \mathcal{L}^{(i)} = - \sum_{k = 0}^{n_y - 1} Y_{oh,k}^{(i)} * log(a^{(i)}_k)$$

In [31]:
def model(X, Y,word_to_vec_map,learning_rate = 0.01,num_iterations=100):
    """
    Model to train word vector representations in numpy.
    
Arguments:
X -- input data, numpy array of sentences (strings) of shape (m,)
Y -- labels, numpy array of integers between 0 and 4 of shape (m,)
   
word_to_vec_map -- dictionary mapping every word in a vocabulary 
                   into its 50-dimensional vector representation  
learning_rate --  for the stochastic gradient descent algorithm
num_iterations -- number of iterations 
    
Returns:
    pred -- vector of predictions, numpy-array of shape (m, 1)
    W -- weight matrix of the softmax layer, of shape (n_y, n_h)
    b -- bias of the softmax layer, of shape (n_y,)
    """
    
    # Get just one word contained in the word_to_vec_map. 
    any_word = list(word_to_vec_map.keys())[0]
        
    # Initialize cost to 0
    cost = 0 # TODO
    
    # Number of training examples ?
    m = X.shape[0] 
    
    # Number of classes 
    n_y = len(np.unique(Y))
    
    # dimensions of GloVe vectors 
    n_h = word_to_vec_map[any_word].shape[0] 
    
    # Initialize parameters using Xavier initialization
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,))
    
    # Convert Y to one hot encoding with n_y classes
    Y_oh = convert_to_one_hot(Y=Y, C=n_y)
    
    # Optimization loop
    # Loop over the number of iterations
    for t in range(num_iterations): 
        # Loop over the training examples
        for i in range(m):        

            # Apply function sentence_to_avg to average the word vectors 
            # from the i'th training example

            avg = sentence_to_avg(sentence=X[i], word_to_vec_map=word_to_vec_map)

            # Forward propagate the avg through the softmax layer
            z = np.add(np.dot(W,avg),b)
            
            # function softmax() is part of emo_utils.py 
            a = softmax(z) 

            # Compute cost using one hot representation of i'th training label
            # and the output of the softmax (a)
            cost = -np.sum(np.dot(Y_oh[i], np.log(a)))
            
            # Compute gradients 
            dz = a - Y_oh[i]
            dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db = dz

            # Update parameters with Stochastic Gradient Descent
            # (after processing a single training examples)
            W = W - learning_rate * dW
            b = b - learning_rate * db
        
        if t % 10 == 0:
            print("Epoch: " + str(t) + " --- cost = " + str(cost))
            
            #Function predict is part of emo_utils.py
            pred = predict(X, Y, W, b, word_to_vec_map) 

    return pred, W, b

Train the model & learn softmax parameters (W, b). **Training may take some time !!!**

In [32]:
np.random.seed(1)
pred, W, b = model(X_train, Y_train, word_to_vec_map)

Epoch: 0 --- cost = 1.9520498812810076
Accuracy: 0.3484848484848485
Epoch: 10 --- cost = 1.0040987758894053
Accuracy: 0.7272727272727273
Epoch: 20 --- cost = 0.5388772571119417
Accuracy: 0.803030303030303
Epoch: 30 --- cost = 0.3331218997365079
Accuracy: 0.803030303030303
Epoch: 40 --- cost = 0.23144766289423163
Accuracy: 0.8257575757575758
Epoch: 50 --- cost = 0.1747265584802322
Accuracy: 0.8560606060606061
Epoch: 60 --- cost = 0.1398575258401195
Accuracy: 0.8787878787878788
Epoch: 70 --- cost = 0.1167706122397682
Accuracy: 0.8939393939393939
Epoch: 80 --- cost = 0.10058743375666801
Accuracy: 0.9090909090909091
Epoch: 90 --- cost = 0.08872634092095644
Accuracy: 0.9242424242424242


### 1.5 Model Performance 


In [34]:
# Apply function predict to get train & test accuracy of the trained model

# Training set: Accuracy arround 0.93 for 100 iterations"
pred_train = predict(
    X=X_train,
    Y=Y_train,
    W=W,
    b=b,
    word_to_vec_map=word_to_vec_map)

# Test set: Accuracy arround 0.84
pred_test = predict(
    X=X_test,
    Y=Y_test,
    W=W,
    b=b,
    word_to_vec_map=word_to_vec_map)

Accuracy: 0.9318181818181818
Accuracy: 0.5892857142857143


In [35]:
def predict_single(sentence, W=W, b=b, word_to_vec_map=word_to_vec_map):
    """
    Given a sentence predict emojis.
    
    Arguments:
    sentence -- input data containing a sentence
    
    Returns:
    pred -- model predictions
    """

 # Get just one word contained in the word_to_vec_map. 
    any_word = list(word_to_vec_map.keys())[0]
    
    # Dimension of Glove vectors   
    n_h = word_to_vec_map[any_word].shape[0] 
        
    # Split jth test example (sentence) into list of lower case words
    words = sentence.lower().split()

    #  # Inicialize average words' vectors to zeros
    avg = np.zeros((n_h,))
    count = 0
    for w in words:
        if w in word_to_vec_map:
            avg = avg + word_to_vec_map[w]
            #Increment count
            count += 1

    if count > 0:
        # Get the average, only if count > 0
        avg = avg / count

    # Forward propagation
    Z = np.dot(W, avg) + b
    A = softmax(Z)
    pred = np.argmax(A)
            
    return pred

In [36]:
label_to_emoji(int(predict_single("I love you")))

'❤️'

#### The Model Matches Emojis to Relevant Words
In the training set, the algorithm saw the sentence 
>"I love you." with the label ❤️. 
* The word "adore" does not appear in the training set, let's see what happens with the sentence "I adore you."

In [37]:
X_my_sentences = np.array(["i adore you", "i love you", "funny lol", 
"lets play with a ball", "food is ready", "not feeling happy", 
"This movie is not good and not enjoyable"])

Y_my_labels = np.array([[0], [0], [2], [1], [4], [3]])

pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)

Accuracy: 0.0

i adore you ❤️
i love you ❤️
funny lol 😄
lets play with a ball ⚾
food is ready 🍴
not feeling happy 😄
This movie is not good and not enjoyable 😄


  print("Accuracy: "  + str(np.mean((pred[:] == Y.reshape(Y.shape[0],1)[:]))))


Because *adore* has a similar embedding as *love*, the algorithm has generalized correctly even it has never seen before *adore*. 
Words such as *heart*, *dear*, *beloved* or *adore* have embedding vectors similar to *love*. 

**This algorithm ignores word ordering**, so it may not understand correctly phrases like "not happy" or "This movie is not good and not enjoyable". 

**Confusion Matrix** can help understand which classes are more difficult for the model. 

In [38]:
# Print the confusion matrix for train and test data 
confusion_matrix(Y_test, pred_test)

array([[ 5,  1,  4,  2,  0],
       [ 0,  5,  0,  0,  0],
       [ 2,  2, 11,  3,  0],
       [ 2,  1,  3,  8,  1],
       [ 0,  0,  1,  1,  4]])

<font color='blue'><b>What you should remember:</b>
- Even with only 132 training examples, you can get a reasonably good model for Emojifying. This is due to the generalization power of the word vectors. 
- Emojify-V1 will perform poorly on sentences such as "This movie is not good and not enjoyable" It doesn't understand combinations of words.
It just averages all the words' embedding vectors together, without considering the ordering of words. 
</font>
    
The next algorithm considers the ordering of words.

## PART 2 - Emojifier-V2: Using LSTMs

Here, LSTM model takes word **sequences** as input! This model is able to account for word ordering. 
Emojifier-V2 will continue to use pre-trained word embeddings to represent words. You'll feed word embeddings into an LSTM, and it will learn to predict the most appropriate emoji. 

In [40]:
import tensorflow
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Dropout, LSTM, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.initializers import glorot_uniform


### 2.1 - Model Overview (Emojifier-v2)

<img src="https://github.com/renanaferreira/CAA-repository/blob/main/Lab10_emojify/images/emojifier-v2.png?raw=true" style="width:700px;height:400px;"> <br>
<caption><center><font color='purple'><b>Figure 3</b>: Emojifier-V2. A 2-layer LSTM sequence classifier. </center></caption>

### 2.2  Padding Sequences of Varying Length

Here we want to train Keras using mini-batches. Most deep learning frameworks require that all sequences have the **same length**. Common solution to handle sequences of **different length** is to use padding.  Specifically:

    * Set a maximum sequence length
    * Pad all sequences to have the same length. 
    
#### Example of Padding:
* Given a maximum sequence length of 20, you could pad every sentence with "0"s so that each input sentence is of length 20. 
* Thus, the sentence "I love you" would be represented as $(e_{I}, e_{love}, e_{you}, \vec{0}, \vec{0}, \ldots, \vec{0})$. 
* In this example, any sentences longer than 20 words would have to be truncated. 
* One way to choose the max sequence length is to pick the length of the longest sentence in the training set. 

### 2.3  Embedding Layer

In Keras, the embedding matrix is represented as a "layer."

* The embedding matrix maps word indices to embedding vectors.
    * The word indices are positive integers.
    * The embedding vectors are dense vectors of fixed size.
    * A "dense" vector is the opposite of a sparse vector. It means that most of its values are non-zero.  As a counter-example, a one-hot encoded vector is not "dense."
* The embedding matrix can be derived in two ways:
    * Training a model to derive the embeddings from scratch. 
    * Using a pretrained embedding.
    
#### Using and Updating Pre-trained Embeddings 

[Embedding()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer in Keras

* Initialize the Embedding layer with GloVe 50-dimensional vectors. 
* Keras allows you to either train or leave this layer fixed. Because the training set is small, you'll leave GloVe embeddings fixed instead of updating them.

#### Inputs and Outputs to the Embedding Layer

* The `Embedding()` layer's input is an integer matrix of size **(batch size, max input length)**. 
    * This input corresponds to sentences converted into lists of indices (integers).
    * The largest integer (the highest word index) in the input should be no larger than the vocabulary size.
* The embedding layer outputs an array of shape (batch size, max input length, dimension of word vectors).

* The figure shows the propagation of two example sentences through the embedding layer. 
    * Both examples have been zero-padded to a length of `max_len=5`.
    * The word embeddings are 50 units in length.
    * The final dimension of the representation is  `(2,max_len,50)`. 

<img src="https://github.com/renanaferreira/CAA-repository/blob/main/Lab10_emojify/images/embedding1.png?raw=true" style="width:700px;height:250px;">
<caption><center><font color='purple'><b>Figure 4</b>: Embedding layer</center></caption>

#### Prepare the Input Sentences =>  Implement `sentences_to_indices`

This function processes an array of sentences X and returns inputs to the embedding layer:

* Convert each training sentences into a list of indices (the indices correspond to each word in the sentence)
* Zero-pad all these lists so that their length is the length of the longest sentence.

In [41]:
for idx, val in enumerate(["I", "like", "learning"]):
    print(idx, val)

0 I
1 like
2 learning


In [42]:
def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices 
    corresponding to words in the sentences.
    The output shape should be such that it can be given to
    `Embedding()` 
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- dictionary containing each word mapped to its index
    max_len -- maximum number of words in a sentence.
    Every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the 
    sentences from X, of shape (m, max_len)
    """
    
    # number of training examples
    m = X.shape[0]
    
    # Initialize X_indices as a numpy matrix of zeros and the correct shape
    X_indices = np.zeros(shape=(m,max_len))
    
     # loop over training examples
    for i in range(m):        
        
        # Convert ith training sentence in lower case and split is into words. 
        # You should get a list of words.
        sentence_words = X[i].lower().split()
        
        # Initialize j 
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # if w exists in the word_to_index dictionary
            if w in word_to_index:
                # Set the (i,j)th entry of X_indices to the index of the correct word.
                X_indices[i, j] = word_to_index[w]
                # Increment j
                j += 1

    
    return X_indices

In [44]:
# Check what function `sentences_to_indices()` does for the 
# sentences in X1. Assume max number of words in a sentence =5

X1 = np.array(["funny lol", "lets play baseball", "food is ready for you"])

X1_indices = sentences_to_indices(X=X1,word_to_index=word_to_index,max_len=5)

#### Build Embedding Layer

Now you'll build the `Embedding()` layer in Keras, using pre-trained word vectors. 

* The embedding layer takes as input a list of word indices.
    * `sentences_to_indices()` creates these word indices.
* The embedding layer will return the word embeddings for a sentence. 

###  Implement `pretrained_embedding_layer()`:

1. Initialize the embedding matrix as a numpy array of zeros.
    * The embedding matrix has a row for each unique word in the vocabulary.
        * There is one additional row to handle "unknown" words.
        * So vocab_size is the number of unique words plus one.
    * Each row will store the vector representation of one word. 
        * For example, one row may be 50 positions long if using GloVe word vectors.
    * In the code below, `emb_dim` represents the length of a word embedding.
2. Fill in each row of the embedding matrix with the vector representation of a word
    * Each word in `word_to_index` is a string.
    * `word_to_vec_map` is a dictionary where the keys are strings and the values are the word vectors.
3. Define the Keras embedding layer. 
    * Use [Embedding()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding). 
    * The input dimension is equal to the vocabulary length (number of unique words plus one).
    * The output dimension is equal to the number of positions in a word embedding.
    * Make this layer's embeddings fixed.
        * If you were to set `trainable = True`, then it will allow the optimization algorithm to modify the values of the word embeddings.
        * In this case, you don't want the model to modify the word embeddings.
4. Set the embedding weights to be equal to the embedding matrix.

In [52]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained 
    GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector 
    representation.
    word_to_index -- dictionary mapping from words to their indices in
    the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer
    """
    # add 1 to to handle "unknown" words.
    vocab_size = len(word_to_index) + 1  
    
    # Get just one word contained in the word_to_vec_map. 
    any_word = list(word_to_vec_map.keys())[0]
    
    # define dimensionality of GloVe word vectors (= 50)
    # dimensions of GloVe vectors 
    emb_dim = len(word_to_vec_map[any_word])
      
    # Step 1
    # Initialize the embedding matrix as a numpy array of zeros
    # with shape [vocab_size,emb_dim]
    emb_matrix = np.zeros(shape=(vocab_size, emb_dim))
    
    # Step 2
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        emb_matrix[idx, :] = word_to_vec_map[word]

    # Step 3
    # Define Keras embedding layer with the correct input and output 
    #sizes. Make it non-trainable.
    embedding_layer = Embedding(vocab_size, emb_dim ,trainable = False)

    # Step 4
    # Build the embedding layer, it is required before setting the weights 
    # of the embedding layer. 
    embedding_layer.build((None,)) 
    
    # Set the weights of the embedding layer to the embedding matrix. 
    # The layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

In [51]:
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
print("weights[0][1][1] =", embedding_layer.get_weights()[0][1][1])
print("Input_dim", embedding_layer.input_dim)
print("Output_dim",embedding_layer.output_dim)

weights[0][1][1] = 0.39031
Input_dim 400001
Output_dim 50


### 2.4 - Implement `Emojify_V2()`

Now you will build Emojifier-V2 model, in which you feed the embedding layer's output to an LSTM network! `Emojify_V2()`
function builds a Keras graph of the architecture shown in Fig.3.

* The model takes as input an array of sentences of shape (`m`, `max_len`, ) defined by `input_shape`. 
* The model outputs a softmax probability vector of shape (`m`, `C = 5`). 

* Keras layers:
    * [Input()](https://www.tensorflow.org/api_docs/python/tf/keras/Input)
        * Set the `shape` and `dtype` parameters.
        * The inputs are integers, so you can specify the data type as a string, 'int32'.
    * [LSTM()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
        * Set the `units` and `return_sequences` parameters.
    * [Dropout()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout)
        * Set the `rate` parameter.
    * [Dense()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)
        * Set the `units`, 
    * [Activation()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Activation)
        * You can pass in the activation of your choice as a lowercase string.
    * [Model()](https://www.tensorflow.org/api_docs/python/tf/keras/Model)
        * Set `inputs` and `outputs`.

* Here is some sample code: 
```Python
raw_inputs = Input(shape=(maxLen,), dtype='int32')
preprocessed_inputs = ... # some pre-processing
X = LSTM(units = ..., return_sequences= ...)(processed_inputs)
X = Dropout(rate = ..., )(X)
...
X = Dense(units = ...)(X)
X = Activation(...)(X)
model = Model(inputs=..., outputs=...)
...
```

In [54]:
def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the Emojify-v2 model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary 
    into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in 
    the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph.
    # It should be of shape input_shape and dtype 'int32' 
    # (as it contains indices, which are integers).
    sentence_indices = Input(shape=input_shape,dtype='int32')
    
    # Create the embedding layer pretrained with GloVe Vectors 
    embedding_layer = pretrained_embedding_layer(word_to_vec_map=word_to_vec_map, word_to_index=word_to_index)
    
    # Propagate sentence_indices through the embedding layer
    embeddings = embedding_layer(sentence_indices)   
    
    # Propagate embeddings through LSTM layer with 128-dim. hidden state
    # Returned output is a batch of sequences, set return_sequences = True
    # If return_sequences = False, LSTM returns only last output in the sequence
    
    X = LSTM(units=128,return_sequences = True)(embeddings)
    
    # Add dropout with a probability of 0.5
    X = Dropout(rate=0.5)(X)
    
    # Propagate X trough another LSTM layer with 128-dim. hidden state
    # Returned output should be single hidden state, not batch of sequences.
    X = LSTM(units=128,return_sequences = False)(X)
    
    # Add dropout with a probability of 0.5
    X = Dropout(rate=0.5)(X)
    
    # Propagate X through a Dense layer with 5 units
    X = Dense(units=5)(X)
    
    # Add a softmax activation
    X = Activation("softmax")(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
    
    return model

Run the following cell to create your model and check its summary. 

* Because all sentences in the dataset are less than 10 words, `max_len = 10` was chosen.  
* You should see that your architecture uses 20,223,927 parameters, of which 20,000,050 (the word embeddings) are non-trainable, with the remaining 223,877 being trainable. 
* Because your vocabulary size has 400,001 words (with valid indices from 0 to 400,000) there are 400,001\*50 = 20,000,050 non-trainable parameters. 

In [59]:
model = Emojify_V2(input_shape=(maxLen),
                   word_to_vec_map=word_to_vec_map, 
                   word_to_index=word_to_index)
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding_5 (Embedding)     (None, 10, 50)            20000050  
                                                                 
 lstm_5 (LSTM)               (None, 10, 128)           91648     
                                                                 
 dropout_2 (Dropout)         (None, 10, 128)           0         
                                                                 
 lstm_6 (LSTM)               (None, 128)               131584    
                                                                 
 dropout_3 (Dropout)         (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 5)                 645 

#### Compile the Model 

Use `categorical_crossentropy` loss, `adam` optimizer and `['accuracy']` metrics:

In [60]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### 2.5 - Train the Model 

Emojifier-V2 `model` takes as input an array of shape (`m`, `max_len`) and outputs probability vectors of shape (`m`, `number of classes`). 

Use function *sentences_to_indices* to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices). The same for X_test.

In [61]:
X_train_indices = sentences_to_indices(X=X_train,word_to_index=word_to_index, max_len=maxLen)
X_test_indices = sentences_to_indices(X=X_test,word_to_index=word_to_index, max_len=maxLen)

Fit Keras model on `X_train_indices` & `Y_oh_train`, for `epochs = 50`, `batch_size = 32`, shuffle=True. 

In [63]:
history = model.fit(x=X_train_indices, y=Y_oh_train, epochs=50, batch_size=32, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


The model should perform around **90% to 100% accuracy** on the training set. Exact model accuracy may vary. 
Evaluate the model on test set: 

In [67]:
loss, acc = model.evaluate(X_test_indices, Y_oh_test)

print(f"Test accuracy = {acc}")

Test accuracy = 0.6071428656578064


You should get a test accuracy between 80% and 95%. Run the cell below to see the mislabelled examples: 

In [68]:
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):
    x = X_test_indices
    num = np.argmax(pred[i])
    if(num != Y_test[i]):
        print('Expected emoji:'+ label_to_emoji(Y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())

Expected emoji:❤️ prediction: she got me a present	😄
Expected emoji:❤️ prediction: he is a good friend	😄
Expected emoji:❤️ prediction: I am upset	😞
Expected emoji:❤️ prediction: We had such a lovely dinner tonight	😄
Expected emoji:😞 prediction: work is hard	😄
Expected emoji:😞 prediction: This girl is messing with me	❤️
Expected emoji:😄 prediction: are you serious ha ha	😞
Expected emoji:😞 prediction: work is horrible	😄
Expected emoji:😄 prediction: you brighten my day	❤️
Expected emoji:😞 prediction: she is a bully	😄
Expected emoji:😞 prediction: I worked during my birthday	😄
Expected emoji:😄 prediction: enjoy your break	⚾
Expected emoji:❤️ prediction: valentine day is near	😄
Expected emoji:😞 prediction: My life is so boring	❤️
Expected emoji:🍴 prediction: I am starving	😞
Expected emoji:😄 prediction: I will go dance⚾
Expected emoji:😄 prediction: I like your jacket 	❤️
Expected emoji:❤️ prediction: I love to the stars and back	😄
Expected emoji:😄 prediction: I want to joke	😞
Expected emoji:😞

In [None]:
#Test the model with the sentence "not feeling happy" and other examples

?


#### LSTM Version Accounts for Word Order
Emojify-V1 model was uncorrect for "not feeling happy", Emojify-V2 got it probably right! Emojify-V2 model still isn't very robust at understanding negation (such as "not happy").  This is because the training set is small and doesn't have a lot of examples of negation.

<font color='blue'><b>What you should remember</b>:
- For an NLP task where the training set is small, word embeddings may improve the algorithm. 
- Word embeddings allow the model to work on words in the test set that may not appear in the training set. 
- Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
    - To use mini-batches, the sequences need to be **padded** so that all the examples in a mini-batch have the **same length**. 
    - An `Embedding()` layer can be initialized with pretrained values. 
        - These values can be either fixed or trained further on your dataset. 
        - If however your labeled dataset is small, it's usually not worth trying to train a large pre-trained set of embeddings.   
    - `Dropout()` right after `LSTM()` regularizes the network. 