# NLP

Build an emojifier (outputs an emoji for an input sentence) using word-vector embeddings:
* **Emojifier-V1**: Baseline model using word embeddings (using numpy)
* **Emojifier-V2**: using LSTM (via tensorflow), allows the model to understand context
---

## Dataset EMOJISET

We have a tiny dataset (X, Y) where:
- X contains 127 sentences (strings).
- Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence.

<img src="images/data_set.png" style="width:700px;height:300px;">
<caption><center><font color='purple'>EMOJISET - a classification problem with 5 classes</center></caption></img>

In [1]:
import numpy as np
import csv
import emoji
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

In [2]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

In [3]:
def convert_to_one_hot(Y, C):
    Y = np.eye(C)[Y.reshape(-1)]
    return Y

In [4]:
def read_csv(filename):
    phrase = []
    emoji = []

    with open (filename) as f:
        csvReader = csv.reader(f)

        for row in csvReader:
            phrase.append(row[0])
            emoji.append(row[1])

    X = np.asarray(phrase)
    Y = np.asarray(emoji, dtype=int)

    return X, Y

In [5]:
X_train, Y_train = read_csv('data/nlp_emoji/train_emoji.csv')
X_test, Y_test = read_csv('data/nlp_emoji/tesss.csv')

In [6]:
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

((132,), (132,), (56,), (56,))

In [7]:
max(X_train, key=len)

'I am so impressed by your dedication to this project'

In [8]:
emoji_dictionary = {"0": "\u2764\uFE0F",
                    "1": ":baseball:",
                    "2": ":smile:",
                    "3": ":disappointed:",
                    "4": ":fork_and_knife:"}

def label_to_emoji(label):
    """
    Converts a label (int or string) into the corresponding emoji code (string) ready to be printed
    """
    return emoji.emojize(emoji_dictionary[str(label)], use_aliases=True)

In [9]:
for idx in range(10):
    print(X_train[idx], label_to_emoji(Y_train[idx]))

never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴
I love you mum ❤️
Stop saying bullshit 😞
congratulations on your acceptance 😄
The assignment is too long  😞
I want to go play ⚾


In [10]:
print(f"Sentence '{X_train[0]}' has label index {Y_train[0]}, which is emoji {label_to_emoji(Y_train[0])}", )
print(f"Label index {Y_train[0]} in one-hot encoding format is {convert_to_one_hot(Y_train[0], C = 5)}")

Sentence 'never talk to me again' has label index 3, which is emoji 😞
Label index 3 in one-hot encoding format is [[0. 0. 0. 1. 0.]]


In [11]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf-8') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

In [12]:
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('pretrainedmodel/nlp_glovevec/glove.6B.50d.txt')

In [13]:
word = "cucumber"
idx = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(idx) + "th word in the vocabulary is", index_to_word[idx])

the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos


---
## Emojifier-V1: baseline model

<center><img src="images/image_1.png" style="width:900px;height:300px;">
    <caption><center><font color='purple'>Baseline model (Emojifier-V1).</center></caption></center></font></img>


* The input of the model is a string corresponding to a sentence (e.g. "I love you"). 
* Each word is converted into it's embedding vector and the 'average' for sentence is calculated
* The output will be a probability vector of shape (1,5), (indicating that there are 5 emojis to choose from).
* The (1,5) probability vector is passed to an argmax layer, which extracts the index of the emoji with the highest probability.


**Equations needed to implement the forward pass and compute cross-entropy cost:**
$$ z^{(i)} = Wavg^{(i)} + b$$
$$ a^{(i)} = softmax(z^{(i)})$$
$$ \mathcal{L}^{(i)} = - \sum_{k = 0}^{n_y - 1} Y_{oh,k}^{(i)} * log(a^{(i)}_k)$$

In [14]:
def predictions(X, Y, W, b, word_to_vec_map):
    """
    Calculates predictions, for given parameters W, b
    X (m, None) -> words -> avg. vector -> softmax -> max value = prediction (m,1) vs. y_true
    """
    m = X.shape[0]
    pred = np.zeros((m, 1))
    any_word = list(word_to_vec_map.keys())[0]         # Get a sample work
    n_h = word_to_vec_map[any_word].shape[0]           # Dimension of GloveVec embedding
                              
    for j in range(m):                                 # Loop over training examples               
        
        # Calculate average vector for sentence
        words = X[j].lower().split()
        avg = np.zeros((n_h,))                         # Initialize the average word vector
        count = 0
        for w in words:                                # Calculate average vector
            if w in word_to_vec_map:
                avg += word_to_vec_map[w]
                count +=1

        if count > 0:
            avg = avg/count
        
        # Forward propagation
        Z = np.dot(W, avg) + b
        A = softmax(Z)
        pred[j] = np.argmax(A)
        
    print("Accuracy: "  + str(np.mean((pred[:] == Y.reshape(Y.shape[0],1)[:]))))
    return pred

In [15]:
def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
    """
    Model to train word vector representations in numpy.
    X (m,), Y (m, 1): array of integers between 0 and 7
    pred -- vector of predictions, numpy-array of shape (m, 1)
    W / b -- weight matrix / bias of the softmax layer, of shape (n_y, n_h), (n_y,)
    """
    any_word = list(word_to_vec_map.keys())[0] # sample word
    n_h = word_to_vec_map[any_word].shape[0]   # dimensions of the GloVe vectors 
    m = X.shape[0]                             # number of training examples
    n_y = len(np.unique(Y))                    # number of classes  
    
    # Initialize parameters using Xavier initialization
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,))
    
    # Convert Y to Y_onehot with n_y classes
    Y_oh = convert_to_one_hot(Y, C = n_y) 
    
    # Optimization loop
    for t in range(num_iterations):
        cost, dW, db = 0, 0, 0
        
        for i in range(m):          # Loop over the training examples

            # Avg vector + Forward propagate through the softmax layer. 
            words = X[i].lower().split()
            avg = np.zeros((n_h,))                         # Initialize the average word vector
            count = 0
            for w in words:                                # Calculate average vector
                if w in word_to_vec_map:
                    avg += word_to_vec_map[w]
                    count +=1
            
            if count > 0:
                avg = avg/count
            
            # Forward prop
            z = np.dot(W, avg) + b
            a = softmax(z)
            cost += -np.sum(Y_oh[i] * np.log(a))
                        
            # Compute gradients 
            dz = a - Y_oh[i]
            dW += np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db += dz

            # Update parameters with Stochastic Gradient Descent
            W = W - learning_rate * dW
            b = b - learning_rate * db
        
        if t % 100 == 0:
            print("Epoch: " + str(t) + " --- cost = " + str(cost))
            pred = predictions(X, Y, W, b, word_to_vec_map)
    return pred, W, b

In [16]:
np.random.seed(2)
pred, W, b = model(X_train, Y_train, word_to_vec_map)
print(pred[:5])

Epoch: 0 --- cost = 382.7257158439735
Accuracy: 0.553030303030303
Epoch: 100 --- cost = 47.05028762790248
Accuracy: 0.9393939393939394
Epoch: 200 --- cost = 55.46399565420228
Accuracy: 0.9621212121212122
Epoch: 300 --- cost = 0.37151123004985004
Accuracy: 1.0
[[3.]
 [2.]
 [3.]
 [0.]
 [4.]]


### Test Set Performance

In [17]:
print("Training set:")
pred_train = predictions(X_train, Y_train, W, b, word_to_vec_map)
print('Test set:')
pred_test = predictions(X_test, Y_test, W, b, word_to_vec_map)

Training set:
Accuracy: 1.0
Test set:
Accuracy: 0.8928571428571429


In [18]:
print(Y_test.shape)
print('           '+ label_to_emoji(0)+ '    ' + label_to_emoji(1) + '    ' +  label_to_emoji(2)+ '    ' + label_to_emoji(3)+'   ' + label_to_emoji(4))
print(pd.crosstab(Y_test, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))

(56,)
           ❤️    ⚾    😄    😞   🍴
Predicted  0.0  1.0  2.0  3.0  4.0  All
Actual                                 
0            6    0    0    1    0    7
1            0    8    0    0    0    8
2            1    0   17    0    0   18
3            1    1    2   12    0   16
4            0    0    0    0    7    7
All          8    9   19   13    7   56


---
## Emojifier-V2: LSTM based model using TensorFlow
* Emojifier-V2 will continue to use pre-trained word embeddings to represent words. 
* We'll feed word embeddings into an LSTM, and the LSTM will learn to predict the most appropriate emoji. 

<img src="images/emojifier-v2.png" style="width:700px;height:400px;"> <br>
<caption><center><font color='purple'>Emojifier-V2. A 2-layer LSTM sequence classifier. </center></caption></img>

#### Padding (and/or truncation)
We will train using mini-batches, which requires all sequences in the mini-batch to have the **same length**, which will be achieved through **padding**
* Given a maximum sequence length of 20, we can pad every sentence with "0"s s.t. each sentence is of length 20. 
* Thus, the sentence "I love you" would be represented as $(e_{I}, e_{love}, e_{you}, \vec{0}, \vec{0}, \ldots, \vec{0})$. 

#### Embedding
We will initialize an Embedding layer with GloVe 50D vectors (which won't be trained further)

<img src="images/embedding1.png" style="width:700px;height:250px;">
<caption><center><font color='purple'>Embedding layer</center></caption></img>

In [19]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Dropout, LSTM, Activation, Embedding

In [20]:
def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (m,) into an array of indices (m, max_len)
    """
    m = X.shape[0]                                   # number of training examples
    X_indices = np.zeros((m, max_len))               # Initialize as a matrix of zeros
    
    for i in range(m):                               # loop over training examples        
        sentence_words = X[i].lower().split()
        j = 0

        for w in sentence_words:
            if w in word_to_index:
                X_indices[i, j] = word_to_index[w]
                j =  j+1
    return X_indices

### Creating Pre-trained Embedding Layer

In [21]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    """
    
    vocab_size = len(word_to_index) + 1             # adding 1 to fit Keras embedding (requirement)
    any_word = list(word_to_vec_map.keys())[0]
    emb_dim = word_to_vec_map[any_word].shape[0]    # dimensions of the GloVe vectors 

    # Initialize the embedding matrix and populate it
    emb_matrix = np.zeros((vocab_size, emb_dim))
    for word, idx in word_to_index.items():
        emb_matrix[idx, :] = word_to_vec_map[word]

    # Define Keras embedding layer
    embedding_layer = Embedding(input_dim = vocab_size, output_dim=emb_dim, 
                                trainable=False
                               )
    embedding_layer.build((None,)) # Build the embedding layer, required before setting weights 
    embedding_layer.set_weights([emb_matrix]) # Set the weights of the embedding
    return embedding_layer

In [22]:
def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    Creating the Emojify-v2 model's graph.
    """
    # Create the embedding layer pretrained with GloVe Vectors
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Shape input_shape and dtype 'int32' (as it contains indices, which are integers).
    sentence_indices = Input(shape=input_shape, dtype='int32')
    
    # Model Graph
    embeddings = embedding_layer(sentence_indices)          # word embeddings
    X = LSTM(units=128, return_sequences=True)(embeddings)  # returned o/p should be a batch of sequences.
    X = Dropout(0.5)(X)                                     # dropout with a probability of 0.5
    X = LSTM(units=128, return_sequences=False)(X)          # returned o/p is a single hidden state
    X = Dropout(0.5)(X)
    X = Dense(5)(X)
    X = Activation('softmax')(X)
    
    # Create Model
    model = Model(inputs = sentence_indices, outputs = X)
    
    return model

In [23]:
maxLen = len(max(X_train, key=len).split())
maxLen

10

In [24]:
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 10)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 10, 50)            20000050  
_________________________________________________________________
lstm (LSTM)                  (None, 10, 128)           91648     
_________________________________________________________________
dropout (Dropout)            (None, 10, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 5)                

In [25]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [26]:
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)

In [27]:
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x2763696ad08>

In [28]:
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)


Test accuracy =  0.875


In [29]:
# Mislabelled examples
C = 5
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(pred)):
    num = np.argmax(pred[i])
    if(num != Y_test[i]):
        print('Expected emoji:'+ label_to_emoji(Y_test[i]) +\
              ' prediction: '+ X_test[i] + label_to_emoji(num).strip())

Expected emoji:😞 prediction: work is hard	😄
Expected emoji:😞 prediction: This girl is messing with me	❤️
Expected emoji:😞 prediction: work is horrible	😄
Expected emoji:😞 prediction: she is a bully	❤️
Expected emoji:😞 prediction: My life is so boring	❤️
Expected emoji:😞 prediction: go away	⚾
Expected emoji:😞 prediction: yesterday we lost again	⚾


In [30]:
_test = np.array(['do you want to play'])
_test_indices = sentences_to_indices(_test, word_to_index, maxLen)
print(_test[0] +' '+  label_to_emoji(np.argmax(model.predict(_test_indices))))

do you want to play ⚾
