### 1.1 - Dataset EMOJISET

Let's start by building a simple baseline classifier. 

You have a tiny dataset (X, Y) where:
- X contains 127 sentences (strings)
- Y contains a integer label between 0 and 4 corresponding to an emoji for each sentence

<img src="data_set.png" style="width:700px;height:300px;">
<caption><center> *Figure 1*: EMOJISET - a classification problem with 5 classes. A few examples of sentences are given here. </center></caption>

Let's load the dataset using the code below. We split the dataset between training (127 examples) and testing (56 examples).

In [6]:
import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation, Lambda
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from pickle import load, dump
#import pandas as pd
from unicodedata import normalize
from numpy import array
import string
import re


np.random.seed(1)
%matplotlib inline

In [7]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r',encoding='UTF-8') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

#### Read Glove for synonyms

In [8]:
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')


In [9]:
X_train, Y_train= read_csv('train_emoji_rework.csv')
X_test, Y_test = read_csv('tesss.csv')

In [10]:
maxLen = len(max(X_train, key=len).split())+3
print(maxLen)

13


In [11]:
#set path
path = 'C:/Users/viret/OneDrive/IE/Third_Term/NLP/Application/'

#Save word_to_index
with open( path + 'word_to_index_final.pkl', 'wb') as f:
    dump(word_to_index, f)
    print("saved")

#saved maxlen
with open( path + 'maxLen_final.pkl', 'wb') as f:
    dump(maxLen, f)
    print("saved")

saved
saved


### 2.1 - Overview of the model


<img src="emojifier-v2.png" style="width:700px;height:400px;"> <br>
<caption><center> **Figure 3**: Emojifier-V2. A 2-layer LSTM sequence classifier. </center></caption>



In [12]:
def decontracted(phrase):  #decontrat specific words
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase





def sentences_to_indices(X, word_to_index, max_len):  #transfrom the sentence into an indices for the LSTM
    
    
    m = X.shape[0]                                   # number of training examples
    
    # Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)
    X_indices = np.zeros((m,max_len))
    
    for i in range(m):                               # loop over training examples
       

        sentence_words = (decontracted(X[i].lower()).split())
       # sentence_words = 
        # Initialize j to 0
        j = 0

        # Loop over the words of sentence_words
        for w in sentence_words:
                #  sentence_words = X[i].decode('latin1')
                # Set the (i,j)th entry of X_indices to the index of the correct word.
                X_indices[i, j] = word_to_index[w]

                # Increment j to j + 1
                j = j+1
             
    
    return X_indices

In [13]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
   
    
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len,emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len,emb_dim, trainable=False)
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

In [14]:
from keras.activations import softmax

def softMaxAxis1(x):
    return softmax(x)

In [15]:
# GRADED FUNCTION: Emojify_V2

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(input_shape)
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(128, return_sequences=True) (embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5) (X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(128, return_sequences=True) (X)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5) (X)
    X = LSTM(128) (X)
    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X = Dense(5)(X)
    # Add a softmax activation
    X = Activation('softmax')(X)
    #X=Lambda(lambda x: K.tf.nn.softmax(x))
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices, outputs=X)
    
    
    return model

#### Compile model

In [16]:
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 13)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 13, 50)            20000050  
_________________________________________________________________
lstm_1 (LSTM)                (None, 13, 128)           91648     
_________________________________________________________________
dropout_1 (Dropout)          (None, 13, 128)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 13, 128)           131584    
_________________________________________________________________
dropout_2 (Dropout)          (None, 13, 128)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               131584    
__________

In [17]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [18]:
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)

In [22]:
filename = path + 'emojis_final2.h5'
checkpoint = ModelCheckpoint(filename, monitor='loss', verbose=1, save_best_only=True, mode='min')
#model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True, callbacks=[checkpoint], verbose=2)

Epoch 1/50
 - 0s - loss: 0.2911 - acc: 0.8784

Epoch 00001: loss improved from inf to 0.29111, saving model to C:/Users/viret/OneDrive/IE/Third_Term/NLP/Application/emojis_final2.h5
Epoch 2/50
 - 0s - loss: 0.2508 - acc: 0.9099

Epoch 00002: loss improved from 0.29111 to 0.25082, saving model to C:/Users/viret/OneDrive/IE/Third_Term/NLP/Application/emojis_final2.h5
Epoch 3/50
 - 0s - loss: 0.2729 - acc: 0.9054

Epoch 00003: loss did not improve
Epoch 4/50
 - 0s - loss: 0.2208 - acc: 0.9144

Epoch 00004: loss improved from 0.25082 to 0.22080, saving model to C:/Users/viret/OneDrive/IE/Third_Term/NLP/Application/emojis_final2.h5
Epoch 5/50
 - 0s - loss: 0.1918 - acc: 0.9324

Epoch 00005: loss improved from 0.22080 to 0.19176, saving model to C:/Users/viret/OneDrive/IE/Third_Term/NLP/Application/emojis_final2.h5
Epoch 6/50
 - 1s - loss: 0.1494 - acc: 0.9414

Epoch 00006: loss improved from 0.19176 to 0.14944, saving model to C:/Users/viret/OneDrive/IE/Third_Term/NLP/Application/emojis_fin

<keras.callbacks.History at 0x27bd53c30b8>

In [24]:
# load save model
model = load_model(path + 'emojis_final2.h5')

#### Model evaluation

In [25]:
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)


Test accuracy =  0.767857151372


#### View results

In [26]:
# This code allows you to see the mislabelled examples
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):
    x = X_test_indices
    num = np.argmax(pred[i])
    if(num != Y_test[i]):
        print('Expected emoji:'+ label_to_emoji(Y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())

Expected emoji:😄 prediction: We had such a lovely dinner tonight	❤️
Expected emoji:😞 prediction: work is hard	😄
Expected emoji:😞 prediction: This girl is messing with me	❤️
Expected emoji:😞 prediction: are you serious❤️
Expected emoji:😞 prediction: work is horrible	❤️
Expected emoji:😞 prediction: stop pissing me off⚾
Expected emoji:😄 prediction: you brighten my day	❤️
Expected emoji:😞 prediction: she is a bully	❤️
Expected emoji:😞 prediction: My life is so boring	❤️
Expected emoji:😄 prediction: will you be my valentine	❤️
Expected emoji:😄 prediction: dance with me	⚾
Expected emoji:😄 prediction: What you did was awesome	😞
Expected emoji:😞 prediction: go away	⚾


#### Single Emoji

In [27]:
# Change the sentence below to see your prediction. Make sure all the words are in the Glove embeddings.  
x_test = np.array(['i love baseball'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))

i love baseball ⚾


#### Mutiple Emojis

based on the score received by the classification

In [28]:
#mutiple emojis

X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
p=model.predict(X_test_indices).reshape(5)
L = np.argsort(-(model.predict((X_test_indices))), axis=1).reshape(5)
print(x_test[0] +' '+  label_to_emoji(L[0]) + ' ' + label_to_emoji(L[1]) + ' ' + label_to_emoji(L[2]))


i love baseball ⚾ 😞 ❤️


**It Is a really small dataset and the accuracy is somehow decent but nothing amazing – the aim of the project was not so much to work on the emoji classification but more and the NLP seq2seq translation and create a prototype.**