## Model Training

The approach in this notebook is same as Andrew Ng's Sequence Modelling class (Deep Learning Specialization) on [coursera](https://www.coursera.org/learn/nlp-sequence-models) where he taught us to emojify a sentence. 

I am using this approach in slight different way to classify a sentence as spam or not-spam.

Credit for most of the code goes to [Andrew Ng](http://www.andrewng.org/)

In [37]:
import pandas as pd
import numpy as np
import keras
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from collections import Counter
from sklearn.metrics import confusion_matrix, classification_report
import re

## Constants

In [47]:
processedFilename = "../data/processedFile.csv"
uniformProcessedFilename = "../data/uniformDataProcessedFile.csv"
gloveVecFile = "../data/glove.6B.50d.txt"
# gloveVecFile = "../data/glove.twitter.27B.50d.txt"   # New twitter glove file
maxLen = 30 # Computed in the previous notebook : 2_ExploratoryDataAnalysis.ipynb
classes = 2 # equals to number of classes of spam. Here we have 2. Spam and Not-Spam
seed = 0
np.random.seed(seed)
TrainTestPartition = 0.80

## Split data into Training and Testing

In [48]:
def partition(filename, TrainTestPartition):
    """
    Function that partitions the data in 'filename' into training and testing based on the ratio in 
    TrainTestPartition
    :param filname : String. e.g. "../data/processedFile.csv"
    :param TrainTestPartition : Integer . e.g. 0.80
    :return : Tuple(DataFrame, DataFrame)
    """
    df = pd.read_csv(filename, header=None)
    msk = np.random.rand(len(df)) < TrainTestPartition
    
    return df[msk], df[~msk]

trainDf, testDf = partition(uniformProcessedFilename, TrainTestPartition)
print ("Length of training set : {0}, and testing set : {1}".format(len(trainDf), len(testDf)))
trainDf.head()

Length of training set : 1024, and testing set : 269


Unnamed: 0,0,1
0,hi its lucy hubby at meetins all day fri i wil...,1
1,think ur smart win 200 this week in our weekly...,1
2,guess what somebody you know secretly fancies ...,1
3,sorry for the delay yes masters,0
4,you ve won tkts to the euro2004 cup final or 8...,1


In [49]:
Counter(trainDf[1].values)

Counter({0: 523, 1: 501})

## Building X_train, Y_train, X_test and Y_test

In [50]:
X_train, Y_train = trainDf[0].values, trainDf[1].values
X_test, Y_test = testDf[0].values, testDf[1].values
X_train[0:5]

array([ 'hi its lucy hubby at meetins all day fri i will b alone at hotel u fancy cumin over pls leave msg 2day 09099726395 lucy x calls 1 minmobsmorelkpobox177hp51fl',
       'think ur smart win 200 this week in our weekly quiz text play to 85222 now t cs winnersclub po box 84 m26 3uz 16 gbp1 50 week',
       'guess what somebody you know secretly fancies you wanna find out who it is give us a call on 09065394514 from landline datebox1282essexcm61xn 150p min 18',
       'sorry for the delay yes masters',
       'you ve won tkts to the euro2004 cup final or 800 cash to collect call 09058099801 b4190604 pobox 7876150ppm'], dtype=object)

## Converting Y to one-hot

Getting data ready for LSTM

In [51]:
def convert_to_one_hot(Y, classes=2):
    """
    Function to convert Y into one-hot depending on the classes
    :param Y : numpy_array(Integer)
    :param classes : Integer. e.g 2
    :return : numpy_array(List(Integers))
    """
    Y = np.eye(classes)[Y.reshape(-1)]
    return Y

Y_OneHot_Train = convert_to_one_hot(Y_train, classes)
Y_OneHot_Test = convert_to_one_hot(Y_test, classes)
Y_OneHot_Train

array([[ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       ..., 
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.]])

## Reading Glove Vector

Glove vector would help us convert each word into array of Integers. These glove vectors has already been [downloaded](https://nlp.stanford.edu/projects/glove/) into `../data/` folder.

If you do not have this file. Download these from the above link. I will be using 50-Dimension vector.

In [52]:
def readGlove(filename):
    """
    Function to read glove vector from the disk and compute word_to_index, index_to_word, and word_to_vec map.    
    Below code is taken from Deep Leaning class of Andrew Ng on coursera.
    
    :param filename : String. e.g. "../data/gloveVec.txt"
    :return : Tuple(dict, dict, dict) .
    """        
    with open(filename, 'r', encoding='utf-8') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

words_to_index, index_to_words, word_to_vec_map = readGlove(gloveVecFile)

In [53]:
word_to_vec_map['cucumber']

array([ 0.68224 , -0.31608 , -0.95201 ,  0.47108 ,  0.56571 ,  0.13151 ,
        0.22457 ,  0.094995, -1.3237  , -0.51545 , -0.39337 ,  0.88488 ,
        0.93826 ,  0.22931 ,  0.088624, -0.53908 ,  0.23396 ,  0.73245 ,
       -0.019123, -0.26552 , -0.40433 , -1.5832  ,  1.1316  ,  0.4419  ,
       -0.48218 ,  0.4828  ,  0.14938 ,  1.1245  ,  1.0159  , -0.50213 ,
        0.83831 , -0.31303 ,  0.083242,  1.7161  ,  0.15024 ,  1.0324  ,
       -1.5005  ,  0.62348 ,  0.54508 , -0.88484 ,  0.53279 , -0.085119,
        0.02141 , -0.56629 ,  1.1463  ,  0.6464  ,  0.78318 , -0.067662,
        0.22884 , -0.042453])

## Converting sentences from X_train/X_test into indices vector

In [54]:
def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()`
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. One can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    # Initialize X_indices as a numpy matrix of zeros and the correct shape
    X_indices = np.zeros(shape=(m, max_len))
    
    for i in range(m):                               
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words =X[i].lower().strip().split()
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            try:
                X_indices[i, j] = word_to_index[w]
            except KeyError as e:
                X_indices[i, j] = word_to_index["unk"]
            # Increment j to j + 1
            j = j + 1
            if j>=max_len:
                break
    
    return X_indices

## Defining pre-training embedding layer to be used in out LSTM Model

In [55]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
#     vocab_len = len(words_to_index)                     #using this for twitter glove vectors
    emb_dim = word_to_vec_map["cucumber"].shape[0]      
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros(shape=(vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        try:
            emb_matrix[index, :] = word_to_vec_map[word]
        except Exception as e:
            print ("Exception {0} occured for word {1}".format(e, word))

    # Define Keras embedding layer with the correct output/input sizes, make it trainable.
    embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)

    # Build the embedding layer, it is required before setting the weights of the embedding layer.
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

## Creating Model

In [56]:
def spamify(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the spamify model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape=input_shape, dtype='int32')
    
    # Create the embedding layer pretrained with GloVe Vectors
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(128, return_sequences = True)(embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(128)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(rate = 0.5)(X)
    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X = Dense(classes, activation='softmax')(X)
    # Add a softmax activation
    X = Activation('softmax')(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(sentence_indices, X)
    
    
    return model

In [57]:
model = spamify((maxLen,), word_to_vec_map, words_to_index)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 30)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 30, 50)            20000050  
_________________________________________________________________
lstm_3 (LSTM)                (None, 30, 128)           91648     
_________________________________________________________________
dropout_3 (Dropout)          (None, 30, 128)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 258       
__________

## Compiling the model 

As usual, after creating your model in Keras, we need to compile it and define what loss, optimizer and metrics we want to use. Compiling my model using categorical_crossentropy loss, adam optimizer and ['accuracy'] metrics:

In [58]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

It's time to train our model. Our spamify `model` takes as input an array of shape (`m`, `max_len`) and outputs probability vectors of shape (`m`, `number of classes`). We thus have to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices), and Y_train (labels as indices) to Y_train_oh (labels as one-hot vectors).

In [59]:
X_train_indices = sentences_to_indices(X_train, words_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, classes=classes)

## Fitting the keras model

In [60]:
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 50, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f975982a048>

## Testing Accuracy

In [61]:
X_test_indices = sentences_to_indices(X_test, words_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, classes = classes)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)

Test accuracy =  0.944237918216


## Other Metrics

In [66]:
def predValues (X_test_indices):
    """
    Function to predict whether text is spam or not
    :param X_test_indices : np.array(list(integers))
    :return : np.array(Integers)
    """
    pred = model.predict(X_test_indices, verbose=1)
    pred_array_values = []
    for prediction in pred:
        pred_array_values.append(np.argmax(prediction))
        
    return pred_array_values
    
pred_array_values = predValues(X_test_indices)    
pred_array_values[0:10] 



[1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

In [67]:
print ("Below is the confusion matrix : ")
confusion_matrix(Y_test, pred_array_values)

Below is the confusion matrix : 


array([[125,   6],
       [  9, 129]])

In [68]:
target_names = ['Non-Spam', 'Spam']
print (classification_report(Y_test, pred_array_values, target_names=target_names))

             precision    recall  f1-score   support

   Non-Spam       0.93      0.95      0.94       131
       Spam       0.96      0.93      0.95       138

avg / total       0.94      0.94      0.94       269



## Test on random sentence

Use the below code to experiment with you own sentence. Type your sentence and see whether the model is able to classify it corretly or not

In [80]:
user_sms = "t mobile customer you may now claim your... free camera phone upgrade a pay go sim card for your loyalty call on 0845 021 3680 offer ends 28thfeb t c s apply"
# user_sms = "Hi Saurabh. How are you"
# user_sms = "Hey saurabh, ..../././././ get our free bill"
user_sms = user_sms.strip().lower()
pattern = re.compile('[^A-Za-z0-9]+')
user_sms = re.sub(pattern, " ", user_sms)
print ("user_sms preprocessed : \t{0}\n".format(user_sms))

user_sms_indices = sentences_to_indices(np.array([user_sms]), words_to_index, maxLen) 
pred_values = predValues(user_sms_indices)
print ("\nGiven text is : {0}".format("SPAM" if pred_values[0]==1 else "Not a Spam"))

user_sms preprocessed : 	t mobile customer you may now claim your free camera phone upgrade a pay go sim card for your loyalty call on 0845 021 3680 offer ends 28thfeb t c s apply


Given text is : SPAM
