## Model Training

The approach in this notebook is same as Andrew Ng's Sequence Modelling class (Deep Learning Specialization) on [coursera](https://www.coursera.org/learn/nlp-sequence-models) where he taught us to emojify a sentence. 

I am using this approach in slight different way to classify a sentence as spam or not-spam.

Credit for most of the code goes to [Andrew Ng](http://www.andrewng.org/)

In [5]:
import pandas as pd
import numpy as np
import keras
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding

## Constants

In [58]:
processedFilename = "../data/processedFile.csv"
gloveVecFile = "../data/glove.6B.50d.txt"
maxLen = 20 # Computed in the previous notebook : 2_ExploratoryDataAnalysis.ipynb
classes = 2 # equals to number of classes of spam. Here we have 2. Spam and Not-Spam
seed = 0
np.random.seed(seed)
TrainTestPartition = 0.80

## Split data into Training and Testing

In [9]:
def partition(filename, TrainTestPartition):
    """
    Function that partitions the data in 'filename' into training and testing based on the ratio in 
    TrainTestPartition
    :param filname : String. e.g. "../data/processedFile.csv"
    :param TrainTestPartition : Integer . e.g. 0.80
    :return : Tuple(DataFrame, DataFrame)
    """
    df = pd.read_csv(filename, header=None)
    msk = np.random.rand(len(df)) < TrainTestPartition
    
    return df[msk], df[~msk]

trainDf, testDf = partition(processedFilename, TrainTestPartition)
print ("Length of training set : {0}, and testing set : {1}".format(len(trainDf), len(testDf)))
trainDf.head()

Length of training set : 4121, and testing set : 1018


Unnamed: 0,0,1
1,ok lar joking wif u oni,0
3,u dun say so early hor u c already then say,0
4,nah i don t think he goes to usf he lives arou...,0
6,even my brother is not like to speak with me t...,0
7,as per your request melle melle oru minnaminun...,0


## Building X_train, Y_train, X_test and Y_test

In [12]:
X_train, Y_train = trainDf[0].values, trainDf[1].values
X_test, Y_test = testDf[0].values, testDf[1].values
X_train[0:5]

array(['ok lar joking wif u oni',
       'u dun say so early hor u c already then say',
       'nah i don t think he goes to usf he lives around here though',
       'even my brother is not like to speak with me they treat me like aids patent',
       'as per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertune'], dtype=object)

## Converting Y to one-hot

Getting data ready for LSTM

In [43]:
def convert_to_one_hot(Y, classes=2):
    """
    Function to convert Y into one-hot depending on the classes
    :param Y : numpy_array(Integer)
    :param classes : Integer. e.g 2
    :return : numpy_array(List(Integers))
    """
    Y = np.eye(classes)[Y.reshape(-1)]
    return Y

Y_OneHot_Train = convert_to_one_hot(Y_train, classes)
Y_OneHot_Test = convert_to_one_hot(Y_test, classes)
Y_OneHot_Train

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

## Reading Glove Vector

Glove vector would help us convert each word into array of Integers. These glove vectors has already been [downloaded](https://nlp.stanford.edu/projects/glove/) into `../data/` folder.

If you do not have this file. Download these from the above link. I will be using 50-Dimension vector.

In [23]:
def readGlove(filename):
    """
    Function to read glove vector from the disk and compute word_to_index, index_to_word, and word_to_vec map.    
    Below code is taken from Deep Leaning class of Andrew Ng on coursera.
    
    :param filename : String. e.g. "../data/gloveVec.txt"
    :return : Tuple(dict, dict, dict) .
    """        
    with open(filename, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

words_to_index, index_to_words, word_to_vec_map = readGlove(gloveVecFile)

In [98]:
word_to_vec_map['lar']

array([-0.68095 , -1.5925  , -0.28595 , -0.148   , -0.85375 , -0.3206  ,
        0.56763 , -0.55571 ,  0.029921, -0.19592 , -0.43244 ,  0.077961,
        0.3094  , -0.51442 , -1.3036  , -0.54708 ,  0.81773 , -0.24174 ,
       -0.018879,  0.46546 , -0.79265 , -0.28118 , -0.25071 , -0.60221 ,
        0.53435 ,  0.60365 , -1.435   , -1.0093  , -0.63034 , -0.50105 ,
       -0.66707 , -0.13999 ,  0.23964 ,  1.0034  ,  0.5395  ,  0.33595 ,
        0.59106 , -0.48929 , -0.02608 ,  0.44607 , -0.2265  ,  0.45986 ,
       -0.45285 , -0.59165 ,  0.23867 , -0.46094 ,  0.46202 , -0.3993  ,
        0.52779 ,  0.9966  ])

## Converting sentences from X_train/X_test into indices vector

In [41]:
def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()`
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. One can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    # Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)
    X_indices = np.zeros(shape=(m, max_len))
    
    for i in range(m):                               
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words =X[i].lower().strip().split()
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            try:
                X_indices[i, j] = word_to_index[w]
            except KeyError as e:
                X_indices[i, j] = word_to_index["unk"]
            # Increment j to j + 1
            j = j + 1
            if j>=max_len:
                break
    
    return X_indices

## Defining pre-training embedding layer to be used in out LSTM Model

In [26]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros(shape=(vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable.
    embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)

    # Build the embedding layer, it is required before setting the weights of the embedding layer.
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

## Creating Model

In [52]:
def spamify(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the spamify model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape=input_shape, dtype='int32')
    
    # Create the embedding layer pretrained with GloVe Vectors
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(128, return_sequences = True)(embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(128)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(rate = 0.5)(X)
    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X = Dense(classes, activation='softmax')(X)
    # Add a softmax activation
    X = Activation('softmax')(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(sentence_indices, X)
    
    
    return model

In [59]:
model = spamify((maxLen,), word_to_vec_map, words_to_index)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 20)                0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 20, 50)            20000050  
_________________________________________________________________
lstm_9 (LSTM)                (None, 20, 128)           91648     
_________________________________________________________________
dropout_9 (Dropout)          (None, 20, 128)           0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dropout_10 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 258       
__________

## Compiling the model 

As usual, after creating your model in Keras, we need to compile it and define what loss, optimizer and metrics we want to use. Compiling my model using categorical_crossentropy loss, adam optimizer and ['accuracy'] metrics:

In [60]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

It's time to train our model. Our spamify `model` takes as input an array of shape (`m`, `max_len`) and outputs probability vectors of shape (`m`, `number of classes`). We thus have to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices), and Y_train (labels as indices) to Y_train_oh (labels as one-hot vectors).

In [61]:
X_train_indices = sentences_to_indices(X_train, words_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, classes=classes)

## Fitting the keras model

In [None]:
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 50, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50

In [63]:
X_train_indices

array([[ 268745.,  217389.,  198528., ...,       0.,       0.,       0.],
       [ 369052.,  130722.,  319691., ...,       0.,       0.,       0.],
       [ 255601.,  185457.,  127406., ...,       0.,       0.,       0.],
       ..., 
       [ 285736.,  383514.,  188481., ...,       0.,       0.,       0.],
       [ 357266.,  169725.,  123517., ...,  260309.,  384714.,   54718.],
       [ 372306.,  193919.,  366081., ...,       0.,       0.,       0.]])