In [1]:
from IPython.display import Image
from IPython.display import HTML
import pandas as pd
import plotly.plotly 
from plotly.offline import *

<img src= "spam-email.jpg">
<br>
This study is about designing a spam filter that can separate ham and spam emails based on various machine learning techniques. The subject line of the emails are analyzed based on the NLP technique of tokenizing sentences. Document/Text classification is one of the important and typical task in supervised machine learning (ML). Assigning categories to documents, such as email in this study has many applications like e.g. spam filtering. In this article, I would like to demonstrate and compare three architectures of neural network to classifying emails and create a spam filter. Detailed hyperparameter analysis are performed on these three archictectures. **Validation loss** is taken as the metrics to caolculate the best model. Models are sorted based on the lowest validation loss metrics. 
<br>
The dataset is based on cleaned Enron corpus, there are a total of 92188 messages belonging to 158 users with an average of 757 messages per user. The dataset has almost an equal distribution of ham and spam emails. 19997 emails consisting of ham and spams are used for train and validation set. A validation split of 20% is used. After performing a series of Hyperparameter analysis three best models for each of the architecture are choosen for analysis on test dataset that comprises of 17880 emails.
<img src = 'sshot.png'>
### Neural network for spam detection
<img src= "Capture.PNG">

Neural networks are powerful machine learning algorithms. They can be used to transform the features so as to form fairly complex non linear decision boundaries. They are primarily used for classification problems. The fully connected layer takes the deep representation from the RNN/LSTM and transforms it into the final output classes or class scores. This component is comprised of fully connected layers along with batch normalization and optionally dropout layers for regularization.
### Word Embedding
A word is a basic unit of language that conveys meaning of its own. With the help of words and language rules, an infinite set of concepts can be expressed. Machine learning approaches towards NLP require words to be expressed in vector form. Word embeddings, proposed in 1986 [4], is a feature engineering technique in which words are represented as a vector.
Word embedding is a technique for representing the meaning of a word in terms other words as defined by the Word2vec approach. Embeddings are designed for specific tasks. Let's take a simple way to represent a word in vector space: each word is uniquely mapped onto a series of zeros and a one, with the location of the one corresponding to the index of the word in the vocabulary. This technique is referred to as one-hot encoding. As an example, take the sentence "Go Intelligent Bot Service Artificial Intelligence, Oracle":   
<img src= "1.png">
The embedding of the word vectors enables the identification of words that are used in similar contexts to a specific word. In word embedding the words that have the same meaning have a similar representation.
<br>
In this study Glove (Global vectors)is used which is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. 
GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. For example, consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus: 
<img src= "table.png">
<br>
The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well. For this reason, the resulting word vectors perform very well on word analogy tasks, such as those examined in the word2vec package. 
$$J = \sum_{i,j=1}^{V} f (X_{ij})(w^T\tilde{w_j} + b_i + \tilde{b_j} - logX_{ij})^2$$
where V is the size of the vocabulary. Let the matrix of word-word co-occurrence counts be denoted by $X$, whose entries $X_{ij}$ tabulate the number of times word $j$ occurs in the context of word $i$.
### Vizualizing hyperparameters
#### CNN
<img src= "CNN.png">
#### LSTM
<img src= "LSTM.png">

<img src= "CNN3d.png">
<img src= "LSTM3d.png">

## CNN model

In [2]:
def embeddings(fl1=32, fl2=32, fl3=64, dl=16, optimizer= 'RMSprop', kl = 5, layer =1 ):
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)
    if (layer == 1):
        x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(embedded_sequences)
        x = MaxPooling1D(pool_size = kl)(x)
    elif (layer == 2):
        x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(embedded_sequences)
        x = MaxPooling1D(pool_size = kl)(x)
        x = Conv1D(filters = fl2, kernel_size = kl, activation='relu')(x)
        x = MaxPooling1D(pool_size = kl)(x)
        
    else:
        x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(embedded_sequences)
        x = MaxPooling1D(pool_size = kl)(x)
        x = Conv1D(filters = fl2, kernel_size = kl, activation='relu')(x)
        x = MaxPooling1D(pool_size = kl)(x)
        x = Conv1D(filters = fl3, kernel_size = kl, activation='relu')(x)
    x = GlobalMaxPooling1D()(x)
    x = Dense(units = dl, activation='relu')(x)
    preds = Dense(2, activation='tanh')(x)
    model = Model(sequence_input, preds)
    model.compile(loss= 'categorical_crossentropy',
              optimizer= optimizer,
              metrics=['acc'])
   
    return model

## LSTM model

In [3]:
def embedding_LSTM(fl1=16, fl2=16, fl3=16, dl=16, optimizer= 'RMSprop', kl = 5, layer =1): 
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)
    x = Bidirectional(LSTM(units = fl1, return_sequences=True))(embedded_sequences)
    x = GlobalMaxPool1D()(x)
    x = Dense(units=dl, activation="relu")(x)
    x = Dropout(0.1)(x)
    preds = Dense(2, activation='softmax')(x)

    model = Model(sequence_input, preds)
    model.compile(loss= 'categorical_crossentropy',
              optimizer= optimizer,
              metrics=['acc'])
    return model

In [None]:
def no_embeddings(fl1=32, fl2=32, fl3=64, dl=16, optimizer= 'Nadam', kl = 5, layer =1 ):
    inp =  Input(shape=(1000, 1))
    if (layer == 1):
        x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(inp)
        x = MaxPooling1D(pool_size = kl)(x)
    elif (layer == 2):
        x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(inp)
        x = MaxPooling1D(pool_size = kl)(x)
        x = Conv1D(filters = fl2, kernel_size = kl, activation='relu')(x)
        x = MaxPooling1D(pool_size = kl)(x)
    else:
        x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(inp)
        x = MaxPooling1D(pool_size = kl)(x)
        x = Conv1D(filters = fl2, kernel_size = kl, activation='relu')(x)
        x = MaxPooling1D(pool_size = kl)(x)
        x = Conv1D(filters = fl3, kernel_size = kl, activation='relu')(x)
    x = GlobalMaxPooling1D()(x)
    x = Dense(units = dl, activation='relu')(x)
    preds = Dense(1, activation='tanh')(x)
    model = Model(inp, preds)
    model.compile(loss= 'binary_crossentropy',
              optimizer= optimizer,
              metrics=['acc'])
   
    return model

In [20]:
df = pd.read_csv('test_loss.csv')
df

Unnamed: 0,Neurons Dense layer,Filter 1st layer,Filter 2nd layer,Filter 3rd layer,kernel,layer,trainable_params,optimizer,ANN,train_acc,train_loss,val_acc,val_loss,TP,TN,FP,FN
0,16,128,16,32,5,1,66226,Adam,CNN,0.92,0.18,0.9,0.23,4688,3904,4159,4337
1,64,128,16,64,5,2,75602,Adam,CNN,0.92,0.19,0.9,0.23,4538,4032,4031,4487
2,32,128,16,16,5,3,76290,Nadam,CNN,0.92,0.19,0.9,0.23,6152,2621,5442,2873
3,64,64,64,32,5,2,56898,Nadam,CNN,0.92,0.2,0.9,0.23,4423,4192,3871,4602
4,128,128,16,16,5,2,76818,Nadam,CNN,0.92,0.19,0.9,0.23,4294,4323,3740,4731
5,64,128,0,0,0,1,251074,Nadam,LSTM,0.93,0.16,0.93,0.18,9025,0,8063,0
6,64,32,0,0,0,1,38338,Nadam,LSTM,0.93,0.18,0.92,0.18,4940,3722,4341,4085
7,32,128,0,0,0,1,242786,Nadam,LSTM,0.93,0.18,0.92,0.19,4876,3780,4283,4149
8,64,64,0,0,0,1,92866,Nadam,LSTM,0.93,0.16,0.93,0.19,4978,3599,4464,4047
9,128,32,0,0,0,1,42626,Nadam,LSTM,0.93,0.18,0.93,0.19,4413,4159,3904,4612


In [21]:
print('Spam emails: 9025')
print('Ham emails: 8063')
df_sort = df.sort_values('val_loss')
df_sort

Spam emails: 9025
Ham emails: 8063


Unnamed: 0,Neurons Dense layer,Filter 1st layer,Filter 2nd layer,Filter 3rd layer,kernel,layer,trainable_params,optimizer,ANN,train_acc,train_loss,val_acc,val_loss,TP,TN,FP,FN
5,64,128,0,0,0,1,251074,Nadam,LSTM,0.93,0.16,0.93,0.18,9025,0,8063,0
6,64,32,0,0,0,1,38338,Nadam,LSTM,0.93,0.18,0.92,0.18,4940,3722,4341,4085
7,32,128,0,0,0,1,242786,Nadam,LSTM,0.93,0.18,0.92,0.19,4876,3780,4283,4149
8,64,64,0,0,0,1,92866,Nadam,LSTM,0.93,0.16,0.93,0.19,4978,3599,4464,4047
9,128,32,0,0,0,1,42626,Nadam,LSTM,0.93,0.18,0.93,0.19,4413,4159,3904,4612
0,16,128,16,32,5,1,66226,Adam,CNN,0.92,0.18,0.9,0.23,4688,3904,4159,4337
1,64,128,16,64,5,2,75602,Adam,CNN,0.92,0.19,0.9,0.23,4538,4032,4031,4487
2,32,128,16,16,5,3,76290,Nadam,CNN,0.92,0.19,0.9,0.23,6152,2621,5442,2873
3,64,64,64,32,5,2,56898,Nadam,CNN,0.92,0.2,0.9,0.23,4423,4192,3871,4602
4,128,128,16,16,5,2,76818,Nadam,CNN,0.92,0.19,0.9,0.23,4294,4323,3740,4731
