<h1 style="font-size: 30px; margin-left:50px">SPAM detector</h1>

<img src="https://gifimage.net/wp-content/uploads/2018/05/spam-gif-6.gif" style="width:20%; float:center;">


In [None]:
# import libraries
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from nltk.tokenize import word_tokenize
from sklearn.metrics import confusion_matrix

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier,RandomForestClassifier
import xgboost
from sklearn import svm,tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score


import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding,Input,LSTM,Dense,Bidirectional,Dropout, Activation
from keras.models import Model
from tensorflow.keras.models import Sequential
tf.__version__
import warnings
warnings.filterwarnings("ignore")


In [None]:
# Load the data
df = pd.read_csv('../input/spam-filter/emails.csv')
df.head()

In [None]:
# checking the number of duplicate columns
print('Number of duplicate rows in the data are : ',df[df.duplicated(subset=None, keep='first') == True].shape[0], '\nSo we drop them')

# dropping the duplicate columns
df.drop_duplicates(inplace = True)

In [None]:
# Describing the values in the Spam column
df.groupby('spam').describe()

We can see that the 

In [None]:
# creating a column with the length of each message
df['mail_len'] = df.text.apply(len)

## Exploratory Data Analysis

Plotting the histogram of data for the count of Ham and Spam mails

In [None]:
plt.figure(figsize=(6,6))

df.spam[df.spam==1].plot(bins=4, kind='hist', color='blue', 
                                       label='Spam Mails', alpha=0.6)

df.spam[df.spam==0].plot(bins=4, kind='hist', color='red', 
                                       label='Ham Mails', alpha=0.6)
plt.legend()
plt.xlabel("Ham/Spam")

We cam see that the Ham mails are more almost 4 times in number than the spam mails

Plottin the distplot to see the distribution of mail length...bigger the mail lenght, higher the plot goes.

In [None]:
plt.style.use('seaborn-darkgrid')
plt.figure(figsize=(10,5))
sns.distplot(df['mail_len'],kde=True,color='red',hist=True)
plt.xlabel("Message Length",size=15)
plt.ylabel("Frequency",size=15)
plt.title("Length Histogram",size=15)

We see that the initial rows have got bigger mails than the later ones

Lets see the histogram of length of mails for both the labels in the same plot one over the other

In [None]:
plt.figure(figsize=(12, 8))
df[df.spam==1].mail_len.plot( kind='hist', color='blue',label='Spam Mails', alpha=0.6)
df[df.spam== 0].mail_len.plot(kind='hist', color='red',label='Ham Mails', alpha=0.6)
plt.legend()
plt.xlabel("Mail Length")

## Preprocessing of the text data
In preprocessing we will remove the punctuations and stopwords and lower case all the mails data

In [None]:
#1.Punctuations are [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
#2.Stop words in natural language processing, are useless words (data).

def process_text(text):
    
    #1 Remove Punctuationa
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    #2 Remove Stop Words
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    #3 Return a list of clean words
    return clean_words

In [None]:
#Show the processed data
df.text = df.text.apply(process_text)
df.text.head()


## Vectorization of the text data 

We can not feed text data directly to the models. So we will vectorize each mail into a matrix by tokenizing it, then converting  into numerial vectors and finally padding it to create a matrix of numbers for each mail input.

In [None]:
vocab_size = 10000
max_len = 250

# Tokenize the mails
tok = Tokenizer(num_words=vocab_size)
tok.fit_on_texts(df.text)

# Use text_to_sequence to convert it into vectors
sequences = tok.texts_to_sequences(df.text)

# pad seqence to create a matrix of equal length mails
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

Lets see how the mails look like now.....

In [None]:
sequences_matrix[0]

Okay...So finally the data is ready for training.

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sequences_matrix, df.spam, test_size = 0.2, random_state = 1)

We will now train the data on the below ML Models.
Lets make a list of the classification models, fit them on training data and check for their respective accuracies.

In [None]:
models=[RandomForestClassifier(),
        GaussianNB(),
        AdaBoostClassifier(),
        xgboost.XGBClassifier(),
        svm.SVC(),
        tree.DecisionTreeClassifier(),
        KNeighborsClassifier()]

model_names=['Random Forest Classifier',
             'Gaussian Naive Bayes Classifier',
             'Adaboost Classifier',
             'XGBoost Classifier',
             'Support Vector Classifier',
             'Decision Tree Classifier',
             'K Nearest Neighbour Classifier']
accuracy=[]
d={}
for model in range (len(models)):
    clf=models[model]
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    accuracy.append(accuracy_score(y_test,y_pred))
d={'Modelling Algo':model_names,'Accuracy':accuracy} 

lets put all the models with their accuracies and compare to see whcih one has the highest score.

In [None]:
accuracy_frame=pd.DataFrame.from_dict(d, orient='index').transpose()
accuracy_frame

XGBoost Classifier has the highest score, we will do hyperparameter tuning and see how much the accuracy improves

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid

In [None]:
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
print(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = xgboost.XGBClassifier()
# Random search of parameters, using 2 fold cross validation, 
# search across 5 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 5, cv = 2, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

In [None]:
rf_random.best_params_

In [None]:
rf_random.best_estimator_

Evaluating the score with best parameters

In [None]:
rfc = rf_random.best_estimator_
rfc.fit(X_train, y_train)
y_pred1 = rfc.predict(X_test) 
print(confusion_matrix(y_test,y_pred1))
print(accuracy_score(y_test,y_pred1))
print(classification_report(y_test,y_pred1))

### Accuracy using XGBoost with best parameters does improve improved the accuracy score but its still not satisfactory


### Now lets use a simple single layered LSTM model

https://en.wikipedia.org/wiki/Long_short-term_memory
## What is LSTM and why it is used..

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning.

In theory, classic (or "vanilla") RNNs can keep track of arbitrary long-term dependencies in the input sequences. The problem with vanilla RNNs is computational (or practical) in nature: when training a vanilla RNN using back-propagation, the gradients which are back-propagated can "vanish" (that is, they can tend to zero) or "explode" (that is, they can tend to infinity), because of the computations involved in the process, which use finite-precision numbers. RNNs using LSTM units partially solve the vanishing gradient problem, because LSTM units allow gradients to also flow unchanged.
We also have Bi-directional LSTM which overcomes the drwbacks of LSTM model

[https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/](http://)
## Embedding layer - 
Also we are using an embedding layer before giving the data to the LSTM layer

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments: 
1. input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
2. output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word.
3. input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.



In [None]:
model = Sequential()
model.add(Embedding(vocab_size, 200, input_length=max_len))
model.add(LSTM(32))
model.add(Dense(1,activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=5,batch_size=64)

## Model Evaluation

In [None]:
scores = model.evaluate(X_test, y_test, verbose=0)
y_pred = model.predict_classes(X_test)

print('Test loss:', scores[0])
print('Test accuracy:', scores[1])
print('confusion matrix:\n', confusion_matrix(y_pred,y_test))

## So we can see that a simple LSTM model gives an accuracy of 0.98 whereas best ML model had just 0.89.

## So finally we have our machine ready....You feed the message and it will tell you whether its a SPAM or HAM

<img src="https://digitalmarketingbypsk.files.wordpress.com/2017/05/21.gif" style="width:30%; float:center;">


## Thank You........