# Spam ham detection on UCI Spambase dataset
Author: Swetha Sivakumar \\
USC ID: 5978959727

In this notebook, I have tuned and trained 5 classification models which are as follows:


*   Model 1 : Random Forest Classifier - an ensemble learning method
*   Model 2 : XGBoost Classifier - an ensemble learning method
*   Model 3 : Logistic Regresion 
*   Model 4 : Neural Network in Tensorflow and Keras
*   Model 5 : Support Vector Machine

Summary of results are as follows: 




In [25]:
t = PrettyTable(['Model Name', 'False Positive Rate', 'False Negative Rate', 'Overall Error Rate', 'Accuracy Score'])
t.add_row(['Random Forest Classifier', np.mean(rf_false_positive_rates), np.mean(rf_false_negative_rates), np.mean(rf_overall_error_rates), np.mean(rf_accuracy_scores)])
t.add_row(['XGBoost Classifier', np.mean(xgb_false_positive_rates), np.mean(xgb_false_negative_rates), np.mean(xgb_overall_error_rates), np.mean(xgb_accuracy_scores)])
t.add_row(['Tensorflow Neural Network', np.mean(nn_false_positive_rates), np.mean(nn_false_negative_rates), np.mean(nn_overall_error_rates), np.mean(nn_accuracy_scores)])
t.add_row(['Logistic Regression', np.mean(logistic_regression_false_positive_rates), np.mean(logistic_regression_false_negative_rates), np.mean(logistic_regression_overall_error_rates), np.mean(logistic_regression_accuracy_scores)])
print(t)

+---------------------------+----------------------+---------------------+----------------------+--------------------+
|         Model Name        | False Positive Rate  | False Negative Rate |  Overall Error Rate  |   Accuracy Score   |
+---------------------------+----------------------+---------------------+----------------------+--------------------+
|  Random Forest Classifier | 0.02869575307495939  | 0.06786776759152449 | 0.04413043478260869  | 0.9558695652173913 |
|     XGBoost Classifier    | 0.03300069621721977  |  0.0789083844332463 | 0.051086956521739134 | 0.9489130434782609 |
| Tensorflow Neural Network | 0.025107655810835204 | 0.06732438831886348 |  0.0417391304347826  | 0.9830369560122492 |
|    Logistic Regression    | 0.04806864186070499  | 0.10871531783133995 | 0.07195652173913045  | 0.9280434782608695 |
+---------------------------+----------------------+---------------------+----------------------+--------------------+


# Conclusion: 
From the results summary, we can understand that the best models are Neural Network and Random Forest Classifier. But since, the neural network has been trained for 100 epochs and the data size is just 4600 emails, there are chances of over fitting. We can also see that despite the lesser accuracy of random forest, the overall error rate of Random Forest Classifier is very close to the Neural Network.

# Mounting Google Drive to colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Changing Current Working Directory to Spambase Datset

In [2]:
cd drive/My Drive/spambase

/content/drive/My Drive/spambase


# Importing required Python libraries 

In [3]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
import xgboost as xgb 
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential 
from keras.layers import InputLayer 
from keras.layers import Dense 
from keras.layers import Dropout
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import StratifiedKFold
from collections import Counter
from sklearn.metrics import accuracy_score
from prettytable import PrettyTable
from sklearn.linear_model import LogisticRegression

# Read the input .data file into a pandas dataframe

In [4]:
df = pd.read_csv("spambase.data")
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


# Checking for imbalanced data


> We check if the number of data points in one class dominates significantly over the other class. Since, in cases of imbalanced data, the model trained on such a data will be biased towards the class with higher data-points. In this case, we can generate duplicate/new data points for the class with lesser number of data points inorder to bridge the gap.









In [5]:
Counter(df['1'])
## From result below, We see that, we have 2788 non-spam mails and 1812 spam mails. Thus, we can conclude that the distribution is not very skewed towards any one class. 

Counter({0: 2788, 1: 1812})

# Create feature vector and target vector as dataframes 

In [6]:
X = df.iloc[:, :-1]
Y = df.iloc[:,-1:]

# Feature Selection


# [Feature Selection] Test 1: Checking for highly correlated features


> Features with high correlation with another feature, then this feature might not help in improving the model. But these highly correlated features might be a cause of curse of dimensionality without adding any new knowledge to the model. Also, these correlated features 'might' cause a bias towards a certain class for classification tasks. Thus, it is better to remove every thing except one of the highly correlated features. 



In [7]:
## Generating correlation matrix of Spambase dataset 
correlation_matrix = df.corr().abs()
print(correlation_matrix)

               0      0.64    0.64.1  ...        61       278         1
0       1.000000  0.016735  0.065684  ...  0.061387  0.089165  0.126323
0.64    0.016735  1.000000  0.033579  ...  0.000268  0.022679  0.030318
0.64.1  0.065684  0.033579  1.000000  ...  0.107462  0.070119  0.196840
0.1     0.013270  0.006920  0.020240  ...  0.022081  0.021369  0.057394
0.32    0.023120  0.023761  0.077737  ...  0.052290  0.002492  0.241958
0.2     0.059650  0.024815  0.087624  ...  0.090177  0.082089  0.232741
0.3     0.007647  0.003939  0.036725  ...  0.059680  0.008344  0.332255
0.4     0.003970  0.016261  0.012044  ...  0.037578  0.040252  0.206915
0.5     0.106241  0.003803  0.093843  ...  0.189252  0.248726  0.231680
0.6     0.041171  0.032989  0.032135  ...  0.103314  0.087274  0.139088
0.7     0.188441  0.006843  0.048304  ...  0.086795  0.115056  0.234651
0.64.2  0.105811  0.040406  0.083197  ...  0.021773  0.020076  0.007711
0.8     0.066416  0.018836  0.047644  ...  0.041965  0.105150  0

# Removing redundancy causing highly correlated features


> Since correlation matrix, calculates correlatio between every possible pair of features, the matrix is symmetric. Thus, to save computational expense, we can process just the upper or lower triangle. 




> For our case, we have considered the upper triangle of correlation matrix and weeded out a list of features with correlation greater than 0.95.  






In [None]:
'''print(df.shape)
upper = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
print(to_drop)
df.drop(to_drop, axis=1, inplace=True)
print(df.shape)'''
## We can notice that one feature, df['0.25'] has been found to be highly correlated.
## Which means, almost all the variable of this dataset are not correlated and thus could 'potentially' be contributing to the target pattern
## But after multiple runs of models with all features, features with correlated features removed and features set selected through Chi Square test, I have noticed that the performance improvement is not significant for any of the models in this specific case.

(4600, 58)
['0.25']
(4600, 57)


#[Feature Selection] Test 2: Chi-Square test


> A chi-square test is used to test the independence of two events. 


> When two features are independent,  It will have smaller Chi-Square value. Whereas, high Chi-Square value of a feature mens the feature is more dependent on the target and it can potentially impart additional knowledge to model training.





In [None]:
## But after multiple runs of models with all features, features with correlated features removed and features set selected through Chi Square test, I have noticed that the performance improvement is not significant for any of the models in this specific case.
'''chi2_selector = SelectKBest(chi2, k=10)
X_kbest = chi2_selector.fit_transform(X, Y)
print(df.shape)
cols = chi2_selector.get_support(indices=True)
X = df.iloc[:,cols]
print(df.columns)
print(features_df_new.columns)'''

# Splitting train and test

> Usually, the data is split into train and test set. Model is trained on train set and tested on test set. But since in our case, we will be K-fold cross-validation on the entire dataset, the necessity for train_test_split is avoided.



In [None]:
# x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

# Convert Target variable into categorical variable
This step is required since, sklearn models like SVM, RandomForestClassifier amongst others do not work on continous data, they expect their target variable to be categorical ie. 0/1 or 0 to n classes.

In our case, we make target variable binary, ie. 0/1 where 1 stands for spam and 0 stands for ham. 

In [8]:
label_encoder = preprocessing.LabelEncoder()
Y = label_encoder.fit_transform(Y)
#y_test = label_encoder.fit_transform(y_test)

  y = column_or_1d(y, warn=True)


# Model 1: Random Forest Classifier 

## Model performance with default hyperparameter (without Hyperparameter tuning) 

In [9]:
rf_clf = RandomForestClassifier(criterion="gini", max_features="log2", n_estimators=73, random_state = 42)
rf_scores = cross_validate(rf_clf, X, Y, cv=10)
np.mean(rf_scores['test_score'])

0.9395652173913044


# What is Hyperparameter Tuning ?


> Hyperparameters are features that cannot be learnt by the model during the training process. Instead, these are parameters that infact describe the model. Hyperparameters are usually selected by humans before starting the model training, and the selection is done using prior experience of domain experts or by 'Trial and error'


### Hyperparameter tuning can be done in different ways. Two of the most prominent methods of hyperparameter tuning are GridSearch and Randomized Search. 


*   Grid Search 
*   Randomized Search




# Implementing Hyperparameter Tuning for RandomForestClassifier

In [10]:
rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [x for x in range(50, 100)],
                    'criterion': ['gini', 'entropy'],
                    'max_features': ['sqrt', 'log2', None ]}]
grid = RandomizedSearchCV(rf_clf, param_grid, n_jobs=-1, refit=True)
grid.fit(X, Y)
rf_clf = grid.best_estimator_       # rf_clf now contains the random_forest_classifier with the best performing hyperparameters for this task.

# Doing a 10-fold cross validation 


> We can use the in-built cross-validation methods like **cross_validate(), cross_validate_score() or cross_validate_predict()** from sklearn. 

> But since we need to print the confusion matrix values for each fold, we instead implement k-fold explicitly in the next cell.


> To get the just the accuracy of every fold and not the confusion matrix, another sklearn method that can be used is  


> We use **Stratified K-Fold cross validation** inorder to ensure that every split has equal distribution of the two classes of data. Stratified K-Fold distribution is an improved version of K-fold cross validation



In [11]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

In [12]:
Y = pd.DataFrame(Y)
X = pd.DataFrame(X)
print(type(Y))

<class 'pandas.core.frame.DataFrame'>


In [13]:
### Utility Function used by all models to calculate F, FN, error rate and the result table
def report_model_result(t, my_confusion_matrix, accuracy, accuracy_scores, false_positive_rates, false_negative_rates, overall_error_rates):
    false_positives_count = (my_confusion_matrix[0][1]) / (my_confusion_matrix[0][1] + my_confusion_matrix[0][0])
    false_negative_count = (my_confusion_matrix[1][0]) / (my_confusion_matrix[1][0]+my_confusion_matrix[1][1])
    overall_error_rate = (my_confusion_matrix[1][0]+my_confusion_matrix[0][1]) / (my_confusion_matrix[0][0]+my_confusion_matrix[0][1] + my_confusion_matrix[1][0]+my_confusion_matrix[1][1])
    accuracy_scores.append(accuracy)
    false_positive_rates.append(false_positives_count)
    false_negative_rates.append(false_negative_count)
    overall_error_rates.append(overall_error_rate)
    t.add_row([fold_counter, false_positives_count, false_negative_count, overall_error_rate, accuracy])
    

In [14]:

### Code to cross-validate using sklearn's inbuilt method
'''rf_scores = cross_validate(rf_clf, X, Y, cv=10) 
np.mean(rf_scores['test_score'])''' 
### Code for implementing k-fold cross validation mannualy
t = PrettyTable(['K-th fold number', 'False Positive Rate', 'False Negative Rate', 'Overall Error Rate', 'Accuracy Score'])
rf_accuracy_scores = []
rf_false_positive_rates = []
rf_false_negative_rates = []
rf_overall_error_rates = []
fold_counter = 1
for train_index, test_index in skf.split(X, Y): 
    x_train, x_test = X[X.index.isin(train_index)], X[X.index.isin(test_index)]
    y_train, y_test = Y[Y.index.isin(train_index)], Y[Y.index.isin(test_index)] 
    rf_clf.fit(x_train, y_train.values.ravel()) 
    y_pred = rf_clf.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    my_confusion_matrix = confusion_matrix(y_test, y_pred)
    report_model_result(t, my_confusion_matrix, accuracy, rf_accuracy_scores, rf_false_positive_rates, rf_false_negative_rates, rf_overall_error_rates)
    #confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=my_confusion_matrix, display_labels=rf_clf.classes_).plot()
    fold_counter += 1
print(t)
t = PrettyTable(['Avg. False Positive Rate', 'Avg. False Negative Rate', 'Avg. Overall Error Rate', 'Avg. Accuracy Score'])
t.add_row([np.mean(rf_false_positive_rates), np.mean(rf_false_negative_rates), np.mean(rf_overall_error_rates), np.mean(rf_accuracy_scores)])
print(t)
### Description of False Positives and False negatives to understand the results 
### False Positive (FP) - Mail that is not spam, but is incorrectly being classified as spam. 
### False Negative (FN) - Mail that is spam, but is incorrectly seen as a non-spam email. 

+------------------+----------------------+----------------------+----------------------+--------------------+
| K-th fold number | False Positive Rate  | False Negative Rate  |  Overall Error Rate  |   Accuracy Score   |
+------------------+----------------------+----------------------+----------------------+--------------------+
|        1         | 0.039568345323741004 | 0.08791208791208792  | 0.058695652173913045 | 0.941304347826087  |
|        2         | 0.02158273381294964  | 0.07142857142857142  | 0.041304347826086954 | 0.9586956521739131 |
|        3         | 0.02867383512544803  | 0.049723756906077346 | 0.03695652173913044  | 0.9630434782608696 |
|        4         | 0.025089605734767026 | 0.055248618784530384 | 0.03695652173913044  | 0.9630434782608696 |
|        5         | 0.03942652329749104  | 0.055248618784530384 | 0.04565217391304348  | 0.9543478260869566 |
|        6         | 0.02867383512544803  | 0.055248618784530384 |  0.0391304347826087  | 0.9608695652173913 |
|

**NOTE:** In the above result, we can see a significant increase in model performance through hyperparameter tuning. Hence, it is essential to choose out model parameters as fitting to our problem statement as possible.




# Model 2 : XGboost 

In [15]:
xgb_clf = xgb.XGBClassifier()
param_grid = {"learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ] }
grid = RandomizedSearchCV(xgb_clf, param_grid, n_jobs=-1, refit=True)
grid.fit(X, Y.values.ravel())
xgb_clf = grid.best_estimator_

In [16]:
### Code for implementing k-fold cross validation mannualy
t = PrettyTable(['K-th fold number', 'False Positive Rate', 'False Negative Rate', 'Overall Error Rate', 'Accuracy Score'])
xgb_accuracy_scores = []
xgb_false_positive_rates = []
xgb_false_negative_rates = []
xgb_overall_error_rates = []
fold_counter = 1
for train_index, test_index in skf.split(X, Y): 
    x_train, x_test = X[X.index.isin(train_index)], X[X.index.isin(test_index)]
    y_train, y_test = Y[Y.index.isin(train_index)], Y[Y.index.isin(test_index)] 
    xgb_clf.fit(x_train, y_train.values.ravel()) 
    y_pred = xgb_clf.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    my_confusion_matrix = confusion_matrix(y_test, y_pred)
    report_model_result(t, my_confusion_matrix, accuracy, xgb_accuracy_scores, xgb_false_positive_rates, xgb_false_negative_rates, xgb_overall_error_rates)
    #confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=my_confusion_matrix, display_labels=xgb_clf.classes_).plot()
    fold_counter += 1
print(t)
t = PrettyTable(['Avg. False Positive Rate', 'Avg. False Negative Rate', 'Avg. Overall Error Rate', 'Avg. Accuracy Score'])
t.add_row([np.mean(xgb_false_positive_rates), np.mean(xgb_false_negative_rates), np.mean(xgb_overall_error_rates), np.mean(xgb_accuracy_scores)])
print(t)
### Description of False Positives and False negatives to understand the results 
### False Positive (FP) - Mail that is not spam, but is incorrectly being classified as spam. 
### False Negative (FN) - Mail that is spam, but is incorrectly seen as a non-spam email. 

+------------------+----------------------+----------------------+----------------------+--------------------+
| K-th fold number | False Positive Rate  | False Negative Rate  |  Overall Error Rate  |   Accuracy Score   |
+------------------+----------------------+----------------------+----------------------+--------------------+
|        1         | 0.04316546762589928  | 0.08791208791208792  | 0.06086956521739131  | 0.9391304347826087 |
|        2         | 0.02877697841726619  | 0.08791208791208792  | 0.05217391304347826  | 0.9478260869565217 |
|        3         | 0.03225806451612903  |  0.0718232044198895  | 0.04782608695652174  | 0.9521739130434783 |
|        4         | 0.02867383512544803  | 0.08839779005524862  | 0.05217391304347826  | 0.9478260869565217 |
|        5         | 0.04659498207885305  | 0.055248618784530384 |         0.05         |        0.95        |
|        6         | 0.02867383512544803  | 0.06629834254143646  | 0.043478260869565216 | 0.9565217391304348 |
|

# Model 3 : Neural Network using Tensorflow and Keras 


In [17]:
nn_clf = Sequential()
nn_clf.add(Dense(57, input_dim=57, activation='relu'))
nn_clf.add(Dense(units = 128, activation='relu'))
nn_clf.add(Dense(1, activation='sigmoid'))
# Compile model
nn_clf.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [20]:
### Code for implementing k-fold cross validation mannualy
t = PrettyTable(['K-th fold number', 'False Positive Rate', 'False Negative Rate', 'Overall Error Rate', 'Accuracy Score'])
nn_accuracy_scores = []
nn_false_positive_rates = []
nn_false_negative_rates = []
nn_overall_error_rates = []
fold_counter = 1
for train_index, test_index in skf.split(X, Y): 
    x_train, x_test = X[X.index.isin(train_index)], X[X.index.isin(test_index)]
    y_train, y_test = Y[Y.index.isin(train_index)], Y[Y.index.isin(test_index)] 
    history = nn_clf.fit(x_train, y_train, batch_size=100, epochs=100, verbose = 0, validation_data=(x_test, y_test))
    accuracy = np.mean(history.history['val_accuracy'])
    ypred = nn_clf.predict(x_test)
    mod_y_pred = []
    for predicted in y_pred:
      if predicted > 0.75:
        mod_y_pred.append(1)
      else:
        mod_y_pred.append(0)
    my_confusion_matrix = confusion_matrix(y_test, mod_y_pred)
    report_model_result(t, my_confusion_matrix, accuracy, nn_accuracy_scores, nn_false_positive_rates, nn_false_negative_rates, nn_overall_error_rates)
    #confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=my_confusion_matrix, display_labels=xgb_clf.classes_).plot()
    fold_counter += 1
print(t)
t = PrettyTable(['Avg. False Positive Rate', 'Avg. False Negative Rate', 'Avg. Overall Error Rate', 'Avg. Accuracy Score'])
t.add_row([np.mean(nn_false_positive_rates), np.mean(nn_false_negative_rates), np.mean(nn_overall_error_rates), np.mean(nn_accuracy_scores)])
print(t)
### Description of False Positives and False negatives to understand the results 
### False Positive (FP) - Mail that is not spam, but is incorrectly being classified as spam. 
### False Negative (FN) - Mail that is spam, but is incorrectly seen as a non-spam email. 

+------------------+----------------------+---------------------+----------------------+--------------------+
| K-th fold number | False Positive Rate  | False Negative Rate |  Overall Error Rate  |   Accuracy Score   |
+------------------+----------------------+---------------------+----------------------+--------------------+
|        1         | 0.025179856115107913 | 0.07142857142857142 | 0.043478260869565216 | 0.9574999982118606 |
|        2         | 0.025179856115107913 | 0.07142857142857142 | 0.043478260869565216 | 0.9679347771406174 |
|        3         | 0.025089605734767026 | 0.06629834254143646 | 0.041304347826086954 | 0.9777826076745987 |
|        4         | 0.025089605734767026 | 0.06629834254143646 | 0.041304347826086954 | 0.9846086972951889 |
|        5         | 0.025089605734767026 | 0.06629834254143646 | 0.041304347826086954 | 0.9874782621860504 |
|        6         | 0.025089605734767026 | 0.06629834254143646 | 0.041304347826086954 | 0.985956524014473  |
|        7

# Model 4 : Logistic Regression 

In [21]:
logistic_regression_clf = LogisticRegression()
param_grid = {'solver': ['newton-cg', 'lbfgs', 'liblinear'],  
              'penalty': ['l2'], 
              'C': [100, 10, 1.0, 0.1, 0.01], 
              'max_iter': [100, 1000, 10000, 5000]}
grid = RandomizedSearchCV(logistic_regression_clf, param_grid, n_jobs=-1, refit=True)
grid.fit(X, Y.values.ravel())
logistic_regression_clf = grid.best_estimator_


In [22]:
### Code for implementing k-fold cross validation mannualy
t = PrettyTable(['K-th fold number', 'False Positive Rate', 'False Negative Rate', 'Overall Error Rate', 'Accuracy Score'])
logistic_regression_accuracy_scores = []
logistic_regression_false_positive_rates = []
logistic_regression_false_negative_rates = []
logistic_regression_overall_error_rates = []
fold_counter = 1
for train_index, test_index in skf.split(X, Y): 
    x_train, x_test = X[X.index.isin(train_index)], X[X.index.isin(test_index)]
    y_train, y_test = Y[Y.index.isin(train_index)], Y[Y.index.isin(test_index)] 
    logistic_regression_clf.fit(x_train, y_train.values.ravel()) 
    y_pred = logistic_regression_clf.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    my_confusion_matrix = confusion_matrix(y_test, y_pred)
    report_model_result(t, my_confusion_matrix, accuracy, logistic_regression_accuracy_scores, logistic_regression_false_positive_rates, logistic_regression_false_negative_rates, logistic_regression_overall_error_rates)
    #confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=my_confusion_matrix, display_labels=xgb_clf.classes_).plot()
    fold_counter += 1
print(t)
t = PrettyTable(['Avg. False Positive Rate', 'Avg. False Negative Rate', 'Avg. Overall Error Rate', 'Avg. Accuracy Score'])
t.add_row([np.mean(logistic_regression_false_positive_rates), np.mean(logistic_regression_false_negative_rates), np.mean(logistic_regression_overall_error_rates), np.mean(logistic_regression_accuracy_scores)])
print(t)
### Description of False Positives and False negatives to understand the results 
### False Positive (FP) - Mail that is not spam, but is incorrectly being classified as spam. 
### False Negative (FN) - Mail that is spam, but is incorrectly seen as a non-spam email. 

+------------------+----------------------+---------------------+---------------------+--------------------+
| K-th fold number | False Positive Rate  | False Negative Rate |  Overall Error Rate |   Accuracy Score   |
+------------------+----------------------+---------------------+---------------------+--------------------+
|        1         | 0.06115107913669065  | 0.10989010989010989 | 0.08043478260869565 | 0.9195652173913044 |
|        2         | 0.050359712230215826 | 0.11538461538461539 | 0.07608695652173914 | 0.9239130434782609 |
|        3         | 0.043010752688172046 | 0.12154696132596685 | 0.07391304347826087 | 0.9260869565217391 |
|        4         | 0.043010752688172046 | 0.11602209944751381 | 0.07173913043478261 | 0.9282608695652174 |
|        5         | 0.07168458781362007  | 0.08287292817679558 | 0.07608695652173914 | 0.9239130434782609 |
|        6         | 0.05017921146953405  | 0.07734806629834254 | 0.06086956521739131 | 0.9391304347826087 |
|        7         

# Model 5 : Support Vector Machines

In [None]:
svm_clf = svm.SVC() 
param_grid = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['linear']}
grid = RandomizedSearchCV(svm_clf, param_grid, n_jobs=-1, refit=True)
grid.fit(X, Y)
svm_clf = grid.best_estimator_

In [None]:
## 
### Code for implementing k-fold cross validation mannualy
svm_clf = svm.SVC(kernel='linear') ## I have explicitly mentioned max_iter parameter, becuase for lesser values the lbfgs solver doesn't converge, which might lead to bd performance 
t = PrettyTable(['K-th fold number', 'False Positive Rate', 'False Negative Rate', 'Overall Error Rate', 'Accuracy Score'])
svm_accuracy_scores = []
svm_false_positive_rates = []
svm_false_negative_rates = []
svm_overall_error_rates = []
fold_counter = 1
for train_index, test_index in skf.split(X, Y): 
    x_train, x_test = X[X.index.isin(train_index)], X[X.index.isin(test_index)]
    y_train, y_test = Y[Y.index.isin(train_index)], Y[Y.index.isin(test_index)] 
    svm_clf.fit(x_train, y_train.values.ravel()) 
    y_pred = svm_clf.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    my_confusion_matrix = confusion_matrix(y_test, y_pred)
    report_model_result(t, my_confusion_matrix, accuracy, svm_accuracy_scores, svm_false_positive_rates,svm_false_negative_rates, svm_overall_error_rates)
    #confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=my_confusion_matrix, display_labels=xgb_clf.classes_).plot()
    fold_counter += 1
print(t)
t = PrettyTable(['Avg. False Positive Rate', 'Avg. False Negative Rate', 'Avg. Overall Error Rate', 'Avg. Accuracy Score'])
t.add_row([np.mean(svm_false_positive_rates), np.mean(svm_false_negative_rates), np.mean(svm_overall_error_rates), np.mean(svm_accuracy_scores)])
print(t)
### Description of False Positives and False negatives to understand the results 
### False Positive (FP) - Mail that is not spam, but is incorrectly being classified as spam. 
### False Negative (FN) - Mail that is spam, but is incorrectly seen as a non-spam email. 

**NOTE:** Both colab and Jupyter Notebook results in .ipynb notebok. But since colab uses GPUs provided by google, it is much faster than Jupyter Noteook. On the otherhand, Jupyter notebook uses the local CPU and my system isn't equipped with a good gpu, thus, I have used colab for faster processing. But I am capble of doing the same work on both colab and Jupyter Notebook. 

# Results Summary Accross all models

In [24]:
t = PrettyTable(['Model Name', 'False Positive Rate', 'False Negative Rate', 'Overall Error Rate', 'Accuracy Score'])
t.add_row(['Random Forest Classifier', np.mean(rf_false_positive_rates), np.mean(rf_false_negative_rates), np.mean(rf_overall_error_rates), np.mean(rf_accuracy_scores)])
t.add_row(['XGBoost Classifier', np.mean(xgb_false_positive_rates), np.mean(xgb_false_negative_rates), np.mean(xgb_overall_error_rates), np.mean(xgb_accuracy_scores)])
t.add_row(['Tensorflow Neural Network', np.mean(nn_false_positive_rates), np.mean(nn_false_negative_rates), np.mean(nn_overall_error_rates), np.mean(nn_accuracy_scores)])
t.add_row(['Logistic Regression', np.mean(logistic_regression_false_positive_rates), np.mean(logistic_regression_false_negative_rates), np.mean(logistic_regression_overall_error_rates), np.mean(logistic_regression_accuracy_scores)])
print(t)

+---------------------------+----------------------+---------------------+----------------------+--------------------+
|         Model Name        | False Positive Rate  | False Negative Rate |  Overall Error Rate  |   Accuracy Score   |
+---------------------------+----------------------+---------------------+----------------------+--------------------+
|  Random Forest Classifier | 0.02869575307495939  | 0.06786776759152449 | 0.04413043478260869  | 0.9558695652173913 |
|     XGBoost Classifier    | 0.03300069621721977  |  0.0789083844332463 | 0.051086956521739134 | 0.9489130434782609 |
| Tensorflow Neural Network | 0.025107655810835204 | 0.06732438831886348 |  0.0417391304347826  | 0.9830369560122492 |
|    Logistic Regression    | 0.04806864186070499  | 0.10871531783133995 | 0.07195652173913045  | 0.9280434782608695 |
+---------------------------+----------------------+---------------------+----------------------+--------------------+
