In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,confusion_matrix

### Spam Classification
The goal is to predict if an email is spam or not (ham), by using MultinomialNB.

The provided dataset has
* 5572 rows (number of emails)
* 6217 columns 

The last column is the  class variable (with two values either "spam" or "ham").
The columns 0-6216 represent words (see Header of the  dataset).

This is an example of the first 5 rows and first 30 columns of the dataset

<img src="dtataset_example_email.png"></img>

A zero value means that word is not present in the email (each row is a different email) and a non-zero value
is a count and represents the number of times that word is present in that email.
The words are stems,  a stem is the form of a word before any inflectional affixes are added.
Example: *cost* is a stem for the words, *cost*, *costs*, *costly*.

Your task is:
1. loading the dataset using pandas
2. counting the number of rows and the number of columns
3. You don't need datat preprocessing, the input is in the right format for MultinomialNB
4. splitting the dataset in inputs (denoted as X, that includes the columns 0-6216) and output (denoted as y, last column)
5. randomly splitting the rows in (X,y) in training (50% of the rows) and testing (50% of the rows), you can use the function `train_test_split` from `sklearn`
6. training `MultinomialNB` on the training dataset and predict the class variable for the training dataset and testing dataset
7. computing the accuracy for the two predictions and also the confusion_matrix for the prediction on the test set.
8. implementing your own prediction by thresholding the probability of the class spam and ham,
More precisely, if $[p1,p2]$ is the predicted probability for ham and spam for an instance in the test set, the decision is

`
if p1>threshold
    prediction ="ham"
elif p2>threshold
    prediction ="spam"
else
    prediction = "none" #this "none" means none of the above cases.
`    
You then need to compute the resulting accuracy for all the cases where the classifier made a decision (that is it returns something different from -1).



# Solution

## Loading the dataset

In [5]:
#
dataset = pd.read_csv("email_clean.csv")
print("number of rows ", dataset.shape[0])
print("number of cols ", dataset.shape[1])

# splitting input and output
X=dataset.iloc[:,0:-1].values
y=dataset.iloc[:,-1].values
y

number of rows  5572
number of cols  6217


array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], dtype=object)

In [6]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [7]:
#split in training and testing
x_train, x_test, y_train, y_test=train_test_split(X, y, test_size=0.5, random_state=7)



## training the classifier and making predictions for the training and testing set

In [8]:
#MultinomialNB
clf=MultinomialNB()
#training
clf.fit(x_train,y_train)
#prediction
prediction_train=clf.predict(x_train)
prediction_test=clf.predict(x_test)

## Computing accuracy and confusion matrix

In [9]:
#accuracy
print("Accuracy:"+str(accuracy_score(y_train,prediction_train)))
print()

Accuracy:0.9931801866475233



We care about the generalisation error, that is the performance on unseen data.

In [10]:

#accuracy
print("Accuracy:"+str(accuracy_score(y_test,prediction_test)))
print()
#confusion matrix
conf_mat=confusion_matrix(y_test, prediction_test)
print("Confusion Matrix")
print(conf_mat)


Accuracy:0.9745154343144293

Confusion Matrix
[[2352   42]
 [  29  363]]


In [12]:
np.set_printoptions(suppress=True)#this suppresses the scientific notation in the visualisation of the porbabilities
proba = clf.predict_proba(x_test)
proba

array([[0.99952141, 0.00047859],
       [0.99974511, 0.00025489],
       [0.90815815, 0.09184185],
       ...,
       [0.99990251, 0.00009749],
       [0.99967452, 0.00032548],
       [0.99999987, 0.00000013]])

In [44]:
def new_predict(proba, threshold):
    Classes=["ham", "spam"]
    Output=[]
    for i in range(proba.shape[0]):
        indmax = np.argmax(proba[i,:])
        if proba[i,indmax]>threshold:
            Output.append(Classes[indmax])
        else:
            Output.append("none")
    return Output

ypred_new = new_predict(proba,0.9999)
ind = np.where(np.array(ypred_new)!="none")
accuracy_score(np.array(ypred_new)[ind],y_test[ind])

1.0

In [45]:
conf_mat=confusion_matrix(y_test, ypred_new)
print("Confusion Matrix")
print(conf_mat)

Confusion Matrix
[[1005 1389    0]
 [   0    0    0]
 [   0   90  302]]
