In [None]:
!pip install pandas seaborn xlsxwriter scikit-learn==1.0.2

## Import libraries
We are using a new library called scikit-learn, originally created and released for free by researchers at the French national laboratory INRIA.

In [None]:
import numpy as np
import pandas as pd

import sklearn.pipeline
import sklearn.feature_extraction.text
import sklearn.naive_bayes
import sklearn.model_selection
import sklearn.metrics

## Load data
We have a dataset of spam (junk email) and ham (good email). It also comes with a `label_num` where 0 is ham and 1 is spam.

In [None]:
data = pd.read_csv("spam_ham_dataset.csv")

In [None]:
data

In [None]:
data['output'].value_counts()

## Split into training and testing datasets
We want to randomly select 80% of this data to use to train the model, then use the remaining 20% to test how good the model is on examples it has not seen before.

In [None]:
input_train, input_test, output_train, output_test = sklearn.model_selection.train_test_split(data['input'], 
                                                                                             data['output'], 
                                                                                             test_size=0.2)

In [None]:
input_train, output_train

In [None]:
input_test, output_test

## Create a blank model from a three-part pipeline and then train it


In [None]:
model = sklearn.pipeline.Pipeline([
    ('vect', sklearn.feature_extraction.text.CountVectorizer()),
    ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()),
    ('clf', sklearn.naive_bayes.MultinomialNB()),
])


In [None]:
model.fit(input_train, output_train)

## Test it on a few sentences in a list

In [None]:
text_list = ['Subject: I need this report by 9am!',
             'Subject: Buy peni$ pill$ 4 CHEAP',
             'Subject: YOUR ACCOUNT HAS BEEN HACKED',
             'Subject: Hi',
             'Subject: Hi sexy']

There are two functions you can use to get a prediction. If you use `model.predict()`, then you just get the text of the predicted output label.

In [None]:
# predict if each line of text is ham or spam
model.predict(text_list)

If you use `model.predict_proba()` you get the probability percentage for all output labels. We only have two (ham and spam), but if you had more labels, it would show the probability for all of them. You can use `model.classes_` to get the labels in the order that `predict_proba()` displayes them.

In [None]:
scores = model.predict_proba(text_list)
scores

In [None]:
model.classes_

In [None]:
scores_dataframe = pd.DataFrame(scores, columns=model.classes_)
scores_dataframe

## Testing the model on the other 20% of data
Remember the `input_test` dataset? We will use the `model.predict()` function to score those 1035 emails.

In [None]:
input_test

In [None]:
output_predicted = model.predict(input_test)

In [None]:
output_predicted

Remember that the 'true' labels for these are in `output_test`. They are in a slightly different format (array displayed horizontally versus a column displayed horizontally), but they are easy for the computer to compare.

In [None]:
output_test

We can use `model.score()` by first inputting the 'true' labels, then the predictions. We get a percentage of the model's __accuracy__:

$accuracy = \frac{\mbox{number of correct predictions}}{\mbox{total number of items predicted}}$

In [None]:
print(model.score(output_test, output_predicted))

## The Confusion Matrix

But accruacy alone doesn't tell us everything: there could be way more false positives than false negatives. So we use the ___confusion matrix___, which is a 2x2 table of what kinds of correct vs incorrect predictions were made:

![confusion matrix](https://indhumathychelliahcom.files.wordpress.com/2020/12/f653c-1x6gcmh3jedj_quso8pvl6q.png)

[image from Indhumathy Chellia](https://indhumathychelliah.com/2020/12/23/confusion-matrix%E2%80%8A-%E2%80%8Aclearly-explained/)

In [None]:
confusion_matrix = sklearn.metrics.confusion_matrix(output_test, output_predicted, labels=model.classes_)
confusion_matrix

In [None]:
tp, fn, fp, tn = confusion_matrix.flatten()
tp, fn, fp, tn

In [None]:
display = sklearn.metrics.ConfusionMatrixDisplay(confusion_matrix, display_labels=model.classes_)
display.plot()


We have no false negatives! We didn't have any cases where ham email was incorrectly predicted to be spam. But we have a lot of cases where spam email was incorrectly predicted to be ham -- about 40% of the time, it let spam through as ham.

# Metrics of success and failure

Also look at the chart at https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context)

## ___Accuracy___:

$\frac{\mbox{# correct predictions}}{\mbox{# items predicted}}$

## ___Precision___ or positive predictive value (PPV): 

$\frac{\mbox{# true positives}}{\mbox{# true positives + # false positives}}$

## ___Recall___, sensitivity, or  true positive rate (TPR):

$\frac{\mbox{# true positives }}{\mbox{# true positives + # false negatives}}$

## ___Specificity___ or true negative rate (TNR):

$\frac{\mbox{# true negatives }}{\mbox{# true negatives + # false positives }}$

# Another visualization
![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/422px-Precisionrecall.svg.png)
![](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/Sensitivity_and_specificity_1.01.svg/422px-Sensitivity_and_specificity_1.01.svg.png)

See https://enwp.org/Precision_and_recall and https://enwp.org/Sensitivity_and_specificity

In [None]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
accuracy

In [None]:
precision = tp / (tp + fp)
precision

In [None]:
recall = tp / (tp + fn)
recall

In [None]:
specificity = tn / (tn + fp)
specificity