In [None]:
!pip install pandas seaborn xlsxwriter scikit-learn==1.0.2

## Import libraries
We are using a new library called scikit-learn, originally created and released for free by researchers at the French national laboratory INRIA.

In [None]:
import numpy as np
import pandas as pd

import sklearn.pipeline
import sklearn.feature_extraction.text
import sklearn.naive_bayes
import sklearn.model_selection
import sklearn.metrics

## Load data
We are going to be adding rows to this spreadsheet: https://docs.google.com/spreadsheets/d/1xN2E9ZSRp0k1Z90-wDLwYrFj1m5_P-Mdgkj5PMvm46M/edit?usp=sharing

In [None]:
data = pd.read_excel("COMM106E_happysad.xlsx")

In [None]:
data

In [None]:
data['output'].value_counts()

## Split into training and testing datasets
We want to randomly select 80% of this data to use to train the model, then use the remaining 20% to test how good the model is on examples it has not seen before.

In [None]:
input_train, input_test, output_train, output_test = sklearn.model_selection.train_test_split(data['input'], 
                                                                                             data['output'], 
                                                                                             test_size=0.2)

In [None]:
input_train, output_train

In [None]:
input_test, output_test

## Create a blank model from a three-part pipeline and then train it


In [None]:
model = sklearn.pipeline.Pipeline([
    ('vect', sklearn.feature_extraction.text.CountVectorizer()),
    ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()),
    ('clf', sklearn.naive_bayes.MultinomialNB()),
])


In [None]:
model.fit(input_train, output_train)

## Testing the model on the other 20% of data
Remember the `input_test` dataset? We will use the `model.predict()` function to score those 1035 emails.

In [None]:
input_test

In [None]:
output_predicted = model.predict(input_test)

In [None]:
output_predicted

Remember that the 'true' labels for these are in `output_test`. They are in a slightly different format (array displayed horizontally versus a column displayed horizontally), but they are easy for the computer to compare.

In [None]:
output_test

We can use `model.score()` by first inputting the 'true' labels, then the predictions. We get a percentage of the model's __accuracy__:

$accuracy = \frac{\mbox{number of correct predictions}}{\mbox{total number of items predicted}}$

In [None]:
print(model.score(output_test, output_predicted))

## The Confusion Matrix

But accruacy alone doesn't tell us everything: there could be way more false positives than false negatives. So we use the ___confusion matrix___, which is a 2x2 table of what kinds of correct vs incorrect predictions were made:

![confusion matrix](https://indhumathychelliahcom.files.wordpress.com/2020/12/f653c-1x6gcmh3jedj_quso8pvl6q.png)

[image from Indhumathy Chellia](https://indhumathychelliah.com/2020/12/23/confusion-matrix%E2%80%8A-%E2%80%8Aclearly-explained/)

In [None]:
confusion_matrix = sklearn.metrics.confusion_matrix(output_test, output_predicted, labels=model.classes_)
confusion_matrix

In [None]:
display = sklearn.metrics.ConfusionMatrixDisplay(confusion_matrix, display_labels=model.classes_)
display.plot()


We have no false negatives! We didn't have any cases where ham email was incorrectly predicted to be spam. But we have a lot of cases where spam email was incorrectly predicted to be ham -- about 40% of the time, it let spam through as ham.

## If you have more than 2 categories, it might be more useful to look at the classification report:

In [None]:
print(sklearn.metrics.classification_report(output_test, output_predicted))

## Output to excel

In [None]:
data_predicted = data[['input','output']].copy()
data_predicted.columns = ['input', 'output_ground_truth']
data_predicted

In [None]:
data_predicted['output_predicted'] = model.predict(data_predicted['input'])
data_predicted

In [None]:
confidence = pd.DataFrame(model.predict_proba(data_predicted['input']), columns=model.classes_)
confidence

In [None]:
confidence_max = confidence.max(axis=1)
confidence_max

In [None]:
data_predicted['confidence'] = confidence_max
data_predicted

__For lab: change the filename to one that makes sense based on your datasets__

In [None]:
data_predicted.to_excel("COMM106E_happysad_predicted.xlsx", engine='xlsxwriter')