# Final Report

Our Natural Language Processing work focusses on the Enron email data set which contains both spam and real emails for us to analyse. The research we wanted to perform was: *Analyse the Enron spam vs normal data set to create and visualise a topic model and understand the features of spam emails.* We did this by performing pre-processing on the data set, then creating a model and then analysing the resulting model.

Below is a summary of the results we have found through our research into NLP on the Enron email set. We summarise with the classification report for each model and a data frame containing the perplexity and coherence scores for each optimal model created. More results and visualisations of models can be found within our own folders and should definitely be looked at to understand the analysis that took place since the complexity of that cannot be shown through the results below.

In [1]:
import numpy as np
import pandas as pd
import math

import pickle

### Classifcation Reports

In [2]:
from sklearn.metrics import confusion_matrix, classification_report

def classification_eval(y_true,y_pred):
    
    print("Confusion Matrix")
    C = confusion_matrix(y_true,y_pred)
    
    print('Classification report')
    print(classification_report(y_true, y_pred, target_names = ['Normal', 'Spam'], digits=3))

We import the predictions here.

In [3]:
# Matt's predictions
Matt_actual = pickle.load(open('../Data/Actual.p','rb'))
Matt_pred = pickle.load(open('../Data/Matt_pred.p','rb'))

# Alex predictions
Alex_actual = pickle.load(open('../Data/Alex_y_actual.p','rb')) 
Alex_pred = pickle.load(open('../Data/Alex_y_pred.p','rb'))

# Xiao predictions

In [4]:
classification_eval(Matt_actual,Matt_pred)

Confusion Matrix
Classification report
              precision    recall  f1-score   support

      Normal      0.844     0.859     0.851     15046
        Spam      0.829     0.811     0.820     12670

    accuracy                          0.837     27716
   macro avg      0.836     0.835     0.835     27716
weighted avg      0.837     0.837     0.837     27716



In [5]:
classification_eval(Alex_actual,Alex_pred)

Confusion Matrix
Classification report
              precision    recall  f1-score   support

      Normal      0.608     0.709     0.655     15046
        Spam      0.570     0.457     0.507     12670

    accuracy                          0.594     27716
   macro avg      0.589     0.583     0.581     27716
weighted avg      0.590     0.594     0.587     27716



In [6]:
#classification_eval(Xiao_actual,Xiao_pred)

### Perplexity and Coherence

We import our scores for our optimal models here and compare them.

In [7]:
Matt_values = pickle.load(open('../Data/Matt_opt_values.p','rb'))
Matt_values = ['LDA Matt'] + Matt_values
Matt_values.pop(2)

Alex_values = pickle.load(open('../Data/Alex_optimal_values.p','rb'))
Alex_values.reverse()
Alex_values = ['LDA_td-idf Alex'] + Alex_values

In [8]:
co_df = pd.DataFrame(columns=['Model','Perplexity','c_v coherence'])
co_df.loc[0] = Matt_values
co_df.loc[1] = Alex_values
#co_df.loc[2] = Xiao_values # Xiao ended up using an NMF implementation for topic modelling and thus is unable to produce 
                            # perplexity/coherence scores for comparison to mine and Alex's models.

In [9]:
co_df

Unnamed: 0,Model,Perplexity,c_v coherence
0,LDA Matt,-8.149243,0.561903
1,LDA_td-idf Alex,-21.449396,0.480509


### Comparison with removal of re/fw

In [10]:
Alex_rf_pred = pickle.load(open('../Data/y_pred_re.p','rb'))
Alex_rf_actual = pickle.load(open('../Data/y_actual_re.p','rb'))

In [11]:
classification_eval(Alex_rf_actual,Alex_rf_pred)

Confusion Matrix
Classification report
              precision    recall  f1-score   support

      Normal      0.830     0.800     0.815      4736
        Spam      0.325     0.371     0.346      1229

    accuracy                          0.711      5965
   macro avg      0.578     0.585     0.581      5965
weighted avg      0.726     0.711     0.718      5965

