# Assessment 3

Compare without reply/forward

In [1]:
import glob
import os
import numpy as np
import pandas as pd
import math

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
np.random.seed(2021)

import nltk
import re

First we need to import the data. References 1 & 2 introduced me to glob which then allows us to import all the text documents in one smooth process. We do this below.

In [2]:
from pathlib import Path

corpus = []

for path in Path(r'C:\Users\corri\OneDrive\Documents\GitHub\DST-Assessment-3\Data\All enron unzip').rglob('*.txt'):
        with open(path, errors='ignore') as f_input:
            corpus.append(f_input.read())

In [3]:
df = pd.DataFrame(corpus)

In [4]:
df.head()

Unnamed: 0,0
0,Legitimate\n----------\n- Owner: farmer-d\n- T...
1,Subject: christmas tree farm pictures\n
2,"Subject: vastar resources , inc .\ngary , prod..."
3,Subject: calpine daily gas nomination\n- calpi...
4,Subject: re : issue\nfyi - see note below - al...


In [5]:
print(len(df))

27721


While importing the files, we've also imported the summary sheets along with them. These need to be removed as they wont contribute to our overall corpus and we do this below along with reindexing the data set (we could just remove them from the data but this is easy).

In [6]:
df = df.drop([0,5173,11031,16544,21720]) # removing unnecessary entries
nlist = np.arange(0,27717,1) # reindexing
df = df.reindex(nlist)
df = df.drop(0)
df.columns = ['E-mails']

In [7]:
df.head()

Unnamed: 0,E-mails
1,Subject: christmas tree farm pictures\n
2,"Subject: vastar resources , inc .\ngary , prod..."
3,Subject: calpine daily gas nomination\n- calpi...
4,Subject: re : issue\nfyi - see note below - al...
5,Subject: meter 7268 nov allocation\nfyi .\n- -...


We're going to assign a binary variable for spam vs normal data to our data set to allow us to perform performance analysis later on. This is done simply since we know where the splits will be for spam/normal data will be in our data frame and we do this below by first creating a list of the positions where spam emails are found and then assigning the binary values to our data frame.

In [8]:
z1 = np.arange(3673,5173)
z2 = np.arange(9534,11031)
z3 = np.arange(15043,16544)
z4 = np.arange(18043,21719)
z5 = np.arange(23220,27716)
ze = np.concatenate((z1,z2,z3,z4,z5))

In [9]:
z = np.zeros(27716)
df['spam'] = z
for i in ze:
    df.iloc[i,1] = 1

In [10]:
df.head()

Unnamed: 0,E-mails,spam
1,Subject: christmas tree farm pictures\n,0.0
2,"Subject: vastar resources , inc .\ngary , prod...",0.0
3,Subject: calpine daily gas nomination\n- calpi...,0.0
4,Subject: re : issue\nfyi - see note below - al...,0.0
5,Subject: meter 7268 nov allocation\nfyi .\n- -...,0.0


In [11]:
df_temp = df

In [12]:
for i in range(0,9):
    if i == 8:
        a = df_temp
        locals()['df_{}'.format(i)] = a
    else:
        a = df_temp.sample(n = 2771)
        locals()['df_{}'.format(i)] = a
        l = list(a.index.values)
        df_temp = df_temp.drop(l)

In [13]:
df_test = df_temp

frames=[df_0, df_1, df_2, df_3, df_4, df_5, df_6, df_7, df_8]
df_train = pd.concat(frames)

### Pre-Processing for LDA Model

In this section we perform pre-processing for our model. This involves techniques such as removing common words, symbols and punctuation. For sake of comparison, we print the first entry to determine the impact the techniques have.

In [14]:
email_sample = df_train.iloc[170].values[0] # some of the emails are less tasteful than other emails sorry!
print(email_sample)

Subject: paypal notification ( your account is suspended )
dear paypal member ,
paypal is committed to maintaining a safe environment
for its community of
buyers and sellers . to protect the security of your
account , paypal employs
some of the most advanced security systems in the
world and our anti - fraud
teams regularly screen the paypal system for unusual
activity .
recently , our account review team identified some
unusual activity in your
account . in accordance with paypal ' s user agreement
and to ensure that your
account has not been compromised , access to it was
flagged . you
will remain flagged until this issue has been
resolved .
this is a fraud prevention measure meant to ensure
that your account is not
compromised .
in order to secure your account and quickly restore
full access , we may
require some specific information from you for the
following reason :
our system requires further account verification .
case id number : pp - 056 - 245 - 481
we encourage you to log in

In [15]:
df_train.head()

Unnamed: 0,E-mails,spam
15773,Subject: january giveaway\nspecial offfer for ...,1.0
25178,Subject: unfaithful bitches\nwives looking for...,1.0
3695,Subject: re : 1 / 2 price sildenafil citrate t...,1.0
4894,Subject: can we go over guillermo ' s budget t...,1.0
13018,Subject: competitive analysis update # 24 - us...,0.0


We use stop words from nltk to remove the majority of the stop words in the data. Some of the words we need to remove are case specific and therefore aren't removed by the stop word processing, such as 'Subject:' which appears in every email and therefore provides no insight for topic modelling.

In [16]:
en_stop = nltk.corpus.stopwords.words('english')
en_stop.append('Subject')

In [17]:
import nltk
from nltk.corpus import wordnet as wn

# Function to lemmatize a word
def get_lemma(word):
   
    lemma = wn.morphy(word) 
    if lemma is None:
        return word
    else:
        return lemma

def prepare_text_for_lda(text):
    
    tokens = nltk.word_tokenize(text) # splits text block into sole words
    tokens = [token for token in tokens if len(token) > 4] # removes words length less than 4, likely to be unimportant
    tokens = [token for token in tokens if token not in en_stop] # removes stop words
    tokens = [get_lemma(token) for token in tokens] # lemmatize (standardise) words
    return tokens

In [18]:
print('original document: ')
words = []
for word in email_sample.split(' '):
    words.append(word)
print(words)

print('\n\n method removing stop words: ')
print(prepare_text_for_lda(email_sample))

original document: 
['Subject:', 'paypal', 'notification', '(', 'your', 'account', 'is', 'suspended', ')\ndear', 'paypal', 'member', ',\npaypal', 'is', 'committed', 'to', 'maintaining', 'a', 'safe', 'environment\nfor', 'its', 'community', 'of\nbuyers', 'and', 'sellers', '.', 'to', 'protect', 'the', 'security', 'of', 'your\naccount', ',', 'paypal', 'employs\nsome', 'of', 'the', 'most', 'advanced', 'security', 'systems', 'in', 'the\nworld', 'and', 'our', 'anti', '-', 'fraud\nteams', 'regularly', 'screen', 'the', 'paypal', 'system', 'for', 'unusual\nactivity', '.\nrecently', ',', 'our', 'account', 'review', 'team', 'identified', 'some\nunusual', 'activity', 'in', 'your\naccount', '.', 'in', 'accordance', 'with', 'paypal', "'", 's', 'user', 'agreement\nand', 'to', 'ensure', 'that', 'your\naccount', 'has', 'not', 'been', 'compromised', ',', 'access', 'to', 'it', 'was\nflagged', '.', 'you\nwill', 'remain', 'flagged', 'until', 'this', 'issue', 'has', 'been\nresolved', '.\nthis', 'is', 'a', 'f

In [19]:
#processed_text = df_train['E-mails'].map(prepare_text_for_lda)
#processed_text.head()

When running the above code, we identify entries in the code that are classified as NaN. The pre-processor cannot handle these exceptions so we remove them as they are immaterial to the final result anyway.

In [20]:
place = []

for j in range(1,len(df_train)):
    try:
        a = df_train.iloc[j,0]
        prepare_text_for_lda(a)
    except Exception:
        place.append(j)

for j in place:
    df_train.iloc[j,0] = ""

In [21]:
processed_text = df_train['E-mails'].map(prepare_text_for_lda) 
processed_text.head()

15773    [january, giveaway, special, offfer, wlndows, ...
25178    [unfaithful, bitch, wife, looking, search, com...
3695                   [price, sildenafil, citrate, today]
4894     [guillermo, budget, today, glory, become, eigh...
13018    [competitive, analysis, update, response, terr...
Name: E-mails, dtype: object

### Creating a Dictionary and a Corpus

In [22]:
dictionary = gensim.corpora.Dictionary(processed_text)

count = 0
for k,v  in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 20251
1 ashman
2 attention
3 attorney
4 cherie
5 complementary
6 confirmation
7 contents
8 downcast
9 dragnet
10 email


In [23]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_text]

### LDA Model

In [24]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 2, id2word=dictionary, passes=2, workers = 2)

In [25]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [26]:
lda_display = pyLDAvis.gensim.prepare(lda_model, bow_corpus, 
                                          dictionary, mds='mmds')

  and should_run_async(code)


In [27]:
pyLDAvis.display(lda_display, template_type='notebook')

  and should_run_async(code)


### Classification

Here we look at classification using our LDA model and we want to see how well the model identifies spam vs non-spam emails. We need to generate a classification output from the Bag of Words Corpus for each document. We do this by using only 2 topics and creating a list of the assignment of a document to a topic. In the case where the LDA model produces probablities for both topics, we take the largest one to be the topic assignment.

In [28]:
a = np.zeros(len(df_train))

  and should_run_async(code)


In [29]:
y_pred = []
y_actual = list(df_train['spam'])

for i in range(0,len(df_train)):
    for index, score in sorted(lda_model[bow_corpus[i]], key=lambda tup: -1*tup[1]):
        if score >= 0.5 and index == 0:
            y_pred.append(0)
            break
        elif score > 0.5 and index == 1:
            y_pred.append(1)
            break

  and should_run_async(code)


As a check to ensure the above code has worked and will be compatible with our confusion matrix, we check the length of both the lists.

In [30]:
print(len(y_actual), len(y_pred))

27716 27716


  and should_run_async(code)


Next we define our confusion matrix and classification report. We use sklearn metrics to generate this, in a similar way to project 1 and evaluate.

In [32]:
from sklearn.metrics import confusion_matrix, classification_report

def classification_eval(y_true,y_pred):
    
    print("Confusion Matrix")
    C = confusion_matrix(y_true,y_pred)
    
    print('Classification report')
    print(classification_report(y_true, y_pred, target_names = ['Normal', 'Spam'], digits=3))

  and should_run_async(code)


In [33]:
classification_eval(y_actual,y_pred)

Confusion Matrix
Classification report
              precision    recall  f1-score   support

      Normal      0.290     0.113     0.163     15046
        Spam      0.389     0.672     0.493     12670

    accuracy                          0.369     27716
   macro avg      0.340     0.392     0.328     27716
weighted avg      0.336     0.369     0.314     27716



  and should_run_async(code)


### Perplexity and Coherence

In [34]:
from gensim.models.coherencemodel import CoherenceModel
def getCoherence(m,c,d):
    coherence_model_lda = CoherenceModel(model=m,corpus=c, dictionary=d, coherence='u_mass')
    coherence_lda = coherence_model_lda.get_coherence()
    return(coherence_lda)

  and should_run_async(code)


In [35]:
print("lda_model perplexity: {}".format(lda_model.log_perplexity(bow_corpus[0:1000])))
print("lda_model coherence: {}".format(getCoherence(lda_model,bow_corpus[0:1000],dictionary)))

  and should_run_async(code)


lda_model perplexity: -10.20536763618399
lda_model coherence: -1.4227806482839416


### References

1. [Loading .txt files](https://stackoverflow.com/questions/42407976/loading-multiple-text-files-from-a-folder-into-a-python-list-variable)
2. [Using Glob to retrieve the data](https://stackoverflow.com/questions/2186525/how-to-use-glob-to-find-files-recursively)
3. [WordNet guidance](https://www.nltk.org/howto/wordnet.html)
4. [Lemmatisation](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/)
5. [Stop Words](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)
6. [Adding Stop Words](https://stackoverflow.com/questions/5511708/adding-words-to-nltk-stoplist)
7. [Classificaiton Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)
8. [Filtering Extereme values from a Dictionary](https://stackoverflow.com/questions/51634656/filtering-tokens-by-frequency-using-filter-extremes-in-gensim)
9. [General LDA - used for index, score function in Classification](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)