# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 12/02/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

print("SUCCESS. All modules have been imported.")

SUCCESS. All modules have been imported.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,"debt rankings finally fizzle , but the deal fi...",0
1,people gas and nipsco . it also provides acces...,0
2,joannie,0
3,issued super standard,1
4,""" yann d ' halluin "" on 11 / 16 / 2000 01 : 18...",0
5,Subject: rush to buy on final count down to li...,1
6,if ubsw energy is willing to relocate to the n...,0
7,"skeptical at first , but my best friend & busi...",1
8,legal department .,0
9,necessary to make the statements therein not m...,1


Unnamed: 0,X,y
829200,cc :,0
829201,"? ? ? � ? ? ??? � ? � ? ? ?f? ? ? "" ? ? phishi...",1
829202,2 ) type : http : / / gasmsgboard . corp . enr...,0
829203,>,0
829204,justification : jeff ' s capabilities in these...,0
829205,movie collection anywhere !,1
829206,anjam,0
829207,assumptions or,1
829208,Subject: [ ilug ] stop the mlm insanity,1
829209,creating your own profitable online business a...,1


In [5]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [9]:
df2 = df.copy()

In [10]:
# Check missing values
nan_value = float('NaN')
df2.replace('', nan_value, inplace=True)
n_missing_values = df2.isnull().sum()[0]
print('Total missing values (NaN) in the feature column:', n_missing_values)
print('\nTotal missing values (NaN) takes up {:.2%} of our data.'.format(n_missing_values/len(df.index)))

Total missing values (NaN) in the feature column: 43469

Total missing values (NaN) takes up 5.24% of our data.


In [11]:
# Remove rows containing missing values
df2.dropna(subset=['X'], inplace=True)

In [12]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(785741, 2)

Unnamed: 0,X,y
0,debt rankings finally fizzle deal fizzled first,0
1,people gas nipsco also provides access anr pip...,0
2,joannie,0
3,issued super standard,1
4,yann halluin pm,0
5,rush buy final count lift,1
6,ubsw energy willing relocate north tower,0
7,skeptical first best friend business associate,1
8,legal department,0
9,necessary make statements therein misleading i...,1


Unnamed: 0,X,y
829199,calger,0
829200,cc,0
829201,f phishing r f,1
829202,type gasmsgboard corp com msgframe asp click,0
829204,justification jeff capabilities areas make ext...,0
829205,movie collection anywhere,1
829206,anjam,0
829207,assumptions,1
829208,ilug stop mlm insanity,1
829209,creating profitable online business making lif...,1


In [13]:
# Save as cleand df
df2.to_csv('../data/emails_cleaned.csv', index=False)