# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 12/02/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

print("SUCCESS! All modules have been imported.")

SUCCESS! All modules have been imported.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,what we would like you to do is populate this ...,0
1,"when : tuesday , nov . 6 , 2001 , 8 a . m . - ...",0
2,by $ 47 million and ` ` other investments ' ' ...,0
3,to : lee l papayoti / hou / ect @ ect,0
4,gg,0
5,we are your online source for obtaining origin...,1
6,northeast development corporation negotiating ...,1
7,"4 , if possible .",0
8,"$ 50 , 000 worth of options to vest 1 / 3 1 / ...",0
9,please let me know asap who gets approved and ...,0


Unnamed: 0,X,y
829200,Subject: kill your chronic pain,1
829201,meta content = 3 dmshtml 5 . 00 . 2920 . 0 nam...,1
829202,manufacturing company in the world . mars is n...,0
829203,$ 45 adobe premiere elements,1
829204,and,1
829205,renewal price .,0
829206,! ! ! unknown database .,0
829207,Subject: site status report and qbr,0
829208,but i am assuring you that all will be well at...,1
829209,fuentes,1


In [5]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [9]:
df2 = df.copy()

In [10]:
# Check missing values
nan_value = float('NaN')
df2.replace('', nan_value, inplace=True)
n_missing_values = df2.isnull().sum()[0]
print('Total missing values (NaN) in the feature column:', n_missing_values)
print('\nTotal missing values (NaN) takes up {:.2%} of our data.'.format(n_missing_values/len(df.index)))

Total missing values (NaN) in the feature column: 43469

Total missing values (NaN) takes up 5.24% of our data.


In [11]:
# Remove rows containing missing values
df2.dropna(subset=['X'], inplace=True)

In [12]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(785741, 2)

Unnamed: 0,X,y
0,would like populate format numbers believe tru...,0
1,tuesday nov p,0
2,million investments rose,0
3,lee l papayoti,0
4,gg,0
5,online source obtaining original branded presc...,1
6,northeast development corporation negotiating ...,1
7,possible,0
8,worth options vest,0
9,please let know asap gets approved eol grant t...,0


Unnamed: 0,X,y
829199,houston texas,0
829200,kill chronic pain,1
829201,meta content dmshtml name dgenerator,1
829202,manufacturing company world mars habit disclosing,0
829203,adobe premiere elements,1
829205,renewal price,0
829206,unknown database,0
829207,site status report qbr,0
829208,assuring well end day,1
829209,fuentes,1


In [13]:
# Confirm that there is no missing values
df2.isnull().sum()

X    0
y    0
dtype: int64

In [14]:
# Save as cleand df
df2.to_csv('../data/emails_cleaned.csv', index=False)