# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 12/02/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

print("SUCCESS! All modules have been imported.")

SUCCESS! All modules have been imported.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,"the wall street journal , 11 / 06 / 01",0
1,http : / / m 3 b . info . abrwallk . info /,1
2,no appointments,1
3,thanks,0
4,simply disolve half a plll under your tongue 1...,1
5,"prabhu had pointed out that "" there is no ques...",0
6,- - - - - original message - - - - -,0
7,"earlier wednesday , dynegy inc . ( dyn ) termi...",0
8,largest competitor in the business of trading ...,0
9,"billion at john hancock advisers inc . , and a...",0


Unnamed: 0,X,y
829200,"sent : tuesday , september 25 , 2001 1 : 37 pm",0
829201,laurent jacque,0
829202,Subject: profitable business relationship,1
829203,remover contumacy koenig keynesian floorings,1
829204,regard to the individual support accountabilit...,0
829205,pancakesrevulsion,1
829206,please respond to gappy @ stanford . edu,0
829207,the 4 th annual national gas conference,0
829208,"chemicals . biofiavonoids such as quercetin , ...",1
829209,thanks for reading this long email .,0


In [5]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [24]:
df2 = df.copy()

In [25]:
# Check missing values
nan_value = float('NaN')
df2.replace('', nan_value, inplace=True)
n_missing_values = df2.isnull().sum()[0]
print('Total missing values (NaN) in the feature column:', n_missing_values)
print('\nTotal missing values (NaN) takes up {:.2%} of our data.'.format(n_missing_values/len(df.index)))

Total missing values (NaN) in the feature column: 43469

Total missing values (NaN) takes up 5.24% of our data.


In [26]:
# Remove rows containing missing values
df2.dropna(subset=['X'], inplace=True)

In [34]:
df2 = df2.sort_values(by='X')

In [44]:
# Drop rows that contain non-english characters
df2.drop(df2.tail(93).index, inplace=True)

In [50]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(785648, 2)

Unnamed: 0,X,y
724078,aa,0
32027,aa,0
177740,aa,0
160110,aa,0
41728,aa,0
121819,aa,0
538249,aa exec lead congrats,0
171893,aa houston office interestingly enough wes col...,0
683940,aa indicated proposal regard transactions done,0
676235,aa informed hedges new power company warrants,0


Unnamed: 0,X,y
692015,zzeeghh dzmwdqa ytdloexru idhuoo sjkvxwu sbaim,1
417408,zzlm,1
350610,zzmacmac aol com,0
443017,zzsobajqqskityegwqj iumtjhydfqshpmvs roflffqg ...,1
412113,zzw afeet com,1
537732,zzw afeet com,1
767352,zzzz,1
647555,zzzz example com,1
400488,zzzz hello,1
582301,zzzz web site making money pm,1


In [52]:
# Save as cleand df
df2.to_csv('../data/emails_cleaned.csv', index=False)