# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 12/02/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

print("SUCCESS! All modules have been imported.")

SUCCESS! All modules have been imported.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,( d ) contract to take the swing ? please let ...,0
1,enron bonds fall as company taps $ 3 bln credi...,0
2,"matter how you calculate it , you will still",1
3,http : / / hrqmkritq . grave 5217 pinn . com / 15,1
4,"company would violate , and the auditors would...",0
5,"copyright ? 2000 dow jones & company , inc . a...",0
6,ferc to examine more price mitigation,0
7,"houston , nov . 30 ( bloomberg ) - - dynegy in...",0
8,excellent . my interactions with krishnar have...,0
9,5 . geec is expected to fiie for a higher exch...,1


Unnamed: 0,X,y
829200,massachusetts,1
829201,emerging biotech company soon to release groun...,1
829202,cc : vkaminski @ aol . com,0
829203,take 1 minute to fill out our short form,1
829204,reuters english news service - 10 / 24 / 01,0
829205,"prices on earnings , energy prices have alread...",0
829206,the technology - laced nasdaq composite index ...,0
829207,"identification , eliminating the need for stan...",1
829208,"l \ ock on the 3 . 39 % , even",1
829209,> > > rcpt to :,1


In [5]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [9]:
df2 = df.copy()

In [10]:
# Check missing values
nan_value = float('NaN')
df2.replace('', nan_value, inplace=True)
n_missing_values = df2.isnull().sum()[0]
print('Total missing values (NaN) in the feature column:', n_missing_values)
print('\nTotal missing values (NaN) takes up {:.2%} of our data.'.format(n_missing_values/len(df.index)))

Total missing values (NaN) in the feature column: 43469

Total missing values (NaN) takes up 5.24% of our data.


In [11]:
# Remove rows containing missing values
df2.dropna(subset=['X'], inplace=True)

In [12]:
df2 = df2.sort_values(by='X')

In [13]:
# Drop rows that contain non-english characters
df2.drop(df2.tail(93).index, inplace=True)

In [14]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(785648, 2)

Unnamed: 0,X,y
538039,aa,0
608892,aa,0
308753,aa,0
555396,aa,0
340409,aa,0
368798,aa,0
206682,aa exec lead congrats,0
357048,aa houston office interestingly enough wes col...,0
199624,aa indicated proposal regard transactions done,0
419181,aa informed hedges new power company warrants,0


Unnamed: 0,X,y
124530,zzeeghh dzmwdqa ytdloexru idhuoo sjkvxwu sbaim,1
759828,zzlm,1
789233,zzmacmac aol com,0
518114,zzsobajqqskityegwqj iumtjhydfqshpmvs roflffqg ...,1
765014,zzw afeet com,1
71491,zzw afeet com,1
670201,zzzz,1
766841,zzzz example com,1
356386,zzzz hello,1
494973,zzzz web site making money pm,1


In [15]:
# Save as cleand df
df2.to_csv('../data/emails_cleaned.csv', index=False)

In [16]:
df2.shape
print('The model-ready dataset contains {} rows.'.format(df.shape[0]))

The model-ready dataset contains 829210 rows.
