# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 12/05/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

print("SUCCESS! All modules have been imported.")

SUCCESS! All modules have been imported.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,tri - state capital feature stock reports are ...,1
1,click here to save thousands on your mortgage,1
2,and singapore ) .,0
3,"thanks ,",0
4,humor is a drug which it ' s the fashion to ab...,1
5,we are planning an ena offsite on may 3 rd and...,0
6,i felt i was able to communicate the philosoph...,0
7,Subject: elizabeth 25 has invited you to open ...,1
8,? how do houston pipe line employees rsvp for ...,0
9,risk books,0


Unnamed: 0,X,y
829200,divfont face = arial size = 2 = = = = = = = = ...,1
829201,subject : storage book . . .,0
829202,go through the signature shop on - line . for ...,0
829203,epowe,1
829204,thanks for your presentation today .,0
829205,our e - mail distribution list :,0
829206,( 212 ) 458 - 2995,0
829207,vvorld .,1
829208,in 1998 the gas distribution venture with sk w...,0
829209,> ( see attached file : proposal for sps trans...,0


In [5]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [9]:
df2 = df.copy()

In [10]:
# Check missing values
nan_value = float('NaN')
df2.replace('', nan_value, inplace=True)
n_missing_values = df2.isnull().sum()[0]
print('Total missing values (NaN) in the feature column:', n_missing_values)
print('\nTotal missing values (NaN) takes up {:.2%} of our data.'.format(n_missing_values/len(df.index)))

Total missing values (NaN) in the feature column: 43469

Total missing values (NaN) takes up 5.24% of our data.


In [11]:
# Remove rows containing missing values
df2.dropna(subset=['X'], inplace=True)

In [12]:
df2 = df2.sort_values(by='X')

In [13]:
# Drop rows that contain non-english characters
df2.drop(df2.tail(93).index, inplace=True)

In [14]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(785648, 2)

Unnamed: 0,X,y
592405,aa,0
791213,aa,0
444298,aa,0
510369,aa,0
539804,aa,0
82322,aa,0
44328,aa exec lead congrats,0
775471,aa houston office interestingly enough wes col...,0
560863,aa indicated proposal regard transactions done,0
79581,aa informed hedges new power company warrants,0


Unnamed: 0,X,y
372601,zzeeghh dzmwdqa ytdloexru idhuoo sjkvxwu sbaim,1
423207,zzlm,1
145219,zzmacmac aol com,0
693340,zzsobajqqskityegwqj iumtjhydfqshpmvs roflffqg ...,1
357040,zzw afeet com,1
373816,zzw afeet com,1
706230,zzzz,1
602456,zzzz example com,1
580183,zzzz hello,1
118769,zzzz web site making money pm,1


In [15]:
# Save as cleand df
df2.to_csv('../data/emails_cleaned.csv', index=False)

In [16]:
df2.shape
print('The model-ready dataset contains {} rows.'.format(df2.shape[0]))

The model-ready dataset contains 785648 rows.
