# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 11/30/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [22]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jiejie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jiejie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [23]:
df = pd.read_csv("../data/emails.csv")

In [24]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [25]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,11 / 07 / 2001,0
1,"hot springs , ark . , - - tradestar ( otc : ti...",1
2,peopie to know about your website and boost yo...,1
3,ms exchange 2003 enterprise server,1
4,subject : downtime request,0
5,beverly,0
6,>,0
7,"in order to avert this negative development , ...",1
8,> good luck .,0
9,apc,1


Unnamed: 0,X,y
829200,Subject: w ! ! ndows critical service pack 2 u...,1
829201,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _,1
829202,cartee . info,1
829203,to nearly 6 . your product has saved my sex li...,1
829204,to commit to the kind of budgets which produce...,1
829205,"on indicating your interest , all documents an...",1
829206,"yowman / corp / enron @ enron , bob sparger / ...",0
829207,"association that will bring ex high school , c...",1
829208,soweto national stadium for 2006 bid that we l...,1
829209,spruced raoul reconsider,1


In [37]:
# Create a stop list
self_define=['enron','subject','ect','hou','e','http']
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)
stoplist

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'e',
 'each',
 'ect',
 'enron',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'hou',
 'how',
 'http',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mus

In [38]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [39]:
df['X'] = df['X'].astype(str)

In [40]:
# Remove punctuation and stop words in X
df['X'] = df['X'].apply(trim_word)

In [44]:
# Save as cleand df
df.to_csv('../data/emails_cleaned.csv', index=False)

In [45]:
df2 = pd.read_csv("../data/emails_cleaned.csv")

In [46]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,,0
1,hot springs ark tradestar otc tirr today annou...,1
2,peopie know website boost revenues way,1
3,ms exchange enterprise server,1
4,downtime request,0
5,beverly,0
6,,0
7,order avert negative development,1
8,good luck,0
9,apc,1


Unnamed: 0,X,y
829200,w ndows critical service pack update january th,1
829201,,1
829202,cartee info,1
829203,nearly product saved sex life matt fl,1
829204,commit kind budgets producers caiiber,1
829205,indicating interest documents proofs enable get,1
829206,yowman corp bob sparger corp tim,0
829207,association bring ex high school college,1
829208,soweto national stadium bid lost germany,1
829209,spruced raoul reconsider,1
