# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 11/30/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [15]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jiejie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jiejie/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [8]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,"provided by investext . the reports , which ar...",0
1,$ 75 adobe audition 1 . 5,1
2,>,0
3,"cc : heather choate / hou / ect @ ect , irena ...",0
4,suspending all others . it also resumed limite...,0
5,of ferc rejection of southern company ? s setrans,0
6,way through the class to make sure that we don...,0
7,government wants to give away this money . . ....,1
8,middle office functional groups have been intr...,0
9,error : dbcaps 97 data : cannot perform this o...,0


Unnamed: 0,X,y
829200,to invest this fund abroad in a confidential m...,1
829201,appropriate .,0
829202,ls hpl katy 30 . 000 / enron,0
829203,"from : oxley , david",0
829204,"Subject: perfect logo charset = koi 8 - r "" >",1
829205,"all prices in u . s . dollars , ex - works ,",1
829206,the only solution to penis growth,1
829207,"day to settle positions , said kilduff , whose...",0
829208,"thanks ,",0
829209,cc : hasan kedwaii / et & s / enron @ enron,0


In [11]:
# Create a stop list
stoplist = stopwords.words('english') + list(string.punctuation)
stoplist = set(stoplist)

In [12]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [13]:
df['X'] = df['X'].astype(str)

In [16]:
# Remove punctuation and stop words in X
df['X'] = df['X'].apply(trim_word)

In [17]:
# Save as cleand df
df.to_csv('../data/emails_cleaned.csv', index=False)

In [19]:
df2 = pd.read_csv("../data/emails_cleaned.csv")

In [21]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,provided investext reports updated continuously,0
1,adobe audition,1
2,,0
3,cc heather choate hou ect ect irena hogan hou ...,0
4,suspending others also resumed limited trading,0
5,ferc rejection southern company setrans,0
6,way class make sure miss anyone pam,0
7,government wants give away money congressional,1
8,middle office functional groups introduced dua...,0
9,error dbcaps data perform operation closed dat...,0


Unnamed: 0,X,y
829200,invest fund abroad confidential manner came,1
829201,appropriate,0
829202,ls hpl katy enron,0
829203,oxley david,0
829204,Subject perfect logo charset koi r,1
829205,prices u dollars ex works,1
829206,solution penis growth,1
829207,day settle positions said kilduff whose compan...,0
829208,thanks,0
829209,cc hasan kedwaii et enron enron,0
