# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 11/30/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [7]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jiejie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jiejie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
df = pd.read_csv("../data/emails.csv")

In [9]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [10]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,> >,0
1,there are 2 larger file cabinets that are loca...,0
2,x - iris - host : 3666990549 / [ 218 . 145 . 2...,1
3,muschar,0
4,fax : ( 402 ) 691 - 9552,0
5,good one for us non - mba ' s .,0
6,"in january , enron told investors its ' ' recu...",0
7,et c ? est ce partenaire qui pourra m ' aider .,1
8,department,0
9,registration see www . rogroup . com .,0


Unnamed: 0,X,y
829200,hi . real women in a city by city database of ...,1
829201,acquisition by dynegy inc .,0
829202,- - - - - original message - - - - -,0
829203,- - - - - - - - - - - - - - - - - - - - - - fo...,0
829204,9 / 7 / 00,0
829205,subject : re : cairn gas purchase bid,0
829206,save up to 50 % order with,1
829207,ready to boost your sex life ? positive ?,1
829208,thesecu rities exch ange act of 1934 . any sta...,1
829209,equipment to detect possibie problems before t...,1


In [12]:
# Create a stop list
self_define=['enron','subject','ect','hou','e','http']
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [13]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [14]:
df['X'] = df['X'].astype(str)

In [15]:
# Remove punctuation and stop words in X
df['X'] = df['X'].apply(trim_word)

In [44]:
# Save as cleand df
df.to_csv('../data/emails_cleaned.csv', index=False)

In [45]:
df2 = pd.read_csv("../data/emails_cleaned.csv")

In [46]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,,0
1,hot springs ark tradestar otc tirr today annou...,1
2,peopie know website boost revenues way,1
3,ms exchange enterprise server,1
4,downtime request,0
5,beverly,0
6,,0
7,order avert negative development,1
8,good luck,0
9,apc,1


Unnamed: 0,X,y
829200,w ndows critical service pack update january th,1
829201,,1
829202,cartee info,1
829203,nearly product saved sex life matt fl,1
829204,commit kind budgets producers caiiber,1
829205,indicating interest documents proofs enable get,1
829206,yowman corp bob sparger corp tim,0
829207,association bring ex high school college,1
829208,soweto national stadium bid lost germany,1
829209,spruced raoul reconsider,1
