# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 12/02/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

print("SUCCESS. All modules have been imported.")

SUCCESS. All modules have been imported.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,04 / 16 / 2001 12 : 43 pm,0
1,Subject: t . v .,0
2,"profitable transaction , 70 % for me and 2 of my",1
3,legal,1
4,m 11 - 12,0
5,author : refined products team,0
6,utilities fortnightly ( puf ) .,0
7,paper said . an investment by a private - equi...,0
8,the favored route in my advise to customers is...,1
9,from : i . q . software - bucharest,1


Unnamed: 0,X,y
829200,> shirley crenshaw,0
829201,"hello , welcome to the medzonli midwinter ne",1
829202,subject : integration meeting,0
829203,e - mail - info @ pro - techonline . com,1
829204,call for details and discount pricing !,1
829205,the daily jump in energy prices .,1
829206,"flexible discounts : ioqo improvement , additi...",1
829207,500 100 ural and de eff - in con,1
829208,"we c 2 s ( c 2 s recruitment service co . , lt...",1
829209,time : 6 : 30 p . m . - ?,0


In [5]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [15]:
df2 = df.copy()

In [20]:
# Remove rows containing missing values
nan_value = float('NaN')
df2.replace('', nan_value, inplace=True)
df2.dropna(subset=['X'], inplace=True)

In [21]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(785741, 2)

Unnamed: 0,X,y
0,pm,0
1,v,0
2,profitable transaction,1
3,legal,1
5,author refined products team,0
6,utilities fortnightly puf,0
7,paper said investment private equity firm coul...,0
8,favored route advise customers start assessing,1
9,q software bucharest,1
10,outstanding shares mil website www genethera net,1


Unnamed: 0,X,y
829200,shirley crenshaw,0
829201,hello welcome medzonli midwinter ne,1
829202,integration meeting,0
829203,mail info pro techonline com,1
829204,call details discount pricing,1
829205,daily jump energy prices,1
829206,flexible discounts ioqo improvement additional,1
829207,ural de eff con,1
829208,c c recruitment service co ltd specialized hum...,1
829209,time p,0


In [22]:
# Save as cleand df
df2.to_csv('../data/emails_cleaned.csv', index=False)