# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 12/02/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

print("SUCCESS. All modules have been imported.")

SUCCESS. All modules have been imported.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,vi @ gra,1
1,to be removed from our database please send a ...,1
2,for processing and remittance of your prize fu...,1
3,"assets or product inventory , whether in trans...",1
4,in this action may be identified through the u...,1
5,in the location box type : ftp : / / energy @ ...,0
6,the,1
7,record :,0
8,we are pleased to announce another addition ( ...,0
9,can you please reconsider your decision to dis...,0


Unnamed: 0,X,y
829200,discount drugs . . . save over 70 %,1
829201,no birds were flying overhead -,1
829202,tel . : + + 49 / 621 / 181 - 1487,0
829203,agave energy co . - met with agave this week t...,0
829204,"these are real , genuine degrees that include ...",1
829205,tel + 971 4,1
829206,free application,1
829207,west of hatway - rating decrease from 2800 to ...,0
829208,filenamee ffbfd fbntohl mzo 8 darueber flip - ...,1
829209,chevron and texaco agree to $ 100 billion merger,0


In [5]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [9]:
# Remove rows containing missing values
df = df.dropna()

In [10]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,vi gra,1
1,removed database please send fax,1
2,processing remittance prize funds designated bank,1
3,assets product inventory whether transit wareh...,1
4,action may identified use words,1
5,location box type ftp energy ftp fea com,0
6,,1
7,record,0
8,pleased announce another addition transfer por...,0
9,please reconsider decision discontinue technic...,0


Unnamed: 0,X,y
829200,discount drugs save,1
829201,birds flying overhead,1
829202,tel,0
829203,agave energy co met agave week discuss potenti...,0
829204,real genuine degrees include bachelors masters...,1
829205,tel,1
829206,free application,1
829207,west hatway rating decrease reasons remidies a...,0
829208,filenamee ffbfd fbntohl mzo darueber flip mout...,1
829209,chevron texaco agree billion merger,0


In [11]:
# Save as cleand df
df.to_csv('../data/emails_cleaned.csv', index=False)