# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 11/30/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829210, 2)

Unnamed: 0,X,y
0,my email remains : vngo @ rice . edu and i wil...,0
1,cc : daren . j . farmer @ enron . com,0
2,914 253 4187,0
3,giant drew down about $ 3 billion from a credi...,0
4,"orders for report # 5 . that ' s 100 , 000 ord...",1
5,immediately known .,1
6,matters concerning this issue . once,1
7,state :,1
8,Subject: alternative names,0
9,"from cambridge energy research associates , in...",0


Unnamed: 0,X,y
829200,"improve sleep , vision and memory",1
829201,- - - - - original message - - - - -,0
829202,forgot your username and password ? go to :,0
829203,staff writer,0
829204,pittsburgh media list ( 350 + contacts ) $ 79,1
829205,is one of the core enron values and i believe ...,0
829206,Subject: qf ' s,0
829207,please see the attached spreadsheet .,0
829208,and park by the barn . you ' ll see us .,0
829209,company ceo interviewsemerging technologies ne...,1


In [5]:
# Create a stop list
stoplist = stopwords.words('english') + list(string.punctuation)
stoplist = set(stoplist)

In [6]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [7]:
df['X'] = df['X'].astype(str)

In [8]:
# Remove punctuation and stop words in X
df['X'] = df['X'].apply(trim_word)

In [9]:
# Save as cleand df
df.to_csv('../data/emails_cleaned.csv', index=False)

In [10]:
display(df.shape)
display(df.head())
display(df.tail())

(829210, 2)

Unnamed: 0,X,y
0,email remains vngo rice edu leave shirley phon...,0
1,cc daren j farmer enron com,0
2,,0
3,giant drew billion credit line,0
4,orders report orders times,1


Unnamed: 0,X,y
829205,one core enron values believe great way improve,0
829206,Subject qf,0
829207,please see attached spreadsheet,0
829208,park barn see us,0
829209,company ceo interviewsemerging technologies ne...,1
