# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Data Process

Created On: 11/15/2020

Modified On: 11/30/2020

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Vectorization:
  - 

In [18]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))

(829210, 2)

Unnamed: 0,X,y
0,products or undertake transactions that need i...,1
1,"the examples above show the awesome , earning ...",1
2,presenters should attend if at all possible an...,0
3,http : / / m . r . splendidspring . com / oe /,1
4,/ b / p,1
5,request create date : 11 / 2 / 00 1 : 12 : 58 pm,0
6,vita : http : / / garven . lsu . edu / dossier...,0
7,oi | and gas ' shares ? weil we do know this -...,1
8,further drop ! rates starting at 3 . 25 %,1
9,"bruce francis , kathleen hays",0


In [17]:
# Create a stop list
stoplist = stopwords.words('english') + list(string.punctuation)
stoplist = set(stoplist)

In [20]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    return [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]

In [24]:
df['X'] = df['X'].astype(str)

In [25]:
# Remove punctuation and stop words in X
df['X'].apply(trim_word)

0         [products, undertake, transactions, need, imme...
1         [examples, show, awesome, earning, potential, ...
2             [presenters, attend, possible, event, unable]
3                        [http, r, splendidspring, com, oe]
4                                                    [b, p]
                                ...                        
829205                     [non, hormonal, herbal, therapy]
829206    [enron, makes, markets, variety, commodities, ...
829207                                        [lauer, kara]
829208                                 [Subject, news, use]
829209                       [site, goes, live, next, week]
Name: X, Length: 829210, dtype: object

In [26]:
# Save as cleand df
df.to_csv('../data/emails_cleaned.csv', index=False)