# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Exploratory Data Analysis

Created On: 11/25/2020

Modified On: 11/30/2020

### Description

This script applies exploratory data analysis (EDA) to the `emails_cleaned.csv` dataset. 

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
df = pd.read_csv('../data/emails_cleaned.csv')

### Missing Values

In [5]:
print('\nTotal missing values (NaN) in each column: \n\n', df.isnull().sum())


Total missing values (NaN) in each column: 

 X    41998
y        0
dtype: int64


In [6]:
# Drop rows containing missing values
df = df.dropna(axis=0, how='any')

### Dimensions

The cleaned email dataset has 2 columns and 787,212 data records.

In [7]:
df.shape

(787212, 2)

### Head and Tail

Below are the first and last 5 rows of record.

In [8]:
display(df.head())
display(df.tail())

Unnamed: 0,X,y
0,email remains vngo rice edu leave shirley phon...,0
1,cc daren j farmer enron com,0
3,giant drew billion credit line,0
4,orders report orders times,1
5,immediately known,1


Unnamed: 0,X,y
829205,one core enron values believe great way improve,0
829206,Subject qf,0
829207,please see attached spreadsheet,0
829208,park barn see us,0
829209,company ceo interviewsemerging technologies ne...,1


### Spam Ratio

In our dataset, 379,336 records are marked as **spam** (48.19\%) . 407,876 are not spam (51.81\%).

In [38]:
display(df['y'].value_counts())
display((df['y'].value_counts() / len(df['y'])) * 100)

0    407876
1    379336
Name: y, dtype: int64

0    51.812726
1    48.187274
Name: y, dtype: float64

### Vectorization

Next, collect each word and its frequency in each email.

In [9]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['X'])

In [10]:
display(X.shape)

(787212, 143176)

Within the 787,212 email records, there are 143,176 unique words (features) in the dataset.

In [27]:
# print the first 10 features
display(vectorizer.get_feature_names()[:10])

['aa',
 'aaa',
 'aaaa',
 'aaaacy',
 'aaaahhhhhh',
 'aaadrizzle',
 'aaaenerfax',
 'aaagrp',
 'aaal',
 'aaaplusdirect']