# Predicting Enron Spam Emails using Supervised Learning

## DS-GA 1001: Introduction to Data Science Final Project

### Scripts

## Supervised Learning Models

Created On: 11/30/2020

Modified On: 12/01/2020

### Description

This script establishes various supervised learning models for the `emails_cleaned.csv` dataset.

### Methods



In [17]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve

print("SUCCESS. All modules have been imported.")

SUCCESS. All modules have been imported.


In [2]:
# Load data
df = pd.read_csv('../data/emails_cleaned.csv')

In [6]:
df = df.dropna()

In [7]:
display(df.shape)
display(df.head())

(787212, 2)

Unnamed: 0,X,y
0,email remains vngo rice edu leave shirley phon...,0
1,cc daren j farmer enron com,0
3,giant drew billion credit line,0
4,orders report orders times,1
5,immediately known,1


In [9]:
# Create a vectorization matrix
vectorizer = TfidfVectorizer()
vectorized_emails = vectorizer.fit_transform(df.X)
vectorized_emails

<787212x143176 sparse matrix of type '<class 'numpy.float64'>'
	with 4418276 stored elements in Compressed Sparse Row format>

In [10]:
# Set train-test ratio to 0.2
X_train, X_test, y_train, y_test = train_test_split(vectorized_emails, df.y, test_size=0.2, random_state=88)

### Baseline Model

We used logistic regression as our baseline model. We also applied elastic net to weight coefficients and added a penalty term to our model. 

In [None]:
# Fit a Logistic regression with elastic net
logreg = LogisticRegression(C=1, solver='saga', penalty='elasticnet', l1_ratio = 0.5)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

In [None]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()