# Enron Email Fraud Detection Machine Learning 

## Project Summary

Enron was an American energy, commodities, and services company that went bankrupt in 2001 due to widespread financial fraud and corruption. As a result of Federal investigations, tens of thousands of emails and financial data were released to the public. Hence, this project investigated the Enron Email dataset and developed a 'person of interest' (POI) prediction model. Financial and email features were selected via kbest, decision tree, and lasso regularization. Nine machine learning classification prediction algorithms were investigated: Logistic Regression, Naive-Bayes, K-means Clustering, K-nearest Neighbors, Random Forest, Extra Tree, Gradient Boosting, Decision Tree, and Adaboost. Adaboost yielded the best results. The final tuned Adaboost model exhibited 85% accuracy, 43% precision, and 34% recall. Validation was preformed by 1000-fold stratified shuffle-split cross-validation.

## Data Exploration 

Dataset Source: https://github.com/udacity/ud120-projects.git

This dataset was initially stored in a dictionary, where each key-value pair corresponded to each observation or person name. The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. 

The dataset contained 146 observations (people) consisting of 14 financial features ('salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'), 6 email features (to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'), and 1 label/outcome ('poi'). Of the 146 observations, 18 were labeled as a 'person of interest' (12%) and 128 were labeled as 'not a person of interest' (88%).

In [None]:
> data_dict['LAY KENNETH L']
{'bonus': 7000000,
 'deferral_payments': 202911,
 'deferred_income': -300000,
 'director_fees': 'NaN',
 'email_address': 'kenneth.lay@enron.com',
 'exercised_stock_options': 34348384,
 'expenses': 99832,
 'from_messages': 36,
 'from_poi_to_this_person': 123,
 'from_this_person_to_poi': 16,
 'loan_advances': 81525000,
 'long_term_incentive': 3600000,
 'other': 10359729,
 'poi': True,
 'restricted_stock': 14761694,
 'restricted_stock_deferred': 'NaN',
 'salary': 1072321,
 'shared_receipt_with_poi': 2411,
 'to_messages': 4273,
 'total_payments': 103559793,
 'total_stock_value': 49110078}

## Missing Values 

The number and fraction of missing values for each feature was investigated. The features with more than 50% missing values were 'long_term_incentive', 'deferred_income', 'deferral_payments', 'restricted_stock_deferred', 'director_fees', and 'loan_advances'. All missing values were converted to 0 in order to perform machine learning operations. 

In [None]:
'poi' 0, 0.0
'total_stock_value' 20, 0.14
'total_payments' 21, 0.14
'email_address' 35, 0.24
'restricted_stock' 36, 0.25
'exercised_stock_options' 44, 0.3
'expenses' 51, 0.35
'salary' 51, 0.35
'other' 53, 0.36
'from_messages' 60, 0.41
'from_poi_to_this_person' 60, 0.41
'from_this_person_to_poi' 60, 0.41
'shared_receipt_with_poi' 60, 0.41
'to_messages' 60, 0.41
'bonus' 64, 0.44
'long_term_incentive' 80, 0.55
'deferred_income' 97, 0.66
'deferral_payments' 107, 0.73
'restricted_stock_deferred' 128, 0.88
'director_fees' 129, 0.88
'loan_advances' 142, 0.97

## Outliers

Three observations were removed from the dataset:'THE TRAVEL AGENCY IN THE PARK', 'TOTAL', and 'LOCKHART EUGENE E'. The 'THE TRAVEL AGENCY IN THE PARK' and 'TOTAL' observations are not individual person observations. Moreover, the 'LOCKHART EUGENE E' observation was removed because it only contained missing values. 

## New Feature Engineering  

Two new features were created: 'from_poi_percent' and 'to_poi_percent'. The feature 'from_poi_percent' was calculated by dividing 'from_poi' by 'to_messages'. Similarly, 'to_poi_percent' was calculated by dividing 'to_poi' by 'from_messages'. These features were created to obtain a ratio of the number of messages from and to persons of interest. These ratios go beyond the raw message numbers and give us more insight into the fraction/ratio of messages to and from persons of interest. I hypothesized that high/low ratios signify high/low contact with persons of interest better than raw message numbers that can vary significantly between people. 

## Feature Selection 

Features were selected by 3 processes: kbest, decision tree, and lasso regularization algorithms. The kbest features were selected by p-values less than 0.15. The decision tree best features were selected by feature importances scores greater than 0. The lasso regularization best features were selected by coefficients greater than 0. The best features (total of 13 unique features) across all of these processes were collected and used for machine learning. 

Moreover, all features were scaled prior to feature selection and machine learning. Scaling was conducted in order to equally weight all features in some machine learning algorithms, such as K-means. 

In [None]:
> kbest_features:
'bonus', 0.02
'exercised_stock_options', 0.01
'loan_advances', 0.01
'long_term_incentive', 0.11
'salary', 0.08
'shared_receipt_with_poi', 0.12
'total_payments', 0.1
'total_stock_value', 0.02
'to_poi_percent', 0.03
'to_poi_percent', 0.03

In [None]:
> decision_best: 
'exercised_stock_options', 0.2
'expenses', 0.12
'long_term_incentive', 0.04
'other', 0.13
'restricted_stock', 0.03
'shared_receipt_with_poi', 0.18
'total_payments', 0.07
'total_stock_value', 0.1
'to_poi_percent', 0.14

In [None]:
> lasso_best:
'bonus', 0.28
'deferred_income', 0.14
'exercised_stock_options', 0.26
'expenses', 0.01
'loan_advances', 0.01
'long_term_incentive', 0.01
'restricted_stock', 0.01
'salary', 0.16
'total_payments', 0.01
'total_stock_value', 0.2
'to_poi_percent', 0.28

## Machine Learning 

The following algorithms were tested: Logistic Regression, Naive-Bayes, K-means Clustering, K-nearest Neighbors, Random Forest, Extra Tree, Gradient Boosting, Decision Tree, and Adaboost. These algorithms were compared via accuracy, precision, recall, and F1 score. The algorithm that was selected was Adaboost, which had the best combined precision, recall, and FI score results (all greater than 0.3). 

In [None]:
Logistic Regression:  Accuracy: 0.80073,  Precision: 0.20791,  Recall: 0.17600,  F1: 0.19063
Naive-Bayes:          Accuracy: 0.81293,  Precision: 0.30303,  Recall: 0.31000,  F1: 0.30648
K-means Clustering:   Accuracy: 0.83433,  Precision: 0.22783,  Recall: 0.10150,  F1: 0.14044
K-nearest Neighbors:  Accuracy: 0.86147,  Precision: 0.38596,  Recall: 0.06600,  F1: 0.11272
Random Forest:        Accuracy: 0.86140,  Precision: 0.43577,  Recall: 0.13400,  F1: 0.20497
Extra Tree:           Accuracy: 0.85867,  Precision: 0.41254,  Recall: 0.14150,  F1: 0.21072
Gradient Boosting:    Accuracy: 0.85947,  Precision: 0.45408,  Recall: 0.26700,  F1: 0.33627
Decision Tree:        Accuracy: 0.82407,  Precision: 0.33823,  Recall: 0.33400,  F1: 0.33610
Adaboost:             Accuracy: 0.84687,  Precision: 0.40782,  Recall: 0.32850,  F1: 0.36389

## Tuned Algorithm 

Algorithm tuning is the process in which model hyper-parameter are selected for a specific problem/dataset. Hyper-parameters are algorithm-specific variables such as alphas, gammas, number of estimators, learning rate, max_depth, and kernals. Algorithms hyper-parameters are initially set to default values, which may or may not be appropriate for the current problem.  Algorithm tuning is important in order to improve/optimize/maximize performance (i.e. accuracy, precision, and recall) and utilize hyper-parameters that are appropriate for the underlying problem/dataset. Tuning is a search process by using either grid search (trial and error) or random search. If a model is not tuned well, the accuracy, precision, and/or recall will not be optimal. Moreover, there is a trade-off between recall and precision (inverse relationship) that is strongly dependent on the tuning process. 

The AdaBoost algorithm was tuned to improve precision and recall. It was tuned by conducting a cross-validated grid search of the model hyper-parameters (n_estimators, learning_rate, and algorithm). In simple terms, an exhaustive trial-and-error search was conducted to find the hyper-parameters that yield the best results. In this case, the F1 score was maximized in order to stress both precision and recall. 

The best AdaBoost parameters were found to be n_estimators = 100, learning_rate = 0.9, and algorithm = SAMME.R. 

In [None]:
Adaboost (Tuned):    Accuracy: 0.85167,  Precision: 0.42866,  Recall: 0.33800,  F1: 0.37797

The previously engineered feature 'to_poi_percent' was found to be the second/third most important feature.

In [None]:
> ada_best      
'bonus', 0.09
'deferred_income', 0.05
'exercised_stock_options', 0.13
'expenses', 0.22
'long_term_incentive', 0.06
'other', 0.07
'restricted_stock', 0.03
'salary', 0.1
'shared_receipt_with_poi', 0.05
'total_payments', 0.04
'total_stock_value', 0.03
'to_poi_percent', 0.13

## Performance 

The main evaluation metrics for the final tuned Adaboost model were precision (43%) and recall (34%). 

Precision is the ratio of true positives over the sum of true positives and false positives (positive predictive value). In simple terms, this means that 43% of the predicted persons of interest were actually persons of interest. 

Recall is the ratio of true positives over the sum of true positives and false negatives (true positive rate). In simple terms, this means that 34% of persons of interest were predicted to be persons of interest.

Note: Accuracy was not a major consideration because of the small and unbalanced dataset (146 observations) with only 18 persons of interest. The accuracy of the baseline model (all considered not a person of interest) was 88%.

## Validation

Validation is a testing process used to asses model performance. Various validation methods exist, such as hold-out data, cross-validation, and bootstrapping. Validation is typically conducted by separating data into training and testing datasets (i.e. 80% training and 20% testing). In order to accurately determine model performance, validation tests are carried out on data that is independent from data that used during training. The classic mistake is over-fitting to the training dataset. It is important to use validation in order to avoid over-fitting and improve model performance on non-training data. If validation is not implemented, over-fitting results in maximal results on the training dataset, but poor results (i.e. accuracy) on the testing dataset. Moreover, the testing dataset should not be utilized until the final model has been selected/tuned. The testing dataset should not be manipulated or run multiple times.

This analysis was validated by the following: 1.) simple data splitting into training and test datasets (1:1 ratio due to small and unbalanced dataset) via sklearn.cross_validation.train_test_split, and 2.) stratified shuffle-split cross-validation (1000 folds) via sklearn.cross_validation.StratifiedShuffleSplit (as developed by Udacity in the tester.py file). The stratified shuffle-split cross-validation is the best solution for this project, due to the small and unbalanced dataset. 

## Source Code

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import sys
import pickle
sys.path.append("./tools/")

In [None]:
# Task 1: Inport Data -------------------------------------------------------

# features_list is a list of strings, each of which is a feature name
features_list = ['poi', 
    'bonus', 'deferral_payments', 'deferred_income', 'director_fees',
    'exercised_stock_options', 'expenses', 'from_messages',
    'from_poi_to_this_person', 'from_this_person_to_poi', 'loan_advances',
    'long_term_incentive', 'other', 'restricted_stock',
    'restricted_stock_deferred', 'salary', 'shared_receipt_with_poi',
    'to_messages', 'total_payments', 'total_stock_value' ]
feature_names = features_list[1:]

# load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

# allocation of labels     
labels_poi = []
for x in data_dict:
    labels_poi.append( data_dict[x]['poi'] )
print labels_poi.count(0)
print labels_poi.count(1)

# missing values counts
values = data_dict.values()[0].keys()
values_dict = {}
for x in values:
    values_dict[x] = 0
for x in range( 0,len( data_dict.values() ) ):
    for y in values:
        if data_dict.values()[x][y] == 'NaN':
            values_dict[y] = values_dict[y] + 1
print values_dict
for key, value in sorted(values_dict.iteritems(), key=lambda (k,v): (v,k)):
    print key, value, round( float(value)/len( data_dict.values() ), 2 )

In [None]:
# Task 2: Remove Outliers ---------------------------------------------------

# remove outlier entries 
data_dict_out = data_dict.copy()
del data_dict_out['THE TRAVEL AGENCY IN THE PARK']
del data_dict_out['TOTAL']
del data_dict_out['LOCKHART EUGENE E']

In [None]:
# Task 3-1: Create New Features ---------------------------------------------

# create features for percent of to/from messages 
data_dict_new = data_dict_out.copy()
for each in data_dict_new:
    from_poi = float( data_dict_new[each]['from_poi_to_this_person'] )
    to_poi = float( data_dict_new[each]['from_this_person_to_poi'] ) 
    from_messages = float( data_dict_new[each]['from_messages'] )
    to_messages = float( data_dict_new[each]['to_messages'] )
    # add new features 
    data_dict_new[each]['from_poi_percent'] = from_poi / to_messages
    data_dict_new[each]['to_poi_percent'] = to_poi / from_messages
    # fix nan values that will cause errors, convert to string 'NaN'
    if math.isnan( data_dict_new[each]['from_poi_percent'] ):
        data_dict_new[each]['from_poi_percent'] = 'NaN'
    if math.isnan( data_dict_new[each]['to_poi_percent'] ):
        data_dict_new[each]['to_poi_percent'] = 'NaN'
        
# add new features to features_list
features_list.append('from_poi_percent')
features_list.append('to_poi_percent')
feature_names = features_list[1:]

# store to my_dataset for export 
my_dataset = data_dict_new

In [None]:
# Task 3-2: Feature Selection -----------------------------------------------

# extract all features and labels from dataset
from feature_format import featureFormat, targetFeatureSplit
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

# preprocess all features 
from sklearn import preprocessing
scale = preprocessing.MinMaxScaler()
features = scale.fit_transform(features)

# kbest feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
featureSelector = SelectKBest(chi2, k='all')
featureSelector.fit(features, labels)
featureSelector.pvalues_
kbest_features = []
for x in range(0,len(feature_names)):
    if featureSelector.pvalues_[x] < 0.15:
        kbest_features.append([feature_names[x],
                    round(featureSelector.pvalues_[x],2)])
print kbest_features        
        
# decision tree variable importance 
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features, labels)
clf.feature_importances_
decision_best = []
for x in range(0,len(feature_names)):
    if clf.feature_importances_[x] > 0:
        decision_best.append([feature_names[x],
                    round(clf.feature_importances_[x],2)])
print decision_best       

# Lasso regularization feature selection 
from sklearn.linear_model import RandomizedLasso
rlasso = RandomizedLasso(alpha=0.0085)
rlasso.fit(features, labels)
lasso_best = []
for x in range(0,len(feature_names)):
    if rlasso.scores_[x] > 0:
        lasso_best.append([feature_names[x],
                            round(rlasso.scores_[x],2)])
print lasso_best

# top features selected
features_list = ['poi', 
        'bonus', 'deferred_income', 'exercised_stock_options',
        'expenses', 'loan_advances', 'long_term_incentive',
        'other', 'restricted_stock', 'salary',
        'shared_receipt_with_poi', 'total_payments', 
        'total_stock_value', 'to_poi_percent'] 
feature_names = features_list[1:]

# extract only selected features and labels from dataset
from feature_format import featureFormat, targetFeatureSplit
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

# preprocess selected features 
from sklearn import preprocessing
scale = preprocessing.MinMaxScaler()
features = scale.fit_transform(features)

# split data into training and testing sets
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.5, random_state=42)

In [None]:
# Task 4: Test Classifiers --------------------------------------------------

# logistic regression
# Accuracy: 0.80073	Precision: 0.20791	Recall: 0.17600	
# F1: 0.19063	F2: 0.18157
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

# naive-bayes 
# Accuracy: 0.81293	Precision: 0.30303	Recall: 0.31000	
# F1: 0.30648	F2: 0.30858
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

# k-means clustering 
# Accuracy: 0.83433	Precision: 0.22783	Recall: 0.10150	
# F1: 0.14044	F2: 0.11416
from sklearn.cluster import KMeans
clf = KMeans(n_clusters=2)

# k-nearest neighbors 
# Accuracy: 0.86147	Precision: 0.38596	Recall: 0.06600	
# F1: 0.11272	F2: 0.07912
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=2)

# random forest
# Accuracy: 0.86140	Precision: 0.43577	Recall: 0.13400	
# F1: 0.20497	F2: 0.15554
from sklearn.ensemble import RandomForestClassifier 
clf = RandomForestClassifier()

# extra tree
# Accuracy: 0.85867	Precision: 0.41254	Recall: 0.14150	
# F1: 0.21072	F2: 0.16291
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier()

# gradient-boosting 
# Accuracy: 0.85947	Precision: 0.45408	Recall: 0.26700	
# F1: 0.33627	F2: 0.29098
from sklearn import ensemble
clf = ensemble.GradientBoostingClassifier()

# decision tree
# Accuracy: 0.82407	Precision: 0.33823	Recall: 0.33400	
# F1: 0.33610	F2: 0.33484
from sklearn import tree
clf = tree.DecisionTreeClassifier()

# ada-boosting 
# Accuracy: 0.84687	Precision: 0.40782	Recall: 0.32850	
# F1: 0.36389	F2: 0.34180
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()

In [None]:
# Task 5-1: Tune Classifier -------------------------------------------------

# cross-validation of training data
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(features_train.shape[0], 
                    n_iter=5, test_size=0.2, random_state=42)

# tune decision tree classifier via grid search cv
from sklearn import tree
clf = tree.DecisionTreeClassifier()
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score
param_grid = { 'criterion': ['gini','entropy'],
               'max_features': [None, 'auto', 'sqrt', 'log2'],
               'splitter' : ['best','random'],
               'max_depth' : [None,1,2,4,8,10,15],
               'min_samples_split' : [1,2,3,5,10],
               'min_samples_leaf' : [1,2,3],
               'class_weight' : [None, 'balanced']}
cv_clf = GridSearchCV(estimator=clf, 
                      cv=cv, 
                      param_grid=param_grid,
                      scoring='f1')
cv_clf.fit(features_train, labels_train)
print cv_clf.best_params_

# tune decision tree classifier via grid search cv
# Accuracy: 0.81427	Precision: 0.33827	Recall: 0.41100	
# F1: 0.37111	F2: 0.39406
clf = tree.DecisionTreeClassifier(splitter = 'best', 
            min_samples_leaf = 2,
            min_samples_split = 2, 
            criterion = 'entropy', 
            max_features = 'log2', 
            max_depth = 8, 
            class_weight = 'balanced')

# tune ada-boosting classifier via grid search cv
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score
param_grid = { 'n_estimators': [50,100,200],
               'learning_rate': [0.7,0.9,1,1.1],
               'algorithm' : ['SAMME', 'SAMME.R']}
cv_clf = GridSearchCV(estimator=clf, 
                      cv=cv, 
                      param_grid=param_grid,
                      scoring = 'f1')
cv_clf.fit(features_train, labels_train)
print cv_clf.best_params_

# tuned ada-boosting model via grid search cv
# 1st run: Accuracy: 0.85167  Precision: 0.42866  Recall: 0.33800	
# F1: 0.37797	F2: 0.35293
# 2nd run: Accuracy: 0.85153  Precision: 0.42803  Recall: 0.33750	
# F1: 0.37741	F2: 0.35241
clf = AdaBoostClassifier(n_estimators = 100, 
                         learning_rate = 0.9,
                         algorithm = 'SAMME.R')

In [None]:
# Task 5-2: Evaluation Metrics ----------------------------------------------

# predictions
clf.fit(features_train, labels_train)
test_predictions = clf.predict(features_test)

# feature importance 
clf.feature_importances_
ada_best = []
for x in range(0,len(feature_names)):
    if clf.feature_importances_[x] > 0:
        ada_best.append([feature_names[x],
                    round(clf.feature_importances_[x],2)])
print ada_best      

# confusion matrix and errors
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
print confusion_matrix(labels_test, test_predictions) 
print classification_report(labels_test, test_predictions)
print roc_auc_score(labels_test, test_predictions)

In [None]:
# Task 6: Dump Classifier, Dataset, and Features --------------------------

from tester import dump_classifier_and_data
dump_classifier_and_data(clf, my_dataset, features_list)