## Identify Fraud from Enron Email

Yuchen Yeh, November 2016

## Background

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. 

The aim of this project is to use classification techniques in machine learning to predict person of interest identifier (POIs) based on financial and email data made public as a result of the Enron scandal. 

## Understanding the Dataset and Question

#### Data exploration

The features in the data fall into three major types, namely financial features, email features and POI labels:

* financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] 

* email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] 

* POI label: [‘poi’] 

Important characteristics of this Enron dataset are:

* 146 data points (i.e. people)
* 21 available features for each person
* of 146 people, 18 are identified as POIs
* Missing values across all features except the POI label.

With only 18 people allocated for POIs, the classification data is unbalanced. This means cross-validation method like Stratified Shuffle Split is important since it makes sure the ratio of POI and non-POI is the same during training and testing. Also, the number of data is relatively small, which means Stratified Shuffle Split combined with Grid Search CV is acceptable to use to have a reasonable training time.



#### Outliner identification

By looking at two features (“salary” and “bonus”) through a scatterplot, I identified an outlier named TOTAL. This is a spreadsheet artifact and it was therefore removed. Two data points look like potential outliers due to their much larger values but in fact they are valid data : Enron founder  LAY KENNETH L and former CEO SKILLING JEFFREY K made bonuses of at least 5 million dollars, and a salary of over 1 million dollars.

Yet another outlier "THE TRAVEL AGENCY IN THE PARK" is found by manually scanning the enron61702insiderpay.pdf 


## Feature selection

#### Create new features 

Two features are created to understand the percentage of the total emails received or sent is related to POIs:

* fraction_from_poi: Fraction of emails received from POIs.
* fraction_to_poi: Fraction of emails sent to POIs.

In general, POIs send emails to other POIs at a rate higher than the general population. These two features are added to help to improve the classifier's performance, as calculating the fraction of emails related to POIs can add an indication that the person is highly considered to be POSs.


#### Intelligently select features 

In order to decide which features should be included, I consider the whole feature list plus two new added features. I used SelectKBest for feature selection: present the feature scores for all features (k = 'all'), ordered in descending order.

<table>
 <tr>
    <td><b>Feature</b></td>
    <td><b>Score</b></td>
  </tr>
  <tr>
    <td>exercised_stock_options</td>
    <td>24.82</td>
  </tr>
   <tr>
    <td>total_stock_value</td>
    <td>24.18</td>
  </tr>
   <tr>
    <td>bonus</td>
    <td>20.79</td>
  </tr>
   <tr>
    <td>salary</td>
    <td>18.29</td>
  </tr>
    <tr>
    <td>fraction_to_poi</td>
    <td>16.41</td>
  </tr>
   <tr>
    <td>deferred_income</td>
    <td>11.46</td>
  </tr>
   <tr>
    <td>long_term_incentive</td>
    <td>9.92</td>
  </tr>
   <tr>
    <td>restricted_stock</td>
    <td>9.21</td>
  </tr>
    <tr>
    <td>total_payments</td>
    <td>8.77</td>
  </tr>
   <tr>
    <td>shared_receipt_with_poi</td>
    <td>8.59</td>
  </tr>
   <tr>
    <td>loan_advances</td>
    <td>7.18</td>
  </tr>
   <tr>
    <td>expenses</td>
    <td>6.09</td>
  </tr>
    <tr>
    <td>from_poi_to_this_person</td>
    <td>5.24</td>
  </tr>
   <tr>
    <td>other</td>
    <td>4.19</td>
  </tr>
    <tr>
    <td>fraction_from_poi</td>
    <td>3.13</td>
  </tr>
   <tr>
    <td>from_this_person_to_poi</td>
    <td>2.38</td>
  </tr>
   <tr>
    <td>director_fees</td>
    <td>2.13</td>
  </tr>
   <tr>
    <td>to_messages</td>
    <td>1.65</td>
  </tr>
    <tr>
    <td>deferral_payments</td>
    <td>0.22</td>
  </tr>
   <tr>
    <td>from_messages</td>
    <td>0.17</td>
  </tr>
    <tr>
    <td>restricted_stock_deferred</td>
    <td>0.07</td>
  </tr>
 </table>

I see a cutoff when the score dropped from (16.41, 'fraction_to_poi') to 
(11.46, 'deferred_income').  Therefore, I choose 5 features are associated with ar score of more than 15:

* 'poi'
* 'salary'
* 'bonus'
* 'total_stock_value'
* 'exercised_stock_options'
* 'fraction_to_poi'

It is new features 'fraction_to_poi' shows a high score, and another new feature 'fraction_from_poi' is less relevant.


#### Properly scale features 
I scaled all features using the scikit-learn MinMaxScaler to avoid problems caused by different units in the dataset. However, the algorithms I used (decision tree and naive Bayes classifiers) do not need feature scaling.

## Pick and Tune an Algorithm

I tested with 2 algorithms to see which one performs the best:

#### GaussianNB:

The classification report for GaussianNB is provided below.

GaussianNB(priors=None)
	Accuracy: 0.85629	<b> Precision: 0.49545	Recall: 0.32650	 </b> F1: 0.39361	F2: 0.35040
	Total predictions: 14000	True positives:  653	False positives:  665	False negatives: 1347	True negatives: 11335

#### Decision tree:
In order to achieve better performance for the decision tree algorithm, I performed parameter tuning. Out all of the parameters available, I decided to tune min_samples_split. Min_samples_split is the minimum number of samples is required to split an internal node, and it affects if the decision tree classifier is overfitting. In general, a bigger value of min_samples_split draws a simpler boundary and provides a better accuracy.

I used GridSearchCV to identify the best parameter for min_samples_split in a range between 2 (default) and 50.  The best one is 14 and I, therefore, used this value to compute the decision tree classifier.  

The classification report for decision tree is provided below.
 
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=14, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.80650	<b> Precision: 0.29960	Recall: 0.26500	</b> F1: 0.28124	F2: 0.27127
	Total predictions: 14000	True positives:  530	False positives: 1239	False negatives: 1470	True negatives: 10761
 

Feature importance:
* 'salary' importance is 0.168042010503
* 'bonus' importance is 0.130932733183
* 'total_stock_value' importance is 0.421015598727
* 'exercised_stock_options' importance is 0.132033008252
* 'fraction_to_poi' importance is 0.147976649335


In the end, I selected GaussianNB as the algorithm because it performed the better in terms of precision and recall.


## Validate and Evaluate

In this Enron dataset, only 18 data points are POIs. Having imbalanced classes introduces some special challenges in accuracy. The accuracy metric is used to identify numbers of items in a class is labelled correctly. Therefore, accuracy is not a good evaluation metric.

I used precision and recall to evaluate algorithm performance. When tester.py is used to evaluate performance for the chosen algorithm GaussianNB, precision is 0.49545 and recall is 0.32650, both are at least 0.3. The result shows precison for GaussianNB is quite good, which means whenever a POI gets flagged in my test set it is likely to a real POI. However recall is slightly low, and it shows I sometimes miss real POIs.



Cross-validation is to make sure the data is generalized beyond the dataset used to train it, in order to avoid this situation overfitting. A model sometimes just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. 

I used the provided StratifiedShuffleSplit method (n_iter=10000, test_size=0.3, random_state=0)) to validate the chosen algorithm GaussianNB. It is notable that the returned evaluation metrics are the same: precision is 0.49545 and recall is 0.32650. This means we have validated the performance metrics, precision and recall, for the chosen algorithm.

