# Documentation On Identifying Fraud from Enron Email

## Yong Yu
---

## Overview

Overall, the building process in this project can be broke into two parts. The first part is choosing best combination of feature scaling methods, feature selection methods, and classification method. In particular, this part includes checking on different feature selection methods with different but all well tuned classifers, excluding or including PCA, before scaling features using various ways. The second part is to further tuning feature selection paramters, PCA parameters to increase the performance of the estimator.

The final model uses a pipeline made of MinMaxScaler, PCA, SelectKBest, and LinearSVC, with cross validation method StratifiedShuffleSplit. The details are as following,

<pre>[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
 ('k_best', SelectKBest(k=5, score_func=<function f_classif at 0x106b0d8c0>)),
 ('pca', PCA(copy=True, n_components=1, whiten=False)),
 ('linear_svc', LinearSVC(C=0.1, class_weight='auto', dual=False, fit_intercept=True,
                          intercept_scaling=1, loss='squared_hinge', max_iter=1000.0,
                          multi_class='ovr', penalty='l1', random_state=42, tol=0.0001,
                          verbose=0))]</pre>

## About This Project
The goal of this project is to identify whether a person is guilty for the notorious Enron Fraud, using publicly available information like financial incomes and emails. 

From oil-trading, bandwidth trading, and weather trading with market-to-market accounting, Enron has moved its hands to a variety of commodities, got in touch with politicians like George W. Bush, and caused a great loss to the public, including the California electricity shortage. All these information can be useful if text learning was applied, and certain patterns could be found out like, a pattern indicating a decision-making person could very likely be a person of interest. However, this is not applied in this analysis since it's a more advanced topic.

This analysis used a financial dataset containing people's salary, stock information, and so on. During Enron Fraud, people like Jefferey Skilling, Key Lay, and Fastow all have dumped large amounts of stock options, and they are all guilty. This information can be very helpful to check on other person of interest, and can be easily refelected in the dataset. This is also where machine learning comes into play. By creating models which calculate relationships between a person of interest and its available quantitative data, machine learning tries to find and memorize a pattern that helps us identify a guilty person in the future.

##About This Dataset

The following is a list of findings,
- there are 146 data points with 21 features, and a total of 3066 obervations.
- there are 18 people who is an point of interest.
- 1,358 data points are missing.
- the top 3 features with most missing values are "loan_advances", "director_fees", and "restricted_stock_deferred".

## Outliers Cleaning

There are certain outliers by each feature. One big outlier was found to be "TOTAL" and certainly it was removed. To find out other outliers, a multivariable linear regression model was created. By removing data points with top 10% variance between predicted and real values, an outlier-cleaned dataset could be created for the analysis. However, as there were 13 person-of-interest out of total 18 appearing in the outliers, simply removing the outliers before spliting it into training and testing set would weaken the model as it won't have much work to do when identifying POIs. Given this thought, the outlier cleaning was only performed on the training set.

On the other hand, the amount of removal on outliers can also affect the performanc on the final model. Hence, the different ratios were inspected to get the best performance of the final model. The best ratio when cleaning the outliers is 0.02.

## Feature Scaling

Three different feature scaling methods are explored in this report, including MinMaxScaler, StandardScaler, and Normalizer. The MinMaxScaler is chosen for the final feature scaling method.

Generally speaking, it is crucial to scale the features when a distance function is used as training algorithm while it's not necessary for a linear function, 
<a href='http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html' target='_blank'>although scaling would make it faster.</a>
If a feature has a scale of 0 to 100, and a second feature has a scale of 0 to 100,000, the contribution from the first feature would then be totally swamped by the second one. Thus it's essential to scale them to a same level before training on a distance based algorithm. In this dataset, LinearSVC is chosen in the final model, which is a distance based algorithm. And the standard deviation among features could be as high as 8846594(total_payments), and as low as 74(from_poi_to_this_person), therefore it's necessary to scale the dataset.

## Feature Selection and PCA

The feature selection method for the final model is SelectKBest. After tuning, this model takes k = 5 as optimal parameter value, and 5 features are chosen, their feature scores are listed below,

- exercised_stock_options: 27.80
- total_stock_value: 26.67
- salary: 18.22
- bonus: 18.16
- shared_receipt_with_poi: 9.89

For PCA, after tuning, it chose n_components = 1 and whiten = False, and the explained variance ratio is 0.605.

## Feature Engineering

Three new features, "stock_salary_ratio", "poi_from_ratio", "poi_to_ratio", are created when analyzing the data.

stock_salary_ratio takes the result from total_stock_value divided by salary. This feature is useful based on the assumption that a person of interest usually has a unusual large stock value since it's under the table, while salary information could be more easily known by public, thus the ratio could give information to identify the poi. The bigger the ratio, the more likely it is a poi.

poi_from_ratio takes result from from_poi_to_this_person divided by from_messages. This feature assumes that if a person is a poi, he/she tends to have more contacts with another poi, therefore the ratio would be bigger. And same applie to feature poi_to_ratio.

Before any final tuning on feature scaling methods, their feature scores are as following,

- poi_from_ratio: 2.727
- poi_to_ratio: 0.117
- stock_salary_ratio: 0.075

And the model performance after adding these new features,

$$Accuracy: 0.7037, Precision: 0.2889, Recall: 0.8365, F1: 0.4295$$

whose precision and recall scores are lower than training without these new features,

$$Accuracy: 0.7115, Precision: 0.2954, Recall: 0.8405, F1: 0.4372$$

Therefore, these features are not used in the final model.

## Classifiers

The chosen classifer is LinearSVC, after scaled by MinMaxScaler, and using PCA and SelectKBest. LinearSVC is compared with KNeighborsClassifer. From the tuning results, KNeighborClassifier tends to give a higher precision score, while LinearSVC gives a higher recall score. And KNeighborClassifer always gives a higher accuracy score. However, since LinearSVC tends to give a higher F1 score, based on the list below, LinearSVC was chosen as it gave the highest F1 score after combining with feature scaling and feature selection.

In [1]:
import pandas as pd
pd.read_csv("model_metrix.csv").sort(['f1_score', 'time_used'], ascending=[0,1])

Unnamed: 0,model,scaler,feature_selection_method,pca,classification_method,accuracy_score,f1_score,precision_score,recall_score,time_used
0,21,minmaxscaler,pca,k_best,linear_svc,0.7115,0.4372,0.2954,0.8405,37.162
3,13,standardscaler,pca,k_best,linear_svc,0.719,0.4222,0.2908,0.77,50.131
6,23,minmaxscaler,pca,extra_tree,linear_svc,0.7152,0.4052,0.2808,0.7275,119.631
7,15,standardscaler,pca,extra_tree,linear_svc,0.7112,0.395,0.274,0.707,120.961
10,5,none,pca,k_best,linear_svc,0.7147,0.3682,0.2612,0.6235,211.157
8,7,none,pca,extra_tree,linear_svc,0.6715,0.3133,0.2172,0.562,123.089
1,4,none,none,extra_tree,k_neighbors,0.8556,0.2833,0.4188,0.214,40.113
2,8,none,pca,extra_tree,k_neighbors,0.8556,0.2833,0.4188,0.214,41.69
4,9,standardscaler,none,k_best,linear_svc,0.7282,0.2823,0.2179,0.401,57.096
11,11,standardscaler,none,extra_tree,linear_svc,0.7281,0.2787,0.2156,0.394,1047.162


## Tuning

Tuning the algorithm means to find a better solution specificly for the current problem. If an algorithm is not well tuned, it could lower down the accuracy score, percision and recall scores, which means a better outcome would be missed. Even worse, it could increase the runtime.

For the final model, three algorithms are carefully tuned. For LinearSVC, the parameters tuned are 'C', 'tol', and 'max_iter'. For SelectKBest, the parameter tuned is 'k'. For PCA, the parameter n_components is tuned.

When tuning LinearSVC, C was dominating the performance alone. The algorithm was found to be best at C = 0.1. At C = 1.0, tol started to make positive effect. At C = 3.0, model performance decreased when tol changed from 0.1 to 1.0. At C = 10.0, model performance decreased when tol changed from 0.01 to 0.1, and increased when it got to 1.0. For all the fitting, max_iter didn't take any effect on performance.

When tuning SelectKBest, the performance generally went up when k increased, and was found to be the best when k = 5.

When tuning PCA, the n_components reached its best when n_components = 1 and whiten = False.

## Validation
Validation is to hold a part of a dataset, fit the model on the rest of the data, and predict using the held data. The classic mistake, overfitting, would happen if no cross validation is applied.

In this model, StratifiedShuffleSplit is used as the cross validation method. It is chosen as the data is heavily imbalanced, as there are only 18 POIs and 125 non-POIs. Stratifying it makes sure when splitting the dataset, the classes in training and testing sets are proportional.

Considering time consumption, when tuning algorithms and evaluating models, n_iter = 100 and n_iter = 1000 were used respectively.

## Evaluation
The evaluation metrics in this analysis used are accuaracy score, f1 score, precision score and recall score.

The final model has an accuracy score of 0.7186, which means 71.86% of the predictions are found to be true. However, using this metrix won't give us very valuable information. The dataset is heavily imbalance and guessing all outcomes to be non-POIs would give an accuracy of 86.2% already, this is why we need use precision and recall scores instead.

With a precision score of 0.31, it tells us that if this model predicts 100 POIs, then the chance would be 31 people who are truely POIs and the rest 69 are innocent. On the other hand, with a recall score of 0.81, this model can find 81% of all real POIs in prediction. This model is good at finding bad guys, with the price of high probability(0.69) in calumniating someone.

Finally, there's always a tradeoff between precision and recall, f1 comes and measure how well the tradeoff is. With a f1 score of 0.45, this model is ok, while more improvements could be applied such as better feature engineering.