# Identifying Fraud from Enron Email

This is a report on the process of builing estimators for Fraud Detection using machine learning.

A more compact and summurized report can be found as 
<a href='https://raw.githubusercontent.com/yyforyongyu/nanodegree-machine-learning/master/final_project/documentation.html' target='_blank'>Documentation (html)</a>, or 
<a href='https://github.com/yyforyongyu/nanodegree-machine-learning/blob/master/final_project/documentation.ipynb' target='_blank'> Documentation (ipynb) </a>.

## Overview
In this report, there are series of investigations performed to make a robust, strong final estimator to predict a person-of-interest(poi). These include,
- an overview of the dataset.
- outlier cleaning.
- a performance comparison between scaled features and non-scaled features.
- creating three features, "stock_salary_ratio", "poi_from_ratio", "poi_to_ratio", and evaluating them.
- a performance comparison between two different feature selection methods, SelectKBest and ExtraTreesClassifier.
- a performance comparison between including PCA and excluding PCA.
- a performance comparison between different classifiers, LinearSVC and KNeighborsClassifier.
- tuning algorithms using F1 score as evaluation metric.
- cross-validation on the final estimator.

Several helper functions are built for this project in poi_helper.py. Since this report only focuses on methodology in machine learning, we will not cover them here. For more details, report 
<a href='https://raw.githubusercontent.com/yyforyongyu/nanodegree-machine-learning/master/final_project/poi_id.html' target='_blank'>poi_id.html</a>
has all the thoughts and steps in building these functions.

## Methodology
When finding a best combination out of groups of factors, there are usually two ways to think about it. One way would be simply find the best solution from each group, then chain all the solutions together to make the final combination. The assumption is that the best of each independent thing can be grouped to be the best of a new thing. In reality, this is rarely true since the best from one group might have a negtive effect on the best from another group. If we are to apply this method into the analysis, in short, we would need to first, find the best feature selection method, then find the best calssifier, lastly combine the feature selection and classifier to make the final estimator. However, from the report 
<a href='https://raw.githubusercontent.com/yyforyongyu/nanodegree-machine-learning/master/final_project/poi_id.html' target='_blank'>poi_id.html</a>,
using a SelectKBest + LinearSVC could have an accuracy score of 0.94, same when using RandomizedLogisticRegression + KNeighbors, although the runtimes were different. However, when applying RandomizedLogisticRegression + LinearSVC, or SelectKBest + KNeighbors, accuracy scores became lower. This clearly indicates that, for each classifier algorithm, there is a best fit feature selection method. Simply chaining a best classifier and a feature selection seperately won't produce the best result. It becomes rather clear when all the algorithms were applied on both outlier-cleaned and full datasets. The best estimator for one dataset won't work on a different dataset.

So machine learning is really about finding a specific, nearly unique solution to a question, which brings me to think about the second way, exhaustively trying out combinations of all factors, rather than finding a best answer by groups. This assumes that a simple change in one unit can make a total difference. Unfortunately, this method also creates a problem, large time consumptions.

For this analysis, the dataset will be tested with or without PCA and feature scaling, with two feature selection method, and two classifiers, considering only 20 values to be tuned on each classifier, 20 values to be tuned on each feature selection, and 10 values to be tuned on PCA, a rough total number of combinations is,
$$ 2*2*2*2*20*20*10 = 64,000$$
And it is not likely we will conduct all the possibilies here at once, we will have to make a tradeoff.

For this analysis, the parameters of feature selection methods won't be tuned untill one best feature selection method is found, then the cross validation will be tuned, which brings the possible combination down dramstically before we start to tune on feature selection and cross validation.

# Summary of Dataset
A summary of findings,
- there are 146 data points with 21 features, and a total of 3066 obervations.
- there are 18 people who is an point of interest.
- 1,358 data points are missing.
- the top 3 features with most missing values are "loan_advances", "director_fees", and "restricted_stock_deferred".

In [1]:
#!/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")

from tester import dump_classifier_and_data, test_classifier
from poi_helper import *

### Load the dictionary containing the dataset
data_dict = pickle.load(open("final_project_dataset.pkl", "r") )

In [2]:
# number of data points
len(data_dict.keys())

146

In [3]:
# number of features available
len(data_dict['METTS MARK'])

21

In [4]:
# available features
data_dict["METTS MARK"].keys()

['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'email_address',
 'from_poi_to_this_person']

In [5]:
# people of interest
count = 0
for key, item in data_dict.iteritems():
    if item["poi"]:
        print key
        count += 1
count

HANNON KEVIN P
COLWELL WESLEY
RIEKER PAULA H
KOPPER MICHAEL J
SHELBY REX
DELAINEY DAVID W
LAY KENNETH L
BOWEN JR RAYMOND M
BELDEN TIMOTHY N
FASTOW ANDREW S
CALGER CHRISTOPHER F
RICE KENNETH D
SKILLING JEFFREY K
YEAGER F SCOTT
HIRKO JOSEPH
KOENIG MARK E
CAUSEY RICHARD A
GLISAN JR BEN F


18

In [6]:
# create a dictionary for all missing values
missing = {}
for key, item in data_dict.iteritems():
    for elem, value in item.iteritems():
        if value == "NaN":
            if elem not in missing:
                missing[elem] = 1
            else:
                missing[elem] += 1

In [7]:
# number of missing values
number_of_missing = 0
for key, item in missing.iteritems():
    number_of_missing += item
number_of_missing

1358

In [8]:
missing

{'bonus': 64,
 'deferral_payments': 107,
 'deferred_income': 97,
 'director_fees': 129,
 'email_address': 35,
 'exercised_stock_options': 44,
 'expenses': 51,
 'from_messages': 60,
 'from_poi_to_this_person': 60,
 'from_this_person_to_poi': 60,
 'loan_advances': 142,
 'long_term_incentive': 80,
 'other': 53,
 'restricted_stock': 36,
 'restricted_stock_deferred': 128,
 'salary': 51,
 'shared_receipt_with_poi': 60,
 'to_messages': 60,
 'total_payments': 21,
 'total_stock_value': 20}

# Outlier Investigation

As we already known in mini projects, there is an outlier named "TOTAL" in this dataset. We will need to remove it before any further analysis.

In [9]:
# remove the outlier 'TOTAL'
data_dict.pop("TOTAL")

{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'email_address': 'NaN',
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}

## Plots of the Outliers

To understand the outliers in this dataset, plots are created by using salary against every other feature but poi, which is used to color data points in each plot. As a starting point, all the available features will be selected and put into the model. Later in this report, some features will be removed based on their feature selection score.

In [10]:
# create features for plots
# features_list is a list of strings, each of which is a feature name.
# The first feature must be "poi".
features_list = ['poi',
                 'salary',
                 'to_messages',
                 'deferral_payments',
                 'total_payments',
                 'exercised_stock_options',
                 'bonus',
                 'restricted_stock',
                 'shared_receipt_with_poi',
                 'restricted_stock_deferred',
                 'total_stock_value',
                 'expenses',
                 'loan_advances',
                 'from_messages',
                 'other',
                 'from_this_person_to_poi',
                 'director_fees',
                 'deferred_income',
                 'long_term_incentive',
                 'from_poi_to_this_person']

# format the dataset
data = featureFormat(data_dict, features_list)

# create a pandas dataframe
df = pd.DataFrame(data, columns = features_list)

In the following plots, blue color stands for poi, red color stands for non-poi.

In [11]:
# %matplotlib inline
# from ggplot import *

# # iter through all features
# # x axis will always be salary
# # poi is represented by colors of points
# # the rest of features are put in y axis
# for feature in features_list:
#     if feature != "poi" and feature != "salary":
#         print ggplot(aes(x = 'salary', y = feature, color = 'poi'),
#                data = df) +\
#         geom_point() +\
#         ggtitle("salary against " + feature)

## Outlier Removal

The purpose of removing outliers is to prevent the model being misrepresented by extreme cases, which comes with an assumption that either the extreme cases rarely happen, or they don't carry engough valuable infomration to be kept in the model. This can be true for some of the features, but could be controversy for "total_payment" feature, and shouldn't be applied to "exercised_stock_options" as the top four outliers are all person of interest. On the other hand, if we are to treat top 10% of each feature as outliers, it is not hard to imagine that the final dataset will have much less than 90%. A large deduction in the original dataset will cause the model becoming weaker.

Given all these thoughts, we will start the cleaning experiment using a simple method. By fitting in a linear regression model, we will calculate the variance between the predicted values and true values, then treat features whoes predictions have top 10% variance as outliers. Based on that, we can then decide what to do with the outliers.


In [12]:
### check on the score before outlier cleaning
features, labels = featureLabelSplit(data_dict, features_list)
buildRegression(features, labels)[1]

0.35600464626660311

In [13]:
### clean the outliers
### extract normal data points and outliers
cleaned_data, outliers = outlierCleaner(features, labels)

In [14]:
### extract labels and features from cleaned_data
cleaned_labels, cleaned_features = targetFeatureSplit(cleaned_data)

# fit the model again and check the score
buildRegression(cleaned_features, cleaned_labels)[1]

0.82103305215836841

A removal of the outliers improved the score of the linear model dramastically from 0.35 to 0.82. Although it's good to see improvement in score, it's always necessary to take a look at the removed outliers.

## Check on Outliers

In [15]:
### change the data format from numpy array to python dictionary
outliers_dataset = personMapping(featureReformat(outliers, features_list), data_dict, features_list)

In [16]:
### number of outliers
len(outliers_dataset)

14

In [17]:
### name of outliers who is not a poi
for key, item in outliers_dataset.iteritems():
    if item['poi'] == 0.0:
        print key

LAVORATO JOHN J


## Strategy on Outlier Removal

As mentioned above, simply removing the outliers might cause an issue for later on analysis. While the imporvement in score of the linear model is surely tempting, do note that, this is not the model that we will use to conduct machine learning in this dataset. 

On the other hand, there's no surprising that most of the person of interest(13 out 18) are flagged as outliers given the background knowledge of Enron Fraud. In this case, the outliers are the targets we want to find, according to <a href='https://discussions.udacity.com/t/outlier-removal/7446' target='_blank'>this post in discussion forum</a>, we can manually decided to include or exclude the outliers or not in the training set. This strategy will be applied when processing the dataset.

Given the though above, when fitting in datasets later, there are 5% outliers cleaned on the training set.

# Preprocessing Features

## Feature Creation
To dig out more patterns from the dataset, three new features, "stock_salary_ratio", "poi_from_ratio", "poi_to_ratio", are created as following.

- stock_salary_ratio: stock_salary_ratio takes the result from total_stock_value divided by salary. This feature is useful based on the assumption that a person of interest usually has a unusual large stock value since it's under the table, while salary information could be more easily known by public, thus the ratio could give information to identify the poi. The bigger the ratio, the more likely it is a poi.
- poi_from_ratio: poi_from_ratio takes result from from_poi_to_this_person divided by from_messages. This feature assumes that if a person is a poi, he/she tends to have more contacts with another poi, therefore the ratio would be bigger. And same applie to feature poi_to_ratio.

In [18]:
### add new features to dataset
for key, item in data_dict.iteritems():
    ### add stock_salary_ratio
    if item['salary'] != "NaN" and item['total_stock_value'] != "NaN":
        item['stock_salary_ratio'] = float(item['total_stock_value']) / item['salary']
    else:
        item['stock_salary_ratio'] = "NaN"
    
    ### add poi_from_ratio
    if item['from_messages'] != "NaN" and item['from_poi_to_this_person'] != "NaN":
        item['poi_from_ratio'] = float(item['from_poi_to_this_person']) / item['from_messages']
    else:
        item['poi_from_ratio'] = "NaN"
        
    ### add poi_to_ratio
    if item["to_messages"] != "NaN" and item["from_this_person_to_poi"] != "NaN":
        item["poi_to_ratio"] = float(item["from_this_person_to_poi"]) / item["to_messages"]
    else:
        item["poi_to_ratio"] = "NaN"

In [19]:
### update features list
new_features_list = features_list + ["stock_salary_ratio", "poi_from_ratio", "poi_to_ratio"]

## Feature Scaling

Depending on the algorithms chosen, feature scaling may be necessary. In this report, three feature scaling methods are compared, including MinMaxScaler, StandardScaler, and Normalizer.

In [31]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer

In [74]:
stdMeanReader(features, features_list)

Unnamed: 0,feature,mean,std
3,total_payments,2259057.125,8846594.3829
11,loan_advances,582812.5,6794471.7789
9,total_stock_value,2909785.6111,6189018.075
4,exercised_stock_options,2075801.9792,4795513.1452
6,restricted_stock,868536.2917,2016572.3887
8,restricted_stock_deferred,73417.9028,1301983.3904
5,bonus,675997.3542,1233155.2559
13,other,297260.0903,1131068.131
2,deferral_payments,222089.5556,754101.3026
17,long_term_incentive,336957.8333,687182.5677


In [76]:
### check results from MinMaxScaler
stdMeanReader(features, features_list, MinMaxScaler())

Unnamed: 0,feature,mean,std
15,director_fees,0.0724,0.227
10,expenses,0.1546,0.1981
7,shared_receipt_with_poi,0.1273,0.1951
0,salary,0.1669,0.1773
16,deferred_income,0.9447,0.1729
5,bonus,0.0845,0.1541
1,to_messages,0.0818,0.1477
18,from_poi_to_this_person,0.0734,0.1407
4,exercised_stock_options,0.0604,0.1396
17,long_term_incentive,0.0655,0.1336


In [77]:
### check results from StandardScaler
stdMeanReader(features, features_list, StandardScaler())

Unnamed: 0,feature,mean,std
0,salary,0,1.0035
10,expenses,0,1.0035
17,long_term_incentive,0,1.0035
16,deferred_income,0,1.0035
15,director_fees,0,1.0035
14,from_this_person_to_poi,0,1.0035
13,other,0,1.0035
12,from_messages,0,1.0035
11,loan_advances,0,1.0035
9,total_stock_value,0,1.0035


In [78]:
### check results from Normalizer
stdMeanReader(features, features_list, Normalizer())

Unnamed: 0,feature,mean,std
3,total_payments,0.42,0.3071
9,total_stock_value,0.4611,0.28
4,exercised_stock_options,0.2979,0.2644
5,bonus,0.1587,0.1891
6,restricted_stock,0.1812,0.189
15,director_fees,0.052,0.176
16,deferred_income,-0.069,0.1638
2,deferral_payments,0.0466,0.1395
13,other,0.0506,0.1201
17,long_term_incentive,0.0693,0.1118


In [96]:
### create scalers
scalers = [('none', None),
           ('standardscaler', StandardScaler()),
           ('minmaxscaler', MinMaxScaler()),
           ('normalier', Normalizer())]

## Feature Selection

To get a better processing before any fitting into models, two feature selection methods for classification listed in 
<a href='http://scikit-learn.org/stable/modules/feature_selection.html' target='_blank'>sklearn documentations</a> 
are explored, which are SelectKBest and ExtraTreesClassifier.

In [20]:
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import ExtraTreesClassifier

feature_selection = [('k_best', SelectKBest(k = 5)),
                     ('extra_tree', ExtraTreesClassifier(max_features=5, class_weight='auto', random_state=42))]

## PCA
PCA is imported to conduct dimensions deduction. Depending on the performances, the decision to include PCA or not will be made later.

In [21]:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)

When calling PCA, the features are scaled before further processing.

## Feature Union
Feature Union function in sklearn could be handy when we want to combine PCA process with other feature selection method. 
<a href='http://scikit-learn.org/stable/auto_examples/feature_stacker.html' target='_blank'>One example from sklearn</a>
showing the usage of feature union. Do note that, a feature union is not a combination of features but rather a chain of two different processes.

In [22]:
from sklearn.pipeline import FeatureUnion

### chain pca to feature selection
combined_feature = []
for method in feature_selection:
    new_method = FeatureUnion([('pca', PCA(n_components=5)), method])
    name = method[0] + "_with_pca"
    combined_feature.append((name, new_method))

In [23]:
### update feature selection list
feature_selection += combined_feature

# Pick and Prepare Classifiers

According to 
<a href='http://scikit-learn.org/stable/tutorial/machine_learning_map/' target='_blank'>this cheat sheet in sklearn</a>, 
there are at least four classification methods can be used,
- LinearSVC
- KNeighbors Classifier
- SVC
- Ensemble Classifers

In this report, we will check on LinearSVC and KNeighborsClassifier.

In [24]:
from sklearn.svm import LinearSVC
linear_svc = LinearSVC(class_weight='auto', penalty='l1', dual=False, random_state=42)

params_svc = {'linear_svc__C':[0.1, 0.3, 1, 3, 10],
              'linear_svc__tol': [1e-4, 1e-3, 1e-2, 1],
              'linear_svc__max_iter': [1e3, 1e4]}

In [25]:
from sklearn.neighbors import KNeighborsClassifier
k_neighbors = KNeighborsClassifier(weights='distance', algorithm='auto')

params_kneighbors = {'k_neighbors__n_neighbors': [1, 3, 10],
                     'k_neighbors__leaf_size': [2, 5, 10, 30, 50, 100]}

In [26]:
### put all classifiers together
classifiers = [('linear_svc', linear_svc, params_svc),
               ('k_neighbors', k_neighbors, params_kneighbors)]

# Validation and Evaluation


## Validation
To prevent overfitting, a cross validation is needed to split the dataset into training and testing. StratifiedShuffleSplit is used across all tuning processes with a default test_size of 0.1. Depending on different steps, the n_iter paramter used for cross validation varies, as listed below,

- when finding best combination of feature selection and classification method, n_iter = 100;
- when tuning on the chosen estimator, n_iter = 1000;
- when tuning on the chosen feature selection, n_iter = 1000.


## Evaluation
For evaluation, we will use accuracy score, f1 score, precision score, recall score and time consumption when deciding the best estimator. When performing grid search, f1 score is used as a scoring parameter.

# Exploring Algorithms

When runing the esimators on the dataset, there are 8 models generated seperately for scaled and non-scaled features. A comparison among the best choices is as following, listed as model number, feature selection method, classification method, accuracy score, F1 score, precision score, recall score, and time consumption.

In [97]:
### scaled results
model_sets_scaled, score_scaled = trainModel(data_dict, features_list, 
                                             feature_selection=feature_selection,
                                             classifiers=classifiers,
                                             scalers = scalers)

TypeError: trainModel() got an unexpected keyword argument 'scalers'

check the results.

In [30]:
pd.read_csv("result.csv").sort(['f1_score','time_used'], ascending= [0, 1])

Unnamed: 0,model,scaled,feature_selection_method,classification_method,accuracy_score,f1_score,precision_score,recall_score,time_used
6,15,none,extra_tree_with_pca,linear_svc,0.7633,0.4036,0.3039,0.6005,101.619
2,13,none,k_best_with_pca,linear_svc,0.7575,0.4022,0.2996,0.612,34.494
3,5,none,k_best_with_pca,linear_svc,0.7558,0.3312,0.2609,0.4535,41.274
1,9,none,k_best,linear_svc,0.8013,0.3044,0.2855,0.326,25.712
0,1,none,k_best,linear_svc,0.7833,0.2903,0.2577,0.3325,23.758
7,7,none,extra_tree_with_pca,linear_svc,0.7442,0.2808,0.2246,0.3745,107.585
5,11,none,extra_tree,linear_svc,0.7757,0.2603,0.2323,0.296,92.791
4,3,none,extra_tree,linear_svc,0.746,0.2561,0.2101,0.328,84.217


# Final Tuning on the Estimator

In [30]:
### extract the pipeline
pipeline = model_sets[1][1]
tuning_score_kneighbors = score[1]

In [13]:
### get the training and testing set
features_train, features_test, labels_train, labels_test = trainTestSplit(data_dict, features_list)

### prepare the cross validation
sss = StratifiedShuffleSplit(labels_train, n_iter=1000, random_state=42)

In [14]:
feature, label = featureLabelSplit(data_dict, features_list)

In [96]:
from sklearn.preprocessing import *
new_feature = MinMaxScaler().fit_transform(feature)

In [97]:
dff = pd.DataFrame(new_feature, columns = features_list[1:])

In [98]:
dff.mean()

salary                       0.166879
to_messages                  0.081758
deferral_payments            0.049711
total_payments               0.021814
exercised_stock_options      0.060434
bonus                        0.084500
restricted_stock             0.199988
shared_receipt_with_poi      0.127262
restricted_stock_deferred    0.107912
total_stock_value            0.060094
expenses                     0.154638
loan_advances                0.007149
from_messages                0.025305
other                        0.028694
from_this_person_to_poi      0.040435
director_fees                0.072392
deferred_income              0.944731
long_term_incentive          0.065487
from_poi_to_this_person      0.073403
dtype: float64

In [100]:
dff.std() * 100000 *(197042.123807/17731.447045)

salary                       197042.123811
to_messages                  164137.025681
deferral_payments            128340.973846
total_payments                94929.204628
exercised_stock_options      155147.105228
bonus                        171294.205429
restricted_stock             129039.989417
shared_receipt_with_poi      216835.346660
restricted_stock_deferred     83905.540900
total_stock_value            139918.865648
expenses                     220098.204152
loan_advances                 92614.671822
from_messages                112198.946636
other                        121326.388857
from_this_person_to_poi      145573.458109
director_fees                252299.472917
deferred_income              192169.105231
long_term_incentive          148410.634706
from_poi_to_this_person      156326.995769
dtype: float64

In [30]:
df = pd.DataFrame(feature, columns = features_list[1:])

In [78]:
197042/8660.

22.753117782909932

In [44]:
df.std()

salary                        197042.123807
to_messages                     2237.564816
deferral_payments             754101.302578
total_payments               8846594.382873
exercised_stock_options      4795513.145239
bonus                        1233155.255938
restricted_stock             2016572.388715
shared_receipt_with_poi         1077.290736
restricted_stock_deferred    1301983.390377
total_stock_value            6189018.075043
expenses                       45309.303038
loan_advances                6794471.778940
from_messages                   1450.675239
other                        1131068.130972
from_this_person_to_poi           79.778266
director_fees                  31300.575144
deferred_income               606011.135120
long_term_incentive           687182.567651
from_poi_to_this_person           74.276769
dtype: float64

In [37]:
df.mean()

salary                        185446.034722
to_messages                     1238.555556
deferral_payments             222089.555556
total_payments               2259057.125000
exercised_stock_options      2075801.979167
bonus                         675997.354167
restricted_stock              868536.291667
shared_receipt_with_poi          702.611111
restricted_stock_deferred      73417.902778
total_stock_value            2909785.611111
expenses                       35375.340278
loan_advances                 582812.500000
from_messages                    363.583333
other                         297260.090278
from_this_person_to_poi           24.625000
director_fees                   9980.319444
deferred_income              -193683.270833
long_term_incentive           336957.833333
from_poi_to_this_person           38.756944
dtype: float64

In [32]:
### check on grid score
gridScoreReader(tuning_score_kneighbors)

Unnamed: 0,leaf_size,n_neighbors,mean,std
0,2,1,0.19,0.3749
1,2,3,0.16,0.3574
2,2,10,0.0,0.0
3,5,1,0.19,0.3749
4,5,3,0.16,0.3574
5,5,10,0.0,0.0
6,10,1,0.19,0.3749
7,10,3,0.16,0.3574
8,10,10,0.0,0.0
9,30,1,0.19,0.3749


## Tuning on SelectKBest

In [33]:
### set the parameters
params_kbest = {"k_best__k": [1,2,3,4,5,6,7,8,9,10]}

### fit and search
estimator = GridSearchCV(pipeline, params_kbest, scoring='f1', cv=sss)
estimator.fit(features_train, labels_train)

### extract scores
tuning_score_kbest = estimator.grid_scores_

### get the best estimator
clf = estimator.best_estimator_

### check the model performance
evaluateModel(clf.predict(features_test), labels_test)

    Accuracy score: 0.931
    F1 score: 0.5
    Precision score: 0.5
    Recall score: 0.5


(0.931, 0.5, 0.5, 0.5)

In [34]:
##E extract algorithm
k_best = clf.steps[0][1]

### get the score
k_best_result = zip(features_list[1:], k_best.scores_, k_best.get_support())
k_best_result.sort(key=lambda value:value[1], reverse=True)

In [35]:
k_best_result

[('bonus', 51.120605903941218, True),
 ('salary', 24.378837544422296, True),
 ('exercised_stock_options', 21.111725009619182, True),
 ('total_stock_value', 20.87860909083037, True),
 ('long_term_incentive', 20.516440353312195, False),
 ('deferred_income', 18.499021817455969, False),
 ('shared_receipt_with_poi', 18.477569072515099, False),
 ('total_payments', 12.938116760117209, False),
 ('restricted_stock', 11.515986746342593, False),
 ('from_poi_to_this_person', 9.8344055580294469, False),
 ('loan_advances', 9.6499650560932189, False),
 ('from_this_person_to_poi', 6.8731239136060633, False),
 ('to_messages', 5.8210008299837321, False),
 ('other', 3.9924191837478231, False),
 ('expenses', 2.5639196153701413, False),
 ('director_fees', 1.2871143874353277, False),
 ('restricted_stock_deferred', 0.56816385208855003, False),
 ('deferral_payments', 0.0013739864310871243, False),
 ('from_messages', 1.6042692443149132e-05, False)]

In [36]:
gridScoreReader(tuning_score_kbest)

Unnamed: 0,k,mean,std
0,1,0.4187,0.4331
1,2,0.4103,0.4239
2,3,0.3983,0.4284
3,4,0.5031,0.439
4,5,0.4613,0.4504
5,6,0.2693,0.3975
6,7,0.2462,0.3895
7,8,0.2116,0.3675
8,9,0.24,0.3833
9,10,0.2025,0.3581


## Check on New Features

In [37]:
### extact new training and testing sets for checking
features_train_new, features_test_new, labels_train_new, labels_test_new = trainTestSplit(data_dict, new_features_list)

### copy estimator
new_clf = clf

### fit the model
new_clf.fit(features_train_new, labels_train_new)

### check the new features's impact on the model
evaluateModel(new_clf.predict(features_test_new), labels_test_new)

    Accuracy score: 0.931
    F1 score: 0.5
    Precision score: 0.5
    Recall score: 0.5


(0.931, 0.5, 0.5, 0.5)

In [38]:
### get all the features
new_kbest = new_clf.steps[0][1]
new_kbest_result = zip(new_features_list[1:], new_kbest.scores_, new_kbest.get_support())
new_kbest_result.sort(key=lambda value:value[1], reverse=True)
new_kbest_result

[('bonus', 51.120605903941218, True),
 ('salary', 24.378837544422296, True),
 ('exercised_stock_options', 21.111725009619182, True),
 ('total_stock_value', 20.87860909083037, True),
 ('long_term_incentive', 20.516440353312195, False),
 ('deferred_income', 18.499021817455969, False),
 ('shared_receipt_with_poi', 18.477569072515099, False),
 ('total_payments', 12.938116760117209, False),
 ('restricted_stock', 11.515986746342593, False),
 ('from_poi_to_this_person', 9.8344055580294469, False),
 ('loan_advances', 9.6499650560932189, False),
 ('from_this_person_to_poi', 6.8731239136060633, False),
 ('to_messages', 5.8210008299837321, False),
 ('other', 3.9924191837478231, False),
 ('poi_to_ratio', 3.353610953858202, False),
 ('poi_from_ratio', 2.6016157011385932, False),
 ('expenses', 2.5639196153701413, False),
 ('director_fees', 1.2871143874353277, False),
 ('restricted_stock_deferred', 0.56816385208855003, False),
 ('stock_salary_ratio', 0.092387770980377898, False),
 ('deferral_payments

#Final Solution

After comparing among feature selection methods, classification methods, carefully tuning parameters for the methods, the best model turned out to be SelectKBest as feature selection processor and KNeighbors as classification method. Parameters are as following,
- SelectKBest, k = 4.
- KNeighbors, leaf_size = 2, n_neighbors = 1.


In [39]:
### check on the parameters
clf.steps

[('k_best', SelectKBest(k=4, score_func=<function f_classif at 0x106be5050>)),
 ('k_neighbors',
  KNeighborsClassifier(algorithm='auto', leaf_size=2, metric='minkowski',
             metric_params=None, n_neighbors=1, p=2, weights='distance'))]

In [40]:
### prepare for the test
clf = clf
my_dataset = data_dict
features_list = features_list

### dump for testing
dump_classifier_and_data(clf, my_dataset, features_list)