<center> <h1> Enron Email Person of Interest Identification </h1> </center>

Enron was an energy company that was one of the largest companies in the US when it filled for bankruptcy in 2002.  This was due to widespread corporate fraud.  Emails and finacial data entered the public record after the Federal investigation.  I will be looking at this data to see if I can create an algorithm that can help predict Persons of Interest in the Enron scandal

(The Enron email + financial dataset, along with several provisional functions used in this report, is available on [Udacity's GitHub](https://github.com/udacity/ud120-projects).)

In [1]:
import sys
import pickle
sys.path.append("../tools/")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import time

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import scale
import tester

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest



<h2> Data Investigation and Cleaning </h2>

The first thing that needs to be done is to look over the data and figure out what is in it and see if there are any errors.  The data is provided in a python dictionary which I will convert to pandas dataframe for easier data manipulation.

In [2]:
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
payment_columns = ['salary',
                   'bonus',
                   'long_term_incentive',
                   'deferred_income',
                   'deferral_payments',
                   'loan_advances',
                   'other',
                   'expenses',
                   'director_fees',
                   'total_payments']

stock_columns = ['exercised_stock_options',
                 'restricted_stock',
                 'restricted_stock_deferred',
                 'total_stock_value']

email_columns = ['to_messages',
                 'from_messages',
                 'from_poi_to_this_person',
                 'from_this_person_to_poi',
                 'shared_receipt_with_poi']             
              
features_list = ['poi'] + payment_columns + stock_columns + email_columns

In [3]:
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

# transfer dictionary to dataframe for easier data manipulation
df = pd.DataFrame.from_dict(data_dict, orient='index')
# replace all 'NaN' with numpy 'nan'
df = df.replace('NaN', np.nan)
# reorder dataframe columns to match features_list
df = df[features_list]

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 20 columns):
poi                          146 non-null bool
salary                       95 non-null float64
bonus                        82 non-null float64
long_term_incentive          66 non-null float64
deferred_income              49 non-null float64
deferral_payments            39 non-null float64
loan_advances                4 non-null float64
other                        93 non-null float64
expenses                     95 non-null float64
director_fees                17 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
restricted_stock             110 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
to_messages                  86 non-null float64
from_messages                86 non-null float64
from_poi_to_this_person      86 non-null float

In [5]:
df.head()

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
ALLEN PHILLIP K,False,201955.0,4175000.0,304805.0,-3081055.0,2869717.0,,152.0,13868.0,,4484442.0,1729541.0,126027.0,-126027.0,1729541.0,2902.0,2195.0,47.0,65.0,1407.0
BADUM JAMES P,False,,,,,178980.0,,,3486.0,,182466.0,257817.0,,,257817.0,,,,,
BANNANTINE JAMES M,False,477.0,,,-5104.0,,,864523.0,56301.0,,916197.0,4046157.0,1757552.0,-560222.0,5243487.0,566.0,29.0,39.0,0.0,465.0
BAXTER JOHN C,False,267102.0,1200000.0,1586055.0,-1386055.0,1295738.0,,2660303.0,11200.0,,5634343.0,6680544.0,3942714.0,,10623258.0,,,,,
BAY FRANKLIN R,False,239671.0,400000.0,,-201641.0,260455.0,,69.0,129142.0,,827696.0,,145796.0,-82782.0,63014.0,,,,,


In [6]:
df.describe()

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
count,95.0,82.0,66.0,49.0,39.0,4.0,93.0,95.0,17.0,125.0,102.0,110.0,18.0,126.0,86.0,86.0,86.0,86.0,86.0
mean,562194.3,2374235.0,1470361.0,-1140475.0,1642674.0,41962500.0,919065.0,108728.9,166804.9,5081526.0,5987054.0,2321741.0,166410.6,6773957.0,2073.860465,608.790698,64.895349,41.232558,1176.465116
std,2716369.0,10713330.0,5942759.0,4025406.0,5161930.0,47083210.0,4589253.0,533534.8,319891.4,29061720.0,31062010.0,12518280.0,4201494.0,38957770.0,2582.700981,1841.033949,86.979244,100.073111,1178.317641
min,477.0,70000.0,69223.0,-27992890.0,-102500.0,400000.0,2.0,148.0,3285.0,148.0,3285.0,-2604490.0,-7576788.0,-44093.0,57.0,12.0,0.0,0.0,2.0
25%,211816.0,431250.0,281250.0,-694862.0,81573.0,1600000.0,1215.0,22614.0,98784.0,394475.0,527886.2,254018.0,-389621.8,494510.2,541.25,22.75,10.0,1.0,249.75
50%,259996.0,769375.0,442035.0,-159792.0,227449.0,41762500.0,52382.0,46950.0,108579.0,1101393.0,1310814.0,451740.0,-146975.0,1102872.0,1211.0,41.0,35.0,8.0,740.5
75%,312117.0,1200000.0,938672.0,-38346.0,1002672.0,82125000.0,362096.0,79952.5,113784.0,2093263.0,2547724.0,1002370.0,-75009.75,2949847.0,2634.75,145.5,72.25,24.75,1888.25
max,26704230.0,97343620.0,48521930.0,-833.0,32083400.0,83925000.0,42667590.0,5235198.0,1398517.0,309886600.0,311764000.0,130322300.0,15456290.0,434509500.0,15149.0,14368.0,528.0,609.0,5521.0


From the information gleamed above we can see that the poi field is a True or False and all the rest of the values are in float numbers.  There are 146 rows with each row being a different person.  

All the NaNs in the financial fields are accutally 0 not unknown quantites according to the [official pdf documentation.](https://github.com/udacity/ud120-projects/blob/master/final_project/enron61702insiderpay.pdf)  The NaNs in the email data is unknown information.  I will replace the NaNs in the financial data with 0 but will fill in the NaNs for the email data with the mean of the column grouped by person of interest.

In [7]:
# Fill in the NaN payment and stock values with zero 
df[payment_columns] = df[payment_columns].fillna(0)
df[stock_columns] = df[stock_columns].fillna(0)

# Create a poi dataframe and nonpoi dataframe
df_poi = df[df["poi"]]
df_nonpoi = df[df['poi'] == False]
# Fill in the NaN email values with column mean in the poi dataframe
df_poi[email_columns] = df_poi[email_columns].fillna(df_poi[email_columns].mean())
df_nonpoi[email_columns] = df_nonpoi[email_columns].fillna(df_nonpoi[email_columns].mean())
# update the df with new poi dataframe and nonpoi dataframe
df = df_poi.append(df_nonpoi)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


We can check for financial errors easily by seeing if  all the finacial columns add up to the total columns for total payments of total stock values.  If we find any we will enter in the correct data from the [official pdf documentation.](https://github.com/udacity/ud120-projects/blob/master/final_project/enron61702insiderpay.pdf)

In [8]:
# Find any rows that don't add up to the total_payments or total_stock_value.
# These will be errors.
errors_payment_columns = (df[df[payment_columns[:-1]].sum(axis='columns') != df['total_payments']])
errors_stock_columns = (df[df[stock_columns[:-1]].sum(axis='columns') != df['total_stock_value']])

In [9]:
errors_payment_columns

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
BELFER ROBERT,False,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,2007.111111,668.763889,58.5,36.277778,1058.527778
BHATNAGAR SANJAY,False,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,523.0,29.0,0.0,1.0,463.0


In [10]:
errors_stock_columns

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
BELFER ROBERT,False,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,2007.111111,668.763889,58.5,36.277778,1058.527778
BHATNAGAR SANJAY,False,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,523.0,29.0,0.0,1.0,463.0


In [11]:
# Correct 2 error values in order taken from the official financial PDF
fixed_bel_rob = [0, 0, 0, -102500, 0, 0, 0, 3285, 102500, 3285, 0, 44093, -44093, 0]
fixed_bha_san = [0, 0, 0, 0, 0, 0, 0, 137864, 0, 137864, 15456290, 2604490, -2604490, 15456290]
# Putting the fixed values into the correct rows
df.loc["BELFER ROBERT", 1:15] = fixed_bel_rob
df.loc["BHATNAGAR SANJAY", 1:15] = fixed_bha_san

In [12]:
# Check if there are any more errors in payment_columns
len(df[df[payment_columns[:-1]].sum(axis='columns') != df['total_payments']])

0

In [13]:
# Check if there are any more errors in the stock_columns
len(df[df[stock_columns[:-1]].sum(axis='columns') != df['total_stock_value']])

0

<h2> Revove Outliers </h2>

In looking for and removing outliers I will be looking maining at non person's of interest.  I do not want to remove any persons of interest from the data set.  I can look at the Interquartile Range.  Either lower thatn the first quartile minus 1.5 times the IQR or above the third quartile plus 1.5 times the IQR.  I will count the outliers of each non POI and see if they need to be removed.

In [14]:
### Task 2: Remove outliers
IQR = df.quantile(q=0.75) - df.quantile(q=0.25)
first_quartile = df.quantile(q=0.25)
third_quartile = df.quantile(q=0.75)
outliers = df[(df>(third_quartile + 1.5*IQR) ) | (df<(first_quartile - 1.5*IQR) )].count(axis=1)

poi = df[df["poi"]].index

for o in outliers.keys():
    if o in poi:
        del outliers[o]

outliers.sort_values(axis=0, ascending=False, inplace=True)
outliers.head(7)

TOTAL                 14
FREVERT MARK A        12
BAXTER JOHN C          8
LAVORATO JOHN J        8
KEAN STEVEN J          7
WHALLEY LAWRENCE G     7
HAEDICKE MARK E        7
dtype: int64

I see that TOTAL is the highest row with outliers and since that is not a person I will remove it.  I can remove the next 4 highest non-POI with outliers.

In [15]:
# Removes rows TOTAL because they are not people
df.drop(axis=0, labels=['TOTAL', 'THE TRAVEL AGENCY IN THE PARK','FREVERT MARK A', 'BAXTER JOHN C', 'LAVORATO JOHN J', 'KEAN STEVEN J'], inplace=True)

In [16]:
len(df)

140

In [17]:
df["poi"].value_counts()

False    122
True      18
Name: poi, dtype: int64

In [18]:
df.isnull().sum().sum()

0L

In [19]:
df[df==0].count().sum()

1150L

This leaves me with 140 individuals (122 non-POI and 18 POI).  I also see that there is no nulls in my data now and that there are 1150 zeros in the dataset I will be using.

I will then scale the data using the normalization function.  Scaling creates non-dimensional features so that those features with larger units do not have an undue influence on the classifier as would be the case if the classifier uses some sort of distance measurement as a similarity metric.

I will first train with the initial features of the dataset to gain a baseline to work and observe the performance of each algorithm before I start to tune.  I selected DecisionTreeClassifier, GaussianNB, KNeighborsClassifier, and Support Vector Cassifier (SVC).  I will run all the parameters on default settings.  

I will be looking at precision, recall, and F1 score metrics to determine the best algorithm that will find the person of interest. 

Precision is the fraction of persons of interest that the algorithm predicts that are truly persons of interest.  Mathematically precision is defined as 

$$ precision = \frac{true\ positives}{true\ positives + false\ positives} $$

Recall is the fraction of persons of interest that the algorithm identifies.  Mathematically precision is defined as

\\[ recall = \frac{true\ positives}{true\ positives + false\ negatives} \\]

Precision is also known as positive predictive value while recall is called the sensitivity of the classifier. A combined measured of precision and recall is the F1 score. Is it the harmonic mean of precision and recall. Mathematically, the F1 score is defined as:

\\[ F1\ Score = \frac{2\ (precision\ x\ recall)}{precision + recall} \\]

In [20]:
# function that will print the precision df
def print_precision(dic):
    df = pd.DataFrame.from_dict(dic, orient='columns')
    print df

In [21]:
### Store to my_dataset for easy export below.
# my_dataset = data_dict
# Scale the dataset and send it back to a dictionary
scaled_df = df.copy()
scaled_df.iloc[:,1:] = scale(scaled_df.iloc[:,1:])
my_dataset = scaled_df.to_dict(orient='index')

df_dtc = {}
clf = DecisionTreeClassifier()
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_dtc["dtc_1"] = tester.main()

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
	Accuracy: 0.83236	Precision: 0.40334	Recall: 0.36200	F1: 0.38155	F2: 0.36958
	Total predictions: 14000	True positives:  724	False positives: 1071	False negatives: 1276	True negatives: 10929



In [22]:
# Create the classifier, GaussianNB has no parameters to tune
df_gnb = {}
clf = GaussianNB()
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_gnb['gnb_1'] = tester.main()

GaussianNB(priors=None)
	Accuracy: 0.66936	Precision: 0.25653	Recall: 0.69250	F1: 0.37437	F2: 0.51683
	Total predictions: 14000	True positives: 1385	False positives: 4014	False negatives:  615	True negatives: 7986



In [23]:
df_knc = {}
clf = KNeighborsClassifier()
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_knc['knc_1'] = tester.main()

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
	Accuracy: 0.83514	Precision: 0.01875	Recall: 0.00300	F1: 0.00517	F2: 0.00361
	Total predictions: 14000	True positives:    6	False positives:  314	False negatives: 1994	True negatives: 11686



In [24]:
df_svc = {}
clf = SVC()
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_svc['svc_1'] = tester.main()

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.85714	Precision: 0.50000	Recall: 0.00150	F1: 0.00299	F2: 0.00187
	Total predictions: 14000	True positives:    3	False positives:    3	False negatives: 1997	True negatives: 11997



In [25]:
print_precision(df_dtc)
print_precision(df_gnb)
print_precision(df_knc)
print_precision(df_svc)

              dtc_1
Accuracy   0.832357
F1         0.381555
Precision  0.403343
Recall     0.362000
              gnb_1
Accuracy   0.669357
F1         0.374375
Precision  0.256529
Recall     0.692500
              knc_1
Accuracy   0.835143
F1         0.005172
Precision  0.018750
Recall     0.003000
              svc_1
Accuracy   0.857143
F1         0.002991
Precision  0.500000
Recall     0.001500


Looking at the different algorithms' initial results we see that the decision tree classifer performed the best overall on all three metrics.

<h2> Create new features </h2>

I will create some new features that should help performace.  I will have four new features.  I believe that the ratio of emails from a POI, to a POI, or shared with POI will be a great use.  I will also see if the salary ratio of the individual with total payments recieved from the company will help.  I will fill all NaN in these new columns with 0.  I will then see if the new features improve the metrics in the 4 algorithms.

In [26]:
### Task 3: Create new feature(s)
df["from_poi_ratio"] = df["from_poi_to_this_person"] / df["from_messages"]
df["to_poi_ratio"] = df["from_this_person_to_poi"] / df["to_messages"]
df["shared_poi_ratio"] = df["shared_receipt_with_poi"] / df["to_messages"]
df["salary_ratio"] = df["salary"] / df["total_payments"]

features_list.append('to_poi_ratio')
features_list.append('from_poi_ratio')
features_list.append('shared_poi_ratio')
features_list.append('salary_ratio')

df.fillna(value=0, inplace=True)
df = df.replace('inf', 0)

# Scale the dataset and send it back to a dictionary
scaled_df = df.copy()
scaled_df.iloc[:,1:] = scale(scaled_df.iloc[:,1:])
my_dataset = scaled_df.to_dict(orient='index')

<h3> Validation </h3>

To validate the algorithms chosen I will be using cross-validation in the tester.py script.  Cross-validation will take the data and perfom multiple splits.  Each split will be different training and testing sets.  The classifier is then fit with the training set and tested on the testing set.  The classifier is then trained and tested on different sets.  This process continues for the number of splits made on the data set.  Cross-validation prevents the classifier from training and testing on the same data.  

In [27]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

# Create and test the Decision Tree Classifier
clf = DecisionTreeClassifier()
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_dtc['dtc_2'] = tester.main()

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
	Accuracy: 0.83693	Precision: 0.41073	Recall: 0.32550	F1: 0.36318	F2: 0.33959
	Total predictions: 14000	True positives:  651	False positives:  934	False negatives: 1349	True negatives: 11066



In [28]:
# Create and test the Gaussian Naive Bayes Classifier
clf = GaussianNB()
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_gnb['gnb_2'] = tester.main();

GaussianNB(priors=None)
	Accuracy: 0.68293	Precision: 0.26398	Recall: 0.68200	F1: 0.38063	F2: 0.51796
	Total predictions: 14000	True positives: 1364	False positives: 3803	False negatives:  636	True negatives: 8197



In [29]:
# Create and test the KMeans Classifier
clf = KNeighborsClassifier()
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_knc['knc_2'] = tester.main()

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
	Accuracy: 0.85786	Precision: 0.51397	Recall: 0.09200	F1: 0.15606	F2: 0.11007
	Total predictions: 14000	True positives:  184	False positives:  174	False negatives: 1816	True negatives: 11826



In [30]:
# Create and test the Support Vector Classifier
clf = SVC(kernel="linear")
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_svc['svc_2'] = tester.main()

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.85657	Precision: 0.49625	Recall: 0.26450	F1: 0.34508	F2: 0.29175
	Total predictions: 14000	True positives:  529	False positives:  537	False negatives: 1471	True negatives: 11463



In [31]:
print_precision(df_dtc)
print_precision(df_gnb)
print_precision(df_knc)
print_precision(df_svc)

              dtc_1     dtc_2
Accuracy   0.832357  0.836929
F1         0.381555  0.363180
Precision  0.403343  0.410726
Recall     0.362000  0.325500
              gnb_1     gnb_2
Accuracy   0.669357  0.682929
F1         0.374375  0.380633
Precision  0.256529  0.263983
Recall     0.692500  0.682000
              knc_1     knc_2
Accuracy   0.835143  0.857857
F1         0.005172  0.156064
Precision  0.018750  0.513966
Recall     0.003000  0.092000
              svc_1     svc_2
Accuracy   0.857143  0.856571
F1         0.002991  0.345075
Precision  0.500000  0.496248
Recall     0.001500  0.264500


Looking at the results after the new features were added the two best performing algorithms are decision tree classifier and SVC.

<h2> Tune Algorithms </h2>

When you start to tune your algorithms you are starting to optimize the settings in the algorithm to achieve maximum performance on the given data.  Algorithms are a list of rules that produce a result and tuning can be a way of altering the rules to produce better classifications.  You can either tune manually by selecting different configurations, performing cross-validation, and then selecting the settings that give you the highest performance or you can automate the algorithm to select the best settings using GridSearchCV.  GridSearchCV uses a number of combinations of parameters determined by the user to test the algorithm and returns the maximized performance perameters.

I will be choosing the DecisionTreeClassifier and SVC to tune and compare.  The first thing I will do is find the feature importances of the DecissionTree.  The higher the score the more important the feature.  This score is computed as the normalized total reduciton of the criterion brought by the feature.

In [32]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

# Example starting point. Try investigating other evaluation techniques!
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.3, random_state=42)

In [33]:
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)

tree_importance = clf.feature_importances_
tree_features = zip(tree_importance, features_list[1:])
tree_features.sort(reverse=True)
tree_features

[(0.22631578947368439, 'total_payments'),
 (0.17026292251200423, 'shared_poi_ratio'),
 (0.113953488372093, 'to_poi_ratio'),
 (0.11078811369509041, 'bonus'),
 (0.10631358305776914, 'from_poi_ratio'),
 (0.087519136837501288, 'exercised_stock_options'),
 (0.071220930232558127, 'to_messages'),
 (0.067303479570480967, 'other'),
 (0.04632255624881846, 'restricted_stock'),
 (0.0, 'total_stock_value'),
 (0.0, 'shared_receipt_with_poi'),
 (0.0, 'salary_ratio'),
 (0.0, 'salary'),
 (0.0, 'restricted_stock_deferred'),
 (0.0, 'long_term_incentive'),
 (0.0, 'loan_advances'),
 (0.0, 'from_this_person_to_poi'),
 (0.0, 'from_poi_to_this_person'),
 (0.0, 'from_messages'),
 (0.0, 'expenses'),
 (0.0, 'director_fees'),
 (0.0, 'deferred_income'),
 (0.0, 'deferral_payments')]

In [34]:
n_features = np.arange(1, len(features_list))

# Create a pipeline with feature selection and classification
pipe = Pipeline([('select_features', SelectKBest()),
                 ('classify', DecisionTreeClassifier())
                ])
param = [{'select_features__k': n_features}]

# Use GridSearchCV to find the optimal number of features
clf = GridSearchCV(pipe, param_grid=param, scoring='f1', cv = 10)
clf.fit(features, labels);
# number of best parameters found by GridSearchCV

best_params = clf.best_params_
best_k = best_params['select_features__k']
best_k
best_k = 11

  'precision', 'predicted', average, warn_for)


In [35]:
# Create a pipeline with feature selection and classification
pipe = Pipeline([('select_features', SelectKBest(k=best_k)),
                 ('classify', DecisionTreeClassifier())
                ])

# Create and test the Decision Tree Classifier
tester.dump_classifier_and_data(pipe, my_dataset, features_list)
df_dtc['dtc_3'] = tester.main()

Pipeline(memory=None,
     steps=[('select_features', SelectKBest(k=11, score_func=<function f_classif at 0x0855BE70>)), ('classify', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])
	Accuracy: 0.83071	Precision: 0.39118	Recall: 0.33250	F1: 0.35946	F2: 0.34278
	Total predictions: 14000	True positives:  665	False positives: 1035	False negatives: 1335	True negatives: 10965



In [36]:
# Create a pipeline with feature selection and classifier
pipe = Pipeline([('select_features', SelectKBest(k=best_k)),
                 ('classify', DecisionTreeClassifier()),
                ])

# Define the configuration of parameters to test with the Decision Tree Classifier
param = dict(classify__criterion = ['entropy', 'gini'],
             classify__max_depth = [None, 5, 10, 15, 20],
             classify__min_samples_split = [2, 4, 6, 8, 10, 20],
             classify__min_samples_leaf = [1, 2, 3],
             classify__max_features = [None, 'sqrt', 'log2', 'auto'])

# Use GridSearchCV to find the optimal hyperparameters for the classifier
clf = GridSearchCV(pipe, param_grid = param, scoring='f1', cv=10)
clf.fit(features, labels)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('select_features', SelectKBest(k=11, score_func=<function f_classif at 0x0855BE70>)), ('classify', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classify__min_samples_split': [2, 4, 6, 8, 10, 20], 'classify__max_depth': [None, 5, 10, 15, 20], 'classify__max_features': [None, 'sqrt', 'log2', 'auto'], 'classify__criterion': ['entropy', 'gini'], 'classify__min_samples_leaf': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1', verbose=0)

In [37]:
clf.best_score_

0.5680952380952381

In [38]:
# Get the best algorithm hyperparameters for the Decision Tree
best_params =clf.best_params_
best_params

{'classify__criterion': 'entropy',
 'classify__max_depth': None,
 'classify__max_features': 'auto',
 'classify__min_samples_leaf': 1,
 'classify__min_samples_split': 2}

In [39]:
# Create the classifier with the optimal hyperparameters as found by GridSearchCV
clf = Pipeline([
    ('select_features', SelectKBest(k=11)),
    ('classify', DecisionTreeClassifier(criterion=best_params['classify__criterion'], 
                                        max_depth=best_params['classify__max_depth'], 
                                        max_features=best_params['classify__max_features'], 
                                        min_samples_leaf=best_params['classify__min_samples_leaf'], 
                                        min_samples_split=best_params['classify__min_samples_split']))
])

# Test the Decision Tree Classifier with best parameters using tester.py
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_dtc['dtc_4'] = tester.main()

Pipeline(memory=None,
     steps=[('select_features', SelectKBest(k=11, score_func=<function f_classif at 0x0855BE70>)), ('classify', DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])
	Accuracy: 0.83900	Precision: 0.42913	Recall: 0.38450	F1: 0.40559	F2: 0.39267
	Total predictions: 14000	True positives:  769	False positives: 1023	False negatives: 1231	True negatives: 10977



In [40]:
print_precision(df_dtc)

              dtc_1     dtc_2     dtc_3     dtc_4
Accuracy   0.832357  0.836929  0.830714  0.839000
F1         0.381555  0.363180  0.359459  0.405591
Precision  0.403343  0.410726  0.391176  0.429129
Recall     0.362000  0.325500  0.332500  0.384500


Using GridSearchCV and SelectKBest I was able to determine that using the 11 best perameters in the DecisionTreeClassider provided the best results.  This produces an F1 score of 0.3865 which was a small improvement over the initial F1 score of 0.3770.

In [41]:
# Create a pipeline with feature selection and classifier
pipe = Pipeline([('classify', SVC())])

# Define the configuration of parameters to test with the Decision Tree Classifier
param = dict(classify__C=[10, 100, 1000, 10000], 
             classify__kernel=['linear', 'rbf', 'poly'])

# Use GridSearchCV to find the optimal hyperparameters for the classifier
clf = GridSearchCV(pipe, param_grid = param, scoring='f1', cv=10)
clf.fit(features, labels)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('classify', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classify__C': [10, 100, 1000, 10000], 'classify__kernel': ['linear', 'rbf', 'poly']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1', verbose=0)

In [42]:
best_params = clf.best_params_
best_params

{'classify__C': 100, 'classify__kernel': 'linear'}

In [43]:
clf = Pipeline([('classify', SVC(C=best_params['classify__C'],
                                 kernel=best_params['classify__kernel']))])

# Test the Decision Tree Classifier with best parameters using tester.py
tester.dump_classifier_and_data(clf, my_dataset, features_list)
df_svc['svc_3'] = tester.main()

Pipeline(memory=None,
     steps=[('classify', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
	Accuracy: 0.85407	Precision: 0.48900	Recall: 0.47800	F1: 0.48344	F2: 0.48016
	Total predictions: 14000	True positives:  956	False positives:  999	False negatives: 1044	True negatives: 11001



Using GridSearch for the SVC algorithm we were able to raise the F1 score from 0.3450 to 0.4834.  I then compared the dtc and svc tuning results and concluded that the SVC algorithm was superiour in every metric to the Decision Tree Classifier and will be using the SVC algorithm for the final algorithm.

In [44]:
print_precision(df_svc)
print_precision(df_dtc)

              svc_1     svc_2     svc_3
Accuracy   0.857143  0.856571  0.854071
F1         0.002991  0.345075  0.483439
Precision  0.500000  0.496248  0.489003
Recall     0.001500  0.264500  0.478000
              dtc_1     dtc_2     dtc_3     dtc_4
Accuracy   0.832357  0.836929  0.830714  0.839000
F1         0.381555  0.363180  0.359459  0.405591
Precision  0.403343  0.410726  0.391176  0.429129
Recall     0.362000  0.325500  0.332500  0.384500


In [45]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

tester.dump_classifier_and_data(clf, my_dataset, features_list)