# Project Notes - Machine Learning

The following notebook is a summary of the steps taken in the final project for Intro to Machine Learning.  

### Tools Used
For this project I used the following:

* Jupyter Notebook - for notekeeping and data exploration
* Github - project submission
* PyCharm - for Python script coding


### Data Exploration

Data was loaded into the notebook and read into a dataframe to gain initial stats summary and to spot potention data errors.

When inspecting the data there are 21 features as follows:

financial features in US dollars: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] 

email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)

POI label: [‘poi’] (boolean, represented as integer)


### [Project Questions and Summary](https://github.com/troberts777/C753-Identify-Fraud-from-Enron-Email/blob/main/Project%20Questions.pdf)


In [1]:
# import some libraries
import pickle as pkl
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.lines as mlines



In [2]:
# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pkl.load(data_file)

# make a pandas dataframe to explore
enron_df = pd.DataFrame.from_dict(data_dict,orient='index')

In [3]:
# check the data structure
enron_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
salary                       146 non-null object
to_messages                  146 non-null object
deferral_payments            146 non-null object
total_payments               146 non-null object
exercised_stock_options      146 non-null object
bonus                        146 non-null object
restricted_stock             146 non-null object
shared_receipt_with_poi      146 non-null object
restricted_stock_deferred    146 non-null object
total_stock_value            146 non-null object
expenses                     146 non-null object
loan_advances                146 non-null object
from_messages                146 non-null object
other                        146 non-null object
from_this_person_to_poi      146 non-null object
poi                          146 non-null bool
director_fees                146 non-null object
deferred_income              146 non-null object


**Dataset 21 columns 146 rows**

In [4]:
# Checking the dataset
enron_df.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,1729541.0,4175000.0,126027.0,1407.0,-126027.0,1729541,...,,2195.0,152.0,65.0,False,,-3081055.0,304805.0,phillip.allen@enron.com,47.0
BADUM JAMES P,,,178980.0,182466,257817.0,,,,,257817,...,,,,,False,,,,,
BANNANTINE JAMES M,477.0,566.0,,916197,4046157.0,,1757552.0,465.0,-560222.0,5243487,...,,29.0,864523.0,0.0,False,,-5104.0,,james.bannantine@enron.com,39.0
BAXTER JOHN C,267102.0,,1295738.0,5634343,6680544.0,1200000.0,3942714.0,,,10623258,...,,,2660303.0,,False,,-1386055.0,1586055.0,,
BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,145796.0,,-82782.0,63014,...,,,69.0,,False,,-201641.0,,frank.bay@enron.com,


Let's reorganize the data

In [5]:
# Formatting for the output:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Convert dictionary to DataFrame
enron_df = pd.DataFrame.from_dict(data_dict, orient = 'index', dtype = float)

# reorganize columns
enron_df = enron_df[
['salary',
'bonus',
'long_term_incentive',
'deferred_income',
'deferral_payments',
'loan_advances',
'other',
'expenses',
'director_fees',
'total_payments',
 'exercised_stock_options',
'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value',
 'email_address',
 'to_messages',
 'shared_receipt_with_poi',
 'from_messages',
 'from_this_person_to_poi',
 'poi',
 'from_poi_to_this_person']]

In [6]:
# Check the dataset
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

enron_df

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,4175000.0,304805.0,-3081055.0,2869717.0,,152.0,13868.0,,4484442.0,1729541.0,126027.0,-126027.0,1729541.0,phillip.allen@enron.com,2902.0,1407.0,2195.0,65.0,0.0,47.0
BADUM JAMES P,,,,,178980.0,,,3486.0,,182466.0,257817.0,,,257817.0,,,,,,0.0,
BANNANTINE JAMES M,477.0,,,-5104.0,,,864523.0,56301.0,,916197.0,4046157.0,1757552.0,-560222.0,5243487.0,james.bannantine@enron.com,566.0,465.0,29.0,0.0,0.0,39.0
BAXTER JOHN C,267102.0,1200000.0,1586055.0,-1386055.0,1295738.0,,2660303.0,11200.0,,5634343.0,6680544.0,3942714.0,,10623258.0,,,,,,0.0,
BAY FRANKLIN R,239671.0,400000.0,,-201641.0,260455.0,,69.0,129142.0,,827696.0,,145796.0,-82782.0,63014.0,frank.bay@enron.com,,,,,0.0,
BAZELIDES PHILIP J,80818.0,,93750.0,,684694.0,,874.0,,,860136.0,1599641.0,,,1599641.0,,,,,,0.0,
BECK SALLY W,231330.0,700000.0,,,,,566.0,37172.0,,969068.0,,126027.0,,126027.0,sally.beck@enron.com,7315.0,2639.0,4343.0,386.0,0.0,144.0
BELDEN TIMOTHY N,213999.0,5249999.0,,-2334434.0,2144013.0,,210698.0,17355.0,,5501630.0,953136.0,157569.0,,1110705.0,tim.belden@enron.com,7991.0,5521.0,484.0,108.0,1.0,228.0
BELFER ROBERT,,,,,-102500.0,,,,3285.0,102500.0,3285.0,,44093.0,-44093.0,,,,,,0.0,
BERBERIAN DAVID,216582.0,,,,,,,11892.0,,228474.0,1624396.0,869220.0,,2493616.0,david.berberian@enron.com,,,,,0.0,


In [7]:
# Remove the 2 wrong names
data_dict.pop('TOTAL',0)
data_dict.pop('THE TRAVEL AGENCY IN THE PARK',0)

# Remove entry with all NaN data
data_dict.pop('LOCKHART EUGENE E',0)

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

In [8]:
# update the dataframe

# Formatting for the output:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Convert dictionary to DataFrame
enron_df = pd.DataFrame.from_dict(data_dict, orient = 'index', dtype = float)

# reorganize columns
enron_df = enron_df[
['salary',
'bonus',
'long_term_incentive',
'deferred_income',
'deferral_payments',
'loan_advances',
'other',
'expenses',
'director_fees',
'total_payments',
 'exercised_stock_options',
'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value',
 'email_address',
 'to_messages',
 'shared_receipt_with_poi',
 'from_messages',
 'from_this_person_to_poi',
 'poi',
 'from_poi_to_this_person']]

In [9]:
# second check
enron_df

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,4175000.0,304805.0,-3081055.0,2869717.0,,152.0,13868.0,,4484442.0,1729541.0,126027.0,-126027.0,1729541.0,phillip.allen@enron.com,2902.0,1407.0,2195.0,65.0,0.0,47.0
BADUM JAMES P,,,,,178980.0,,,3486.0,,182466.0,257817.0,,,257817.0,,,,,,0.0,
BANNANTINE JAMES M,477.0,,,-5104.0,,,864523.0,56301.0,,916197.0,4046157.0,1757552.0,-560222.0,5243487.0,james.bannantine@enron.com,566.0,465.0,29.0,0.0,0.0,39.0
BAXTER JOHN C,267102.0,1200000.0,1586055.0,-1386055.0,1295738.0,,2660303.0,11200.0,,5634343.0,6680544.0,3942714.0,,10623258.0,,,,,,0.0,
BAY FRANKLIN R,239671.0,400000.0,,-201641.0,260455.0,,69.0,129142.0,,827696.0,,145796.0,-82782.0,63014.0,frank.bay@enron.com,,,,,0.0,
BAZELIDES PHILIP J,80818.0,,93750.0,,684694.0,,874.0,,,860136.0,1599641.0,,,1599641.0,,,,,,0.0,
BECK SALLY W,231330.0,700000.0,,,,,566.0,37172.0,,969068.0,,126027.0,,126027.0,sally.beck@enron.com,7315.0,2639.0,4343.0,386.0,0.0,144.0
BELDEN TIMOTHY N,213999.0,5249999.0,,-2334434.0,2144013.0,,210698.0,17355.0,,5501630.0,953136.0,157569.0,,1110705.0,tim.belden@enron.com,7991.0,5521.0,484.0,108.0,1.0,228.0
BELFER ROBERT,,,,,-102500.0,,,,3285.0,102500.0,3285.0,,44093.0,-44093.0,,,,,,0.0,
BERBERIAN DAVID,216582.0,,,,,,,11892.0,,228474.0,1624396.0,869220.0,,2493616.0,david.berberian@enron.com,,,,,0.0,


In [10]:
# Counting the POIs
print("number of poi: {}".format(enron_df[enron_df['poi']==True]['poi'].count()))

number of poi: 18


In [11]:
# Counting the NON-POIs
print("number of non_poi: {}".format(enron_df[enron_df['poi']==False]['poi'].count()))

number of non_poi: 125


In [12]:
# print the POIs
index = enron_df.index

condition_poi = enron_df["poi"] == True


person_poi = index[condition_poi]


person_poi_list = person_poi.tolist()


print ('\n'.join(map(str, person_poi_list)))

BELDEN TIMOTHY N
BOWEN JR RAYMOND M
CALGER CHRISTOPHER F
CAUSEY RICHARD A
COLWELL WESLEY
DELAINEY DAVID W
FASTOW ANDREW S
GLISAN JR BEN F
HANNON KEVIN P
HIRKO JOSEPH
KOENIG MARK E
KOPPER MICHAEL J
LAY KENNETH L
RICE KENNETH D
RIEKER PAULA H
SHELBY REX
SKILLING JEFFREY K
YEAGER F SCOTT


In [13]:
# Change the NaN to 0

enron_df = enron_df.fillna(0)
enron_df.head()

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,4175000.0,304805.0,-3081055.0,2869717.0,0.0,152.0,13868.0,0.0,4484442.0,1729541.0,126027.0,-126027.0,1729541.0,phillip.allen@enron.com,2902.0,1407.0,2195.0,65.0,0.0,47.0
BADUM JAMES P,0.0,0.0,0.0,0.0,178980.0,0.0,0.0,3486.0,0.0,182466.0,257817.0,0.0,0.0,257817.0,,0.0,0.0,0.0,0.0,0.0,0.0
BANNANTINE JAMES M,477.0,0.0,0.0,-5104.0,0.0,0.0,864523.0,56301.0,0.0,916197.0,4046157.0,1757552.0,-560222.0,5243487.0,james.bannantine@enron.com,566.0,465.0,29.0,0.0,0.0,39.0
BAXTER JOHN C,267102.0,1200000.0,1586055.0,-1386055.0,1295738.0,0.0,2660303.0,11200.0,0.0,5634343.0,6680544.0,3942714.0,0.0,10623258.0,,0.0,0.0,0.0,0.0,0.0,0.0
BAY FRANKLIN R,239671.0,400000.0,0.0,-201641.0,260455.0,0.0,69.0,129142.0,0.0,827696.0,0.0,145796.0,-82782.0,63014.0,frank.bay@enron.com,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# dataframe stats

enron_df.describe()

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person
count,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0
mean,186742.86,680724.608,339314.182,-195037.699,223642.629,586888.112,296806.692,35622.72,10050.112,2272322.587,2090318.077,874609.972,73931.315,2930133.762,1247.217,707.524,366.126,24.797,0.126,39.028
std,197117.072,1236179.688,689013.933,607922.475,756520.789,6818177.362,1135030.646,45370.87,31399.349,8876252.372,4809193.249,2022338.365,1306545.168,6205936.524,2243.006,1079.457,1455.452,80.032,0.333,74.466
min,0.0,0.0,0.0,-3504386.0,-102500.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,-1787380.0,-44093.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,-37506.0,0.0,0.0,0.0,0.0,0.0,96796.5,0.0,38276.5,0.0,254936.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,210692.0,300000.0,0.0,0.0,0.0,0.0,947.0,21530.0,0.0,966522.0,608750.0,360528.0,0.0,976037.0,383.0,114.0,18.0,0.0,0.0,4.0
75%,270259.0,800000.0,374825.5,0.0,9110.0,0.0,149204.0,53534.5,0.0,1956977.5,1698900.5,775992.0,0.0,2307583.5,1639.0,967.5,54.0,14.0,0.0,41.5
max,1111258.0,8000000.0,5145434.0,0.0,6426990.0,81525000.0,10359729.0,228763.0,137864.0,103559793.0,34348384.0,14761694.0,15456290.0,49110078.0,15149.0,5521.0,14368.0,609.0,1.0,528.0


why is deferral_payments negative, who is doing this?

In [15]:
# print rows with deferral_payments < 0
index = enron_df.index

condition = enron_df["deferral_payments"]<0

person_ind = index[condition]

person_ind_list = person_ind.tolist()

print (person_ind_list)

['BELFER ROBERT']


In [16]:
# lets see more about his person

enron_df.loc['BELFER ROBERT']

salary                            0.000
bonus                             0.000
long_term_incentive               0.000
deferred_income                   0.000
deferral_payments           -102500.000
loan_advances                     0.000
other                             0.000
expenses                          0.000
director_fees                  3285.000
total_payments               102500.000
exercised_stock_options        3285.000
restricted_stock                  0.000
restricted_stock_deferred     44093.000
total_stock_value            -44093.000
email_address                       NaN
to_messages                       0.000
shared_receipt_with_poi           0.000
from_messages                     0.000
from_this_person_to_poi           0.000
poi                               0.000
from_poi_to_this_person           0.000
Name: BELFER ROBERT, dtype: object

Looks like the data was not imported properly and the columns shifted. 
Deferred income should have been -102500.00 not deferral_payments.
let's see if this hapened anywhere else.

In [17]:
# add columns to check that valeus are correct for 
# total_payments and total_stock

enron_df['add_total_payments'] = \
enron_df['salary'] + \
enron_df['bonus'] + \
enron_df['long_term_incentive'] + \
enron_df['deferred_income'] + \
enron_df['deferral_payments'] + \
enron_df['loan_advances'] + \
enron_df['other'] +  \
enron_df['expenses'] + \
enron_df['director_fees'] 

enron_df['add_total_stock'] = \
enron_df['exercised_stock_options'] + \
enron_df['restricted_stock'] + \
enron_df['restricted_stock_deferred'] 


enron_df['check_total_payments'] = enron_df['add_total_payments'] == enron_df['total_payments']
enron_df['check_total_stock'] = enron_df['add_total_stock'] == enron_df['total_stock_value']


In [18]:
# Check total_payments
enron_df[enron_df['check_total_payments']==0]

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person,add_total_payments,add_total_stock,check_total_payments,check_total_stock
BELFER ROBERT,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,,0.0,0.0,0.0,0.0,0.0,0.0,-99215.0,47378.0,False,False
BHATNAGAR SANJAY,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,sanjay.bhatnagar@enron.com,523.0,463.0,29.0,1.0,0.0,0.0,275728.0,15456290.0,False,False


In [19]:
# Check total stock
enron_df[enron_df['check_total_stock']==0]

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person,add_total_payments,add_total_stock,check_total_payments,check_total_stock
BELFER ROBERT,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,,0.0,0.0,0.0,0.0,0.0,0.0,-99215.0,47378.0,False,False
BHATNAGAR SANJAY,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,sanjay.bhatnagar@enron.com,523.0,463.0,29.0,1.0,0.0,0.0,275728.0,15456290.0,False,False


In [20]:
# Let's look at the other person's data
enron_df.loc['BHATNAGAR SANJAY']

salary                                            0.000
bonus                                             0.000
long_term_incentive                               0.000
deferred_income                                   0.000
deferral_payments                                 0.000
loan_advances                                     0.000
other                                        137864.000
expenses                                          0.000
director_fees                                137864.000
total_payments                             15456290.000
exercised_stock_options                     2604490.000
restricted_stock                           -2604490.000
restricted_stock_deferred                  15456290.000
total_stock_value                                 0.000
email_address                sanjay.bhatnagar@enron.com
to_messages                                     523.000
shared_receipt_with_poi                         463.000
from_messages                                   

In [21]:
# Fix the 2 shifted person entries
from support_functions  import *

# Fix shifted data function
fix_shifted_data(data_dict)

{'ALLEN PHILLIP K': {'bonus': 4175000,
  'deferral_payments': 2869717,
  'deferred_income': -3081055,
  'director_fees': 'NaN',
  'email_address': 'phillip.allen@enron.com',
  'exercised_stock_options': 1729541,
  'expenses': 13868,
  'from_messages': 2195,
  'from_poi_to_this_person': 47,
  'from_this_person_to_poi': 65,
  'loan_advances': 'NaN',
  'long_term_incentive': 304805,
  'other': 152,
  'poi': False,
  'restricted_stock': 126027,
  'restricted_stock_deferred': -126027,
  'salary': 201955,
  'shared_receipt_with_poi': 1407,
  'to_messages': 2902,
  'total_payments': 4484442,
  'total_stock_value': 1729541},
 'BADUM JAMES P': {'bonus': 'NaN',
  'deferral_payments': 178980,
  'deferred_income': 'NaN',
  'director_fees': 'NaN',
  'email_address': 'NaN',
  'exercised_stock_options': 257817,
  'expenses': 3486,
  'from_messages': 'NaN',
  'from_poi_to_this_person': 'NaN',
  'from_this_person_to_poi': 'NaN',
  'loan_advances': 'NaN',
  'long_term_incentive': 'NaN',
  'other': 'NaN'

In [22]:
# update the dataframe

# Formatting for the output:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Convert dictionary to DataFrame
enron_df = pd.DataFrame.from_dict(data_dict, orient = 'index', dtype = float)

# reorganize columns
enron_df = enron_df[
['salary',
'bonus',
'long_term_incentive',
'deferred_income',
'deferral_payments',
'loan_advances',
'other',
'expenses',
'director_fees',
'total_payments',
 'exercised_stock_options',
'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value',
 'email_address',
 'to_messages',
 'shared_receipt_with_poi',
 'from_messages',
 'from_this_person_to_poi',
 'poi',
 'from_poi_to_this_person']]

# Change the NaN to 0

enron_df = enron_df.fillna(0)
enron_df.head()

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,4175000.0,304805.0,-3081055.0,2869717.0,0.0,152.0,13868.0,0.0,4484442.0,1729541.0,126027.0,-126027.0,1729541.0,phillip.allen@enron.com,2902.0,1407.0,2195.0,65.0,0.0,47.0
BADUM JAMES P,0.0,0.0,0.0,0.0,178980.0,0.0,0.0,3486.0,0.0,182466.0,257817.0,0.0,0.0,257817.0,,0.0,0.0,0.0,0.0,0.0,0.0
BANNANTINE JAMES M,477.0,0.0,0.0,-5104.0,0.0,0.0,864523.0,56301.0,0.0,916197.0,4046157.0,1757552.0,-560222.0,5243487.0,james.bannantine@enron.com,566.0,465.0,29.0,0.0,0.0,39.0
BAXTER JOHN C,267102.0,1200000.0,1586055.0,-1386055.0,1295738.0,0.0,2660303.0,11200.0,0.0,5634343.0,6680544.0,3942714.0,0.0,10623258.0,,0.0,0.0,0.0,0.0,0.0,0.0
BAY FRANKLIN R,239671.0,400000.0,0.0,-201641.0,260455.0,0.0,69.0,129142.0,0.0,827696.0,0.0,145796.0,-82782.0,63014.0,frank.bay@enron.com,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
# double check for column shift
enron_df['add_total_payments'] = \
enron_df['salary'] + \
enron_df['bonus'] + \
enron_df['long_term_incentive'] + \
enron_df['deferred_income'] + \
enron_df['deferral_payments'] + \
enron_df['loan_advances'] + \
enron_df['other'] +  \
enron_df['expenses'] + \
enron_df['director_fees'] 

enron_df['add_total_stock'] = \
enron_df['exercised_stock_options'] + \
enron_df['restricted_stock'] + \
enron_df['restricted_stock_deferred'] 


enron_df['check_total_payments'] = enron_df['add_total_payments'] == enron_df['total_payments']
enron_df['check_total_stock'] = enron_df['add_total_stock'] == enron_df['total_stock_value']


In [24]:
enron_df[enron_df['check_total_payments']==0]

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person,add_total_payments,add_total_stock,check_total_payments,check_total_stock


In [25]:
enron_df[enron_df['check_total_stock']==0]

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,email_address,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person,add_total_payments,add_total_stock,check_total_payments,check_total_stock


In [26]:
# recheck df stats
enron_df.describe()

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person,add_total_payments,add_total_stock
count,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0
mean,186742.86,680724.608,339314.182,-195754.483,224359.413,586888.112,295842.608,36609.776,9779.839,2164506.916,2180167.832,911344.748,-52984.531,3038528.049,1247.217,707.524,366.126,24.797,0.126,39.028,2164506.916,3038528.049
std,197117.072,1236179.688,689013.933,607751.295,756258.114,6818177.362,1135225.135,46050.414,30518.512,8808363.755,4937260.144,2005940.55,274107.89,6288440.349,2243.006,1079.457,1455.452,80.032,0.333,74.466,8808363.755,6288440.349
min,0.0,0.0,0.0,-3504386.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,-39798.0,0.0,0.0,0.0,0.0,0.0,89292.5,0.0,45698.5,0.0,288212.0,0.0,0.0,0.0,0.0,0.0,0.0,89292.5,288212.0
50%,210692.0,300000.0,0.0,0.0,0.0,0.0,891.0,22344.0,0.0,916197.0,608750.0,363428.0,0.0,985032.0,383.0,114.0,18.0,0.0,0.0,4.0,916197.0,985032.0
75%,270259.0,800000.0,374825.5,0.0,9110.0,0.0,149204.0,54522.0,0.0,1901558.5,1698900.5,861142.0,0.0,2413007.5,1639.0,967.5,54.0,14.0,0.0,41.5,1901558.5,2413007.5
max,1111258.0,8000000.0,5145434.0,0.0,6426990.0,81525000.0,10359729.0,228763.0,125034.0,103559793.0,34348384.0,14761694.0,0.0,49110078.0,15149.0,5521.0,14368.0,609.0,1.0,528.0,103559793.0,49110078.0


In [27]:
# let's look for any correlations in the data
enron_df.corr(method ='pearson') 

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,shared_receipt_with_poi,from_messages,from_this_person_to_poi,poi,from_poi_to_this_person,add_total_payments,add_total_stock,check_total_payments,check_total_stock
salary,1.0,0.649,0.559,-0.328,0.242,0.388,0.546,0.332,-0.306,0.527,0.42,0.525,0.017,0.498,0.395,0.51,0.143,0.206,0.339,0.406,0.527,0.498,,
bonus,0.649,1.0,0.497,-0.33,0.173,0.433,0.384,0.228,-0.178,0.571,0.395,0.416,0.05,0.445,0.519,0.664,0.174,0.448,0.358,0.64,0.571,0.445,,
long_term_incentive,0.559,0.497,1.0,-0.295,0.119,0.402,0.535,0.083,-0.159,0.53,0.379,0.336,0.044,0.407,0.191,0.278,0.055,0.156,0.256,0.266,0.53,0.407,,
deferred_income,-0.328,-0.33,-0.295,1.0,-0.543,-0.025,-0.265,-0.032,0.072,-0.108,-0.255,-0.12,0.079,-0.235,-0.12,-0.235,-0.014,-0.004,-0.274,-0.194,-0.108,-0.235,,
deferral_payments,0.242,0.173,0.119,-0.543,1.0,0.014,0.369,-0.029,-0.096,0.146,0.108,0.083,-0.007,0.111,0.123,0.209,0.028,0.001,-0.04,0.216,0.146,0.111,,
loan_advances,0.388,0.433,0.402,-0.025,0.014,1.0,0.759,0.118,-0.028,0.973,0.552,0.585,0.017,0.621,0.115,0.137,-0.02,-0.01,0.22,0.1,0.973,0.621,,
other,0.546,0.384,0.535,-0.265,0.369,0.759,1.0,0.131,-0.084,0.838,0.531,0.637,0.038,0.622,0.106,0.18,-0.055,-0.05,0.17,0.16,0.838,0.622,,
expenses,0.332,0.228,0.083,-0.032,-0.029,0.118,0.131,1.0,-0.14,0.154,0.158,0.182,-0.155,0.175,0.231,0.276,0.148,0.125,0.192,0.132,0.154,0.175,,
director_fees,-0.306,-0.178,-0.159,0.072,-0.096,-0.028,-0.084,-0.14,1.0,-0.077,-0.136,-0.145,0.052,-0.151,-0.178,-0.211,-0.081,-0.1,-0.122,-0.169,-0.077,-0.151,,
total_payments,0.527,0.571,0.53,-0.108,0.146,0.973,0.838,0.154,-0.077,1.0,0.582,0.63,0.033,0.66,0.202,0.258,0.012,0.066,0.249,0.223,1.0,0.66,,


In [28]:
# lets see how many people are outliers

outlier_POIs = dict()

cases = 0
for column_name in list(enron_df):
    if enron_df[column_name].dtype == 'float' and column_name != 'poi':
        cases +=1
        test_data = enron_df[enron_df[column_name]!=0]
        Q1 = test_data[column_name].quantile(0.25)
        Q3 = test_data[column_name].quantile(0.75)
        IQR = Q3 - Q1

        # Filtering Values between Q1-1.5IQR and Q3+1.5IQR
        query_str = '(@Q1 - 2 * @IQR) >= ' + column_name + ' or  (@Q3 + 2 * @IQR) <=' + column_name
        filtered = test_data.query(query_str)

        print column_name + ": " + str(len(filtered))
        filtered = filtered.sort_values(column_name)
        print filtered[column_name].head(10)
        print ""
        
        for person in filtered.index.tolist():
            try:
               outlier_POIs[person] = 1 + outlier_POIs[person]
            except:
               outlier_POIs[person] = 1 

print "Total Cases:" + str(cases)

salary: 7
BANNANTINE JAMES M       477.000
GRAY RODNEY             6615.000
WHALLEY LAWRENCE G    510364.000
PICKERING MARK R      655037.000
FREVERT MARK A       1060932.000
LAY KENNETH L        1072321.000
SKILLING JEFFREY K   1111258.000
Name: salary, dtype: float64

bonus: 8
DELAINEY DAVID W     3000000.000
WHALLEY LAWRENCE G   3000000.000
KITCHEN LOUISE       3100000.000
ALLEN PHILLIP K      4175000.000
BELDEN TIMOTHY N     5249999.000
SKILLING JEFFREY K   5600000.000
LAY KENNETH L        7000000.000
LAVORATO JOHN J      8000000.000
Name: bonus, dtype: float64

long_term_incentive: 4
LAVORATO JOHN J   2035380.000
ECHOLS JOHN B     2234774.000
LAY KENNETH L     3600000.000
MARTIN AMANDA K   5145434.000
Name: long_term_incentive, dtype: float64

deferred_income: 5
RICE KENNETH D     -3504386.000
FREVERT MARK A     -3367011.000
HANNON KEVIN P     -3117011.000
ALLEN PHILLIP K    -3081055.000
BELDEN TIMOTHY N   -2334434.000
Name: deferred_income, dtype: float64

deferral_payments: 4
AL

In [29]:
# lets count'em up

outlier_POIs_df = pd.DataFrame.from_dict(outlier_POIs, orient = 'index', dtype = int)
outlier_POIs_df.columns = ['Count']
outlier_POIs_df = outlier_POIs_df.sort_values('Count', ascending = False)
outlier_POIs_df.head(10)

Unnamed: 0,Count
FREVERT MARK A,11
LAY KENNETH L,10
BELDEN TIMOTHY N,8
SKILLING JEFFREY K,8
LAVORATO JOHN J,8
BAXTER JOHN C,7
DERRICK JR. JAMES V,5
KITCHEN LOUISE,5
PAI LOU L,5
BHATNAGAR SANJAY,5


LAY KENNETH L and SKILLING JEFFREY K are near the top of the list the was expected. Other names there are: FREVERT MARK A, BELDEN TIMOTHY N, LAVORATO JOHN J.

FREVERT MARK A is a strong outlier since he shows up in 11 times. These seem like real outliers and they will be kept.

**There are many features with NaN values.  Let's look at the proportions of NaN values for each feature.**

In [30]:
# What is the rate of NaN values for each column
for col in enron_df.columns:
    if str(enron_df[col].dtype) != 'bool':
        print col,(enron_df[enron_df[col] != 0][col].count()) \
        ,round((enron_df[enron_df[col] != 0][col].count())/ \
        (float(enron_df.shape[0])),2)

salary 94 0.66
bonus 81 0.57
long_term_incentive 65 0.45
deferred_income 49 0.34
deferral_payments 37 0.26
loan_advances 3 0.02
other 90 0.63
expenses 96 0.67
director_fees 15 0.1
total_payments 123 0.86
exercised_stock_options 100 0.7
restricted_stock 110 0.77
restricted_stock_deferred 17 0.12
total_stock_value 125 0.87
email_address 143 1.0
to_messages 86 0.6
shared_receipt_with_poi 86 0.6
from_messages 86 0.6
from_this_person_to_poi 66 0.46
poi 18 0.13
from_poi_to_this_person 74 0.52
add_total_payments 123 0.86
add_total_stock 125 0.87


In [31]:
# creating the first feature list from non-NaN data ratios of > 50%

POI = 18
non_POI = 125

first_feature_list = ['poi']
all_features_list = ['poi']

for feature in data_dict[data_dict.keys()[0]].keys():
    nan_poi = 0
    nan_non_poi = 0
    valid_data = 0 
    print "\n"+feature
    for person in data_dict.keys():
        if data_dict[person][feature] =='NaN':
            if data_dict[person]['poi']:
                nan_poi +=1
            else:
                nan_non_poi +=1
        else:
            valid_data +=1
    print "NaN in POIs: " + str(nan_poi) + " (" + str(round(float(nan_poi)/float(POI)*100,2))+"%)"
    print "NaN in non-POIs: " + str(nan_non_poi) + " (" + str(round(float(nan_non_poi)/float(non_POI)*100,2))+"%)"
    print "Valid: " + str(valid_data)  +" (" + str(round(float(valid_data)/float(POI+non_POI)*100,2))+"%)"
 
    
   # Keep feature only if at least 50% is valid:
    if round(float(valid_data)/float(POI+non_POI)*100,2)>50:
        # ignore email_address, since it is just text, and poi, which must be first
        if feature !="email_address" and feature !="poi":
            first_feature_list.append(feature)
            
    # save all features to allow for comparison:
    # ignore email_address, since it is just text, and poi, which must be first
    if feature !="email_address" and feature !="poi":
        all_features_list.append(feature)
            
            



salary
NaN in POIs: 1 (5.56%)
NaN in non-POIs: 48 (38.4%)
Valid: 94 (65.73%)

to_messages
NaN in POIs: 4 (22.22%)
NaN in non-POIs: 53 (42.4%)
Valid: 86 (60.14%)

deferral_payments
NaN in POIs: 13 (72.22%)
NaN in non-POIs: 93 (74.4%)
Valid: 37 (25.87%)

total_payments
NaN in POIs: 0 (0.0%)
NaN in non-POIs: 20 (16.0%)
Valid: 123 (86.01%)

exercised_stock_options
NaN in POIs: 6 (33.33%)
NaN in non-POIs: 37 (29.6%)
Valid: 100 (69.93%)

bonus
NaN in POIs: 2 (11.11%)
NaN in non-POIs: 60 (48.0%)
Valid: 81 (56.64%)

restricted_stock
NaN in POIs: 1 (5.56%)
NaN in non-POIs: 32 (25.6%)
Valid: 110 (76.92%)

shared_receipt_with_poi
NaN in POIs: 4 (22.22%)
NaN in non-POIs: 53 (42.4%)
Valid: 86 (60.14%)

restricted_stock_deferred
NaN in POIs: 18 (100.0%)
NaN in non-POIs: 108 (86.4%)
Valid: 17 (11.89%)

total_stock_value
NaN in POIs: 0 (0.0%)
NaN in non-POIs: 18 (14.4%)
Valid: 125 (87.41%)

expenses
NaN in POIs: 0 (0.0%)
NaN in non-POIs: 47 (37.6%)
Valid: 96 (67.13%)

loan_advances
NaN in POIs: 17 (9

Let's create 5 new features:

* nf_bonus:  the raio of bonus to total payments

* nf_salary: the raio of salary to total payments

* nf_excised_stock: the raio of excised stockbonus to total payments

* nf_poi_to: the ratio of emails a person received from a poi

* nf_to_poi: the ratio of emails a person sent to a poi

Feature Reasoning:  These features had high correlations with total payments.  I suspect a POI will have high ratios in these new features for conducting fraudulent activity.  These new features will quantify this and will hopefully increase the predictive power of the algorithm. 


In [32]:
import support_functions


# nf_bonus
data_dict = support_functions.create_feature(data_dict,
                                      "bonus",
                                      "total_payments",
                                      "nf_bonus")

# nf_salary
data_dict = support_functions.create_feature(data_dict,
                                      "salary",
                                      "total_payments",
                                      "nf_salary")


# nf_exercised_stock
data_dict = support_functions.create_feature(data_dict,
                                      "exercised_stock_options",
                                      "total_payments",
                                      "nf_exercised_stock")


# nf_to_poi
data_dict = support_functions.create_feature(data_dict,
                                      "from_this_person_to_poi",
                                      "from_messages",
                                      "nf_to_poi")

# nf_poi_to
data_dict = support_functions.create_feature(data_dict,
                                      "from_poi_to_this_person",
                                      "to_messages",
                                      "nf_poi_to")




# Add the  new features
first_feature_list = first_feature_list + ['nf_bonus',
                                 'nf_salary', 'nf_exercised_stock', 'nf_to_poi', 'nf_poi_to']
  
all_features_list = all_features_list + ['nf_bonus',
                                 'nf_salary', 'nf_exercised_stock', 'nf_to_poi', 'nf_poi_to']  
                

In [33]:
# setup 3 algo tests
from tester import test_classifier

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier


# Naive Bayes
clf_NB = GaussianNB()

# Decision Tree
clf_TREE = DecisionTreeClassifier()

# AdaBoost:
clf_ADA = AdaBoostClassifier()

  from numpy.core.umath_tests import inner1d


In [34]:
# Print the feature list
print  first_feature_list
print "\n Numbers of features: " + str(len(first_feature_list))

['poi', 'salary', 'to_messages', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'from_messages', 'other', 'from_this_person_to_poi', 'from_poi_to_this_person', 'nf_bonus', 'nf_salary', 'nf_exercised_stock', 'nf_to_poi', 'nf_poi_to']

 Numbers of features: 19


In [35]:
# import feature formatter
from feature_format import featureFormat, targetFeatureSplit
from sklearn.model_selection import train_test_split

import sys
sys.path.append("./tools/") 

In [36]:
# format the feature list
data = featureFormat(data_dict, all_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)


**Let's see what the best features are from the first set**

In [37]:
# Print importance of each feature

clf_ADA.fit(features_train,labels_train)
for feature, imp in zip(first_feature_list[1:],clf_ADA.feature_importances_):
    print feature +": " + str(round(imp,2))

salary: 0.08
to_messages: 0.0
total_payments: 0.02
exercised_stock_options: 0.08
bonus: 0.02
restricted_stock: 0.06
shared_receipt_with_poi: 0.08
total_stock_value: 0.1
expenses: 0.0
from_messages: 0.0
other: 0.06
from_this_person_to_poi: 0.0
from_poi_to_this_person: 0.04
nf_bonus: 0.2
nf_salary: 0.0
nf_exercised_stock: 0.0
nf_to_poi: 0.0
nf_poi_to: 0.02


In [38]:
# Remove some features 

first_feature_list.remove('salary')
first_feature_list.remove('to_messages')
first_feature_list.remove('from_messages')
first_feature_list.remove('from_this_person_to_poi')
first_feature_list.remove('from_poi_to_this_person')

In [39]:
# Print the feature list
#first_feature_list = feature_list_one
print  first_feature_list
print "\n Numbers of features: " + str(len(first_feature_list))

['poi', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'other', 'nf_bonus', 'nf_salary', 'nf_exercised_stock', 'nf_to_poi', 'nf_poi_to']

 Numbers of features: 14


redo feature format

In [40]:
# format the feature list
data = featureFormat(data_dict, all_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)


features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)


**Let's pick an algo**

In [41]:
# let's test 3 algos and choose one
test_classifier(clf_NB, data_dict, first_feature_list)
test_classifier(clf_TREE, data_dict, first_feature_list)
test_classifier(clf_ADA, data_dict, first_feature_list)

GaussianNB(priors=None)
	Accuracy: 0.84573	Precision: 0.37173	Recall: 0.22750	F1: 0.28226	F2: 0.24664
	Total predictions: 15000	True positives:  455	False positives:  769	False negatives: 1545	True negatives: 12231

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
	Accuracy: 0.81847	Precision: 0.31678	Recall: 0.31250	F1: 0.31462	F2: 0.31335
	Total predictions: 15000	True positives:  625	False positives: 1348	False negatives: 1375	True negatives: 11652

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
	Accuracy: 0.84087	Precision: 0.38078	Recall: 0.30900	F1: 0.34115	F2: 0.32111
	Total predictions: 15000	True positiv

**going to choose AdaBoost since it has the highest precision**

**Let's see what the best features are from the first set**

In [42]:
# Print importance of each feature

clf_ADA.fit(features_train,labels_train)
for feature, imp in zip(first_feature_list[1:],clf_ADA.feature_importances_):
    print feature +": " + str(round(imp,2))

total_payments: 0.08
exercised_stock_options: 0.0
bonus: 0.02
restricted_stock: 0.08
shared_receipt_with_poi: 0.02
total_stock_value: 0.06
expenses: 0.08
other: 0.1
nf_bonus: 0.0
nf_salary: 0.0
nf_exercised_stock: 0.06
nf_to_poi: 0.0
nf_poi_to: 0.04


**Let's try GridSearch**

to get starting values for estimator and learning_rate for later fine tuning

In [43]:
from sklearn.model_selection import GridSearchCV

In [44]:
# Gridsearch for learning rate and estimator hyperparameters
from sklearn.model_selection import GridSearchCV
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],
              'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()
clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)

print clf.best_params_
print clf.best_score_

  'precision', 'predicted', average, warn_for)


{'n_estimators': 13, 'learning_rate': 0.9}
0.49941724941724935


# Classifier Tuning - First Feature List 

In [45]:
lr = clf.best_params_['learning_rate']
n_est = clf.best_params_['n_estimators']
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, first_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.9, n_estimators=13, random_state=None)
	Accuracy: 0.85180	Precision: 0.42098	Recall: 0.29700	F1: 0.34828	F2: 0.31559
	Total predictions: 15000	True positives:  594	False positives:  817	False negatives: 1406	True negatives: 12183



**trying some other hyperparameters**

In [46]:
lr = 0.9
n_est = 14
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, first_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.9, n_estimators=14, random_state=None)
	Accuracy: 0.85220	Precision: 0.42365	Recall: 0.30100	F1: 0.35194	F2: 0.31950
	Total predictions: 15000	True positives:  602	False positives:  819	False negatives: 1398	True negatives: 12181



got it (precision > .30  and recall > .30) but not good enough.  Let's keep going

# Classifier Tuning - Second Feature List 

In [47]:
# creating a new list from the important features
feature_list_two = ['poi']
for feature, imp in zip(first_feature_list[1:],clf_ADA.feature_importances_):
    if imp > 0.00:
        feature_list_two.append(feature)

In [48]:
# Take the best importance from the first list
second_feature_list = feature_list_two

# print the feature list
print  second_feature_list
print "\n Numbers of features: " + str(len(second_feature_list))

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'nf_to_poi']

 Numbers of features: 7


**Let's see what the best features are from the second set**

In [49]:
# Print importance of each feature
clf_ADA.fit(features_train,labels_train)
for feature, imp in zip(second_feature_list[1:],clf_ADA.feature_importances_):
    print feature +": " + str(round(imp,2))

exercised_stock_options: 0.07
restricted_stock: 0.0
shared_receipt_with_poi: 0.0
expenses: 0.14
other: 0.07
nf_to_poi: 0.21


In [50]:
# format the feature list
data = featureFormat(data_dict, all_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)


features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [51]:
# Gridsearch for learning rate and estimator hyperparameters
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()
clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)



print clf.best_params_
print clf.best_score_

{'n_estimators': 13, 'learning_rate': 0.9}
0.49941724941724935


In [52]:
lr = clf.best_params_['learning_rate']
n_est = clf.best_params_['n_estimators']
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, second_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.9, n_estimators=13, random_state=None)
	Accuracy: 0.89267	Precision: 0.63430	Recall: 0.46050	F1: 0.53360	F2: 0.48720
	Total predictions: 15000	True positives:  921	False positives:  531	False negatives: 1079	True negatives: 12469



**performance tests**

**trying some other hyperparameters**

In [53]:
lr = 0.9
n_est = 14
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, second_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.9, n_estimators=14, random_state=None)
	Accuracy: 0.89413	Precision: 0.63807	Recall: 0.47600	F1: 0.54525	F2: 0.50147
	Total predictions: 15000	True positives:  952	False positives:  540	False negatives: 1048	True negatives: 12460



best so far, but let's keep trying 

# Classifier Tuning - Third Feature List 

In [54]:
# creating a new list from the important features
feature_list_three = ['poi']
for feature, imp in zip(first_feature_list[1:],clf_ADA.feature_importances_):
    if imp > 0.00:
        feature_list_three.append(feature)

In [55]:
# Take the best importance from the second list
third_feature_list = feature_list_three



# print the feature list
print  third_feature_list
print "\n Numbers of features: " + str(len(third_feature_list))

['poi', 'total_payments', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value']

 Numbers of features: 6


In [56]:
# Print importance of each feature
clf_ADA.fit(features_train,labels_train)
for feature, imp in zip(third_feature_list[1:],clf_ADA.feature_importances_):
    print feature +": " + str(round(imp,2))

total_payments: 0.07
exercised_stock_options: 0.0
restricted_stock: 0.0
shared_receipt_with_poi: 0.14
total_stock_value: 0.07


In [57]:
# format the feature list
data = featureFormat(data_dict, all_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)


features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [58]:
# Gridsearch for learning rate and estimator hyperparameters
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()
clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)



print clf.best_params_
print clf.best_score_

{'n_estimators': 13, 'learning_rate': 0.9}
0.49941724941724935


In [59]:
lr = clf.best_params_['learning_rate']
n_est = clf.best_params_['n_estimators']
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, third_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.9, n_estimators=13, random_state=None)
	Accuracy: 0.84600	Precision: 0.36953	Recall: 0.21950	F1: 0.27541	F2: 0.23890
	Total predictions: 15000	True positives:  439	False positives:  749	False negatives: 1561	True negatives: 12251



performance was worse stopping here

# Recursive Feature Selection and Tuning

Let's try recursive feature elimination to make the list.

In [61]:
from sklearn.feature_selection import RFE

In [62]:
# format the feature list
data = featureFormat(data_dict, first_feature_list, sort_keys = True)
labels, features = targetFeatureSplit(data)


In [63]:
# rank the features
rfe = RFE(clf_ADA,8)
selector = rfe.fit(features,labels)

In [64]:
# print he feature ranks
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    print feature +" (" + str(rank) +")"

total_payments (6)
exercised_stock_options (1)
bonus (5)
restricted_stock (4)
shared_receipt_with_poi (3)
total_stock_value (2)
expenses (1)
other (1)
nf_bonus (1)
nf_salary (1)
nf_exercised_stock (1)
nf_to_poi (1)
nf_poi_to (1)


In [65]:
# creating a new list from the ranked features
alt_feature_list_one = ['poi']
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    if rank < 6:
        alt_feature_list_one.append(feature)

In [66]:
# print the feature list
print  alt_feature_list_one
print "\n Numbers of features: " + str(len(alt_feature_list_one))

['poi', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'other', 'nf_bonus', 'nf_salary', 'nf_exercised_stock', 'nf_to_poi', 'nf_poi_to']

 Numbers of features: 13


In [67]:
# format the feature list
data = featureFormat(data_dict,first_feature_list, sort_keys = True)
labels, features = targetFeatureSplit(data)


features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [68]:
# Gridsearch for learning rate and estimator hyperparameters
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()
clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)



print clf.best_params_
print clf.best_score_

{'n_estimators': 17, 'learning_rate': 0.8}
0.4848484848484848


In [69]:
lr = clf.best_params_['learning_rate']
n_est = clf.best_params_['n_estimators']
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_one)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.8, n_estimators=17, random_state=None)
	Accuracy: 0.85953	Precision: 0.46235	Recall: 0.32850	F1: 0.38410	F2: 0.34869
	Total predictions: 15000	True positives:  657	False positives:  764	False negatives: 1343	True negatives: 12236



better but let's keep going

**Let's see what the best features are from the second set**

In [70]:
# Print importance of each feature
clf_ADA.fit(features_train,labels_train)
for feature, imp in zip(alt_feature_list_one[1:],clf_ADA.feature_importances_):
    print feature +": " + str(round(imp,2))

exercised_stock_options: 0.18
bonus: 0.12
restricted_stock: 0.12
shared_receipt_with_poi: 0.0
total_stock_value: 0.06
expenses: 0.0
other: 0.18
nf_bonus: 0.12
nf_salary: 0.0
nf_exercised_stock: 0.06
nf_to_poi: 0.06
nf_poi_to: 0.12


In [71]:
# format the feature list
data = featureFormat(data_dict, first_feature_list, sort_keys = True)
labels, features = targetFeatureSplit(data)


rfe = RFE(clf_ADA,6)
selector = rfe.fit(features,labels)

In [72]:
# print the feature ranks
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    print feature +" (" + str(rank) +")"

total_payments (8)
exercised_stock_options (1)
bonus (7)
restricted_stock (1)
shared_receipt_with_poi (2)
total_stock_value (6)
expenses (1)
other (1)
nf_bonus (5)
nf_salary (4)
nf_exercised_stock (3)
nf_to_poi (1)
nf_poi_to (1)


**Alt Feature list 2**

In [73]:
# creating a new list from the ranked features
alt_feature_list_two = ['poi']
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    if rank < 4:
        alt_feature_list_two.append(feature)

In [74]:
# print the feature list
print  alt_feature_list_two
print "\n Numbers of features: " + str(len(alt_feature_list_two))

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'nf_exercised_stock', 'nf_to_poi', 'nf_poi_to']

 Numbers of features: 9


In [75]:
# reformat the feature list
data = featureFormat(data_dict, alt_feature_list_two, sort_keys = True)
labels, features = targetFeatureSplit(data)


features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [76]:
# Gridsearch for learning rate and estimator hyperparameters
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()
clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)

print clf.best_params_
print clf.best_score_

{'n_estimators': 15, 'learning_rate': 0.8}
0.5378250591016548


In [77]:
# test performance
lr = clf.best_params_['learning_rate']
n_est = clf.best_params_['n_estimators']
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_two)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.8, n_estimators=15, random_state=None)
	Accuracy: 0.88000	Precision: 0.56757	Recall: 0.42000	F1: 0.48276	F2: 0.44304
	Total predictions: 15000	True positives:  840	False positives:  640	False negatives: 1160	True negatives: 12360



In [78]:
# Print importance of each feature
clf_ADA.fit(features_train,labels_train)
for feature, imp in zip(alt_feature_list_two[1:],clf_ADA.feature_importances_):
    print feature +": " + str(round(imp,2))

exercised_stock_options: 0.07
restricted_stock: 0.13
shared_receipt_with_poi: 0.13
expenses: 0.07
other: 0.13
nf_exercised_stock: 0.27
nf_to_poi: 0.13
nf_poi_to: 0.07


In [79]:
# format the feature list
data = featureFormat(data_dict, first_feature_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

rfe = RFE(clf_ADA,5)
selector = rfe.fit(features,labels)

In [80]:
# print the feature ranks
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    print feature +" (" + str(rank) +")"

total_payments (9)
exercised_stock_options (1)
bonus (8)
restricted_stock (1)
shared_receipt_with_poi (2)
total_stock_value (7)
expenses (1)
other (1)
nf_bonus (6)
nf_salary (5)
nf_exercised_stock (4)
nf_to_poi (1)
nf_poi_to (3)


**Alt Feature list 3**

In [81]:
# creating a new list from the ranked features
alt_feature_list_three = ['poi']
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    if rank < 2:
        alt_feature_list_three.append(feature)

In [82]:
# print the feature list
print  alt_feature_list_three
print "\n Numbers of features: " + str(len(alt_feature_list_three))

['poi', 'exercised_stock_options', 'restricted_stock', 'expenses', 'other', 'nf_to_poi']

 Numbers of features: 6


In [83]:
# reformat the feature list
data = featureFormat(data_dict, alt_feature_list_three, sort_keys = True)
labels, features = targetFeatureSplit(data)


features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [84]:
# Gridsearch for learning rate and estimator hyperparameters
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()
clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)

print clf.best_params_
print clf.best_score_

{'n_estimators': 16, 'learning_rate': 0.8}
0.6847619047619047


In [85]:
# test performance
lr = clf.best_params_['learning_rate']
n_est = clf.best_params_['n_estimators']
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_three)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.8, n_estimators=16, random_state=None)
	Accuracy: 0.89157	Precision: 0.66110	Recall: 0.49450	F1: 0.56579	F2: 0.52075
	Total predictions: 14000	True positives:  989	False positives:  507	False negatives: 1011	True negatives: 11493



**looking better let's see how far it can go**

In [86]:
# Print importance of each feature
clf_ADA.fit(features_train,labels_train)
for feature, imp in zip(alt_feature_list_three[1:],clf_ADA.feature_importances_):
    print feature +": " + str(round(imp,2))

exercised_stock_options: 0.19
restricted_stock: 0.13
expenses: 0.06
other: 0.38
nf_to_poi: 0.25


In [87]:
# print the feature ranks
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    print feature +" (" + str(rank) +")"

total_payments (9)
exercised_stock_options (1)
bonus (8)
restricted_stock (1)
shared_receipt_with_poi (2)
total_stock_value (7)
expenses (1)
other (1)
nf_bonus (6)
nf_salary (5)
nf_exercised_stock (4)
nf_to_poi (1)
nf_poi_to (3)


**Alt Feature list 4**

In [88]:
# creating a new list from the ranked features
alt_feature_list_four = ['poi']
for feature, rank in zip(first_feature_list[1:], selector.ranking_ ):
    if rank < 3:
        alt_feature_list_four.append(feature)

In [89]:
# print the feature list
print  alt_feature_list_four
print "\n Numbers of features: " + str(len(alt_feature_list_four))

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'nf_to_poi']

 Numbers of features: 7


In [90]:
# reformat the feature list
data = featureFormat(data_dict, alt_feature_list_four, sort_keys = True)
labels, features = targetFeatureSplit(data)


features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [91]:
# Gridsearch for learning rate and estimator hyperparameters
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],
              'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()
clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)

print clf.best_params_
print clf.best_score_

{'n_estimators': 17, 'learning_rate': 1}
0.6323877068557919


In [92]:
# test performance
lr = clf.best_params_['learning_rate']
n_est = clf.best_params_['n_estimators']
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_four)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
          n_estimators=17, random_state=None)
	Accuracy: 0.89460	Precision: 0.63335	Recall: 0.49750	F1: 0.55727	F2: 0.51980
	Total predictions: 15000	True positives:  995	False positives:  576	False negatives: 1005	True negatives: 12424



**some fine tuning**

In [93]:
# test performance
lr = 1.0
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_four)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=18, random_state=None)
	Accuracy: 0.89640	Precision: 0.64277	Recall: 0.50200	F1: 0.56373	F2: 0.52499
	Total predictions: 15000	True positives: 1004	False positives:  558	False negatives:  996	True negatives: 12442



In [94]:
# test performance
lr = 1.0
n_est = 19
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_four)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=19, random_state=None)
	Accuracy: 0.89573	Precision: 0.63956	Recall: 0.49950	F1: 0.56092	F2: 0.52238
	Total predictions: 15000	True positives:  999	False positives:  563	False negatives: 1001	True negatives: 12437



In [95]:
# test performance
lr = 1.0
n_est = 20
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_four)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=20, random_state=None)
	Accuracy: 0.89560	Precision: 0.63839	Recall: 0.50050	F1: 0.56110	F2: 0.52310
	Total predictions: 15000	True positives: 1001	False positives:  567	False negatives:  999	True negatives: 12433



In [96]:
# test performance
lr = 1.0
n_est = 16
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_four)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=16, random_state=None)
	Accuracy: 0.89407	Precision: 0.63148	Recall: 0.49350	F1: 0.55403	F2: 0.51605
	Total predictions: 15000	True positives:  987	False positives:  576	False negatives: 1013	True negatives: 12424



In [97]:
# test performance
lr = 1.0
n_est = 15
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_four)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=15, random_state=None)
	Accuracy: 0.89440	Precision: 0.63506	Recall: 0.48900	F1: 0.55254	F2: 0.51258
	Total predictions: 15000	True positives:  978	False positives:  562	False negatives: 1022	True negatives: 12438



In [99]:
# test performance
lr = 0.9
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, alt_feature_list_four)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.9, n_estimators=18, random_state=None)
	Accuracy: 0.89540	Precision: 0.64150	Recall: 0.48850	F1: 0.55464	F2: 0.51297
	Total predictions: 15000	True positives:  977	False positives:  546	False negatives: 1023	True negatives: 12454



**think this might be as good as it gets**

# Best Feature List

In [100]:
final_feature_list = alt_feature_list_four

# print the feature list
print  final_feature_list
print "\n Numbers of features: " + str(len(final_feature_list))

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'nf_to_poi']

 Numbers of features: 7


In [101]:
# format the feature list
data = featureFormat(data_dict, final_feature_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)


# Gridsearch for learning rate and estimator hyperparameters
parameters = {'learning_rate':[0.1, 0.2, 0.3, 0.5, 0.7,0.8, 0.9, 1, 2,3, 5, 10],
              'n_estimators':[1,5,8,10,11, 12, 13, 14, 15, 16, 17, 18,50,100,1000, 2000] }
clf_ADA = AdaBoostClassifier()

clf = GridSearchCV(clf_ADA, parameters, scoring = 'f1', cv = 10 )
clf.fit(features, labels)

print clf.best_params_
print clf.best_score_

{'n_estimators': 17, 'learning_rate': 1}
0.6323877068557919


In [102]:
# test performance
lr = 1.0
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=18, random_state=None)
	Accuracy: 0.89647	Precision: 0.64318	Recall: 0.50200	F1: 0.56389	F2: 0.52505
	Total predictions: 15000	True positives: 1004	False positives:  557	False negatives:  996	True negatives: 12443



**final tuning**

In [103]:
# test performance
lr = 1.0
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=18, random_state=None)
	Accuracy: 0.89647	Precision: 0.64318	Recall: 0.50200	F1: 0.56389	F2: 0.52505
	Total predictions: 15000	True positives: 1004	False positives:  557	False negatives:  996	True negatives: 12443



In [104]:
# test performance
lr = 1.05
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.05, n_estimators=18, random_state=None)
	Accuracy: 0.89380	Precision: 0.63020	Recall: 0.49250	F1: 0.55290	F2: 0.51501
	Total predictions: 15000	True positives:  985	False positives:  578	False negatives: 1015	True negatives: 12422



In [105]:
# test performance
lr = 1.0
n_est = 19
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=19, random_state=None)
	Accuracy: 0.89573	Precision: 0.63956	Recall: 0.49950	F1: 0.56092	F2: 0.52238
	Total predictions: 15000	True positives:  999	False positives:  563	False negatives: 1001	True negatives: 12437



In [106]:
# test performance
lr = 0.95
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.95, n_estimators=18, random_state=None)
	Accuracy: 0.89480	Precision: 0.63863	Recall: 0.48600	F1: 0.55196	F2: 0.51040
	Total predictions: 15000	True positives:  972	False positives:  550	False negatives: 1028	True negatives: 12450



In [107]:
# test performance
lr = 0.9
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.9, n_estimators=18, random_state=None)
	Accuracy: 0.89533	Precision: 0.64126	Recall: 0.48800	F1: 0.55423	F2: 0.51250
	Total predictions: 15000	True positives:  976	False positives:  546	False negatives: 1024	True negatives: 12454



# Created Features Performance

In [108]:
# format the feature list
data = featureFormat(data_dict, final_feature_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

# Final feature selection 
final_feature_list = ['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 
                      'nf_to_poi'] 


# print the features
print  final_feature_list
print "\n Numbers of features: " + str(len(final_feature_list))

# Test
lr = 1.0
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'nf_to_poi']

 Numbers of features: 7
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=18, random_state=None)
	Accuracy: 0.89647	Precision: 0.64318	Recall: 0.50200	F1: 0.56389	F2: 0.52505
	Total predictions: 15000	True positives: 1004	False positives:  557	False negatives:  996	True negatives: 12443



**ony one created feature lasted this far**  

In [109]:
# format the feature list
data = featureFormat(data_dict, final_feature_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

# features list with the created feature removed
final_feature_list = ['poi', 'exercised_stock_options', 'restricted_stock', 
                      'shared_receipt_with_poi', 'expenses', 'other' 
                      ] 

# print the feature list
print  final_feature_list
print "\n Numbers of features: " + str(len(final_feature_list))

# Test
lr = 1.0
n_est = 18
clf_ADA = AdaBoostClassifier(learning_rate = lr, n_estimators = n_est)
test_classifier(clf_ADA, data_dict, final_feature_list)

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other']

 Numbers of features: 6
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=18, random_state=None)
	Accuracy: 0.86820	Precision: 0.50805	Recall: 0.36300	F1: 0.42345	F2: 0.38498
	Total predictions: 15000	True positives:  726	False positives:  703	False negatives: 1274	True negatives: 12297



**Looks like without the created feature accuracy, precision, and recall suffer performance loss**