# Introduction

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I'm putting my new skills by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal.

# Understanding the Dataset and Question

The goal of the project is to identify employees from Enron who may have committed fraud based on the public Enron financial and email dataset, i.e., a person of interest. We define a person of interest (POI) as an individual who was indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

Machine learning algorithms are useful in trying to accomplish goals like this one because they can process datasets way faster than humans and they can spot relevant trends that humans would have a hard time realizing manually. Here is some background on the Enron financial and email dataset.

* Employees

There are 146 Enron employees in the dataset. 18 of them are POIs.

* Features

There are fourteen (14) financial features. All units are US dollars.

* salary
* deferral_payments
* total_payments
* loan_advances
* bonus
* restricted_stock_deferred
* deferred_income
* total_stock_value
* expenses
* exercised_stock_options
* other
* long_term_incentive
* restricted_stock
* director_fees

There are six (6) email features. All units are number of emails messages, except for ‘email_address’, which is a text string.

* to_messages
* email_address
* from_poi_to_this_person
* from_messages
* from_this_person_to_poi
* shared_receipt_with_poi

There is one (1) other feature, which is a boolean, indicating whether or not the employee is a person of interest.
* poi



In [145]:
def poiEmails():
    email_list = ["kenneth_lay@enron.net",    
            "kenneth_lay@enron.com",
            "klay.enron@enron.com",
            "kenneth.lay@enron.com", 
            "klay@enron.com",
            "layk@enron.com",
            "chairman.ken@enron.com",
            "jeffreyskilling@yahoo.com",
            "jeff_skilling@enron.com",
            "jskilling@enron.com",
            "effrey.skilling@enron.com",
            "skilling@enron.com",
            "jeffrey.k.skilling@enron.com",
            "jeff.skilling@enron.com",
            "kevin_a_howard.enronxgate.enron@enron.net",
            "kevin.howard@enron.com",
            "kevin.howard@enron.net",
            "kevin.howard@gcm.com",
            "michael.krautz@enron.com"
            "scott.yeager@enron.com",
            "syeager@fyi-net.com",
            "scott_yeager@enron.net",
            "syeager@flash.net",
            "joe'.'hirko@enron.com", 
            "joe.hirko@enron.com", 
            "rex.shelby@enron.com", 
            "rex.shelby@enron.nt", 
            "rex_shelby@enron.net",
            "jbrown@enron.com",
            "james.brown@enron.com", 
            "rick.causey@enron.com", 
            "richard.causey@enron.com", 
            "rcausey@enron.com",
            "calger@enron.com",
            "chris.calger@enron.com", 
            "christopher.calger@enron.com", 
            "ccalger@enron.com",
            "tim_despain.enronxgate.enron@enron.net", 
            "tim.despain@enron.com",
            "kevin_hannon@enron.com", 
            "kevin'.'hannon@enron.com", 
            "kevin_hannon@enron.net", 
            "kevin.hannon@enron.com",
            "mkoenig@enron.com", 
            "mark.koenig@enron.com",
            "m..forney@enron.com",
            "ken'.'rice@enron.com", 
            "ken.rice@enron.com",
            "ken_rice@enron.com", 
            "ken_rice@enron.net",
            "paula.rieker@enron.com",
            "prieker@enron.com", 
            "andrew.fastow@enron.com", 
            "lfastow@pdq.net", 
            "andrew.s.fastow@enron.com", 
            "lfastow@pop.pdq.net", 
            "andy.fastow@enron.com",
            "david.w.delainey@enron.com", 
            "delainey.dave@enron.com", 
            "'delainey@enron.com", 
            "david.delainey@enron.com", 
            "'david.delainey'@enron.com", 
            "dave.delainey@enron.com", 
            "delainey'.'david@enron.com",
            "ben.glisan@enron.com", 
            "bglisan@enron.com", 
            "ben_f_glisan@enron.com", 
            "ben'.'glisan@enron.com",
            "jeff.richter@enron.com", 
            "jrichter@nwlink.com",
            "lawrencelawyer@aol.com", 
            "lawyer'.'larry@enron.com", 
            "larry_lawyer@enron.com", 
            "llawyer@enron.com", 
            "larry.lawyer@enron.com", 
            "lawrence.lawyer@enron.com",
            "tbelden@enron.com", 
            "tim.belden@enron.com", 
            "tim_belden@pgn.com", 
            "tbelden@ect.enron.com",
            "michael.kopper@enron.com",
            "dave.duncan@enron.com", 
            "dave.duncan@cipco.org", 
            "duncan.dave@enron.com",
            "ray.bowen@enron.com", 
            "raymond.bowen@enron.com", 
            "'bowen@enron.com",
            "wes.colwell@enron.com",
            "dan.boyle@enron.com",
            "cloehr@enron.com", 
            "chris.loehr@enron.com"
        ]
    return email_list


In [146]:
#!/usr/bin/pickle

""" a basic script for importing student's POI identifier,
    and checking the results that they get from it 
 
    requires that the algorithm, dataset, and features list
    be written to my_classifier.pkl, my_dataset.pkl, and
    my_feature_list.pkl, respectively

    that process should happen at the end of poi_id.py
"""

import pickle
import sys
from sklearn.cross_validation import StratifiedShuffleSplit
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

PERF_FORMAT_STRING = "\
\tAccuracy: {:>0.{display_precision}f}\tPrecision: {:>0.{display_precision}f}\t\
Recall: {:>0.{display_precision}f}\tF1: {:>0.{display_precision}f}\tF2: {:>0.{display_precision}f}"
RESULTS_FORMAT_STRING = "\tTotal predictions: {:4d}\tTrue positives: {:4d}\tFalse positives: {:4d}\
\tFalse negatives: {:4d}\tTrue negatives: {:4d}"

def test_classifier(clf, dataset, feature_list, folds = 1000):
    data = featureFormat(dataset, feature_list, sort_keys = True)
    labels, features = targetFeatureSplit(data)
    cv = StratifiedShuffleSplit(labels, folds, random_state = 42)
    true_negatives = 0
    false_negatives = 0
    true_positives = 0
    false_positives = 0
    for train_idx, test_idx in cv: 
        features_train = []
        features_test  = []
        labels_train   = []
        labels_test    = []
        for ii in train_idx:
            features_train.append( features[ii] )
            labels_train.append( labels[ii] )
        for jj in test_idx:
            features_test.append( features[jj] )
            labels_test.append( labels[jj] )
        
        ### fit the classifier using training set, and test on test set
        clf.fit(features_train, labels_train)
        predictions = clf.predict(features_test)
        for prediction, truth in zip(predictions, labels_test):
            if prediction == 0 and truth == 0:
                true_negatives += 1
            elif prediction == 0 and truth == 1:
                false_negatives += 1
            elif prediction == 1 and truth == 0:
                false_positives += 1
            elif prediction == 1 and truth == 1:
                true_positives += 1
            else:
                print "Warning: Found a predicted label not == 0 or 1."
                print "All predictions should take value 0 or 1."
                print "Evaluating performance for processed predictions:"
                break
    try:
        total_predictions = true_negatives + false_negatives + false_positives + true_positives
        accuracy = 1.0*(true_positives + true_negatives)/total_predictions
        precision = 1.0*true_positives/(true_positives+false_positives)
        recall = 1.0*true_positives/(true_positives+false_negatives)
        f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives)
        f2 = (1+2.0*2.0) * precision*recall/(4*precision + recall)
        print clf
        print PERF_FORMAT_STRING.format(accuracy, precision, recall, f1, f2, display_precision = 5)
        print RESULTS_FORMAT_STRING.format(total_predictions, true_positives, false_positives, false_negatives, true_negatives)
        print ""
    except:
        print "Got a divide by zero when trying out:", clf
        print "Precision or recall may be undefined due to a lack of true positive predicitons."

CLF_PICKLE_FILENAME = "my_classifier.pkl"
DATASET_PICKLE_FILENAME = "my_dataset.pkl"
FEATURE_LIST_FILENAME = "my_feature_list.pkl"

def dump_classifier_and_data(clf, dataset, feature_list):
    with open(CLF_PICKLE_FILENAME, "w") as clf_outfile:
        pickle.dump(clf, clf_outfile)
    with open(DATASET_PICKLE_FILENAME, "w") as dataset_outfile:
        pickle.dump(dataset, dataset_outfile)
    with open(FEATURE_LIST_FILENAME, "w") as featurelist_outfile:
        pickle.dump(feature_list, featurelist_outfile)

def load_classifier_and_data():
    with open(CLF_PICKLE_FILENAME, "r") as clf_infile:
        clf = pickle.load(clf_infile)
    with open(DATASET_PICKLE_FILENAME, "r") as dataset_infile:
        dataset = pickle.load(dataset_infile)
    with open(FEATURE_LIST_FILENAME, "r") as featurelist_infile:
        feature_list = pickle.load(featurelist_infile)
    return clf, dataset, feature_list

def main():
    ### load up student's classifier, dataset, and feature_list
    clf, dataset, feature_list = load_classifier_and_data()
    ### Run testing script
    test_classifier(clf, dataset, feature_list)

if __name__ == '__main__':
    main()


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=75, splitter='best')
	Accuracy: 0.90880	Precision: 0.66255	Recall: 0.64400	F1: 0.65314	F2: 0.64763
	Total predictions: 15000	True positives: 1288	False positives:  656	False negatives:  712	True negatives: 12344



# Data Exploration


20 of the 21 features have missing values (represented as "NaN"), with the exception being the "poi" feature.

The missing financial features are imputed by featureFormat to zero (0). Imputing to zero makes sense for these features because we have a reasonably complete financial picture through the FindLaw "Payments to Insiders" document. I am assuming that if a feature has a dash ('-'), like several 'bonus' rows do, that means it is zero. That isn't an unreasonable assumption since there are no actual zeros in that document and the dashes take their place.

I imputed the missing email features to each feature's median. Imputing to zero doesn't make sense in this case because the email data appears incomplete. 60 of the 146 employees in the dataset have "NaN" for all of their email features. A missing feature likely means we couldn't find the data, rather than the value is zero. Though this introduces some bias, we are at the whim of the dataset and imputing to the mean is a fine option.

In [147]:
#!/usr/bin/python

import sys
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.decomposition import PCA


In [148]:
import pickle
sys.path.append("../tools")
from feature_format import featureFormat, targetFeatureSplit
import tester

In [149]:
features_list = ['poi',
                'salary',
                'bonus', 
                'long_term_incentive', 
                'deferred_income', 
                'deferral_payments',
                'loan_advances', 
                'other',
                'expenses', 
                'director_fees',
                'total_payments',
                'exercised_stock_options',
                'restricted_stock',
                'restricted_stock_deferred',
                'total_stock_value',
                'to_messages',
                'from_messages',
                'from_this_person_to_poi',
                'from_poi_to_this_person']

 # features_list 

In [150]:
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r")  as data_file:
    data_dict = pickle.load(data_file)

# Transform data from dictionary to the Pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient = "index")

#Order columns in DataFrame, exclude email column
df = df[features_list]
df = df.replace('NaN', np.nan)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 19 columns):
poi                          146 non-null bool
salary                       95 non-null float64
bonus                        82 non-null float64
long_term_incentive          66 non-null float64
deferred_income              49 non-null float64
deferral_payments            39 non-null float64
loan_advances                4 non-null float64
other                        93 non-null float64
expenses                     95 non-null float64
director_fees                17 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
restricted_stock             110 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
to_messages                  86 non-null float64
from_messages                86 non-null float64
from_this_person_to_poi      86 non-null float

In [151]:
#split of POI and non-POI in the dataset
poi_non_poi = df.poi.value_counts()
poi_non_poi.index = ["non-POI", "POI"]
print "poi / non_poi split"
poi_non_poi

poi / non_poi split


non-POI    128
POI         18
Name: poi, dtype: int64

# Data Cleaning 

In [152]:
## Amount of NA's in the dataset

print "Amount of NaN values in the dataset: ", df.isnull().sum().sum()

Amount of NaN values in the dataset:  1263


According to the financial dat from FindLaw , NaN's represents values of ) but not the missing values'. Thats why I am going to replace all NaNs with 0

In [153]:
# Replacing "NaN" in financial features with 0

df.ix[:,:15] = df.ix[:,:15].fillna(0)

NaNs values in email features means the information is missing. Thus, I will split the data into 2 classes POI/non-POI for the sake of missing values imputation. I will use the median of each class for NA's imputation in the dataset.

In [154]:
### Replacing NAN's
email_features = ["to_messages", "from_messages", "from_this_person_to_poi", "from_poi_to_this_person"]

imp = Imputer(missing_values = "NaN", strategy = "median", axis = 0)


### Impute missing values of email features 
df.loc[df[df.poi == 1].index, email_features] = imp.fit_transform(df[email_features][df.poi == 1])
df.loc[df[df.poi == 0].index, email_features] = imp.fit_transform(df[email_features][df.poi == 0])

Next, I will check the accuracy of the financial dataset by summing up the payment features and comparing it with the total_payemnet feature. Furthermore, I will also comapare stock features with the total_stock_value

In [155]:
### check data: summing payments features and comparing them with total_payments
payments = ["salary",
            "bonus", 
            "long_term_incentive", 
            "deferred_income", 
            "deferral_payments",
            "loan_advances", 
            "other",
            "expenses", 
            "director_fees"]

df[df[payments].sum(axis = "columns") != df.total_payments]

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_this_person_to_poi,from_poi_to_this_person
BELFER ROBERT,False,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,944.0,41.0,6.0,26.5
BHATNAGAR SANJAY,False,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,523.0,29.0,1.0,0.0


In [156]:
stock_value = ["exercised_stock_options",
              "restricted_stock",
              "restricted_stock_deferred"]

df[df[stock_value].sum(axis = "columns") != df.total_stock_value]

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_this_person_to_poi,from_poi_to_this_person
BELFER ROBERT,False,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,944.0,41.0,6.0,26.5
BHATNAGAR SANJAY,False,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,523.0,29.0,1.0,0.0


It seems two samples have mistakes in the data entry. Thus, I am going to rectify the situation by correcting the worngly entered data. Ultimately, everything in the data set will be cross checked (i.e. empty DataFrame mean no samples with mistakes in the dataset)

In [157]:
df.ix['BELFER ROBERT','total_payments'] = 3285
df.ix['BELFER ROBERT','deferral_payments'] = 0
df.ix['BELFER ROBERT','restricted_stock'] = 44093
df.ix['BELFER ROBERT','restricted_stock_deferred'] = -44093
df.ix['BELFER ROBERT','total_stock_value'] = 0
df.ix['BELFER ROBERT','director_fees'] = 102500
df.ix['BELFER ROBERT','deferred_income'] = -102500
df.ix['BELFER ROBERT','exercised_stock_options'] = 0
df.ix['BELFER ROBERT','expenses'] = 3285
df.ix['BELFER ROBERT',]
df.ix['BHATNAGAR SANJAY','expenses'] = 137864
df.ix['BHATNAGAR SANJAY','total_payments'] = 137864
df.ix['BHATNAGAR SANJAY','exercised_stock_options'] = 1.54563e+07
df.ix['BHATNAGAR SANJAY','restricted_stock'] = 2.60449e+06
df.ix['BHATNAGAR SANJAY','restricted_stock_deferred'] = -2.60449e+06
df.ix['BHATNAGAR SANJAY','other'] = 0
df.ix['BHATNAGAR SANJAY','director_fees'] = 0
df.ix['BHATNAGAR SANJAY','total_stock_value'] = 1.54563e+07
df.ix['BHATNAGAR SANJAY']

df[df[payments].sum(axis='columns') != df.total_payments]

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_this_person_to_poi,from_poi_to_this_person


In [158]:
df[df[stock_value].sum(axis='columns') != df.total_stock_value]

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_this_person_to_poi,from_poi_to_this_person


# Outlier Investigation 

In the previous step, datset was cleaned for missing values as well typos. Next, I would explore the data for plausible outliers. This will be accomplished by using descriptive statistics. We know that desciptive statitics determiene the distribution as the values which are higher than **Q2 + 1.5IQR** or **Q2 - 1.5IQR**; where Q2 is median of the distribution, IQR is interquartile range. I will caculate the sum of the outlier variables for each person and sort them in descending order.

In [159]:
outliers = df.quantile (.5) + 1.5 * (df.quantile (.75) - df.quantile (.25))

pd.DataFrame((df[1:] > outliers[1:]).sum(axis = 1), columns = ["# of outliers"]).\
    sort_values("# of outliers", ascending = [0]).head(7)

Unnamed: 0,# of outliers
TOTAL,12
LAY KENNETH L,12
FREVERT MARK A,12
WHALLEY LAWRENCE G,11
SKILLING JEFFREY K,11
LAVORATO JOHN J,9
MCMAHON JEFFREY,8


Our data set is really small, so I'm going to consider just 5% of the samples with most number of outlier variables:

* The first value is 'TOTAL' which is the total value of financial payments from the FindLaw data. As it's doesn't make any sence for our solution, I'm going to exclude it from the data set.
* Kenneth Lay and Jeffrey Skilling are very well known persons from ENRON - I should keep them as they represent anomalies but not the outliers.
* Mark Frevert and Lawrence Whalley are not so very well known but top managers of the Enron who also represent valuable examples for the model - I'm also going to keep them in the data set.
* John Lavorato is not very well known person as far as I've searched in the internet. I don't think he represents a valid point and exclude him.
* Jeffrey Mcmahon is the former treasurer who worked before guilty Ben Glisan. I would exclude him from the data set as he worked before the guilty treasurer and might add some confusion to the model.

From considered 7 persons I've ended up with excluding 3 of them (1 typo 'TOTAL' and 2 persons).

In [160]:
scaler = StandardScaler()
df_norm = df[features_list]
df_norm = scaler.fit_transform(df_norm.ix[:,1:])

clf = GaussianNB()

features_list2 = ['poi'] + range(8)

my_dataset = pd.DataFrame(SelectKBest(f_classif, k = 8).fit_transform(df_norm, df.poi), index = df.index)
my_dataset.insert(0, "poi", df.poi)
my_dataset = my_dataset.to_dict(orient = "index")



In [161]:
dump_classifier_and_data(clf, my_dataset, features_list2)

tester.main()

GaussianNB(priors=None)
	Accuracy: 0.33760	Precision: 0.14848	Recall: 0.83800	F1: 0.25226	F2: 0.43447
	Total predictions: 15000	True positives: 1676	False positives: 9612	False negatives:  324	True negatives: 3388



In [162]:
### exclude 3 outliers from the data set

df = df.drop(['TOTAL', 'LAVORATO JOHN J', 'MCMAHON JEFFREY'],0)

# Optimise Feature Selection/Engineering

Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model, we generally want to restrict the features in our models to those, that are most relevant for the response variable we want to predict. Using as few features as possible will also reduce the complexity of our models, which means it needs less time and computer power to run and is easier to understand.

There are several ways to identify how much each feature contributes to the model and to restrict the number of selected features. Here, I am going to examine the effect of feature selection via

  * Decision Tree C lassifier , incl. choosing the features with features importance attribute and tuning the model.

or Random Forest models.



## Feature Engineering

For both strategies I've tried to create new features as a fraction of almost all financial variables (f.ex. fractional bonus as fraction of bonus to total_payments, etc.). Logic behind email feature creation was to check the fraction of emails, sent to POI, to all sent emails; emails, received from POI, to all received emails.
I've end up with using one new feature fraction_to_POI:

In [163]:
### Create additional feature : fraction of person's email to POI to all sent messages

df["fraction_to_poi"] = df["from_this_person_to_poi"]/df["from_messages"]

### Clean all "inf" values which we got if the perosn from_messages = 0
df = df.replace("inf", 0)
#df.head(7)

Second strategy showed significantly better results so the rest of the project I'm going to concentrate on it. Decision tree doesn't require me to do any feature scaling so I've skipped this step.

## Feature Selection 

On the feature selection step I've fitted my DecisionTreeClassifier with all features and as a result received number of features with non-null feature importance, sorted by importance.

In [164]:
### Decision tree using features with non-null importance
clf = DecisionTreeClassifier(random_state = 75)
clf.fit(df.ix[:, 1:], df.ix[:, :1])

### Show the feature with non-null importance
### Sort, create feature list of the feature model 
feature_importance = []
for i in range (len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0:
        feature_importance.append(([df.columns[i + 1], clf.feature_importances_[i]]))
feature_importance.sort(key = lambda x: x[1], reverse = True)

for f_i in feature_importance:
    print f_i

features_list = [x[0] for x in feature_importance]
features_list.insert(0, "poi")

['fraction_to_poi', 0.35824390243902443]
['expenses', 0.26431889023871075]
['to_messages', 0.16306330961503368]
['other', 0.084740740740740714]
['deferred_income', 0.070617283950617254]
['from_poi_to_this_person', 0.059015873015873015]


According to feature_importances attribute of the classifier, just created fraction_to_poi feature has the highest importance for the model. The number of features used for the model may cause different results. Later on the algorithm tuning step I'm going to re-iterate the step of choosing features with non-null importance so the number of them will be changed.

I'm using random state equal to 75 in decision tree and random forest to be able to represent the results. The exact value was manually chosen for better performance of decision tree classifier.

# Model Building 

I focused on four algorithms, with parameter tuning incorporated into algorithm selection (i.e. parameters tuned for more than one algorithm, and best algorithm-tune combination selected for final analysis). These algorithms were:

* Decision Tree Classifier
* AdaBoost Classifier
* Random Forest
* GaussianNB

For decision tree and random forest I've selected just features with non-null importance based on clf.features_importances__. 

### Model No. 1 -  Decision Tree Classifier

In [165]:
### Decision Tree Classifier with standard parametres 

clf = DecisionTreeClassifier (random_state = 75)
my_dataset = df[features_list].to_dict(orient = "index")

In [166]:
tester.dump_classifier_and_data(clf, my_dataset, features_list)
tester.main()

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=75, splitter='best')
	Accuracy: 0.90880	Precision: 0.66255	Recall: 0.64400	F1: 0.65314	F2: 0.64763
	Total predictions: 15000	True positives: 1288	False positives:  656	False negatives:  712	True negatives: 12344



### Model No. 2 -  AdaBoost Classifier

In [168]:
### AdaBoost Classifier with standard parameters

from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(random_state = 75)

my_dataset = df[features_list].to_dict(orient = "index")

In [169]:
tester.dump_classifier_and_data(clf, my_dataset, features_list)
tester.main()

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=75)
	Accuracy: 0.91680	Precision: 0.69936	Recall: 0.65950	F1: 0.67885	F2: 0.66710
	Total predictions: 15000	True positives: 1319	False positives:  567	False negatives:  681	True negatives: 12433



### Model No. 3 -  Random Forest 

In [170]:
### Random Forest with standard parameters
clf = RandomForestClassifier(random_state = 75)
clf.fit(df.ix[:,1:], np.ravel(df.ix[:,:1]))

# selecting the features with non null importance, sorting and creating features_list for the model
features_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0:
        features_importance.append([df.columns[i+1], clf.feature_importances_[i]])
features_importance.sort(key=lambda x: x[1], reverse = True)
features_list = [x[0] for x in features_importance]
features_list.insert(0, 'poi')

# number of features for best result was found iteratively
features_list2 = features_list[:11]
my_dataset = df[features_list2].to_dict(orient = 'index')


In [171]:
tester.dump_classifier_and_data(clf, my_dataset, features_list2)
tester.main()

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=75,
            verbose=0, warm_start=False)
	Accuracy: 0.89780	Precision: 0.70322	Recall: 0.40400	F1: 0.51318	F2: 0.44158
	Total predictions: 15000	True positives:  808	False positives:  341	False negatives: 1192	True negatives: 12659



### Model No. 4 -  GaussianNB  

In [172]:
# GaussianNB with feature standartization, selection, PCA

clf = GaussianNB()

# data set standartization
scaler = StandardScaler()
df_norm = df[features_list]
df_norm = scaler.fit_transform(df_norm.ix[:,1:])

# feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
features_list2 = ['poi']+range(3)
my_dataset = pd.DataFrame(SelectKBest(f_classif, k=8).fit_transform(df_norm, df.poi), index = df.index)

#PCA
pca = PCA(n_components=3)
my_dataset2 = pd.DataFrame(pca.fit_transform(my_dataset),  index=df.index)
my_dataset2.insert(0, "poi", df.poi)
my_dataset2 = my_dataset2.to_dict(orient = 'index')  



In [173]:
dump_classifier_and_data(clf, my_dataset2, features_list2)
tester.main()

GaussianNB(priors=None)
	Accuracy: 0.86447	Precision: 0.49065	Recall: 0.43300	F1: 0.46003	F2: 0.44342
	Total predictions: 15000	True positives:  866	False positives:  899	False negatives: 1134	True negatives: 12101



Though I used the classfiication report for quick checks, I used tester.py's evaluation metrics to make sure I would get precision and recall above 0.3 for the Udacity grading system. Here is how each performed:

In [174]:
pd.DataFrame([[0.90880, 0.66255, 0.64400, 0.65314],
              [0.89120, 0.60372, 0.53550, 0.56757],
              [0.89780, 0.70322, 0.40400, 0.51318],
              [0.86447, 0.49065, 0.43300, 0.46003]],
             columns = ['Accuracy','Precision', 'Recall', 'F1'], 
             index = ['Decision Tree Classifier', 'AdaBoost', 'Random Forest', 'Gaussian Naive Bayes'])

Unnamed: 0,Accuracy,Precision,Recall,F1
Decision Tree Classifier,0.9088,0.66255,0.644,0.65314
AdaBoost,0.8912,0.60372,0.5355,0.56757
Random Forest,0.8978,0.70322,0.404,0.51318
Gaussian Naive Bayes,0.86447,0.49065,0.433,0.46003


# Algorithm Tuning 

Tuning the parameters of an algorithm means adjusting the parameters in a certain way to achieve optimal algorithm performance. There are a variety of "certain ways" (e.g. a manual guess-and-check method or automatically with GridSearchCV) and "algorithm performance" can be measured in a variety of ways (e.g. accuracy, precision, or recall). If you don't tune the algorithm well, performance may suffer. The data won't be "learned" well and you won't be able to successfully make predictions on new data.

Bias-variance tradeoff is one of the key dilema in machine learning. High bias algorithms has no capacity to learn, high variance algorithms react poorly in case they didn't see such data before. Predictive model should be tuned to achieve compromise. The process of changing the parameteres of algorithms is algorithm tuning and it lets us find the golden mean and best result. For the chosen decision tree classifier for example, I tried out multiple different parameter values for each of the following parameters (with the optimal combination bolded). I used Stratified Shuffle Split cross validation to guard against bias introduced by the potential underrepresentation of classes (i.e. POIs).


###### Decision Tree
Accuracy: 0.90880	
Precision: 0.66255	
Recall: 0.64400	
F1: 0.65314

* Parameters
  + class_weight = None 
  + criterion = 'gini'
  + max_depth = None
  + max_features = None 
  + max_leaf_nodes=None
  + min_impurity_split = 1e-07 
  + min_samples_leaf = 1
  + min_samples_split = 2 
  + min_weight_fraction_leaf = 0.0
  + presort=False
  + random_state = 75
  + splitter = 'best'

######  AdaBoost 
Accuracy: 0.89120	
Precision: 0.60372	
Recall: 0.53550	
F1: 0.56757

* Parameters
  + algorithm = 'SAMME.R'
  + base_estimator = None
  + learning_rate = 1.0 
  + n_estimators = 50 
  + random_state = 75

###### Random Forest 
Accuracy: 0.89780
Precision: 0.70322
Recall: 0.40400
F1: 0.51318

* Parameters
  + bootstrap = True 
  + class_weight = None
  + criterion = 'gini'
  + max_depth = None
  + max_features = 'auto'
  + max_leaf_nodes = None,
  + min_impurity_split = 1e-07
  + min_samples_leaf = 1
  + min_samples_split = 2
  + min_weight_fraction_leaf = 0.0
  + n_estimators=10
  + n_jobs=1
  + oob_score = False
  + random_state = 75
  + verbose = 0
  + warm_start = False)

In [175]:
clf = DecisionTreeClassifier(criterion = "entropy",
                            min_samples_split = 19,
                            random_state = 75,
                            min_samples_leaf = 6,
                            max_depth = 3)

clf.fit(df.ix[:,1:], df.poi)


# show the features with non null importance, sorted and create features_list of features for the model
features_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0:
        features_importance.append([df.columns[i+1], clf.feature_importances_[i]])
features_importance.sort(key=lambda x: x[1], reverse = True)

features_list = [x[0] for x in features_importance]
features_list.insert(0, 'poi')

In [176]:
my_dataset = df[features_list].to_dict(orient = 'index')
tester.dump_classifier_and_data(clf, my_dataset, features_list)
tester.main() 

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=6,
            min_samples_split=19, min_weight_fraction_leaf=0.0,
            presort=False, random_state=75, splitter='best')
	Accuracy: 0.93673	Precision: 0.83238	Recall: 0.65800	F1: 0.73499	F2: 0.68678
	Total predictions: 15000	True positives: 1316	False positives:  265	False negatives:  684	True negatives: 12735



# Model Evaluation and Valiadation 
### Usage of Evaluation Metrics

In the project I've used F1 score as key measure of algorithms' accuracy. It considers both the precision and the recall of the test to compute the score.
* Precision is the ability of the classifier not label as positive sample that is negative.
* Recall is the ability of the classifier to find all positive samples.
* The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.

My tuned decision tree classifier showed precision0.93673 and recall 0.65800 with the resulting F1 score 0.73499. I can explain it as 93.68% of the called POI are POI and 65.80% of POI are identified.

### Validation Strategy
Validation is a way to substantiate your machine learning algorithm's performance, i.e., to test how well your model has been trained. A classic validation mistake is testing your algorithm on the same data that is was trained on. Without separating the training set from the testing set, it is difficult to determine how well your algorithm generalises to new data. Another way of model validation is to perform a cross validation. In cross validation the data is split on k beans of equal size; runs learning experiments; repeats this operation number of times and take the average test result.

### Algorithm Performance

The two notable evaluation metrics for this POI identifier are precision and recall. The average precision for my decision tree classifier was 0.93673 and the average recall was 0.658. 

> *What do each of these mean?*

*Precision* is how often our class prediction (POI vs. non-POI) is right when we guess that class. *Recall* is how often we guess the class (POI vs. non-POI) when the class actually occurred. In the context of our POI identifier, it is arguably more important to make sure we don't miss any POIs, so we don't care so much about precision. Imagine we are law enforcement using this algorithm as a screener to determine who to prosecute. When we guess POI, it is not the end of the world if they aren't a POI because we'll do due diligence. We won't put them in jail right away. For our case, we want high recall: when it is a POI, we need to make sure we catch them in our net. The decision tree algorithm performed best in recall of the algorithms I tried, hence it being my choice for final analysis.


Stratification is a way of controlling for cofounders. Confounding is a major concern in causal studies because it results in biased estimation of exposure effects. In the extreme, this can mean that a causal effect is suggested where none exists, or that a true effect is hidden.Using, stratification we form a strata within which the cofuounding variables are approximately constant, estimating stratum-specific effects of exposure, checking to see wether these stratum-specific effects are roughly same, finally combing them to form a single estimate of thier common value.

In this project, I used Stratified Shuffle Split with 1000 runs, which makes sure that the training and testing sets have about the equal ratio of passed vs. failed students. It also shuffles to remove the bias that ordering of students might have in the dataset. Second of all, overfitting might be a strong possibility. The number of training instances needed to accurately classify or predict a target label correctly increases exponentially as we increase the number of features. This is also known as the Curse of Dimensionality. See https://en.wikipedia.org/wiki/Curse_of_dimensionality We only have a relatively small number of training instances to consider. More data is generally better than less data, and we don’t have that advantage here.

* criterion = 'entropy' 
* min_samples_split = 19
* random_state = 75
* min_samples_leaf=6 
* max_depth = 3

In [177]:
pd.DataFrame([[0.93673, 0.83238, 0.65800, 0.73499]],
             columns = ['Accuracy','Precision', 'Recall', 'F1'], 
             index = ['Decision Tree Classifier'])

Unnamed: 0,Accuracy,Precision,Recall,F1
Decision Tree Classifier,0.93673,0.83238,0.658,0.73499


# Reflection 

Before the start of this project I was completely sure that building the machine learning is about choosing the right algorithm from the black box and some magic. Working on the person of interest identifier I've been recursively going through the process of data exploration, outlier detection and algorithm tuning and spend most of the time on a data preparation. The model performance raised significantly after missing values imputation, extra feature creation and feature selection and less after algorithm tuning which shows me once again how important to fit the model with the good data.

In the context of our POI identifier, it is arguably more important to make sure we don't miss any POIs, so we don't care so much about precision. Imagine we are law enforcement using this algorithm as a screener to determine who to prosecute. When we guess POI, it is not the end of the world if they aren't a POI because we'll do due diligence. We won't put them in jail right away. For our case, we want high recall: when it is a POI, we need to make sure we catch them in our net. The decision tree algorithm performed best in recall (0.65) of the algorithms I tried, hence it being my choice for final analysis.

This experience might be applied to other fraud detection tasks. I think there is way of the model improvement by using and tuning alternative algorithms like Random Forest.

### Limitations of the study

It’s important to identify and acknowledge the limitation of the study. My conclusions are based just on the provided data set which represent just 145 persons. To get the real causation, I should gather all financial and email information about all enron persons which is most probably not possible. Missing email values were imputed with median so the modes of the distributions of email features are switched to the medians. Algorithms were tuned sequentially (I've changed one parameter to achieve better performance and then swithched to another parameter. There is a chance that othere parameters in combination might give better model's accuracy).