# Using Machine Learning to Identify Fraud in Enron Emails

By Trevor Cook

This project uses machine learning techniques to identify Enron employees who may have been involved in the company's fraud. The dataset that was used contains information on various financial and email indicators of different Enron employees. With this information, I was able to build an algorithm that can successfully predict whether a given employee was a POI (person of interest), with a precision and recall score of above 0.3

In [1]:
# %load poi_id.py
#!/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")

import numpy as np
import matplotlib.pyplot as plt
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from pprint import pprint

# Select what features to use.
# The first feature must be "poi".
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person']

# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)


In [2]:
# Explore the dataset

# Count number of POIs and non-POIs
poi = 0
non_poi = 0
for v in data_dict.itervalues():
    if v['poi'] == True:
        poi += 1
    else:
        non_poi += 1
        
# Count number of NaN values for each feature        
nan_features = {'bonus': 0,
 'deferral_payments': 0,
 'deferred_income': 0,
 'director_fees': 0,
 'email_address': 0,
 'exercised_stock_options': 0,
 'expenses': 0,
 'from_messages': 0,
 'from_poi_to_this_person': 0,
 'from_this_person_to_poi': 0,
 'loan_advances': 0,
 'long_term_incentive': 0,
 'other': 0,
 'restricted_stock': 0,
 'restricted_stock_deferred': 0,
 'salary': 0,
 'shared_receipt_with_poi': 0,
 'to_messages': 0,
 'total_payments': 0,
 'total_stock_value': 0}

for i in data_dict.values():
    for k in i.items():
        if k[1] == 'NaN':
            nan_features[k[0]] += 1       

# Print out characteristics of the dataset
print 'Number of data points: ', sum(len(v) for v in data_dict.itervalues())
print 'Number of features: ', len(features_list)
print 'Number of POIs: ', poi
print 'Number of non POIs: ', non_poi
print 'Number of employees: ', poi + non_poi
print 'Number of NaN values per feature: '
pprint(sorted(nan_features.items(), key = lambda x: x[1], reverse=True))

Number of data points:  3066
Number of features:  20
Number of POIs:  18
Number of non POIs:  128
Number of employees:  146
Number of NaN values per feature: 
[('loan_advances', 142),
 ('director_fees', 129),
 ('restricted_stock_deferred', 128),
 ('deferral_payments', 107),
 ('deferred_income', 97),
 ('long_term_incentive', 80),
 ('bonus', 64),
 ('to_messages', 60),
 ('from_poi_to_this_person', 60),
 ('from_messages', 60),
 ('from_this_person_to_poi', 60),
 ('shared_receipt_with_poi', 60),
 ('other', 53),
 ('salary', 51),
 ('expenses', 51),
 ('exercised_stock_options', 44),
 ('restricted_stock', 36),
 ('email_address', 35),
 ('total_payments', 21),
 ('total_stock_value', 20)]


In [3]:
# Remove outliers
def remove_outliers(data, outliers):
    '''Removes outliers from the data'''
    for name in outliers:
        data.pop(name, 0)

outliers_list = ['TOTAL', 'THE TRAVEL AGENCY IN THE PARK', 'LOCKHART EUGENE E']      
remove_outliers(data_dict, outliers_list)


# Create new features
import math
# Add new features to data_dict
for name in data_dict:
    from_ratio = float(data_dict[name]['from_poi_to_this_person']) / float(data_dict[name]['to_messages']) 
    # Corrects for non-string NaN values resulting from new features
    if math.isnan(from_ratio):
        data_dict[name]['percent_from_poi'] = 0
    else:
        data_dict[name]['percent_from_poi'] = from_ratio

for name in data_dict:
    to_ratio = float(data_dict[name]['from_this_person_to_poi']) / float(data_dict[name]['from_messages'])
    if math.isnan(to_ratio):
        data_dict[name]['percent_to_poi'] = 0
    else:
        data_dict[name]['percent_to_poi'] = to_ratio

# Append new features to features_list
features_list.append('percent_from_poi')
features_list.append('percent_to_poi')


# Store to my_dataset for easy export below.
my_dataset = data_dict

# Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [4]:
# Print SelectKBest scores
from sklearn.feature_selection import SelectKBest

def k_best_features(k):
    '''Selects k best features. Returns a dictionary where keys, values = feature name, score'''
    selector = SelectKBest(k=k)
    selector.fit(features, labels)
    scores = selector.scores_
    indices = selector.get_support(indices=True)
    best_scores = [scores[i] for i in indices]
    best_features = [features_list[i+1] for i in indices]
    feature_scores = zip(best_features, best_scores)
    sorted_features = (sorted(feature_scores, key=lambda x: x[1], reverse=True))
    pprint(sorted_features)
    return dict(sorted_features)


In [5]:
# Try a vareity of classifiers

# Split data into training and testing sets
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
train_test_split(features, labels, test_size=0.3, random_state=42)

from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

# Create classifiers
nb_clf = GaussianNB()
tree_clf = tree.DecisionTreeClassifier()
knn_clf = KNeighborsClassifier()
adb_clf = AdaBoostClassifier()

# Set range of parameters for classifiers
nb_params = {'feature_selection__k': range(1, 15)}
tree_params = {
    'algorithm__criterion': ['gini', 'entropy'],
    'algorithm__max_depth': range(2, 8, 2), 
    'algorithm__splitter':('best','random'),
    'algorithm__min_samples_split':[3,4,5],
    'algorithm__max_leaf_nodes':[5,10]
}
knn_params = {
    'feature_selection__k': range(3, 6),
    'algorithm__n_neighbors': range(2, 6)
}
adb_params = {
    'feature_selection__k': range(3, 8),
    'algorithm__n_estimators': [50, 60],
    'algorithm__algorithm': ['SAMME.R']
}

from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.grid_search import GridSearchCV


def test_algorithm(classifier, params):
    '''
    Function that takes in an algorithm classifier and their respective parameters as inputs.
    Performs SelectKBest feature selection and GridSearchCV for parameter selection into a pipeline. 
    Prints parameter options, pipeline steps, f1 score, best parameters, and precision and recall. 
    Returns pipeline classifier.
    '''
    
    select = SelectKBest()
    
    # Steps to be fed into pipeline
    steps = [('feature_selection', select),
             ('algorithm', classifier)]
    
    pipeline = Pipeline(steps)
    
    folds = 100
    # StratifiedShuffleSplit returns stratified randomized folds
    sss = StratifiedShuffleSplit(labels_train, n_iter=folds, random_state=42)
    gs = GridSearchCV(pipeline, param_grid = params, cv=sss, scoring = 'f1')
    
    print 'Parameters:'
    pprint(params)
    print ""
    
    # Print out pipeline steps
    print"Pipeline: \n", [step for step, _ in pipeline.steps], '\n'
    
    # Fit training data to GridSearchCV
    gs.fit(features_train, labels_train)
    
    # Print f1 score
    score = gs.best_score_
    print 'f1 score: \n', score, '\n'
    
    # Fetch optimal parameters found
    best_params = gs.best_estimator_.get_params()
    print 'Best Parameters: '
    for name in params.keys():
        print name, ': ', best_params[name]
        if name == 'feature_selection__k':
            k_best = k_best_features(best_params[name])
    
    pred = gs.predict(features_test)
    
    # Calculate and print precision and recall evaluation metrics    
    true_negatives = 0
    false_negatives = 0
    true_positives = 0
    false_positives = 0
    
    for prediction, truth in zip(pred, labels_test):
            if prediction == 0 and truth == 0:
                true_negatives += 1
            elif prediction == 0 and truth == 1:
                false_negatives += 1
            elif prediction == 1 and truth == 0:
                false_positives += 1
            elif prediction == 1 and truth == 1:
                true_positives += 1
    
    total_predictions = true_negatives + false_negatives + false_positives + true_positives
    accuracy = 1.0*(true_positives + true_negatives)/total_predictions
    precision = 1.0*true_positives/(true_positives+false_positives)
    recall = 1.0*true_positives/(true_positives+false_negatives)
    f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives)
    
    print ''
    print 'Evaluation metrics:'
    print 'Precision: ', precision
    print 'Recall: ', recall
    
    clf = gs.best_estimator_
    return clf
        

In [20]:
print 'Naive Bayes Algorithm: \n'
test_algorithm(nb_clf, nb_params)

Naive Bayes Algorithm: 

Parameters:
{'feature_selection__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]}

Pipeline: 
['feature_selection', 'algorithm'] 

f1 score: 
0.149666666667 

Best Parameters: 
feature_selection__k :  14
[('exercised_stock_options', 24.815079733218194),
 ('total_stock_value', 24.182898678566879),
 ('bonus', 20.792252047181535),
 ('salary', 18.289684043404513),
 ('percent_to_poi', 16.409712548035792),
 ('deferred_income', 11.458476579280369),
 ('long_term_incentive', 9.9221860131898225),
 ('restricted_stock', 9.2128106219771002),
 ('total_payments', 8.7727777300916756),
 ('shared_receipt_with_poi', 8.589420731682381),
 ('loan_advances', 7.1840556582887247),
 ('expenses', 6.0941733106389453),
 ('from_poi_to_this_person', 5.2434497133749582),
 ('other', 4.1874775069953749)]

Evaluation metrics:
Precision:  0.4
Recall:  0.4


Pipeline(steps=[('feature_selection', SelectKBest(k=14, score_func=<function f_classif at 0x10e026b90>)), ('algorithm', GaussianNB())])

In [21]:
print 'Decision Tree Algorithm: \n'
test_algorithm(tree_clf, tree_params)

Decision Tree Algorithm: 

Parameters:
{'algorithm__criterion': ['gini', 'entropy'],
 'algorithm__max_depth': [2, 4, 6],
 'algorithm__max_leaf_nodes': [5, 10],
 'algorithm__min_samples_split': [3, 4, 5],
 'algorithm__splitter': ('best', 'random')}

Pipeline: 
['feature_selection', 'algorithm'] 

f1 score: 
0.327333333333 

Best Parameters: 
algorithm__max_leaf_nodes :  10
algorithm__splitter :  best
algorithm__min_samples_split :  4
algorithm__criterion :  gini
algorithm__max_depth :  6

Evaluation metrics:
Precision:  0.333333333333
Recall:  0.2


Pipeline(steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10e026b90>)), ('algorithm', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=10, min_samples_leaf=1,
            min_samples_split=4, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])

In [24]:
print 'K Nearest Neighbor Algorithm: \n'
test_algorithm(knn_clf, knn_params)

K Nearest Neighbor Algorithm: 

Parameters:
{'algorithm__n_neighbors': [2, 3, 4, 5], 'feature_selection__k': [3, 4, 5]}

Pipeline: 
['feature_selection', 'algorithm'] 

f1 score: 
0.296666666667 

Best Parameters: 
feature_selection__k :  5
[('exercised_stock_options', 24.815079733218194),
 ('total_stock_value', 24.182898678566879),
 ('bonus', 20.792252047181535),
 ('salary', 18.289684043404513),
 ('percent_to_poi', 16.409712548035792)]
algorithm__n_neighbors :  3

Evaluation metrics:
Precision:  0.0
Recall:  0.0


Pipeline(steps=[('feature_selection', SelectKBest(k=5, score_func=<function f_classif at 0x10e026b90>)), ('algorithm', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform'))])

In [25]:
print 'AdaBoost Algorithm: \n'
test_algorithm(adb_clf, adb_params)

AdaBoost Algorithm: 

Parameters:
{'algorithm__algorithm': ['SAMME.R'],
 'algorithm__n_estimators': [50, 60],
 'feature_selection__k': [3, 4, 5, 6, 7]}

Pipeline: 
['feature_selection', 'algorithm'] 

f1 score: 
0.284 

Best Parameters: 
feature_selection__k :  7
[('exercised_stock_options', 24.815079733218194),
 ('total_stock_value', 24.182898678566879),
 ('bonus', 20.792252047181535),
 ('salary', 18.289684043404513),
 ('percent_to_poi', 16.409712548035792),
 ('deferred_income', 11.458476579280369),
 ('long_term_incentive', 9.9221860131898225)]
algorithm__algorithm :  SAMME.R
algorithm__n_estimators :  60

Evaluation metrics:
Precision:  0.166666666667
Recall:  0.2


Pipeline(steps=[('feature_selection', SelectKBest(k=7, score_func=<function f_classif at 0x10e026b90>)), ('algorithm', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=60, random_state=None))])

In [6]:
def choose_algorithm(classifier, params):
    '''
    Function that takes in an algorithm classifier and its respective parameters as inputs.
    Performs GridSearchCV for parameter selection into a pipeline. Prints parameter options, 
    pipeline steps, f1 score, best parameters, and precision and recall. Selects features based 
    on feature importances. Returns pipeline classifier.
    '''

    # Steps to be fed into pipeline
    steps = [('algorithm', classifier)]
    
    pipeline = Pipeline(steps)
    
    folds = 50
    # StratifiedShuffleSplit returns stratified randomized folds
    sss = StratifiedShuffleSplit(labels_train, n_iter=folds, random_state=42)
    gs = GridSearchCV(pipeline, param_grid = params, cv=sss, scoring = 'f1')
    
    print 'Parameters:'
    pprint(params)
    print ""
    
    # Print out pipeline steps
    print"Pipeline: \n", [step for step, _ in pipeline.steps], '\n'
    
    # Fit training data to GridSearchCV
    gs.fit(features_train, labels_train)
    
    # Print f1 score
    score = gs.best_score_
    print 'f1 score: \n', score, '\n'
    
    # Fetch optimal parameters found
    best_params = gs.best_estimator_.get_params()
    print 'Best Parameters: '
    for name in params.keys():
        print name, ': ', best_params[name]
    
    pred = gs.predict(features_test)
    
    # Calculate and print precision and recall evaluation metrics    
    true_negatives = 0
    false_negatives = 0
    true_positives = 0
    false_positives = 0
    
    for prediction, truth in zip(pred, labels_test):
            if prediction == 0 and truth == 0:
                true_negatives += 1
            elif prediction == 0 and truth == 1:
                false_negatives += 1
            elif prediction == 1 and truth == 0:
                false_positives += 1
            elif prediction == 1 and truth == 1:
                true_positives += 1
    
    total_predictions = true_negatives + false_negatives + false_positives + true_positives
    accuracy = 1.0*(true_positives + true_negatives)/total_predictions
    precision = 1.0*true_positives/(true_positives+false_positives)
    recall = 1.0*true_positives/(true_positives+false_negatives)
    f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives)
    
    print ''
    print 'Evaluation metrics:'
    print 'Precision: ', precision
    print 'Recall: ', recall
    print 'f1:', f1
    
    
    clf = gs.best_estimator_
    
    # Select features based on feature importances
    importances = clf.named_steps['algorithm'].feature_importances_
    indices = np.argsort(importances)[::-1]

    print ''
    print 'Feature Ranking: '
    for i in range(3):
        print "feature number {}: {} ({})".format(i+1,features_list[indices[i]+1],importances[indices[i]])
    
    
    return clf

In [7]:
choose_algorithm(tree_clf, tree_params)

Parameters:
{'algorithm__criterion': ['gini', 'entropy'],
 'algorithm__max_depth': [2, 4, 6],
 'algorithm__max_leaf_nodes': [5, 10],
 'algorithm__min_samples_split': [3, 4, 5],
 'algorithm__splitter': ('best', 'random')}

Pipeline: 
['algorithm'] 

f1 score: 
0.326666666667 

Best Parameters: 
algorithm__max_leaf_nodes :  10
algorithm__splitter :  random
algorithm__min_samples_split :  4
algorithm__criterion :  gini
algorithm__max_depth :  6

Evaluation metrics:
Precision:  0.5
Recall:  0.4
f1: 0.444444444444

Feature Ranking: 
feature number 1: expenses (0.3908498596)
feature number 2: percent_to_poi (0.325984720516)
feature number 3: restricted_stock (0.101351351351)


  'precision', 'predicted', average, warn_for)


Pipeline(steps=[('algorithm', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=10, min_samples_leaf=1,
            min_samples_split=4, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random'))])

In [9]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Assign the Decision Tree algorithm to clf
clf = choose_algorithm(tree_clf, tree_params)

### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)

Parameters:
{'algorithm__criterion': ['gini', 'entropy'],
 'algorithm__max_depth': [2, 4, 6],
 'algorithm__max_leaf_nodes': [5, 10],
 'algorithm__min_samples_split': [3, 4, 5],
 'algorithm__splitter': ('best', 'random')}

Pipeline: 
['algorithm'] 

f1 score: 
0.321333333333 

Best Parameters: 
algorithm__max_leaf_nodes :  5
algorithm__splitter :  best
algorithm__min_samples_split :  4
algorithm__criterion :  entropy
algorithm__max_depth :  4

Evaluation metrics:
Precision:  0.444444444444
Recall:  0.8
f1: 0.571428571429

Feature Ranking: 
feature number 1: bonus (0.428698703468)
feature number 2: exercised_stock_options (0.36037706334)
feature number 3: percent_to_poi (0.210924233192)


# Response Questions

1: Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]

The purpose of this project is to performing machine learning techniques by building an algorithm that can identify Enron employees who may have been involved in the company's fraud. A dataset containing information on various financial and email indicators of different Enron employees was used to help build the model. This includes features such as the employee name, salary, bonus, number of emails sent, and whether or not they are a POI (Person of Interest). The next step is to explore the dataset to look for any relationships or outliers between the data. For example, I found that there was a 'Total' column included in the data, as well as a datapoint called 'The Travel Agency in the Park'. These datapoints are irrelevant to our investigation as they are not Enron employees. Once the dataset has been cleaned for outliers, I was able to use the relevant features to help make predictions on the likelihood that a given employee was a POI.

2: What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]

My initial features list contained all 20 available features within the dataset. I also decided to create two additional features called 'percent_from_poi' and 'percent_to_poi' using the original data provided. These features represent the ratio between the number of emails received (sent) from a POI to the total number of emails received (sent). The idea behind this is that I would like to investigate whether employees are likely to be POI's themselves if they communicate frequently with other POI's. I then preprocessed my features list by optimizing feature selection and feature importance. I used SelectKBest to select the best K features, as well as MinMaxScaler to add weights to the remaining features. Preprocessing data in machine learning is an important step before picking an algorithm as it  reduces the processing time of the algorithms. For example, I noticed that using MinMaxScaler reduced the completion time of the support vector machine algorithm from several hours to several minutes. Another reason is that some features may not be relevant when making predictions, and can therefore be ignored.

3: What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]

I tested four differect algorithms: Naive Bayes, Decision Tree, Support Vector Machine, and K Nearest Neighbor.  I noticed when running these algorithms that they each took different amount of times to execute, and also made different predictions. The support vector machine took the longest time to run, and K Nearest Neighbor returned the poorest classification results. Based on the evaluation metrics of each algorithm, I decided to choose the Decision Tree algorithm as it returned the highest f1 score.

4: What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric item: “tune the algorithm”]

Several machine learning algorithms take in parameters that determine how it will be performed. This is an important step in the machine learning process as it will determine the results of the model. If not done properly, the algorithm may run the risk of over-fitting or under-fitting the data. This will cause the results from the training data to differ from the testing data. Fortunately, scikit-learn provides a class called GridSearchCV to help find the optimal parameters of the model. For the Decision Tree algorithm, I tested the 'criterion' parameters of 'gini' and 'entropy', as well as 'max_depth' of 2, 4, 6, 8, and 10. After simulating each of the possible parameters through brute force, GridSearchCV returned the optimal parameters to be 'criterion':'gini' and 'max_depth': 6.

5: What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric item: “validation strategy”]

In machine learning, data is split up into training and testing sets. The training set is used to train the model, while the remaining test set is used for predictions. We 'pretend' to not know the results of the test set, and compare predictions to actual results to measure the results. However, there is a risk of overfitting the data as knowledge about the test set can 'leak' into the model. To solve this problem, cross-validation may be used by splitting the test data into smaller subsets. After several rounds of validation, the average results are returned. For this project, I used Stratified Shuffle Split, which splits up a random portion of the data several times. 

6: Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

Evaluation metrics such as as precision and recall are used in machine learning to measure the performance of an algorithm. I chose to include these evaluation metrics when determining the effectiveness of my algorithms as they take into account the number of true positives, true negatives, false positives, and false negatives of the algorithm. Precision is the ratio between true positives and the total amount of POIs predicted by the algorithm (true positives + false positives). Recall is the ratio between true positives and the actual amount of POIs (true positives + false negatives). The Decision Tree algorithm identified scored a precision of 0.44978 and a recall of 0.36500 when running the test_classifier function.

<h3>References:</h3>

https://civisanalytics.com/blog/data-science/2016/01/06/workflows-python-using-pipeline-gridsearchcv-for-compact-code/
http://abshinn.github.io/python/sklearn/2014/06/08/grid-searching-in-all-the-right-places/
https://github.com/amueller/scipy_2015_sklearn_tutorial/tree/master/notebooks
<br>https://www.youtube.com/watch?v=80fZrVMurPM
<br>https://www.youtube.com/watch?v=Ud-FsEWegmA&t=8173s
<br>https://discussions.udacity.com/t/feature-importances-/173319/10