# Identify Fraud from Enron Email #
## Data Anylyst Nano Degree -- Project 5 ##
### Richard Lorenzo ###


## Overview ##
This project studies financial data and email data from the public Enron Dataset.  Using Machine Learning analysis, I will predict "Persons of Interest" (POI) based on this data.

1. I began with Exploratory Data Analysis (EDA) to validate the data, look for general trends and identify outliers.

2. After cleaning the dataset, I will summarize the data and provide overall metrics.

3. I will descibe three machine learning analyses, the methods, and the results.

4. An appendix is included with EDA plots used in step 1.

## Code: 'poi_id.py' ##
The following Python program is the basis of the analysis. It is show below with the output.  However, I will review and restate each result in this report.

The 'poi_id.py' proram uses the following helper code and data were provided by Udacity for this project:
- 'feature_format.py'
- 'tester.py'
- 'final_project_dataset.pkl'

In [1]:
#!/usr/bin/python
import pandas as pd
import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data, load_classifier_and_data, test_classifier
from sklearn.cross_validation import train_test_split

from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import StratifiedShuffleSplit
from itertools import compress
from pprint import pprint
from IPython.display import display
from time import time

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi','salary'] # You will need to use more features

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

### Convert to Pandas
df = pd.DataFrame.from_records(list(data_dict.values()))
employees = pd.Series(list(data_dict.keys()))
# set the index of df to be the employees series:
df.set_index(employees, inplace=True)
# Convert Numeric Column types
# Convert Financial Columns 
df[['bonus','deferral_payments',
    'deferred_income', 'director_fees','exercised_stock_options','expenses',
    'loan_advances', 'long_term_incentive', 'loan_advances','other', 
    'restricted_stock', 'restricted_stock_deferred','salary','total_payments', 
    'total_stock_value']] = df[['bonus','deferral_payments',
    'deferred_income', 'director_fees','exercised_stock_options','expenses',
    'loan_advances', 'long_term_incentive', 'loan_advances','other', 
    'restricted_stock', 'restricted_stock_deferred','salary','total_payments', 
    'total_stock_value']].apply(pd.to_numeric,errors='coerce')
# Convert email columns
df[['from_messages', 'from_poi_to_this_person','from_this_person_to_poi',
    'shared_receipt_with_poi','to_messages']] = df[['from_messages', 
    'from_poi_to_this_person','from_this_person_to_poi',
    'shared_receipt_with_poi',
    'to_messages']].apply(pd.to_numeric,errors='coerce')

### Task 2: Remove outliers
df.drop('TOTAL', inplace = True)    # Remove Outliers
df.drop('THE TRAVEL AGENCY IN THE PARK', inplace = True)    # Remove Outliers
print "Removed Outliers for 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK'\n"


### Task 3: Create new feature(s)
df['from_poi_to_this_person_pct'] = \
    df['from_poi_to_this_person'] / df['to_messages']
df['from_this_person_to_poi_pct'] = \
    df['from_this_person_to_poi'] / df['from_messages']

nan_observations = {}
for column in df:
    nan_observations[column] = df[column].isnull().sum()

df.fillna(value=0,inplace = True)

### Data Exploration
print "Total number of data points (observations) :", len(df.index)
print "Numer of POI observations :", len(df['poi'][df['poi']])
print "Number of non-POI obseervations :", len(df['poi'][df['poi'] == False])


### Store to my_dataset for easy export below.
df.to_csv('enron_for_eda.txt')
df_dict = df.to_dict('index')
my_dataset = df_dict

features_list = ['poi',
 'bonus',
 'deferral_payments',
 'deferred_income',
 'director_fees',
 'exercised_stock_options',
 'expenses',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'loan_advances',
 'long_term_incentive',
 'other',
 'restricted_stock',
 'restricted_stock_deferred',
 'salary',
 'shared_receipt_with_poi',
 'to_messages',
 'total_payments',
 'total_stock_value',
 'from_poi_to_this_person_pct',
 'from_this_person_to_poi_pct'
]

# create a list of features without 'poi' which is a label
features_no_poi = list(features_list) # copy the feature list
features_no_poi.pop(0)  # remove 'poi' from the list
print "\nTotal features available : ", len(features_no_poi)
print
print "Available Feature sorted by NaNs"
df_nans = pd.DataFrame.from_dict(nan_observations, orient = 'index')
pprint(df_nans.sort_values(by = 0, ascending=False))


####################################################################
# Implement Modeling Pipeline with "Select K Best" and "Naive Bayes"
####################################################################
print "\n******************\n Select K Best + Gaussian NB Pipeline\n"
t0 = time()
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

k_range = range(2,10)
params = {'SKB__k' : k_range }
pipeline = Pipeline([('SKB', SelectKBest()), ('classifier', GaussianNB())])
cv = StratifiedShuffleSplit(labels, 100, test_size=0.2, random_state=60)
gs = GridSearchCV(pipeline, params, cv=cv, scoring="f1_weighted")
gs.fit(features, labels)
clf = gs.best_estimator_

# Print the selected features and pvalues
print "Processing time:", round(time()-t0, 3), "s"
k_best_support = clf.named_steps['SKB'].get_support(False).tolist()
df_selected_features1 = pd.DataFrame(
    {'Feature': list(compress(features_no_poi, k_best_support)),
     'p value': list(compress(clf.named_steps['SKB'].pvalues_,k_best_support))
    })
pprint(df_selected_features1)
print

# Test the results
dump_classifier_and_data(clf, my_dataset, features_list)
clf, dataset, feature_list = load_classifier_and_data()
test_classifier(clf, dataset, feature_list)

#####################################################################
# Implement Modeling Pipeline with "Select K Best" and "DecisionTree"
####################################################################
print "\n******************\n Select K Best + DecisionTree Pipeline\n"
t0 = time()
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

k_range = range(2,8)
params = {'SKB__k' : k_range,
          "dt__min_samples_leaf": [2, 4, 6],
          "dt__min_samples_split": [8, 10, 12],
          "dt__min_weight_fraction_leaf": [0, 0.1],
          "dt__criterion": ["gini", "entropy"],
          "dt__random_state": [42, 46]}
          
pipeline = Pipeline([('SKB', SelectKBest()),('dt', DecisionTreeClassifier())])
cv = StratifiedShuffleSplit(labels, 100, test_size=0.2, random_state=60)
gs = GridSearchCV(pipeline, params, cv=cv, scoring="f1_weighted")
gs.fit(features, labels)
clf = gs.best_estimator_

# Print the selected features, pvalues, and DT Importances
print "Processing time:", round(time()-t0, 3), "s"
k_best_support = clf.named_steps['SKB'].get_support(False).tolist()
df_selected_features2 = pd.DataFrame(
    {'Feature': list(compress(features_no_poi, k_best_support)),
    'p value': list(compress(clf.named_steps['SKB'].pvalues_,k_best_support)),
    'Importance' : clf.named_steps['dt'].feature_importances_.tolist()
    })
pprint(df_selected_features2)
print

# Test the results
dump_classifier_and_data(clf, my_dataset, features_list)
clf, dataset, feature_list = load_classifier_and_data()
test_classifier(clf, dataset, feature_list)

#####################################################################
# Implement "Naive Bayes" with Manually Selected Features
#####################################################################
print "\n******************\n Gaussian NB w/ manual Features\n"
features_list_manual = ['poi',
 'from_poi_to_this_person_pct',
 'salary',  'deferred_income', 
 'exercised_stock_options',  'expenses', 
 'total_stock_value']

print "Manually Selected Features : ", features_list_manual[1:]

t0 = time()
data = featureFormat(my_dataset, features_list_manual, sort_keys = True)
labels, features = targetFeatureSplit(data)
features_train, features_test, labels_train, labels_test\
    = train_test_split(features, labels, test_size=0.2, random_state=42)
clf = GaussianNB()
print "Processing time:", round(time()-t0, 3), "s"

# Test the results
dump_classifier_and_data(clf, my_dataset, features_list_manual)
clf, dataset, features_list_manual = load_classifier_and_data()
test_classifier(clf, dataset, features_list_manual)

  'precision', 'predicted', average, warn_for)


Removed Outliers for 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK'

Total number of data points (observations) : 144
Numer of POI observations : 18
Number of non-POI obseervations : 126

Total features available :  21

Available Feature sorted by NaNs
                               0
loan_advances                141
director_fees                128
restricted_stock_deferred    127
deferral_payments            106
deferred_income               96
long_term_incentive           79
bonus                         63
to_messages                   58
from_this_person_to_poi       58
from_messages                 58
shared_receipt_with_poi       58
from_poi_to_this_person       58
from_this_person_to_poi_pct   58
from_poi_to_this_person_pct   58
other                         53
expenses                      50
salary                        50
exercised_stock_options       43
restricted_stock              35
total_payments                21
total_stock_value             19
email_address           



## Exploratory Data Analysis##

I immediately converted the dataset to a Pandas DataFrame to simplify analysis and manipulation.  I exported it as a '.csv' file to import it to RStudio for EDA.  The plots are included in the appendix.  I prepared several 'ggpairs' plots to look for trends, correlations, and get an overall feel for the dataset.  Next, I plotted several promising features against each other to look for outliers.

The 'TOTAL' obersvation was obvious in these plots.  I removed 'TOTAL' because it is not a valid observation.  Also, the 'THE TRAVEL AGENCY IN THE PARK' is also invalid since it cannot be considered a POI.  There were several other extreme values, but since they appear to valid amounts, I did not remove them.

### Data Charactistics: ###

In [2]:
print "Total number of data points (observations) :", len(df.index)
print "Numer of POI observations :", len(df['poi'][df['poi']])
print "Number of non-POI obseervations :", len(df['poi'][df['poi'] == False])

Total number of data points (observations) : 144
Numer of POI observations : 18
Number of non-POI obseervations : 126


Many features contained NaN data and had to be filled in with zeros.  The following chart lists the features sorted with the ones with the most NaNs.

In [3]:
display(df_nans.sort_values(by = 0, ascending=False))

Unnamed: 0,0
loan_advances,141
director_fees,128
restricted_stock_deferred,127
deferral_payments,106
deferred_income,96
long_term_incentive,79
bonus,63
to_messages,58
from_this_person_to_poi,58
from_messages,58


## Adding New Features ##

I added the following two new features:

- from_poi_to_this_person_pct
- from_this_person_to_poi_pct

The original email counts to/from POIs can be improved by looking at a ratio of their overall emails.  For example, a person who sends many emails including a few to POIs is less interesting than a person who sends fewer emails, but to the same number of POIs.  

## Coding the Classifiers##

### Evaluation Metrics ###

Because only (14) of the (144) observations are POIs, percision and recall are the most appropriate metrics.  Accuaracy is less important because an alogorithm that always guesses 'not a POI' will have a 87.5% accuracy.  

Percision is a measure of result relevancy.  Recall is a measure of how many truly relevant results are returned. High precision relates to a low false positive rate, and high recall relates to a low false negative rate.

For example, a precision of 0.22604 and a recall of 0.39500 means the classifier has too many false positives, but it has an acceptably low number of false negatives.

The final analysis has a precision of 0.53031 and a recall: 0.39800 which means both a low false positive rate and a low false negtive negative rate.


### Validation ###

Validation is important to ensure the classier is both responsive and does not over-fit the testing data set. Classifiers that are not learning from the training data will not adapt to the test data.  Classifiers that over-fit the trainnig data will not perform well with different test data.

With only (144) observations the dataset is small, and splitting the already small dataset in testing and training sets is problematic and causes large variations depending on the testing vs. training data.  I used the StratifiedShuffleSplit method in SKLearn's Cross Validation package to multiply overlapping testing/training splits and to average the results.

### Steps taken for coding the classifiers ###
The first Machines Learning classifier I used is present last because it was the most accurate. 

I started by simply coding several classifiers with default tunning and guessed at several features.  I tried GaussianNB, DecisionTree, K Nearest Neighbors, and Support Vector Machines.  GausssianNB clearly produced the best results.  (The results from K Nearest Neighbors and Support Vector Machines and not included in poi_id.py or discussed further because their results were poor.)  I then manually removed features until the metrics dropped and added new features in.  I only kept them in if they improved the metrices.  The results of this manual GaussianNB alogrithm are present later in the report.

### Scaling ###
Since I only used GaussianNB and Decision Tree Algorithms, I did not need to scale my features.  However, scaling could be easily added to the pipelines presented if someone want to try a different alogrithm.

### Performance ###
The processing time to run the Select K-Best and GaussianNB classifier was 3.454 seconds.  This was slightly slower performance than the final analisys.

## Select K-Best and GaussianNB##
Hoping to improve on my manual feature selection, and keeping the promising GaussianNB classifier, I coded a pipeline usinf Select K Best and GridSearchCV using the GaussianNB classifier.  The GridSearchCV also implemented the StratifiedShuffleSplit cross validation.  GaussianNB does not have any tunable parameters, but I hoped Sleect K Best could identify beter features.

The pipeline identified the following features with their associated p-values.

In [4]:
display(df_selected_features1)

Unnamed: 0,Feature,p value
0,bonus,1.1e-05
1,exercised_stock_options,2e-06
2,salary,3.5e-05
3,total_stock_value,2e-06
4,from_this_person_to_poi_pct,8.4e-05


The test results / metrics are:
```
Pipeline(steps=[('SKB', SelectKBest(k=5, score_func=<function f_classif at 0x0000000008AD8EB8>)), 
            ('classifier', GaussianNB(priors=None))])

    Accuracy: 0.84893   Precision: 0.41910	  Recall: 0.34450	F1: 0.37816	  F2: 0.35722

    Total predictions: 15000	
    True positives:  689	False positives:  955	False negatives: 1311	True negatives: 12045
```

These results still meet the specified 0.3 for percison and recall, but they are less than the manual analysis to be presented later.

## Select K-Best and the Decision Tree Classifier ##

Eventhough the untuned Decision Tree classifier performed worse than NB initially, I coded a pipeline for Select K-Best and Decision Tree and improve the result through tuning.  The GridSearchCV also implemented the StratifiedShuffleSplit cross validation.

The DT tunable paramters and the values tried are:
```
    min_samples_leaf         : 2, 4, 6
    min_samples_split        : 8, 10, 12
    min_weight_fraction_leaf : 0, 0.1
    criterion                : gini, entropy
    random_state             : 42, 46
```

Parameter tuning with StratifiedShuffleSplit cross validation is important because it ensures optimum performance with checking that the classifier is not over-fit.  GridSearchCV for parameter tuning, executes the DT classifier for each permutation of the above parameters.  The executed classifier with the best metric and the best combination of tunable parmeters is assign to the "clf" object, and is the result of the GridSearchCV.  The best parameters are:

- min_samples_leaf=6
- min_samples_split=8
- min_weight_fraction_leaf=0
- criterion='entropy'
- random_state=42

The processing time for the Select-K-Best and DecisionTree Classifier with paramet tuning and StratifiedShuffleSplit cross validation was 274.249 s.  By far, this method had the worst CPU performance.

This time, Select K-Best pick the following features.  Also, since DecisionTree also outputs feature importance values, these are presented here:

In [5]:
display(df_selected_features2)

Unnamed: 0,Feature,Importance,p value
0,bonus,0.138586,1.1e-05
1,deferred_income,0.0,0.000922
2,exercised_stock_options,0.124407,2e-06
3,long_term_incentive,0.0,0.001994
4,salary,0.068504,3.5e-05
5,total_stock_value,0.262293,2e-06
6,from_this_person_to_poi_pct,0.40621,8.4e-05


The test results / metrics are:
```
Pipeline(steps=[('SKB', SelectKBest(k=7, score_func=<function f_classif at 0x0000000008AD8EB8>)), 
            ('dt', DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=6,
            min_samples_split=8, min_weight_fraction_leaf=0, presort=False,
            random_state=42, splitter='best'))])
            
	Accuracy: 0.84867	Precision: 0.37546	 Recall: 0.20350	F1: 0.26394	  F2: 0.22402
	Total predictions: 15000	
    True positives:  407	False positives:  677	False negatives: 1593	True negatives: 12323
```

The recall metric does not meet the required 0.3, and the metrics are less than the GaussianNB algoirthm.


## Final Analysis ##
### GaussianNB with Manual Feature Selection ###

I manually coded a GaussianNB classifier and guessed at likely features from my EDA plots.  The, I removed features until the metrics dropped and added new features in. I only kept them in if they improved the metrices. 

Since I determined the features using trial and error, I can only speculate why these features produced the best results:

- "from_poi_to_this_person_pct" is likely a strong feature because, by definition, it is linked to known POIs. It likely allows the classifiers to correctly include POIs for individuals with a higher percentage of their emails are with POIs.

- "salary" is likely a strong feature because many POIs have high salaries amd most low-salary employees are not POIs.

- "deferred_income" and "expenses" are likely strong features because employees who did not have access to the "perks" were likely not tied to the fraud and this assist the classier.

- "total_stock_value" is likely a strong feature because employees with large stock amounts are incented to "bend" the rules, and coversely, employees without large stock holding are not.

- "exercised_stock_options" is a likely strong feature because these people may have known about the fraud, and were cashing out their options.  Employees without knowledge of the fraud would be more likely to have kept their options. 

The manually selected features witn GaussianNB had the best CPU performance time of 0.001 seconds.

My final features were:


In [6]:
display(pd.DataFrame(features_list_manual[1:]))

Unnamed: 0,0
0,from_poi_to_this_person_pct
1,salary
2,deferred_income
3,exercised_stock_options
4,expenses
5,total_stock_value


The test results / metrics are:
```
GaussianNB(priors=None)

	Accuracy: 0.87273	Precision: 0.53031	Recall: 0.39800	F1: 0.45473	F2: 0.41890
	Total predictions: 15000	
    True positives:  796	False positives:  705	False negatives: 1204	True negatives: 12295
```

These are the best results achieved.

## APPENDIX ##
The EDA Plots are include in the file: enron_eda.pdf and shown below


In [7]:
from IPython.display import HTML
HTML('<iframe src=./enron_eda.pdf width=950 height=1500></iframe>')