- Data Analyst Nanodegree
- Project 5 - Identify Fraud from Enron Email
- Ricardo Yoshitomi

## Introduction

The Enron Corporation was an American energy, commodities, and services company founded in 1985 by Kenneth Lay. In around the year 2000, Enron was one of the top largest companies in United States. But at the end of 2001, it was bankrupt due to a systematic account fraud, known as Enron Scandal. It was the biggest bankruptcy in U.S. history. In total 20,000 employees lost their jobs and it also affected lots of investors and other companies. <br />
In this project, machine learning algorithms will be used to identify Enron employees who may have commited fraud based on Enron financial and email dataset. The Enron dataset was made public and it contains confidential information about tens of thousands of emails and detailed financial data for top executives. All the codes I used in this project I found in [scikit-learn](http://scikit-learn.org/), [numpy](http://www.numpy.org/) and [stackoverflow](https://stackoverflow.com/) pages. 

## Questions

1) Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those? [relevant rubric items: “data exploration”, “outlier investigation”]

### Objective
The goal of this project is to build a person of interest identifier through machine learning techniques based on financial and email datasets made public as a result of the Enron fraud case. The dataset contains 146 data points (i.e. people), with 21 features for each person. The features are splited into 14 financial features, 6 email features and 1 POI label. Financial features include benefits, payments, bonus, incentives, etc., all in American Dollars (USD). Email features are usually the quantities of exchanged emails. POI label is a Boolean feature that indicates if the investigated employee is a person of interest or not. There are 18 POIs (persons of interest) in the dataset. The POI identifier will work as a predictive model based on the features (financial and emails) of each person, the identifier will take as target the POI list.

### Missingness
In total the dataset contains 3066 values (number of features times the number of observations), but among these values, 1358 are empty values. The missing values represent 44% (1358/3066) of the total which is a very high rate. When analyzing the presence of null values for each observation, we observe lots of cases of persons without some values, in some of them only 2 out of 21 features are filled. The dictionary bellow shows the number of missing values for each feature, the features that have the greatest amount of missing values are <strong>director_fees</strong>, <strong>restricted_stock_deferred</strong> and <strong>deferral_payments</strong>. In order to deal with these gaps, it is not recommended to remove the feature or observation because of the small size of the dataset, in this case the best strategy to use is data imputation.

<pre>
{'salary': 51, 'to_messages': 60, 'deferral_payments': 107, 'total_payments': 21, 'long_term_incentive': 80, 'loan_advances': 142, 'bonus': 64, 'restricted_stock': 36, 'restricted_stock_deferred': 128, 'total_stock_value': 20, 'shared_receipt_with_poi': 60, 'from_poi_to_this_person': 60, 'exercised_stock_options': 44, 'from_messages': 60, 'other': 53, 'from_this_person_to_poi': 60, 'deferred_income': 97, 'expenses': 51, 'email_address': 35, 'director_fees': 129}
</pre>

### Allocation Across Classes
The allocation across classes for this dataset is <strong>unbalanced</strong>, it means that the classes are not represented equally. The dataset contains more examples of non-POI class than POI class, this is characteristic of datasets to identify Fraud cases. The ratio of the POI class to non-POI class instances is 18:146 or 1:12. For this case, I tried different types of algorithms. I tested the Naive Bayes Classifier and classification algorithms that have class weight as parameter such as the Decision Tree, Random Forest and Support Vector Classifiers. The validation process that I used was StratifiedShuffleSplit. Later we will discuss the reason for these choices.

### Outliers
Yes, there was an outlier called <strong>TOTAL</strong> which is the sum of the values of all people for each feature. This outlier was imported during the transference from the spreadsheet to the program. It was necessary to remove this outlier from the dataset because it is considered a mistake when the data was imported and it can affect the final result. <br />
There was also an outlier called <strong>LOCKHART EUGENE E</strong>, this outlier was removed because it doesn't have values for any feature. An observation without values is not useful for the classifier. <br />
There was another outlier called <strong>THE TRAVEL AGENCY IN THE PARK</strong>, which is an agency co-owned by the sister of Enron's former Chairman. Since it is not a person to be identified, it was removed from the dataset.

2) What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come readymade in the dataset explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]

### Select Best Features

The features that I used were "poi", "salary", "deferred_income", "total_stock_value", "expenses" and "exercised_stock_options". I selected these features using the <strong>SelectKBest</strong> function which is a classification algorithm that takes as input two arrays (features and target feature) and returns the scores of each feature according to the k highest scores. The following table shows the score for each feature and the features with the highest scores are highlighted. 

Features | Score
------------ | -------------
**salary** | **3.3796**
deferral_payments | 0.8018
total_payments | 0.0158
loan_advances | 0.1704
bonus | 0.6052
restricted_stock_deferred | 0.0904
**deferred_income** | **3.275**
**total_stock_value** | **11.3589**
**expenses** | **3.9754**
**exercised_stock_options** | **11.9885**
other | 0.0679
long_term_incentive | 0.9383
restricted_stock | 0.8884
director_fees | 1.8093
to_messages | 0.2286
from_poi_to_this_person | 0.2478
from_messages | 0.7503
from_this_person_to_poi | 0.4012
shared_receipt_with_poi | 0.1931

The SelectKBest selector was chosen through a combined function. The <strong>Pipeline</strong> is used to classify preprocessing algorithms, such as a selector or a feature extractor, in a single function. The <strong>GridSearchCV</strong> is used to optimize the parameters of a classifier. It is possible to combine the Pipeline with a GridSearchCV, it allows us to grid search over parameters of all estimators in the pipeline at once. For the SelectKBest, I tested different values for the parameter "k" which is the number of top features to select. And for the RandomForestClassifier, I tested some values for the parameters "n_estimators" (number of trees in the forest) and "min_samples_split" (the minimum number of samples required to split an internal node). The table bellow is a summary of all classifiers and its respective parameters to be evaluated.

Parameters | SelectKBest | RandomForestClassifier
------------ | ------------- | -------------
k | 2, 3, 5, 8 | N/A
n_estimators | N/A | 25, 50, 100, 200
min_samples_split | N/A | 2, 3, 4, 5, 10

The combined function reports back the best combination/model. As we can see in the results bellow, the SelectKBest had a better performance in all metrics compared to the RandomForestClassifier. The combined function also returned the best parameter for the SelectKBest, that is k = 2.

<pre>
              precision    recall  f1-score   support

 SelectKBest       0.87      0.94      0.91        36
RandomForest       0.50      0.29      0.36         7

 avg / total       0.81      0.84      0.82        43

{'feature_selection__k': 2, 'random_forest__n_estimators': 25, 'random_forest__min_samples_split': 4}
</pre>

### Feature Scaling
Feature scaling is a method used to standardize the range of the features of a dataset. Some machine learning algorithms will not work properly without normalization. Features based on distance require features scaling. If a feature has a variance that is orders of magnitude larger than others, it might has more importance on the calculation of the distance. The method consist in rescaling the range of features to scale the range in [0, 1] or [−1, 1]. The feature rescaling formula is defined as:

$$x^´ = \frac{x - x_{min}}{x_{max} - x_{min}}$$

where x is the original value and x' is the normalized value. One of the disadvantages about this formula is if we have an outlier for a particular feature in the x<sub>min</sub> or x<sub>max</sub> inputs, it can screw up the feature rescaling for that feature. For this project it is recommended to use a feature scaling algorithm based on z-score (StandardScaler) since the algorithm mininum/maximum is sensible to the presence of extreme values. 

### New Feature
I created a new feature called <strong>salary_in_proportion_to_bonus</strong> that calculates the proportion of the salary in the bonus received for each employee. I thought about this feature because POIs could have their bonuses much higher than their salaries, but when I tested its importance among other features it didn't show any importance.
***

3) What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? [relevant rubric item: “pick an algorithm”]

As the accuracy in this project is irrelevant since the dataset is highly unbalanced, it can be neglected for the analysis of the performance. In this case the best metric for evaluation of the performance is recall. Later the reason for this choice will be discussed in more details. I tested a variety of classifiers such as the Naive Bayes, Decision Tree, Random Forest and Support Vector Machine algorithms. The table bellow shows the results obtained from the tests. The one that showed the best performance was the <strong>Naive Bayes Classifier</strong> with a recall of 0.39800 and a precision of 0.53031. 
Other classifiers presented a better recall, but their precision were bellow the limit 0.3. The Support Vector Classifier didn't give a consistent result. The classifier showed a recall of 1.0, this is not a reliable value. This happened because the number of False Negatives calculated was zero.

Classifier | Accuracy | Precision | Recall
------------ | ------------- | ------------- | -------------
Naive Bayes | 0.87273 | 0.53031 | 0.39800
Decision Tree | 0.59133 | 0.23914 | 0.94650
Random Forest | 0.53573 | 0.20290 | 0.84750
SVC | 0.13333 | 0.13333 | 1.00000

***

4) What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). [relevant rubric item: “tune the algorithm”]

The reason for tuning the parameters of an algorithm is to fit the model to the characteristics of the dataset in order to get the best performance of it. In addition, it provides models more robust to changes, it means that the model will work effectively for a new training and testing data. Futhermore, tuning the parameters also optimize the processing time and memory usage. <br />
I tunned the parameters using the <strong>GridSearchCV</strong> function. The GridSearchCV searches over multiple combinations of parameter tunes by cross-validation to determine which tune gives the best performance. I set the scoring parameter to "recall" to the function return the optimized parameters for the best performance of this metric. <br />
The Naive Bayes algorithm doesn't have parameters to tune but I used it on the Decision Tree, Random Forest and Support Vector Classifiers. The following table is a summary of the all tested parameters and their values for each classifier:

Classifier | kernel | C | min_samples_split | n_estimators | class_weight | 
------------ | ------------- | ------------- | ------------- | ------------- | -------------
Naive Bayes | N/A | N/A | N/A | N/A | N/A
Decision Tree | N/A | N/A | 20, 30, 40, 50, 60, 80, 90 | N/A | {1:2}, {1:5}, {1:10}, {1:20}
Random Forest | N/A | N/A | 20, 30, 40, 50, 60, 80, 90 | 2, 5, 10, 25, 50 | {1:2}, {1:5}, {1:10}, {1:20}
SVC | rbf | 1e-8, 1e-6, 1e-2, 1 | N/A | N/A | {1:5}, {1:10}, {1:20}, {1:30}

The <strong>kernel</strong> parameter of the Support Vector Classifier specifies the kernel method to be used in the algorithm. In this case, ‘rbf’ will be used. The kernel function takes a low dimensional feature space and enable it to operate in a high-dimensional space. <br />
The <strong>C</strong> parameter of the SVC controls the tradeoff between a smooth decision boundary and one that classifies all the training points correctly. The GridSearchCV function returned the value of 1e-8. <br />
The <strong>min_samples_split</strong> parameter used in both Decision Tree and Random Forest Classifiers defines the minimum number of samples required to split an internal node. The GridSearchCV returned the values of 90 and 80 for the Decision Tree and Random Forest, respectively. <br />
The <strong>n_estimators</strong> parameter of the Random Forest Classifier specifies the number of trees in the forest. The GridSearchCV returned the value of 5. <br />
The <strong>class_weight</strong> parameter is useful for datasets with unbalanced classes. This parameter has the function to rebalance the training set giving more emphasis for the samples in the minority classes. For all classifiers, the GridSearchCV returned the value of 1:10. <br />
***


5) What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis? [relevant rubric item: “validation strategy”]

Validation is a technique to estimate how accurately a predictive model will perform on an independent dataset. It is a big mistake to use the same dataset to learn the parameters of a prediction function and test it. This model would just repeat the labels of the samples and would have a perfect score but would fail to predict anything on yet-unseen data. This is called <strong>overfitting</strong>, it occurs when a model begins to "memorize" training data rather than "learning" to generalize from trend. The model that has been overfitted has poor predictive performance. <br />
Validation consists on splitting the dataset into training and testing data, the goal is to maximize the number of data points in the training set to get the best learning results and maximize the number of data items in the testing set to get the best validation. The <strong>cross validation</strong> strategy prevents the use of a fixed dataset which would reduce the number of samples which can be used for learning the model. The cross validation splits the training dataset into x parts, for example, 3 parts ABC. Then the model trains 2 of these parts and validates the rest. It is executed in all combinations (in this case, train in AB and test in C, train in AC and test in C and train in BC and test in A). The final performance is the average of the performance obtained in each validation. <br />
In order to validate the analysis, I used the <strong>StratifiedShuffleSplit</strong> cross validation iterator available in the tester.py file. This cross-validator type is suitable for <strong>unbalanced</strong> datasets because it returns stratified randomized folds, the folds preserves the percentage of samples for each class. I tried different values for the parameter test_size. The one that gave me the best performance was the test_size=0.1 which is the default value for the parameter.
***

6) Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something humanunderstandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

Accuracy is not an ideal metric to evaluate the performance of the POI identifier because the dataset is highly unbalanced. This will lead to a low number of true positives, consequently the portion of true negatives will predominate over the true positives. It means that the portion of true positives don't have weight in the accuracy formula, defined as: 

$$Accuracy = \frac{\sum True\;positive + \sum True\;negative}{\sum Total\;population}$$

Suppose all cases are identified as non-POIs, the model will have an accuracy of 87.7% (128/146), greater than the reported models. <br />
Another problem in using accuracy as a metric to evaluate the performance, it may err on guessing innocent or guilty. In this case, we want to have certain before incriminating a person. The best approach to this evaluation is to use the <strong>precision</strong> or <strong>recall</strong>. <br />
Precision is defined as:

$$Precision = \frac{\sum True\;positive}{\sum(True\;positive + False\;positive)}$$

And recall is defined as:

$$Recall = \frac{\sum True\;positive}{\sum(True\;positive + False\;negative)}$$

Precision is also referred to as positive predictive value (PPV), it describes the performance of a predictive test. Recall is also referred to as the true positive rate or sensitivity, it measures the proportion of positives that are correctly identified as having the condition. Both metrics are relevant to evaluate the performance of the POI identifier, but it is desirable to have a high rate of <strong>recall</strong> since the objective is to not let pass any suspect, which after could go through a more complete investigation to determine their guilt or innocence. <br />
The resulted precision and recall by the testing script were 0.53031 and 0.39800, respectively, these values are above the suggested limit of 0.3.
***

## Conclusion

In this project I used machine learning techniques to investigate the Enron employees who may have commited fraud based on the Enron financial and email dataset. The dataset contains 146 instances with 21 features. This dataset is considered unbalanced because the classes are not represented equally, 128 instances are labeled with non-POI class and the remaining 18 instances are labeled with POI class. The structure of this dataset requires different approaches. In order to build the persons of interest (POI) identifier, it was required to apply different machine learning algorithms, compare each of them and select the one which gave the best performance. Quantitative metrics were used to evaluate the performance of the algorithms. Cross validation was used to tune the parameters of the algorithms to obtain the best performance. <br />
This project was very challenging and very difficult to complete. I struggled a lot to understand the machine learning concepts and how to implement it. The project reviewer gave me good explanations in lots of topics that were not clear to me. These difficulties include choosing the classifiers and the relevant parameters to use, tuning the classifiers to deliver the best performance, choosing the validation process for this particular case and determining the appropriate metric to evaluate the performance. This analysis is in its initial stage. It is possible to achieve better results of precision and recall than I obtained.