### Data set characteristics:
* Total number of data points: `144`
* Data set persons of interest (POI) / non-POI: `18/126`
* Number of features used: `6`
* Features with missing values: `0` (all missing values were filled with 0)

### Summary

1.) **Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?**

The goal of this project is to implement a ML algorithm which can accurately predict whether a given Enron employee is a person of interest in their fraud scandal. Enron was an "energy services company" that was involved in "one of the biggest bankruptcy filings in the history of the United States" which became apparent "as a number of analysts began to dig into the details of Enron’s publicly released financial statements" (Enron scandal, 2019). These days, we have machine learning is able to do most of the heavy lifting in terms of being able to learn what fraud may have quickly showh up in these financial statements!

This dataset contains financials (including stock information) about various employees working for the organization. Using this data, a supervised classification algorithm can be implemented to see to what accuracy the data can be modeled and the learned function can be trained to be able to find these patterns. To do this accurately, a few rows must be removed. In this case, there are a couple of outliers that are non-POIs which are skewing some of the summary statistics from giving us a clearer picture of the data set:
* `TOTAL` is an aggregate statistic that needs to be removed from the data frame
* `THE TRAVEL AGENCY IN THE PARK` was a "company" owned by the sister of Enron ringleader Kenneth Lay and is not a POI from Enron.

Furthermore, two features were created to better represent the data in the set:
* `total_compensation` is the result of adding the `total_stock_value` and `total_payments` so that gross compensation could be added into the model; some POIs appeared to be payed disproportionately in stock while others received some compensation through salary and other direct payments. This looks at compensation solely according to the dollar value of all compensation given.
* `comp_minus_sal` is a feature created from `total_compensation` which looks only at money paid through stock and not direct payments. This gets a better picture of some of the money that would have been flowing behind the scenes in the form of insider stock trading and the likes.

Omitting these two features in the model produced a lower accuracy but a _marginally_ better recall. Ultimately, they were kept in the final model because the F1 score _with_ the engineered features was better than for those without them.

### Features

2.) **What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not?**

The features selected were:
* `exercised_stock_options`, `bonus`, `long_term_incentive` - due to their correlation with POIs
* `total_compenstation` - a computed measure of payments and stock awards together
* `comp_minus_sal` - total compensation, minus the person's salary

In this problem, none of the features were scaled. Some of the higher magnitude values were removed during the data preprocessing, and for many of the measures (at least, those missing any data), the minimum was 0 and the maximum could be in the tens of millions of dollars. Not all of the some features could have scaled better than others. comp_minus_sal was an engineered feature used to look at total compenstation packages without looking at salary, under the auspices that dirty money would be "hidden" in less obvious places than a worker's salary.

```
Feature Importances

Random Forest Classifier
 [('exercised_stock_options', 0.20342257063088942), ('bonus', 0.17437672556999195), ('salary', 0.11651845758656504), ('long_term_incentive', 0.08606113227789798), ('total_compensation', 0.2106706664293011), ('comp_minus_sal', 0.20895044750535458)] 

Decision Tree
 [('exercised_stock_options', 0.08417508417508415), ('bonus', 0.34325544634822974), ('salary', 0.1388888888888888), ('long_term_incentive', 0.0), ('total_compensation', 0.3877889721961889), ('comp_minus_sal', 0.04589160839160842)]
 ```
 
In this problem, features were not scaled. Much of this has to do with the amount of missing values in the data which were assumed to 

### Algorithm analysis

3.) **What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?**

I chose a random forest classifier, due to its accuracy in working with a large set of features. I also tried Gaussian naive bayes and decision tree classifier, but the RFC performed better on nearly all measurements. The RFC performed consistently better on its F1 scores than any of the other models tested.

```
RandomForestClassifier(max_depth=5, random_state=42)
Accuracy:           0.8409
Precision:          0.8604
Recall:             0.9736
F1:                 0.9135
F2:                 0.9487

Total predictions:      44
True positives:         37
False positives:         6
False negatives:         1
True negatives:          0
---

RandomForestClassifier(max_depth=10, min_samples_split=4,
                       min_weight_fraction_leaf=0.1, n_estimators=1000,
                       random_state=42)        
Accuracy:            0.8636
Precision:           0.8636
Recall:                 1.0
F1 :                 0.9268
F2 :                 0.9693

Total predictions:       44
True positives:          38
False positives:          6
False negatives:          0
True negatives:           0
---

GaussianNB()
Accuracy:            0.8181
Precision:           0.8750
Recall:              0.9210
F1 :                 0.8974
F2 :                 0.9114

Total predictions:       44
True positives:          35
False positives:          5
False negatives:          3
True negatives:           1
---

DecisionTreeClassifier(random_state=42)
Accuracy:             0.7954
Precision:            0.8717
Recall:               0.8947
F1 :                  0.8831
F2 :                  0.8900

Total predictions:        44
True positives:           34
False positives:           5
False negatives:           4
True negatives:            1
```

4.) What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).

Hyperparameter tuning is the process of manually adjusting the parameters of a ML model which cannot successfully be learned from the training process directly. Hyperparameters can include penalty, rate of convergence, selecting kernel functions, the number of estimators, gamma, and more. In this case, there was no parameter tuning of the algorithms; parameters were kept to their defaults.

5.) What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

Validation data is a data set used as the midpoint for training and testing data, and is used to compare performances when engaging in hyperparameter optimization. In this case, the test data and validation data were one in the same; in the case of larger data sets, the test data would be used to gauge the performance characteristics of the algorithm, such as specificity, sensitivity, and F-measure. The validation data set works as an intermediary to prevent overfitting of the model.

6.) **Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.**

Precision is a measure of how many selected items are relevant in the problem, while recall explains the fraction of successful classifications made. In the case of the RFC--the algorithm chosen for this problem, its precision was 0.8718 while its recall was 0.8947. Precision is the number of true positives divided by the total number of predicted positive outcomes. Recall, also known as the probability of detection, is the number of true positives divided by the number of true conditions in the sample. In binary classification problems, F-scores are typically the best measure of a test's accuracy; in this case, it was 0.8831, making it moderately accurate.

### References



"Enron scandal". October 7, 2019. In *Encyclopaedia Britannica online*. Retrieved from https://www.britannica.com/event/Enron-scandal
