##Summarize the goal of the this project and how machine learning is useful in trying to accomplish it.

The goal of the this project is to develop a model that is capable of identifying people of interest in relation to the Enron Scandal using only the financial and e-mail data.  Machine Learning is useful in accomplishing this task since there are numerous Classifier algorithms that can be built to accomplish this task.  Classification algorithms can be used to determine categorical outcomes based on defining decision surfaces for the various outcomes.  These decision surfaces are defined by training the classifier on some past data so that the model can be tuned to accurately predict future unseen data points.

Defining people of interest is a good fit for a supervised classification algorithm.  When looking at the dataset given to me, I noticed that there were 145 people in my dataset.  Out of the 145 people, only 18 are labeled as people of interest.  In the dataset, there are 19 features given.  Below is a table of number of empty entries in the financial data:  

In [5]:
%%html
<table>
<tr>
<th>Feature name</th>
<th># of empty values</th>
</tr>
<tr>
<td>salary</td>
<td>51</td>
</tr>
<tr>
<td>deferral_payments</td>
<td>107</td>
</tr>
<tr>
<td>total_payments</td>
<td>21</td>
</tr>
<tr>
<td>loan_advances</td>
<td>142</td>
</tr>
<tr>
<td>bonus</td>
<td>64</td>
</tr>
<tr>
<td>restricted_stock_deferred</td>
<td>128</td>
</tr>
<tr>
<td>deferred_income</td>
<td>97</td>
</tr>
<tr>
<td>total_stock_value</td>
<td>20</td>
</tr>
<tr>
<td>expenses</td>
<td>51</td>
</tr>
<tr>
<td>exercised_stock_options</td>
<td>44</td>
</tr>
<tr>
<td>other</td>
<td>53</td>
</tr>
<tr>
<td>long_term_incentive</td>
<td>80</td>
</tr>
<tr>
<td>restricted_stock</td>
<td>36</td>
</tr>
<tr>
<td>director_fees</td>
<td>129</td>
</tr>
</table

Feature name,# of empty values
salary,51
deferral_payments,107
total_payments,21
loan_advances,142
bonus,64
restricted_stock_deferred,128
deferred_income,97
total_stock_value,20
expenses,51
exercised_stock_options,44


Initially I decided to visualize the data by plotting various features to see if there were any strong correlations with people of interest and "non" people of interest.  I found that there were a few features that had a small number of entries(deferral payments, deferred_income) and some that had pretty good predictive power(exercised_stock_options).  There was one outlier that I found which was the entry for TOTAL which I popped out of the dataset dictionary using the **pop()** function.  The TOTAL entry can be found as an outlier when visualizing the salary vs. bonus plot.

##What features did you end of using in your POI identifier and what selection process did you use to pick them?

The features that I ended up using are the following:
- stock_pay_ratio
- bonus
- salary
- shared_receipt_with_poi
- exercised_stock_options
- total_stock_value

I engineered 3 different features. The first feature I engineered was whether the restricted stock deferred was empty(NaN) or not.  This seemed like a good feature to engineer because many poi from the financial data given in the *FindLaw.com"* data did not have a value for this.  The conclusion that I drew from this is that POIs wanted there money before everything went bust and they knew that something was going to happen given they were most likely in executive positions.  Thus there was no need to defer any stock.  The second feature that I engineered was the ratio between a person's total stock value and their total payments value.  The reasoning for this feature was that many POIs have high stock/total_pay ratios since they wanted most of their money to be in stock options.  This is a good tax shelter for the wealthy and a common practice among executives. The last feature that I engineered was the ratio of Director Fees to Expenses.  It seemed that many POI masked payments to themselves through the more generic section of expenses rather than having that payment within the more explicit section titled "Director Fees".

I used the **SelectPercentile** function to extract the most important features for POI classification.  Intially I set the percentile to 5%, but after analyzing which features it chose, it only chose a single feature.  To avoid high biasing, I thought it would be better to increase this percentile number so that I can get more features to train my data.  After bumping that number up to 50%, I got the list of features above.  Having a small percentile parameter is only beneficial if you have a large number of features.  In my case when I'm only using 12 features, it best to have a higher percentile parameter to balance out the bias-variance trade off.  The feature scores are as follows:
- stock_pay_ratio - 8.0163
- bonus - 23.8236
- salary - 11.7762
- shared_receipt_with_poi - 7.9910
- exercised_stock_options - 18.0632
- total_stock_value - 17.3184

##What algorithm did you end up using? What other ones did you try?

The algorithm that I ended up using was the **Decision Tree Classifier**.  The Decision Tree algorithm gave me the best precision and recall scores on all the different classifier algorithms that I expermimented with.  I intially started with the **Gaussian Naive Bayes classifier** to see how it performed.  When running this classifier on the large dataset ran by tester.py, I got excellent recall scores(0.94), but bad precision scores(0.24).  The other algorithm that I tried in which I thought would give me much better performance was **RandomForest Classifier**.  I thought this would do better than a single Decision Tree since it uses an ensemble of Decision trees.  It did indeed give me excellent precision scores (0.54) but bad recall scores(0.17).  When using the Decision Tree Classifier, I got on average an precision score of 0.33 and a recall score of 0.34.

##What does it mean to tune the parameters of an algorithm, and what can happen if you don't do this well?  How did you tune the parameters of your algorithm?

To tune the parameters of an algorithm means to fit your model with a variable number of parameter values for a number of parameters, and get the set of parameters that fits your model best to the training data.  I tuned my data using the **GridSearchCV** function. I ran the GridSearchCV function on both my RandomForest Classifier and my DecisionTree Classifier.  This was pretty easy since they both share a good amount of parameters.  What I found is that it is probably best to choose a **randomState** value for the *test_train_split* function because the GridSearchCV function will generate a different "best estimator" based on the different training/test data.  I also noticed that when passing the output generated by the GridSearchCV function into the tester.py, the code run takes a long time to complete (I'm not sure that it even finishes because I stopped it).  To alleviate this issue, I would print out the *best_estimator* given by GridSearchCV and use those parameters as my classifer. Below is a subset of parameters tuned with their mean and std. deviations given by the **grid_scores_** function.

In [8]:
%%html
<table>
<tr>
<th>Parameters</th>
<th>mean</th>
<th>std. dev</th>
</tr>
<tr>
<td> 'max_features': 'auto', 'min_samples_split': 1, 'criterion': 'gini'</td>
<td> 0.86139</td>
<td>0.03862 </td>
</tr>
<tr>
<td>'max_features': 'auto', 'min_samples_split': 5, 'criterion': 'gini' </td>
<td>0.79208</td>
<td>0.04039</td>
</tr>
<tr>
<td>'max_features': 'auto', 'min_samples_split': 10, 'criterion': 'gini' </td>
<td>0.84158 </td>
<td>0.01294</td>
</tr>
<tr>
<td>'max_features': 'auto', 'min_samples_split': 15, 'criterion': 'gini' </td>
<td>0.81188 </td>
<td>0.01279 </td>
</tr>
<tr>
<td>'max_features': 'log2', 'min_samples_split': 1, 'criterion': 'entropy' </td>
<td>0.79208 </td>
<td>0.04622 </td>
</tr>
<tr>
<td>'max_features': 'log2', 'min_samples_split': 5, 'criterion': 'entropy' </td>
<td>0.81188 </td>
<td>0.03512 </td>
</tr>
<tr>
<td>'max_features': 'log2', 'min_samples_split': 10, 'criterion': 'entropy' </td>
<td>0.80198 </td>
<td>0.03470 </td>
</tr>
<tr>
<td>'max_features': 'log2', 'min_samples_split': 15, 'criterion': 'entropy' </td>
<td>0.84158 </td>
<td>0.03543 </td>
</tr>
</table>

Parameters,mean,std. dev
"'max_features': 'auto', 'min_samples_split': 1, 'criterion': 'gini'",0.86139,0.03862
"'max_features': 'auto', 'min_samples_split': 5, 'criterion': 'gini'",0.79208,0.04039
"'max_features': 'auto', 'min_samples_split': 10, 'criterion': 'gini'",0.84158,0.01294
"'max_features': 'auto', 'min_samples_split': 15, 'criterion': 'gini'",0.81188,0.01279
"'max_features': 'log2', 'min_samples_split': 1, 'criterion': 'entropy'",0.79208,0.04622
"'max_features': 'log2', 'min_samples_split': 5, 'criterion': 'entropy'",0.81188,0.03512
"'max_features': 'log2', 'min_samples_split': 10, 'criterion': 'entropy'",0.80198,0.0347
"'max_features': 'log2', 'min_samples_split': 15, 'criterion': 'entropy'",0.84158,0.03543


##What is validation, and what's a classic mistake you can make if you do it wrong?  How did you validate your analysis?

Validation is the concept of checking the performance and accuracy of your machine learning model by running some test data against it.  The classic mistake that you can make is to have a high-biased algorithm (not generalized enough to have good predictive power) or to have a high variance algorithm(Too generalized to have good predictive power).  If your training data and testing data is not split correctly, this can generate either of these scenarios.  Thus it may be necessary to test train your data on mulitple folds of the dataset using the **K-fold** cross validation technique.  I validated my analysis by using the **test_train_split** function to split my data into a 70/30 split and calculating the accuracy_score, precision_score and recall_score on my testing data.  As these metrics improved, I felt more confident about the predictive accuracy it would have on the larger dataset.  I also used the GridSearchCV function to determine which configuration was best for my decision tree classifier.

##Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understable about your algorithm's performance.

My Decision Tree Classifier had the following scores on average on the large test dataset ran by tester.py:
- accuracy_score: 0.82
- precision_score: 0.33
- recall_score: 0.34

My accuracy score means that my classifier algorithm predicted the correct outcome for 82% of the 15000 people who were potentially a person of interest.  The precision score means that my algorithm has a probability of 33% of correctly predicting a person of interest when that person is actually a person of interest.  This value is determined by the number of people the algorithm correctly predicted a person of interest(True positives) divided by the sum of that number and the number of people the algorithm incorrectly predicted a person of interest(True positives + False Positives).  The recall score means that my algorithm has a probability of 34% of correctly determining whether that person is a person of interest or not.  This value is calculated by the number of people the algorithm correctly predicted people of interest (True positives) divided by the sum of that number and the number of people the algorithm predicted were not people of interest, but they actually were people of interest(True Postives + False Negatives).