### Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]

The goal of this project is to identify whether a person is guilty for the notorious Enron Fraud, using publicly available information like finacial incomes and emails. 

From oil-trading, bandwidth trading, and weather trading with market-to-market accounting, Enron has moved its hands to a variety of commodities, got in touch with politicians like George W. Bush, and caused a great loss to the public, including the California electricity shortage. All these information can be useful if text learning was applied, and certain patterns could be found out like, a pattern indicating a decision-making person could very likely be a person of interest. However, this is not applied in this analysis since it's a more advanced topic.

This analysis used a finacial dataset containing people's salary, stock information, and so on. During Enron Fraud, people like Jefferey Skilling, Key Lay, and Fastow all have dumped large amounts of stock options, and they are all guilty. This information can be very helpful to check on other person of interest, and can be easily refelected in the dataset. This is also where machine learning comes into play. By creating models which calculate relationships between a person of interest and its available quantitative data, machine learning tries to find and memorize a pattern that helps us identify a guilty person in the future.

There are certain outliers by each feature. To find out the outliers, a multivariable linear regression model was created. By removing data points with top 10% variance between predicted and real values, an outlier-cleaned dataset was built for the analysis. However, as there were lots person of interest appearing in the outliers, the original dataset was kept too. The analysis later on performed investigation on two datasets, and trained estimators on both.

### What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, f you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores.  [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]

There are five features in the final analysis after applying SelectKBest as feature selection.
- exercised_stock_options, with a score of 319.54
- total_stock_value, with a score of 211.70
- bonus, with a score of 63.12
- total_payments, 53.61
- loan_advances, 46.03

The feature selection method is picked after comparing to other three different feature selection methods, and performed with and without PCA. Feature scaling is used as it gives the fastest performance with highest scores, although there are good results without using scaling. The reason behind scaling is that, the huge difference in values between features like 'total_payments' and 'from_messages' or 'to_messages'. As features are not in the same scale, it's better to perform feature scaling before further analysis.

Three new features, "stock_salary_ratio", "poi_from_ratio", "poi_to_ratio", are created when analyzing the data.

stock_salary_ratio takes the result from total_stock_value divided by salary. This feature is useful based on the assumption that a person of interest usually has a unusual large stock value since it's under the table, while salary information could be more easily known by public, thus the ratio could give information to identify the poi. The bigger the ratio, the more likely it is a poi.

poi_from_ratio takes result from from_poi_to_this_person divided by from_messages. This feature assumes that if a person is a poi, he/she tends to have more contacts with another poi, therefore the ratio would be bigger. And same applie to feature poi_to_ratio.

### What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]

The final algorithm used is LinearSVC. There were three algorithms being picked and evaluated, and LinearSVC gave the fastest performance on the same scores. Part of the performance is attached as below.

In [3]:
import pandas as pd
pd.read_csv("result.csv")

Unnamed: 0,cleaned,model,scaled,feature_selection_method,classification_method,accuracy_score,f1_score,precision_score,recall_score,time_used
0,False,1,True,k_best,linear_svc,0.944,0.5,1,0.333,1.424
1,False,1,False,k_best,linear_svc,0.944,0.5,1,0.333,1.474
2,False,14,False,k_best_with_pca,k_neighbors,0.944,0.5,1,0.333,5.054
3,False,17,False,linear_svc_l1_with_pca,k_neighbors,0.944,0.5,1,0.333,10.154
4,False,11,False,extra_tree,k_neighbors,0.944,0.5,1,0.333,15.624
5,False,23,False,extra_tree_with_pca,k_neighbors,0.944,0.5,1,0.333,16.92
6,False,20,False,logistic_reg_with_pca,k_neighbors,0.944,0.5,1,0.333,109.164
7,False,8,False,logistic_reg,k_neighbors,0.944,0.5,1,0.333,443.944
8,True,1,True,k_best,linear_svc,0.97,0.667,1,0.5,1.506
9,True,13,True,k_best_with_pca,linear_svc,0.97,0.667,1,0.5,1.969


### What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric item: “tune the algorithm”]

Tuning the algorithm means to find a better solution specificly for the current problem. If an algorithm is not tuned well, it could lower down the accuracy score, percision and recall scores, while increasing the runtime.

To tune the algorithm, a list of available parameters are put into a dictionary as keys, with a list of possible values as their values. Then GridSearchCV was conducted to search for the best solution among the given paramters. So the tuning result is limited by the range of values given to parameters, it is important to provide reasonable amount of values to be tuned while considering the time cost.

### What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric item: “validation strategy”]

Without validation, the training might be lead to be overfitting. To avoid it, train_test_split was used to split the dataset into training and testing set. That is an easy and quick way to do cross validation.

### Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

The evaluation metrics in this analysis used are accuaracy score, f1 score, precision score and recall score.
The final model has an accuracy score of 0.97, which means 97% of the predictions are found to be true.
With a precision score of 1.0, it tells us that if this model predicts one person as a poi, then he is truly a poi. On the other hand, with a recall score of 0.5, only half of all the poi could be found out by this model.
Finally, there's always a tradeoff between precision and recall, f1 comes and measure how well the tradeoff is. With a f1 score of 0.67, this model is fairly robust, while more improvement might be applied in the future.