# Identifying Persons of Interest in the Enron Fraud
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. 

This project will be used to build a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist with this I have created a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

The Enron email and financial data have been combined into a dictionary, where each key-value pair in the dictionary corresponds to one person. The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. The features in the data fall into three major types, namely financial features, email features and POI labels.

**financial features:** ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)

**email features:** ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)

**POI label:** [‘poi’] (boolean, represented as integer)

## Questions

**1. Summarize the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question.**

**Were there any outliers in the data when you got it, and how did you handle those?**  


**2. What features did you end up using in your POI identifier, and what selection process did you use to pick them?** 

**Did you have to do any scaling? Why or why not?**

**As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.)**

**In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. ** 


**3. What algorithm did you use?** 

**What other one(s) did you try? How did model performance differ between algorithms?** 


**4. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?**  

**How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).**  


**5. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?**

**6. Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. **

## Files included in this project:
* **poi_id.py :** Code for the POI identifier.
* **final_project_dataset.pkl :** The dataset for the project. 
* **tester.py :** Used to test the functionality of the poi_id.py file. 
* **emails_by_address :** This directory contains many text files, each of which contains all the messages to or from a particular email address. 