# Analysis of Enron Data
## Based on Udacity intro to machine learning course

### Goal

The goal of this analysis is to use publicly available data from the investigation into the [2001 Enron scandal](https://en.wikipedia.org/wiki/Enron_scandal) to develop a machine learning algorithm that could identify persons of interest (referred to as "pois") with precision and recall  >0.3. 

### Data sources:
- Raw email text data can be found at: https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz
and a breakdown of emails by sender (provided by the course instructors) can be found [here](data/emails_by_address/)
- The financial data was organized by the course instructors and organized in [this dictionary](data/final_project_dataset.pkl) and was compiled from [this file](data/financial_data.pdf)
- The pois were manually compiled by the course instructions and come from [this file](data/poi_names.txt)

- The intermediate data files for the analysis steps described below can be found [here](https://www.dropbox.com/sh/iyd7j82lsghxtgr/AADFlMHuNZdeq5dHCJ7ykIppa?dl=0)
- The final dataset, feature names, and estimator can be found [here](https://www.dropbox.com/sh/bk6amqwcx133rhp/AACOLu6NQRIAhwz6H4JgO3nPa?dl=0)

### Feature generation 

I generated two sets of features from the data:

1) Polynomial features derived from the financial data: The generation, scaling, and testing of these features can be seen in the [polynomial_features](polynomial_features.ipynb) notebook

2) Word features derived from raw emails sent by Enron employees: 
- The extraction and cleaning up of the text data can be seen in the [email_features](email_features.ipynb) notebook
- The vectorization of the email words based on term frequency inverse document frequency analysis can be seen in the [vectorize_email_features](vectorize_email_features.ipynb) notebook
- Finally, the conversion of the generated features into a usable dictionary can be seen in the [save_email_features](save_email_features.ipynb) notebook

### Parameter optimization

Once a set of features had been generated I tested 5 different machine learning classification algorithms to see which could identify pois most effectively:

1) Gaussian Naive Bayes

2) Support Vector Machines

3) AdaBoost

4) Random Forests

5) Logistic Regression

Of these five AdaBoost consistently provided the best results. 

The process of identifying the best classification algorithm and tuning its hyper parameters to maximize performance can be seen in the [parameter_optimization](parameter_optimization.ipynb) notebook

### Results of the analysis

Finally, the final dataset, feature names, and estimator can be found [here](https://www.dropbox.com/sh/bk6amqwcx133rhp/AACOLu6NQRIAhwz6H4JgO3nPa?dl=0)

The performance of the final analysis can be seen here:

In [3]:
run tools/tester.py

Loading data
Done loading
Start testing
Testing splits: 
. 5.0 % . 10.0 % . 15.0 % . 20.0 % . 25.0 % . 30.0 % . 35.0 % . 40.0 % . 45.0 % . 50.0 % . 60.0 % . 65.0 % . 70.0 % . 75.0 % . 80.0 % . 85.0 % . 90.0 % . 95.0 % . 100.0 % 
ESTIMATOR:
AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'),
          learning_rate=0.01, n_estimators=25, random_state=42)
RESULTS:
Total predictions: 15000
True positives:  722
False positives: 1013
False negatives: 1278
True negatives: 11987
PERFORMANCE:
Accuracy: 0.84727
Precision: 0.41614
Recall: 0.36100
F1: 0.38661
Done testing
