# Analysis of Enron Data
## Based on Udacity intro to machine learning course

### Data sources:
- Raw email text data can be found at: https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz
and a breakdown of emails by sender can be found [here](data/emails_by_address/)
- The financial data was compiled from [this file](data/financial_data.pdf)
- The persons of interest "pois" come from [this file](data/poi_names.txt)

### Get the data already preprocessed

- `all_keys` from [outlier_removal](outlier_removal.ipynb)
- `mean_data` from [imputing_data](imputing_data.ipynb)

In [1]:
import os
import pickle

HOME_PATH = os.path.expanduser('~')
DATA_PATH = os.path.join(HOME_PATH, 'Desktop', 'raw_data', 'ml')
mean_path = os.path.join(DATA_PATH, 'mean_data.pkl')
keys_path = os.path.join(DATA_PATH, 'all_keys.pkl')

with open(mean_path, 'rb') as f:
    mean_data = pickle.load(f)

with open(keys_path, 'rb') as f:
    all_keys = pickle.load(f)

### Make a pipeline

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.naive_bayes import GaussianNB

pipe = Pipeline([
                 ('scaler', StandardScaler()),
                 ('polynomials', PolynomialFeatures(2)),
                 ('feature_selr', SelectPercentile(chi2)),
                 ('estimator', GaussianNB())
                ])

Get the pipeline parameters found in [parameter_optimization](parameter_optimization.ipynb)

In [13]:
with open('testing_params.pickle', 'rb') as f:
    testing_params = pickle.load(f)

with open('large_grid.pkl', 'rb') as f:
    large_grid = pickle.load(f)
    
# clf = pipe.set_params(**testing_params)
clf = large_grid.best_estimator_

In [14]:
clf

Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomials', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('feature_selr', SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini'...random_state=42, splitter='best'),
          learning_rate=0.1, n_estimators=250, random_state=42))])

In [15]:
### Dump classifier, dataset, and features_list so anyone can check your results.
from tools.tester import dump_classifier_and_data
dump_classifier_and_data(clf, my_dataset, features_list)

In [16]:
run tools/tester.py

Loading data
Done loading
Start testing
Testing splits: 
. 5 % . 10 % . 20 % . 40 % . 80 % Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomials', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('feature_selr', SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini'...random_state=42, splitter='best'),
          learning_rate=0.1, n_estimators=250, random_state=42))])
	Accuracy: 0.81947	Precision: 0.27132	Recall: 0.21000	F1: 0.23675	F2: 0.21994
	Total predictions: 15000	True positives:  420	False positives: 1128	False negatives: 1580	True negatives: 11872

Done testing
