# Enron Dataset: Predicting POI

This project on the infamous Enron email dataset involves predicting if an employee was a 'Person of Interest' during the Enron scam investigation.

The dataset is availabe on https://www.cs.cmu.edu/~./enron/ .


### Importing necessary files

In [1]:
import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
import matplotlib.pyplot as plt
from tester import dump_classifier_and_data, test_classifier
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler

import pandas as pd
import numpy as np
from sklearn.metrics import recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing




### Processing the data

The function 'get_nan_counts' converts 'NaN' string to np.nan returning a pandas dataframe of each feature and it's corresponding percent null values (nan)


In [2]:
def get_nan_counts(dictionary):
    
    my_df = pd.DataFrame(dictionary).transpose()
    nan_counts_dict = {}
    for column in my_df.columns:
        my_df[column] = my_df[column].replace('NaN',np.nan)
        nan_counts = my_df[column].isnull().sum()
        nan_counts_dict[column] = round(float(nan_counts)/float(len(my_df[column])) * 100,1)
    df = pd.DataFrame(nan_counts_dict,index = ['percent_nan']).transpose()
    df.reset_index(level=0,inplace=True)
    df = df.rename(columns = {'index':'feature'})
    return df

### Feature Selection

Select what features to use. Initially we use the maximum possible features.
We use k-best to find the importance of each feature for training. Using a lot of unnecessary features can lead us to overfit.

In [3]:
features_list = ['poi', 'salary', 'deferral_payments','total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'other', 'exercised_stock_options', 'long_term_incentive', 'restricted_stock', 'director_fees', 'shared_receipt_with_poi']

### Load Dictionary

Load the dictionary containing the dataset

In [4]:
with open("final_project_dataset.pkl", "rb") as data_file:
    data_dict = pickle.load(data_file)

### Outlier Removal

In [5]:
outliers = []
for key in data_dict:
    val = data_dict[key]['salary']
    if val == 'NaN':
        continue
    outliers.append((key,int(val)))

outliers = sorted(outliers,key=lambda x:x[1],reverse=True)[:4]
outliers.append(('BHATNAGAR SANJAY', 888))
for x in outliers:
    print(x)
    data_dict.pop(x[0],0)

('TOTAL', 26704229)
('SKILLING JEFFREY K', 1111258)
('LAY KENNETH L', 1072321)
('FREVERT MARK A', 1060932)
('BHATNAGAR SANJAY', 888)


### Create new features

Visualizing the data from the dataset, we see that there are features which when combined with other features are of importance to us. 
The ratio between the mails sent to poi and the ones sent to all will tell us how often the person communicates with the poi. Similarly, The ratio between the mails recieved from poi and the received mails will be an interesting feature too. We will see further that the fraction of mails received from poi are not much of a deciding feature.

We create two new features: 'fraction_from_poi_email' and 'fraction_to_poi_email' and them to the daatset.


In [6]:
my_dataset = data_dict

def dict_to_list(key,normalizer):
    new_list=[]

    for i in data_dict:
        if data_dict[i][key]=="NaN" or data_dict[i][normalizer]=="NaN":
            new_list.append(0.)
        elif data_dict[i][key]>=0:
            new_list.append(float(data_dict[i][key])/float(data_dict[i][normalizer]))
    return new_list

fraction_from_poi_email=dict_to_list("from_poi_to_this_person","to_messages")
fraction_to_poi_email=dict_to_list("from_this_person_to_poi","from_messages")


count = 0
for i in data_dict:
    data_dict[i]["fraction_from_poi_email"] = fraction_from_poi_email[count]
    data_dict[i]["fraction_to_poi_email"] = fraction_to_poi_email[count]
    count += 1

my_dataset = data_dict


### Selecting the best features

We select the four best features using the k-best method. After analysing the importance scores, we decide a new features_list. The selected features list will be best suited for our predictions.

In [7]:
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

k=4
k_best = SelectKBest(k=k)
k_best.fit(features, labels)
scores = k_best.scores_
print(scores)

features_list = ['poi','total_stock_value','fraction_to_poi_email','expenses','shared_receipt_with_poi']


[  9.28161841   0.05748658   2.55178656   0.128223     8.56202764
   0.74093048  18.01617966  10.92757014   5.42330897   0.51551682
  11.73424577   3.37327549   1.14246906   1.76254073   6.33964132]


### Read data

featureFormat reads dataset as a dictionary and returns a numpy array of the formatted data.
targetFeatureSplit splits data as features and labels for training.

These functions are imported from feature_format.py

In [8]:
data = featureFormat(my_dataset, features_list, sort_keys = True)

labels, features = targetFeatureSplit(data)

### Feature Scaling

Use the StandardScaler() to scale the values of all the features in features_list. 

In [9]:
for i in range(0,len(features_list)-1):
    tmp =[]
    k=0
    for x in features:
        tmp.append(float(x[i]))
    tmp = StandardScaler().fit_transform(tmp)
    for x in features:
        x[i]=tmp[k]
        k = k + 1
        



### Select best classsifier

The modified dataset is now trained on a Decision Tree Classifier which is tuned for maximum performance.
This classifier is tested for performance using a tester function defined in tester.py

In [10]:
clf = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=20, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=5,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

test_classifier(clf,my_dataset,features_list)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=20, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=5,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
	Accuracy: 0.88343	Precision: 0.60575	Recall: 0.52700	F1: 0.56364	F2: 0.54107
	Total predictions: 14000	True positives: 1054	False positives:  686	False negatives:  946	True negatives: 11314



### Validation

Use cross_validation.train_test_split to get validation data from the training set.

In [11]:
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

### Train and Predict

In [12]:
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

### Get Recall

The recall_score is one of the many methods of evaluating the code.

In [13]:
print(recall_score(labels_test, pred))

0.75


### Dump the data

In [14]:
dump_classifier_and_data(clf, my_dataset, features_list)