# Identifying Fraud From Enron Email

A nanodegree project for Intro to Machine Learning.

In [None]:
#!/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from pprint import pprint

import numpy as np

In [None]:
### Load the dictionary containing the dataset
data_dict = pickle.load(open("final_project_dataset.pkl", "r") )

# Part One - Understanding the Dataset and Question

## Data Exploration

To better understand the dataset, an exploration is performed here and the results are summarized as following,
- there are 146 data points with 21 features, and a total of 3066 obervations.
- there are 18 people who is an point of interest.
- 1,358 data points are missing.
- the top 3 features with most missing values are "loan_advances", "director_fees", and "restricted_stock_deferred".

More detailed exploration and analysis are listed as below.

In [None]:
# number of data points
len(data_dict.keys())

In [None]:
# number of features available
len(data_dict['METTS MARK'])

In [None]:
# available features
data_dict["METTS MARK"].keys()

In [None]:
# people of interest
count = 0
for key, item in data_dict.iteritems():
    if item["poi"]:
        print key
        count += 1
count

In [None]:
# create a dictionary for all missing values
missing = {}
for key, item in data_dict.iteritems():
    for elem, value in item.iteritems():
        if value == "NaN":
            if elem not in missing:
                missing[elem] = 1
            else:
                missing[elem] += 1

In [None]:
# number of missing values
number_of_missing = 0
for key, item in missing.iteritems():
    number_of_missing += item
number_of_missing

In [None]:
missing

In [None]:
# check who isn't missing the feature 'load_advances'
# outputs the person's name and a boolean value indicated whether the person is a poi.
for key, item in data_dict.iteritems():
    if item["loan_advances"] != "NaN":
        print "name: ", key, ",poi:", item["poi"]

In [None]:
# check who isn't missing the feature 'director_fees'
# outputs the person's name and a boolean value indicated whether the person is a poi.
for key, item in data_dict.iteritems():
    if item["director_fees"] != "NaN":
        print "name:", key, ",poi:", item["poi"]

In [None]:
# check who isn't missing the feature 'restricted_stock_deferred'
# outputs the person's name and a boolean value indicated whether the person is a poi.
for key, item in data_dict.iteritems():
    if item["restricted_stock_deferred"] != "NaN":
        print "name", key, ",poi:", item["poi"]

In [None]:
# check who isn't missing the feature 'deferral_payments'
# outputs the person's name and a boolean value indicated whether the person is a poi.
for key, item in data_dict.iteritems():
    if item["deferral_payments"] != "NaN":
        print "name:", key, ",poi:", item["poi"]

As shown above, it doesn't seem to have a clear pattern on whether a poi is missing a value or not. The investigation on missing values ends here, and the missing values will be replaced with '0' after feature formatting.

# Part Two - Outlier Investigation


## Remove the TOTAL Data Point
As we already known in mini projects, there is an outlier named "TOTAL" in this dataset. We will need to remove it before any further analysis.

In [None]:
# remove the outlier 'TOTAL'
data_dict.pop("TOTAL")

## Plots of the Outliers

To understand the outliers in this dataset, plots are created by using salary against every other feature but poi, which is used to color data points in each plot. As a starting point, all the available features will be selected and put into the model. Later in this report, some features will be removed based on their PCA importance score.

In [None]:
import pandas as pd

# create features for plots
# features_list is a list of strings, each of which is a feature name.
# The first feature must be "poi".
features_list = ['poi',
                 'salary',
                 'to_messages',
                 'deferral_payments',
                 'total_payments',
                 'exercised_stock_options',
                 'bonus',
                 'restricted_stock',
                 'shared_receipt_with_poi',
                 'restricted_stock_deferred',
                 'total_stock_value',
                 'expenses',
                 'loan_advances',
                 'from_messages',
                 'other',
                 'from_this_person_to_poi',
                 'director_fees',
                 'deferred_income',
                 'long_term_incentive',
                 'from_poi_to_this_person']

# format the dataset
data = featureFormat(data_dict, features_list)

# create a pandas dataframe
df = pd.DataFrame(data, columns = features_list)

In the following plots, blue color stands for poi, red color stands for non-poi.

In [None]:
%matplotlib inline
from ggplot import *

# iter through all features
# x axis will always be salary
# poi is represented by colors of points
# the rest of features are put in y axis
for feature in features_list:
    if feature != "poi" and feature != "salary":
        print ggplot(aes(x = 'salary', y = feature, color = 'poi'),
               data = df) +\
        geom_point() +\
        ggtitle("salary against " + feature)

The purpose of removing outliers is to prevent the model being misrepresented by extreme cases, which comes with an assumption that either the extreme cases rarely happen, or they don't carry engough valuable infomration to be kept in the model. This can be true for some of the features, but could be controversy for "total_payment" feature, and shouldn't be applied to "exercised_stock_options" as the top four outliers are all person of interest. On the other hand, if we are to treat top 10% of each feature as outliers, it is not hard to imagine that the final dataset will have much less than 90%. A large deduction in the original dataset will cause the model beoming weaker.

Given all these thoughts, a detailed outliers removal is performed below.

## Remove Outliers

Instead of removing a certain percent of data points by each features directly, as a starting point, we will fit a dummy linear regression model, calculate the deviations, and treat data points with the highest deviations as outliers.

First of all, prepare features and labels.

In [None]:
### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

Since we only want to remove the outliers by getting deviations from a linear model, there's no need to split the dataset as for now. The model is built as following,

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(features, labels)
predictions = reg.predict(features)

In [None]:
### check the score
reg.score(features, labels)

Create the cleaner function.

In [None]:
def outlierCleaner(predictions, features, labels):
    """
        clean away the 10% of points that have the largest
        residual errors (different between the prediction
        and the actual label value)

        return two lists - normals and outliers. Outliers
        are data points with the top 10% largest residual
        errors, the rest are in normals. Both of the lists
        are formatted as numpy array, and exactly like the
        formats after calling featureFormat.
    """

    normals = []
    outliers = []
    data = []
    length = int(len(predictions) * 0.9) + 1 # define the number of data points to be kept in normals

    ### create a dataset with a format:
    ### tuple(feature, label, residual errors)
    for i in range(len(predictions)):
        result = features[i], labels[i], (labels[i] - predictions[i]) ** 2
        data.append(tuple(result))
        
    ### sort dataset by deviations
    data.sort(key=lambda value: value[2])

    ### access dataset and create normals and outliers
    count = 0
    for values in data:
        count += 1
        if count <= length:
            normals.append(np.append([values[1]],values[0]))
        else:
            outliers.append(np.append([values[1]],values[0]))
    
    return normals, outliers

In [None]:
### extract normal data points and outliers
cleaned_data, outliers = outlierCleaner(predictions, features, labels)

In [None]:
### extract labels and features from cleaned_data
cleaned_labels, cleaned_features = targetFeatureSplit(cleaned_data)

In [None]:
# fit the model again
reg.fit(cleaned_features, cleaned_labels)
# check the score
reg.score(cleaned_features, cleaned_labels)

A removal of the outliers improved the score of the linear model dramastically from 0.35 to 0.82. Although it's good to see improvement in score, it's always necessary to take a look at the removed outliers.

## Find the Outliers

Although the featureFormat function creates a convenience here by turning a python dictionary into a numpy array, it also creates a difficulty checking who's been removed, as it loses information on keys of the dictionary after processing it. Moreover, the final test for this project uses my_dataset as an input, which is a python dicionary. If the removal of outliers do not happen on my_dataset, the test won't reflect the cleaning effort. Therefore, it's necessay to reformat the numpy array into a python dictionary.


In [None]:
def featureReformat(numpy_array, features):
    """
        Format a numpy array object into a python
        dictionary object.
        
        Take a numpy array and features as inputs and
        return a python dictionary using features as
        keys and numpy array as values.
    """
    
    result = []
    
    for array in numpy_array:
        data_point = {}
        for i in range(len(features)):
            value = array[i]
            key = features[i]
            data_point[key] = value
        result.append(data_point)

    return result
            

In [None]:
def personMapping(dict_list, dataset):
    """
        Mapping a person's name based on the values of
        features.
        
        Take a list of dictionaries that has all the values
        of person's features, and map it with a dataset
        which has a person's name as a key, and its features
        and values as the key's item.
        
        Return a dictionary with a person's name as its key,
        and another dictionary as its value, which has features
        as its key, and values of features as its values,
        {name_of_person_1:
            {feature_1: value,
             feature_2: value,
             feature_3: value,
             ...},
         name_of_person_2:
             {...}}
    """
    
    my_dataset = {}
    
    ### iter through the dataset
    for key, item in dataset.iteritems():
        
        ### open the dictionary list
        for data in dict_list:
            
            ### open the features list
            for feature in features_list:
                
                ### filter out 'NaN' in the dataset
                ### check all the '0' values
                if item[feature] == "NaN":
                    if int(data[feature]) == 0:
                        find = True
                    else:
                        find = False
                        break
                
                else:
                    ### check every other feature between dictionary list and dataset
                    ### using a logical value 'find' to determine if a match is found
                    if int(data[feature]) == item[feature]:
                        find = True
                    else:
                        find = False
                        break
            
            ### iter through all features once
            ### if found, map the data to my_dataset
            if find:
                my_dataset[key] = item
                
    return my_dataset

## Summary of Outliers

There are 15 people are identified as outliers, among which only 2 of them are non-person of interest. Given the fact that there are 18 person of interest, and 14 of them showed in the outliers, there might be an issue if these outliers are cleaned away.

In [None]:
outliers_dataset = personMapping(featureReformat(outliers, features_list), data_dict)

In [None]:
len(outliers_dataset)

In [None]:
for key, item in outliers_dataset.iteritems():
    if item['poi'] == 0.0:
        print key

## Create New Datasets

As mentioned above, simply removing the outliers might cause an issue for later on analysis. While the imporvement in score of the linear model is surely tempting, do note that, this is not the model that we will use to conduct machine learning in this dataset. However, since we don't want to miss any possible improvements in our future models, we will use both datasets in later analysis.

In [None]:
my_full_dataset = data_dict
my_cleaned_dataset = personMapping(featureReformat(cleaned_data, features_list), data_dict)

In [None]:
def featureLabelSplit(my_dataset, features_list):
    """
        A simple function creates features and labels
        
        Return features and labels
    """
    data = featureFormat(my_dataset, features_list, sort_keys = True)
    labels, features = targetFeatureSplit(data)
    return features, labels

On the other hand, there's no surprising that most of the person of interest might be flagged as outliers given the background knowledge of Enron Fraud. In this case, the outliers are the targets we want to find, according to <a href='https://discussions.udacity.com/t/outlier-removal/7446' target='_blank'>this post in discussion forum</a>, we can manually decided to include or exclude the outliers or not in the training set. This strategy will be applied when processing the dataset.

# Part Three - Optimize Feature Selection

## Create New Features

To dig out more patterns from the dataset, three new features, "stock_salary_ratio", "poi_from_ratio", "poi_to_ratio", are created as following.

### Feature stock_salary_ratio

stock_salary_ratio takes the result from total_stock_value divided by salary. This feature is useful based on the assumption that a person of interest usually has a unusual large stock value since it's under the table, while salary information could be more easily known by public, thus the ratio could give information to identify the poi. The bigger the ratio, the more likely it is a poi.

### Feature poi_from_ratio and poi_to_ratio
poi_from_ratio takes result from from_poi_to_this_person divided by from_messages. This feature assumes that if a person is a poi, he/she tends to have more contacts with another poi, therefore the ratio would be bigger. And same applie to feature poi_to_ratio.

In [None]:
### add new features to dataset
for key, item in data_dict.iteritems():
    ### add stock_salary_ratio
    if item['salary'] != "NaN" and item['total_stock_value'] != "NaN":
        item['stock_salary_ratio'] = float(item['total_stock_value']) / item['salary']
    else:
        item['stock_salary_ratio'] = "NaN"
    
    ### add poi_from_ratio
    if item['from_messages'] != "NaN" and item['from_poi_to_this_person'] != "NaN":
        item['poi_from_ratio'] = float(item['from_poi_to_this_person']) / item['from_messages']
    else:
        item['poi_from_ratio'] = "NaN"
        
    ### add poi_to_ratio
    if item["to_messages"] != "NaN" and item["from_this_person_to_poi"] != "NaN":
        item["poi_to_ratio"] = float(item["from_this_person_to_poi"]) / item["to_messages"]
    else:
        item["poi_to_ratio"] = "NaN"

In [None]:
### update dataset
my_full_dataset = data_dict
my_cleaned_dataset = personMapping(featureReformat(cleaned_data, features_list), data_dict)

In [None]:
### update features_list
features_list += ["stock_salary_ratio", "poi_from_ratio", "poi_to_ratio"]

## Feature Scaling

Depending on the algorithms chosen, feature scaling may be necessary. We will perform feature scaling anyway in case it is needed for later algorithms.

In [None]:
### Extract features and labels from dataset for local testing
data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [None]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
features = min_max_scaler.fit_transform(features)

## Feature Selection

To get a comprehensive processing before any fitting into models, a variety of feature selection methods for classification listed in 
<a href='http://scikit-learn.org/stable/modules/feature_selection.html' target='_blank'>sklearn documentations</a> 
are explored.

### K-Best

In [None]:
from sklearn.feature_selection import SelectKBest
k_best = SelectKBest(k = 5)
k_best_features = k_best.fit_transform(features, labels)

In [None]:
k_best_features.shape

In [None]:
k_best_result = zip(features_list[1:], k_best.scores_, k_best.get_support())
k_best_result.sort(key=lambda value:value[1], reverse=True)

### LinearSVC

In [None]:
from sklearn.svm import LinearSVC
svc = LinearSVC(penalty="l1", dual=False, random_state=31)
svc_features = svc.fit_transform(features, labels)

In [None]:
svc_features.shape

In [None]:
svc_result = zip(features_list[1:], svc.coef_[0])
svc_result.sort(key=lambda value:value[1], reverse=True)

### Randomized Logistic Regression

In [None]:
from sklearn.linear_model import RandomizedLogisticRegression
randomized_logistic = RandomizedLogisticRegression(C=1, selection_threshold=0.01, random_state=31)
randomized_features = randomized_logistic.fit_transform(features, labels)

In [None]:
randomized_features.shape

In [None]:
randomized_result = zip(features_list[1:], randomized_logistic.scores_, randomized_logistic.get_support())
randomized_result.sort(key=lambda value:value[1], reverse=True)

### Extra Tree Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
tree = ExtraTreesClassifier(max_features=5, random_state=31)
tree_features = tree.fit_transform(features, labels)

In [None]:
tree_features.shape

In [None]:
tree_result = zip(features_list[1:], tree.feature_importances_)
tree_result.sort(key=lambda value: value[1], reverse=True)

### PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca_features = pca.fit_transform(features, labels)

In [None]:
pca_result = pca.explained_variance_ratio_

### Comparison Among Feature Selection Methods

In [None]:
k_best_result

In [None]:
svc_result

In [None]:
randomized_result

In [None]:
tree_result

In [None]:
pca_result

### Finalize Feature Selection and More

Till this point, no models have been fitted and no results can be used to determine which feature selection method would work for the best. In order to find the best method, all four feature selection algorithms will be tested, so does a combination of the four.

Also keep in mind that, the above analysis was done based on the whole dataset, not the cleaned dataset without outliers. Therefore, to automize the pipeline in the later report, a list of feature selection functions and a function to combine all the features are created.

In [None]:
feature_selection = [('k_best', SelectKBest(k = 5)),
                     ('linear_svc_l1', LinearSVC(C=0.01, penalty="l1", dual=False, random_state=31)),
                     ('logistic_reg', RandomizedLogisticRegression(C=1, selection_threshold=0.01, random_state=31)),
                     ('extra_tree', ExtraTreesClassifier(max_features=5, random_state=31))]

In [None]:
from sklearn.pipeline import FeatureUnion, Pipeline

### combine pca to feature selection
combined_feature = []
for method in feature_selection:
    new_method = FeatureUnion(('pca', PCA(n_components=5)), method)
    name = method[0] + " with pca"
    combined_feature.append((name, new_method))

In [None]:
### update feature selection list
feature_selection += combined_feature

Although four different feature selection methods are chosen, their parameters could be tuned and a potential better result could be found too. But it will be left to future. On the other hand, a combination of the feature selection results is really a conservative way, and it's doubtful it would work. Anyway, we will let the score speak of itself.

# Part Four - Pick and Tune an Algorithm

According to 
<a href='http://scikit-learn.org/stable/tutorial/machine_learning_map/' target='_blank'>this cheat sheet in sklearn</a>, 
there are at least four classification methods can be used,
- LinearSVC
- KNeighbors Classifier
- SVC
- Ensemble Classifers

In this report, we will check on LinearSVC, KNeighborsClassifier, and AdaBoostClassifier.

## LinearSVC

Parameters to be tuned:
- C: float, default = 1.0
- loss: "hinge" or "squared_hinge", default = "squared_hinge"
- penalty: "l1" or "l2", default = "l2"
- dual: boolean, prefer False when n_samples > n_features, default = True
- tol: float, default = 0.0001
- max_iter: integer, default = 1000

In [None]:
linear_svc = LinearSVC(random_state=31, dual=False)

params_svc = {'linear_svc__C':[1e-2, 1e-1, 1, 1e2, 1e3],
              'linear_svc__penalty': ["l1", "l2"],
              'linear_svc__tol': [1e4, 1e3, 1e2, 1],
              'linear_svc__max_iter': [1e2, 1e3, 1e4, 1e5]}

## KNeighborsClassifier

Parameters to be tuned:
- n_neighbors: integer, default = 5
- weights: "uniform" or "distance", default = "uniform"
- algorithm: "auto", "ball_tree", "kd_tree" or "brute", default = "auto"
- leaf_size: integer, default = 30
- p, integer, 1 for manhattan_distance, 2 for euclidean_distance, default = 2

In [None]:
from sklearn.neighbors import KNeighborsClassifier

k_neighbors = KNeighborsClassifier()

params_kneighbors = {'k_neighbors__n_neighbors': [1, 5, 10, 20, 50],
                     'k_neighbors__weights': ["uniform", "distance"],
                     'k_neighbors__algorithm':["ball_tree", "kd_tree", "brute"],
                     'k_neighbors__leaf_size': [2, 5, 10, 30, 50, 100],
                     'k_neighbors__p': [1,2]}

## AdaBoostClassifer

Parameters to be tuned:
- n_estimators: integer, default = 50
- learning_rate: float, default = 1
- algorithm: "SAMME" or "SAMME.R", default = "SAMME.R"


In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_boost = AdaBoostClassifier(random_state=31)

params_adaboost = {'ada_boost__base_estimator': [DecisionTreeClassifier(), None],
                   'ada_boost__n_estimators': [1, 5, 10, 20, 50, 100],
                   'ada_boost__algorithm': ['SAMME', 'SAMME.R'],
                   'ada_boost__learning_rate': [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10],
                   }

In [None]:
classifiers = [('linear_svc', linear_svc, params_svc),
               ('k_neighbors', k_neighbors, params_kneighbors),
               ('ada_boost', ada_boost, params_adaboost)]

# Part Five - Validate and Evaluate

## Cross Validation

To prevent overfitting, a cross validation is needed to split the dataset into training and testing. We will use train_test_split method with a default test_size of 0.25.

In [None]:
from sklearn.cross_validation import train_test_split

def trainTestSplit(my_dataset, features_list):
    """
        A training and testing set split function.
        
        Take my_dataset and features_list as input, call on
        featueLabelSplit to create features and labels. Then
        use train_test_split to split datasets.
        
        Return training and testing datasets.
    """
    
    features, labels = featureLabelSplit(my_dataset, features_list)
    
    features_train, features_test, labels_train, labels_test = train_test_split(
        features, labels, test_size=0.25, random_state=42)
    
    return features_train, features_test, labels_train, labels_test

## Evaluation

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def evaluateModel(y_true, y_pred):
    """
        A model evaluator.
        Calculate the model's accuracy score, f1 score,
        precision score, and recall score. 
        
        Return nothing. Print out the scores as side effects.
    """
    
    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    
    print """    Accuracy score: {}
    F1 score: {}
    Precision score: {}
    Recall score: {}""".format(accuracy, f1, precision, recall)
    

# Part Six - Find the Best Estimators

In [None]:
from sklearn.grid_search import GridSearchCV

def tuneEstimator(pipeline, param, features_train, features_test, labels_train):
    """
        Tune the classifiers to find the best estimator.
        
        Return the best estimator, predictions and scores.
    """
    
    clf = GridSearchCV(pipeline, param)
    
    ### train the model
    clf.fit(features_train, labels_train)
                    
    ### store the tuning results
    tuned_scores = clf.grid_scores_
                    
    ### use the best estimator
    best_clf = clf.best_estimator_
    labels_pred = best_clf.predict(features_test)
    
    return best_clf, labels_pred, tuned_scores

In [None]:
def trainModel(my_dataset, features_list, feature_selection=feature_selection, classifiers=classifiers, scaling=False):
    """
        A model training function.
        
        Take a dataset in python dictionary format, a list of
        features, a list of feature selection methods, and a
        list of classification methods. Iter through each list
        and make combinations of different feature selection
        method with different classification method. Then use 
        tuneEstimator to tune the model. Finally, it evaluates
        the model based on accuracy score, precision score,
        recall score, and f1 score.
        
        If scaling is True, it will scale features only when
        appropriate classifiers are used. If scaling_all is True,
        it will scale features for all classifiers.
        
        Return a list of models and tuned scores.
    """
    
    ### split the training and testing sets
    features_train, features_test, labels_train, labels_test = trainTestSplit(my_dataset, features_list)
    
    trained_model = []
    count = 0
    tuned_score = []
    
    ### iter through feature selection and classification methods
    for selection_method in feature_selection:
        for item in classifiers:
            
            count += 1
            print "Model {} \n-working on classifier {}, using slection method {}".format(count, item[0], selection_method[0])
            
            ### add a time function to calculate time used by each model
            from time import time
            t0 = time()
            
            ### unpack name, function and parameters
            classifier = item[:2]
            param = item[2]
                
            ### scale the features before training
            if scaling:
                features_train = MinMaxScaler().fit_transform(features_train)
                features_test = MinMaxScaler().fit_transform(features_test)
                
            try:
                
                ### build pipeline
                pipeline = Pipeline([selection_method, classifier[:2]])
                
                ### tune the model
                try:

                    print "--start tuning..."
                    clf, labels_pred, grid_scores = tuneEstimator(pipeline, param, features_train, features_test, labels_train)
                    
                    ### store the tuning results
                    tuned_score.append(grid_scores)
                    
                    ### store model's information, including name, function, and parameters
                    model_name = item[0] + " with " + selection_method[0]
                    model_info = (model_name, clf)
                    trained_model.append(model_info)

                    print "--training on {} complete, time used {}".format(model_name, time() - t0)

                    ### print out evaluation scores
                    evaluateModel(labels_test, labels_pred)

                    print ""
                
                except Exception, e:
                    print "--error on tuning: \n", e, "\n"
                                
            except Exception, e:
                print "-error on classifying: \n", e, "\n"
            
    return trained_model, tuned_score

## Run Estimators on Full Dataset

When runing the esimators on full dataset, there are 18 models generated. A comparison among the best choices is as following, listed as model number, feature selection method, classification method, accuracy score, F1 score, precision score, recall score, and time consumption.
<table>
    <tr>
        <td>Model No.</td>
        <td>Feature Selection Method</td>
        <td>Classification Method</td>
        <td>Accuracy Score</td>
        <td>F1 Score</td>
        <td>Precision Score</td>
        <td>Recall Score</td>
        <td>Time Used (s)</td>
    </tr>
    <tr>
        <td>1</td>
        <td>k_best</td>
        <td>linear_svc</td>
        <td>0.94</td>
        <td>0.5</td>
        <td>1.0</td>
        <td>0.33</td>
        <td>1.64</td>
    </tr>
    <tr>
        <td>8</td>
        <td>logistic_reg</td>
        <td>k_neighbors</td>
        <td>0.94</td>
        <td>0.5</td>
        <td>1.0</td>
        <td>0.33</td>
        <td>115.33</td>
    </tr>
    <tr>
        <td>11</td>
        <td>extra_tree</td>
        <td>k_neighbors</td>
        <td>0.94</td>
        <td>0.5</td>
        <td>1.0</td>
        <td>0.33</td>
        <td>22.03</td>
    </tr>
    <tr>
        <td>14</td>
        <td>pca</td>
        <td>k_neighbors</td>
        <td>0.94</td>
        <td>0.5</td>
        <td>1.0</td>
        <td>0.33</td>
        <td>3.59</td>
    </tr>
</table>

As shown in the table, all the estimators have the same performance on scores, while the time consumption varies. Based on the results, LinearSVC with SelectKBest performs the best as it only takes 1.64 seconds, and the second best comes with KNeighborsClassifier and PCA. These two models will be used as candidates to build the final model.

In [None]:
### scaled results
full_model_sets, tuned_score_1 = trainModel(my_full_dataset, features_list, scaling=True)

In [None]:
### non-scaled results
#full_model_sets, tuned_score_1 = trainModel(my_full_dataset, features_list)

### Further Improvement Among the Best Estimators
Although the parameters for classifiers are tuned, parameters for feature selection methods might as well be tuned to to find better outcome. We will tune the parameters here and see if anything better could be found.

In [None]:
pipeline_kbest_linearsvc = full_model_sets[0][1]
pipeline_pca_kneighbors = full_model_sets[13][1]

In [None]:
param_k_best = {'k_best__k': [1,2,3,4,5,6,7,8,9,10,11,12]}

param_pca = {'pca__n_components': [1,2,3,4,5,6,7,8,9,10,11,12]}

In [None]:
features_train, features_test, labels_train, labels_test = trainTestSplit(my_full_dataset, features_list)

In [None]:
kbest_linearsvc = GridSearchCV(pipeline_kbest_linearsvc, param_k_best)
kbest_linearsvc.fit(features_train, labels_train)
labels_pred = kbest_linearsvc.predict(features_test)
evaluateModel(labels_test, labels_pred)

In [None]:
pca_kneighbors = GridSearchCV(pipeline_pca_kneighbors, param_pca)
pca_kneighbors.fit(features_train, labels_train)
labels_pred = pca_kneighbors.predict(features_test)
evaluateModel(labels_test, labels_pred)

The result turned out that no better solution was found for full dataset. The final estimator for full dataset is as below,

In [None]:
pipeline_kbest_linearsvc

## Run estimators on cleaned dataset.

Similar to running estimators on full dataset, there are attempts to generate 18 models. However, as shown below, when using combined_feature as the feature selection method, it kept saying missing classes, a futher investigation is needed to figure this out. As for now, a summary table of potential choices for the final model is attached below,

<table>
    <tr>
        <td>Model No.</td>
        <td>Feature Selection Method</td>
        <td>Classification Method</td>
        <td>Accuracy Score</td>
        <td>F1 Score</td>
        <td>Precision Score</td>
        <td>Recall Score</td>
        <td>Time Used (s)</td>
    </tr>
    <tr>
        <td>3</td>
        <td>k_best</td>
        <td>ada_boost</td>
        <td>0.97</td>
        <td>0.67</td>
        <td>1.0</td>
        <td>0.5</td>
        <td>2.52</td>
    </tr>
        <tr>
        <td>10</td>
        <td>extra_tree</td>
        <td>linear_svc</td>
        <td>0.97</td>
        <td>0.67</td>
        <td>1.0</td>
        <td>0.5</td>
        <td>6.38</td>
    </tr>
        <tr>
        <td>12</td>
        <td>extra_tree</td>
        <td>ada_boost</td>
        <td>0.97</td>
        <td>0.67</td>
        <td>1.0</td>
        <td>0.5</td>
        <td>10.56</td>
    </tr>
        <tr>
        <td>13</td>
        <td>pca</td>
        <td>linear_svc</td>
        <td>0.97</td>
        <td>0.67</td>
        <td>1.0</td>
        <td>0.5</td>
        <td>1.25</td>
    </tr>
</table>

It's interesting that when runing on cleaned datasets, the performance in general is much better. For the final analysis, model 3 and 13 are chosen as candidates.

In [None]:
### scaled results
cleaned_model_sets, tuned_score_2 = trainModel(my_cleaned_dataset, features_list)

In [None]:
### non-scaled results
cleaned_model_sets, tuned_score_2 = trainModel(my_cleaned_dataset, features_list)

In [None]:
pipeline_kbest_adaboost = cleaned_model_sets[2][1]
pipeline_pca_linearsvc = cleaned_model_sets[9][1]

In [None]:
features_train, features_test, labels_train, labels_test = trainTestSplit(my_cleaned_dataset, features_list)

In [None]:
kbest_adaboost = GridSearchCV(pipeline_kbest_adaboost, param_k_best)
kbest_adaboost.fit(features_train, labels_train)
clf = kbest_adaboost.best_estimator_
labels_pred = clf.predict(features_test)
evaluateModel(labels_test, labels_pred)

In [None]:
pca_linearsvc = GridSearchCV(pipeline_pca_linearsvc, param_pca)
pca_linearsvc.fit(features_train, labels_train)
clf = pca_linearsvc.best_estimator_
labels_pred = clf.predict(features_test)
evaluateModel(labels_test, labels_pred)

The result turned out to be that the tuned result became even wrose. We will LinearSVC with PCA as the final estimator for cleaned dataset. The parameters are as below,

In [None]:
pipeline_pca_linearsvc

# Part Seven - The Final Solution
After comparing among feature selection methods, classification methods, carefully tuning parameters for the methods, and working on both the full dataset and the outlier-cleaned dataset, the best and fastest model turned out to be outlier-cleaned dataset with PCA as feature selection processor and LinearSVC as classification method. Parameters are as following,
- PCA, n_components = 5.
- LinearSVC, C = 0.1, dual = False, penalty = 'l1', tol = 1, max_iter = 100.


In [None]:
### prepare for the test
clf = pipeline_pca_linearsvc
my_dataset = my_cleaned_dataset

In [None]:
### find the explained variance ratio by PCA
pca = clf.steps[0][1]
pca.explained_variance_ratio_

In [None]:
### dump for testing
dump_classifier_and_data(clf, my_dataset, features_list)

# Limitations

limitations

different number of parameters tuned in each algorithm