In [1]:
"""
This project has two parts. In the first part, you will run a regression, and identify and remove the 10% of points that have 
the largest residual errors. Then you'll remove those outliers from the dataset and refit the regression, just like the strategy 
that Sebastian suggested in the lesson videos.

In the second part, you will get acquainted with some of the outliers in the Enron finance data, and learn if/how to remove them.

Sebastian described to us an algorithm for improving a regression, which you will implement in this project. You will work through
it in the next few quizzes. To summarize, what you'll do is fit the regression on all training points discard the 10% of points 
that have the largest errors between the actual y values, and the regression-predicted y values refit on the remaining points.

Start by running the starter code (outliers/outlier_removal_regression.py) and visualizing the points. A few outliers should 
clearly pop out. Deploy a linear regression, where net worth is the target and the feature being used to predict it is a person's 
age (remember to train on the training data!).

The "correct" slope for the main body of data points is 6.25 (we know this because we used this value to generate the data); 
what slope does your regression have?
"""

"""
outlier_cleaner.py
"""
#!/usr/bin/python

def outlierCleaner(predictions, ages, net_worths):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual net worth).

        Return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error).
    """
    
    cleaned_data = []

    ### your code goes here

    
    return cleaned_data

In [2]:
"""
outlier_removal_regression.py
"""

#!/usr/bin/python

import random
import numpy
import matplotlib.pyplot as plt
import pickle

#from outlier_cleaner import outlierCleaner

### load up some practice data with outliers in it
ages = pickle.load( open("practice_outliers_ages.pkl", "r") )
net_worths = pickle.load( open("practice_outliers_net_worths.pkl", "r") )

### ages and net_worths need to be reshaped into 2D numpy arrays
### second argument of reshape command is a tuple of integers: (n_rows, n_columns)
### by convention, n_rows is the number of data points
### and n_columns is the number of features
ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))
from sklearn.cross_validation import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)


### fill in a regression here!  Name the regression object reg so that
### the plotting code below works, and you can see what your regression looks like


try:
    plt.plot(ages, reg.predict(ages), color="blue")
except NameError:
    pass
plt.scatter(ages, net_worths)
plt.show()


### identify and remove the most outlier-y points
cleaned_data = []
try:
    predictions = reg.predict(ages_train)
    cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )
except NameError:
    print "your regression object doesn't exist, or isn't name reg"
    print "can't make predictions to use in identifying outliers"







### only run this code if cleaned_data is returning data
if len(cleaned_data) > 0:
    ages, net_worths, errors = zip(*cleaned_data)
    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    ### refit your cleaned data!
    try:
        reg.fit(ages, net_worths)
        plt.plot(ages, reg.predict(ages), color="blue")
    except NameError:
        print "you don't seem to have regression imported/created,"
        print "   or else your regression object isn't named reg"
        print "   either way, only draw the scatter plot of the cleaned data"
    plt.scatter(ages, net_worths)
    plt.xlabel("ages")
    plt.ylabel("net worths")
    plt.show()

else:
    print "outlierCleaner() is returning an empty list, no refitting to be done"



your regression object doesn't exist, or isn't name reg
can't make predictions to use in identifying outliers
outlierCleaner() is returning an empty list, no refitting to be done


In [3]:
"""
what slope does your regression have?
"""
# ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

print 'Slope: ', reg.coef_[0]
print 'Intercept: ', reg.intercept_


Slope:  [ 5.07793064]
Intercept:  [ 25.21002327]


In [4]:
"""
What is the score you get when using your regression to make predictions with the test data?
"""
print 'r-squared score: ', reg.score(ages_test, net_worths_test)

r-squared score:  0.878262478835


In [59]:
"""
In outliers/outlier_cleaner.py, you will find the skeleton for a function called outlierCleaner() that you will fill in with a 
cleaning algorithm. It takes three arguments: predictions is a list of predicted targets that come from your regression, ages is 
the list of ages in the training set, and net_worths is the actual value of the net worths in the training set. There should be 
90 elements in each of these lists (because the training set has 90 points in it). Your job is to return a list called 
cleaned_data that has only 81 elements in it, which are the 81 training points where the predictions and the actual values 
(net_worths) have the smallest errors (90 * 0.9 = 81). The format of cleaned_data should be a list of tuples, where each tuple 
has the form (age, net_worth, error). 

Once this cleaning function is working, you should see the regression result changes. What is the new slope? Is it closer to the 
"correct" result of 6.25?
"""
def outlierCleaner(predictions, ages, net_worths):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual net worth).

        Return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error).
    """
    cleaned_data = []
    error = predictions - net_worths
    diff = numpy.absolute(error)
    elem = []
    for k in zip(diff, ages, net_worths, error):
        d, a, n, e = k
        elem.append((d[0], a[0], n[0], e[0]))
        
    #calculate 10% of the ages
    maxlen = len(ages)*0.1

    for i, k in enumerate(sorted(elem, reverse=True)):
        if i >= maxlen:
            d, a, n, e = k
            cleaned_data.append((a, n, e))
    print 'returning %d elements' % len(cleaned_data)
    return cleaned_data

### identify and remove the most outlier-y points
cleaned_data = []
try:
    predictions = reg.predict(ages_train)
    cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )
except NameError:
    print "your regression object doesn't exist, or isn't name reg"
    print "can't make predictions to use in identifying outliers"
    
    ### only run this code if cleaned_data is returning data
    
if len(cleaned_data) > 0:
    ages, net_worths, errors = zip(*cleaned_data)
    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    ### refit your cleaned data!
    try:
        reg.fit(ages, net_worths)
        plt.plot(ages, reg.predict(ages), color="blue")
    except NameError:
        print "you don't seem to have regression imported/created,"
        print "   or else your regression object isn't named reg"
        print "   either way, only draw the scatter plot of the cleaned data"
    plt.scatter(ages, net_worths)
    plt.xlabel("ages")
    plt.ylabel("net worths")
    plt.show()

else:
    print "outlierCleaner() is returning an empty list, no refitting to be done"

print 'Slope: ', reg.coef_[0]
print 'Intercept: ', reg.intercept_

returning 81 elements
Slope:  [ 6.36859481]
Intercept:  [-6.91861159]


In [60]:
"""
What's the new score when you use the regression to make predictions on the test set?
"""
print 'r-squared score: ', reg.score(ages_test, net_worths_test)

r-squared score:  0.983189455686


In [61]:
"""
In the mini-project for the regressions lesson, you used a regression to predict the bonuses for Enron employees. As you saw, 
even a single outlier can make a big difference on the regression result. There was something we didn't tell you, though, which 
was that the dataset we had you use in that project had already been cleaned of some significant outliers. Identifying and 
cleaning away outliers is something you should always think about when looking at a dataset for the first time, and now you'll 
get some hands-on experience with the Enron data.

You can find the starter code in outliers/enron_outliers.py, which reads in the data (in dictionary form) and converts it into a 
sklearn-ready numpy array. Since there are two features being extracted from the dictionary ("salary" and "bonus"), the resulting 
numpy array will be of dimension N x 2, where N is the number of data points and 2 is the number of features. This is perfect input 
for a scatterplot; we'll use the matplotlib.pyplot module to make that plot. (We've been using pyplot for all the visualizations 
in this course.) Add these lines to the bottom of the script to make your scatterplot:

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

As you can see, visualization is one of the most powerful tools for finding outliers!
"""

"""
feature_format.py
"""
#!/usr/bin/python

""" 
    A general tool for converting data from the
    dictionary format to an (n x k) python list that's 
    ready for training an sklearn algorithm

    n--no. of key-value pairs in dictonary
    k--no. of features being extracted

    dictionary keys are names of persons in dataset
    dictionary values are dictionaries, where each
        key-value pair in the dict is the name
        of a feature, and its value for that person

    In addition to converting a dictionary to a numpy 
    array, you may want to separate the labels from the
    features--this is what targetFeatureSplit is for

    so, if you want to have the poi label as the target,
    and the features you want to use are the person's
    salary and bonus, here's what you would do:

    feature_list = ["poi", "salary", "bonus"] 
    data_array = featureFormat( data_dictionary, feature_list )
    label, features = targetFeatureSplit(data_array)

    the line above (targetFeatureSplit) assumes that the
    label is the _first_ item in feature_list--very important
    that poi is listed first!
"""
import numpy as np

def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys = False):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.
    """

    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle
        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
    else:
        keys = dictionary.keys()

    for key in keys:
        tmp_list = []
        for feature in features:
            try:
                dictionary[key][feature]
            except KeyError:
                print "error: key ", feature, " not present"
                return
            value = dictionary[key][feature]
            if value=="NaN" and remove_NaN:
                value = 0
            tmp_list.append( float(value) )

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == 'poi':
            test_list = tmp_list[1:]
        else:
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
                    break
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append( np.array(tmp_list) )

    return np.array(return_list)


def targetFeatureSplit( data ):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """
    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features

In [63]:
"""
enron_outliers.py
"""
#!/usr/bin/python

import pickle
import sys
import matplotlib.pyplot
#sys.path.append("../tools/")
#from feature_format import featureFormat, targetFeatureSplit

### read in data dictionary, convert to numpy array
data_dict = pickle.load( open("final_project_dataset.pkl", "r") )
features = ["salary", "bonus"]
data = featureFormat(data_dict, features)

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

In [75]:
"""
There's one outlier that should pop out to you immediately. Now the question is to identify the source. We found the original 
data source to be very helpful for this identification; you can find that PDF in final_project/enron61702insiderpay.pdf 
What's the name of the dictionary key of this data point? (e.g. if this is Ken Lay, the answer would be "LAY KENNETH L").
"""
maxsalary, maxbonus = numpy.nanmax(data, axis=0)
for name in data_dict:
    if data_dict[name]['salary'] == maxsalary and data_dict[name]['bonus'] == maxbonus:
        print 'Name: ', name 
        print 'Salary: ', maxsalary
        print 'Bonus: ', maxbonus
# >>>print data_dict[0]
# 'METTS MARK': {
#                'salary': 365788, 
#                'to_messages': 807, 
#                'deferral_payments': 'NaN', 
#                'total_payments': 1061827, 
#                'exercised_stock_options': 'NaN', 
#                'bonus': 600000, 
#                'restricted_stock': 585062, 
#                'shared_receipt_with_poi': 702, 
#                'restricted_stock_deferred': 'NaN', 
#                'total_stock_value': 585062, 
#                'expenses': 94299, 
#                'loan_advances': 'NaN', 
#                'from_messages': 29, 
#                'other': 1740, 
#                'from_this_person_to_poi': 1, 
#                'poi': False, 
#                'director_fees': 'NaN', 
#                'deferred_income': 'NaN', 
#                'long_term_incentive': 'NaN', 
#                'email_address': 'mark.metts@enron.com', 
#                'from_poi_to_this_person': 38
#               }

Name:  TOTAL
Salary:  26704229.0
Bonus:  97343619.0


In [76]:
"""
A quick way to remove a key-value pair from a dictionary is the following line: dictionary.pop( key, 0 ) Write a line like this 
(you'll have to modify the dictionary and key names, of course) and remove the outlier before calling featureFormat(). Now rerun 
the code, so your scatterplot doesn't have this outlier anymore. Are all the outliers gone?
"""
data_dict.pop('TOTAL', 0)
features = ["salary", "bonus"]
data = featureFormat(data_dict, features)

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

In [82]:
"""
We would argue that there's 4 more outliers to investigate; let's look at a couple of them. Two people made bonuses of at least 
5 million dollars, and a salary of over 1 million dollars; in other words, they made out like bandits. What are the names 
associated with those points?
"""
for name in data_dict:
    salary = data_dict[name]['salary']
    bonus = data_dict[name]['bonus']
    if salary != 'NaN' and bonus != 'NaN' and (salary > 1000000 and bonus > 5000000):
        print 'Name: %s :: Salary: %f :: Bonus: %f' % (name, salary, bonus)

Name: LAY KENNETH L :: Salary: 1072321.000000 :: Bonus: 7000000.000000
Name: SKILLING JEFFREY K :: Salary: 1111258.000000 :: Bonus: 5600000.000000
