In [1]:
# You can play with k-means clustering yourself here: http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
"""
In this project, we'll apply k-means clustering to our Enron financial data. Our final goal, of course, is to identify persons 
of interest; since we have labeled data, this is not a question that particularly calls for an unsupervised approach like 
k-means clustering.

Nonetheless, you'll get some hands-on practice with k-means in this project, and play around with feature scaling, which will 
give you a sneak preview of the next lesson's material.

The starter code can be found in k_means/k_means_cluster.py, which reads in the email + financial (E+F) dataset and gets us 
ready for clustering. You'll start with performing k-means based on just two financial features--take a look at the code, and 
determine which features the code uses for clustering.

Run the code, which will create a scatterplot of the data. Think a little bit about what clusters you would expect to arise if 
2 clusters are created.
"""
"""
feature_format.py
""" 
#!/usr/bin/python

""" 
    A general tool for converting data from the
    dictionary format to an (n x k) python list that's 
    ready for training an sklearn algorithm

    n--no. of key-value pairs in dictonary
    k--no. of features being extracted

    dictionary keys are names of persons in dataset
    dictionary values are dictionaries, where each
        key-value pair in the dict is the name
        of a feature, and its value for that person

    In addition to converting a dictionary to a numpy 
    array, you may want to separate the labels from the
    features--this is what targetFeatureSplit is for

    so, if you want to have the poi label as the target,
    and the features you want to use are the person's
    salary and bonus, here's what you would do:

    feature_list = ["poi", "salary", "bonus"] 
    data_array = featureFormat( data_dictionary, feature_list )
    label, features = targetFeatureSplit(data_array)

    the line above (targetFeatureSplit) assumes that the
    label is the _first_ item in feature_list--very important
    that poi is listed first!
"""
import numpy as np

def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys = False):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.
    """
    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle
        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
    else:
        keys = dictionary.keys()

    for key in keys:
        tmp_list = []
        for feature in features:
            try:
                dictionary[key][feature]
            except KeyError:
                print "error: key ", feature, " not present"
                return
            value = dictionary[key][feature]
            if value=="NaN" and remove_NaN:
                value = 0
            tmp_list.append( float(value) )

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == 'poi':
            test_list = tmp_list[1:]
        else:
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
                    break
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append( np.array(tmp_list) )

    return np.array(return_list)


def targetFeatureSplit( data ):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """

    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features

In [2]:
"""
k_means_clustering.py
"""

#!/usr/bin/python 

""" 
    Skeleton code for k-means clustering mini-project.
"""
import pickle
import numpy
import matplotlib.pyplot as plt
import sys
#sys.path.append("../tools/")
#from feature_format import featureFormat, targetFeatureSplit

def Draw(pred, features, poi, mark_poi=False, name="image.png", f1_name="feature 1", f2_name="feature 2"):
    """ some plotting code designed to help you visualize your clusters """

    ### plot each cluster with a different color--add more colors for
    ### drawing more than five clusters
    colors = ["b", "c", "k", "m", "g"]
    for ii, pp in enumerate(pred):
        plt.scatter(features[ii][0], features[ii][1], color = colors[pred[ii]])

    ### if you like, place red stars over points that are POIs (just for funsies)
    if mark_poi:
        for ii, pp in enumerate(pred):
            if poi[ii]:
                plt.scatter(features[ii][0], features[ii][1], color="r", marker="*")
    plt.xlabel(f1_name)
    plt.ylabel(f2_name)
    plt.savefig(name)
    plt.show()


### load in the dict of dicts containing all the data on each person in the dataset
data_dict = pickle.load( open("final_project_dataset.pkl", "r") )
### there's an outlier--remove it! 
data_dict.pop("TOTAL", 0)


### the input features we want to use 
### can be any key in the person-level dictionary (salary, director_fees, etc.) 
feature_1 = "salary"
feature_2 = "exercised_stock_options"
poi  = "poi"
features_list = [poi, feature_1, feature_2]
data = featureFormat(data_dict, features_list )
poi, finance_features = targetFeatureSplit( data )


### in the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to 
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2 in finance_features:
    plt.scatter( f1, f2 )
plt.show()

### cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred




### rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
    Draw(pred, finance_features, poi, mark_poi=False, name="clusters.pdf", f1_name=feature_1, f2_name=feature_2)
except NameError:
    print "no predictions object named pred found, no clusters to plot"


no predictions object named pred found, no clusters to plot


In [3]:
"""
Cluster using K-means
"""
from sklearn.cluster import KMeans
clu = KMeans(n_clusters=2)
pred = clu.fit_predict(data)

### rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
    Draw(pred, finance_features, poi, mark_poi=False, name="2clusters.pdf", f1_name=feature_1, f2_name=feature_2)
except NameError:
    print "no predictions object named pred found, no clusters to plot"

In [4]:
"""
Add a third feature to features_list, "total_payments". Now rerun clustering, using 3 input features instead of 2 (obviously we 
can still only visualize the original 2 dimensions). Compare the plot with the clusterings to the one you obtained with 2 input 
features. Do any points switch clusters? How many? This new clustering, using 3 features, couldn't have been guessed by eye--it 
was the k-means algorithm that identified it.

(You'll need to change the code that makes the scatterplot to accommodate 3 features instead of 2, see the comments in the 
starter code for instructions on how to do this.)
"""
### the input features we want to use 
### can be any key in the person-level dictionary (salary, director_fees, etc.) 
feature_1 = "salary"
feature_2 = "exercised_stock_options"
feature_3 = "total_payments"
poi  = "poi"
features_list = [poi, feature_1, feature_2, feature_3]
data = featureFormat(data_dict, features_list )
poi, finance_features = targetFeatureSplit( data )

### in the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to 
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2, _ in finance_features:
    plt.scatter( f1, f2 )
plt.show()

### cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans
clu = KMeans(n_clusters=2)
pred = clu.fit_predict(data)

### rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
    Draw(pred, finance_features, poi, mark_poi=False, name="3clusters.pdf", f1_name=feature_1, f2_name=feature_2)
except NameError:
    print "no predictions object named pred found, no clusters to plot"

In [7]:
"""
In the next lesson, we'll talk about feature scaling. It's a type of feature preprocessing that you should perform before some 
classification and regression tasks. Here's a sneak preview that should call your attention to the general outline of what 
feature scaling does.

What are the maximum and minimum values taken by the "exercised_stock_options" feature used in this example?

(NB: if you look at finance_features, there are some "NaN" values that have been cleaned away and replaced with zeroes--so while 
those might look like the minima, it's a bit deceptive because they're more like points for which we don't have information, and 
just have to put in a number. So for this question, go back to data_dict and look for the maximum and minimum numbers that show 
up there, ignoring all the "NaN" entries.)
"""

def get_min_max(feature):
    min_feat = 'NaN'
    max_feat = 'NaN'
    for name in data_dict:
        feat = data_dict[name][feature]
        if feat != 'NaN':
            if min_feat == 'NaN':
                min_feat = feat
                max_feat = feat
            if feat < min_feat:
                min_feat = feat
            if feat > max_feat:
                max_feat = feat
    return min_feat, max_feat

min_stock, max_stock = get_min_max("exercised_stock_options")    
print 'Minimum "exercised_stock_options": ', min_stock
print 'Maximum "exercised_stock_options": ', max_stock

Minimum "exercised_stock_options":  3285
Maximum "exercised_stock_options":  34348384


In [8]:
"""
What are the maximum and minimum values taken by "salary"?

(NB: same caveat as in the last quiz. If you look at finance_features, there are some "NaN" values that have been cleaned away 
and replaced with zeroes--so while those might look like the minima, it's a bit deceptive because they're more like points for 
which we don't have information, and just have to put in a number. So for this question, go back to data_dict and look for the 
maximum and minimum numbers that show up there, ignoring all the "NaN" entries.)
"""
min_salary, max_salary = get_min_max("salary")    
print 'Minimum "exercised_stock_options": ', min_salary
print 'Maximum "exercised_stock_options": ', max_salary

Minimum "exercised_stock_options":  477
Maximum "exercised_stock_options":  1111258


In [None]:
"""
The plot on the next slide shows the exact same clustering code that you just wrote, but in this example we applied feature 
scaling before performing the clustering.

We want you to compare the clustering with scaling (on the next slide) with the first clustering visualization you produced, 
when you used two features in your clustering algorithm.

Notice that now the range of the features has changed to [0.0, 1.0]. That's the only change we've made.

In the next lesson you'll learn a lot more about what feature scaling means, but for now, just look at the effect on the 
clusters--which point(s) switch their associated cluster?
"""