## Lesson 06: Unsupervised Learning - Clustering and Principal Component Analysis

### Part 01: Clustering with K-Means


#### Clustering Mini Project 

In this project, we’ll apply k-means clustering to our Enron financial data available in the [ud120-projects repo](https://www.github.com/udacity/ud120-projects). It is best if you save a copy on your local machine (using git clone or download and extract the zip archive). You will need to specify the location of the folder where you placed the files from the repository. 

Our final goal, of course, is to identify persons of interest; since we have labeled data, this is not a question that particularly calls for an unsupervised approach like k-means clustering. 

Nonetheless, it is interesting to attempt a clustering analysis to see how well this kind of approach can distinguish between poi's and non-poi's. You’ll also get some hands-on practice with k-means in this project, and play around with feature scaling, which will give you a sneak preview of the next lesson’s material.


#### Preliminaries 
As usual, we start by loading the packages we are going to use. These should be already by installed in your environment. There are no new packages that need to be installed for this session

 - numpy
 - pandas
 - sklearn
 - matplotlib
 
In addition, we will be using data and some code from the ud120-projects repo, so those should be available on your local machine as well
 
 Run the cell below to load them

In [None]:
#!/usr/bin/python 

import pickle

try:
    import numpy as np
    print("Successfully imported numpy! (Version {})".format(np.version.version))
except ImportError:
    pass
    
try:
    import matplotlib
    import matplotlib.pyplot as plt
    print("Successfully imported matplotlib! (Version {})".format(matplotlib.__version__))
except ImportError:
    pass

try:
    import pandas as pd
    print("Successfully imported pandas! (Version {})".format(pd.__version__))
    pd.options.display.max_rows = 10
except ImportError:
    print("Could not import pandas!")

import os, sys

try:
    from IPython.display import display
    from IPython.display import Image
    print("Successfully imported display from IPython.display and Image!")
except ImportError:
    print("Could not import display from IPython.display")

%matplotlib inline

In [None]:
## TODO - adjust the value of the PATH_TO_MINI variable so it points to the top level folder of the Udacity projects
##     For example, here is the structure on my machine:
###         parent dir (projects)
###               -  ZKConnect (folder for ConnectIntensive notebooks
###                     - lesson-06-part-01,ipynb (this file)
###               -  projects  (root folder for all Udacity projects - github ud120) 
###                     - final_project
###                     - k_means
###                     - naive_bayes
###                     - tools
###
### So, I would set PATH_TO_MINI = '../projects'
###
### Once this is correctly set using either a relative path or an absolute path, the rest of the code should work correctly

PATH_TO_MINI = "../projects"

### Once you have the path set, the output from this cell should say so

try:
    path_ok = os.path.isfile(os.path.join(PATH_TO_MINI,"final_project","final_project_dataset.pkl"))
    if not path_ok:
        raise Exception("Path is not set correctly")
    print "PATH_TO_MINI appears correct (can open project data file)" 
except Exception as e:
    print e
    

In [None]:
sys.path.append(PATH_TO_MINI)
sys.path.append(os.path.join(PATH_TO_MINI,"tools"))
sys.path.append(os.path.join(PATH_TO_MINI,"k_means"))

## Adapted from ud120-projects/k_means/k_means_cluster.py
from feature_format import featureFormat, targetFeatureSplit


def Draw(pred, features, poi, mark_poi=False, name="image.png", f1_name="feature 1", f2_name="feature 2"):
    """ some plotting code designed to help you visualize your clusters """

    ### plot each cluster with a different color--add more colors for
    ### drawing more than five clusters
    colors = ["b", "c", "k", "m", "g"]
    for ii, pp in enumerate(pred):
        plt.scatter(features[ii][0], features[ii][1], color = colors[pred[ii]])

    ### if you like, place red stars over points that are POIs (just for funsies)
    if mark_poi:
        for ii, pp in enumerate(pred):
            if poi[ii]:
                plt.scatter(features[ii][0], features[ii][1], color="r", marker="*")
    plt.xlabel(f1_name)
    plt.ylabel(f2_name)
    #plt.savefig(name) # We will not save the figures as images -- can see them in the notebook
    plt.show()

In [None]:
### load in the dict of dicts containing all the data on each person in the dataset
data_dict = pickle.load( open(os.path.join(PATH_TO_MINI,"final_project","final_project_dataset.pkl"), "r"))
print "Loaded dataset - has {} keys (or rows)".format( len(data_dict.keys()) )

#### Data Exploration

We want to get a sense of what is in the data set, whether there are any missing values that need to be cleaned up. 

##### Question 1 - How many different persons are there in this dataset? Are there any keys that should be excluded? (HINT: There is at least one). Use the $ pop $ method of a python $ dict $ object to remove any "rows" that need to be removed

In [None]:
# There's at least one row that doesn't belong in the dataset (an outlier?) --remove it! 
# You can print out the keys to see if there is one that is not a person
# Should you be concerned about duplicates?

### TOUR CODE CAN GO HERE

#   Add the key of the "row" you want to remove as a string literal element in the KEYS_TO_REMOVE LIST
KEYS_TO_REMOVE = [""]

for k in KEYS_TO_REMOVE:
    if data_dict.pop(k, 0):
        print "Removed item {} from data set".format(k)

print "Dataset with outliers removed has {} keys (or rows)".format( len(data_dict.keys()) )

As we've seen before, `pandas` dataframes provide a convenient way of managing data. We will use the `from_dict` method of a DataFrame object to load our python data_dict. However, when the dictionary object loaded, the resulting DataFrame has the list of keys as its columns. We can "flip" this using the transpose (.T) so that each row contains the data for one person.

In [None]:
# Create a DataFrame object from the Enron data dictionary
enron_df = pd.DataFrame.from_dict(data_dict)

# Take the transpose (.T) of the Enron DataFrame,
enron_df = enron_df.T

# Display the DataFrame after preprocessing is complete
display(enron_df.describe())

As you can see, several of the features have missing values and we will need to get rid of them using imputation.

In [None]:
# Change all entries in the DataFrame with "NaN" to zeroes.
enron_df[enron_df == "NaN"] = 0

#### Question 2 - Why is it ok to replace the missing values with 0? Write your thoughts in the cell below and we will discuss as a group. Think about the other ways you can impute missing data, e.g., mean or median values.

#### Question 3 - What are the features (i.e., columns or variables) available in this dataset? Which ones might be useful for identifying a poi? Please write down your top 5 choices along with your reasons for choosing them 

While we can use the DataFrame we created with the sklearn routines, we will convert back to the python dictionary format so we can use the python code included with the projects for self-consistency. Also, we will save a separate copy of the data so we can switch between looking at the pre-processed data as well as the raw data if needed.

In [None]:
data_dict_new = enron_df.T.to_dict()

#### The next several cells are ones you can run repeatedly, varying the inputs to just play around.

##### The online verison of the mini project has six quizzes. Some of the questions in this section are directed at addressing the quiz. You can use this notebook to complete the online version if you wish to do so.

Select two features from the set you identified in Question 3 and  to use in the exploration below. Enter them as feature_1  and  feature_2  below.

#### Clustering with two features

In [None]:
### the input features we want to use 
### can be any key in the person-level dictionary (salary, director_fees, etc.) 

### 
feature_1 = "salary"
feature_2 = "exercised_stock_options"
poi  = "poi"
features_list = [poi, feature_1, feature_2]
data = featureFormat(data_dict_new, features_list )
poi, finance_features = targetFeatureSplit( data )


### in the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to 
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2, in finance_features:
    plt.scatter( f1, f2 )
plt.show()

In [None]:
### cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans

clf=KMeans(n_clusters=2, n_init=10, init="k-means++")
clf.fit(finance_features, poi)
pred = clf.predict(finance_features) # The cluster assignements will be used to visualize them 


### We left out the "name" parameter as we don't want to save the images 

try:
    Draw(pred, finance_features, poi, mark_poi=False,  f1_name=feature_1, f2_name=feature_2)
    pass
except NameError:
    print "no predictions object named pred found, no clusters to plot"

#### In the scatterplot that pops up, are the clusters what you expected?

#### Three features clustering

In this section we will add a third feature to features_list and rerun the clustering. The quiz specifies which feature to use  ("total payments"), but feel free to experiment with others. 

You can copy and modify the cells from the two features groups above to complete this.

#### Question 4 - Compare the plot with the clusterings to the one you obtained with 2 input features. Do any points switch clusters? How many? This new clustering, using 3 features, couldn’t have been guessed by eye--it was the k-means algorithm that identified it.

#### Feature Scaling 

As you can see, if you attempted to use one of the finance features as a variable, most of the "points" are close together but there are several that are fairly far away from the rest. Feature scaling is way to "normalize" the data so we don't end up emphasizing any one dimension too much.

The next few questions (aligned with the online mini-project quizzes) lead you through some of the calculations that are done for feature scaling.

#### What are the maximum and minimum values taken by the “exercised_stock_options” feature used in this example?

(NB: if you look at finance_features, there are some "NaN" values that have been cleaned away and replaced with zeroes--so while those might look like the minima, it's a bit deceptive because they're more like points for which we don't have information, and just have to put in a number. So for this question, go back to `data_dict` and look for the maximum and minimum numbers that show up there, ignoring all the "NaN" entries.)


#### What are the maximum and minimum values taken by “salary”?

This feature also had NaNs in the original data

#### Clustering with feature scaling

The figure below shows an image of the clustering obtained with feature scaling. What is your best guess as to what transformation was used to scale the features? If you have a guess you want to compare, manipulate `finance_features`,  run a K-means algorithm to create the new clusters, then plot the clusters using the Draw function 


In [None]:
Image(filename="kmclusters.png")