# Introduction to Clustering


<p><div style="float:right;margin-left:5px;"><img src="https://cambridge-intelligence.com/wp-content/uploads/2016/01/clustering-animated.gif" width="500"/></div></p>
This notebook is short and sweet and covers both hierarchical clustering and k-means clustering.

Let's start with the example we did in class, the dog breed dataset.

In [2]:
import pandas as pd
dog_data = pd.read_csv('https://raw.githubusercontent.com/zacharski/machine-learning/master/data/dogbreeds.csv')
dog_data = dog_data.set_index('breed')

In [3]:
dog_data

Unnamed: 0_level_0,height (inches),weight (pounds)
breed,Unnamed: 1_level_1,Unnamed: 2_level_1
Border Collie,20,45
Boston Terrier,16,20
Brittany Spaniel,18,35
Bullmastiff,27,120
Chihuahua,8,8
German Shepherd,25,78
Golden Retreiver,23,70
Great Dane,32,160
Portuguese Water Dog,21,50
Standard Poodle,19,65


Looking at the values in the height and weight columns it looks like we should normalize the data.

<img src="http://animalfair.com/wp-content/uploads/2014/08/chihuahua-and-great-dane.jpg" width="700"/>


In [5]:
## TODO



## k means clustering
Let's divide our dog dataset into 3 clusters:


In [6]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, ).fit(dog_data)
labels = kmeans.labels_

The variable `labels` is an array the specifies which group each dog belongs to:

In [7]:
labels

array([2, 2, 2, 0, 1, 2, 2, 0, 2, 2, 1], dtype=int32)

So the first dog is in group 2. That may be helpful for future computational tasks but is not the helpful if we are trying to visualize the data. Let me munge that a bit into a more useful form:

In [9]:
groups = {0: [], 1: [], 2: []}
i = 0
for index, row in dog_data.iterrows():
    groups[labels[i]].append(index)
    i += 1
## Now I will print it in a nice way:

for key, value in groups.items():
    print ('CLUSTER %i' % key)
    for breed in value:
        print("    %s" % breed)
    print('\n')

CLUSTER 0
    Bullmastiff
    Great Dane


CLUSTER 1
    Chihuahua
    Yorkshire Terrier


CLUSTER 2
    Border Collie
    Boston Terrier
    Brittany Spaniel
    German Shepherd
    Golden Retreiver
    Portuguese Water Dog
    Standard Poodle




keep in mind that since they initial centroids are selected somewhat randomly you will unlikely get the same answer as I do.

## Hierarchical Clustering

In [14]:
from sklearn.cluster import AgglomerativeClustering
clusterer = AgglomerativeClustering(affinity='euclidean', linkage='ward')
clusterer.fit(dog_data)

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward',
            memory=Memory(cachedir=None), n_clusters=2,
            pooling_func=<function mean at 0x115798d90>)

we can get the highest level division by viewing the `.labels_`:


In [18]:
clusterer.labels_

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0])

So here the first dog breed, Border Collie belongs to cluster 0. But the clustering algorithm constructs a tree - specifically, a dendrogram. To view that requires some imagination.  I can print a representation of the tree by:

In [19]:
import itertools
ii = itertools.count(dog_data.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in clusterer.children_]

[{'left': 0, 'node_id': 11, 'right': 8},
 {'left': 4, 'node_id': 12, 'right': 10},
 {'left': 5, 'node_id': 13, 'right': 6},
 {'left': 1, 'node_id': 14, 'right': 2},
 {'left': 9, 'node_id': 15, 'right': 11},
 {'left': 14, 'node_id': 16, 'right': 15},
 {'left': 3, 'node_id': 17, 'right': 7},
 {'left': 13, 'node_id': 18, 'right': 16},
 {'left': 12, 'node_id': 19, 'right': 18},
 {'left': 17, 'node_id': 20, 'right': 19}]

The first line `{'left': 0, 'node_id': 11, 'right': 8}` reads that we combine cluster 0 *Border Collie* with cluster 8 *Portuguese Water Dog* to create Cluster 11. The next line says we  combine 4 *Chihuahua* with 10 *Yorkshire Terrier* to create cluster 12. 

So when I draw this out I get:

<img src="http://zacharski.org/files/courses/cs419/dendro.png" width="700"/>

<h1 style="color:red">Tasks</h1>

<h2 style="color:red">Task 1: Breakfast Cereals</h2>
I would like you to create 4 clusters of the data in:

    https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch8/cereal.csv
    
For clustering use the features calories, sugar, protein, and fiber.

Print out the results as we did for the dog breed data:


    CLUSTER 0
    Bullmastiff
    Great Dane
    

    CLUSTER 1
        Chihuahua
        Yorkshire Terrier
    

    CLUSTER 2
        Border Collie
        Boston Terrier
        Brittany Spaniel
        German Shepherd
        Golden Retreiver
        Portuguese Water Dog
        Standard Poodle
        
Because the initial centroids are random, by default the sklearn kmeans agorithm runs the algorithm 10 times and picks the best results (based on some of squares error). I would like you to change that parameter so it runs the algorithm 100 times. 


<h2 style="color:red">Task 2: TBD</h2>
I would like you to create 4 clusters of the data in:
