# Cluster
Unsupervised learning example: hierarchical clustering with [SciPy] and [pandas].

[SciPy]: https://docs.scipy.org/doc/scipy/reference/cluster.html
[pandas]: https://pandas.pydata.org/

In [1]:
%load_ext autoreload
%autoreload all

from cluster import Hierarchy
from tools import *

## get example data

Normalize to avoid neglecting small-magnitude features.

In [2]:
data = irisdata()
data = zscores(data)
data.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,1.034539,-0.131539,0.816859,1.443994,virginica
146,0.551486,-1.27868,0.703564,0.919223,virginica
147,0.793012,-0.131539,0.816859,1.050416,virginica
148,0.430722,0.786174,0.930154,1.443994,virginica
149,0.068433,-0.131539,0.760211,0.788031,virginica


## build a Hierarchy
Input a DataFrame to calculate a [linkage matrix].

[linkage matrix]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

In [3]:
tree = Hierarchy(data)
tree

Hierarchy with 149 links

In [7]:
tree.features

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [4]:
tree.leaves

RangeIndex(start=0, stop=150, step=1)

In [5]:
tree.links

Unnamed: 0,left,right,distance,count
0,101,142,0.000000,2
1,13,38,0.000074,2
2,82,92,0.000612,2
3,29,47,0.000818,2
4,17,40,0.001022,2
...,...,...,...,...
144,283,291,0.248652,51
145,293,294,0.418069,74
146,288,292,0.529738,27
147,295,296,0.936590,101


In [6]:
tree.params

{'method': 'average', 'metric': 'cosine', 'optimal_ordering': False}

## choose clusters
Call to assign each row a cluster number starting with 0.

In [8]:
clusters = tree(3)
clustered = data.join(clusters)
clustered.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,cluster
145,1.034539,-0.131539,0.816859,1.443994,virginica,1
146,0.551486,-1.27868,0.703564,0.919223,virginica,1
147,0.793012,-0.131539,0.816859,1.050416,virginica,1
148,0.430722,0.786174,0.930154,1.443994,virginica,1
149,0.068433,-0.131539,0.760211,0.788031,virginica,1


In [9]:
clusters.value_counts()

cluster
1    74
0    49
2    27
Name: count, dtype: int64

# UNDER CONSTRUCTION

## check results
How closely do the clusters agree with actual species?

In [None]:
cluster = Cluster(clues)
confusion = crosstab(cluster(3), answers)
plot.heat(confusion, cmap='Greys')
confusion

## show the tree
Limit the number of branches for easier/faster viewing.

In [None]:
axes = self.plot()

## cluster with another method

In [None]:
ward = Cluster(clues, method='ward', metric='euclidean')
plot.linkage(ward.links, 50, color_threshold=5)

In [None]:
confusion = crosstab(ward(3), answers)
plot.heat(confusion, cmap='Greys')
confusion

## help

In [None]:
help(Cluster)