# Introduction to Clustering  

### Overview

This session will introduce the concept of clustering. We are going to introduce clustering through an example. 

Lets create a simple easy to follow "by hand" realization of k means clustering. We are going to simulate some dummy data. Suppose we have subjects/individuals and we have a measurment for each person for feature A and feature B. 

In [2]:
import pandas as pd #bring in pandas module and assign it an alias 

data=pd.DataFrame({"individual":[1,2,3, 4, 5, 6, 7],
                   "A" : [1.0, 1.5, 3.0, 5.0, 3.5, 4.5, 3.5],
                  "B" : [1.0, 2.0, 4.0, 7.0, 5.0, 5.0, 4.5]}) #Use a dictrionary type to define columns and entries 

data

Unnamed: 0,A,B,individual
0,1.0,1.0,1
1,1.5,2.0,2
2,3.0,4.0,3
3,5.0,7.0,4
4,3.5,5.0,5
5,4.5,5.0,6
6,3.5,4.5,7


Suppose we want to group the dataset into two clusters. A reasonable approach to determine on what to split on would be to use a Euclidean distance measure to identify the two individuals whose A and B measures are the farthest apart. In  other words, we apply a simple difference of B-A as a length D. We base our clusters on this difference. 

In [4]:
data["distance"]=data["B"]-data["A"]

data.head(7)

Unnamed: 0,A,B,subject,distance
0,1.0,1.0,1,0.0
1,1.5,2.0,2,0.5
2,3.0,4.0,3,1.0
3,5.0,7.0,4,2.0
4,3.5,5.0,5,1.5
5,4.5,5.0,6,0.5
6,3.5,4.5,7,1.0


We identified individuals/subject's 1 and 4 as being the furthest away from each other. These individuals will serve as our base clusters. We take thier respective A and B values as our "mean vector centroid". Don't worry about knowing what the means right now. We will go over the theory in more detail. 

In [5]:
data2=pd.DataFrame({"cluster":["cluster 1", "cluster 2"],
                   "individual" : [1, 4],
                  "mean vector centroid" : ["(1.0, 1.0)", "(5.0, 7.0)"]}) #Use a dictrionary type to define columns and entries

data2.head()

Unnamed: 0,cluster,individual,mean vector centroid
0,cluster 1,1,"(1.0, 1.0)"
1,cluster 2,4,"(5.0, 7.0)"


What do we do now? 
We should start thinking about how we can use the clusters we picked to start to create assginments for the other individuals/sujects in our data. In other words, we want to assign a cluster to the other people in our data. We need to sequantially examine the other individuals to allocate them to clusters. 

We start with trying to identify individuals whose distance are close to either individual 1 or individual 4. Lets take individual 2. If we eyeball individual 2, it looks they are closer to indivudal 1 than 4 in terms of distance. Therefore we can assign individual 2 to the cluster based off individual 1 (cluster 1). We then have to take the mean of the centroid to define a new mean vector centroid. Individual 1 has a vector (1.0, 1.0) and individual 2 has a vector (1.5, 2.0) , therefore the new mean vector centroid will be $(\frac{1+1.5}{2}, \frac{1.0+2.0}{2})=(1.2, 1.5)$ hence cluster 1 now consists of individual 1 and individual 2. We repeat the process for individuals 3, 5, 6, and 7. We check the distance of each individual to the mean centroid and assign that individual to the cluster based on that distance measure while recalculating the mean of vectors with every new assignment. 

In [12]:
print("Cluster 1 Assignments")

data2=pd.DataFrame({"individual" : ["1", "1, 2", "1,2,3", "1,2,3","1,2,3","1,2,3"],
                  "mean vector centroid" : ["(1.0, 1.0)", "(1.2, 1.5)", "(1.8, 2.3)","(1.8, 2.3)","(1.8, 2.3)","(1.8, 2.3)"]}) #Use a dictrionary type to define columns and entries

data2.head(7)

Cluster 1 Assignments


Unnamed: 0,individual,mean vector centroid
0,1,"(1.0, 1.0)"
1,"1, 2","(1.2, 1.5)"
2,123,"(1.8, 2.3)"
3,123,"(1.8, 2.3)"
4,123,"(1.8, 2.3)"
5,123,"(1.8, 2.3)"


In [13]:
print("Cluster 2 Assignments")

data2=pd.DataFrame({"individual" : ["4", "4", "4", "4,5","4,5,6", "4,5,6,7"],
                  "mean vector centroid" : ["(5.0, 7.0)", "(5.0, 7.0)", "(5.0, 7.0)","(4.2, 6.0)","(4.3, 5.7)", "(4.1, 5.4)"]}) #Use a dictrionary type to define columns and entries

data2.head(7)

Cluster 2 Assignments


Unnamed: 0,individual,mean vector centroid
0,4,"(5.0, 7.0)"
1,4,"(5.0, 7.0)"
2,4,"(5.0, 7.0)"
3,45,"(4.2, 6.0)"
4,456,"(4.3, 5.7)"
5,4567,"(4.1, 5.4)"


For each individual, we allocated them to a cluster based on closest distance to the cluster mean. Each time an individual was added to a cluster, the mean vector is re-calculated. Now that we went through each of the individuals, we can determine that we have two clusters with the following mean vector centroid. 

In [14]:
data2=pd.DataFrame({"cluster":["cluster 1", "cluster 2"],
                   "individual" : ["1, 2, 3", "4, 5, 6, 7"],
                  "mean vector centroid" : ["(1.8, 2.3)", "(4.1, 5.4)"]}) #Use a dictrionary type to define columns and entries

data2.head()

Unnamed: 0,cluster,individual,mean vector centroid
0,cluster 1,"1, 2, 3","(1.8, 2.3)"
1,cluster 2,"4, 5, 6, 7","(4.1, 5.4)"


we can individually check the distances between each individual and the centroid vector of the cluster they were assigned to. 

In [16]:
data2=pd.DataFrame({"individual":[1, 2, 3, 4, 5, 6, 7],
                   "Distance to mean Centroid of Cluster 1" : [1.5, 0.4, 2.1, 5.7, 3.2, 3.8, 2.8],
                  "Distance to mean Centroid of Cluster 2" : [5.4, 4.3, 1.8, 1.8, 0.7, 0.6, 1.1]}) #Use a dictrionary type to define columns and entries

data2.head(7)

Unnamed: 0,Distance to mean Centroid of Cluster 1,Distance to mean Centroid of Cluster 2,individual
0,1.5,5.4,1
1,0.4,4.3,2
2,2.1,1.8,3
3,5.7,1.8,4
4,3.2,0.7,5
5,3.8,0.6,6
6,2.8,1.1,7


It seems for the most part, we assigned individuals to the right clusters with the exception of individual 3. Individual 3 was placed in cluster 1 but it is actually closer in distance to the mean vector centroid of cluster 2. We can account for this change and place individual 3 in cluster 2 and re-calculate thee mean vector centroid. 

In [17]:
data2=pd.DataFrame({"cluster":["cluster 1", "cluster 2"],
                   "individual" : ["1, 2", "3, 4, 5, 6, 7"],
                  "mean vector centroid" : ["(1.3, 1.5)", "(3.9, 5.1)"]}) #Use a dictrionary type to define columns and entries

data2.head()

Unnamed: 0,cluster,individual,mean vector centroid
0,cluster 1,"1, 2","(1.3, 1.5)"
1,cluster 2,"3, 4, 5, 6, 7","(3.9, 5.1)"


At this point, we would repeat the process and form new partitions until there are no longer any possible relocations. It is possible that we can do this process endlessly and not be totally satisfied with the assignments but in our simple case, our assignments should be pretty good. Now don't be alarmed if this example was difficult to follow. There was a lot of hand waving but in the end, this process is essentially how clustering works. Lets examine clustering conceptually. (Original example (http://mnemstudio.org/clustering-k-means-example-1.htm)

## What is clustering? 

* Clustering is an analytical method under the unsupervised learning umbrella 

* Unsupervised learning is typically a grouping analysis performed on data without a class label (ex. trying to group transaction data into shopping habits) 

* supervised learning is analysis performed on data WITH  a class label (ex. given potential subscribers to some service, we have a label that identifies individuals as subscribed or not subscribed) 

* clustering tries to group data into categories based on the nature of features 

* Grouping structure is decided automatically 

## Why do we do Clustering? 

* Assigning a label to large data is costly 

* The data contents may not be known to its fullest extent 

* clustering can identify features that are significant for future classification problems 

* clustering allows one to examine the nature of the data 

* clustering can uncover subclasses and additional similarities among subclasses 

## Clustering Type

* Hard Clustering - Each data point falls into a cluster completely or not

* soft clustering - Each data point is assigned a probability or likelihood of that data point belonging to some cluster 

There are numerous algorithms that fall within the clustering analysis umbrella, each with their own unique applications. The handwaving example we looked at in the start of the session was a type of clustering called k means. K means clustering is the most commonly used clustering method. Lets take a deeper dive into k means. (https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/

## K Means Clustering 

K means clustering is an iterative approach that splits some data set into pre defined non overlapping clusters. Each data point belongs exactly to one cluster while trying preserve inter cluster data points such that data points are as similar as possible. This is what we did in our initial example by using distances. The purpose of finding similar is to keep clusters as far away from each other. Individual data points are assigned to clusters based of the sum of the squared distance between data points and the mean vector centroid. We ideally want this distance to be as smallest as possilbe for cluster assignment. A desirable outcome is if all data points assigned to clusters are homogenous and have the smallest possible distance between them and the centroid of their assigned cluster. 

Now that we have some more context about the mechanics behind clustering, we can work on another example where we do the calculations ourselves. Lets say we have a dataset of students with their test scores. 

In [26]:
data2=pd.DataFrame({"individual":["A","B" , "C", "D", "E", "F", "G", "H", "I"],
                   "score" : [5, 20, 11, 5, 3, 19, 30, 3, 15]}) #Use a dictrionary type to define columns and entries

data2.head(10)

Unnamed: 0,individual,score
0,A,5
1,B,20
2,C,11
3,D,5
4,E,3
5,F,19
6,G,30
7,H,3
8,I,15


### Step 1: Split students into k clsuters at random 

Since there are 9 students lets pick three clusters to assign students at random and compute the mean of each group. 

In [5]:
print("Group 1")

g1=pd.DataFrame({"individual":["A", "B", "C"],
                   "score" : [5, 20, 11]}) #Use a dictrionary type to define columns and entries

print("Group 1 Mean Score= 12 ")

g1

Group 1
Group 1 Mean Score= 12 


Unnamed: 0,individual,score
0,A,5
1,B,20
2,C,11


In [6]:
print("Group 2")

g2=pd.DataFrame({"individual":["D", "E", "F"],
                   "score" : [5, 3, 19]}) #Use a dictrionary type to define columns and entries

print("Group 2 Mean Score= 9 ")

g2

Group 2
Group 2 Mean Score= 9 


Unnamed: 0,individual,score
0,D,5
1,E,3
2,F,19


In [7]:
print("Group 3")

g3=pd.DataFrame({"individual":["G", "H", "I"],
                   "score" : [30, 3, 15]}) #Use a dictrionary type to define columns and entries

print("Group 3 Mean Score= 16 ")

g3

Group 3
Group 3 Mean Score= 16 


Unnamed: 0,individual,score
0,G,30
1,H,3
2,I,15


### Step 2: Reassign each student into a group with the closest mean

In our case, we want to reassign individual 1 to group 2.The score for individual one is closest to the mean centroid in group 2. 

In [13]:
print("New Group 1")

g1=pd.DataFrame({"individual":["C"],
                   "score" : [11]}) #Use a dictrionary type to define columns and entries

print("New Group 1 Mean Score= 11 ")

g1

New Group 1
New Group 1 Mean Score= 11 


Unnamed: 0,individual,score
0,C,11


In [14]:
print("New Group 2")

g2=pd.DataFrame({"individual":["A", "D", "E", "H"],
                   "score" : [5, 5, 3, 3]}) #Use a dictrionary type to define columns and entries

print("New Group 2 Mean Score= 4 ")

g2

New Group 2
New Group 2 Mean Score= 4 


Unnamed: 0,individual,score
0,A,5
1,D,5
2,E,3
3,H,3


In [15]:
print("New Group 3")

g3=pd.DataFrame({"individual":["G", "I", "B", "F"],
                   "score" : [30, 15, 20, 19]}) #Use a dictrionary type to define columns and entries

print("New Group 3 Mean Score= 16 ")

g3

New Group 3
New Group 3 Mean Score= 16 


Unnamed: 0,individual,score
0,G,30
1,I,15
2,B,20
3,F,19


### Step 3: Repeat Step 2 until the group mean no longer changes

In [16]:
print("New New Group 1")

g1=pd.DataFrame({"individual":["C", "I"],
                   "score" : [11, 15]}) #Use a dictrionary type to define columns and entries

print("New New Group 1 Mean Score= 13 ")

g1

New New Group 1
New New Group 1 Mean Score= 13 


Unnamed: 0,individual,score
0,C,11
1,I,15


In [17]:
print("New New Group 2")

g2=pd.DataFrame({"individual":["A", "D", "E", "H"],
                   "score" : [5, 5, 3, 3]}) #Use a dictrionary type to define columns and entries

print("New New Group 2 Mean Score= 4 ")

g2

New New Group 2
New New Group 2 Mean Score= 4 


Unnamed: 0,individual,score
0,A,5
1,D,5
2,E,3
3,H,3


In [18]:
print("New new Group 3")

g3=pd.DataFrame({"individual":["G", "B", "F"],
                   "score" : [30, 20, 19]}) #Use a dictrionary type to define columns and entries

print("New New Group 3 Mean Score= 23 ")

g3

New new Group 3
New New Group 3 Mean Score= 23 


Unnamed: 0,individual,score
0,G,30
1,B,20
2,F,19


## Summary 

We introduced the idea of clustering through examples which highlighted the iteration required to perform basic clustering. Lets summarize our basic k means clustering. 

* Choose a random number of groups (clusters) k 

* randomly assign data points to a cluster and find the mean centroid 

* compute the sum squared distance between data points and centroids 

* assign data points to the closest clusters 

* recalculate the new centroid mean 

* repeat until desirable clusters achieved or max interation in place 

## For Next Time...

We will look at an example on how to do k means in SQL. 

## Homework 

Using a pen and paper, perform k means on the dataset below following the steps shown in the summary. Iterate at most two times. Recall that clusters do not have to have the same number of data points. 

In [4]:
data2=pd.DataFrame({"individual":["A","B" , "C", "D", "E", "F", "G", "H"],
                   "score" : [1, 4, 11, 5, 2, 18, 12, 3]}) #Use a dictrionary type to define columns and entries

data2.head(10)

Unnamed: 0,individual,score
0,A,1
1,B,4
2,C,11
3,D,5
4,E,2
5,F,18
6,G,12
7,H,3
