# K-Means Clustering 
## Introduction
K-Means Clustering is an algorithm that is used to partition a set of data. The algorithm splits the data into k different groups, where k is specified by the user. The final result from the clustering results in k different groups that have split up all n points to belong to one of the groups.

## Highlight of the Algorithm
The main algorithm involves inputting a data set and the number of groups ($k$) that the data will be split up into. We start off with intializing $k$ different means randomly. This can be done by either picking points on the graph randomly or randomly picking k points from the data set. After initializing our means, we will go through each point in the data set and assign it to the mean that is closest to. After assigning each point to a mean, we will then iterate through each mean and collect all of the data points that we said belonged to that mean value group. Using this collection, we can calculate the new mean by finding the average of all the points associated to that group. We repeat the process of calculating the new mean for all of the k means. After computing the new means, we can check to see if they are the same as before, if not then we run the whole algorithm starting from the data point assignment to each mean and continue until the means do not change.

Generally, it may take too long to ensure that all means are unchanged; thus, you can limit the run time of the algorithm by changing certain parameters to stop running. We can do this by defining a certain number of iterations, after which, the algorithm will stop running. We can also play with recomputation of means (explained more later).
The following picture shows a good representation of how the K Means Clustering Algorithm works for 2 iterations. The points A1-A8 are all the data points from the data set, while the red "X" represents the mean for that algorithm. Since there are only 3 red "X"s there are only 3 groups that the data can be split into, so $k = 3$ in this situation. 
![title](k-means-4.gif)
From the picture above, we can see that the assignment of groups on the data points to each mean changed after one main iteration. Initially, A4 was a part of the cluster to the top, but after recomputing the mean, it became clustered with the mean to the right. It is not uncommon for points to change categories each iteration since the means move around until they have found a stable spot. In addition, notice how the means (the red X's) have changed from the previous iteration to the next. Recall that this occurs from the recomputing means step in the algorithm. Thus, the means keep changing every iteration until we have found a stable spot for them.


In [203]:
import pandas as pd
import numpy as np
import random
import math

To recap, our code should have this general format:

    Pick k random means (can be from the original data set)
    For each point in the data set
        Get the min distance between the point and each mean
        Mark the point with the mean it is closest to
    For each mean
        Compute the new mean using the points that belong to the cluster
    If new means have changed/number of max iterations not met run the first for loop again
    

# K-Means on 2D Plane

We will now break the algorithm into different parts to make it easier to understand. In addition, for this lesson, we will be looking at performing K-Means on a 2D plane (only x and y coordinate points). Since we will be using a 2D plane, we can compute the shortest distance between a point and its mean using the euclidean distance of measurement (explained below).

## Finding Best Mean
We will first look at the helper function that can be used to compute the best mean given a point from the data we would like to cluster. This function iterates through all the means we have found and computes the euclidean distance from the given point to the current mean. We perform this operation in order to find the mean the data point is closest to. The function then returns the mean that has the shortest distance to the point.
The code below reflects this description.

Recall that the Euclidean distance between two points ($x_1$,$y_1$), ($x_2$,$y_2$) is: 
$$\sum_{} (x_1-x_2)^2+(y_1-y_2)^2$$

In [204]:
def computeBestMean(x, y, means):
    """
    Computes the closest point to the coordinate passed in from the list of means.
    x : (int) x-coordinate
    y: (int) y-coordinate
    means: (pandas.Series) Series of all the current computed means
    Output: 
        (int,int) The closest mean to the point
    """
    (initx, inity) = means[0]
    xchange = (x-float(initx))**2
    ychange = (y-float(inity))**2
    bestDist = math.sqrt(xchange + ychange)
    bestM = (initx,inity)
    for it in means:
        (z1,z2) = it
        xch = (x-float(z1))**2
        ych = (y-float(z2))**2
        cur = math.sqrt(xch + ych)
        if cur < bestDist:
            bestM = it
            bestDist = cur
    return bestM

In [205]:
#Test computeBestMean
initM = [(0,0), (2,3), (5,5), (60,20)]
cluster = pd.DataFrame({'mean': initM})
cluster['points'] = initM
print computeBestMean (3,4, cluster['mean'])
print computeBestMean (7,8, cluster['mean'])

(2, 3)
(5, 5)


![title](figure_1.png) 

From this image we see that blue points are the means and red points are the points we are computing and trying to find the closest mean. You can clearly see which means the red points are closest to.

## Add Point to Cluster
We will now create a function called $update$ that will add the point to the mean it was closest to (computed from $computeBestMean$) in our cluster representation. The function takes in 3 paramters: the data frame representing the cluster structure, the mean the given point is closest to, and the point itself. The cluster dataframe has 2 columns: $mean$, which represents the middle of the cluster, and $points$, which contains all the points that are categorized to that particular cluster.

In [206]:
def update(clust, me, pt):
    """
    Adds the given point to the cluster at the specified mean
    clust: (pandas.DataFrame) cluster structure that has two columns: means and points
    me: (int, int) The mean that the point is associated with
    pt: (int, int) The point we need to add the cluster structure
    Output:
        (pandas.DataFrame) cluster with point added to its particular spot
    """
    for i, row in clust.iterrows():
        if (row['mean'] == me):
            temp = row['points']
            added = temp + [pt]
            clust['points'][i] = added
    return clust

In [208]:
#Test update
cluster = pd.DataFrame({'mean': initM})
cluster['points'] = initM
temp = cluster['points'].apply(lambda x: [])
cluster['points'] = temp
output = computeBestMean(3,4, cluster['mean'])
print "original cluster:"
print cluster
clust = update(cluster, output, (3,4))
clust = update(cluster, computeBestMean(8,2, cluster['mean']), (8,2))
clust = update(cluster, computeBestMean(6,4, cluster['mean']), (6,4))
clust = update(cluster, computeBestMean(20,2, cluster['mean']), (20,2))
clust = update(cluster, computeBestMean(50,5, cluster['mean']), (50,5))
print "final updated cluster"
print clust

original cluster:
       mean points
0    (0, 0)     []
1    (2, 3)     []
2    (5, 5)     []
3  (60, 20)     []
final updated cluster
       mean                     points
0    (0, 0)                         []
1    (2, 3)                   [(3, 4)]
2    (5, 5)  [(8, 2), (6, 4), (20, 2)]
3  (60, 20)                  [(50, 5)]


## Recompute Means
Recall in the K Means algorithm that at the end of each iteration, we recompute the means after assigning all points to a mean. 

The following function, $recomputeMeans$, takes a cluster which is represented as data frame that has a $mean$ column and a $points$ column. The $points$ column consists of a list of tuples where each tuple represents a point that is classified into that mean. The function will compute the average point of all the points in the list. It will then set the respective mean to the computed value in the cluster data frame. This way, the next time we use the cluster representation, it will have the updated means in the $means$ column.


In [209]:
def recomputeMeans(clust):
    """
    Recalculates the means for the cluster structure depending on the 
    points that were grouped to that mean
    clust: (pd.DataFrame) Cluster data frame with 2 columns: mean, points
    Output:
        (pd.DataFrame) cluster data frame with recalculated means
    """
    for i,row in clust.iterrows():
        cur = row['points']
        if (len(cur) > 0):
            x = map((lambda (i,j): float(i)), cur)
            y = map((lambda (i,j): float(j)), cur)
            newX = sum(x)/len(x)
            newY = sum(y)/len(y)
            newM = (newX,newY)
            clust['mean'][i] = newM
    return clust

In [210]:
#Test recomputeMeans
recom = recomputeMeans(clust)
print recom

                             mean                     points
0                          (0, 0)                         []
1                      (3.0, 4.0)                   [(3, 4)]
2  (11.3333333333, 2.66666666667)  [(8, 2), (6, 4), (20, 2)]
3                     (50.0, 5.0)                  [(50, 5)]


## Clear Points
This function resets all values from the points column of the data frame by setting them to be empty arrays. Thus the only significant values left will be in the mean column of the cluster data frame. 

This function will be used later in the main algorithm when we have updated the means and want to recategorize the points with the new means that we calculated.

In [211]:
def clearVals(clust):
    """
    Clears all the points in the points column of the cluster data frame
    clust: (pandas.DataFrame) Cluster data frame with mean and points columns
    Output:
        (pandas.DataFrame) Cluster data frame with mean and points column
    """
    temp = clust['points'].apply(lambda x: [])
    clust['points'] = temp
    return clust

In [212]:
#Test clearVals
print clust
newclust = clearVals(clust)
print newclust

                             mean                     points
0                          (0, 0)                         []
1                      (3.0, 4.0)                   [(3, 4)]
2  (11.3333333333, 2.66666666667)  [(8, 2), (6, 4), (20, 2)]
3                     (50.0, 5.0)                  [(50, 5)]
                             mean points
0                          (0, 0)     []
1                      (3.0, 4.0)     []
2  (11.3333333333, 2.66666666667)     []
3                     (50.0, 5.0)     []


## Changing Means
The function $changingMeans$ will be used see if there is a difference between the old means and the recomputed means. The function returns 0 if there is no difference, else it returns 1.

In [213]:
def changingMeans(old, cur):
    """
    Checks to see if the points in old are different from the points in cur
    old: (pandas.Series) tuples of ints
    cur: (pandas.Series) tuples of ints
    Output:
        0 if all respective points are the same between old and cur, else 0
    """
    for i in range (old.size):
        if ((old[i][0] != cur[i][0]) or (old[i][1] != cur[i][1])):
            return 1
    return 0

## Combine All Parts
We can now combine all of the helper functions we created to be used in the K Means algorithm. Recall that the algorithm follows this format: 

    Pick k random means (can be from the original data set)
    For each point in the data set
        Get the min distance between the point and each mean
        Mark the point with the mean it is closest to
    For each mean
        Compute the new mean using the points that belong to the cluster
    If new means have changed/number of max iterations not met run the first for loop again

In [214]:
def kMeansOn2D(data, k, numiters):
    """
    Function that takes the data from a csv and the number of clusters wanted.
    data: Dataframe that has the x and y coordinates of the data
    k: Number of clusters needed
    """
    initMeans = []
    it = 0
    df = data.copy()
    
    #Initialize the k means by randomly picking k values from the data points
    while (it < k):
        s = random.randint(0, len(data)-1)
        x = data["X-coord"][s]
        y = data["Y-coord"][s]
        if (not (x,y) in initMeans):
            initMeans.append((x,y))
            it = it+1
            
    #Create the initial cluster data frame
    curIter = 0
    cluster = pd.DataFrame({'mean': initMeans})
    cluster['points'] = initMeans
    temp = cluster['points'].apply(lambda x: [])
    cluster['points'] = temp
    prevM = cluster['mean']
    stable = 0
    stableprev = 0

    #True K-Means Algorithm
    while ((curIter < numiters) ):
        cluster = clearVals(cluster)
        
        #Compute closest mean for all points in the data file and add it to the cluster data frame
        for i in range(0,len(data)):
            x1 = data["X-coord"][i]
            y1 = data["Y-coord"][i]
            me = computeBestMean(float(x1),float(y1), cluster['mean'])
            cluster = update(cluster, me, (x1,y1))
        cluster1 = cluster.copy()
        prevM = cluster1['mean']
        
        #Recompute the means
        cluster = recomputeMeans(cluster)
        
        #Check if means have changed
        if (changingMeans(prevM, cluster['mean']) == 0):
            if (stableprev == 0):
                stableprev = 1
            else:
                curIter = numiters + 1
        else:
            stableprev = 0
            curIter = curIter + 1
    return cluster

In [215]:
#Test K-Means
d = pd.read_csv("coordinates.csv", na_filter = False)
kMeansOn2D(d, 6, 50)

Unnamed: 0,mean,points
0,"(-21.8142414861, -207.941176471)","[(-59, -148), (-159, -307), (-55, -58), (20, -..."
1,"(312.292857143, 20.825)","[(372, -137), (89, 135), (109, -38), (351, -11..."
2,"(312.729508197, -342.053278689)","[(377, -185), (284, -204), (386, -368), (390, ..."
3,"(-248.203557312, 278.90513834)","[(-144, 478), (-128, 279), (-441, 47), (-116, ..."
4,"(245.459283388, 346.179153094)","[(381, 364), (426, 236), (278, 426), (46, 438)..."
5,"(-338.463126844, -245.247787611)","[(-257, -328), (-326, -172), (-464, -139), (-4..."


Notice in the code above that I introduced a $stableprev$ variable. I used this to ensure that the means do not change for at least 2 iterations of the K-Means algorithm. This way we know that the means we have calculated are truly stable.

# Conclusion
Python in fact has its own built-in functions from the scikit library for K-Means. It would be easier to use these functions if you would like a quick implementation of K-Means. You can read more about it here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Recall that the above implementation focuses on using K-Means on a 2D plane. However, the algorithm can be used on all kinds of data sets depending on how you represent the data and the distance technique you would like to use. For example, if you would like to use K-Means clustering on categorical data, you can use the K-modes algorithm. You can find further information on K-Modes here: https://arxiv.org/ftp/cs/papers/0603/0603120.pdf
