## Introduction

In this tutorial, we're going to discuss some popular clustering techniques in data mining, particularly focusing on the most popular iterative centroid-based divisive algorithm--k-means clustering algorithmfor(for grouping unlabeled items). Furthermore, we'll introduce a more efficient version of k-means clustering called bisecting k-Means clustering. Finally, we'll compare the k-menas clustering with the bisecting k-means clustering.


### Tutorial content

We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Introduction of k-means Clustering](#Introduction-of-k-means-Clustering) 
- [k-means algorithm support functions](#k-means-algorithm-support-functions)
- [k-mean Clustering Algorithm](#k-mean-Clustering-Algorithm)
- [Bisecting k-Means Clustering](#Bisecting-k-Means-Clustering)
- [Bisecting k-Means Clustering Algorithm](#Bisecting-k-Means-Clustering-Algorithm)
- [Plot Test Results & Comparison](#Plot-Test-Results-&-Comparison)

Note: In this tutorial, we will introduce clustering analysis in Python

## Installing the libraries

Before getting started, you'll need to install the various libraries that we will use.

In [25]:
import numpy as np
from numpy import *
import math
from random import sample 
import matplotlib.pyplot as plt 

## Introduction of k-means Clustering 
Firstly, we are going to study one type of the most wildely used clustering technique, which called k-means, but what's clustring? 

### Cluster identification

Before we get into k-means clustering, we are going to brief introduce cluster analysis at first. Basically, cluster analysis divides data set into groups(clusters) for the further meaningful data summarization.

Clustering (also called unsupervised classification) is a type of unsupervised learning that forms cluster of similar things automatically. It's like an automatic classification, but in classification we know what we're looking for, however in clustering we don't. To say it more precisly, the difference between clustering and classification is that without having the predefined classes, clustering can also produce the same result as classfication. 


### General idea of k-means clustering algorithm

k-means is an algorithm that will find k clusters for a given dataset. We need to define the number of clusters k, and each cluster is described by a single point known as the Center (or called Centroid) of all the points belongs to that clusters.

Next, we find the closest center for each point, and assign the points to their corresponding cluster. After that, the centers are all updated by taking the mean value of all the points in that cluster. We iterate until none of the data points changes its cluster.


#### The goal of k-means clustering 
Goal:  Find a set of centers $\{c_1...c_n\}\in(\mathbb{R})^d$ that minimize the k-mean objective:

$\sum_{j=1}^{n} \displaystyle\min_{\forall i \in \{1,...,k\}} \|x_j - c_i\|^2$ 

#### Input 
The dataset, and the number of clusters to generate are the only two parameters required.

1. Given a data set $S$ = $\{x_1...x_n\}\subset(\mathbb{R})^d$ of n points in d-dimensional space. 

2. k, the numbers of clusters we want to find.


### k-means algorithm support functions

Now, let's set some helper functions that we'll need to use for k-means algorithm.
For ease of use in later function, we  the first functionm load_DataSet( ), which helps us to load text file of tab-delimited floats lines into a list of floats.
The next function dist_Euclidian( ), calculates the Euclidean distance of points and cluster centers.
Finally, the last function is random_Center( ), it helps us to create a set of k centers randomly for a given dataset.


In [16]:
def load_DataSet(filename):
    """
    loads a text file containing tab-delimited floats lines into floats
    Each lists is appended to a list called data

    Args:
        text: string containing space-separated words, on which to compute

    Returns: 
        data: a list of lists of floats

    """
    data = []              
    with open(filename) as f:
        for line in f.readlines():
            sp_Line = line.strip().split('\t')
            fltLine = map(float, sp_Line)   #map all elements to float()
            data.append(fltLine)
        return data
    
#class k_Means():
def dist_Euclidian (A, B):
    """
    Calculate Euclidean distance between A, B
    
    Args: A (vector of points), B (vector of centers)

    Return: the Euclidean distance between A, B
    """
    total = 0.0
    for i in range(len(B)):
        total += (A[i] - B[i]) ** 2
    return sqrt(total)


def random_Center(dataSet, k):
    """
    Create a set of random k centers with given dataset. 

    Args: data (list of lists of floats) n by p 

    Return: centers (the positions of centers)
    """
    centers = []
    n = len(data)
    rand_indexes = sample(range(0, n), k)
    for rand_index in rand_indexes:
            centers.append(data[rand_index])
    return centers

In [17]:
# First, we create data from a test_data.txt file.
data = load_DataSet('test_data.txt')
#print data
#data_matrix = np.mat(data_matrix)
# Let's see what the min and max values in our matrix
# print  min(data_matrix[:,0])
# print  min(data_matrix[:,1])
# print  max(data_matrix[:,0])
# print  max(data_matrix[:,1])
# Now, let's check if random_Center() works well, which can give us a value between min and max
print random_Center(data, 2)


[[2.15241297194665, 3.1385639614095], [0.851774597928675, 3.21423861525527]]


## k-mean Clustering Algorithm

Now, we're ready to implement the full k-means algorithm. The algorithm will create k centers of clusters to us, then we assign each point to their closest center, and recalculate the centers. We'll iterate this process until none of the data points changing its cluster, in other words, the centers of clusters stop changing. 

By checking this stop condition, we set up a map called Membership, whcih is a dictionary maps index of cluster (key) to list of index of points that belongs to this cluster (value). Then, we create a function called $equalMembership()$ for comparing membership and the next membership for each cluster center k. If $equalMembership()$ returns False, which meas the cluster center changed, we update the membership by the next membership and continue iterating. After that, we use function getCenterFromMembership() to get new cluster centers, then use function $getMembershipFromCenter()$ to get the next membership by the latest new cluster centers. We iterate until the function $equalMembership()$ returns True, then the iteration stops, and we find the optimal center for points in each cluster. 

#### Steps of k-mean Clustering Algorithm
    Step 1: Generated a set of random k centers as the initial centers for given dataset.
    
    Step 2: Created k clusters by assign each point a center index of their nearest center
    
    Step 3: Updated the new center of each cluster with recomputing the mean of all points in each cluster.
    
    Step 4: Repeated Step 2 and Step 3 until convergence has been reached, which the centers of each clusters stop changed.


In [18]:
def getCenterIndex(point, center):
    """
    Assign each point a center index of their nearest center, which has the minimum distance to each point. 
    Args: point, center(list of ints)
    return: center_index(list of ints)
       
    """
    min_center_index = 0;
    min_dist = dist_Euclidian(point, center[0])
    for i in range(len(center)):
        dist = dist_Euclidian(point, center[i])
        if dist < min_dist:
            min_center_index = i
    return min_center_index


def getMembershipFromCenter(center, data):
    """
    membership: map index of cluster to list of index of points that belongs to this cluster
    Args: 
        data(list of lists of floats): input data 
        center(list of ints)
    return: membership: {index of cluster : [indexes of points]}
            e.g. {0: [4], 1: [0, 1, 2, 6, 7, 8], 2: [3, 5]}
    """ 
    membership = {} 
    for i in range(len(data)):
        point = data[i]
        cluster_index = getCenterIndex(point, center)
        if not cluster_index in membership:
            membership[cluster_index] = []
        membership[cluster_index].append(i)
    return membership

def getCenterFromMembership(data, k, membership):
    """
    From memembership to get centers
    Args:
        data(list of lists of floats): input data
        k: number of centers(int)
        membership: {center_index : [member_index]}
    return: center(list of ints)
    """ 
    p = len(data[0])
    center = [0] * k
    for center_index in membership:
        members = membership[center_index]
        point = [0.0] * p
        for member_index in members:
            for i in range(p):
                point[i] += data[member_index][i]
        for i in range(p):
            point[i] /= len(members)
        center[center_index] = point
    for i in range(len(center)):
        if center[i] == 0:
            center[i] = random_Center(data, 1)[0]
    return center

def equalMembership(k, membership1, membership2):
    """
    Comparing the equivalent of membership1 and membership2 
    Args:
        k: number of centers(int)
        membership1/membership1 : {center_index : [member_index]}
    return: True or False
    """       
    
    for i in range(k):
        if i in membership1 and i in membership2:
            set1 = set(membership1[i])
            set2 = set(membership2[i])
            if not set1 == set2:
                return False
        elif (i not in membership1) and (i not in membership2):
            continue
        else:
            return False
    return True
    
def kMeans(data, k):
    """
    Get k cluster centers
    Args: 
        data(list of lists of float): input data
        k(int): number of centers
        error(float): stop condition: stop if adjacent loop 
    return: 
        center (list): a list of the center points for each cluster
        membership (dictionary): the key of dictionary is the index of centers
                                 the value of the dictionary is the index of 
                                 sample points in that corresponding cluster
                                 
    """
    center = random_Center(data, k)
    membership = getMembershipFromCenter(center, data)
    center = getCenterFromMembership(data, k, membership)
    next_membership = getMembershipFromCenter(center, data)
    while not equalMembership(k, membership, next_membership):
        membership = next_membership
        center = getCenterFromMembership(data, k, membership)
        next_membership = getMembershipFromCenter(center, data)
    return center, membership

### Notes: Counter Cases:
There are some counter cases we may want to be careful.
1. If the initial cluster center (since we assigned them randomly at the beginning) is empty, or the cluster center has no points associated with it, we replace those centers with random data point.

2. In equalMembership$()$ function, if the cluster center is not in the keys of membership1 and membership2, we just continue our loop.

In [19]:
# Let's first test our code with a small sample data below
data_sample = [[1,2],[2,1],[1,1],[40,60], [45,55], [50, 55],[100, 1], [99, 2], [101,2]]
print kMeans(data_sample, 3)

([[1.5, 1.5], [1.0, 1.0], [72.5, 29.166666666666668]], {0: [0, 1], 1: [2], 2: [3, 4, 5, 6, 7, 8]})


For our implementation, it returns a random output like this:

[45.0, 56.666666666666664], [1.3333333333333333, 1.3333333333333333], [100.0, 1.6666666666666667]]

In [27]:
# Now, un our codes with a bigger dataset (900*2)
centers, membership = kMeans(data, 9)
print centers

[[3.3204363663366423, 0.9146584246494122], [3.249033097660742, 1.1100367428845483], [2.98946384182756, 2.96177942022594], [2.01068650551852, 2.98071350711951], [3.09795819535077, 0.547562132211542], [2.996460448222953, 0.989851938205289], [2.3919482029177717, 0.9658627114439073], [3.057193952447126, 1.9048015458061252], [1.7573315142176673, 2.204610363255312]]


In [None]:
The output of center  ([[3.052589063796065, 0.5874453529235966], [2.09104375631811, 2.85262197068138], [2.92587771564402, 1.20352759495764], [0.795375667956239, 3.12770461334811], [3.31454020304876, 0.768878195112492], [2.915282063214967, 0.7569860378230464], [3.0394499547178193, 1.0200372273592586], [2.333689669568666, 1.1182835132461506], [1.8283460189205516, 2.256770827356669]],

For our implementation, this returns the following $center$ points (k = 9):  

$[[3.052589063796065, 0.5874453529235966], [2.09104375631811, 2.85262197068138], [2.92587771564402, 1.20352759495764], [0.795375667956239, 3.12770461334811], [3.31454020304876, 0.768878195112492], [2.915282063214967, 0.7569860378230464], [3.0394499547178193, 1.0200372273592586], [2.333689669568666, 1.1182835132461506], [1.8283460189205516, 2.256770827356669]]$

The result of $membership$ should be like this format:

$\{0: [601, 630, 669, 699], 1:[504, 533, 543, 553, 557, 571, 598], 2: [556, 51, ],..., ]\}$


## Bisecting k-Means Clustering 

Next, we'll discuss a more efficient version of k-means called bisecting k-means. 

For K-means clustering algorithm, we know that it may converge to a local minimum sometimes, to avoid this problem well another algorithm has been developed, which by minimizing the Sum of Square Error (SSE) to evaluate the performance of clustering. The method of clustering called as Bisecting-K-means.

### Basic Idea:

In the beginning, we have only one initial cluster, which is our whole data set. To obtain K clusters, we firstly split the initial cluster (set of all data points) into two clusters, then select one of the cluster to split, and keep repeating this procedure so on until we have created K clusters as we desired. There are many different ways to select which cluster to split, here we compare the total Sum of Square Error (SSE) for each splitting sub-cluster, and select the largest reduced squared error, which is the least total SSE for all the clustering.


### Steps of Algorithm

Step 1: Split the dataset(the whole input data set) into two clusters by using k-means (k=2 for kMean(dataset, k)), then we have two centers of the whole data set as our initial two clusters, and initialize a center list to contain all the centers index(now we only have 2 centers). 

Step 2: To select the best cluster to split: we go over all the clusters in the cluster list, we create a new_dataSet, which contains the sample points of their corresponding cluster (i). Before we try to split the two clusters ($C_{1}$, $C_{2}$), we calculate the sum squared error (SSE) named as "error_before" for them by using the following formula:
       
SSE = $\sum_{i=1}^{K} \sum_{\displaystyle x\in C_{i}} dist(c_{i}, x)^2$
       
Then, we splits both clusters(eg. $C_{1}$, $C_{2}$) into 2 sub-clusters(eg. for $C_{1}$, we have $S_{1L}$ and $S_{1R}$; for $C_{2}$, we have its sub-cluster $S_{2L}$ and $S_{2R}$ by using k-means algorithm(gives us the two new centers) simutaneously. After that we compute both the sum of squared error(SSE) for each points with their corresponding sub-cluster center, then sum up those two SSE together, and we named them as "error_after" (eg. "error_after" for $C_1$ is SSE($S_{1L}$) + SSE ($S_{1R}$), "error_after" for $C_{2}$ is SSE($S_{2L}$) + SSE ($S_{2R}$)).   

Step 3: Calculate the Reduced Sum of Square Error $(SSE)$ by subtracting error_before and their corresponding error_after (eg. reduced_error = $SSE(C_{1})-(SSE(S_{1L} + SSE (S_{1R}))$). Then, we compare the reduce SSE (reduced_error) for both clusters and select the cluster from them, which has the maximum reduced SSE (The least total SSE for all the clusters). 
        
Step 4: Modify the data index according, and append the optimal centers index of new two sub-cluster to center list


##  Bisecting K-means & Initialization Problem

We have a sense of that the bisecting k-means clustering has less trouble with initialization problems, because it has tried several bisections and select the lowest total Sum of Square Error (SSE), and there are only two centers in each iteration step. We can see the simple iteration procedure of how the sequence of clusterings produced by bisecting K-means algorithm finds four clusters visually in the following Figure below,
<img src="bisect_process.pgn">


## Bisecting k-Means Clustering Algorithm

In [21]:
def getCenterError(dataSet, center, data_ids):
    error = 0;
    for data_id in data_ids:
        data = dataSet[data_id]
        error += dist_Euclidian(data, center) ** 2
    return error

def biKmeans(dataSet, k):
    """
    Assign each point a center index of their nearest center, which has the minimum distance to each point. 
    Args: Given data set
          k: the number of clusters you want
    return: centers, cluster_assessment ()
       
    """
    centers, membership = kMeans(dataSet, 2)
    
    while len(centers) < k:
        opt_center_id = -1;
        max_reduced_error = 0;
        
        #find the optimum center id to split
        for center_id in range(len(centers)):
            center = centers[center_id]
            error_before = getCenterError(dataSet, center, membership[center_id])
            new_dataSet = []
            for data_id in membership[center_id]:
                data = dataSet[data_id]
                new_dataSet.append(data)
            if len(new_dataSet) <= 0:
                continue
            new_centers, new_membership = kMeans(new_dataSet, 2)
            
            error_after = 0
            if 0 in new_membership:
                error_after += getCenterError(new_dataSet, new_centers[0], new_membership[0])
                
            if 1 in new_membership:
                error_after += getCenterError(new_dataSet, new_centers[1], new_membership[1])
                
        # get the maximum reduced error between error_before and error_after  
            reduced_error = error_before - error_after
            if max_reduced_error < reduced_error:
                max_reduced_error = reduced_error
                opt_center_id = center_id
         
        
        #split the center
        new2old = {}
        new_dataSet = []
        for data_id in membership[opt_center_id]:
            data = dataSet[data_id]
            new2old[len(new_dataSet)] = data_id
            new_dataSet.append(data)
        new_centers, new_membership = kMeans(new_dataSet, 2)
        centers[opt_center_id] = new_centers[0]
        centers.append(new_centers[1])
        data_id_list0 = []
        if 0 in new_membership:
            for new_id in new_membership[0]:
                data_id_list0.append(new2old[new_id])
        membership[opt_center_id] = data_id_list0;
        data_id_list1 = []
        if 1 in new_membership:
            for new_id in new_membership[1]:
                data_id_list1.append(new2old[new_id])
        membership[len(membership)] = data_id_list1;
    return centers, membership


In [26]:
# Now, test our codes with our test dataset (900*2)
centers, membership = biKmeans(data, 9)
print centers

[[1.0120343002076153, 2.9301196586981817], [1.0461508383513176, 1.0182152406613805], [2.855959182095021, 2.033993697431677], [2.98646029508743, 1.0358255156991776], [1.9732198287393015, 2.757198893487244], [3.013304863796108, 2.983728792635465], [1.165232206441763, 1.964712989461876], [2.0496492952705645, 1.1232806299321896], [2.83742486145063, 1.547424953487]]


For our implementation, this returns the following center points (k = 9) as:
[[1.9701987757228576, 1.2043994932104172], [2.046155217339732, 2.8181666319642873], [1.047979225240116, 2.9686805104564997], [2.851167728307258, 1.9682201863958153], [1.05395676586136, 0.848181586076502], [3.0317238289558497, 2.945712218285016], [1.010059046402461, 1.0343947158934823], [2.953318285421977, 1.0182129504112625], [1.1590569071191617, 2.0459069304875412]]

The result of $membership$ should be like this format:

$\{0: [300, 301, 303, 304, 305, 306,...], 1: [100, 101, 102, 103, 104, 105, 106, 108,...], 2: [400, 401, 402, 403, 404,...],...,8: [200, 201, 202, 203,...] \}$




# Plot Test Results & Comparison

The loading test data has 900 sample points (2D a), and we set the number of clusters as 9 (k = 9)



In [None]:
# Show the plot the cluster assignment after running the K-means algorithm

centers, membership = kMeans(data, 9)

import matplotlib.pyplot as plt
color = ['pink', 'purple','r', 'b', 'g', 'k', 'y', 'c', 'm']
for i in xrange(9):
    plt.plot([centers[i][0]], [centers[i][1]], color = color[i], marker = 'o', ls = 'None')
    datax = []
    datay = []
    if not i in membership:
        continue
    for data_id in membership[i]:
        datax.append(data[data_id][0])
        datay.append(data[data_id][1])
    plt.plot(datax, datay, color = color[i], marker = '+', ls = 'None')
plt.show()

The plot of cluster resulting from k-means algorithm (different colors representing different clusters):

Note: the cluster centers are marked with a circle, and the data points are marked with a cross

<img src="kmeans_plot.png">

In [None]:
# Show the plot of the cluster assignment after running the Bisecting K-means algorithm 
centers, membership = biKmeans(data, 9)

import matplotlib.pyplot as plt
color = ['pink','purple', 'r', 'b', 'g', 'm', 'k', 'y', 'c']
for i in xrange(9):
    plt.plot([centers[i][0]], [centers[i][1]], color = color[i], marker = 'o', ls = 'None')
    datax = []
    datay = []
    for data_id in membership[i]:
        datax.append(data[data_id][0])
        datay.append(data[data_id][1])
    plt.plot(datax, datay, color = color[i], marker = '+', ls = 'None')
plt.show()

The plot of cluster assignment after running the bisecting k-means algorithm showed as following. 
<img src="biKmeans_plot.png">

### Comparing 

Implementing the k-means and bisecting k-means clustering algorithms in the same test data set, and set the same number of clusters (k = 9), we can see from the results and the plot sections obviously that the the bisecting k-means clustering algorithm can produce much better clusters of our test data than k-means clustering algorithm. The plot result also shows that the k-means get troubled with local minimum to produce poor cluster, but bisecting k-means can avoid that case.

## References & Resources

This tutorial highlighted just a few analysis of k-means clustering and bisecting k-means clustering with small scale  dataset in Python. Much more detail about k-means algorithms are available from the following links.

1. Wikipedia : https://en.wikipedia.org/wiki/K-means_clustering
2. An Efficient k-Means Clustering Algorithm: Analysis and Implementation: https://www.cs.umd.edu/~mount/Projects/KMeans/pami02.pdf
3. "A comparison of document clustering techniques", M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
4. Introduction to Data Mining, Chapter 8. Cluster Analysis: Basic Concepts and Algorithm http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
5. Machine Learning in Action https://www.manning.com/books/machine-learning-in-action