## Introduction

This tutorial will introduce you to K-means clustering including utilizing different similarity metrics, and useful initialization techniques. In particular we will be focusing on how to apply this algorithm to text classification. This algorithm is a simple and powerful algorthim in machine learning that has improtant implications in text mining and data science.

Below is an example of clustering documents with k=3: 
[<img src="http://www.codeproject.com/KB/recipes/439890/clustering-process.png">](http://www.codeproject.com/KB/recipes/439890/clustering-process.png)
(click for full-size version).  This is an example of the first iteration of the k-means clustering algorithm where we begin by selecting k=3 random 'centroids'. After selecting the centroids we must then assign the rest of the documents based upon their 'similarity' to the current centroids. Here we assign colors, red, green, or blue depending upon whether a given document is in a certain centroid's 'cluster'. The algorithm will then update the centroids, and recalculate the clusters for a given number of iterations or after a certain stopping criterion is met. For the rest of this tutorial we will assume that the stopping criterion is simply the number of update iterations.


### Tutorial content

In this tutorial, we will show how to do K-means clustering in Python, using only native python libraries like [sys](https://docs.python.org/2/library/sys.html), [math](https://docs.python.org/2/library/math.html), and [random](https://docs.python.org/2/library/random.html).

We will use a small subset of a data set containing about 300 documents. We will not be providing the raw data but instead will present the data as two document vector text files. These files are in the form where each line corresponds to a new document and each line has word indices ("index" referring to a specific word in a document) and the frequency or how often that word appears in the given document. This is presented in the form 

$wordIndex_{0}$:$frequency_{0}$ $wordIndex_{1}$:$frequency_{1}$ ... $wordIndex_{n}$:$frequency_{n}$

We will cover the following topics in this tutorial:
- [Extracting Word Vectors](#Extracting-Word-Vectors)
- [Data Structure Initialization](#Data-Structure-Initialization)
- [Baseline K-Means Algorithm](#Baseline-K-Means-Algorithm)
- [Diagnostics](#Diagnostics)
- [Improving The Model: Similarity Metrics](#Improving The Model: Similarity Metrics)
- [Improving The Model: K-Means++ Initialization](#Improving-The-Model:-K-Means++-Initialization)
- [Improving The Model: Miscellaneous Approaches](#Improving-The-Model:-Miscellaneous Approaches)

## Extracting Word Vectors

In [57]:
import sys
import math
import random

We first begin by extracting the data we want from our document text file. Note, as mentioned above that we will not be calculating the word frequency itself from the raw text but instead assume that the data is already in the form mentioned above.

Therefore the first step is to simply extract the data from our text file, "train-document-vectors.txt".

In [58]:
doc_file = "train-document-vectors.txt"
with open(doc_file, "r") as f:
    content = f.readlines()

The above code simply reads every line of the file as a string, where content is an array of strings corresponding to a given document.

## Data Structure Initialization

Now that we have this raw vector, can we clean it up for use in our clustering algorithm? In fact we can, we will do this using two data structures. 

The first being similar to our "content" array, is an array of arrays of wordIndex/frequency tuples we will call our "vectors". We can convert this easy enough from the strings by using the "split" function to seperate the words in each line and then splitting the wordIndex from its frequency.

In [59]:
# Initializes vectors in the form of (wordIdx,frequency) array.
def initVectors ():
    vectors = [[]] * len(content)
    for i in range(0,len(content)):
        words = content[i].split(" ")
        words = words[0:(len(words)-1)]
        vector = [0] * len(words)
        for j in range(0,len(words)):
            wFreq = words[j].split(":")
            wFreq = (int(wFreq[0]),float(wFreq[1]))
            vector[j] = wFreq
        vectors[i] = map(tuple,sorted(vector))
    return(vectors)

Next we will define a more efficient structure that will help speed up our clustering algorithm. We will define a python dictionary of dictionaries where instead of each element being the document, we will have each key be a wordIndex that will refer to a dictionary of documents (where said word appears) as keys with the frequency in which it appears in said document as its value.

In [60]:
# Initializes dictionary for use cosine similarity calculation;
# Of the form Dictionary {wordIdx : Dictionary (Doc : Frequency}}.
def initDict ():
    D = {}
    for i in range(0,len(content)):
        words = content[i].split(" ")
        words = words[0:(len(words)-1)]
        for j in range(0,len(words)):
            word = words[j].split(":")
            word = (int(word[0]),float(word[1]))
            if(word[0] not in D):
                D[word[0]] = {i:word[1]}
            else:
                D[word[0]][i] = word[1]
    return(D)

Now with these two we can now initialize the data structures we will need to implement our K-means clustering algorithm.

In [61]:
vectors = initVectors()
dictionary = initDict()

## Baseline K-Means Algorithm

With our data initialized we can now begin to think about K-Means Clustering. First of all, what is K-means clustering?

In K-means we will define a number, K, of clusters or categories we reasonably believe the documents can be clasified into. Knowing this number, K, we then have a simple set of tasks to perform.

1. We must initialize K centroids. We begin by choosing K random documents as our centroids, later we will discuss a more efficient means of initialization.
2. Next, we must initialize the cluster membership of every document. We do this by determing the "similarity" each document has to a given centroids and then assign membership to the centroid that has the higher "similarity". We first define similarity as the centroid that maximizes the cosine similarity, however we will later show other methods for determining similarity.
3. Next, we must update K new centroids for our cluster by finding centroids that represent the mean word frequency for all words present in the current cluster.
4. We then, with our new clusters calculate the new cluster membership of all documents.
5. We continue steps 3-4 until a certain "stopping criterion" is met. For our purposes we will simply define a set number of iterations we wish for our algorithm to run.

Let's begin by defining the methods that we will need for our algorithm. Because we are dealing with a sparse matrix in the form of a dictionary and an array we have to think of different ways to calculate things like mangitude and the cosine similarity.

We begin by calculating the magnitude of a document vector and dot product of two vectors below. The dot product notice uses the dictionary data structure we defined above with indices to denote a specific vector.

In [62]:
# Calculates magnitude of a document vector.
def magnitude (doc):
    return(math.sqrt(sum(map(lambda x: x[1]*x[1],doc))))

# Calculates the dot prodict of two vectors given by their docIdxs, L0 and L1. 
def dotProduct (L0,L1,V,D):
    return(sum(map(lambda x: D[x[0]][L0]*D[x[0]][L1] if L1 in D[x[0]] else 0,V[L0])))

Next we will define the similarity metric we will use for this implementation. This will involve defining two cosine similarity functions, one for the centroid initilization step (where we define a specific document as our centroid) and one for the update step (where we may not have a specific document centroid, i.e. our centroid is the mean of all words in the cluster).

In [63]:
# Given 2 docIdx's doc0 and doc1 calculates Cosine Similarity.
def cosineSim (doc0,doc1,V,D):
    num = dotProduct(doc0,doc1,V,D)
    denom = magnitude(V[doc0]) * magnitude(V[doc1])
    return(num/denom)

# Given a docIdx doc0 and a centroid D1, calculates Cosine Similarity.
def cosineSim2 (doc0,D1,V,D):
    num = sum(map(lambda x: D1[x]*D[x][doc0] if doc0 in D[x] else 0,D1))
    denom = magnitude(V[doc0]) * math.sqrt(sum(map(lambda x: D1[x]*D1[x],D1)))
    return(num/denom)

Knowing all this we can now implement our K-means algorithm for clustering our documents as noted above. This will return both an array of the final K centroids, and an array determing the cluster membership of every document in the corpus.

We also included print statements (verbose) for debugging purposes.

In [65]:
# Calculates K-Means Clusters and centroids.

def baselineKMeans(D,V,k,iters,sim,sim2,init,verbose=False):
    
    # Step 1: Random initial centroids
    centroidIdxs = init(D,V,k)
    centroids = map(lambda vi: dict(V[vi]),centroidIdxs)
    
    # Step 2: Initialize clusters
    clusters = [0] * len(V)
    for i in range(0,len(V)):
        cs = map(lambda c: sim (i,c,V,D),centroidIdxs)
        clusters[i] = cs.index(max(cs))
        
    # Step 3-5: Update Step
    for i in range(0,iters):
        if (verbose):
            print "Iteration: ", i
        
        # Step 3: Update Centroids
        for c in range(0,k):
            cluster = [index for index,value in enumerate(clusters) if value == c]
            for doc in cluster:
                for word in V[doc]:
                    if(word[0] in centroids[c]):
                        centroids[c][word[0]] += D[word[0]][doc]
                    else:
                        centroids[c][word[0]] = D[word[0]][doc]
            centroids[c] = dict(map(lambda (k,v): (k, v/len(clusters)), centroids[c].iteritems()))           
        if (verbose):
            print "Centroids Updated"
        
        # Step 4: Update Clusters
        def f (j):
            cs = map(lambda C: sim2 (j,C,V,D),centroids)
            return(cs.index(max(cs)))
        clusters = map(lambda j: f(j),range(0,len(V)))
        if (verbose):
            print "Cluster Membership Updated"
        
    return(centroids,clusters)

Finally, with all this we can finally begin classify our corpus of documents. We will begin by defining K=30 and have our stopping criterion be 10 iterations of the algorithm. (Notice we input a function for randomly initializing our centroids, later we will show a different way to initialize our centroids).

In [67]:
K = 30
stopping_criterion = 10
def randomInit (D,V,k):
    return(random.sample(range(0,len(V)),k))
(centers,clusters) = baselineKMeans(dictionary,vectors,K,stopping_criterion,cosineSim,cosineSim2,randomInit,True)

Iteration:  0
Centroids Updated
Cluster Membership Updated
Iteration:  1
Centroids Updated
Cluster Membership Updated
Iteration:  2
Centroids Updated
Cluster Membership Updated
Iteration:  3
Centroids Updated
Cluster Membership Updated
Iteration:  4
Centroids Updated
Cluster Membership Updated
Iteration:  5
Centroids Updated
Cluster Membership Updated
Iteration:  6
Centroids Updated
Cluster Membership Updated
Iteration:  7
Centroids Updated
Cluster Membership Updated
Iteration:  8
Centroids Updated
Cluster Membership Updated
Iteration:  9
Centroids Updated
Cluster Membership Updated


With our centers and cluster memberships defined we can now perform diagnostics to assess the accuracy of our classification algorithm by comparing to our text file, and assesing the macro F1 scores of our model.

## Diagnostics

Now that we have run our algorithm, how can we assess the accuracy of our model? One way we can do this is by using a F1 score compared to file with the correct clusters (https://en.wikipedia.org/wiki/F1_score).

We will do this by first reading in our text file "verify-document-vectors.txt" file and then calculating the F1 macro score as described above.

In [83]:
verify_file = "verify-document-vectors.txt"
with open(verify_file, "r") as f:
    content2 = f.readlines()
    
# Initialize cluster membership list and verify .txt file
# to be used in calculating F1 score.
def initializeComparison(estimated,actual):
    verifyDict = dict()
    eventCounter = 0
    docCounter = 0
    verifyClusters = []
    trainClusters = []
    
    doc = 0
    for cluster in estimated:
        cluster = cluster
        while len(trainClusters) <= cluster:
            trainClusters.append([])
        trainClusters[cluster].append(doc)
        doc += 1
    
    for line in actual:
        cluster = line.strip()

        if cluster != "unlabeled":
            if cluster not in verifyDict:
                verifyDict[cluster] = eventCounter
                eventCounter += 1
            clusterID = verifyDict[cluster]
            while len(verifyClusters) <= clusterID:
                verifyClusters.append([])
            verifyClusters[clusterID].append(docCounter)

        docCounter += 1
    return(trainClusters,verifyClusters)

# Calculate F1 Score between Training clusters and Verified clusters
def findMacroF1 (trainClusters,verifyClusters):
    clusterF1s = []
    
    for verifyCluster in verifyClusters:
        bestF1 = -1

        for trainCluster in trainClusters:
            tp = 0
            fp = 0
            fn = 0
            for item in verifyCluster:
                if item in trainCluster:
                    tp += 1.0
                else:
                    fn += 1.0
            for item in trainCluster:
                if item not in verifyCluster:
                    fp += 1.0
            # if none match, just ignore
            if tp == 0:
                continue
            precision = tp / (tp+fp)
            recall = tp / (tp+fn)
            f1 = 2*precision*recall/(precision+recall)
            if f1 > bestF1:
                bestF1 = f1
        clusterF1s.append(bestF1)
        
    macroF1 = 0
    for item in clusterF1s:
        macroF1 += item
    macroF1 = macroF1 / len(clusterF1s)
    return(macroF1)


Now that we have read the actual membership, initialized the clusters, and defined a function for finding the macro F1 score we can now test the accuracy of our model.

In [69]:
(train,verify) = initializeComparison(clusters,content2)
findMacroF1 (train,verify)

0.3714560011648377

Now doing this we get a decent F1 Macro score, but can we do better? There are many different methods, including adjusting the cluster size, K, and other techniques. Below we will go through some simple ways to improve upon our model.

## Improving The Model: Similarity Metrics

Above we utilized the cosine similarity between two documents to determine the similarity, but this need not be our only means. Below we will look at using the dot product similarity and the more typical euclidean distance comparison.

First we start with the dot product similarity. This, and the euclidean distance metric, are both fairly similar to calculate as the cosine similarity so we will use a similar functions as we defined above.

In [70]:
# Given 2 docIdx's doc0 and doc1 and a boolean determining whether to
# calculate the dot product or euclidean distance, calculates similarity.
def dotSim (doc0,doc1,V,D):
    sim = dotProduct(doc0,doc1,V,D)
    return(sim)

def euclideanSim (doc0,doc1,V,D):
    sim = math.sqrt((sum(map(lambda x: (D[x[0]][doc0]-D[x[0]][doc1])**2 if doc1 in D[x[0]] else 0,V[doc0]))))
    return(sim)

# Given a docIdx doc0 and a centroid D1, and a boolean determining whether to
# calculate the dot product or euclidean distance, calculates similarity.
def dotSim2 (doc0,D1,V,D):
    sim = sum(map(lambda x: D1[x]*D[x][doc0] if doc0 in D[x] else 0,D1))
    return(sim)

def euclideanSim2 (doc0,D1,V,D):
    sim = math.sqrt(sum(map(lambda x: (D1[x]-D[x][doc0])**2 if doc0 in D[x] else 0,D1)))
    return(sim)

Now with these two different metrics defined we can now test the accuracy of the model with these two metrics below.

In [71]:
(centers,clusters) = baselineKMeans(dictionary,vectors,K,stopping_criterion,dotSim,dotSim2,randomInit)
(train,verify) = initializeComparison(clusters,content2)
findMacroF1 (train,verify)

0.03330337240988559

In [55]:
(centers,clusters) = baselineKMeans(dictionary,vectors,K,stopping_criterion,euclideanSim,euclideanSim2,randomInit)
(train,verify) = initializeComparison(clusters,content2)
findMacroF1 (train,verify)

0.2665283175465976

Using these two methods, we see that we have not significantly improved our performance, in fact in both cases we performed significantly worse, is there any other hope to improve our model?

## Improving The Model: K-Means++ Initialization

Before we mentioned that in setting our initial clusters we simply randomly assign documents to be a centroid. However, if we wish to have initial centroids that are significantly different from each other at the beginning, it could go a ways to improving the accuracy.

This is the intuition behind the K-Means++ initialization approach (https://en.wikipedia.org/wiki/K-means%2B%2B). Here we modify Step 1 of our previous baseline approach using this new initialization technique as outlined below.

1. Choose one center uniformly at random from among the data points.
2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2.
4. Repeat Steps 2 and 3 until k centers have been chosen.

After these steps we then work through the standard baseline k-means algorithm outlined previously. Below we define a function for the initialization step. Notice because of the relative poor performance of our other similarity metrics, we will be using cosine similarity as the "distance" or similarity metric.

In [76]:
# Performs K-Means++ initialization step. Calculates next centroid by
# choosing from a random distribution where each centroid has probability
# (1-CosineSimilarity)^2 of being chosen.
def initKPlus (D,V,k):
    centers = []
    nearestDists = [0] * len(V)
    centers = centers + [random.randrange(0,len(V))]
    for i in range(1,k):
        #print "Calculating Centroid: ", i
        for j in range(0,len(V)):
            cs = map(lambda c: 1 - cosineSim (j,c,V,D),centers)
            nearestDists[j] = max(cs)
        r = random.uniform(0.0,sum(map(lambda x: x*x,nearestDists)))
        p = 0
        #print nearestDists
        for d in range(0,len(nearestDists)):
            #print r, p, nearestDists[d]
            if r>= p and r <= p+(nearestDists[d]*nearestDists[d]):
                centers = centers + [d]
            p += nearestDists[d]*nearestDists[d]
    return(centers)

Now with the above funtion we simply need to call our original algorithm with the new initialization method.

In [77]:
(centers,clusters) = baselineKMeans(dictionary,vectors,K,stopping_criterion,cosineSim,cosineSim2,initKPlus)
(train,verify) = initializeComparison(clusters,content2)
findMacroF1 (train,verify)

0.4669066843784131

This method thus, while adding some time for initialization as shown above, greatly increases the accuracy of our model at K=30 clusters.

## Improving the Model: Miscellaneous Approaches

We have now seen a few main approaches to increase accuracy in k-means clustering. Lastly we will look at some general miscellaneous approaches.

One approach to improving accuracy can be to instead of looking at the raw frequency of words, as we have been doing in this text classification approach, we instead look at the term frequency–inverse document frequency, or TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Using this approach, we hoped to normalize the data and thus have features that more accurately reflect the ratio and quality of different documents.

We can do this by representing the new weights using the formula below,

$d_{tf-idf} = (tf_{1}  log(N/df_{1}), tf_{2}  log(N/df_{2}), ..., tf_{n}  log(N/df_{n})$

Where $tf_{i}$ represents the current weights, $N$ represents the number of total documents, and df_{i} represents the document frequency of a certain word.

Knowing this we can begin by replacing the weights in the dictionary and vector data structures we defined previously to this tf-idf representation.

In [79]:
# Calculates new TF-IDF weights to replace raw frequency weights in the
# dictionary D and array of vectors, V.
def newWeights(D,V):
    (nuV,nuD) = (initVectors (),initDict ())
    for w in D:
        for d in D[w]:
            weight = D[w][d] * math.log(len(V) / len(D[w]))
            nuD[w][d] = weight
            i = [index for index,value in enumerate(V[d]) if value[0] == w][0]
            nuV[d][i] = (nuV[d][i][0],weight)
    return(nuV,nuD)

(weightedVectors,weightedDictionary) = newWeights(dictionary,vectors)

After replacing the weights as defined above we can now assess the accuracy of our new model.

In [82]:
(centers,clusters) = baselineKMeans(weightedDictionary,weightedVectors,K,stopping_criterion,cosineSim,cosineSim2,randomInit)
(train,verify) = initializeComparison(clusters,content2)
findMacroF1 (train,verify)

0.4422692040188081

This is an improvement to our original model and when coupled with our initialization technique can be very useful.

Some other methods we could use to improve accuracy can be to increase the amount of iterations we perform in our clustering algorithm, or better yet, change the stopping criterion from being an arbitrary amount of iterations to be instead reflecting something we may want to minimize (such as the distance centroids move, for example).

Another obvious technique is to experiment with the size of K, clusters. Here we have been using the same K=30 clusters for all methods. We will end off with the results we get from modifying the size of K using the baseline approach

![title](Kclusters.png)

## Summary and references

This tutorial highlighted just a few methods for K-Means clustering in python for text classification, for further resources on the material above check the following links!

1. KMeans Clustering: 
https://en.wikipedia.org/wiki/K-means_clustering,
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
2. Diagnostics: https://en.wikipedia.org/wiki/F1_score
3. Similarity Metrics:
https://en.wikipedia.org/wiki/Euclidean_distance
https://en.wikipedia.org/wiki/Cosine_similarity
4. KMeans++: https://en.wikipedia.org/wiki/K-means%2B%2B
5. Tf-Idf Representation:
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.7900&rep=rep1&type=pdf
6. Clustering Algorithms for Text Mining: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.7900&rep=rep1&type=pdf