# K Means Algorithm Implementation

After loading and separating the dataset into ground truths, gene IDs and gene ID number (row numbers), for the initial centroids we choose top 5 k values. These centroid-mappings are considered in order to recalculate the new centroids. For recalculating the centroids, we iterate the cluster mappings and append a data point’s index to the “points” list if that data point belongs to that cluster. These recalculated centroids are set as the current centroids (“new_centroid” list). 
The breaking condition happens when the clusters have not changed in consecutive iterations of the algorithm. In such a case, the function returns it’s most recent centroid values and exits. Otherwise, it will set the value of the function-wide “centroid” variable equal to the value of the new_centroid variable.


### Importing Dependencies

In [123]:
import numpy as np
import collections
import pickle
import sys
import time
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import *
from sklearn.decomposition import PCA as sklearnPCA
init_notebook_mode(connected=True)  #Set jupter notebook mode to true for running pyplot

### Preprocess the input data

**Input Parameters** : file name


**returns**: output matrix X, disease id and the ground truth

In [124]:
def preprocess(filename):
    input_data = np.genfromtxt(filename,delimiter = '\t')
    X = np.loadtxt(filename,delimiter = '\t', usecols = range(2, input_data.shape[1]), dtype = 'S15')
    gen_id = np.loadtxt(filename,delimiter = '\t', usecols = 0, dtype = 'S15')
    ground_truth = np.loadtxt(filename,delimiter = '\t', usecols = 1, dtype = 'S15')
    return X, gen_id, ground_truth

## Function to run the K Means Algorithm

**Input Parameters** : input matrix X, disease id and the maximum iteration to which k means should run.


**returns**: clusters

In [125]:
def kmeans(X, gen_id, iterationNo, initial_centroids):
    X = X.astype(np.float)
    clusters = gen_id
    centroid = []
    
    #Initialize the starting centroids
    for m in initial_centroids:
        centroid.append(X[m-1])
        
    #Keep running the algorithm until it converges or the iteration count is reached
    while(True):
        iterationNo -= 1
        new_centroid = np.empty_like(centroid)
        for i in range(0, X.shape[0]):
            clostestTo = -1
            current_minimum_distance = sys.maxsize
            for j in range(0, len(centroid)):
                euc_distance = calculate_euclidean_distance(centroid[j], X[i])
                if(euc_distance < current_minimum_distance):
                    current_minimum_distance = euc_distance
                    clostestTo = j
            clusters[i] = clostestTo
        for m in range(0, len(centroid)):
            points = []
            for i in range(0, len(clusters)):
                if(m == int(clusters[i])):
                    points.append(X[i])
            points = np.array(points)
            new_centroid[m] = np.mean(points, axis = 0)
            
        #If the centroid does not change or iteration is reached break out of the loop
        
        if((centroid == new_centroid).all() or iterationNo == 0):
            return clusters
        else:
            centroid[:] = new_centroid
    return

### Calculate the euclidean distance between two points A and B
**Input Parameters** : point A coordinate as a np array, point B coordinate as a np array


**returns**: Distance in float format

In [126]:
def calculate_euclidean_distance(point_A, point_B):
    distance = 0
    for i in range(len(point_A)):
        distance = distance + np.sqrt(np.square(np.subtract(float(point_A[i]), float(point_B[i]))))
    return distance

### Run principle component analysis to convert n-dimensional data to 2 dimensions in order to visualize
**Input Parameters** : input matrix X


**returns**: Eigen Vector Y

In [127]:
def runPCA(X):
    sklearn_pca = sklearnPCA(n_components=2)
    Y_sklearn = sklearn_pca.fit_transform(X)
    return Y_sklearn

### Calculate the Jaccard value of predicted data

**Input Parameters** : actual ground truth matrix and predicted ground truth


**returns**: None


**prints**: Jaccard and Rand value on console

In [128]:
def calculateJackard(actual_ground_truth, predicted_ground_truth):
    m00, m01, m10, m11 = 0, 0, 0, 0
    for i in range(0, len(actual_ground_truth)):
        for j in range(0, len(actual_ground_truth)):
            if((actual_ground_truth[i] != actual_ground_truth[j]) and (predicted_ground_truth[i] != predicted_ground_truth[j])):
                m00 += 1
            elif((actual_ground_truth[i] == actual_ground_truth[j]) and (predicted_ground_truth[i] != predicted_ground_truth[j])):
                m01 += 1
            elif((actual_ground_truth[i] != actual_ground_truth[j]) and (predicted_ground_truth[i] == predicted_ground_truth[j])):
                m10 += 1
            elif((actual_ground_truth[i] == actual_ground_truth[j]) and (predicted_ground_truth[i] == predicted_ground_truth[j])):
                m11 += 1
    jaccard = m11 / float(m11 + m10 + m01)
    rand = (m11 + m00) / float(m11 + m10 + m01 + m00)
    print(" Jaccard is : " + str(jaccard)),
    print(" Rand is : " + str(rand))

### Draw Scatter Plot using plotly

**Input Parameters** : 2 dimensional data and its label


**prints**: visualized clusters

In [134]:
def draw_scatter_plot(Y, labels):
    unique_labels = set(labels)
    points = []
    for name in unique_labels:
        x = []
        y = []
        for i in range(0, len(labels)):
            if(labels[i] == name):
                x.append(Y[i,0])
                y.append(Y[i,1])
        x = np.array(x)
        y = np.array(y)
        point = Scatter(
            x = x,
            y = y,
            mode='markers',
            name = int(name),
            marker=Marker(size=12, line=Line(color='rgba(217, 154, 217, 123)',width=0.5),opacity=0.9))
        points.append(point)
    data = Data(points)
    layout = Layout(xaxis=XAxis(title='Principle Component 1', showline=True),
                    yaxis=YAxis(title='Principle Component 2', showline=True))
    fig = Figure(data=data, layout=layout)
    iplot(fig)

## Driver program to run the above code

**Input Parameters** : file name, iteration count and initial centroids


**prints**: Jackard value and scatter plot

In [130]:
def driver(file_name, iteration_count, initial_centroids):
    X, gen_id, ground_truth = preprocess(file_name)
    start = time.time()
    clusters = kmeans(X, gen_id, iteration_count, initial_centroids)
    print("Time to run is : "),
    print("--- %s seconds ---" % (time.time() - start))
    Y_pca = runPCA(X)
    calculateJackard(ground_truth, clusters)
    draw_scatter_plot(Y_pca, clusters)

## To run this algorithm for a different file, iteration count or the intial centroids, please change parameters here

In [136]:
file_name = "data/iyer.txt"
iteration_count = 10
initial_centroids = [3, 20, 9]
driver(file_name, iteration_count, initial_centroids)

Time to run is : 
--- 1.172619104385376 seconds ---
 Jaccard is : 0.19457365531146215
 Rand is : 0.41114673630415016


In [135]:
file_name = "data/cho.txt"
iteration_count = 10
initial_centroids = [3, 20, 9]
driver(file_name, iteration_count, initial_centroids)

Time to run is : 
--- 0.9986391067504883 seconds ---
 Jaccard is : 0.4155703907323778
 Rand is : 0.7602754436360708


In [137]:
file_name = "data/new_dataset_1.txt"
iteration_count = 10
initial_centroids = [3, 20, 9]
driver(file_name, iteration_count, initial_centroids)

Time to run is : 
--- 0.06544327735900879 seconds ---
 Jaccard is : 0.6892070484581497
 Rand is : 0.8745777777777778
