# K-Means Clustering with scikit-learn

We are going to use the implementation for k-means from scikit-learn, see [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit) for a documentation. 

In [1]:
from sklearn.cluster import KMeans

When using k-means from scikit-learn, we recommend you that your data be stored as a numpy array. Create it or convert your data into a numpy array as follows.

In [2]:
import numpy as np

#create a numpy array
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])

#convert a list to a numpy array
a=[]
for i in range(0,10):
    p=[i,2*i]
    a.append(p)

Y=np.array(a, dtype='float32')
X


array([[1, 2],
       [1, 4],
       [1, 0],
       [4, 2],
       [4, 4],
       [4, 0]])

The following execute the k-means algorithm on the points in X. Make sure you understand the parameters see [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit)

In [3]:
kmeans = KMeans(init='random', n_clusters=2, max_iter=10000, n_init=100).fit(X)

The following code shows for each data points its cluster (0 or 1)

In [4]:
kmeans.labels_

array([0, 0, 0, 1, 1, 1])

The following code computes the clusters for the points [0,0] and [4,4]. In this case, [0,0] is placed in cluster labeled 0 and [4,4] in the cluster labeled 1.

In [5]:
kmeans.predict([[0, 0], [4, 4]])

array([0, 1])

The following code shows the centroids (in this case called centers ) of the two clusters.

In [6]:
kmeans.cluster_centers_

array([[1., 2.],
       [4., 2.]])

# On test avec nos data

Importation des data en format numpy_array

In [7]:
import csv
imu_path = 'C:\\Users\\samai\\Dev\\SD201\\TP_clustering\\data.csv'

tab_imu=[]

with open(imu_path, 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        tab_imu.append(row)

tab_imu=np.array(tab_imu)
tab_imu= tab_imu[1:].T[1:].T
data = np.array(tab_imu, dtype='float32')


In [8]:
kmeans = KMeans(init='random', n_clusters=8, max_iter=40000, n_init=10).fit(data)


In [9]:
(kmeans.labels_[7],kmeans.labels_[25],kmeans.labels_[9],kmeans.labels_[15])

(4, 1, 4, 1)

Euclidian distance

In [10]:
def dist_square(x,y):
    res=0
    for i in range(len(x)):
        res+= (x[i]-y[i])**2
    return res


Computation of SSE

In [11]:
def SSE(x,centre,appart):

    sse=0

    for i in range(30):
        sse += dist_square(x[i], centre[appart[i]])
    return sse


In [12]:
centres =  kmeans.cluster_centers_
SSE(data, centres, kmeans.labels_)

1838.585247338966

In [13]:
centres[0][24] + data[0][24]
dist_square(data[0], centres[0])

import random
np.ones((10,2))/10
random.random()

u=20.3
20.248842<= u <= 20.3001

True

K++ Algorithm

In [14]:
def f(x,tab):
    for i in range(len(tab)-1):
        if tab[i]<= x <= tab[i+1]:
            return i+1
    else:
        return 0

In [15]:
def k_plus(x):
    Set =[]
    Set.append(data[np.random.randint(0,29)])
    k=1
    while k<8:
        res=np.zeros(30)
        total = 0
        min = 0
        for i in range(30):
            for j in range(len(Set)):
                if dist_square(x[i],Set[j])< dist_square(x[i],Set[min]):
                    min = j
            res[i]= dist_square(x[i],Set[min])
            total += res[i]
        res= np.array(res, dtype='float32')/total

        for i in range(29):
            res[i+1]+= res[i]
        z=random.random()
        Set.append(x[f(z, res)])
        k+=1
    return Set

np.shape(k_plus(data))

(8, 25)

In [73]:
kmeans = KMeans(init='k-means++', n_clusters=8, max_iter=40000, n_init=10).fit(data)
SSE(data,kmeans.cluster_centers_,kmeans.labels_)

1542.491791536703

In [62]:
SSE(data,kmeans.cluster_centers_,kmeans.labels_)

1565.5129471490498

In [74]:
min = 2000
for i in range(10):
    kmeans2 = KMeans(init=k_plus(data), n_clusters=8, max_iter=40000, n_init=1).fit(data)
    centres =  kmeans2.cluster_centers_
    if min > SSE(data, centres, kmeans2.labels_):
        min = SSE(data, centres, kmeans2.labels_)
min

1642.8645229903177

In [77]:
(k_plus(data),k_plus(data))

([array([-0.963597,  1.40845 , -1.88301 , -0.701481,  2.27179 ,  1.97562 ,
         -3.13097 , -0.514706, -2.70694 , -1.72216 ,  0.537857, -1.97842 ,
         -0.83682 , -0.269854, -4.24679 ,  0.35461 ,  2.43615 ,  1.67331 ,
         -2.77257 ,  1.74742 ,  0.      ,  1.96078 , -0.85885 , -2.9845  ,
         -0.497159], dtype=float32),
  array([-1.48197 ,  2.19485 ,  0.372353,  0.542299,  0.919811, -1.18673 ,
         -4.48936 ,  4.34783 ,  0.704535,  0.263043, -2.49383 ,  4.06704 ,
         -0.121655, -1.95908 , -3.86216 ,  3.79015 ,  1.56114 ,  0.926135,
         -2.75582 , -0.907519,  0.405314,  1.48837 , -1.10169 , -3.57542 ,
          0.599733], dtype=float32),
  array([ 0.658256 , -1.22308  ,  0.981194 ,  0.266951 ,  0.       ,
          3.19785  ,  1.92256  ,  2.69549  , -2.49187  , -0.322061 ,
          2.30388  ,  1.84783  , -3.06979  , -1.20192  , -4.28413  ,
          5.0644   , -0.0533618,  0.265252 ,  0.188425 ,  3.11381  ,
         -0.888769 , -2.32955  , -5.64885  ,  0.21