#  K-Means Clustering 

### Introduction 

K-means clusering is a type of undupervised learning, which is used  when you have unlabeled data (without defined categories or groups). The goal of this algorithm is to find groups in the dat, with the number of groups represented by the variable K.  The K-means clustering algorithm uses iterative frefinement to produce a final result. The algorithm inputs are the number of clusters K and the data set. 

There are 3 steps:
* Initialization - K initial centroids are generated at random 
* Assignment - K clusters are crated by associating each observation with the nearst centroid
* Update the centroid of the clusters becomes the new mean

Repeat steps 2 and 3 until the label not change or it reach to the maximum iterations

In [185]:
import numpy as np 
import pandas as pd 
import matplotlib.pylab as plt

Step 1: initialization -K initial centroids are generated at random 

In [37]:
def initialization(data, k) :
    """ Given the data and number of clusters, create the initial centroids of clusters """
    n_row, n_col = data.shape[0], data.shape[1]
    centrods = []
    for i in range(k) :
        new = []
        for j in range(n_col) :
            min_num = int(min(data[:,j]))
            max_num = int(max(data[:,j]))
            tem = np.random.choice(list(range(min_num, max_num)))
            new.append(tem)
        centrods.append(new)
    return centrods

Step 2: Assignment - K clusters are crated by associating each observation with the nearst centroid

In [69]:
def distance(point,centroid):
    """ Given one point and one centroid vector, calculated the Euclidean distance"""
    return np.sqrt(np.sum((point-centroid)**2))
    
def assign_label_cluster (data, centroids):
    """ Given the data and centroids generated from initialization function,
    return the label of all datapoint """
    n_row, n_col = data.shape[0], data.shape[1]
    lab = []
    for i in range(n_row):
        dist = []
        for cent in centroids:
            dist.append(distance(cent, data[i,:]))
        lab.append(dist.index(min(dist)))
    return lab

Step 3: Update the centroid of the clusters becomes the new mean

In [103]:
def update_centroids(data, labels):
    """ Given data, labels for each data point and centroids, return the new centroids """
    map_fun = {}
    for lab in labels:
        map_fun[lab] = []
    for i,lab in enumerate(labels) :  
            map_fun[lab].append(data[i])
    new = []
    for value in map_fun.values():
       new.append(np.mean(value, axis =0))
    return new 

The Final Algorithm: Wrap up all the functions above. 

In [148]:
def k_means(data, k, iteration) :
    """ Givend data and number of clusters, return the label of each data point and the
    centroid of the clusters.  """
    center = initialization(data, k)
    labels = assign_label_cluster (data, center)
    label_list =[]
    for i in range(iteration):
        center = update_centroids(data, labels)
        labels = assign_label_cluster (data, center)
        label_list.append(labels)
        if label_list[i] == label_list[i-1] : break 
    return (labels, center)
    

### K-Means Implementations -- Scratch

In [181]:
from sklearn.datasets import load_iris 
data = load_iris() 
data = data.data 
label, centroids = k_means(data, 5, 1000)

In [182]:
np.array(label)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 2, 2, 2, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 4, 3, 2,
       3, 3, 4, 3, 3, 3, 4, 4, 3, 2, 2, 2, 3, 3, 3, 3, 3, 4, 3, 3, 2, 4,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 4, 2, 2, 2, 2, 3, 2, 2, 2,
       2, 2, 2, 4, 4, 2, 2, 2, 2, 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 4, 2, 2, 2, 3, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4])

In [184]:
centroids

[array([5.01020408, 3.42857143, 1.45306122, 0.24693878]),
 array([4.8, 3.4, 1.9, 0.2]),
 array([6.52028986, 2.8884058 , 5.0942029 , 1.73188406]),
 array([5.64444444, 2.87407407, 4.38888889, 1.56296296]),
 array([5.975, 2.575, 5.15 , 1.475])]

### K-Means Implementation -- sklearn

In [173]:
from sklearn.cluster import KMeans

In [176]:
kmeans = KMeans(n_clusters = 5, max_iter = 1000).fit(data)
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 4, 4, 4, 3, 4, 4, 4, 3, 4, 3, 3, 4, 3, 4, 3, 4,
       4, 3, 4, 3, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 4, 3, 4, 4, 4,
       3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 1, 4, 2, 1, 1, 2, 3, 2, 1, 2,
       1, 1, 1, 4, 1, 1, 1, 2, 2, 4, 1, 4, 2, 4, 1, 2, 4, 4, 1, 2, 2, 2,
       1, 4, 4, 2, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 4], dtype=int32)

In [183]:
kmeans.cluster_centers_

array([[5.006     , 3.428     , 1.462     , 0.246     ],
       [6.52916667, 3.05833333, 5.50833333, 2.1625    ],
       [7.475     , 3.125     , 6.3       , 2.05      ],
       [5.508     , 2.6       , 3.908     , 1.204     ],
       [6.20769231, 2.85384615, 4.74615385, 1.56410256]])