<img src = https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png width = 200>
<h1 align=center> k-means Clustering using Numpy </h1>
<hr>

#### Initialize the environment and import the required modules

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

#### Define `shouldStop` as a function that returns:
- `True`, if there have been more than 10 iterations
- `True`, if the newly assigned centroids are in the same position as the old centroids
- otherwise it returns `False`

In [None]:
def shouldStop(oldCentroids, centroids, iterations):
    if iterations > 10: return True
    return (centroids==oldCentroids).all()

#### Define `distEuclid` as a function that returns:
- the Euclidean distance between two vectors, `vecA` and `vecB`

In [None]:
def distEuclid(vecA, vecB):
    return np.sqrt(sum(np.power(vecA - vecB, 2))) 

#### Define `getLabels" as a function that assigns labels to each datapoint based on centroids

In [None]:
def getLabels(dataSet, centroids):
    m = dataSet.shape[0];k = centroids.shape[0]
    clusterAssment=np.empty([m,2])
    for i in range(m):
        minDist = float('inf')
        minIndex = -1
        for j in range(k):
            distJI = distEuclid(centroids[j,:],dataSet[i,:])
            if distJI < minDist:
                minDist = distJI; minIndex = j
        clusterAssment[i,:] = minIndex,minDist**2
    return clusterAssment

#### Define `getCentroids` as a function that recalculates (moves) centroids' positions

In [None]:
def getCentroids(dataSet, labels, k):    
    for cent in range(k):
        ptsInClust = dataSet[np.nonzero(labels[:,0]==cent)[0]]#get all the point in this cluster
        centroids[cent,:] = np.mean(ptsInClust, axis=0) #move centroid to mean 
    return centroids

#### Read in data and convert into np

In [None]:
!wget -O data2d.csv https://ibm.box.com/shared/static/nbin9unisgcfmxk9ig31af24p7te0teg.csv

In [None]:
dataFile=[]
with open("data2d.csv") as inputfile:
    for line in inputfile:
        words = line.strip().split(',')
        dataFile.append(words)
dataSet=np.asarray(dataFile, dtype = float)  


#### Initialize centroids at random locations:

In [None]:
centroids = np.asarray([[8., 1.], [5., 2.], [6., 3.]])    #original

plt.figure(1)
plt.clf()
plt.scatter(dataSet[:, 0], dataSet[:, 1],alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1],s=50, c=u'r', marker=u's')
plt.show()

#### Initialize some variables:

In [None]:
numFeatures = 2
k=3
iterations = 0
oldCentroids = np.empty([k,numFeatures])

#### Run the main k-means algorithm:

In [None]:
while not shouldStop(oldCentroids, centroids, iterations):
    print(iterations)
    oldCentroids = centroids.copy() # Save old centroids for convergence test.
    iterations += 1
    
    # Assign labels to each datapoint based on centroids
    labels = getLabels(dataSet, centroids)
    centroids = getCentroids(dataSet, labels, k)
    print(labels[:,0]);    print(centroids)
    plt.figure(1)
    plt.clf()
    plt.scatter(dataSet[:, 0], dataSet[:, 1], c=labels[:,0],alpha=0.5)
    plt.scatter(centroids[:, 0], centroids[:, 1],s=50, c=u'r', marker=u's')
    plt.show()

#### Visualize the final results

In [None]:
print(labels[:,0])

fig = plt.figure(1)
plt.clf()
plt.scatter(dataSet[:, 0], dataSet[:, 1], c=labels[:,0],alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1],s=50, c=u'r', marker=u's')
plt.show()

## Want to learn more?

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: [SPSS Modeler for Mac users](https://cocl.us/ML0101EN_SPSSMod_mac) and [SPSS Modeler for Windows users](https://cocl.us/ML0101EN_SPSSMod_win)

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](https://cocl.us/ML0101EN_DSX)

### Thanks for completing this lesson!

Notebook created by: <a href = "https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>

<hr>
Copyright &copy; 2016 [Cognitive Class](https://cognitiveClass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).