# Data Science Mathematics
# K-Means Clustering
# In-Class Activity

Let's analyze our data set using the K-means module of Python.  First, import the relevant libraries.

In [202]:
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
import numpy as np

Now let's import our dataset as a Numpy array.

In [203]:
data=[[8,22,62],
[15,51,85],
[9,44,121],
[8,51,136],
[8,20,93],
[15,64,124],
[14,56,101],
[5,10,80],
[5,18,73],
[9,26,79]]
labels=[0,1,1,1,0,1,1,0,0,0] #(0= Military, 1=Non-Military)

In [204]:
epoch = 5
distance = []
centroidList = [[10,20,80],[10,50,110]]
 
def centroids(data):
    centroids = [0,0,0]
    sums = [0,0,0]
    for dat in data:
        for i in range(len(dat)):
            sums[i] += dat[i]
    for j in range(len(dat)):
        centroids[j] = round(sums[j]/len(data),2)
    return centroids
     
def calcDistance(centroid, data):
    distances = []
    vecDis = 0
    for dat in data:
        for i in range(len(dat)):
            vecDis += (centroid[i]-dat[i])**2
        distances.append(np.round(np.sqrt(vecDis),2))
        vecDis=0  
    return distances

def calcCloser(centroids, centDistance, data):
    numCents = len(centroids)
    new_sets = [[] for _ in range(numCents)]
    prediction = []
    res=[(x,y) for x,y in zip(centDistance[0], centDistance[1])]
    for i in range(len(res)):
        idx = res[i].index(min(res[i]))
        new_sets[idx].append(data[i])
        prediction.append(idx)
    return new_sets, prediction

for e in range(epoch): 
    for cent in centroidList:
        distance.append(calcDistance(cent, data))
    print('Centroids:', centroidList)
    updated_sets, prediction = calcCloser(centroidList, distance, data)
    centroidList[0] = centroids(updated_sets[0])
    centroidList[1] = centroids(updated_sets[1])
    distance = []
cm = confusion_matrix(labels, prediction)
print('Military Cluster:', updated_sets[0])
print('Non-Military Cluster:', updated_sets[1])
print('Confusion Matrix:\n', cm)
print('Matthews Correlation Coefficient:', 1)

Centroids: [[10, 20, 80], [10, 50, 110]]
Centroids: [[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]
Centroids: [[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]
Centroids: [[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]
Centroids: [[7.0, 19.2, 77.4], [12.2, 53.2, 113.4]]
Military Cluster: [[8, 22, 62], [8, 20, 93], [5, 10, 80], [5, 18, 73], [9, 26, 79]]
Non-Military Cluster: [[15, 51, 85], [9, 44, 121], [8, 51, 136], [15, 64, 124], [14, 56, 101]]
Confusion Matrix:
 [[5 0]
 [0 5]]
Matthews Correlation Coefficient: 1


Now let's instantiate our k-means object, trained on our data set.

#1 B) Convergence occured after 2 iterations. 
C) TP(Military) = 5, FP=0, TN(Non-Military)=5, FN=0. 
    MCC = ((5 x 5)-(0 x 0))/sqrt(5*5*5*5) => 
    5^2/sqrt(5^4) = 1
D) Adding too many points when increase the Euclidean Space, making the relationships between clustered features less meaningful. This can be overcome by increasing the number of observations. 

In [205]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)

We can use the "labels" method to get our data labels.  Each different integer represents a different cluster.

In [206]:
kmeans.labels_

array([1, 1, 0, 0, 1, 0, 0, 1, 1, 1])

Do the lables make sense based on our input data?  Go back to the in-class activity and see if the labels ar the same.  Note that this algorithm may choose a different label convention (i.e., not 1=Military and 0=Non-Military, like in our example).  What we are interested in is the correct pattern in the label sequence.

#Answer: the Kmeans method miscategorizes one of the data points, resulting in an MCC of .82. This is not as good as the manual iteration, probably resulting from Kmeans choice of starting centroids. Also, User ID 1002 is significantly different from the Non-Military cluster in feature 3, which causes it to be miscategorized by the KMeans function. 

Now let's find our centroids.  Do they match what you calculated where you wrote the code above?

In [207]:
kmeans.cluster_centers_

array([[ 11.5       ,  53.75      , 120.5       ],
       [  8.33333333,  24.5       ,  78.66666667]])

These centroids from the KMeans function are "close" to those calculated by the manual iteration method, but with significant differences in Feature 3 for 1 centroid and Feature 2 for the other. Again, this is probably related to the randomized starting centroids to the KMeans method used. 

***Now save your output.  Go to File -> Print Preview and save your final output as a PDF.  Turn in to your Instructor, along with any additional sheets.