## Revised Homework 2 : Cold k-means

In this element, I am using a dataset to classify people into two groups to determine whether it could be predicted whether someone will have a stroke. 

I will be implementing a cold k-means .

In [20]:
#importing packages
from scipy.spatial import distance  
import pandas as pd
import numpy as np

In [45]:
#importing data
data = pd.read_csv("Country-data.csv", sep= ",")

data.head()

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,10.0,7.58,44.9,1610,9.44,56.2,5.82,553
1,Albania,16.6,28.0,6.55,48.6,9930,4.49,76.3,1.65,4090
2,Algeria,27.3,38.4,4.17,31.4,12900,16.1,76.5,2.89,4460
3,Angola,119.0,62.3,2.85,42.9,5900,22.4,60.1,6.16,3530
4,Antigua and Barbuda,10.3,45.5,6.03,58.9,19100,1.44,76.8,2.13,12200


In [46]:
#creating the subset
data_subset = data[['exports', 'imports','income','gdpp']]

#dropping rows with NaN values
data_subset = data_subset.dropna()
data_subset.head()


Unnamed: 0,exports,imports,income,gdpp
0,10.0,44.9,1610,553
1,28.0,48.6,9930,4090
2,38.4,31.4,12900,4460
3,62.3,42.9,5900,3530
4,45.5,58.9,19100,12200


In [47]:
#turning subset into numpy array
data_subset_NP = data_subset.to_numpy()

data_subset_NP

array([[1.00e+01, 4.49e+01, 1.61e+03, 5.53e+02],
       [2.80e+01, 4.86e+01, 9.93e+03, 4.09e+03],
       [3.84e+01, 3.14e+01, 1.29e+04, 4.46e+03],
       [6.23e+01, 4.29e+01, 5.90e+03, 3.53e+03],
       [4.55e+01, 5.89e+01, 1.91e+04, 1.22e+04],
       [1.89e+01, 1.60e+01, 1.87e+04, 1.03e+04],
       [2.08e+01, 4.53e+01, 6.70e+03, 3.22e+03],
       [1.98e+01, 2.09e+01, 4.14e+04, 5.19e+04],
       [5.13e+01, 4.78e+01, 4.32e+04, 4.69e+04],
       [5.43e+01, 2.07e+01, 1.60e+04, 5.84e+03],
       [3.50e+01, 4.37e+01, 2.29e+04, 2.80e+04],
       [6.95e+01, 5.09e+01, 4.11e+04, 2.07e+04],
       [1.60e+01, 2.18e+01, 2.44e+03, 7.58e+02],
       [3.95e+01, 4.87e+01, 1.53e+04, 1.60e+04],
       [5.14e+01, 6.45e+01, 1.62e+04, 6.03e+03],
       [7.64e+01, 7.47e+01, 4.11e+04, 4.44e+04],
       [5.82e+01, 5.75e+01, 7.88e+03, 4.34e+03],
       [2.38e+01, 3.72e+01, 1.82e+03, 7.58e+02],
       [4.25e+01, 7.07e+01, 6.42e+03, 2.18e+03],
       [4.12e+01, 3.43e+01, 5.41e+03, 1.98e+03],
       [2.97e+01, 5.

In [48]:
#normalizing variables in gym_vs_coffee_NP and creating gym_vs_coffee_norm

exports = data_subset_NP[:,0]
mx = np.max(exports)
mn = np.min(exports)

exports_norm = (exports - mn)/(mx - mn)
exports_norm = np.around(exports_norm, decimals = 2) 

imports = data_subset_NP[:,1]
mx = np.max(imports)
mn = np.min(imports)

imports_norm = (imports - mn)/(mx - mn)
imports_norm = np.around(imports_norm, decimals = 2) 

income = data_subset_NP[:,2]
mx = np.max(income)
mn = np.min(income)

income_norm = (income - mn)/(mx - mn)
income_norm = np.around(income_norm, decimals = 2) 

gdpp = data_subset_NP[:,3]
mx = np.max(gdpp)
mn = np.min(gdpp)

gdpp_norm = (gdpp - mn)/(mx - mn)
gdpp_norm = np.around(gdpp_norm, decimals = 2) 


data_subset_norm = np.stack((exports_norm, imports_norm,income_norm,gdpp_norm),axis=-1)


In [49]:
def cold_kmeans(arrayName, k, randomState):
    df = pd.DataFrame(arrayName)
    
    #finding k centers (for stroke prediction, k=2)
    centers= df.sample(k, random_state = randomState)
    centers_np = centers.to_numpy()

    #oldCenters, used for storing the centers of the current iteration, is initialized
    oldCenters = []

    # The centers are recalculated until the centers from the new calculations are equal to the centers 
    # from the previous calculation. Therefore, the stopping condition is when the clusters stop 
    # changing and all centers remain the same. 
    
    # To avoid falling into a local minima, a second stopping condition is used: number of iterations
    i = 0
    
    while (not (np.array_equal(oldCenters, centers_np)) and i<200):
        #iteration number increased
        i+=1
        
        #old Centers are set to whatever centers were calculated before 
        oldCenters = np.copy(centers_np)
        
        #new distances calculated and clusters are assigned
        dists = distance.cdist(arrayName, centers_np, 'euclidean')
        clusters = np.argmin(dists, axis=1)

        #looping over each cluster
        for n in range (len(centers_np)):
            subset = []
            
            #adding the indices of the points in each cluster
            for i in range (len(clusters)):
                if (clusters[i]==n):
                    subset.append(i)
                    
            #taking a subset of the points in the cluster
            sub = df.iloc[subset]
            
            #finding a new center for this cluster and changing its value in centers_np
            centers_np[n] = sub.mean(axis=0)
    
    return centers_np, clusters


In [50]:
data_subset_norm

output = cold_kmeans(data_subset_norm, 2, 10)

print(len(output[1]))

167


In [51]:
def compute_mse(truth_vec, predict_vec):
    return np.mean((truth_vec - predict_vec)**2)

0.4224893053575066

Question 1: Your k-means implementation
For this question, you will submit your "cold" k-means implementation and include your justification for the stopping condition(s) that you used in your implementations. Your implementation should be called my_kmeans() and should set at three inputs in this order -

A numpy array
The number of cluster (ie. k)
The random_state
Your implementation should terminate with output including:

The cluster centers
Cluster labels for the data points
Question 2: Choosing k using elbowology
Part A
In k-means, we supply the number of clusters that we believe our data has. This means that the choice of k is made without being directly derived from the data. In this question, you will use elbowology to determine the number of clusters that our data falls into.

For this question, we will use the students_info.csv for this question. For each of the three variable combinations, please normalize your variables and then do the following:

Using either within cluster sum of squares or average cluster cohesion as the measure of cluster "goodness", write a function looping_kmeans that perform k-means using sklearn and computes the "goodness" of clusters for k=1, k=2, ..., k=10. While the k-means should use the sklearn implementation, the measure of "goodness" should be written by you. The inputs should be 1) a numpy array and list of values for k, with the output being the list of the "goodness" measures.
Plot the values of k against your chosen measure of cluster "goodness" as a line plot with each point marked clearly.
Examining your plot, find the value of k that is closest to the "elbow"; that is where the plot changes directions most sharply. This point should look like the elbow on a bent arm.
The variable combinations are:

Gym time and average cups of coffee
Sleep and GPA
All numerical variables within students_info.csv
Part B
Given your plots above, how many clusters do our students fall into. You must choose one number and justify your choice.

Question 3: Limitations of k-means
When working with a new method, we need to explore its limitations. For this question, we explore if k-means can work on all data. Considering three kinds of data, please explain if k-means can work on it. If you believe that it can, explain; and if not, offer a counter example.

Numerical variables: data that is just numbers, such as human height or outdoor temperature
Categorical variables: data that is based on categories or classes, such as coffee/tea preferences or favorite color
Ordinal variables: data that is categorical with an agreed upon ordering such as birth month or seasons
For each type, would k-means work? If yes, explain why. If not, provide a counter example.