# K-Means Simulation

In this notebook, since it is a pretty simple algorithm, I will try to calculate Kmeans from scratch. 

## Algorithm Explanation

Given some data, KMeans tries to find the most natural groups/clusters for this data based on their distance to K cluster centers. The algorithm updates the positions of these cluster centers iteratively by taking the mean of the members' positions until there is no change in the positions of these centers, and these groups are defined as the final assignments. 

The steps of operation can be defined as follows:

1. Define number of cluster center, K
2. Define initial K cluster centers, by selecting random values in the feature space.
3. Calculate the distance of all points to the cluster centers, and assign point to the nearest center.
4. Repeat: 

    a) Update the position of the cluster centers by taking the mean of their feature values of it's community members.
    
    b) Recalculate the distance between all data points and the new centers. 
    
    c) Stop when assignments/labels no longer change.
    
There are different ways to perform these operations, with multiple variations of the above. Here, I will calculate the minimal example for simplicity.

We'll also need data to sample, so to use an interesting one, let's use the CDC Nutrition Demographic data uploaded to [Kaggle](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey). 

## Load Data and Preprocess

In [30]:
import pandas as pd

data= pd.read_csv('cdc_demographic.csv')
print(data.shape)
data.head()

(10175, 47)


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,73557,8,2,1,69,,4,4,1.0,,...,3.0,4.0,,13281.237386,13481.042095,1,112,4.0,4.0,0.84
1,73558,8,2,1,54,,3,3,1.0,,...,3.0,1.0,1.0,23682.057386,24471.769625,1,108,7.0,7.0,1.78
2,73559,8,2,1,72,,3,3,2.0,,...,4.0,1.0,3.0,57214.803319,57193.285376,1,109,10.0,10.0,4.51
3,73560,8,2,1,9,,3,3,1.0,119.0,...,3.0,1.0,4.0,55201.178592,55766.512438,2,109,9.0,9.0,2.52
4,73561,8,2,2,73,,3,3,1.0,,...,5.0,1.0,5.0,63709.667069,65541.871229,2,116,15.0,15.0,5.0


As you can see, there are a few columns with lots of nans or nulls. Let's remove those columns if they exceed a threshold of 33%. For the remaining rows, we'll replace the nan with the median value for that column.

In [31]:
#Drop heavy-null columns
data= data.dropna(thresh=len(data)/3, axis=1)

#Replace remaining nulls
data= data.fillna(data.median())
print(data.shape)
data.head()

(10175, 42)


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,DMQMILIZ,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,73557,8,2,1,69,4,4,1.0,103.0,1.0,...,3.0,4.0,4.0,13281.237386,13481.042095,1,112,4.0,4.0,0.84
1,73558,8,2,1,54,3,3,1.0,103.0,2.0,...,3.0,1.0,1.0,23682.057386,24471.769625,1,108,7.0,7.0,1.78
2,73559,8,2,1,72,3,3,2.0,103.0,1.0,...,4.0,1.0,3.0,57214.803319,57193.285376,1,109,10.0,10.0,4.51
3,73560,8,2,1,9,3,3,1.0,119.0,2.0,...,3.0,1.0,4.0,55201.178592,55766.512438,2,109,9.0,9.0,2.52
4,73561,8,2,2,73,3,3,1.0,103.0,2.0,...,5.0,1.0,5.0,63709.667069,65541.871229,2,116,15.0,15.0,5.0


## Define Initial Cluster Centers

To do this, we'll need to essentially 1) take our data matrix of n columns, and column-wise, randomly sample the space to get a value from each of the columns. We'll do this K times for each of the K cluster centers.

In [36]:
def initial_centers(k_num, input_data):
    
    center_values= []
    
    for i in range(k_num):
        rand_values= [np.random.choice(input_data.iloc[:,i]) for i in range(input_data.shape[1])]
        center_values.append(rand_values)
        
    return center_values

In [35]:
centers0= pd.DataFrame(initial_centers(5, data), columns= data.columns)
print(centers0.shape)
centers0.head()

(5, 42)


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,DMQMILIZ,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,77335,8,2,2,58,3,6,1.0,103.0,2.0,...,3.0,1.0,2.0,101795.871971,134586.772457,2,115,12.0,77.0,2.18
1,73944,8,2,1,3,4,3,1.0,20.0,2.0,...,5.0,5.0,4.0,18497.855373,17514.338376,2,105,15.0,5.0,1.705
2,81975,8,2,1,26,1,1,1.0,103.0,2.0,...,3.0,1.0,4.0,5717.661823,14604.826126,2,113,7.0,15.0,0.64
3,79474,8,2,2,14,3,3,2.0,103.0,2.0,...,3.0,5.0,5.0,22737.320612,15854.474767,1,112,3.0,7.0,1.28
4,82648,8,2,1,54,4,2,1.0,103.0,2.0,...,5.0,1.0,3.0,28762.603706,13009.147456,2,117,7.0,14.0,0.82


## Calculate the Distance to Centers

In [167]:
def preproc_dists(array_1, array_2):
    #Scale data before calculating distance
    scaler = MinMaxScaler()
    scaler.fit(array_1)
     
    norm_1= scaler.transform(array_1)
    norm_2= scaler.transform(array_2)
    
    return norm_1, norm_2

In [213]:
from scipy.spatial import distance
from sklearn.preprocessing import MinMaxScaler

def calc_closest_center(array_1, array_2):
    
    norm_1, norm_2= preproc_dists(array_1, array_2)
    
    #Calculate euclidean distance of data pts to k_clusters
    dist_matrix= pd.DataFrame(distance.cdist(norm_1, norm_2, metric='euclidean'))
    assigned_center= dist_matrix.idxmin(axis=1)
    
    #Return label of closest center
    return assigned_center, dist_matrix

In [238]:
labels0, distmatr_output= calc_closest_center(data, centers0)
distmatr_output.head()

Unnamed: 0,0,1,2,3,4
0,2.663232,1.956968,2.163402,1.902756,2.307813
1,2.485627,1.808002,1.980917,1.875662,2.250152
2,2.800471,2.272875,2.540817,1.851571,2.6718
3,1.987791,1.729111,2.012231,2.223535,1.813013
4,2.216595,2.232526,2.394407,2.19161,2.503526


In [239]:
#labels1= labels0.copy().reset_index()
labels1[labels1[0]== 1].merge(data.reset_index(), left_on='index', right_on='index').head()

Unnamed: 0,index,0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,1,1,73558,8,2,1,54,3,3,1.0,...,3.0,1.0,1.0,23682.057386,24471.769625,1,108,7.0,7.0,1.78
1,3,1,73560,8,2,1,9,3,3,1.0,...,3.0,1.0,4.0,55201.178592,55766.512438,2,109,9.0,9.0,2.52
2,5,1,73562,8,2,1,56,1,1,1.0,...,4.0,3.0,4.0,24978.144602,25344.992359,1,111,9.0,9.0,4.79
3,8,1,73565,8,1,1,42,2,2,2.0,...,3.0,1.0,5.0,23307.675629,0.0,2,106,15.0,15.0,5.0
4,10,1,73567,8,2,1,65,3,3,2.0,...,2.0,2.0,4.0,34002.622995,34795.429301,2,112,3.0,3.0,1.2


## Update the Centers and calculate Within Cluster Variation


Now that we have labels, we can recalculate the coordinates of the centers by taking the mean of the feature values of the members belonging to that current center.

Here we will also calculate the <strong>within cluster variation</strong>, which will trigger the stop of the algorithm when the assigned cluster labels no longer change. 

In [296]:
def update_centers(input_data, k_num, center_labels):
    """This function will take the raw data, current cluster labels, and k value,
    to calculate (for every cluster):
    1) New cluster center means (eg. coordinates)
    2) The cluster variation score to optimize for at every iteration."""
    
    data_copy= input_data.copy()
    labels_copy= center_labels.copy().reset_index()
    
    cluster_data_list= []
    center_means_list= []
    clust_var_list= []
    
    for k in range(k_num):
        #First group the original data points based on their assigned cluster label
        cluster_data= labels_copy[labels_copy[0]== k].merge(data_copy.reset_index(),
                               left_on= 'index', right_on= 'index').set_index('index')
        cluster_data= cluster_data.drop([0], axis=1)
        cluster_data_list.append(cluster_data)
        
        #Now calculate the new mean center from the subset of points belonginng to this cluster
        new_center= cluster_data.mean(axis=0).to_frame().T
        center_means_list.append(new_center)
                
        #Now calculate within cluster variation (sum of euclidean distances/# of points)
        #by calculating the euc. distance of points to each other
        _, new_dists= calc_closest_center(cluster_data, cluster_data)
 
        #Then add euc. distances of one point to all others, then add this to equivalent for all points
        cluster_variation= new_dists.sum().sum()/len(cluster_data)
        clust_var_list.append(cluster_variation)
        
    #Output list of new center means, list of cluster variation scores for each cluster
    return center_means_list, clust_var_list

In [297]:
#Seems to be working but this error is confusing
cluster_opt= update_centers(data, 5, labels0)

In [298]:
cluster_opt[1]

[530.92244049440046,
 3242.823411288482,
 3009.0351144709607,
 14262.370625727854,
 1486.1758570821185]

In [160]:
new_means

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,DMQMILIZ,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,78116.228972,8.0,2.0,1.88785,21.182243,3.985981,4.471963,1.060748,114.369159,1.995327,...,3.350467,1.939252,3.602804,29869.349709,31063.024496,1.920561,113.17757,17.126168,18.196262,2.317103
1,77181.828115,8.0,1.976016,1.175883,28.0493,3.649567,3.96469,1.343771,86.060626,1.918721,...,3.851432,2.854763,4.00533,29252.733795,29985.190201,1.864757,109.794137,10.433045,9.910726,2.416752
2,79099.197007,8.0,1.985869,1.304239,32.847049,1.612635,1.624273,1.250208,113.41064,1.952618,...,2.516209,2.236076,3.054032,21786.390062,22040.835773,1.612635,112.682461,12.746467,12.389859,1.454219
3,78751.91826,8.0,1.954966,1.65438,34.808297,3.178439,3.376002,1.673041,107.612585,1.946021,...,3.637107,2.725632,3.888495,34056.767171,33761.008633,1.333282,110.46607,10.59269,10.170419,2.353776
4,80015.693402,8.0,1.978008,1.141009,11.001294,3.337646,3.521345,1.204398,102.26132,1.989651,...,3.460543,3.302717,3.831824,17944.005371,18276.13564,1.689521,113.636481,9.093144,8.857697,1.748292


# Under Progress

In [200]:
def update_centers(input_data, k_num, center_labels):
    
    data_copy= input_data.copy()
    labels_copy= center_labels.copy().reset_index()
    
    cluster_data_list= []
    center_means= []
    dist_sum_list= []
    
    for k in range(k_num):
        cluster_data= labels_copy[labels_copy[0]== k].merge(data_copy.reset_index(),
                               left_on= 'index', right_on= 'index').set_index('index')
        cluster_data= cluster_data.drop([0], axis=1)
        cluster_data_list.append(cluster_data)
        
        new_center= cluster_data.mean(axis=0).to_frame().T
        center_means.append(new_center)
        
        print(k, cluster_data.shape)#, new_center.shape)
        print(type(new_center), new_center.shape)
        
        #Calculate sum of distances of points from new center
        new_dists= calc_closest_center(cluster_data, new_center)
        new_dists_sum= new_dists.sum()
        dist_sum_list.append(new_dists_sum)
        
    
    return pd.DataFrame(center_means, columns= cluster_data.columns), cluster_data_list