# K-Means Simulation

In this notebook, since it is a pretty simple algorithm, I will try to calculate Kmeans from scratch. 

## Algorithm Explanation

Given some data, KMeans tries to find the most natural groups/clusters for this data based on their distance to K cluster centers. The algorithm updates the positions of these cluster centers iteratively by taking the mean of the members' positions until there is no change in the positions of these centers, and these groups are defined as the final assignments. 

The steps of operation can be defined as follows:

1. Define number of cluster center, K
2. Define initial K cluster centers, by selecting random values in the feature space.
3. Calculate the distance of all points to the cluster centers, and assign point to the nearest center.
4. Repeat: 

    a) Update the position of the cluster centers by taking the mean of their feature values of it's community members.
    
    b) Recalculate the distance between all data points and the new centers. 
    
    c) Stop when assignments/labels no longer change.
    
There are different ways to perform these operations, with multiple variations of the above. Here, I will calculate the minimal example for simplicity.

We'll also need data to sample, so to use an interesting one, let's use the CDC Nutrition Demographic data uploaded to [Kaggle](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey). 

## Load Data and Preprocess

In [30]:
import pandas as pd

data= pd.read_csv('cdc_demographic.csv')
print(data.shape)
data.head()

(10175, 47)


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,73557,8,2,1,69,,4,4,1.0,,...,3.0,4.0,,13281.237386,13481.042095,1,112,4.0,4.0,0.84
1,73558,8,2,1,54,,3,3,1.0,,...,3.0,1.0,1.0,23682.057386,24471.769625,1,108,7.0,7.0,1.78
2,73559,8,2,1,72,,3,3,2.0,,...,4.0,1.0,3.0,57214.803319,57193.285376,1,109,10.0,10.0,4.51
3,73560,8,2,1,9,,3,3,1.0,119.0,...,3.0,1.0,4.0,55201.178592,55766.512438,2,109,9.0,9.0,2.52
4,73561,8,2,2,73,,3,3,1.0,,...,5.0,1.0,5.0,63709.667069,65541.871229,2,116,15.0,15.0,5.0


As you can see, there are a few columns with lots of nans or nulls. Let's remove those columns if they exceed a threshold of 33%. For the remaining rows, we'll replace the nan with the median value for that column.

In [31]:
#Drop heavy-null columns
data= data.dropna(thresh=len(data)/3, axis=1)

#Replace remaining nulls
data= data.fillna(data.median())
print(data.shape)
data.head()

(10175, 42)


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,DMQMILIZ,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,73557,8,2,1,69,4,4,1.0,103.0,1.0,...,3.0,4.0,4.0,13281.237386,13481.042095,1,112,4.0,4.0,0.84
1,73558,8,2,1,54,3,3,1.0,103.0,2.0,...,3.0,1.0,1.0,23682.057386,24471.769625,1,108,7.0,7.0,1.78
2,73559,8,2,1,72,3,3,2.0,103.0,1.0,...,4.0,1.0,3.0,57214.803319,57193.285376,1,109,10.0,10.0,4.51
3,73560,8,2,1,9,3,3,1.0,119.0,2.0,...,3.0,1.0,4.0,55201.178592,55766.512438,2,109,9.0,9.0,2.52
4,73561,8,2,2,73,3,3,1.0,103.0,2.0,...,5.0,1.0,5.0,63709.667069,65541.871229,2,116,15.0,15.0,5.0


## Define Initial Cluster Centers

To do this, we'll need to essentially 1) take our data matrix of n columns, and column-wise, randomly sample the space to get a value from each of the columns. We'll do this K times for each of the K cluster centers.

In [36]:
def initial_centers(k_num, input_data):
    
    center_values= []
    
    for i in range(k_num):
        rand_values= [np.random.choice(input_data.iloc[:,i]) for i in range(input_data.shape[1])]
        center_values.append(rand_values)
        
    return center_values

In [35]:
centers0= pd.DataFrame(initial_centers(5, data), columns= data.columns)
print(centers0.shape)
centers0.head()

(5, 42)


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,DMQMILIZ,...,DMDHREDU,DMDHRMAR,DMDHSEDU,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,77335,8,2,2,58,3,6,1.0,103.0,2.0,...,3.0,1.0,2.0,101795.871971,134586.772457,2,115,12.0,77.0,2.18
1,73944,8,2,1,3,4,3,1.0,20.0,2.0,...,5.0,5.0,4.0,18497.855373,17514.338376,2,105,15.0,5.0,1.705
2,81975,8,2,1,26,1,1,1.0,103.0,2.0,...,3.0,1.0,4.0,5717.661823,14604.826126,2,113,7.0,15.0,0.64
3,79474,8,2,2,14,3,3,2.0,103.0,2.0,...,3.0,5.0,5.0,22737.320612,15854.474767,1,112,3.0,7.0,1.28
4,82648,8,2,1,54,4,2,1.0,103.0,2.0,...,5.0,1.0,3.0,28762.603706,13009.147456,2,117,7.0,14.0,0.82


## Calculate the Distance to Centers

In [50]:
from scipy.spatial import distance
from sklearn.preprocessing import MinMaxScaler

def calc_closest_center(array_1, array_2):
    #Scale data before calculating distance
    scaler = MinMaxScaler()
    scaler.fit(array_1)
     
    norm_1= scaler.transform(array_1)
    norm_2= scaler.transform(array_2)
    
    #Calculate euclidean distance of data pts to k_clusters
    dist_matrix= pd.DataFrame(distance.cdist(norm_1, norm_2, metric='euclidean'))
    assigned_center= dist_matrix.idxmin(axis=1)
    
    #Return label of closest center
    return assigned_center

In [55]:
labels0= calc_closest_center(data, centers0)
labels0.head()

0    3
1    1
2    3
3    1
4    3
dtype: int64

## Update the Centers

Now that we have labels, we can recalculate the coordinates of the centers by taking the mean of the feature values of the members belonging to that current center.

In [78]:
def update_centers(input_data, k_num, center_labels):
    
    data_copy= input_data.copy()
    labels_copy= center_labels.copy().reset_index()
    
    center_means= []
    
    for k in range(k_num):
        cluster_data= pd.merge(data_copy.reset_index(), labels_copy[labels_copy[0]== k],
                               left_on= 'index', right_on= 0)
        print(cluster_data.shape)
        new_center= cluster_data.set_index('index').mean(axis=0)
        center_means.append(new_center)
    
    return center_means

In [79]:
new_means= update_centers(data, 5, labels0)

(214, 46)
(1501, 46)
(1203, 46)
(6484, 46)
(773, 46)


In [80]:
new_means[4]

index_x         4.000000
SEQN        73561.000000
SDDSRVYR        8.000000
RIDSTATR        2.000000
RIAGENDR        2.000000
RIDAGEYR       73.000000
RIDRETH1        3.000000
RIDRETH3        3.000000
RIDEXMON        1.000000
RIDEXAGM      103.000000
DMQMILIZ        2.000000
DMDBORN4        1.000000
DMDCITZN        1.000000
DMDEDUC2        5.000000
DMDMARTL        1.000000
SIALANG         1.000000
SIAPROXY        2.000000
SIAINTRP        2.000000
FIALANG         1.000000
FIAPROXY        2.000000
FIAINTRP        2.000000
MIALANG         1.000000
MIAPROXY        2.000000
MIAINTRP        2.000000
AIALANGA        1.000000
DMDHHSIZ        2.000000
DMDFMSIZ        2.000000
DMDHHSZA        0.000000
DMDHHSZB        0.000000
DMDHHSZE        2.000000
DMDHRGND        1.000000
DMDHRAGE       78.000000
DMDHRBR4        1.000000
DMDHREDU        5.000000
DMDHRMAR        1.000000
DMDHSEDU        5.000000
WTINT2YR    63709.667069
WTMEC2YR    65541.871229
SDMVPSU         2.000000
SDMVSTRA      116.000000


In [59]:
labels0.values

array([3, 1, 3, ..., 3, 4, 4])