## Day 47 Lecture 2 Assignment

In this assignment, we will perform K-Medoids clustering using responses to a survey about student life at a university.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
from pyclustering.cluster.kmedoids import kmedoids
import random

This dataset consists of 35 binary features, each corresponding to a yes/no question that characterizes the student taking the survey.

This dataset contains a large number of features, each corresponding to a survey question. The feature name summarizes the survey question, so we will not list them all out here.

Load the dataset.

In [2]:
url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/student_life_survey.csv"
df = pd.read_csv(url)
df.head()




Unnamed: 0,Q2-Participated in Societies and Interest Groups,Q2-Participated in Clubs,"Q2-Participated in Halls, JCRCs and/or Residential College CSCs",Q2-Participated in University organised events,Q3-Interested in Arts & Culture,Q3-Interested in Science & Technology,Q3-Interested in Research and independent study,Q3-Interested in Sports,"Q3-Interested in Other competitions (eg case, debates)",Q3-Interested in Entrepreneurship,...,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,response_id
0,0,1,0,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
1,0,1,0,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,2
2,0,0,1,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,3
3,1,1,1,1,0,1,1,0,0,0,...,1,0,1,1,1,1,0,1,0,4
4,1,0,1,1,0,1,1,0,0,1,...,1,1,0,1,0,0,0,1,0,5


For our analysis, we will focus on a specific subset of the survey that is focused on stress. These questions all begin with the string 'Q5'. Filter the columns that meet this criteria (should be 10 in total).

In [3]:
q5 = df.filter(like="Q5")
q5_copy = q5.copy()



The pyclustering implementation of kmedoids supports a variety of distance metrics, but they are primarily for numeric data. We will be using SMC/Hamming dissimilarity and precomputing the similarity matrix (an alternative would be to specify a user-defined function, which you are welcome to try in addition).

We'll assume for the next step that a pair of negative values (i.e. both responses are "no") is as informative as a pair of positive values. Compute the full distance/dissimilarity matrix for the survey data using matching/hamming distance.

In [4]:
dist = pdist(q5, metric='hamming')
dmat = squareform(dist)
# dmat.shape

dmat_df = pd.DataFrame(dmat, index=q5.index, columns=q5.index)
# dmat_df.head()

Using the dissimilarity matrix, perform kmedoids clustering using k=2. Set the initial medoids randomly. Note that pyclustering expects the distance matrix to be a numpy array; a pandas dataframe may cause errors. 

Which survey responses are chosen as the cluster representatives? Print out the details of these responses.

In [5]:
k = 2
# np.random.seed(42)

n_rows = dmat.shape[0]
init_medoids = np.random.randint(0, n_rows, k)
init_medoids

array([1033,  721])

In [6]:
kmed = kmedoids(
    dmat, initial_index_medoids=init_medoids, data_type='distance_matrix'
)
kmed.process()

# get index of medoids
medoid_idxs = kmed.get_medoids()
# medoid_idxs

# show cluster representatives
q5_copy.iloc[medoid_idxs, :]

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others
0,0,1,0,0,0,0,0,0,0,0
2415,0,1,0,0,0,0,0,1,0,0


If you run the previous cell a few times, you'll probably notice that the medoids are very sensitive to initialization. A common approach to produce well-separated clusters is to choose initial medoids that are far apart. Re-run the previous process, except with a random pair of medoids that have a dissimilarity of 0.8 or higher. Are the results more stable now? How would you describe the typical clusters you see?

In [7]:
kmed = kmedoids(
    dmat, initial_index_medoids=init_medoids, data_type='distance_matrix'
)
kmed.process()

# get index of medoids
medoid_idxs = kmed.get_medoids()
display(medoid_idxs)

dis = dmat_df.loc[medoid_idxs[0], medoid_idxs[1]]

[0, 2415]

In [13]:
dmat_df.loc[dmat_df[0] >= 0.8, :]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2948,2949,2950,2951,2952,2953,2954,2955,2956,2957
245,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
263,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
496,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
752,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
812,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
866,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
878,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
923,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
941,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
1125,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6


In [15]:
count = 0
k = 2

n_rows = dmat.shape[0]
init_medoids = np.random.randint(0, n_rows, k)

dis = dmat_df.loc[init_medoids[0], init_medoids[1]]
# init_medoids

while dis < 0.8:
    count+=1
    init_medoids = np.random.randint(0, n_rows, k)
    dis = dmat_df.loc[init_medoids[0], init_medoids[1]]
    print(f"{count} : {init_medoids}")
#     medoid_idxs
# else:
#     medoid_idxs = kmed.get_medoids()

kmed = kmedoids(
    dmat, initial_index_medoids=init_medoids, data_type='distance_matrix'
)
kmed.process()
medoid_idxs = kmed.get_medoids()
print(f"DS: {dis}")
print(f""



1 : [ 843 2776]
2 : [2099  456]
3 : [1636 2789]
4 : [627 628]
5 : [2390  319]
6 : [1371  481]
7 : [  79 1889]
8 : [2750 1366]
9 : [ 327 1383]
10 : [ 698 2510]
11 : [171 633]
12 : [378 915]
13 : [1440 1601]
14 : [258  40]
15 : [ 487 2207]
16 : [2781 1791]
17 : [1512  537]
18 : [ 195 1240]
19 : [ 788 1323]
20 : [2115 1788]
21 : [1572 2864]
22 : [1910 2431]
23 : [1817 1595]
24 : [ 804 1589]
25 : [2184  961]
26 : [ 884 1016]
27 : [220 716]
28 : [510 599]
29 : [1441 2393]
30 : [ 341 1728]
31 : [1989  460]
32 : [2073 2678]
33 : [2823 2861]
34 : [1461 1417]
35 : [1823 1320]
36 : [ 553 2259]
37 : [ 69 834]
38 : [ 61 672]
39 : [ 614 2318]
40 : [299 349]
41 : [1370 1662]
42 : [1304 2883]
43 : [2175 2323]
44 : [ 817 2042]
45 : [2198 1276]
46 : [1738 1840]
47 : [608  10]
48 : [2351  508]
49 : [ 161 1158]
50 : [ 981 1681]
51 : [1133 1103]
52 : [1075 1122]
53 : [ 205 2881]
54 : [ 497 2829]
55 : [1313  624]
56 : [  78 1959]
57 : [2367  719]
58 : [2669 1600]
59 : [ 398 2226]
60 : [1398 2089]
61 : [646

[2415, 1029]