## Day 47 Lecture 2 Assignment

In this assignment, we will perform K-Medoids clustering using responses to a survey about student life at a university.

In [1]:
# !pip install pyclustering

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.cluster import cluster_visualizer
import random

This dataset consists of 35 binary features, each corresponding to a yes/no question that characterizes the student taking the survey.

This dataset contains a large number of features, each corresponding to a survey question. The feature name summarizes the survey question, so we will not list them all out here.

Load the dataset.

In [3]:
# answer goes here
url = 'https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/student_life_survey.csv'

survey_df = pd.read_csv(url)

survey_df.head()

Unnamed: 0,Q2-Participated in Societies and Interest Groups,Q2-Participated in Clubs,"Q2-Participated in Halls, JCRCs and/or Residential College CSCs",Q2-Participated in University organised events,Q3-Interested in Arts & Culture,Q3-Interested in Science & Technology,Q3-Interested in Research and independent study,Q3-Interested in Sports,"Q3-Interested in Other competitions (eg case, debates)",Q3-Interested in Entrepreneurship,Q3-Interested in Volunteering,Q3-Interested in Others,Q4-Passionate about Animal welfare,Q4-Passionate about Arts/Culture/Heritage,Q4-Passionate about Children/Youth,Q4-Passionate about Community building,"Q4-Passionate about Diversity & Inclusion (e.g. special needs, migrant worker, interfaith and intercultural understanding)",Q4-Passionate about Environmental sustainability,Q4-Passionate about Families,Q4-Passionate about Health/Well-being (e.g mental health),Q4-Passionate about Seniors,Q4-Passionate about Poverty reduction,Q4-Passionate about Education,Q4-Passionate about None of the above,Q4-Passionate about Others,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,response_id
0,0,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,2
2,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,3
3,1,1,1,1,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,1,1,1,1,0,1,0,4
4,1,0,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,1,0,1,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,5


In [4]:
survey_df.isnull().sum().loc[lambda x: x>0]

Series([], dtype: int64)

In [5]:
test1 = [np.nan, 1, 0 , np.nan]
test2 = [0, 1, 0 , 1]
test = pd.DataFrame({'test1':test1, 'test2':test2})
test.isnull().sum().loc[lambda x: x>0]

test1    2
dtype: int64

For our analysis, we will focus on a specific subset of the survey that is focused on stress. These questions all begin with the string 'Q5'. Filter the columns that meet this criteria (should be 10 in total).

In [6]:
# answer goes here
stress_df = survey_df.filter(like='Q5', axis=1)

stress_df

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others
0,0,1,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,1,0,0
3,1,1,0,1,1,1,1,0,1,0
4,0,1,1,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
2953,0,1,1,0,0,1,0,0,0,0
2954,0,1,0,0,0,0,0,0,0,0
2955,0,1,0,0,0,0,0,0,0,0
2956,0,1,0,0,0,0,1,1,0,0


The pyclustering implementation of kmedoids supports a variety of distance metrics, but they are primarily for numeric data. We will be using SMC/Hamming dissimilarity and precomputing the similarity matrix (an alternative would be to specify a user-defined function, which you are welcome to try in addition).

We'll assume for the next step that a pair of negative values (i.e. both responses are "no") is as informative as a pair of positive values. Compute the full distance/dissimilarity matrix for the survey data using matching/hamming distance.

In [7]:
# answer goes here
ham_matrix_np = squareform(pdist(stress_df, metric ='hamming'))
ham_matrix = pd.DataFrame(ham_matrix_np, 
                          index=stress_df.index, columns=stress_df.index)
# np.fill_diagonal(ham_matrix.values, np.nan)
ham_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,2918,2919,2920,2921,2922,2923,2924,2925,2926,2927,2928,2929,2930,2931,2932,2933,2934,2935,2936,2937,2938,2939,2940,2941,2942,2943,2944,2945,2946,2947,2948,2949,2950,2951,2952,2953,2954,2955,2956,2957
0,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,0.3,0.3,0.5,0.4,0.2,0.3,0.3,0.4,0.4,0.3,0.3,0.2,0.3,0.0,0.3,0.2,0.3,0.2,0.3,0.1,0.3,0.3,0.5,0.1,0.3,0.1,0.1,0.3,0.3,0.0,...,0.2,0.2,0.2,0.1,0.0,0.4,0.1,0.2,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.0,0.1,0.3,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
1,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,0.3,0.3,0.5,0.4,0.2,0.3,0.3,0.4,0.4,0.3,0.3,0.2,0.3,0.0,0.3,0.2,0.3,0.2,0.3,0.1,0.3,0.3,0.5,0.1,0.3,0.1,0.1,0.3,0.3,0.0,...,0.2,0.2,0.2,0.1,0.0,0.4,0.1,0.2,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.0,0.1,0.3,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
2,0.1,0.1,0.0,0.7,0.4,0.2,0.1,0.2,0.5,0.3,0.2,0.2,0.4,0.5,0.3,0.2,0.2,0.3,0.3,0.4,0.2,0.3,0.2,0.1,0.2,0.3,0.2,0.1,0.4,0.0,0.2,0.2,0.4,0.2,0.4,0.0,0.2,0.2,0.4,0.1,...,0.3,0.1,0.1,0.0,0.1,0.3,0.0,0.1,0.2,0.1,0.3,0.1,0.2,0.3,0.2,0.1,0.1,0.0,0.1,0.0,0.2,0.2,0.1,0.3,0.1,0.2,0.3,0.0,0.3,0.1,0.2,0.4,0.3,0.4,0.4,0.3,0.1,0.1,0.1,0.1
3,0.6,0.6,0.7,0.0,0.5,0.5,0.8,0.7,0.2,0.6,0.7,0.5,0.5,0.4,0.6,0.7,0.7,0.6,0.6,0.7,0.7,0.6,0.7,0.6,0.5,0.6,0.7,0.8,0.5,0.7,0.7,0.7,0.7,0.5,0.5,0.7,0.7,0.7,0.7,0.6,...,0.6,0.6,0.8,0.7,0.6,0.4,0.7,0.6,0.7,0.6,0.4,0.6,0.5,0.6,0.5,0.8,0.6,0.7,0.6,0.7,0.5,0.5,0.6,0.6,0.6,0.7,0.6,0.7,0.4,0.6,0.7,0.5,0.4,0.5,0.3,0.6,0.6,0.6,0.6,0.8
4,0.3,0.3,0.4,0.5,0.0,0.2,0.3,0.6,0.3,0.3,0.4,0.4,0.4,0.5,0.5,0.6,0.4,0.5,0.3,0.4,0.2,0.3,0.4,0.3,0.4,0.3,0.2,0.3,0.4,0.4,0.4,0.2,0.4,0.4,0.4,0.4,0.2,0.4,0.4,0.3,...,0.3,0.3,0.3,0.4,0.3,0.3,0.4,0.5,0.2,0.3,0.3,0.3,0.2,0.3,0.4,0.5,0.5,0.4,0.3,0.4,0.4,0.4,0.3,0.5,0.3,0.2,0.3,0.4,0.3,0.5,0.2,0.2,0.3,0.4,0.2,0.3,0.3,0.3,0.5,0.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2953,0.2,0.2,0.3,0.6,0.3,0.3,0.2,0.5,0.4,0.4,0.3,0.5,0.3,0.4,0.4,0.5,0.1,0.6,0.2,0.3,0.3,0.2,0.3,0.2,0.5,0.0,0.3,0.2,0.1,0.3,0.3,0.3,0.5,0.3,0.5,0.3,0.1,0.3,0.3,0.2,...,0.2,0.4,0.2,0.3,0.2,0.6,0.3,0.4,0.1,0.2,0.6,0.2,0.3,0.4,0.3,0.4,0.2,0.3,0.2,0.3,0.5,0.1,0.2,0.4,0.2,0.1,0.2,0.3,0.4,0.4,0.1,0.1,0.4,0.3,0.5,0.0,0.2,0.2,0.4,0.2
2954,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,0.3,0.3,0.5,0.4,0.2,0.3,0.3,0.4,0.4,0.3,0.3,0.2,0.3,0.0,0.3,0.2,0.3,0.2,0.3,0.1,0.3,0.3,0.5,0.1,0.3,0.1,0.1,0.3,0.3,0.0,...,0.2,0.2,0.2,0.1,0.0,0.4,0.1,0.2,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.0,0.1,0.3,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
2955,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,0.3,0.3,0.5,0.4,0.2,0.3,0.3,0.4,0.4,0.3,0.3,0.2,0.3,0.0,0.3,0.2,0.3,0.2,0.3,0.1,0.3,0.3,0.5,0.1,0.3,0.1,0.1,0.3,0.3,0.0,...,0.2,0.2,0.2,0.1,0.0,0.4,0.1,0.2,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.0,0.1,0.3,0.1,0.0,0.4,0.0,0.1,0.2,0.1,0.2,0.2,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
2956,0.2,0.2,0.1,0.6,0.5,0.3,0.2,0.3,0.6,0.4,0.3,0.3,0.5,0.4,0.4,0.3,0.3,0.4,0.4,0.3,0.3,0.2,0.3,0.2,0.3,0.4,0.3,0.2,0.5,0.1,0.1,0.3,0.5,0.3,0.5,0.1,0.3,0.3,0.5,0.2,...,0.4,0.2,0.2,0.1,0.2,0.4,0.1,0.0,0.3,0.2,0.4,0.2,0.3,0.4,0.1,0.2,0.2,0.1,0.2,0.1,0.3,0.3,0.2,0.4,0.2,0.3,0.4,0.1,0.2,0.0,0.3,0.5,0.4,0.5,0.5,0.4,0.2,0.2,0.0,0.2


In [8]:
ham_matrix_np.shape

(2958, 2958)

Using the dissimilarity matrix, perform kmedoids clustering using k=2. Set the initial medoids randomly. Note that pyclustering expects the distance matrix to be a numpy array; a pandas dataframe may cause errors. 

Which survey responses are chosen as the cluster representatives? Print out the details of these responses.

In [9]:
# answer goes here
initial_medoids = stress_df.sample(2).index
kmedoids_instance = kmedoids(ham_matrix_np, initial_medoids, data_type='distance_matrix')

In [10]:
kmedoids_instance.process()

<pyclustering.cluster.kmedoids.kmedoids at 0x7f14530969e8>

In [11]:
clusters = kmedoids_instance.get_clusters()
medoids = kmedoids_instance.get_medoids()

In [12]:
cluster_series = pd.concat([pd.Series(i, index=clusters[i],name='cluster') for i in range(len(clusters))],0).sort_index()

In [13]:
stress_df = pd.concat([stress_df, cluster_series],1)
stress_df

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,cluster
0,0,1,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,1
2,0,1,0,0,0,0,0,1,0,0,1
3,1,1,0,1,1,1,1,0,1,0,0
4,0,1,1,0,1,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...
2953,0,1,1,0,0,1,0,0,0,0,1
2954,0,1,0,0,0,0,0,0,0,0,1
2955,0,1,0,0,0,0,0,0,0,0,1
2956,0,1,0,0,0,0,1,1,0,0,1


In [14]:
stress_df['cluster'].value_counts()

1    2161
0     797
Name: cluster, dtype: int64

In [15]:
medoids

[2938, 1344]

In [16]:
stress_df.loc[medoids]

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,cluster
2938,1,1,0,0,0,0,0,1,1,0,0
1344,0,1,0,0,0,0,0,0,0,0,1


If you run the previous cell a few times, you'll probably notice that the medoids are very sensitive to initialization. A common approach to produce well-separated clusters is to choose initial medoids that are far apart. Re-run the previous process, except with a random pair of medoids that have a dissimilarity of 0.8 or higher. Are the results more stable now? How would you describe the typical clusters you see?

In [17]:
# answer goes here
ham_matrix['id'] = ham_matrix.index
ham_matrix_melted = pd.melt(ham_matrix, id_vars=['id'], value_vars=ham_matrix.columns[ham_matrix.columns != 'id'])
ham_matrix_melted.columns = ['response_id_1', 'response_id_2', 'dissimilarity']
ham_matrix_melted

Unnamed: 0,response_id_1,response_id_2,dissimilarity
0,0,0,0.0
1,1,0,0.0
2,2,0,0.1
3,3,0,0.6
4,4,0,0.3
...,...,...,...
8749759,2953,2957,0.2
8749760,2954,2957,0.2
8749761,2955,2957,0.2
8749762,2956,2957,0.2


In [18]:
x,y = np.where(ham_matrix_np >=0.8)
i = np.random.randint(len(x))
random_pos = [x[i],y[i]]

In [19]:
# initial_medoids_dissim = ham_matrix_melted.loc[lambda x: x['dissimilarity']>=0.8].sample(1)[['response_id_1','response_id_2']].values.tolist()[0]
initial_medoids_dissim = random_pos
kmedoids_instance_dissim = kmedoids(ham_matrix_np, initial_medoids_dissim, data_type='distance_matrix')

In [20]:
kmedoids_instance_dissim.process()

<pyclustering.cluster.kmedoids.kmedoids at 0x7f144bcfeba8>

In [21]:
clusters_dissim = kmedoids_instance_dissim.get_clusters()
medoids_dissim = kmedoids_instance_dissim.get_medoids()

In [22]:
cluster__dissim_series = pd.concat([pd.Series(i, index=clusters_dissim[i],name='cluster_dissim') for i in range(len(clusters_dissim))],0).sort_index()

In [23]:
stress_df = pd.concat([stress_df, cluster__dissim_series],1)
stress_df

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,cluster,cluster_dissim
0,0,1,0,0,0,0,0,0,0,0,1,1
1,0,1,0,0,0,0,0,0,0,0,1,1
2,0,1,0,0,0,0,0,1,0,0,1,1
3,1,1,0,1,1,1,1,0,1,0,0,0
4,0,1,1,0,1,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2953,0,1,1,0,0,1,0,0,0,0,1,1
2954,0,1,0,0,0,0,0,0,0,0,1,1
2955,0,1,0,0,0,0,0,0,0,0,1,1
2956,0,1,0,0,0,0,1,1,0,0,1,1


In [24]:
stress_df['cluster_dissim'].value_counts()

1    1851
0    1107
Name: cluster_dissim, dtype: int64

In [25]:
medoids_dissim

[65, 0]

In [26]:
stress_df.loc[medoids_dissim]

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,cluster,cluster_dissim
65,1,1,0,0,1,0,0,1,1,0,0,0
0,0,1,0,0,0,0,0,0,0,0,1,1


This is a little more stable and the final medoids tend to be more disimiliar compared to the random chance before. 

1 cluster tends to be more "stressed" than the other cluster, which almost always has only "academic stress" and zeros.