## Day 50 Lecture 1 Assignment

In this assignment, we will calculate affinity propagation clustering using responses to a survey about student life at a university.

In [90]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import AffinityPropagation
from scipy.spatial.distance import pdist, squareform

We will load a student life survey dataset. This dataset consists of 35 binary features, each corresponding to a yes/no question that characterizes the student taking the survey.

This dataset contains a large number of features, each corresponding to a survey question. The feature name summarizes the survey question, so we will not list them all out here.

In [91]:
# answer goes here

survey = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/student_life_survey.csv')

For our analysis, we will focus on a specific subset of the survey that is focused on stress. These questions all begin with the string 'Q5'. Filter the columns that meet this criteria (should be 10 in total).

In addition, we are only going to perform clustering on a random subset of this data, as affinity propagation is a fairly slow algorithm and requires infeasibly long times to converge for even medium-sized datasets. Select a random sample of 500 rows from the dataset.

In [92]:
# answer goes here

survey = survey.filter(regex='Q5')

survey_sample = survey.sample(500)
survey_sample

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others
1337,0,1,0,0,0,1,0,0,0,0
591,0,1,1,0,0,0,0,1,0,0
2380,1,1,0,0,0,0,0,0,0,0
350,1,1,1,0,0,0,0,0,0,0
2719,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
1839,0,1,0,0,0,0,0,1,0,0
1483,0,1,1,0,0,0,0,1,0,0
810,0,1,0,0,0,0,0,1,1,0
1545,0,1,0,0,0,0,0,1,0,0


The sklearn implementation of affinity propagation only supports euclidean and precomputed distances, so we will need to precompute a dissimilarity matrix. Furthermore, it expects negative values; the default affinity is negative euclidean distance. 

Compute the full dissimilarity matrix between all pairs of students using the negative matching/hamming distance and store it in a dataframe. 

Note: Be sure to convert the values to negative to match what the algorithm expects.

In [93]:
# answer goes here

pdist_array = pdist(survey_sample, metric='hamming')

pdist_df = pd.DataFrame(squareform(pdist_array))

pdist_df = - pdist_df

pdist_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,-0.0,-0.3,-0.2,-0.3,-0.1,-0.1,-0.1,-0.3,-0.3,-0.1,...,-0.5,-0.3,-0.1,-0.3,-0.3,-0.2,-0.3,-0.3,-0.2,-0.7
1,-0.3,-0.0,-0.3,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,...,-0.4,-0.2,-0.2,-0.4,-0.2,-0.1,-0.0,-0.2,-0.1,-0.4
2,-0.2,-0.3,-0.0,-0.1,-0.1,-0.1,-0.1,-0.1,-0.3,-0.1,...,-0.3,-0.3,-0.1,-0.3,-0.3,-0.2,-0.3,-0.3,-0.2,-0.5
3,-0.3,-0.2,-0.1,-0.0,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,...,-0.4,-0.4,-0.2,-0.4,-0.4,-0.3,-0.2,-0.4,-0.3,-0.4
4,-0.1,-0.2,-0.1,-0.2,-0.0,-0.0,-0.0,-0.2,-0.2,-0.0,...,-0.4,-0.2,-0.0,-0.2,-0.2,-0.1,-0.2,-0.2,-0.1,-0.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,-0.2,-0.1,-0.2,-0.3,-0.1,-0.1,-0.1,-0.1,-0.3,-0.1,...,-0.3,-0.1,-0.1,-0.3,-0.1,-0.0,-0.1,-0.1,-0.0,-0.5
496,-0.3,-0.0,-0.3,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,...,-0.4,-0.2,-0.2,-0.4,-0.2,-0.1,-0.0,-0.2,-0.1,-0.4
497,-0.3,-0.2,-0.3,-0.4,-0.2,-0.2,-0.2,-0.2,-0.4,-0.2,...,-0.4,-0.2,-0.2,-0.2,-0.2,-0.1,-0.2,-0.0,-0.1,-0.4
498,-0.2,-0.1,-0.2,-0.3,-0.1,-0.1,-0.1,-0.1,-0.3,-0.1,...,-0.3,-0.1,-0.1,-0.3,-0.1,-0.0,-0.1,-0.1,-0.0,-0.5


Using the dissimilarity matrix and the specified preference value, run affinity propagation on the survey results using the default value for preference, which is the median dissimilarity, and a damping parameter of 0.8. How many exemplars did it identify? If there are too many exemplars, what changes would we want to make?

In [94]:
# answer goes here

aff_prop = AffinityPropagation(damping=.8)

preds = aff_prop.fit_predict(pdist_df)

In [95]:
print('Number of exemplars:', pd.Series(preds).nunique())

Number of exemplars: 1


In [96]:
# To reduce the number of exemplars, let's try changing the preference

Try adjusting the value of the preference based on the result you saw in the previous step until you have a reasonable number of exemplars. Print out the data for each of these exemplars, as well as the number of surveys assigned to each exemplar. How do these clusters compare to what we saw previously with k-medoids?

Tip: large preferences can lead to numerical instability and issues with convergence. The "damping" parameter can help control this by downscaling the impact of incoming messages; check the documentation for AffinityPropagation for more details().

In [97]:
# answer goes here
for pref in range(7):
    aff_prop = AffinityPropagation(damping=.8, preference=-pref)

    preds = aff_prop.fit_predict(pdist_df)
    print('Preference:', -pref)
    print('Number of exemplars:', pd.Series(preds).nunique())
    print('-------------------------')

Preference: 0
Number of exemplars: 310
-------------------------
Preference: -1
Number of exemplars: 161
-------------------------
Preference: -2
Number of exemplars: 159
-------------------------
Preference: -3
Number of exemplars: 160
-------------------------
Preference: -4
Number of exemplars: 157
-------------------------
Preference: -5
Number of exemplars: 1
-------------------------
Preference: -6
Number of exemplars: 1
-------------------------


In [98]:
pref = 5

for damp in np.arange(.5, 1.01, .1):
    aff_prop = AffinityPropagation(damping=damp, preference=-pref)

    preds = aff_prop.fit_predict(pdist_df)
    print('Damping:', damp)
    print('Number of exemplars:', pd.Series(preds).nunique())
    print('-------------------------')

Damping: 0.5
Number of exemplars: 1
-------------------------
Damping: 0.6
Number of exemplars: 1
-------------------------
Damping: 0.7
Number of exemplars: 1
-------------------------
Damping: 0.7999999999999999
Number of exemplars: 1
-------------------------
Damping: 0.8999999999999999
Number of exemplars: 95
-------------------------
Damping: 0.9999999999999999
Number of exemplars: 3
-------------------------


In [99]:
damp = .99999999
pref = -5

aff_prop = AffinityPropagation(damping=damp, preference=pref)

preds = aff_prop.fit_predict(pdist_df)
print('Damping:', damp)
print('Preference:', pref)
print('Number of exemplars:', pd.Series(preds).nunique())
print('-------------------------')

Damping: 0.99999999
Preference: -5
Number of exemplars: 4
-------------------------


In [100]:
predictions = pd.Series(preds, index=survey_sample.index)
print('Number of predictions for each class:')
print(predictions.value_counts())

Number of predictions for each class:
1    363
3     58
2     41
0     38
dtype: int64


In [103]:
exemplars = pd.DataFrame(aff_prop.cluster_centers_, index=survey_sample.iloc[aff_prop.cluster_centers_indices_].index)
exemplars

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
657,-0.3,-0.4,-0.3,-0.2,-0.4,-0.4,-0.4,-0.4,-0.4,-0.4,...,-0.4,-0.6,-0.4,-0.6,-0.6,-0.5,-0.4,-0.6,-0.5,-0.4
1841,-0.3,-0.2,-0.3,-0.4,-0.2,-0.2,-0.2,-0.2,-0.4,-0.2,...,-0.4,-0.2,-0.2,-0.4,-0.2,-0.1,-0.2,-0.2,-0.1,-0.6
2853,-0.4,-0.3,-0.6,-0.5,-0.5,-0.5,-0.5,-0.5,-0.3,-0.5,...,-0.5,-0.5,-0.5,-0.5,-0.5,-0.4,-0.3,-0.5,-0.4,-0.5
696,-0.6,-0.5,-0.4,-0.5,-0.5,-0.5,-0.5,-0.3,-0.5,-0.5,...,-0.3,-0.5,-0.5,-0.3,-0.5,-0.4,-0.5,-0.3,-0.4,-0.3


In [104]:
survey_sample.loc[exemplars.index]

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others
657,1,1,1,0,0,1,1,0,0,0
1841,0,1,0,0,0,0,0,1,0,1
2853,0,1,1,1,1,1,0,1,0,0
696,1,1,0,1,1,0,0,1,1,0


In kmedoids, the two clusters I got were pretty much: `stressed` and `unstressed`. Here I seem to have groups separated by types of stress. Group one seems to be stressed out by money-related things, group 2 seems to less overall stressed, but perhaps about social things, and group 3 seems rather stressed bout everything while group 4 is somewhere in between.