## Day 50 Lecture 1 Assignment

In this assignment, we will calculate affinity propagation clustering using responses to a survey about student life at a university.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import AffinityPropagation
from scipy.spatial.distance import pdist, squareform

We will load a student life survey dataset. This dataset consists of 35 binary features, each corresponding to a yes/no question that characterizes the student taking the survey.

This dataset contains a large number of features, each corresponding to a survey question. The feature name summarizes the survey question, so we will not list them all out here.

In [None]:
# answer goes here

df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/student_life_survey.csv')



For our analysis, we will focus on a specific subset of the survey that is focused on stress. These questions all begin with the string 'Q5'. Filter the columns that meet this criteria (should be 10 in total).

In addition, we are only going to perform clustering on a random subset of this data, as affinity propagation is a fairly slow algorithm and requires infeasibly long times to converge for even medium-sized datasets. Select a random sample of 500 rows from the dataset.

In [None]:
# answer goes here

for col in df.columns:
    if 'Q5' not in col:
        df.drop(col, axis=1, inplace=True)

df_small = df.sample(n=500)

The sklearn implementation of affinity propagation only supports euclidean and precomputed distances, so we will need to precompute a dissimilarity matrix. Furthermore, it expects negative values; the default affinity is negative euclidean distance. 

Compute the full dissimilarity matrix between all pairs of students using the negative matching/hamming distance and store it in a dataframe. 

Note: Be sure to convert the values to negative to match what the algorithm expects.

In [None]:
# answer goes here

sim = pd.DataFrame(squareform(pdist(df_small)))
diss = sim*(-1)

In [None]:
diss.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,-0.0,-2.236068,-1.414214,-1.732051,-2.236068,-1.732051,-1.732051,-2.0,-2.236068,-1.732051,...,-2.236068,-2.0,-2.236068,-1.0,-1.732051,-2.0,-1.414214,-2.44949,-2.0,-1.414214
1,-2.236068,-0.0,-2.236068,-2.0,-2.0,-2.0,-2.44949,-2.645751,-1.414214,-2.44949,...,-2.0,-2.236068,-2.44949,-2.44949,-2.0,-2.236068,-2.236068,-1.732051,-1.732051,-2.236068
2,-1.414214,-2.236068,-0.0,-2.236068,-2.645751,-2.236068,-2.236068,-2.44949,-1.732051,-2.236068,...,-2.645751,-2.44949,-2.645751,-1.0,-2.236068,-2.44949,-2.0,-2.44949,-2.44949,-2.0
3,-1.732051,-2.0,-2.236068,-0.0,-2.0,-1.414214,-2.0,-1.732051,-2.44949,-2.0,...,-2.0,-1.732051,-2.44949,-2.0,-2.0,-1.732051,-2.236068,-1.732051,-1.732051,-1.0
4,-2.236068,-2.0,-2.645751,-2.0,-0.0,-1.414214,-1.414214,-1.732051,-2.44949,-1.414214,...,-0.0,-1.0,-1.414214,-2.44949,-1.414214,-1.0,-1.732051,-1.732051,-1.0,-1.732051
5,-1.732051,-2.0,-2.236068,-1.414214,-1.414214,-0.0,-2.0,-1.732051,-2.44949,-1.414214,...,-1.414214,-1.0,-2.0,-2.0,-1.414214,-1.0,-1.732051,-2.236068,-1.0,-1.0
6,-1.732051,-2.44949,-2.236068,-2.0,-1.414214,-2.0,-0.0,-1.732051,-2.44949,-1.414214,...,-1.414214,-1.732051,-1.414214,-2.0,-2.0,-1.732051,-1.732051,-1.732051,-1.732051,-1.732051
7,-2.0,-2.645751,-2.44949,-1.732051,-1.732051,-1.732051,-1.732051,-0.0,-3.0,-1.732051,...,-1.732051,-1.414214,-1.732051,-2.236068,-1.732051,-1.414214,-2.0,-2.0,-2.0,-1.414214
8,-2.236068,-1.414214,-1.732051,-2.44949,-2.44949,-2.44949,-2.44949,-3.0,-0.0,-2.44949,...,-2.44949,-2.645751,-2.44949,-2.0,-2.44949,-2.645751,-2.236068,-2.236068,-2.236068,-2.645751
9,-1.732051,-2.44949,-2.236068,-2.0,-1.414214,-1.414214,-1.414214,-1.732051,-2.44949,-0.0,...,-1.414214,-1.0,-1.414214,-2.0,-1.414214,-1.0,-1.0,-2.236068,-1.732051,-1.732051


Using the dissimilarity matrix and the specified preference value, run affinity propagation on the survey results using the default value for preference, which is the median dissimilarity, and a damping parameter of 0.8. How many exemplars did it identify? If there are too many exemplars, what changes would we want to make?

In [None]:
# answer goes here

model = AffinityPropagation(damping=0.8, max_iter=1000, random_state=None)
model.fit(diss)

len(model.cluster_centers_indices_)

36

In [None]:
model.cluster_centers_

array([[-1.41421356, -2.23606798, -0.        , ..., -2.44948974,
        -2.44948974, -2.        ],
       [-2.23606798, -1.41421356, -1.73205081, ..., -2.23606798,
        -2.23606798, -2.64575131],
       [-2.23606798, -2.        , -2.64575131, ..., -1.73205081,
        -1.        , -1.73205081],
       ...,
       [-2.        , -1.73205081, -2.        , ..., -2.44948974,
        -1.41421356, -2.        ],
       [-1.41421356, -1.73205081, -1.41421356, ..., -2.44948974,
        -2.        , -2.        ],
       [-2.        , -1.73205081, -2.44948974, ..., -2.        ,
        -2.        , -2.44948974]])

Try adjusting the value of the preference based on the result you saw in the previous step until you have a reasonable number of exemplars. Print out the data for each of these exemplars, as well as the number of surveys assigned to each exemplar. How do these clusters compare to what we saw previously with k-medoids?

Tip: large preferences can lead to numerical instability and issues with convergence. The "damping" parameter can help control this by downscaling the impact of incoming messages; check the documentation for AffinityPropagation for more details().

In [None]:
# answer goes here

model = AffinityPropagation(damping=0.8, max_iter=1000, random_state=None, preference=-1000)
model.fit(diss)

len(model.cluster_centers_indices_)



13

In [None]:
df_small.iloc[model.cluster_centers_indices_]

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others
2903,0,1,0,0,0,0,0,1,0,0
2303,0,1,0,0,0,0,0,0,1,0
2502,0,1,1,1,1,0,0,1,0,0
2641,0,1,1,0,0,0,0,0,0,0
1095,0,1,1,0,0,0,0,1,0,0
2219,1,1,0,0,1,0,0,0,1,0
287,1,1,0,0,0,0,0,0,0,0
2031,1,1,0,0,0,0,0,1,0,0
271,0,1,0,0,0,0,0,0,0,0
1256,0,1,1,1,0,0,0,0,1,0


In [None]:
df_small['Labels'] = model.labels_
df_small.head()

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,Labels
2884,1,1,1,0,1,0,0,1,0,0,2
2280,0,1,0,1,1,0,1,1,1,0,12
1831,1,1,1,0,1,1,1,1,0,0,12
282,0,1,1,1,0,0,0,1,0,0,2
1192,0,1,0,0,0,0,0,0,1,0,1


In [None]:
df_small['Labels'].value_counts()

8     64
5     47
0     46
10    45
9     41
2     41
12    39
4     34
6     33
3     33
7     32
11    30
1     15
Name: Labels, dtype: int64