## Day 49 Lecture 2 Assignment

In this assignment, we will apply mean shift clustering to a dataset containing the results of a survey on financial wellbeing.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.preprocessing import StandardScaler

This dataset contains the results of a survey on a financial wellbeing conducted by the US Consumer Finance Protection Bureau that were published in October 2017. This dataset has a large number of columns, most of which correspond to specific questions on the survey. These codebook for translating the column names to questions can be found here:

https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

Load the dataset.

In [2]:
# read in dataset
data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/financial_wellbeing.csv')

While the survey questions have the potential for interesting cluster analysis, we will stick to the "score" columns to avoid clustering in an unreasonably high-dimensional space. The columns we are interested in all have "score" in their names; identify and isolate these columns. (There should be 4 in total.)

In [3]:
# select only score columns 
data = data.loc[:, data.columns.str.contains('score')].reset_index(drop=True)

Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [4]:
# scale features with StandardScaler
sc = StandardScaler()
data_sc = sc.fit_transform(data)
data_sc = pd.DataFrame(data_sc, columns=data.columns)

Run mean shift clustering on the scores in the survey dataset using the default bandwidth. Then answer the following by printing or typing as appropriate:

- How many clusters are produced? 
- What are the cluster centers?
- How many responses are assigned to each cluster?
- Are these results reasonable? If not, what changes should we make?

In [5]:
# mean shift clustering
ms = MeanShift()
data['mean_shift_cluster'] = ms.fit_predict(data_sc)

In [6]:
print('At default bandwith, there were {} clusters'.format(ms.cluster_centers_.shape[0])) 

At default bandwith, there were 2 clusters


In [7]:
print('The cluster centers are as follows:')
clusters = pd.DataFrame(
    sc.inverse_transform(ms.cluster_centers_),
    columns = data_sc.columns
)

clusters.style.background_gradient()

The cluster centers are as follows:


Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,57.437953,50.121439,2.814341,0.217414
1,-1.0,-1.0,0.25,-1.826


In [8]:
print('Response assignment to clusters:')
data.mean_shift_cluster.value_counts()

Response assignment to clusters:


0    6326
1      68
Name: mean_shift_cluster, dtype: int64

These results are not really reasonable, a cluster with only 1% of the data points is not very helpful. I'll adjust and see if I can get more reasonable clusters

Try changing the appropriate parameters of the mean shift algorithm to achieve a better clustering result. Answer all of the same questions from the previous clustering step.

In [16]:
ms2 = MeanShift(bandwidth=1.6)
data['ms2_clusters'] = ms2.fit_predict(data_sc)

In [21]:
print('At bandwidth of 1.6, there were {} clusters'.format(ms2.cluster_centers_.shape[0]))

At bandwidth of 1.6, there were 6 clusters


In [22]:
print('The cluster centers are as follows:')
clusters = pd.DataFrame(
    sc.inverse_transform(ms2.cluster_centers_),
    columns = data_sc.columns
)

clusters.style.background_gradient()

The cluster centers are as follows:


Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,58.503932,50.45118,2.909567,0.352707
1,51.506838,45.698311,1.763475,-0.720294
2,55.428571,7.0,0.142857,-1.890714
3,-1.0,-1.0,0.25,-1.826
4,95.0,85.0,0.0,-2.053
5,80.0,10.0,3.0,0.712


In [23]:
data['ms2_clusters'].value_counts()

0    4154
1    2137
4      48
2      28
5      16
3      11
Name: ms2_clusters, dtype: int64

These results are more reasonable. The first two clusters have most of the data, but changing the bandwidth up or down from 1.6 didn't help with that. 