## Day 49 Lecture 2 Assignment

In this assignment, we will apply mean shift clustering to a dataset containing the results of a survey on financial wellbeing.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from sklearn.preprocessing import StandardScaler

This dataset contains the results of a survey on a financial wellbeing conducted by the US Consumer Finance Protection Bureau that were published in October 2017. This dataset has a large number of columns, most of which correspond to specific questions on the survey. These codebook for translating the column names to questions can be found here:

https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

Load the dataset.

In [7]:
# answer goes here
data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/financial_wellbeing.csv')
data.head()

Unnamed: 0,PUF_ID,sample,fpl,SWB_1,SWB_2,SWB_3,FWBscore,FWB1_1,FWB1_2,FWB1_3,FWB1_4,FWB1_5,FWB1_6,FWB2_1,FWB2_2,FWB2_3,FWB2_4,FSscore,FS1_1,FS1_2,FS1_3,FS1_4,FS1_5,FS1_6,FS1_7,FS2_1,FS2_2,FS2_3,SUBKNOWL1,ACT1_1,ACT1_2,FINGOALS,PROPPLAN_1,PROPPLAN_2,PROPPLAN_3,PROPPLAN_4,MANAGE1_1,MANAGE1_2,MANAGE1_3,MANAGE1_4,...,SOCSEC2,SOCSEC3,LIFEEXPECT,HHEDUC,KIDS_NoChildren,KIDS_1,KIDS_2,KIDS_3,KIDS_4,EMPLOY,EMPLOY1_1,EMPLOY1_2,EMPLOY1_3,EMPLOY1_4,EMPLOY1_5,EMPLOY1_6,EMPLOY1_7,EMPLOY1_8,EMPLOY1_9,RETIRE,MILITARY,Military_Status,agecat,generation,PPEDUC,PPETHM,PPGENDER,PPHHSIZE,PPINCIMP,PPMARIT,PPMSACAT,PPREG4,PPREG9,PPT01,PPT25,PPT612,PPT1317,PPT18OV,PCTLT200FPL,finalwt
0,10350,2,3,5,5,6,55,3,3,3,3,2,3,2,3,2,4,44,3,3,4,3,3,3,4,4,3,4,5,4,3,1,5,4,4,3,4,4,2,4,...,62,-2,-2,4,-1,0,0,0,0,8,0,0,0,0,0,0,0,1,0,1,0,5,8,1,4,1,1,1,7,3,1,4,8,0,0,0,0,1,0,0.367292
1,7740,1,3,6,6,6,51,2,2,3,3,3,4,2,2,2,3,43,3,3,3,3,4,3,2,4,3,2,5,4,3,0,3,2,2,1,4,4,1,4,...,-2,66,90,2,1,0,0,0,0,2,0,1,0,0,0,0,0,0,0,-2,0,5,3,3,2,1,1,2,6,3,1,2,3,0,0,0,0,2,0,1.327561
2,13699,1,3,4,3,4,49,3,3,3,3,3,3,3,3,3,3,42,3,3,3,3,3,3,3,3,3,3,5,3,3,1,4,4,4,4,3,3,3,3,...,-2,68,78,3,0,0,0,0,1,2,0,1,0,0,0,0,0,0,0,-2,0,5,3,3,3,2,1,3,6,3,1,4,9,0,0,0,1,2,1,0.835156
3,7267,1,3,6,6,6,49,3,3,3,3,3,3,3,3,3,3,42,3,3,3,3,3,3,3,3,3,3,-1,-1,-1,-1,3,3,3,3,4,4,2,4,...,-2,-1,-1,-1,-1,0,0,0,0,99,0,0,0,0,0,0,0,0,1,-2,-1,-1,3,3,2,1,1,1,8,3,1,3,7,0,0,0,0,1,0,1.410871
4,7375,1,3,4,4,4,49,3,3,3,3,3,3,3,3,3,3,42,3,3,3,3,3,3,3,3,3,3,4,3,3,1,3,3,3,3,3,3,3,3,...,-2,65,75,2,1,0,0,0,0,2,0,1,0,1,0,0,0,0,0,-2,0,5,2,4,2,3,1,5,7,1,1,2,4,0,0,1,0,4,1,4.260668


While the survey questions have the potential for interesting cluster analysis, we will stick to the "score" columns to avoid clustering in an unreasonably high-dimensional space. The columns we are interested in all have "score" in their names; identify and isolate these columns. (There should be 4 in total.)

In [14]:
# answer goes here
score_data = data.loc[:, data.columns.str.endswith('score')]
score_data_PUF = pd.concat([data['PUF_ID'], score_data], axis=1)

Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [16]:
# answer goes here
ss = StandardScaler()
X_scaled = pd.DataFrame(ss.fit_transform(score_data), columns=score_data.columns)
X_scaled

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,-0.073062,-0.530906,0.653830,1.624716
1,-0.355677,-0.609920,0.653830,-0.629626
2,-0.496984,-0.688935,0.653830,-0.160841
3,-0.496984,-0.688935,-0.670399,-1.752502
4,-0.496984,-0.688935,-1.994628,-2.261785
...,...,...,...,...
6389,0.350859,-0.293863,0.653830,1.624716
6390,0.209552,0.654309,-1.994628,-1.421162
6391,0.209552,0.022194,-0.670399,-1.421162
6392,-0.708944,0.259237,-0.670399,-1.421162


Run mean shift clustering on the scores in the survey dataset using the default bandwidth. Then answer the following by printing or typing as appropriate:

- How many clusters are produced? 
- What are the cluster centers?
- How many responses are assigned to each cluster?
- Are these results reasonable? If not, what changes should we make?

In [32]:
# answer goes here
ms =  MeanShift()

In [33]:
X_scaled['MS_Cluster'] = ms.fit_predict(X_scaled)
print("----------Number of Clusters & Number of Responses per Cluster----------")
print(X_scaled['MS_Cluster'].value_counts())
print("----------Cluster Centers----------")
print(ms.cluster_centers_)

----------Number of Clusters & Number of Responses per Cluster----------
0    6326
1      68
Name: MS_Cluster, dtype: int64
----------Cluster Centers----------
[[ 0.0974295  -0.04887488  0.40603632  0.33425047  0.        ]
 [-4.02966165 -4.08655136 -2.98779955 -2.17097301  1.        ]]


Most responses are assigned to cluster with label 0, we could try increasing the bandwidth, as the default bandwidth parameter is none, meaning that the model is undersmoothed and picking up on the naturally occuring imbalance of the data.

Try changing the appropriate parameters of the mean shift algorithm to achieve a better clustering result. Answer all of the same questions from the previous clustering step.

In [38]:
# answer goes here

b = np.arange(.5, 5.5, 1)

for i in b:
  ms =  MeanShift(bandwidth=i)
  X_scaled['MS_Cluster'] = ms.fit_predict(X_scaled)
  print("----------Bandwidth {}: Number of Clusters & Number of Responses per Cluster----------".format(i))
  print(X_scaled['MS_Cluster'].value_counts())

----------Bandwidth 0.5: Number of Clusters & Number of Responses per Cluster----------
0      487
2      481
4      342
1      325
3      273
      ... 
290      1
294      1
298      1
302      1
501      1
Name: MS_Cluster, Length: 505, dtype: int64
----------Bandwidth 1.5: Number of Clusters & Number of Responses per Cluster----------
1      812
0      754
2      342
3      304
4      214
      ... 
266      1
270      1
274      1
278      1
461      1
Name: MS_Cluster, Length: 466, dtype: int64
----------Bandwidth 2.5: Number of Clusters & Number of Responses per Cluster----------
0      1908
1       716
2       561
3       250
10      196
       ... 
174       1
267       1
265       1
261       1
269       1
Name: MS_Cluster, Length: 272, dtype: int64
----------Bandwidth 3.5: Number of Clusters & Number of Responses per Cluster----------
0     3624
1      977
3      426
2      253
4      148
      ... 
66       1
70       1
74       1
63       1
73       1
Name: MS_Cluster, Len