## Day 49 Lecture 2 Assignment

In this assignment, we will apply mean shift clustering to a dataset containing the results of a survey on financial wellbeing.

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from sklearn.preprocessing import StandardScaler

This dataset contains the results of a survey on a financial wellbeing conducted by the US Consumer Finance Protection Bureau that were published in October 2017. This dataset has a large number of columns, most of which correspond to specific questions on the survey. These codebook for translating the column names to questions can be found here:

https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

Load the dataset.

In [3]:
# answer goes here
df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/financial_wellbeing.csv')
df.head(2)

Unnamed: 0,PUF_ID,sample,fpl,SWB_1,SWB_2,SWB_3,FWBscore,FWB1_1,FWB1_2,FWB1_3,FWB1_4,FWB1_5,FWB1_6,FWB2_1,FWB2_2,FWB2_3,FWB2_4,FSscore,FS1_1,FS1_2,FS1_3,FS1_4,FS1_5,FS1_6,FS1_7,FS2_1,FS2_2,FS2_3,SUBKNOWL1,ACT1_1,ACT1_2,FINGOALS,PROPPLAN_1,PROPPLAN_2,PROPPLAN_3,PROPPLAN_4,MANAGE1_1,MANAGE1_2,MANAGE1_3,MANAGE1_4,...,SOCSEC2,SOCSEC3,LIFEEXPECT,HHEDUC,KIDS_NoChildren,KIDS_1,KIDS_2,KIDS_3,KIDS_4,EMPLOY,EMPLOY1_1,EMPLOY1_2,EMPLOY1_3,EMPLOY1_4,EMPLOY1_5,EMPLOY1_6,EMPLOY1_7,EMPLOY1_8,EMPLOY1_9,RETIRE,MILITARY,Military_Status,agecat,generation,PPEDUC,PPETHM,PPGENDER,PPHHSIZE,PPINCIMP,PPMARIT,PPMSACAT,PPREG4,PPREG9,PPT01,PPT25,PPT612,PPT1317,PPT18OV,PCTLT200FPL,finalwt
0,10350,2,3,5,5,6,55,3,3,3,3,2,3,2,3,2,4,44,3,3,4,3,3,3,4,4,3,4,5,4,3,1,5,4,4,3,4,4,2,4,...,62,-2,-2,4,-1,0,0,0,0,8,0,0,0,0,0,0,0,1,0,1,0,5,8,1,4,1,1,1,7,3,1,4,8,0,0,0,0,1,0,0.367292
1,7740,1,3,6,6,6,51,2,2,3,3,3,4,2,2,2,3,43,3,3,3,3,4,3,2,4,3,2,5,4,3,0,3,2,2,1,4,4,1,4,...,-2,66,90,2,1,0,0,0,0,2,0,1,0,0,0,0,0,0,0,-2,0,5,3,3,2,1,1,2,6,3,1,2,3,0,0,0,0,2,0,1.327561


While the survey questions have the potential for interesting cluster analysis, we will stick to the "score" columns to avoid clustering in an unreasonably high-dimensional space. The columns we are interested in all have "score" in their names; identify and isolate these columns. (There should be 4 in total.)

In [4]:
# answer goes here
df2 = df.loc[:, ['FWBscore', 'FSscore', 'LMscore', 'KHscore']]
df2

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,55,44,3,1.267
1,51,43,3,-0.570
2,49,42,3,-0.188
3,49,42,2,-1.485
4,49,42,1,-1.900
...,...,...,...,...
6389,61,47,3,1.267
6390,59,59,1,-1.215
6391,59,51,2,-1.215
6392,46,54,2,-1.215


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [8]:
# answer goes here
scale = StandardScaler()
X_scale = scale.fit_transform(df2)
X_scale

array([[-0.07306245, -0.53090616,  0.65383032,  1.62471561],
       [-0.35567668, -0.6099205 ,  0.65383032, -0.62962645],
       [-0.4969838 , -0.68893483,  0.65383032, -0.16084111],
       ...,
       [ 0.20955178,  0.02219421, -0.67039872, -1.42116191],
       [-0.70894447,  0.25923722, -0.67039872, -1.42116191],
       [-0.4969838 , -0.68893483, -0.67039872, -1.42116191]])

Run mean shift clustering on the scores in the survey dataset using the default bandwidth. Then answer the following by printing or typing as appropriate:

- How many clusters are produced? 
- What are the cluster centers?
- How many responses are assigned to each cluster?
- Are these results reasonable? If not, what changes should we make?

In [9]:
# answer goes here
msc = MeanShift()
df2['clusters'] = msc.fit_predict(X_scale)
df2

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore,clusters
0,55,44,3,1.267,0
1,51,43,3,-0.570,0
2,49,42,3,-0.188,0
3,49,42,2,-1.485,0
4,49,42,1,-1.900,0
...,...,...,...,...,...
6389,61,47,3,1.267,0
6390,59,59,1,-1.215,0
6391,59,51,2,-1.215,0
6392,46,54,2,-1.215,0


In [10]:
df2['clusters'].value_counts()

0    6326
1      68
Name: clusters, dtype: int64

Try changing the appropriate parameters of the mean shift algorithm to achieve a better clustering result. Answer all of the same questions from the previous clustering step.

In [12]:
msc2 = MeanShift(bandwidth=0.1 )
df2['clusters2'] = msc2.fit_predict(X_scale)
df2['clusters2'].value_counts()


8       18
3       18
16      16
2       16
6       15
        ..
2879     1
2875     1
2871     1
2867     1
2041     1
Name: clusters2, Length: 3378, dtype: int64

In [15]:
df2['clusters2'].nunique()

3378

In [13]:
# answer goes here
msc3 = MeanShift(bandwidth=0.5)
df2['clusters3'] = msc3.fit_predict(X_scale)
df2['clusters3'].value_counts()

2      484
0      480
4      342
1      320
3      273
      ... 
389      1
385      1
381      1
377      1
421      1
Name: clusters3, Length: 424, dtype: int64

In [16]:
df2['clusters3'].nunique()

424

In [17]:
msc4 = MeanShift(bandwidth=1)
df2['clusters4'] = msc4.fit_predict(X_scale)
df2['clusters4'].value_counts()

0     3215
1     1353
2      441
4      229
9      132
7      130
25     119
16     117
15     100
3       85
5       74
17      63
14      54
6       39
13      30
10      25
11      24
24      19
32      18
8       14
22      13
35      12
36      12
37      11
27      11
34      10
38       6
30       5
40       5
12       5
33       4
29       4
18       3
20       2
31       2
21       2
19       2
23       1
28       1
26       1
39       1
Name: clusters4, dtype: int64

In [19]:
df2['clusters4'].nunique()

41

In [20]:
msc5 = MeanShift(bandwidth=1.5)
df2['clusters5'] = msc5.fit_predict(X_scale)
df2['clusters5'].value_counts()

0    4096
1    2156
5      50
4      24
2      23
7      22
6      12
3      11
Name: clusters5, dtype: int64

In [24]:
df2['clusters5'].nunique()

8

In [22]:
msc6 = MeanShift(bandwidth=2)
df2['clusters6'] = msc6.fit_predict(X_scale)
df2['clusters6'].value_counts()

0    6326
1      68
Name: clusters6, dtype: int64

In [23]:
df2['clusters6'].nunique()

2

In [25]:
msc7 = MeanShift(bandwidth=2.5)
df2['clusters7'] = msc7.fit_predict(X_scale)
df2['clusters7'].value_counts()

0    6394
Name: clusters7, dtype: int64

In [26]:
df2['clusters7'].nunique()

1

There is probably an ideal value for bandwidth somewhere between 1.5 and 2.0. With 1.5 bandwidth there were slightly too many clusters and with 2.0 there were too many. Essentially bandwidth is doing exactly what you would expect it to do: to "dial in" a value that aligns with the perfect amount of clusters. Almost like tuning an old radio.

One potential problem of this model is that there are highly imbalanced clusters where most of the data are split into two big clusters. 