## Day 49 Lecture 2 Assignment

In this assignment, we will apply mean shift clustering to a dataset containing the results of a survey on financial wellbeing.

In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from sklearn.preprocessing import StandardScaler

<IPython.core.display.Javascript object>

This dataset contains the results of a survey on a financial wellbeing conducted by the US Consumer Finance Protection Bureau that were published in October 2017. This dataset has a large number of columns, most of which correspond to specific questions on the survey. These codebook for translating the column names to questions can be found here:

https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

Load the dataset.

In [3]:
# answer goes here

survey_df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/financial_wellbeing.csv')

survey_df


Unnamed: 0,PUF_ID,sample,fpl,SWB_1,SWB_2,SWB_3,FWBscore,FWB1_1,FWB1_2,FWB1_3,...,PPMSACAT,PPREG4,PPREG9,PPT01,PPT25,PPT612,PPT1317,PPT18OV,PCTLT200FPL,finalwt
0,10350,2,3,5,5,6,55,3,3,3,...,1,4,8,0,0,0,0,1,0,0.367292
1,7740,1,3,6,6,6,51,2,2,3,...,1,2,3,0,0,0,0,2,0,1.327561
2,13699,1,3,4,3,4,49,3,3,3,...,1,4,9,0,0,0,1,2,1,0.835156
3,7267,1,3,6,6,6,49,3,3,3,...,1,3,7,0,0,0,0,1,0,1.410871
4,7375,1,3,4,4,4,49,3,3,3,...,1,2,4,0,0,1,0,4,1,4.260668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6389,11220,3,3,6,7,7,61,3,3,1,...,1,2,3,0,0,0,1,2,-5,0.522504
6390,13118,3,2,7,7,7,59,3,4,2,...,1,3,6,0,0,0,0,3,-5,1.015219
6391,8709,1,3,5,6,6,59,3,4,3,...,1,1,2,0,0,0,0,2,0,1.136270
6392,8515,1,3,5,5,5,46,2,2,3,...,1,4,9,0,0,0,0,2,0,1.224941


<IPython.core.display.Javascript object>

While the survey questions have the potential for interesting cluster analysis, we will stick to the "score" columns to avoid clustering in an unreasonably high-dimensional space. The columns we are interested in all have "score" in their names; identify and isolate these columns. (There should be 4 in total.)

In [4]:
# answer goes here

X = survey_df.filter(like='score').copy()
X



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,55,44,3,1.267
1,51,43,3,-0.570
2,49,42,3,-0.188
3,49,42,2,-1.485
4,49,42,1,-1.900
...,...,...,...,...
6389,61,47,3,1.267
6390,59,59,1,-1.215
6391,59,51,2,-1.215
6392,46,54,2,-1.215


<IPython.core.display.Javascript object>

Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [5]:
# answer goes here
scale = StandardScaler()
X_scale = pd.DataFrame(scale.fit_transform(X), columns=X.columns)
X_scale



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,-0.073062,-0.530906,0.653830,1.624716
1,-0.355677,-0.609920,0.653830,-0.629626
2,-0.496984,-0.688935,0.653830,-0.160841
3,-0.496984,-0.688935,-0.670399,-1.752502
4,-0.496984,-0.688935,-1.994628,-2.261785
...,...,...,...,...
6389,0.350859,-0.293863,0.653830,1.624716
6390,0.209552,0.654309,-1.994628,-1.421162
6391,0.209552,0.022194,-0.670399,-1.421162
6392,-0.708944,0.259237,-0.670399,-1.421162


<IPython.core.display.Javascript object>

Run mean shift clustering on the scores in the survey dataset using the default bandwidth. Then answer the following by printing or typing as appropriate:

- How many clusters are produced? 
- What are the cluster centers?
- How many responses are assigned to each cluster?
- Are these results reasonable? If not, what changes should we make?

In [6]:
# answer goes here
ms = MeanShift()
ms.fit(X_scale)


MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, max_iter=300,
          min_bin_freq=1, n_jobs=None, seeds=None)

<IPython.core.display.Javascript object>

In [11]:
survey_df["clusters"] = ms.predict(X_scale)

cluster_counts = survey_df.clusters.value_counts()
print(f"{len(cluster_counts)} clusters")
print(cluster_counts)
print(f"cluster centers: {ms.cluster_centers_}")

2 clusters
0    6326
1      68
Name: clusters, dtype: int64
cluster centers: [[ 0.09918758 -0.04722472  0.40797515  0.33667793]
 [-4.02966165 -4.08655136 -2.98779955 -2.17097301]]


<IPython.core.display.Javascript object>

In [12]:
# most of the results are assigned to cluster 0

<IPython.core.display.Javascript object>

Try changing the appropriate parameters of the mean shift algorithm to achieve a better clustering result. Answer all of the same questions from the previous clustering step.

In [14]:
for i in np.linspace(0.95, 1.5, 6):
    ms = MeanShift(bandwidth=i)
    ms.fit(X_scale)
    survey_df["clusters"] = ms.predict(X_scale)
    print(f"bandwidth {i}")
    cluster_counts = survey_df.clusters.value_counts()
    print(f"{len(cluster_counts)} clusters")
    print(cluster_counts)

bandwidth 0.95
47 clusters
0     3201
1     1410
2      429
10     120
29     119
7      117
4      116
19     107
18      87
3       80
8       72
44      69
17      48
15      47
20      45
5       35
6       35
12      29
41      19
28      19
13      18
37      17
14      16
26      13
9       12
33      11
42      11
11      11
31      11
40      10
24      10
39       7
46       7
43       6
35       5
34       4
16       4
21       3
25       3
22       2
36       2
23       2
32       1
30       1
38       1
27       1
45       1
Name: clusters, dtype: int64
bandwidth 1.06
36 clusters
0     3335
1     1349
2      435
4      246
9      165
13     118
6      114
3       96
12      85
15      74
11      63
7       53
5       46
8       32
22      21
32      19
29      18
10      13
20      13
24      13
33      11
31      10
30      10
14       9
19       8
35       8
23       7
27       5
26       4
28       4
16       3
17       2
18       2
21       1
25       1
34       1
Name

<IPython.core.display.Javascript object>

In [16]:
ms = MeanShift(bandwidth=1.6)
ms.fit(X_scale)
survey_df["clusters"] = ms.predict(X_scale)
print(f"bandwidth {ms.bandwidth}")
cluster_counts = survey_df.clusters.value_counts()
print(f"{len(cluster_counts)} clusters")
print(cluster_counts)
print(f"cluster centers: {ms.cluster_centers_}")

bandwidth 1.6
6 clusters
0    4154
1    2137
4      48
2      28
5      16
3      11
Name: clusters, dtype: int64
cluster centers: [[ 0.1745028  -0.02117048  0.53407698  0.50270745]
 [-0.31986675 -0.39671528 -0.98361138 -0.81406581]
 [-0.04278236 -3.45443665 -3.12968123 -2.25038953]
 [-4.02966165 -4.08655136 -2.98779955 -2.17097301]
 [ 2.75307983  2.70868169 -3.31885681 -2.4495444 ]
 [ 1.69327648 -3.21739364  0.65383032  0.94362696]]


<IPython.core.display.Javascript object>