## Day 49 Lecture 2 Assignment

In this assignment, we will apply mean shift clustering to a dataset containing the results of a survey on financial wellbeing.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from sklearn.preprocessing import StandardScaler

This dataset contains the results of a survey on a financial wellbeing conducted by the US Consumer Finance Protection Bureau that were published in October 2017. This dataset has a large number of columns, most of which correspond to specific questions on the survey. These codebook for translating the column names to questions can be found here:

https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

Load the dataset.

In [2]:
# answer goes here

fin_well = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/financial_wellbeing.csv')

In [3]:
fin_well

Unnamed: 0,PUF_ID,sample,fpl,SWB_1,SWB_2,SWB_3,FWBscore,FWB1_1,FWB1_2,FWB1_3,...,PPMSACAT,PPREG4,PPREG9,PPT01,PPT25,PPT612,PPT1317,PPT18OV,PCTLT200FPL,finalwt
0,10350,2,3,5,5,6,55,3,3,3,...,1,4,8,0,0,0,0,1,0,0.367292
1,7740,1,3,6,6,6,51,2,2,3,...,1,2,3,0,0,0,0,2,0,1.327561
2,13699,1,3,4,3,4,49,3,3,3,...,1,4,9,0,0,0,1,2,1,0.835156
3,7267,1,3,6,6,6,49,3,3,3,...,1,3,7,0,0,0,0,1,0,1.410871
4,7375,1,3,4,4,4,49,3,3,3,...,1,2,4,0,0,1,0,4,1,4.260668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6389,11220,3,3,6,7,7,61,3,3,1,...,1,2,3,0,0,0,1,2,-5,0.522504
6390,13118,3,2,7,7,7,59,3,4,2,...,1,3,6,0,0,0,0,3,-5,1.015219
6391,8709,1,3,5,6,6,59,3,4,3,...,1,1,2,0,0,0,0,2,0,1.136270
6392,8515,1,3,5,5,5,46,2,2,3,...,1,4,9,0,0,0,0,2,0,1.224941


While the survey questions have the potential for interesting cluster analysis, we will stick to the "score" columns to avoid clustering in an unreasonably high-dimensional space. The columns we are interested in all have "score" in their names; identify and isolate these columns. (There should be 4 in total.)

In [12]:
# answer goes here

X = fin_well.filter(regex='score')
X

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,55,44,3,1.267
1,51,43,3,-0.570
2,49,42,3,-0.188
3,49,42,2,-1.485
4,49,42,1,-1.900
...,...,...,...,...
6389,61,47,3,1.267
6390,59,59,1,-1.215
6391,59,51,2,-1.215
6392,46,54,2,-1.215


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [13]:
# answer goes here

scaler = StandardScaler()
X_std = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_std.head()

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,-0.073062,-0.530906,0.65383,1.624716
1,-0.355677,-0.60992,0.65383,-0.629626
2,-0.496984,-0.688935,0.65383,-0.160841
3,-0.496984,-0.688935,-0.670399,-1.752502
4,-0.496984,-0.688935,-1.994628,-2.261785


Run mean shift clustering on the scores in the survey dataset using the default bandwidth. Then answer the following by printing or typing as appropriate:

- How many clusters are produced? 
- What are the cluster centers?
- How many responses are assigned to each cluster?
- Are these results reasonable? If not, what changes should we make?

In [35]:
# answer goes here

mean_shift = MeanShift()
labels = mean_shift.fit_predict(X_std)

In [41]:
cluster_df = pd.DataFrame(scaler.inverse_transform(mean_shift.cluster_centers_), columns=X.columns)
print('Number of clusters:', cluster_df.shape[0])

print('clusters:')
print(cluster_df)

labeled_X = X.copy()
labeled_X['labels'] = labels

print('responsed per label:')
print(labeled_X['labels'].value_counts())

#IDK

Number of clusters: 2
clusters:
    FWBscore    FSscore   LMscore   KHscore
0  57.437953  50.121439  2.814341  0.217414
1  -1.000000  -1.000000  0.250000 -1.826000
responsed per label:
0    6326
1      68
Name: labels, dtype: int64


In [None]:
import matplotlib.pyplot as plt
plt.hist(labeled_X)

In [37]:
mean_shift.get_params

<bound method BaseEstimator.get_params of MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, max_iter=300,
          min_bin_freq=1, n_jobs=None, seeds=None)>

I don't think the results are reasonable. There are a ridiculously unbalanced. I think I can improve this by changing the bandwidth.

Try changing the appropriate parameters of the mean shift algorithm to achieve a better clustering result. Answer all of the same questions from the previous clustering step.

In [45]:
# answer goes here

mean_shift = MeanShift(bandwidth=1.7)
labels = mean_shift.fit_predict(X_std)

cluster_df = pd.DataFrame(scaler.inverse_transform(mean_shift.cluster_centers_), columns=X.columns)
print('Number of clusters:', cluster_df.shape[0])

print('clusters:')
print(cluster_df)

labeled_X = X.copy()
labeled_X['labels'] = labels

print('responsed per label:')
print(labeled_X['labels'].value_counts())

Number of clusters: 4
clusters:
    FWBscore    FSscore   LMscore   KHscore
0  58.027393  50.184094  2.868630  0.295934
1  54.222222  11.333333  0.111111 -1.851222
2  -1.000000  -1.000000  0.250000 -1.826000
3  80.000000  10.000000  3.000000  0.712000
responsed per label:
0    5946
1     409
3      20
2      19
Name: labels, dtype: int64
