## Day 49 Lecture 2 Assignment

In this assignment, we will apply mean shift clustering to a dataset containing the results of a survey on financial wellbeing.

In [11]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.preprocessing import StandardScaler

This dataset contains the results of a survey on a financial wellbeing conducted by the US Consumer Finance Protection Bureau that were published in October 2017. This dataset has a large number of columns, most of which correspond to specific questions on the survey. These codebook for translating the column names to questions can be found here:

https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

Load the dataset.

In [20]:
data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/financial_wellbeing.csv')
data

Unnamed: 0,PUF_ID,sample,fpl,SWB_1,SWB_2,SWB_3,FWBscore,FWB1_1,FWB1_2,FWB1_3,...,PPMSACAT,PPREG4,PPREG9,PPT01,PPT25,PPT612,PPT1317,PPT18OV,PCTLT200FPL,finalwt
0,10350,2,3,5,5,6,55,3,3,3,...,1,4,8,0,0,0,0,1,0,0.367292
1,7740,1,3,6,6,6,51,2,2,3,...,1,2,3,0,0,0,0,2,0,1.327561
2,13699,1,3,4,3,4,49,3,3,3,...,1,4,9,0,0,0,1,2,1,0.835156
3,7267,1,3,6,6,6,49,3,3,3,...,1,3,7,0,0,0,0,1,0,1.410871
4,7375,1,3,4,4,4,49,3,3,3,...,1,2,4,0,0,1,0,4,1,4.260668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6389,11220,3,3,6,7,7,61,3,3,1,...,1,2,3,0,0,0,1,2,-5,0.522504
6390,13118,3,2,7,7,7,59,3,4,2,...,1,3,6,0,0,0,0,3,-5,1.015219
6391,8709,1,3,5,6,6,59,3,4,3,...,1,1,2,0,0,0,0,2,0,1.136270
6392,8515,1,3,5,5,5,46,2,2,3,...,1,4,9,0,0,0,0,2,0,1.224941


While the survey questions have the potential for interesting cluster analysis, we will stick to the "score" columns to avoid clustering in an unreasonably high-dimensional space. The columns we are interested in all have "score" in their names; identify and isolate these columns. (There should be 4 in total.)

In [21]:
score_cols = [i for i in data.columns if 'score' in i]
score_data_df = data[score_cols]
score_data_df

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,55,44,3,1.267
1,51,43,3,-0.570
2,49,42,3,-0.188
3,49,42,2,-1.485
4,49,42,1,-1.900
...,...,...,...,...
6389,61,47,3,1.267
6390,59,59,1,-1.215
6391,59,51,2,-1.215
6392,46,54,2,-1.215


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [8]:
scaler = StandardScaler()
score_data = scaler.fit_transform(score_data_df)
score_data

array([[-0.07306245, -0.53090616,  0.65383032,  1.62471561],
       [-0.35567668, -0.6099205 ,  0.65383032, -0.62962645],
       [-0.4969838 , -0.68893483,  0.65383032, -0.16084111],
       ...,
       [ 0.20955178,  0.02219421, -0.67039872, -1.42116191],
       [-0.70894447,  0.25923722, -0.67039872, -1.42116191],
       [-0.4969838 , -0.68893483, -0.67039872, -1.42116191]])

Run mean shift clustering on the scores in the survey dataset using the default bandwidth. Then answer the following by printing or typing as appropriate:

- How many clusters are produced? 
- What are the cluster centers?
- How many responses are assigned to each cluster?
- Are these results reasonable? If not, what changes should we make?

In [12]:
bw = estimate_bandwidth(score_data)
bw

1.9962824784211841

In [13]:
means =  MeanShift(bandwidth=bw)
means.fit(score_data)

MeanShift(bandwidth=1.9962824784211841, bin_seeding=False, cluster_all=True,
          max_iter=300, min_bin_freq=1, n_jobs=None, seeds=None)

In [14]:
labels = means.labels_
centers = means.cluster_centers_
centers

array([[ 0.09918758, -0.04722472,  0.40797515,  0.33667793],
       [-4.02966165, -4.08655136, -2.98779955, -2.17097301]])

In [17]:
labels.shape

(6394,)

In [22]:
score_data_df['label'] = labels
score_data_df['label'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0    6326
1      68
Name: label, dtype: int64

In [23]:
score_data_df

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore,label
0,55,44,3,1.267,0
1,51,43,3,-0.570,0
2,49,42,3,-0.188,0
3,49,42,2,-1.485,0
4,49,42,1,-1.900,0
...,...,...,...,...,...
6389,61,47,3,1.267,0
6390,59,59,1,-1.215,0
6391,59,51,2,-1.215,0
6392,46,54,2,-1.215,0


In [26]:
score_data_df.loc[score_data_df['label']==1]

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore,label
148,39,16,1,-0.909,1
164,14,10,2,-1.215,1
278,39,-1,0,-2.053,1
305,25,28,1,-1.215,1
380,14,5,3,0.242,1
...,...,...,...,...,...
6275,26,35,1,-0.909,1
6278,22,31,1,-0.909,1
6326,36,31,1,-1.485,1
6363,24,24,1,-1.713,1


Try changing the appropriate parameters of the mean shift algorithm to achieve a better clustering result. Answer all of the same questions from the previous clustering step.

In [5]:
# answer goes here



