## Day 49 Lecture 2 Assignment

In this assignment, we will apply mean shift clustering to a dataset containing the results of a survey on financial wellbeing.

# IMPORTS

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from sklearn.preprocessing import StandardScaler

This dataset contains the results of a survey on a financial wellbeing conducted by the US Consumer Finance Protection Bureau that were published in October 2017. This dataset has a large number of columns, most of which correspond to specific questions on the survey. These codebook for translating the column names to questions can be found here:

https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

Load the dataset.

# LOAD THE DATA

In [None]:
# answer goes here
data = 'https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/financial_wellbeing.csv'
df = pd.read_csv(data)

In [None]:
col = list(df.columns)

While the survey questions have the potential for interesting cluster analysis, we will stick to the "score" columns to avoid clustering in an unreasonably high-dimensional space. The columns we are interested in all have "score" in their names; identify and isolate these columns. (There should be 4 in total.)

# PROCESS THE DATA

In [None]:
# answer goes here
score = df.filter(like='score', axis=1)
score.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6394 entries, 0 to 6393
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   FWBscore  6394 non-null   int64  
 1   FSscore   6394 non-null   int64  
 2   LMscore   6394 non-null   int64  
 3   KHscore   6394 non-null   float64
dtypes: float64(1), int64(3)
memory usage: 199.9 KB


In [None]:
score.describe()

Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
count,6394.0,6394.0,6394.0,6394.0
mean,56.034094,50.719112,2.506256,-0.056935
std,14.154676,12.656921,0.755215,0.814936
min,-4.0,-1.0,0.0,-2.053
25%,48.0,42.0,2.0,-0.57
50%,56.0,50.0,3.0,-0.188
75%,65.0,57.0,3.0,0.712
max,95.0,85.0,3.0,1.267


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [None]:
# answer goes here
scale = StandardScaler()
scaled = pd.DataFrame(scale.fit_transform(score), columns=score.columns)

Run mean shift clustering on the scores in the survey dataset using the default bandwidth. Then answer the following by printing or typing as appropriate:

- How many clusters are produced? 
- What are the cluster centers?
- How many responses are assigned to each cluster?
- Are these results reasonable? If not, what changes should we make?

# BANDWIDTH DEFAULT

In [None]:
# answer goes here
msh = MeanShift()
msh.fit(scaled)

MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, max_iter=300,
          min_bin_freq=1, n_jobs=None, seeds=None)

In [None]:
score['msh_cluster'] = msh.fit_predict(scaled)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
print('The number of clusters produced is:', len(list(score.msh_cluster.value_counts())))

The number of clusters produced is: 2


In [None]:
msh_clus = pd.DataFrame(msh.cluster_centers_, columns=scaled.columns)
msh_inverse = pd.DataFrame(scale.inverse_transform(msh_clus), columns=scaled.columns)
unscaled = msh_inverse.style.background_gradient()

In [None]:
print('The cluster centers are: \n')
unscaled

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,57.437953,50.121439,2.814341,0.217414
1,-1.0,-1.0,0.25,-1.826


In [None]:
print('The cluster values are: \n')
score.msh_cluster.value_counts()

The cluster values are: 



0    6326
1      68
Name: msh_cluster, dtype: int64

Try changing the appropriate parameters of the mean shift algorithm to achieve a better clustering result. Answer all of the same questions from the previous clustering step.

# BANDWIDTH 1

In [None]:
# answer goes here
msh1 = MeanShift(bandwidth=1)
msh1.fit(score)
score['msh_band1'] = msh1.fit_predict(scaled)

print('The number of clusters produced is:', len(list(score.msh_band1.value_counts())))

The number of clusters produced is: 41


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
msh_clus1 = pd.DataFrame(msh1.cluster_centers_, columns=scaled.columns)
msh_inverse1 = pd.DataFrame(scale.inverse_transform(msh_clus1), columns=scaled.columns)
unscaled1 = msh_inverse1.style.background_gradient()

print('The cluster centers are: \n')
unscaled1

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,58.72734,50.687221,3.0,0.369296
1,52.039337,45.219462,2.0,-0.641718
2,49.16895,41.721461,1.0,-1.069096
3,48.155172,44.034483,0.0,-1.351586
4,78.6,76.8,2.0,-0.552033
5,53.185185,74.888889,1.0,-1.331037
6,32.958333,73.708333,1.0,-1.431042
7,30.3,64.15,2.0,-0.5877
8,77.416667,80.333333,1.0,-1.4715
9,86.454545,82.636364,2.0,0.334727


In [None]:
print('The cluster values are: \n')
score.msh_band1.value_counts()

The cluster values are: 



0     3215
1     1353
2      441
4      229
9      132
7      130
25     119
16     117
15     100
3       85
5       74
17      63
14      54
6       39
13      30
10      25
11      24
24      19
32      18
8       14
22      13
35      12
36      12
37      11
27      11
34      10
38       6
30       5
40       5
12       5
33       4
29       4
18       3
20       2
31       2
21       2
19       2
23       1
28       1
26       1
39       1
Name: msh_band1, dtype: int64

# BANDWIDTH 5

In [None]:
# answer goes here
msh5 = MeanShift(bandwidth=5)
msh5.fit(score)
score['msh_band5'] = msh5.fit_predict(scaled)

print('The number of clusters produced is:', len(list(score.msh_band5.value_counts())))

The number of clusters produced is: 1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
msh_clus5 = pd.DataFrame(msh5.cluster_centers_, columns=scaled.columns)
msh_inverse5 = pd.DataFrame(scale.inverse_transform(msh_clus5), columns=scaled.columns)
unscaled5 = msh_inverse5.style.background_gradient()

print('The cluster centers are: \n')
unscaled5

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,56.06296,50.776821,2.509632,-0.054264


In [None]:
print('The cluster values are: \n')
score.msh_band5.value_counts()

The cluster values are: 



0    6394
Name: msh_band5, dtype: int64

In [None]:
# answer goes here
msh2 = MeanShift(bandwidth=2)
msh2.fit(score)
score['msh_band2'] = msh2.fit_predict(scaled)

print('The number of clusters produced is:', len(list(score.msh_band2.value_counts())))

The number of clusters produced is: 2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
msh_clus2 = pd.DataFrame(msh2.cluster_centers_, columns=scaled.columns)
msh_inverse2 = pd.DataFrame(scale.inverse_transform(msh_clus2), columns=scaled.columns)
unscaled2 = msh_inverse2.style.background_gradient()

print('The cluster centers are: \n')
unscaled2

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,57.421547,50.105086,2.812967,0.215943
1,-1.0,-1.0,0.25,-1.826


In [None]:
print('The cluster values are: \n')
score.msh_band2.value_counts()

The cluster values are: 



0    6326
1      68
Name: msh_band2, dtype: int64

# BANDWIDTH .5

In [None]:
# answer goes here
msh_5 = MeanShift(bandwidth=.5)
msh_5.fit(score)
score['msh_band_5'] = msh_5.fit_predict(scaled)

print('The number of clusters produced is:', len(list(score.msh_band_5.value_counts())))

The number of clusters produced is: 424


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
msh_clus_5 = pd.DataFrame(msh_5.cluster_centers_, columns=scaled.columns)
msh_inverse_5 = pd.DataFrame(scale.inverse_transform(msh_clus_5), columns=scaled.columns)
unscaled_5 = msh_inverse_5.style.background_gradient()

print('The cluster centers are: \n')
unscaled_5

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,61.089494,53.031128,3.0,0.712
1,60.455696,53.729958,3.0,0.242
2,55.015873,45.031746,3.0,0.242
3,51.066667,43.277778,3.0,-0.315333
4,62.417647,54.217647,3.0,1.267
5,56.818182,49.842424,3.0,-0.322279
6,50.892562,41.975207,2.0,-0.905331
7,51.990991,42.738739,2.0,-0.380721
8,55.918367,49.204082,2.0,-0.35951
9,68.443038,55.443038,3.0,-0.265367


In [None]:
print('The cluster values are: \n')
score.msh_band_5.value_counts()

The cluster values are: 



2      484
0      480
4      342
1      320
3      273
      ... 
389      1
385      1
381      1
377      1
421      1
Name: msh_band_5, Length: 424, dtype: int64

# BANDWIDTH 1.5

In [None]:
# answer goes here
msh1_5 = MeanShift(bandwidth=1.5)
msh1_5.fit(score)
score['msh_band1_5'] = msh1_5.fit_predict(scaled)

print('The number of clusters produced is:', len(list(score.msh_band1_5.value_counts())))

The number of clusters produced is: 8


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
msh_clus1_5 = pd.DataFrame(msh1_5.cluster_centers_, columns=scaled.columns)
msh_inverse1_5 = pd.DataFrame(scale.inverse_transform(msh_clus1_5), columns=scaled.columns)
unscaled1_5 = msh_inverse1_5.style.background_gradient()

print('The cluster centers are: \n')
unscaled1_5

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,58.683315,50.565689,2.943902,0.378775
1,51.626538,45.594903,1.792619,-0.705463
2,54.5,6.833333,0.0,-1.958333
3,-1.0,-1.0,0.25,-1.826
4,70.0,11.0,2.5,-1.599
5,95.0,85.0,0.0,-2.053
6,80.0,10.0,3.0,0.712
7,61.0,8.0,1.0,-1.485


In [None]:
print('The cluster values are: \n')
score.msh_band1_5.value_counts()

The cluster values are: 



0    4096
1    2156
5      50
4      24
2      23
7      22
6      12
3      11
Name: msh_band1_5, dtype: int64

# BANDWIDTH 1.75

In [None]:
# answer goes here
msh1_75 = MeanShift(bandwidth=1.75)
msh1_75.fit(score)
score['msh_band1_75'] = msh1_75.fit_predict(scaled)

print('The number of clusters produced is:', len(list(score.msh_band1_75.value_counts())))

The number of clusters produced is: 4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
msh_clus1_75 = pd.DataFrame(msh1_75.cluster_centers_, columns=scaled.columns)
msh_inverse1_75 = pd.DataFrame(scale.inverse_transform(msh_clus1_75), columns=scaled.columns)
unscaled1_75 = msh_inverse1_75.style.background_gradient()

print('The cluster centers are: \n')
unscaled1_75

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,57.850949,50.041371,2.857183,0.28281
1,54.222222,11.333333,0.111111,-1.851222
2,-1.0,-1.0,0.25,-1.826
3,80.0,10.0,3.0,0.712


In [None]:
print('The cluster values are: \n')
score.msh_band1_75.value_counts()

The cluster values are: 



0    5954
1     405
3      18
2      17
Name: msh_band1_75, dtype: int64

# BANDWIDTH 1.25

In [None]:
# answer goes here
msh1_25 = MeanShift(bandwidth=1.25)
msh1_25.fit(score)
score['msh_band1_25'] = msh1_25.fit_predict(scaled)

print('The number of clusters produced is:', len(list(score.msh_band1_25.value_counts())))

The number of clusters produced is: 23


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
msh_clus1_25 = pd.DataFrame(msh1_25.cluster_centers_, columns=scaled.columns)
msh_inverse1_25 = pd.DataFrame(scale.inverse_transform(msh_clus1_25), columns=scaled.columns)
unscaled1_25 = msh_inverse1_25.style.background_gradient()

print('The cluster centers are: \n')
unscaled1_25

The cluster centers are: 



Unnamed: 0,FWBscore,FSscore,LMscore,KHscore
0,58.827638,50.748241,3.0,0.483269
1,52.132022,46.110955,2.0,-0.615541
2,49.068323,43.332298,1.0,-1.047966
3,48.418919,45.243243,-0.0,-1.315554
4,40.116279,72.186047,1.0,-1.36714
5,70.8,78.36,1.0,-1.29024
6,57.4,73.3,0.0,-1.3074
7,39.333333,77.5,0.0,-1.254
8,68.666667,84.0,2.0,1.267
9,63.333333,2.666667,0.0,-2.053


In [None]:
print('The cluster values are: \n')
score.msh_band1_25.value_counts()

The cluster values are: 



0     3411
1     1664
2      466
8      310
5      138
4      127
3       96
18      39
22      24
19      19
16      17
13      16
6       15
20      15
17      10
7        6
21       6
15       4
9        3
10       3
11       3
14       1
12       1
Name: msh_band1_25, dtype: int64

# BANDWIDTH CLUSTER COMPARE

In [None]:
print('The number of clusters produced by bandwidth=none is: ', len(list(score.msh_cluster.value_counts())))
print('The number of clusters produced by bandwidth=1 is:    ', len(list(score.msh_band1.value_counts())))
print('The number of clusters produced by bandwidth=5 is:    ', len(list(score.msh_band5.value_counts())))
print('The number of clusters produced by bandwidth=2 is:    ', len(list(score.msh_band2.value_counts())))
print('The number of clusters produced by bandwidth=.5 is:   ', len(list(score.msh_band_5.value_counts())))
print('The number of clusters produced by bandwidth=1.5 is:  ', len(list(score.msh_band1_5.value_counts())))
print('The number of clusters produced by bandwidth=1.75 is: ', len(list(score.msh_band1_75.value_counts())))
print('The number of clusters produced by bandwidth=1.25 is: ', len(list(score.msh_band1_25.value_counts())))

The number of clusters produced by bandwidth=none is:  2
The number of clusters produced by bandwidth=1 is:     41
The number of clusters produced by bandwidth=5 is:     1
The number of clusters produced by bandwidth=2 is:     2
The number of clusters produced by bandwidth=.5 is:    424
The number of clusters produced by bandwidth=1.5 is:   8
The number of clusters produced by bandwidth=1.75 is:  4
The number of clusters produced by bandwidth=1.25 is:  23


# Conclusions


The ideal value for bandwidth really depends on the number of clusters desired. Somewhere between 1.25 and 1.5 is the ideal number of clusters. Above 2 and the model behaves worse than default. Below 1 and the amount of clusters becomes a bit unruly. 