# Market Segmentation with Clustering - Lab

## Introduction

In this lab, we'll use our knowledge of clustering to perform market segmentation on a real-world dataset!

## Objectives

You will be able to:

* Identify and explain what Market Segmentation is, and how clustering can be used for segmentation
* Use clustering algorithms to create and interpret a market segmentation on real-world data

## Getting Started

In this lab, we're going to work with the [Wholesale Customers Dataset] from the UCI Machine Learning Datasets Respository. This dataset contains data on wholesale purchasing information from real businesses. These businesses range from small cafes and hotels to grocery stores and other retailers. 

Here's the data dictionary for this dataset:

|      Column      |                                               Description                                              |
|:----------------:|:------------------------------------------------------------------------------------------------------:|
|       FRESH      |                    Annual spending on fresh products, such as fruits and vegetables                    |
|       MILK       |                               Annual spending on milk and dairy products                               |
|      GROCERY     |                                   Annual spending on grocery products                                  |
|      FROZEN      |                                   Annual spending on frozen products                                   |
| DETERGENTS_PAPER |                  Annual spending on detergents, cleaning supplies, and paper products                  |
|   DELICATESSEN   |                           Annual spending on meats and delicatessen products                           |
|      CHANNEL     | Type of customer.  1=Hotel/Restaurant/Cafe, 2=Retailer. (This is what we'll use clustering to predict) |
|      REGION      |            Region of Portugal that the customer is located in. (This column will be dropped)           |



One benefit of working with this dataset for practice with segmentation is that we actually have the ground-truth labels of what market segment each customer actually belongs to. For this reason, we'll borrow some methodology from Supervised Learning and store these labels separately, so that we can use them afterwards to check how well our clustering segmentation actually performed. 

Let's get started by importing everything we'll need.

In the cell below:

* Import pandas, numpy, and matplotlib.pyplot, and set the standard alias for each. 
* Use numpy to set a random seed of `0`.
* Set all matplotlib visualizations to appear inline.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
%matplotlib inline

Now, let's load our data and inspect it. You'll find the data stored in `wholesale_customers_data.csv`. 

In the cell below, load the data into a DataFrame and then display the head to ensure everything loaded correctly.

In [4]:
raw_df = pd.read_csv('wholesale_customers_data.csv')
raw_df.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


Now, let's go ahead and store the `'Channel'` column in a separate variable, and then drop both the `'Channel'` and `'Region'` columnns. Then, display the head of the new DataFrame to ensure everything worked correctly. 

In [5]:
channels = raw_df["Channel"]
df = raw_df.drop(columns = "Channel")

In [6]:
df.columns

Index(['Region', 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',
       'Delicassen'],
      dtype='object')

Now, let's get right down to it and begin our clustering analysis. 

In the cell below:

* Import `KMeans` from `sklearn.cluster`, and then create an instance of it. Set the number of clusters to `2`
* Fit the cluster object.
* Get the predictions from the clustering algorithm and store them in `cluster_preds`

In [7]:
from sklearn.cluster import KMeans

In [8]:
k_means = KMeans(2).fit(df)

cluster_preds = k_means.predict(df)

Now, let's use some of the metrics we've learned about to check the performance of our segmentation. We'll use `calinski_harabaz_score` and `adjusted_rand_score`, which can both be found inside `sklearn.metrics.cluster`. 

In the cell below, import these scoring functions. 

In [9]:
from sklearn.metrics.cluster import calinski_harabaz_score, adjusted_rand_score

Now, let's start with CH Score, to get the variance ratio. 

In [10]:
ch_score = calinski_harabaz_score(df, k_means.labels_)
ch_score

171.6846159379035

Although we don't have any other numbers to compare this to, this is a pretty low score, suggesting that our clusters aren't great. 

Since we actually have ground-truth labels in this case, we can actually use the `adjusted_rand_score` to tell us how well the clustering performed. Adjust Rand Score is meant to compare two clusterings, which the score can interpret our labels as. This will tell us how similar our predicted clusters are to the actual channels. 

Adjusted Rand Score is bounded between -1 and 1. A score close to 1 shows that the clusters are almost identical. A score close to 0 means that predictions are essentially random, while a score close to -1 means that the predictions are pathologically bad, since they are worse than random chance. 

In the cell below, call `adjusted_rand_score` and pass in our `channels` and `cluster_preds` to see how well our first iteration of clustering did. 

In [11]:
adjusted_rand_score(channels, k_means.labels_)

-0.03060891241109425

According to these results, our clusterings were essentially no better than random chance. Let's see if we can improve this. 

### Scaling Our Dataset

Recall that the results of K-Means Clustering is heavily affected by scaling. Since the clustering algorithm is distance-based, this makes sense. Let's use a `StandardScaler` object to scale our dataset and then try our clustering again and see if the results are different. 

In the cells below:

* Import a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) object and use it to transform our dataset. 
* Create another K-Means object, fit it to our scaled data, and then use it to predict clusters.
* Calculate the Adjusted Rand Score of our new predictions and our labels. 

In [12]:
from sklearn.preprocessing import StandardScaler

In [15]:
scaler = StandardScaler()
# fitting then transforming is the same as doing fit transform

scaler.fit(df)
df_scaled = scaler.transform(df)

# this is the same 
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

  return self.partial_fit(X, y)
  This is separate from the ipykernel package so we can avoid doing imports until
  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [32]:
k_means_scaled = KMeans(2).fit(df_scaled)
cluster_preds = k_means_scaled.predict(df_scaled)

In [33]:
ch_score = calinski_harabaz_score(df, k_means_scaled.labels_)
ch_score

132.51135828045622

In [34]:
ars = adjusted_rand_score(channels, k_means_scaled.labels_)
ars

0.19479267120292965

That's a big improvement! Although it's not perfect, we can see that scaling our data had a significant effect on the quality of our clusters. 

## Incorporating PCA

Since clustering algorithms are distance-based, this means that dimensionality has a definite effect on their performance. The greater the dimensionality of the dataset, the the greater the total area that we have to worry about our clusters existing in. Let's try using some Principal Component Analysis to transform our data and see if this affects the performance of our clustering algorithm. 

Since you've aready seen PCA in a previous section, we won't hold your hand through section too much. 

In the cells below:

* Import [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) from the appropriate module in sklearn
* Create a `PCA` instance and use it to tranform our scaled data. 
* Investigate the explained variance ratio for each Principal Component. Consider dropping certain components to reduce dimensionality if you feel it is worth the loss of information.
* Create a new `KMeans` object, fit it to our pca-transformed data, and check the Adjusted Rand Score of the predictions it makes. 

**_NOTE:_** Your overall goal here is to get the highest possible Adjusted Rand Score. Don't be afraid to change parameters and rerun things to see how it changes. 

In [35]:
from sklearn.decomposition import PCA

In [69]:
pca = PCA(n_components=4)
pca_df = pca.fit_transform(df_scaled)

In [70]:
pca.components_

array([[ 2.05690181e-02,  4.36834460e-02,  5.45171062e-01,
         5.78949643e-01,  5.12486019e-02,  5.48207659e-01,
         2.49231700e-01],
       [ 5.83252985e-02,  5.28766954e-01,  8.23342670e-02,
        -1.47305448e-01,  6.07641595e-01, -2.56372374e-01,
         5.03558080e-01],
       [-9.86181604e-01, -6.88203033e-02,  7.07125304e-04,
         1.05440202e-02,  1.49286109e-01,  1.01145237e-02,
         1.44667411e-02],
       [ 8.47401397e-02, -8.04569548e-01,  6.27275168e-02,
        -1.11676621e-01,  1.54245575e-01, -1.40828673e-01,
         5.34280947e-01]])

In [58]:
# all the dot product are 0, 
# showing that all the vectors are perpendicular to each other

# the first vector is always the same, doesn't change when you add more components
np.dot(pca.components_[0], pca.components_[1])


-8.326672684688674e-17

In [59]:
print(pca.explained_variance_ratio_)
print(sum(pca.explained_variance_ratio_))

[0.37795265 0.24356898 0.14375576 0.10543601]
0.8707134131129592


In [60]:
pca_k_means = KMeans(n_clusters=2)
pca_k_means.fit(pca_df)
pca_preds = pca_k_means.predict(pca_df)
adjusted_rand_score(channels, pca_preds)

0.19479267120292965

In [61]:
for n in range(1,7):
    pca = PCA(n_components=n)
    pca_df = pca.fit_transform(df_scaled)
    print(n)
    print( "new_component: ", pca.components_[n-1])
    print("explained variance ratios: ", pca.explained_variance_ratio_)
    print("sum of explained variance: ", sum(pca.explained_variance_ratio_))
    pca_k_means = KMeans(n_clusters=2)
    pca_k_means.fit(pca_df)
    pca_preds = pca_k_means.predict(pca_df)
    print("adj rand score: ", adjusted_rand_score(channels, pca_preds))
    print("\n")
    
    

1
new_component:  [0.02056902 0.04368345 0.54517106 0.57894964 0.0512486  0.54820766
 0.2492317 ]
explained variance ratios:  [0.37795265]
sum of explained variance:  0.37795265472907175
adj rand score:  0.1313835844382013


2
new_component:  [ 0.0583253   0.52876695  0.08233427 -0.14730545  0.6076416  -0.25637237
  0.50355808]
explained variance ratios:  [0.37795265 0.24356898]
sum of explained variance:  0.621521638486782
adj rand score:  0.23084287036169227


3
new_component:  [-9.86181604e-01 -6.88203033e-02  7.07125304e-04  1.05440202e-02
  1.49286109e-01  1.01145237e-02  1.44667411e-02]
explained variance ratios:  [0.37795265 0.24356898 0.14375576]
sum of explained variance:  0.7652774018189819
adj rand score:  0.1879196366549076


4
new_component:  [ 0.08474014 -0.80456955  0.06272752 -0.11167662  0.15424557 -0.14082867
  0.53428095]
explained variance ratios:  [0.37795265 0.24356898 0.14375576 0.10543601]
sum of explained variance:  0.8707134131129592
adj rand score:  0.1947926

**_Question_**:  What was the Highest Adjusted Rand Score you achieved? Interpret this score, and determine the overall quality of the clustering. Did PCA affect the performance overall?  How many Principal Components resulted in the best overall clustering performance? Why do you think this is?

Write your answer below this line:
_______________________________________________________________________________________________________________________________

The highest ARS should be ~0.23, which suggests that the clusters are better than random chance, but far from perfect.  Overall, the quality of the clustering algorithm did alot better than the first algorithm we ran on unscaled data. The best performance was achieved when reducing the number of Principal Components down to 4. The increase in model performance is likely due to the reduction in dimensionality. Although dropping the last 2 PCs means that we lose about 6% of our explained variance, this proved to be a net-positive tradeoff for the reduction in dimensionality it provided. 

## Optional Step: Hierarchical Agglomerative Clustering

Now that we've tried doing market segmentation with K-Means Clustering, let's end this lab by trying with HAC!

In the cells below, use [Agglomerative Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) to make cluster predictions on the datasets we've created, and see how HAC's performance compares to K-Mean's performance. 

**_NOTE_**: Don't just try HAC on the PCA-transformed dataset--also compare algorithm performance on the scaled and unscaled datasets, as well!

In [62]:
from sklearn.cluster import AgglomerativeClustering
agg_clust = AgglomerativeClustering(n_clusters=2)

In [63]:
from scipy.cluster.hierarchy import dendrogram, ward


In [64]:
# # use the ward() function
# linkage_array = ward(df)

# # Now we plot the dendrogram for the linkage_array containing the distances
# # between clusters
# dendrogram(linkage_array)

# ax = plt.gca()
# bounds = ax.get_xbound()
# ax.plot(bounds, [16, 16], '--', c='k')
# ax.plot(bounds, [9, 9], '--', c='k')
# # ax.text(bounds[1], 16, ' 2 clusters', va='center', fontdict={'size': 12})
# # ax.text(bounds[1], 9, ' 3 clusters', va='center', fontdict={'size': 12})
# plt.xlabel("Data index")
# plt.ylabel("Cluster distance")

In [74]:
hac = AgglomerativeClustering(n_clusters=2)



hac.fit(pca_df)
hac_pca_preds = hac.labels_
adjusted_rand_score(channels, hac_pca_preds)
# 0.04822381910875346

# for 4 components



0.04822381910875346

In [73]:
hac2 = AgglomerativeClustering(n_clusters=2)
hac2.fit(df_scaled)
hac_scaled_preds = hac2.labels_
adjusted_rand_score(channels, hac_scaled_preds)
#0.022565317001188977


0.3120323665576148

In [72]:

# scaled a different way
hac2 = AgglomerativeClustering(n_clusters=2)
hac2.fit(scaled_df)
hac_scaled_preds = hac2.labels_
adjusted_rand_score(channels, hac_scaled_preds)
#0.022565317001188977


0.3120323665576148

In [55]:
hac3 = AgglomerativeClustering(n_clusters=2)
hac3.fit(df)
hac__preds = hac3.labels_
adjusted_rand_score(channels, hac__preds)
#-0.01923156414375716

-0.01923156414375716

## Summary

In this lab, we used our knowledge of clustering to perform a market segmentation on a real-world dataset. We started with a cluster analysis with poor performance, and then implemented some changes to iteratively improve the performance of the clustering analysis!