# **Customer Segmentation - K-means & TMAP Clustering**

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

------

## Overview

<img src="https://conversionxl.com/wp-content/uploads/2016/09/segmentation-illustration.png" width="500" height="500" align="center"/>

Image source: https://conversionxl.com/wp-content/uploads/2016/09/segmentation-illustration.png

Dataset source: *Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon* 

The goal of this demo is to cluster customers of a wholesale Portugese company into different segments. Note that clustering is an unsupervised learning task where no labels are given. Description of the fields are as follows (m.u. stands for monetary unit): 


| Feature name     | Variable Type | Description 
|------------------|---------------|--------------------------------------------------------
| FRESH            | Continuous    | annual spending (m.u.) on fresh products   
| MILK             | Continuous    | annual spending (m.u.) on milk products  
| GROCERY          | Continuous    | annual spending (m.u.) on grocery products  
| FROZEN           | Continuous    | annual spending (m.u.) on frozen products   
| DETERGENTS_PAPER | Continuous    | annual spending (m.u.) on detergents and paper products  
| DELICATESSEN     | Continuous    | annual spending (m.u.) on delicatessen products    
| CHANNEL          | Categorical   | customers channel where 1 = HoReCa (Hotel/Restaurant/Cafe); 2 = Retail channel
| REGION           | Categorical   | customers region where 1 = Lisbon; 2 = Porto; 3 = Other  

------

## **Part 0**: Setup

In [None]:
# Put all import statements at the top of your notebook

# Standard imports
import pandas as pd
import numpy  as np
from time import time

# Clustering imports
from sklearn               import preprocessing
from sklearn               import datasets
from sklearn.cluster       import KMeans
from sklearn.metrics       import silhouette_samples, silhouette_score
from sklearn.decomposition import PCA
from sklearn.manifold      import TSNE
import tmap                as tm

# Visualization packages
from bokeh.models         import HoverTool
from bokeh.plotting       import output_notebook, figure, show, ColumnDataSource
import bokeh.plotting     as bp
import matplotlib.pyplot  as plt
import matplotlib.cm      as cm
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker    import NullFormatter
from faerun               import Faerun

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline


In [None]:
# Set a seed for replication
SEED = 10

## **Part 1**: Data Preprocessing and EDA

In [None]:
# Load data
data = pd.read_csv('customer_data.csv')

In [None]:
# Features
data.columns

In [None]:
# Feature types
data.dtypes

In [None]:
# Top rows
data.head()

In [None]:
# Key statistics
data.describe()

## **Part 2**: Data Preprocessing

There are several preprocessing steps before doing the clustering:

* null values
* categorical variables
* standardization

First lets control redundant and missing values:

In [None]:
# Number of observations and features
data.shape

In [None]:
# Drop possible duplicate rows
data.drop_duplicates(inplace=True)
data.shape

In [None]:
# Drop possible missing values
data.dropna(inplace=True)
data.shape

We should also convert categorical features into one hot encoded: 

In [None]:
cols_to_transform = [ 'Channel', 'Region']
data_with_dummies = pd.get_dummies(data = data, columns = cols_to_transform )

In [None]:
data_with_dummies.head()

Next we should standardize data so that each feature have zero mean and unit standard deviation: 

In [None]:
X_scaled = preprocessing.scale(data_with_dummies.values)
type(X_scaled)

In [None]:
# Convert standardized data from ndarray to data frame
data_with_dummies_ready = pd.DataFrame(X_scaled, columns = data_with_dummies.columns)
type(data_with_dummies_ready)

In [None]:
data_with_dummies_ready.head()

In [None]:
data_with_dummies_ready.describe()

## **Part 3**: Kmeans Clustering

In [None]:
# Clustering data into 3 clusters
kmeans = KMeans(n_clusters=3, random_state=SEED).fit(data_with_dummies_ready)

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.labels_

However we should pay attention to the followings:

  * __initial centroids__: kmean clustering in sklearn is initializing centoids with different initial random seeds and will report the best result in terms of optimization score.  
  
  * __number of clusters__: to select the best number of clusters, one strategy is to maximize the "Silhouette" coefficient.  
  
    The Silhouette coefficient is calculated using the mean distance of a data point to all other points in the same cluster (a) and the mean distance to all data points in the nearest-cluster (b). The Silhouette Coefficient for a sample is (b - a) / max(a, b). The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. We can consider the mean Silhuette coefficient of all samples as a rough metric for clustering quality.  


Therefore we change number of clusters from 2 to 20 and chose the one with best Silouette score. For each number of clusters, we visualize the silouette score of each data point and project our data using PCA into 2 dimensions to visualize detected clusters.

In [None]:
X = data_with_dummies_ready
range_n_clusters = range(2,21)

# For different number of clusters do the followings
for n_clusters in range_n_clusters:
    
    # Define a subplot with one row and two columns
    fig, (ax1,ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)
    ax1.set_xlim([-1, 1])
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
    
    # Obtain cluster labels for n_clusters
    clusterer = KMeans(n_clusters=n_clusters, random_state=SEED)
    cluster_labels = clusterer.fit_predict(X)
    
    # Calculate average silhouette score 
    silhouette_avg = silhouette_score(X, cluster_labels)

    # Calculate silhouette score for each data point
    sample_silhouette_values = silhouette_samples(X, cluster_labels)
    
    # Visualize silhouette score of each data point as well as average silhouette score
    y_lower = 10
    for i in range(n_clusters):
       
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10  

    ax1.set_title('The silhouette plot for the various clusters.')
    ax1.set_xlabel('The silhouette coefficient values')
    ax1.set_ylabel('Cluster label')
    ax1.axvline(x=silhouette_avg, color='red', linestyle='--')
    ax1.set_yticks([])  
    ax1.set_xticks([-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])
    
    # Project data into two dimensions using PCA, with differnt colors for each cluster
    pca = PCA(n_components=2, svd_solver='full')
    XX = pca.fit(X).transform(X)
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(XX[:, 0], XX[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
    centers = pca.transform(clusterer.cluster_centers_)
   
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',c="white", alpha=1, s=200, edgecolor='k')
    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')

    ax2.set_title('The visualization of the clustered data.')
    ax2.set_xlabel('Feature space for the 1st feature')
    ax2.set_ylabel('Feature space for the 2nd feature')
    
    plt.suptitle(('\n Silhouette analysis for KMeans clustering on sample data '
                  'with n_clusters = %d (Avg score: %f)' % (n_clusters, silhouette_avg)), fontsize=14, fontweight='bold')

    plt.show()

In [None]:
# Calculate the silhouette score with different number of clusters and different random seeds
range_seeds = range(0,10)
range_n_clusters = range(2,21)
list_all = []
list_cluster = []
for seed in range_seeds:
    for n_clusters in range_n_clusters:
        
        clusterer = KMeans(n_clusters=n_clusters, random_state=seed)
        cluster_labels = clusterer.fit_predict(data_with_dummies_ready)
        silhouette_avg = silhouette_score(data_with_dummies_ready, cluster_labels)
        
        list_cluster.append(silhouette_avg)
        list_cluster_tmp = list(list_cluster)
        
    list_all.append(list_cluster_tmp)
    list_cluster.clear()


In [None]:
# Visualize silhouette score for different number of clusters and different random seeds
plt.xlabel('number of clusters')
plt.ylabel('silhouette score')
plt.grid(True)
plt.xticks(np.arange(min(range_n_clusters),max(range_n_clusters)+1,1.0))
gather_list_all = []

for i in range_seeds:
    gather_list_all.append(list_all[i])

results_avg = [float(sum(col))/len(col) for col in zip(*gather_list_all)]

plt.plot(range_n_clusters, results_avg)

In [None]:
# Therefore we choose number of clusters equal to 9

clusterer = KMeans(n_clusters=9, random_state=SEED)
cluster_labels = clusterer.fit_predict(data_with_dummies_ready)
cluster_labels

## **SUMMARY OF SILHOUETTE SCORES**

In [None]:
width    = 35
clusters = [str(i) for i in range_n_clusters]
results  = results_avg
print('', '=' * width, '\n', 'Summary of Silhouette Scores'.center(width), '\n', '=' * width)  
for i in range(len(clusters)):
    print('K = {}'.format(clusters[i]).center(width-12), '{0:.4f}'.format(results[i]))

## **Part 4**: TMAP 

TMAP is a very fast visualization library for large, high-dimensional data sets. Using TMAP, we generate an interactive visualization to explore the 9 cluster we found earlier. For details on TMAP: http://tmap.gdb.tools/index.html

Here is an illustration of the 4 conceptual steps that make up TMAP.

<img src="https://tmap.readthedocs.io/en/latest/_images/basic_pipelines.jpg" width="500" height="500" align="center"/>

Image source: https://tmap.readthedocs.io/en/latest/_images/basic_pipelines.jpg

In [None]:
# Set up MinHash encoding structure
# NB: Minhash might not run on all machines - instaead, inspect the exported .html file in this directory
dims = data_with_dummies_ready.shape[0] * data_with_dummies_ready.shape[1]
enc = tm.Minhash(dims)

# Locally sensitive hasing (LSH) speeds up k nearest neighbor search
lf = tm.LSHForest(dims, 128)
lf.batch_add(enc.batch_from_weight_array(data_with_dummies_ready.values))
lf.index()

In [None]:
# Coniguration for the tmap layout
CFG = tm.LayoutConfiguration()

# Create labels that appear on hover
labels = ['customer{}_channel{}_region{}'.format(i, 
                                                 data.iloc[i,0], 
                                                 data.iloc[i,1]
                                                ) for i in list(data_with_dummies_ready.index)]

x, y, s, t, _ = tm.layout_from_lsh_forest(lf, CFG)

faerun = Faerun(clear_color="#111111", view="front", coords=False)
faerun.add_scatter(
    "Customers",
    {"x": x, "y": y, "c": cluster_labels, "labels": labels},
    colormap="tab10",
    shader="smoothCircle",
    point_scale=5,
    max_point_size=30,
    has_legend=True,
    categorical=True,
)
faerun.add_tree(
    "MNIST_tree", {"from": s, "to": t}, point_helper="Customers", color="#666666"
)
faerun.plot("tmap_graph", template="default")

## **Part 5**: Interactive visualization of clusters using non-linear dimensionality reduction: T-SNE

PCA is able to grasp only **linear** projection of high-dimensional space into lower dimensions. In many cases, what is more precise in to learn the non-linear manifolds in which the data shows the highest variance along with. There are different techniques to achieve this goal, all with their own pros and cons. t-Distributed Stochastic Neighbor Embedding is an state of the art technique to achieve this goal. Following example shows the main intuition behind this technique. For more information, you can have a look [here](https://lvdmaaten.github.io/tsne/). To understand the difference better, look at the following toy example.  

In [None]:
# Inspired from work by Jake Vanderplas -- <vanderplas@astro.washington.edu>

n_points = 1000
X_sample, color = datasets.make_s_curve(n_points, random_state=0)
n_components = 2

# Original space
fig = plt.figure(figsize=(32, 12))
ax = fig.add_subplot(251, projection='3d')
ax.scatter(X_sample[:, 0], X_sample[:, 1], X_sample[:, 2], c=color, cmap=plt.cm.Spectral)
ax.view_init(4, -72)

# PCA space
t0 = time()
pca = PCA(n_components=n_components, random_state=0)
Y_sample = pca.fit_transform(X_sample)
t1 = time()
print("PCA: %.2g sec" % (t1 - t0))
ax = fig.add_subplot(252)
plt.scatter(Y_sample[:, 0], Y_sample[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("PCA (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')

# TSNE space
t0 = time()
tsne = TSNE(n_components=n_components, init='pca', random_state=0)
Y_sample = tsne.fit_transform(X_sample)
t1 = time()
print("t-SNE: %.2g sec" % (t1 - t0))
ax = fig.add_subplot(253)
plt.scatter(Y_sample[:, 0], Y_sample[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("t-SNE (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')

plt.show()

In brief, T-SNE works by minimizing the divergence between two probability distribution between pairs, proportional to their distances, one in the original space and the other in the reduced dimension space. Unlike PCA, T-SNE is sensitive to its configuration parameters. Among them, following parameters worth mentioning here:

* **perplexity**: a parameter representing the number of nearest neighbors considered as 'close' in the original space. 
* **early_exaggeration**: controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. 
* **learning_rate**: the step size in solving optimization problem using gradient descent. it can make your optimization to converge fast or to escape from local minima. Note that different runs of this algorithm with different initial points can lead to different results due to non-convexity of the optimization surface. One might run the algorithm several times with different initial values and choose the one with minimum divergence score.

Coming back to the customer segmentation problem, we try to visualize our data with optimum number of clusters (9 clusters) using T-SNE.

In [None]:
# First we transform data into lower dimensions using distance of each data points to the 9 centroids
X_kmeans_distances = clusterer.transform(data_with_dummies_ready)
X_kmeans_distances

In [None]:
# Next, we transfrom data using 9 dimensional space to 2 dimensional space using t-sne
tsne2 = TSNE(n_components=2, verbose=1, random_state=1, method='exact')
X_kmeans_distances_tsne2 = tsne2.fit_transform(X_kmeans_distances)

In [None]:
X_kmeans_distances_tsne2.shape

In [None]:
# Define a colormap of random colors
colormap = np.array(["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9", "#68af4e", "#6e6cd5",
"#e3be38", "#4e2d7c", "#5fdfa8", "#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053", "#5e9981",
"#803a62", "#9b9e39", "#c88cca", "#e1c37b", "#34223b", "#bdd8a3", "#6e3326", "#cfbdce", "#d07d3c",
"#52697d", "#7d6d33", "#d27c88", "#36422b", "#b68f79"])

In [None]:
# We transfor the 2 dimensioanl tsne data into a data frame with associated cluster number and color

def calculate_color(cluster):
    color = colormap[cluster]
    return color

dataset_kmeans_vis = pd.DataFrame(X_kmeans_distances_tsne2, columns=['x', 'y'])
dataset_kmeans_vis['cluster'] = cluster_labels
dataset_kmeans_vis['color'] = dataset_kmeans_vis.cluster.apply(calculate_color)
dataset_kmeans_vis['channel'] = data['Channel']
dataset_kmeans_vis['region'] = data['Region']

dataset_kmeans_vis.head(2)

In [None]:
# Visualize using bokeh library which is used for interactive visualization
# BokehJS mit return an error the first time running this 

source = ColumnDataSource(data=dataset_kmeans_vis)

plot_kmeans = bp.figure(plot_width=800, plot_height=600, title='KMeans clustering of wholesale customers',
    tools='pan,wheel_zoom,box_zoom,reset,hover',
    x_axis_type=None, y_axis_type=None, min_border=1)
output_notebook()
plot_kmeans.scatter(x='x', y='y', color='color', size=5, source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips=[("index", "$index"),('cluster','@cluster'), ('channel', '@channel'), ('region', '@region')]
show(plot_kmeans)

## **Part 6**: Discussion

* What could be the semantics behind each cluster? 
* How the set of chosen features can affect cluster semantics ?