## Introduction

Clustering is one of most popular unsupervised learning techniques that helps to discover hidden structure in the data. The goal of clustering is to find natural grouping in data so that items in the same cluster are more similar to each other than to those from different clusters.

In this post we will learn customer segmentaion using clustering. We will do the following steps over our dataset

*	Data exploration and pre-processing
*	K-Means Clustering with pre-defined number of clusters
*	Find optimal number of clusters using elbow and silhouette plot
*	Perform agglomerative hierarchical clustering using dendrogram
*	Visualization of clusters using PCA (Principal Component Analysis)
*	Visualization of clusters using T-SNE

## Import Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_samples
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram

# plotly imports
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

## Loading Data

In [None]:
# load the customer data
raw_df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv', index_col='CustomerID')
raw_df.info()

## Data Exploration

In [None]:
raw_df.head()

In [None]:
raw_df.describe()

In [None]:
genders =['Female', 'Male']
fig = go.Figure([go.Bar(x=genders, y=raw_df['Gender'].value_counts().values.tolist())])
fig.show()

Now let's explore the spending scores with respect to annual incomes of each of the customers

In [None]:
fig = px.scatter(raw_df, x='Annual Income (k$)', y='Spending Score (1-100)', color="Gender", size="Age")
fig.show()

From above plot one good observation is that people with age between 20-30 have high spending scores (more that 70) although they have low annual income (below 40k $) (cluster at the top left corner).

## Data Pre-Processing

In [None]:
# One-hot encoding of gender column
encoded_df = pd.get_dummies(raw_df, prefix='Gender', drop_first=True)
encoded_df.head()

In [None]:
# Standardize the dataset
scaler = StandardScaler()
# Scales only the numerics columns
scaled_features = pd.DataFrame(scaler.fit_transform(encoded_df.loc[:, ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]),
                              index=encoded_df.index, columns=['Scaled_Age', 'Scaled_AnnualIncome', 'Scaled_SpendingScore'])
# Concat with the Gender column
scaled_df = pd.concat([scaled_features, encoded_df.loc[:, ['Gender_Male']]], axis=1)
scaled_df.head()

## KMeans Clustering With Predefined Number of Clusters

In [None]:
result_df = scaled_df.copy()
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(result_df.values)
result_df['Cluster'] = clusters
result_df.head()

## Find optimal number of clusters using elbow and silhouette plot

In [None]:
# Elbow plot to determine the optimal number of clusters
distorsions = []
for n_cluster in range(1, 11):
    km = KMeans(n_clusters=n_cluster, n_init=50, max_iter=300, random_state=0)
    km.fit(scaled_df.values)
    distorsions.append(km.inertia_)
plt.plot(range(1, 11), distorsions, marker='o')
plt.show()

In [None]:
def silhoutte_plot(X, y_km):
    cluster_labels = np.unique(y_km)
    n_clusters = cluster_labels.shape[0]
    silhoutte_vals = silhouette_samples(X, y_km, metric='euclidean')
    y_ax_upper, y_ax_lower = 0, 0
    y_ticks = []
    for i, c in enumerate(cluster_labels):
        c_silhoutte_vals = silhoutte_vals[y_km==c]
        c_silhoutte_vals.sort()
        y_ax_upper += len(c_silhoutte_vals)
        color = cm.jet(float(i) / n_clusters)
        plt.barh(range(y_ax_lower, y_ax_upper), c_silhoutte_vals, height=1.0, edgecolor='none', color=color)
        y_ticks.append((y_ax_lower + y_ax_upper) / 2.)
        y_ax_lower += len(c_silhoutte_vals)

    silhoutte_avg = np.mean(silhoutte_vals)
    plt.axvline(silhoutte_avg, color='red', linestyle='--')
    plt.yticks(y_ticks, cluster_labels+1)
    plt.ylabel('Cluster')
    plt.xlabel('Silhoutte Coefficient')
    plt.show()

In [None]:
km4 = KMeans(n_clusters=4, n_init=50)
y_km4 = km4.fit_predict(scaled_df.values)

In [None]:
silhoutte_plot(scaled_df.values, y_km4)

In [None]:
km5 = KMeans(n_clusters=5, n_init=50)
y_km5 = km5.fit_predict(scaled_df.values)

In [None]:
silhoutte_plot(scaled_df.values, y_km5)

From the above silhoutte plots it seems that 5 is optimal number of clusters

## Perform agglomerative hierarchical clustering using dendrogram

Now we will apply complete linkage agglomeration to our cluster using the linkage function which returns the so called linkage matrix

In [None]:
# Apply complete linkage agglomeration using linkage function which return a linkage matrix row_clusters
row_clusters = linkage(scaled_df.values, method='complete', metric='euclidean')
# Now let's make a dendrogram
plt.figure(figsize=(12, 8))
row_dndr = dendrogram(row_clusters, labels=scaled_df.index)
plt.tight_layout()
plt.ylabel('Euclidean Distance')
plt.show()

In [None]:
# Now let's perform aggolomerative hierarchical clustering using sklearn library
ac = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='complete')
labels = ac.fit_predict(scaled_df.values)
print('First 10 cluster labels: {}'.format(labels[:10]))

## Visualization of clusters using PCA (Principal Component Analysis)

In [None]:
# PCA with three priciple components
pca = PCA(n_components=3)

In [None]:
# PCA Dataframe
PCs = pd.DataFrame(pca.fit_transform(result_df.drop(['Cluster'], axis=1)))
PCs.columns = ['PC1', 'PC2', 'PC3']

In [None]:
# Concatenate all PCs dataframes with result df
result_df = pd.concat([result_df, PCs], axis=1, join='inner')
result_df.head()

In [None]:
# Seperate out 5 different clusters
cluster0 = result_df[result_df['Cluster'] == 0]
cluster1 = result_df[result_df['Cluster'] == 1]
cluster2 = result_df[result_df['Cluster'] == 2]
cluster3 = result_df[result_df['Cluster'] == 3]
cluster4 = result_df[result_df['Cluster'] == 4]

### PCA Visualization

In [None]:
init_notebook_mode(connected=True)

#### 2D-Visualization

In [None]:
# intruction for building 2-D Plot

# trace1 for `cluster0`
trace1 = go.Scatter(x=cluster0['PC1'], y=cluster0['PC2'], mode='markers', name='Cluster0', 
                    marker=dict(color = 'rgba(255, 128, 255, 0.8)'), text=None)

# trace2 for `cluster1`
trace2 = go.Scatter(x=cluster1['PC1'], y=cluster1['PC2'], mode='markers', name='Cluster1', 
                    marker=dict(color = 'rgba(255, 128, 2, 0.8)'), text=None)

# trace3 for `cluster2`
trace3 = go.Scatter(x=cluster2['PC1'], y=cluster2['PC2'], mode='markers', name='Cluster2', 
                    marker=dict(color = 'rgba(20, 128, 200, 0.8)'), text=None)

# trace4 for `cluster3`
trace4 = go.Scatter(x=cluster3['PC1'], y=cluster3['PC2'], mode='markers', name='Cluster3', 
                    marker=dict(color = 'rgba(0, 255, 200, 0.8)'), text=None)

# trace5 for `cluster4`
trace5 = go.Scatter(x=cluster4['PC1'], y=cluster4['PC2'], mode='markers', name='Cluster4', 
                    marker=dict(color = 'rgba(150, 0, 200, 0.8)'), text=None)

data = [trace1, trace2, trace3, trace4, trace5]

title = '2D Visulization of clusters using PCA'

layout = dict(title=title,
              xaxis=dict(title='PC1', ticklen=5, zeroline=False),
              yaxis=dict(title='PC2', ticklen=5, zeroline=False))

fig = dict(data=data, layout=layout)

iplot(fig)

#### 3D-Visualization

In [None]:
# intruction for building 3D Plot

# trace1 for `cluster0`
trace1 = go.Scatter3d(x=cluster0['PC1'], y=cluster0['PC2'], z=cluster0['PC3'], mode='markers', name='Cluster0', 
                      marker=dict(color = 'rgba(255, 128, 255, 0.8)'), text=None)

# trace2 for `cluster1`
trace2 = go.Scatter3d(x=cluster1['PC1'], y=cluster1['PC2'], z=cluster0['PC3'], mode='markers', name='Cluster1', 
                      marker=dict(color = 'rgba(255, 128, 2, 0.8)'), text=None)

# trace3 for `cluster2`
trace3 = go.Scatter3d(x=cluster2['PC1'], y=cluster2['PC2'], z=cluster0['PC3'], mode='markers', name='Cluster2', 
                      marker=dict(color = 'rgba(20, 128, 200, 0.8)'), text=None)

# trace4 for `cluster3`
trace4 = go.Scatter3d(x=cluster3['PC1'], y=cluster3['PC2'], z=cluster0['PC3'], mode='markers', name='Cluster3', 
                      marker=dict(color = 'rgba(0, 255, 200, 0.8)'), text=None)

# trace5 for `cluster4`
trace5 = go.Scatter3d(x=cluster4['PC1'], y=cluster4['PC2'], z=cluster0['PC3'], mode='markers', name='Cluster4', 
                      marker=dict(color = 'rgba(150, 0, 200, 0.8)'), text=None)

data = [trace1, trace2, trace3, trace4, trace5]

title = '3D Visulization of clusters using PCA'

layout = dict(title=title,
              xaxis=dict(title='PC1', ticklen=5, zeroline=False),
              yaxis=dict(title='PC2', ticklen=5, zeroline=False))

fig = dict(data=data, layout=layout)

iplot(fig)

## Visualization of clusters using T-SNE

In [None]:
# T-SNE with three components
perplexity = 12
tsne = TSNE(n_components=3, perplexity=perplexity)

#This DataFrame contains three dimension,built by T-SNE
TCs = pd.DataFrame(tsne.fit_transform(result_df.drop(['Cluster'], axis=1)))
TCs.columns = ['TC1', 'TC2', 'TC3']

# Concatenate the TCs 
result_df = pd.concat([result_df, TCs], axis=1, join='inner')
result_df.head()

In [None]:
# Seperate out 5 different clusters
cluster0 = result_df[result_df['Cluster'] == 0]
cluster1 = result_df[result_df['Cluster'] == 1]
cluster2 = result_df[result_df['Cluster'] == 2]
cluster3 = result_df[result_df['Cluster'] == 3]
cluster4 = result_df[result_df['Cluster'] == 4]

### T-SNE Visualization

#### 2D-Visualization

In [None]:
# intruction for building 2-D Plot

# trace1 for `cluster0`
trace1 = go.Scatter(x=cluster0['TC1'], y=cluster0['TC2'], mode='markers', name='Cluster0', 
                    marker=dict(color = 'rgba(255, 128, 255, 0.8)'), text=None)

# trace2 for `cluster1`
trace2 = go.Scatter(x=cluster1['TC1'], y=cluster1['TC2'], mode='markers', name='Cluster1', 
                    marker=dict(color = 'rgba(255, 128, 2, 0.8)'), text=None)

# trace3 for `cluster2`
trace3 = go.Scatter(x=cluster2['TC1'], y=cluster2['TC2'], mode='markers', name='Cluster2', 
                    marker=dict(color = 'rgba(20, 128, 200, 0.8)'), text=None)

# trace4 for `cluster3`
trace4 = go.Scatter(x=cluster3['TC1'], y=cluster3['TC2'], mode='markers', name='Cluster3', 
                    marker=dict(color = 'rgba(0, 255, 200, 0.8)'), text=None)

# trace5 for `cluster4`
trace5 = go.Scatter(x=cluster4['TC1'], y=cluster4['TC2'], mode='markers', name='Cluster4', 
                    marker=dict(color = 'rgba(150, 0, 200, 0.8)'), text=None)

data = [trace1, trace2, trace3, trace4, trace5]

title = '2D Visulization of clusters using T-SNE'

layout = dict(title=title,
              xaxis=dict(title='TC1', ticklen=5, zeroline=False),
              yaxis=dict(title='TC2', ticklen=5, zeroline=False))

fig = dict(data=data, layout=layout)

iplot(fig)

#### 3D-Visualization

In [None]:
# intruction for building 3D Plot

# trace1 for `cluster0`
trace1 = go.Scatter3d(x=cluster0['TC1'], y=cluster0['TC2'], z=cluster0['TC3'], mode='markers', name='Cluster0', 
                      marker=dict(color = 'rgba(255, 128, 255, 0.8)'), text=None)

# trace2 for `cluster1`
trace2 = go.Scatter3d(x=cluster1['TC1'], y=cluster1['TC2'], z=cluster0['TC3'], mode='markers', name='Cluster1', 
                      marker=dict(color = 'rgba(255, 128, 2, 0.8)'), text=None)

# trace3 for `cluster2`
trace3 = go.Scatter3d(x=cluster2['TC1'], y=cluster2['TC2'], z=cluster0['TC3'], mode='markers', name='Cluster2', 
                      marker=dict(color = 'rgba(20, 128, 200, 0.8)'), text=None)

# trace4 for `cluster3`
trace4 = go.Scatter3d(x=cluster3['TC1'], y=cluster3['TC2'], z=cluster0['TC3'], mode='markers', name='Cluster3', 
                      marker=dict(color = 'rgba(0, 255, 200, 0.8)'), text=None)

# trace5 for `cluster4`
trace5 = go.Scatter3d(x=cluster4['TC1'], y=cluster4['TC2'], z=cluster0['TC3'], mode='markers', name='Cluster4', 
                      marker=dict(color = 'rgba(150, 0, 200, 0.8)'), text=None)

data = [trace1, trace2, trace3, trace4, trace5]

title = '3D Visulization of clusters using T-SNE'

layout = dict(title=title,
              xaxis=dict(title='TC1', ticklen=5, zeroline=False),
              yaxis=dict(title='TC2', ticklen=5, zeroline=False))

fig = dict(data=data, layout=layout)

iplot(fig)