<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import" data-toc-modified-id="Import-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import</a></span></li><li><span><a href="#Get-Data" data-toc-modified-id="Get-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get Data</a></span></li><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exploratory Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Visuals" data-toc-modified-id="Visuals-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Visuals</a></span></li></ul></li><li><span><a href="#Data-Split:-Train-/-Test" data-toc-modified-id="Data-Split:-Train-/-Test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Split: Train / Test</a></span></li><li><span><a href="#Prepare-the-data" data-toc-modified-id="Prepare-the-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Prepare the data</a></span><ul class="toc-item"><li><span><a href="#LabelEncoder" data-toc-modified-id="LabelEncoder-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>LabelEncoder</a></span></li><li><span><a href="#StandardScaler" data-toc-modified-id="StandardScaler-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>StandardScaler</a></span></li><li><span><a href="#Principal-Component-Analysis-(PCA)" data-toc-modified-id="Principal-Component-Analysis-(PCA)-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Principal Component Analysis (PCA)</a></span></li></ul></li><li><span><a href="#Kmeans-Cluster-Model" data-toc-modified-id="Kmeans-Cluster-Model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Kmeans Cluster Model</a></span></li><li><span><a href="#Pipeline" data-toc-modified-id="Pipeline-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Pipeline</a></span></li><li><span><a href="#Assign-Labels" data-toc-modified-id="Assign-Labels-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Assign Labels</a></span></li><li><span><a href="#Optimal-K" data-toc-modified-id="Optimal-K-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Optimal K</a></span><ul class="toc-item"><li><span><a href="#Centriods" data-toc-modified-id="Centriods-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Centriods</a></span></li><li><span><a href="#Centriod-Visuals" data-toc-modified-id="Centriod-Visuals-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Centriod Visuals</a></span></li><li><span><a href="#Interia" data-toc-modified-id="Interia-9.3"><span class="toc-item-num">9.3&nbsp;&nbsp;</span>Interia</a></span></li><li><span><a href="#Interia2" data-toc-modified-id="Interia2-9.4"><span class="toc-item-num">9.4&nbsp;&nbsp;</span>Interia2</a></span></li><li><span><a href="#Silhouette" data-toc-modified-id="Silhouette-9.5"><span class="toc-item-num">9.5&nbsp;&nbsp;</span>Silhouette</a></span></li></ul></li><li><span><a href="#Validate-with-New-dataset" data-toc-modified-id="Validate-with-New-dataset-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Validate with New dataset</a></span></li><li><span><a href="#Conclusion-!" data-toc-modified-id="Conclusion-!-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Conclusion !</a></span></li></ul></div>

[](http://)* Problem Statement: To analyze and segment the customer of the mall based on the attributes - age, gender, annual income and spending score. Thereby, support the Mall's marketing startegy in identifying the target customers. 

![](http://)* To decompose above requirement, 
        * Whats the optimal customers segment to create? how?
        * Who are the target customers?
        * Can this customer segmentation tested with new data?  

## Import

In [None]:
#DataFrames
import numpy as np
import pandas as pd

#Scikit learn
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from sklearn.pipeline import make_pipeline   
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder

#Visuals
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
from pandas import plotting

#Others
import random
random.seed(42)
import warnings
import os
warnings.filterwarnings("ignore")
from IPython.display import Image
import math


In [None]:
# Eaton Center !
Image(url='https://i.gifer.com/QE7.gif')

## Get Data

In [None]:
df = pd.read_csv('../input/Mall_Customers.csv')
df.head(5)

In [None]:
# rows, columns
df.shape

## Exploratory Data Analysis

In [None]:
df.info()

In [None]:
df.describe()

#### Check Nulls

In [None]:
# Check for Nulls
df.isnull().sum().sort_values(ascending=False)

No null records. Its means, no additional effort required in handling nulls. Great news!

#### Check Duplicates

In [None]:
# Unique values count
print(df.nunique())

'CustomerID'is primary key of the dataset and its unique counts matches the total record count. Hence, no duplicate records. 

#### Drop Columns

In [None]:
# drop Customer id 
df = df.drop('CustomerID', axis=1)
df.head(2)

#### Rename Columns

In [None]:
# rename columns
new_cols = ['Gender', 'Age', 'AnnualIncome','SpendingScore']

df.columns = new_cols

df.head(3)

For easy use, I have renamed all the column names to get ride of spaces and special characters.

### Visuals

In [None]:
# Categorical Scatterplot on Gender Vs Annual Income
sns.catplot(x="Gender", y="AnnualIncome", kind="swarm",hue="Gender", data=df.sort_values("Gender"))

In [None]:
# Distributions of observations within categorical attribute - "Gender"
sns.catplot(x="Gender", y="AnnualIncome", kind="boxen",data=df.sort_values("Gender"))

In the above two plots, Males get paid more than Females. But, at lower annual income level, both genders are paid equally.

In [None]:
sns.relplot(x="Age", y="AnnualIncome", hue="Gender", style="Gender",size="SpendingScore",sizes=(1, 100), data=df);

In the above graph, It can be seen that the Ages from 30 to 40 has higher annual income and better spending score. Males are heavy spenders then females. Sounds like this age group could be the target customers of the mall.  

In [None]:
# Data distribution 
sns.pairplot(df);

In the above Plots, we can infer based on the distribution pattern of Annual Income and Age that few people who earn more than 100 US Dollars.Most of the people have an earning of around 25-90 US Dollars. Also, we can say that the least Income is around 20 US Dollars.

People with age, >20 years and < 40 years, have good spending score. Spending score and Annual Income scatter plot is providing us some insight on the clustering. Lets hold our conculsions for time being. 

In [None]:
# Visual linear relationship
plt.figure(1 , figsize = (15 , 7))
n=0
new_cols = ['Age', 'AnnualIncome','SpendingScore']

for x in new_cols:
    for y in new_cols:
        n += 1
        plt.subplot(3 , 3 , n)
        plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
        sns.regplot(x = x , y = y , data = df)
            
plt.show()

Unable to find a linear relationship between the attributes. 

In [None]:
# lets calculate the variance of each numerical attribute in the dataset.
df.var(ddof=0).plot(kind='bar')

In [None]:
# lets calculate the SD of each numerical attribute in the dataset.
df.std(ddof=0)

Assumes the low variance features are "noise" and and high variance features are informative. In this case, Age has low variance. Again ! lets not draw any conclusions as the dataset is not normalized yet. 

## Data Split: Train / Test

In [None]:
train_X, test_X = train_test_split(df, test_size=0.2, random_state=42)

print(len(train_X), "train +", len(test_X), "test")

In [None]:
# lets take copy of the data 
df2 = train_X.copy()

## Prepare the data

The following stages are:
1. Preprocessing.LabelEncoder() - helps normalize labels such that they contain only values between 0 and n_classes-1.
2. StandardScaler - scaling to unit variance (ie Normalizing the data)
3. Principal Component analysis(PCA)- is an unsupervised statistical technique that is used for dimensionality reduction.
4. Finally, Feature selection.

### LabelEncoder

In [None]:
# Let fit and transform the Gender attribute into numeric
le = LabelEncoder()
le.fit(df2.Gender)

In [None]:
# 0 is Female, 1 is Male
le.classes_

In [None]:
#update df2 with transformed values of gender
df2.loc[:,'Gender'] = le.transform(df2.Gender)

In [None]:
df2.head(3)

### StandardScaler

In [None]:
# Create scaler: scaler
scaler = StandardScaler()
scaler.fit(df2)

In [None]:
# transform
data_scaled = scaler.transform(df2)
data_scaled[0:3]

The next step, PCA is sensitive to the scale of features. Hence, I have normalized the data.

### Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First step "decorrelation" and then reduces dimension. 

In [None]:
pca = PCA()

# fit PCA
pca.fit(data_scaled)

In [None]:
# PCA features
features = range(pca.n_components_)
features

In [None]:
# PCA transformed data
data_pca = pca.transform(data_scaled)
data_pca.shape

In [None]:
# PCA components variance ratios.
pca.explained_variance_ratio_

In [None]:
plt.bar(features, pca.explained_variance_ratio_)
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.show()

This tells you that 33.2% of the dataset’s variance lies along the first axis, and 26.7% lies along the second axis. I assume, 2 Intrinsic dimensions (number of PCA features needed to approximate the dataset) is sufficient to represent dataset in flat 2-dimensional plane.

In [None]:
# Principal component analysis (PCA) and singular value decomposition (SVD) 
# PCA and SVD are closely related approaches and can be both applied to decompose any rectangular matrices.
pca2 = PCA(n_components=2, svd_solver='full')

# fit PCA
pca2.fit(data_scaled)

# PCA transformed data
data_pca2 = pca2.transform(data_scaled)
data_pca2.shape

## Kmeans Cluster Model

In [None]:
xs = data_pca2[:,0]
ys = data_pca2[:,1]
#zs = train_X.iloc[:,2]
plt.scatter(ys, xs)
#plt.scatter(ys, zs, c=labels)


plt.grid(False)
plt.title('Scatter Plot of Customers data')
plt.xlabel('PCA-01')
plt.ylabel('PCA-02')

plt.show()

In [None]:
# KMeans model

# lets assume 4 clusters to start with

k=4 
kmeans = KMeans(n_clusters=k, init = 'k-means++',random_state = 42) 

## Pipeline

In [None]:
# Build pipeline
pipeline = make_pipeline(scaler, pca2, kmeans)
#pipeline = make_pipeline(kmeans)

In [None]:
# fit the model to the scaled dataset
model_fit = pipeline.fit(df2)
model_fit

## Assign Labels

In [None]:
# target/labels of train_X
labels = model_fit.predict(df2)
labels

In [None]:
# lets add the clusters to the dataset
train_X['Clusters'] = labels

In [None]:
# Number of data points for each feature in each cluster
train_X.groupby('Clusters').count()

In [None]:
# Scatter plot visuals with labels

xs = data_pca2[:,0]
ys = data_pca2[:,1]
#zs = train_X.iloc[:,2]
plt.scatter(ys, xs,c=labels)
#plt.scatter(ys, zs, c=labels)

plt.grid(False)
plt.title('Scatter Plot of Customers data')
plt.xlabel('PCA-01')
plt.ylabel('PCA-02')

plt.show()

## Optimal K

### Centriods

In [None]:
# Centroids of each clusters.
centroids = model_fit[2].cluster_centers_
centroids

### Centriod Visuals

In [None]:
X = data_pca2
# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

In [None]:
# Visualising the clusters & their Centriods
plt.figure(figsize=(15,7))
sns.scatterplot(X[labels == 0, 0], X[labels == 0, 1], color = 'grey', label = 'Cluster 1',s=50)
sns.scatterplot(X[labels == 1, 0], X[labels == 1, 1], color = 'blue', label = 'Cluster 2',s=50)
sns.scatterplot(X[labels == 2, 0], X[labels == 2, 1], color = 'yellow', label = 'Cluster 3',s=50)
sns.scatterplot(X[labels == 3, 0], X[labels == 3, 1], color = 'green', label = 'Cluster 4',s=50)

sns.scatterplot(centroids_x, centroids_y, color = 'red', 
                label = 'Centroids',s=300,marker='*')
plt.grid(False)
plt.title('Clusters of customers')
plt.xlabel('PCA-01')
plt.ylabel('PCA-02')
plt.legend()
plt.show()

### Interia

In [None]:
# Distance from each sample to centroid of its cluster
model_fit[2].inertia_

In [None]:
# WCSS stands for Within Cluster Sum of Squares. It should be low.

ks = range(1, 10)
wcss = []
samples = data_pca2

for i in ks:
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(samples)
    # inertia method returns wcss for that model
    wcss.append(kmeans.inertia_)

Summation Distance(p,c) is the sum of distance of points in a cluster from the centroid.


![](https://i.imgur.com/5W63xul.png)

In [None]:
# lets visualize 
plt.figure(figsize=(10,5))
sns.lineplot(ks, wcss,marker='o',color='skyblue')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

As you can see, the inertia drops very quickly as we increase k (clusters) up to 4, but then it decreases much more slowly as we keep increasing k. This curve has roughly the shape of an arm (Hence, called Elbow Method) and there is an “elbow” at k=4. But, Is it 4 or 5 the optimal  k value? 

Lets find out with other measures like, Silhouette.

### Interia2

A complementary measure of performance is to look at what below is called inertia2, which is the sum of the squares distances between each point and the 2nd closest cluster

A nice clustering solution should have small inertia, and large inertia2: that means:

- points are close to the center of their cluster
- points are far from the center of the other clusters (since they are far to the closest center of the other clusters)

In [None]:
def getInertia2(X,kmeans):
    ''' This function is analogous to getInertia, but with respect to the 2nd closest center, rather than closest one'''
    inertia2 = 0
    for J in range(len(X)):
        L = min(1,len(kmeans.cluster_centers_)-1) # this is just for the case where there is only 1 cluster at all
        dist_to_center = sorted([np.linalg.norm(X[J] - z)**2 for z in kmeans.cluster_centers_])[L]
        inertia2 = inertia2 + dist_to_center
    return inertia2 

### Silhouette

Another performance measure is called _silhouette_:

The silhouette $s(x)$ for a point $x$ is defined as:

$$ s(x) = \frac{b(x)-a(x)}{\max\{a(x),b(x)\}} $$

where 

- $a(x)$ is the average distance between $x$ and the points in the cluster $x$ belongs

- $b(x)$ is the lowest average distance between $x$ and the points the clusters $x$ does not belong to

$s(x)$ is a quantity in $[-1,1]$. The silhouette score is the average silhouette score among all the points: $\frac{1}{|X|} \sum_x s(x)$

In [None]:
wcss = []
inertias_2 = []
silhouette_avgs = []

ks = range(1, 10)
samples = data_pca2

for i in ks:
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(samples)
    wcss.append(kmeans.inertia_)
    inertias_2.append(getInertia2(samples,kmeans))
    if i>1:
        silhouette_avgs.append(silhouette_score(samples, kmeans.labels_))

In [None]:
silhouette_avgs

The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster.

In [None]:
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
plt.title("wcss: sum square distances to closest cluster")
plt.plot(ks,wcss)
plt.xticks(ks)
plt.xlabel('number of clusters')
plt.grid()
    
plt.subplot(1,3,2)    
plt.title("Ratio: wcss VS. sum square distances to 2nd closest cluster")
plt.plot(ks,np.array(wcss)/np.array(inertias_2))
plt.xticks(ks)
plt.xlabel('number of clusters')
plt.grid()

plt.subplot(1,3,3)  
plt.title("Average Silhouette")
plt.plot(ks[1:], silhouette_avgs)
plt.xticks(ks)
plt.xlabel('number of clusters')
plt.grid()

plt.show()

Diagram 1 is Interia (Elbow method). We are unable to infer whether 4, 5 or 6 is the optimal K value?

Diagram 2 is ratio of Interia / sum squares distance to 2nd closest cluster. Th ratio is increasing is low at K=4 which means small interia and large interia2(points far from other clusters). So. 4 is the optimal K.

Diagram 3 is the silhouette score. Highest silhouette score means means that the instance is well inside its own cluster and far from other clusters. K=4 has highest score. 

So, Optimal Clusters (K) is 4.

Note: # Clusters(K) might increase with increase in PCA components.

## Validate with New dataset 

In [None]:
# Copy the dataset
df_new = test_X.copy()

In [None]:
# predict the labels
le.fit(df_new.Gender)

#update df2 with transformed values of gender
df_new.loc[:,'Gender'] = le.transform(df_new.Gender)

labels_test = model_fit.predict(df_new)
labels_test

In [None]:
# lets add the clusters to the dataset
test_X['Clusters'] = labels_test


In [None]:
# Number of data points for each feature in each cluster
test_X.groupby('Clusters').count()

In [None]:
query = (test_X['Clusters']==1)
test_X[query]

## Conclusion !

For the given dataset, we have segmented the customers into 4 (optimal) clusters using Kmeans algorithm. Each cluster is mix of defined variables such as - gender, age, spending score and annual income.  

We have to run another machine learning model to distinguish customers. As of now, lets assume the clusters are include of following customers - Big Spenders, Bargain Hunters, Window Shoppers.

In [None]:
from IPython.display import display, HTML

HTML('''<div style="display: flex; justify-content: row;">
    <img src="https://media.giphy.com/media/MEgGD8bV72hfq/giphy.gif">
    <img src="https://media.giphy.com/media/3k9gOXgimLWF2/giphy.gif">
    <img src="https://media.giphy.com/media/3o751RE4VSNLjpSLew/giphy.gif">
</div>''')

In [None]:
# Are these shoppers? No idea! what they are doing.
Image(url='https://media.giphy.com/media/fAhOtxIzrTxyE/giphy.gif')