**Customer Segmentation using Clustering**

1. Importing Libraries.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#importing packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
import plotly.graph_objs as go
import plotly as py
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import sklearn.utils

In [None]:
#Reading the raw data
data = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')


In [None]:
data.head()

In [None]:
data.set_index('CustomerID', inplace=True)

2. Data Exploration

In [None]:
#print a concise summary of a Data
data.info()

In [None]:
#Return the first 5 rows
data.head()

In [None]:
#get data dimensionality
data.shape

Since CustomerID column has been moved to index, our Mall data has 4 columns and 200 records

In [None]:
data.describe()

Interesting statistical insights :
1. Average age of customer purchasing in mall is 39 years with minimum age of 18 yrs and max of 70 yrs old
2. 3/4th of the customers purchasing in the mall are at least 49 yrs old
3. Average Annual Income of customer purchasing in mall is $60k, max of $137 and min of $15k




3. Data cleaning

In [None]:
data.isna().sum()

In [None]:
data.duplicated().sum()

There are no duplicated or missing values.

4. Data Visualzation

In [None]:
data.columns

In [None]:
data.rename(columns={'Annual Income (k$)':'Annual_Income', 'Spending Score (1-100)':'Spending_Score'}, inplace=True)

In [None]:
data.head()

* Scatterplot

In [None]:
plt.figure(1 , figsize = (15 , 10))
sns.lmplot(x='Age', y='Annual_Income', hue='Gender', data=data)
plt.xlabel('Age'), plt.ylabel('Annual_Income') 
plt.title('Age vs Annual Income w.r.t Gender')
plt.show()

Analysis:
1. On an average Male has annual income more than Female till the age of 50
2. After 50 yrs, Male tends to earn lower than female counterparts 

In [None]:
plt.figure(1 , figsize = (15 , 6))
sns.lmplot(x='Age', y='Spending_Score', hue='Gender', data=data)
plt.xlabel('Age'), plt.ylabel('Spending_Score') 
plt.title('Age vs Spending_Score w.r.t Gender')
plt.show()

Analysis:
1. Gender agnostic insight says that as age is inversely proportional to Spending score
1. On an average, Female spend more than Male counterparts till the age of 50 
1. After 50 yrs of age, female's spending is marginally reduced

In [None]:
plt.figure(1 , figsize = (15 , 6))
sns.lmplot(x='Annual_Income', y='Spending_Score', hue='Gender', data=data)
plt.xlabel('Annual_Income'), plt.ylabel('Spending_Score') 
plt.title('Annual_Income vs Spending_Score w.r.t Gender')
plt.show()

Analysis:
1. As annual income increases Female tends to spend more than Males substantially

2. Distplot

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True)
f.set_figheight(8)
f.set_figwidth(15)
sns.distplot(data['Age'], ax=ax1)
ax1.set_title('Distplot of Age')


sns.distplot(data['Annual_Income'], ax=ax2)
ax2.set_title('Distplot of Annual_Income')

sns.distplot(data['Spending_Score'], ax=ax3)
ax3.set_title('Distplot of Spending_Score')

3. Regplot

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(1,3)
f.set_figheight(10)
f.set_figwidth(15)
count = 0
for i in ['Age' , 'Annual_Income' , 'Spending_Score']:
    for j in ['Age' , 'Annual_Income' , 'Spending_Score']:
        if i != j :
            count += 1
            plt.subplot(3 , 3 , count)
            sns.regplot(i , j , data = data)
        
        

In [None]:
plt.figure(1 , figsize = (15 , 6))
ax = sns.boxplot(x="Age", y="Annual_Income", hue="Gender",data=data, palette=sns.color_palette("muted"))

In [None]:
plt.figure(1 , figsize = (15 , 6))
ax = sns.boxplot(x="Age", y="Gender", data=data)

In [None]:
plt.figure(1 , figsize = (15 , 6))
ax = sns.boxplot(x="Annual_Income", y="Gender", data=data)

In [None]:
plt.figure(1 , figsize = (15 , 6))
ax = sns.boxplot(x="Spending_Score", y="Gender", data=data)

Using Clustering Algorithm:


**K-means Algorithms**

Using **Elbow method** to select the optimal number of clusters.
“elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.


**1. Segmentation using Age and Spending Score**

In [None]:
df1 = data[['Age', 'Spending_Score']].values
algorithm = KElbowVisualizer(KMeans(init='k-means++',algorithm='elkan'), k=12, metric="distortion")
algorithm.fit(df1)
algorithm.show()

The scoring parameter metric is set to ***distortion***, which computes the sum of squared distances from each point to its assigned center.

Above visualization shows that **4 clusters** would be optimal in order to segment our customer in different buckets on the basis of Age and Spending Score

In [None]:
algorithm = (KMeans(n_clusters = 4 ,init='k-means++', algorithm='elkan') )
algorithm.fit(df1)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_
y_km = algorithm.fit_predict(df1)


In [None]:
plt.scatter(df1[y_km ==0,0], df1[y_km == 0,1], s=100, c='red')
plt.scatter(df1[y_km ==1,0], df1[y_km == 1,1], s=100, c='black')
plt.scatter(df1[y_km ==2,0], df1[y_km == 2,1], s=100, c='blue')
plt.scatter(df1[y_km ==3,0], df1[y_km == 3,1], s=100, c='cyan')
plt.xlabel('Age')
plt.ylabel('Spending_Score')
plt.title('Segmentation using Age and Spending Score')


**2. Segmentation using Annual Income and Spending Score**

In [None]:
df2 = data[['Annual_Income', 'Spending_Score']].values
algorithm = KElbowVisualizer(KMeans(init='k-means++',algorithm='elkan'), k=12, metric="distortion")
algorithm.fit(df2)
algorithm.show()


Above visualization shows that **5 clusters** would be optimal in order to segment our customer in different buckets on the basis of Annual Income and Spending Score

In [None]:
algorithm = (KMeans(n_clusters = 5 ,init='k-means++', algorithm='elkan') )
algorithm.fit(df2)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_
y_km = algorithm.fit_predict(df2)


In [None]:
plt.scatter(df2[y_km ==0,0], df2[y_km == 0,1], s=100, c='red')
plt.scatter(df2[y_km ==1,0], df2[y_km == 1,1], s=100, c='black')
plt.scatter(df2[y_km ==2,0], df2[y_km == 2,1], s=100, c='blue')
plt.scatter(df2[y_km ==3,0], df2[y_km == 3,1], s=100, c='cyan')
plt.scatter(df2[y_km ==4,0], df2[y_km == 4,1], s=100, c='orange')

plt.xlabel('Annual_Income')
plt.ylabel('Spending_Score')
plt.title('Segmentation using Annual Income and Spending Score')

**3.Segmentation using Age , Annual Income and Spending Score**

In [None]:
df3 = data[['Age','Annual_Income', 'Spending_Score']].values
algorithm = KElbowVisualizer(KMeans(init='k-means++',algorithm='elkan'), k=12, metric="distortion")
algorithm.fit(df3)
algorithm.show()


Above visualization shows that **5 clusters** would be optimal in order to segment our customer in different buckets on the basis of Age, Annual Income and Spending Score

In [None]:
algorithm = (KMeans(n_clusters = 5 ,init='k-means++', algorithm='elkan') )
algorithm.fit(df3)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_
y_km = algorithm.fit_predict(df3)


In [None]:

data['Cluster'] =  labels1
trace1 = go.Scatter3d(
    x= data['Age'],
    y= data['Spending_Score'],
    z= data['Annual_Income'],
    mode='markers',
     marker=dict(
        color = data['Cluster'], 
        size= 30,
        line=dict(
            color= data['Cluster'],
            width= 18
        ),
        opacity=0.8
     )
)
data1 = [trace1]
layout = go.Layout(
#     margin=dict(
#         l=0,
#         r=0,
#         b=0,
#         t=0
#     )
    title= 'Clusters',
    scene = dict(
            xaxis = dict(title  = 'Age'),
            yaxis = dict(title  = 'Spending Score'),
            zaxis = dict(title  = 'Annual Income')
        )
)
fig = go.Figure(data=data1, layout=layout)
py.offline.iplot(fig)


In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data[data['Cluster']==0].describe()

**Cluster 0 Analysis:**
* These customers have Age between 18 to 35 years with avg age of 25 yrs
* Annual income ranging from $15K to $39K with avg income of $26K
* Spending score ranging from 61 to 99

In [None]:
data[data['Cluster']==1].describe()

**Cluster 1 Analysis:**
* These customers have Age between 19 to 59 years with avg age of 40 yrs
* Annual income ranging from $ 70K  to  $ 137K with avg income of $87K
* Spending score ranging from 1 to 40

In [None]:
data[data['Cluster']==2].describe()

**Cluster 2 Analysis:**
* These customers have Age between 27 to 40 years with avg age of 32 yrs
* Annual income ranging from $ 69K  to  $ 137K with avg income of $86K
* Spending score ranging from 63 to 97

In [None]:
data[data['Cluster']==3].describe()

**Cluster 3 Analysis:**
* These customers have Age between 19 to 67 years with avg age of 45 yrs
* Annual income ranging from $ 15K  to  $ 39K with avg income of $26K
* Spending score ranging from 3 to 40

In [None]:
data[data['Cluster']==4].describe()

**Cluster 4 Analysis:**
* These customers have Age between 18 to 70 years with avg age of 43 yrs
* Annual income ranging from $ 39K  to  $ 71K with avg income of $54K
* Spending score ranging from 35 to 61

**Conclusion**


From above analysis, following clusters categorize the data in most optimum way. 


**Cluster 0** (average age: 22) – People with lowest average income but spending a lot

**Cluster 1** (average age: 42) – People settled down (have a family) with decent average income but spending less

**Cluster 2** (average age: 32) – People which are getting settled having decent average income and spending more

**Cluster 3** (average age: 45) – People settled down (have a family) with lowest average income and spending less

**Cluster 4** (average age: 43) – People settled down (have a family) with decent annual income and spending decent

Marketing team of the mall can leverage the clusters and plan the strategy accordingly.