# Customer segmentation basics 

![pic](https://acquire.io/wp-content/uploads/2016/09/25-Awesome-Customer-Service-Tips-You-Must-Employ-Updated%E2%80%A9.png)

Customer segmentation (or market segmentation) is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. All customers share the common need of your product or service, but beyond that, there are distinct demographic differences (i.e., age, gender) and they tend to have additional socio-economic, lifestyle, or other behavioral differences that can be useful to the organization.
In this notebook we're going to split customers into segments according to their age and income.

Let's go!

* [Data overview and preparation](#section-one)
* [EDA](#section-two)
    - [Spending score and Annual income](#subsection-one)
    - [Spending score and Age](#subsection-two)
* [Customers segmentation with K-means](#section-three)
    - [Spending score and Annual income](#subsection-two-one)
    - [Spending score and Age](#subsection-two-two)
* [Customers segmentation with DBCSAN](#section-four)
    - [Spending score and Annual income](#subsection-three-one)
    - [Spending score and Age](#subsection-three-two)
* [Results](#section-five)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

df=pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

First of all, quick data overview for better understanding....

<a id="section-one"></a>
# Data overview and preparation

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df.dropna()

<a id="section-two"></a>
# EDA

In [None]:
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only

In [None]:
plt.figure(1,figsize=(15,6))
n=0
for x in colnames_numerics_only:
    n+=1
    plt.subplot(1,4,n)
    plt.subplots_adjust(hspace=0.5,wspace=0.5)
    sns.distplot(df[x],bins=20)
    plt.title('Distplot of {}'.format(x))
plt.show()

In [None]:
plt.figure(1,figsize=(15,10))
n=0
for x in colnames_numerics_only:
    for y in colnames_numerics_only:
        n+=1
        plt.subplot(4,4,n)
        plt.subplots_adjust(hspace=0.5,wspace=0.5)
        sns.regplot(x=x,y=y,data=df)
        plt.ylabel(y.split()[0]+''+y.split()[1] if len(y.split())>1 else y)
plt.show()

<a id="subsection-one"></a>
## Spending score and Annual income

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(df["Spending Score (1-100)"], df["Annual Income (k$)"])

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x="Spending Score (1-100)", y="Annual Income (k$)", hue="Gender")

<a id="subsection-two"></a>
## Spending score and Age

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(df["Spending Score (1-100)"], df["Age"])

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x="Spending Score (1-100)", y="Age", hue="Gender")

In [None]:
plt.figure(figsize=(7,5))
ax = sns.boxplot(x="Gender", y="Spending Score (1-100)", data=df)

<a id="section-three"></a>
# Customers segmentation with K-means

Firstly, taking Spending score and Annual income.
Looking at the scatter plot, it seems like 5 groups  of customes. However, we should prove it. There are two most used methods to determine optimal clusters amount - Elbow method and Silhouette method. BIC is also used in some cases.  Let's look at all of them!

<a id="subsection-two-one"></a>
## Spending score and annual income

In [None]:
df_income_score = df.iloc[:, [False, False, False, True, True]].values

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_income_score_scaled=scaler.fit_transform(df_income_score)

In [None]:
df_income_score_scaled



Elbow method helps to select the optimal number of clusters by fitting the model with a range of values for K.If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. 


In [None]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df_income_score_scaled)
    distortions.append(kmeanModel.inertia_)
    
plt.figure(figsize=(10,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

Elbow method shows 5 is optimal. But what about silhouette method? By the way, the Elbow Method and the Silhouette Method are not like alternatives to each other for finding the optimal amount of clusters. Rather they are instruments for using together for a more confident decision.

The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). It's values aer within [-1;1]. Optimal value is a peak.

In [None]:
s_scores = []
clusters = [2,3,4,5,6,7,8,9,10]
clusters_inertia = []

for n in clusters:
    KM_est = KMeans(n_clusters=n, init='k-means++').fit(df_income_score_scaled)
    clusters_inertia.append(KM_est.inertia_)   
    silhouette_avg = silhouette_score(df_income_score_scaled, KM_est.labels_)
    s_scores.append(silhouette_avg)
    
    
fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(clusters, s_scores, marker='o', ax=ax)

ax.set_xlabel("Number of clusters")
ax.set_ylabel("Silhouette score")
plt.title('The Silhouette Method showing the optimal k')
plt.grid()
plt.show()

Finally, let's take a look at Bayesian information criterion (BIC)

In [None]:
from sklearn.mixture import GaussianMixture

gm_bic= []
gm_score=[]
for i in range(2,12):
    gm = GaussianMixture(n_components=i,n_init=10,tol=1e-3,max_iter=1000).fit(df_income_score_scaled)
    gm_bic.append(-gm.bic(df_income_score_scaled))
    gm_score.append(gm.score(df_income_score_scaled))
    
plt.figure(figsize=(7,4))
plt.title("The Gaussian Mixture model BIC",fontsize=16)
plt.scatter(x=[i for i in range(2,12)],y=np.log(gm_bic),s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("Log of Gaussian mixture BIC score",fontsize=15)
plt.xticks([i for i in range(2,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()

Here again optimal value (peak) is 5. Great! Going straight to K-means model...

In [None]:
kmeanModel = KMeans(n_clusters=5,init='k-means++',max_iter=300,n_init=10,random_state=0)
y_kmeans= kmeanModel.fit_predict(df_income_score)
plt.figure(figsize=(8,8))
plt.scatter(df_income_score[y_kmeans == 0, 0], df_income_score[y_kmeans == 0, 1], s = 100, c = 'g', label = 'Cluster 1')
plt.scatter(df_income_score[y_kmeans == 1, 0], df_income_score[y_kmeans == 1, 1], s = 100, c = 'b', label = 'Cluster 2')
plt.scatter(df_income_score[y_kmeans == 2, 0], df_income_score[y_kmeans == 2, 1], s = 100, c = 'r', label = 'Cluster 3')
plt.scatter(df_income_score[y_kmeans == 3, 0], df_income_score[y_kmeans == 3, 1], s = 100, c = 'burlywood', label = 'Cluster 4')
plt.scatter(df_income_score[y_kmeans == 4, 0], df_income_score[y_kmeans == 4, 1], s = 100, c = 'green', label = 'Cluster 5')
plt.scatter(kmeanModel.cluster_centers_[:, 0], kmeanModel.cluster_centers_[:, 1], s = 200, c = 'black', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

Groups could be described as: 
* Blue - low annual income and high spending score (careless)
* Red - high  income and high spending score (target)
* Light green - medium indome and medium spending score (standart)
* Dark green - high icome and low spending rate (careful)
* Beige - low income and low spending score (sensible)

<a id="subsection-two-two"></a>
## Spending and age

In [None]:
df_age_score = df.iloc[:, [False, False, True, False, True]].values

scaler = MinMaxScaler()
df_age_score_scaled=scaler.fit_transform(df_age_score)

In [None]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df_age_score_scaled)
    distortions.append(kmeanModel.inertia_)
    
plt.figure(figsize=(10,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

In [None]:
s_scores = []
clusters = [2,3,4,5,6,7,8,9,10]
clusters_inertia = []

for n in clusters:
    KM_est = KMeans(n_clusters=n, init='k-means++').fit(df_age_score_scaled)
    clusters_inertia.append(KM_est.inertia_)   
    silhouette_avg = silhouette_score(df_age_score_scaled, KM_est.labels_)
    s_scores.append(silhouette_avg)
    
    
fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(clusters, s_scores, marker='o', ax=ax)

ax.set_xlabel("Number of clusters")
ax.set_ylabel("Silhouette score")
plt.title('The Silhouette Method showing the optimal k')
plt.grid()
plt.show()

Optimal value here is 6. Let's try!

In [None]:
kmeanModelAge = KMeans(n_clusters=6,init='k-means++',max_iter=300,n_init=10,random_state=0)
y_kmeansAge= kmeanModelAge.fit_predict(df_age_score)
plt.figure(figsize=(8,8))
plt.scatter(df_age_score[y_kmeansAge == 0, 0], df_age_score[y_kmeansAge == 0, 1], s = 100, c = 'g', label = 'Cluster 1')
plt.scatter(df_age_score[y_kmeansAge == 1, 0], df_age_score[y_kmeansAge == 1, 1], s = 100, c = 'b', label = 'Cluster 2')
plt.scatter(df_age_score[y_kmeansAge == 2, 0], df_age_score[y_kmeansAge == 2, 1], s = 100, c = 'grey', label = 'Cluster 3')
plt.scatter(df_age_score[y_kmeansAge == 3, 0], df_age_score[y_kmeansAge == 3, 1], s = 100, c = 'burlywood', label = 'Cluster 4')
plt.scatter(df_age_score[y_kmeansAge == 4, 0], df_age_score[y_kmeansAge == 4, 1], s = 100, c = 'green', label = 'Cluster 5')
plt.scatter(df_age_score[y_kmeansAge == 5, 0], df_age_score[y_kmeansAge == 5, 1], s = 100, c = 'red', label = 'Cluster 6')
plt.scatter(kmeanModelAge.cluster_centers_[:, 0], kmeanModelAge.cluster_centers_[:, 1], s = 200, c = 'black', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

Groups could be described as:

* Blue - middle age and medium spendings
* Red - young and low spending
* Light green - young and medium spending score
* Dark green - the elderly with low spendings
* Beige - the elderly with medium spendings 
* Grey - young that spend a lot (seems to be target group)

<a id="section-four"></a>
# Customer segmentation with DBCSAN 

<a id="subsection-three-one"></a>
## Spending score and annual income

Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density. Simply,the  main idea of DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density.

There are two parameters we need to set:
* Eps, ε - distance,radius around each point

The higher eps is, more elements will be included in the particular group and the less density of this group will be.

* MinPts – minimum number of data points that should be around that point within that radius

The more minPts is,the more outliers potentially could be, the more detached points will be exluded from clusters.

It should be said, selecting parameters isn't easy, it needs time and some iterations. The reason is that parameters are unique for each dataset and particular task. 

In [None]:
df_income_score = df.iloc[:, [False, False, False, True, True]]
df_norm = scaler.fit_transform(df_income_score)

In [None]:
df_income_score

Here I've tried range of eps but selected 0.09 which gives adequate number of clusters. By the way, default value is 0.5 but here we have poinnts with much more higher density.
As for the min_samples, it seems logical to use approximately 10 or more as totally there are 200 customers in the dataset, however it leads to huge amount of poinnts considered as outliers and unadequate result. I decreased it slightly and decided to use the value of 5. 

In [None]:
from sklearn.cluster import DBSCAN


DBS_clustering = DBSCAN(eps=0.09, min_samples=5).fit(df_norm)
DBSCAN_clustered = df_norm.copy()
labels = DBS_clustering.labels_
labels

In [None]:
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
n_clusters_

In [None]:

colors = ['g','r','b','y','burlywood','green', 'm', 'c']
plt.figure(figsize=(8,8))
for i in range(0 ,n_clusters_ - 1):
    plt.scatter(df_norm[labels == i, 0], df_norm[labels == i, 1], s = 100, c = colors[i], label = 'Cluster ' + str(i + 1))
plt.scatter(df_norm[labels == -1, 0], df_norm[labels == -1, 1], s = 50, c = 'black', label = 'Outliers')    
plt.legend()

Here we have 5 clusters that look different from what we had using K-means. In terms of customer segmentation and marketing strategies, black outliers here should rather be interpeted as actual customers, but this is how the algorithm works ;-)

<a id="subsection-three-two"></a>
## Spending score and age

In [None]:
df_age_score = df.iloc[:, [False, False, True, False, True]]
df_norm_age = scaler.fit_transform(df_age_score)

DBS_clustering = DBSCAN(eps=0.08, min_samples=5).fit(df_norm_age)
DBSCAN_clustered = df_norm_age.copy()
labels = DBS_clustering.labels_
labels

In [None]:
n_clusters_age = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
n_clusters_age

In [None]:
colors = ['g','r','b','y','burlywood','green', 'm', 'c']
plt.figure(figsize=(8,8))
for i in range(0 ,n_clusters_age - 1):
    plt.scatter(df_norm_age[labels == i, 0], df_norm_age[labels == i, 1], s = 100, c = colors[i], label = 'Cluster ' + str(i + 1))
plt.scatter(df_norm_age[labels == -1, 0], df_norm_age[labels == -1, 1], s = 50, c = 'black', label = 'Outliers')    
plt.legend()

Here there are 6 clusters with some black outliers. Well, the result also differs from K-means one. However, cluster 1 and 4 look reasonable and logic. Probably, it'd be better to unite some of the clusters 2,3,5,6 because they look to small to represent whole category of customers. 

<a id="section-five"></a>
# Results

All in all, we've successfuly found several groups that show the spending score of customers depending on their age or annual income.These groups could be applied in marketing in order to optimize the companies of attraction and retention as well as in strategic management and other business areas. Having the results of two algorithms it looks like K-means performs better for this need than  DBSCAN in this particular task. However, this theory could be proven only after application of our results and testing.  

