# Customer segmentation 

**A simple and pragmatic approach: V1**   
*A detailed version will come in succession to this kernel: V2!*  
*An expansion (segmenting with transactional data) will then follow: V3*   

<img src="https://cdn.pixabay.com/photo/2014/04/03/00/41/people-309099_960_720.png" width="250px" align="right">



The purpose with this analysis is to segment the customers in the "Mall customers" dataset.  
The dataset consist of 200 customers, each identifiable by "Customer ID".    
Segmentation is based on the variables in the dataset:  
* Gender	
* Age  
* Annual Income (k)*  
* Spending Score (1-100)*

**Variables used for this approach!*

**Note!** This is a pragmatic approach, focusing on the output!  I am skipping "details" here and there, e.g. the K-Means "bend analysis" to determine number of clusters, analysis of differences between male/female and age distribution. I am also only focusing on a single method, K-Means clustering.  

If you want more information on K-Means, [read more on this site.](https://towardsdatascience.com/how-does-k-means-clustering-in-machine-learning-work-fdaaaf5acfa0) 



In [None]:
#standard data libraries
import pandas as pd
import numpy as np

#vizualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

#machine learning models and related libs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings("ignore")

# Exploratory data analysis (EDA)  
If you only want the segmentation-part of the process, **feel free to skip this step.  **  

**Getting to know the data.**

>We need to know e.g:  
>>How many rows and columns are in the dataset?  
>>Are there null-values we need to handle?  
>>What are the datatype for each variable?  
>>What are the distributions for each value?


I have kept all the information in one "print". 

*Not userfriendly? I know, just skip it then!*

In [None]:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
print('The "shape" of the dataset:\nRows:\t ', df.shape[0], 
      '\nColumns: ', df.shape[1], 
      '\n\n\nThe data types: \n', df.dtypes,
      '\n\n\nInformation: \n', df.describe().round(1),
      '\n\n\nDo we have null values?: \n', df.isnull().sum())

In [None]:
#I am renaming the columns; just to make the columns names a bit easier to work with.
df.rename(
    columns = {
        'CustomerID':'id', 
        'Gender':'gender',
        'Age':'age', 
        'Annual Income (k$)':'income', 
        'Spending Score (1-100)':'score'
        },
    inplace = True
    )
df.set_index('id', inplace=True)

## Visual overview
**Not necessarily part of the most simpel and pragmatic approach, but I am a visually oriented.**  

**Correlation** between the variables can be seen below.  

In this analysis, I only focus on income and age. We see no signigicant correlation between the variables; that's a green light!  

If we have a higher number of variables with high correlation, we should consider dimension reduction, e.g. through a Principal Component Analysis (PCA).

In [None]:
plt.figure(1, figsize=(4, 4))
sns.heatmap(abs(df.corr()), annot=df.corr(), cmap='RdYlBu_r', square=True, cbar=False)
plt.suptitle('Heatmap of correlation between key variables')
plt.show()

A **pairplot** also give a nice overview of the variables.  

This creates a scatterplot between variables and a histogram/kde of the distribution for each variable. We can e.g. spot outliers and see patterns, such as correlation.  

In [None]:
plt.figure(1, figsize=(12, 7))
g = sns.pairplot(df, hue='gender', palette=['#00616f','#f47920'])
plt.suptitle('Pairplot overview')

for i, j in zip(*np.triu_indices_from(g.axes, 1)):
    g.axes[i, j].set_visible(False)
    
plt.show()

What we are especially interested in, is the relationship between "score" and "income", as we use these variables for segmentation.  

Let's have a closer look in the figure below.  

In this dataset, we can already begin to see patterns. 

In [None]:
plt.figure(1, figsize=(6, 6))
plt.scatter(x=df.income, y=df.score)
plt.show()

 # The segments!

**Number of clusters**  

There are a number of different ways to find the "correct" number of clusters in a K-Means analysis. The most frequently used is the so-called "Bend Analysis".  

I am skipping this step. From a pragmatic approach, I want the segments to a) be large enough to be "useful" and b) different enough to be targeted.  

*From a marketers perspective all clusters large enough to constitute as a segment should be deemed valid, as long as segments are "different enough" from each other - depending on the purpose of the segmentation, e.g. customer targeting. A general rule of thumb is that a segment should at least 10% of the data set.*

In [None]:
sc = StandardScaler()
rs = RobustScaler()
mm = MinMaxScaler()

SCALER = None

def clusters(data, no_clusters, scaler):
    kmeans = KMeans(
        n_clusters=no_clusters,
        max_iter=1000000000,
        n_init=20,
        init='k-means++',
        random_state=101,
        algorithm='full'
        )
    if scaler is not None:
        df = scaler.fit_transform(data)
    else:
        df = data
    kmeans.fit(df)
    data['cluster'] = kmeans.fit_predict(df)
    return data

data = df.drop(['gender','age'], axis=1)
#data = df.drop(['gender'], axis=1)

**Let's go through different scenarios, between 1 and 9 different clusters, i.e. segments**  

Try going throgh each scenario, from 1 through 9 clusters, to see the clusters evolve... exciting, right!!!

In [None]:
plt.figure(1 , figsize=(17 , 17))
n = 0 
for no in range(1,10):
    n += 1
    plt.subplot(3 , 3 , n)
    plt.subplots_adjust(hspace=0.25 , wspace=0.25)
    segments = clusters(data=data, no_clusters=no, scaler=SCALER)
    plt.title('No. clusters: {}'.format(no))
    sns.scatterplot(data=segments, x='income',y='score',hue='cluster', palette='tab10', s=70, legend=False)    
plt.show()

The above figure shows the process of dividing data points (customers) into clusters (segments). Does it make sense?

In [None]:
def ch_plot_for_k_means(X, k_range, resample):
    plt.figure(figsize=[12,5])
    for i in range(3):
        scores = []
        for k in k_range:
            kmeansModel = KMeans(n_clusters = k).fit(X)
            labels = kmeansModel.labels_
            scores.append(metrics.calinski_harabasz_score(X, labels)) 
        plt.plot(k_range, scores)
    plt.xticks(k_range)
    plt.title("Optimal number of clusters Calinski-Harabasz criterion")
    plt.xlabel("Number of clusters")
    plt.ylabel("Calinski - Harabasz statistic")
    plt.show()
ch_plot_for_k_means(data, range(2,15), 3)

#thanks to: https://www.kaggle.com/gmateusz/customer-segmentation-using-kmeans

In [None]:
 def counts(no_clusters, if_print_1):
    clus = clusters(data, no_clusters, SCALER).reset_index()
    clus = df.reset_index().merge(clus).set_index('id')
    clus_count = clus.groupby('cluster')['score'].count()
    clus_count = clus_count.to_frame()
    clus_count = clus_count.rename(columns={'score':'no_customers'})
    clus_count['pct. of total'] = clus_count['no_customers'].apply(lambda x: round(100*(x / len(data)),2))
    if if_print_1 == 1:
        mini = str(min(clus_count['pct. of total'])) + '%'
        print('Smallest cluster count in pct. :', mini, ', with ', no_clusters, ' clusters')
    clus_count['pct. of total'] = clus_count['pct. of total'].apply(lambda x: str(x) + '%')
    return clus, clus_count
clus, clus_count = counts(5, 0)

**Let's go on using the 5-cluster segmentation.**   

First, let's make sure that the segments are large enough to actually make sense. With only 200 customers, I'll use the 10% rule of thumb. 

In [None]:
clus_count

We are above the 10% threshold, great! 

But what about the other cluster solutions? If we e.g. go up to 6 segments, are we still above the threshold?  

In [None]:
for num in range(2,10):
    counts(num, 1)

I've iterated over all scenarios, and we have actually found the segment just within the threshold! With 6 clusters, the smallest segment is still above only 5% of the customers.


Secondly, a **pairplot to vizualize** the differences between the clusters, other than the above illustration:

In [None]:
sns.pairplot(clus, hue='cluster', vars=['income','age','score'])
plt.show()

# Interpretation

**We want data!** 

You got it! The mean value for each variable:

In [None]:
figdata = clus.pivot_table(
    index=['cluster'],
    aggfunc='mean')

figdata = figdata.reset_index().merge(clus_count.reset_index()[['no_customers','cluster']]).set_index('cluster')
figdata = figdata.sort_values('score', ascending=False)

sc = StandardScaler()
figinfo = sc.fit_transform(figdata)
figinfo = pd.DataFrame(figinfo, columns=figdata.columns)

plt.figure(1 , figsize=(6 ,6))
sns.heatmap(data=figinfo, annot=figdata, cmap='RdYlBu_r', cbar=False)
plt.suptitle('Heatmap for interpretation')
plt.show()

## From clusters to segments:

**In order to actually use the segmentation, we need to interpret the output.  
To keep it simple, let's just do this without chi2 or other measures, just freestyling.**

**From the top!**

* **Big spenders!**   
  * High income / high score
  * Young and with money to spend
  * Big spenders either due to 
    * "independence" (increasing income, still low living costs) 
    * "nesting" and/or parenthood (increasing costs)


  
* **Mall-rats!**   
  * Low income / high score
  * Young people, maybe hanging out at caf√©s, shopping sneakers
  * They are likely using a large share-of-wallet at the mall
  
  

* **The middleground**
  * Medium income / medium score
  * Description
  * Largest segment
  
  
  
* **Low demanders ** 
  * Low spenders / low income
  * Maybe with low disposable income and
  * not in immidiate target audience
  
  

  
* **Low spenders** 
  * Low spender / high income
  * Potential!
  * High income customers, but maybe not in the immidiate target audience  
  

## ... and then?  
You have your segments! Go target them, spam them with emails, give them coupons!!!


## I will make at least one more kernel on customer segmentation. 
***It will include this dataset and analysis in much more detail AND transactional data to make the RFM analysis part of the process. Hang tight!**

# Extra: Single-function solution
***Just an easy way to test different variations**

In [None]:
def clusters(data, no_clusters,  k_range, df):
    
    print('Initiating...')
    
    def clusters(data, no_clusters):
        kmeans = KMeans(
            n_clusters=no_clusters,
            max_iter=1000000000,
            n_init=20,
            init='k-means++',
            random_state=101,
            algorithm='full'
        )
        kmeans.fit(data)
        data['cluster'] = kmeans.fit_predict(data)
        return data

    print('Cluster function created...')
    
    plt.figure(1 , figsize=(17 , 17))
    n = 0 
    for no in k_range:
        n += 1
        plt.subplot(3 , 3 , n)
        plt.subplots_adjust(hspace=0.25 , wspace=0.25)
        segments = clusters(data=data, no_clusters=no)
        plt.title('No. clusters: {}'.format(no))
        sns.scatterplot(data=segments, x='income',y='score',hue='cluster', palette='tab10', s=70, legend=False)    
    print('Cluster visuals based only on income and score:')
    plt.show()


    def ch_plot_for_k_means(X, k_range, resample):
        plt.figure(figsize=[12,5])
        for i in range(3):
            scores = []
            for k in k_range:
                kmeansModel = KMeans(n_clusters = k).fit(X)
                labels = kmeansModel.labels_
                scores.append(metrics.calinski_harabasz_score(X, labels)) 
            plt.plot(k_range, scores)
        plt.xticks(k_range)
        plt.title("Optimal number of clusters Calinski-Harabasz criterion")
        plt.xlabel("Number of clusters")
        plt.ylabel("Calinski - Harabasz statistic")
        plt.show()
    print('Optimal number of clusters, please adjust parameters:')
    ch_plot_for_k_means(data, range(2,15), 3)
    
    
    def counts(no_clusters, if_print_1):
        clus = clusters(data, no_clusters).reset_index()
        clus = df.reset_index().merge(clus).set_index('id')
        clus_count = clus.groupby('cluster')['score'].count()
        clus_count = clus_count.to_frame()
        clus_count = clus_count.rename(columns={'score':'no_customers'})
        clus_count['pct. of total'] = clus_count['no_customers'].apply(lambda x: round(100*(x / len(data)),2))
        if if_print_1 == 1:
            mini = str(min(clus_count['pct. of total'])) + '%'
            print('Smallest cluster count in pct. :', mini, ', with ', no_clusters, ' clusters')
        clus_count['pct. of total'] = clus_count['pct. of total'].apply(lambda x: str(x) + '%')
        return clus, clus_count
    clus, clus_count = counts(no_clusters, 0)
    
    print('')
    for num in range(2,10):
        counts(num, 1)
    print('\nPlease adjust if needed...')
    print('\nCluster difference overview')
    sns.pairplot(clus, hue='cluster', vars=[col for col in df.columns if df[col].dtype not in ['object','str']])
    plt.show()
    print('\nCluster mean comparison:')
    figdata = clus.pivot_table(
    index=['cluster'],
    aggfunc='mean')

    figdata = figdata.reset_index().merge(clus_count.reset_index()[['no_customers','cluster']]).set_index('cluster')
    figdata = figdata.sort_values('score', ascending=False)

    sc = StandardScaler()
    figinfo = sc.fit_transform(figdata)
    figinfo = pd.DataFrame(figinfo, columns=figdata.columns)

    plt.figure(1 , figsize=(6 ,6))
    sns.heatmap(data=figinfo, annot=figdata, cmap='RdYlBu_r', cbar=False)
    plt.suptitle('Heatmap for interpretation')
    plt.show()

## Including age
***And 6 clusters***

In [None]:
clusters(data=df[['income','score','age']], no_clusters=6, k_range=range(1,10), df=df)

## Including gender

In [None]:
df2 = df.copy()
df2['Gender'] = df2['gender'].apply(lambda x: 1 if x=='Male' else 0)

clusters(data=df2[['income','score','age','Gender']], no_clusters=6, k_range=range(1,10), df=df.merge(df2))