<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h1 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#005097; border:0' role="tab" aria-controls="home"><center>Customer Segmentation </center></h1>

In [None]:
import numpy as np
import pandas as pd
import datetime
from datetime import date
import matplotlib
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, normalize
from sklearn import metrics
from sklearn.mixture import GaussianMixture
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
data_folder = "/kaggle/input/arketing-campaign/"

### Table of Contents

* [Data Preprocessing](#section_1)
    * [Feature Engineering](#section_1_1)
    * [Statistical summary](#section_1_2)
    * [Outliers and missing values treatment](#section_1_3)
    * [Data normalization](#section_1_4)
    ___
* [Clustering Algorithm](#section_2)
    * [Number of clusters selection](#section_2_1)
    * [Clusters creation](#section_2_2)
    * [Clusters interpretation](#section_2_3)
    * [Clusters visualization](#section_2_4)
    
    ---
Link to my previous Notebook : https://www.kaggle.com/raphael2711/data-prep-visual-eda-and-statistical-hypothesis

# 1. Data Preprocessing <a class="anchor" id="section_1"></a>

### A. Feature Engineering <a class="anchor" id="section_1_1"></a>

In [None]:
data=pd.read_csv(data_folder+'marketing_campaign.csv',header=0,sep=';') 
data.head(10)

Having a first look at the row data enables us to start thinking at some useful variables we could create in order to better understand our dataset and choose the features to cluster the customers.  

We wrill create two variables :

>- Variable __*Spending*__ as the sum of the amount spent on the 6 product categories
>- Variable __*Seniority*__ as the number of months the customer is enrolled with the company

We will remove the unused variables for this analysis

In [None]:
#Spending variable creation
data['Spending']=data['MntWines']+data['MntFruits']+data['MntMeatProducts']+data['MntFishProducts']+data['MntSweetProducts']+data['MntGoldProds']
#Seniority variable creation
last_date = date(2014,10, 4)
data['Seniority']=pd.to_datetime(data['Dt_Customer'], dayfirst=True,format = '%Y-%m-%d')
data['Seniority'] = pd.to_numeric(data['Seniority'].dt.date.apply(lambda x: (last_date - x)).dt.days, downcast='integer')/30

dataset=data[['Income','Spending','Seniority']]

### B. Statistical summary <a class="anchor" id="section_1_2"></a>

In [None]:
pd.options.display.float_format = "{:.2f}".format
dataset.describe()

Our variables do not have the same units. we need to normalize them.  
Moreover, we saw from my previous Notebook that _Income_ Variable has both __outliers__ and __missing value__ <br>

Link to my previous Notebook : https://www.kaggle.com/raphael2711/data-prep-visual-eda-and-statistical-hypothesis

### C. Outliers and missing values treatment <a class="anchor" id="section_1_3"></a>

For a clustering analysis, simply removing the rows with missing values can be an option.

We will therefore just remove the 24 rows which don't have Income values and the row where the Income is equal to 666K

In [None]:
#Remove rows with missing values
dataset=dataset.dropna(subset=['Income'])

#Remove the only outlier in the dataset
dataset=dataset[dataset['Income']<600000]
dataset.describe()

In [None]:
nd = pd.melt(dataset, value_vars =dataset)
n1 = sns.FacetGrid (nd, col='variable', col_wrap=5, sharex=False, sharey = False)
n1 = n1.map(sns.distplot, 'value')
n1

We will normalize our data in both rows and columns

### D. Data normalization <a class="anchor" id="section_1_4"></a>

>We use __*Standard Scaler*__ to transform column features by removing the mean and scale to unit variance.<br>
We use __*Normalize*__ to rescale each row independently of other rows so that its norm equals one.

In [None]:
scaler=StandardScaler()
dataset=dataset[['Income','Seniority','Spending']]

X_std=scaler.fit_transform(dataset)
X = normalize(X_std,norm='l2') 

In [None]:
df = pd.DataFrame(data=X, columns=['Income','Spending','Seniority'])
nd = pd.melt(df, value_vars =df )
n1 = sns.FacetGrid (nd, col='variable', col_wrap=5, sharex=False, sharey = False)
n1 = n1.map(sns.distplot, 'value')
n1

One of the advantage of GMM clustering algortihm over K-means is to assume that an observation can belong to several clusters and hence is able to calculate the probability for each observation associated to each of the clusters.

We can therefore perform Hard clustering or Soft clustering with GMM clustering.<br>
Hard clustering assigns each observations to the cluster yielding the highest probability. Each observation is assigned to one cluster and we can retrieve the probability associated.

In our example, we will define a marketing strategy for each of the clusters generated. We need to perform Hard clustering to associate each customer to a strategy but we will keep the probability in case we want to assign a customer into another cluster and try another marketing strategy. 

# 2. Clustering Algorithm <a class="anchor" id="section_2"></a>

### A. Number of clusters selection <a class="anchor" id="section_2_1"></a>

We define the number of clusters using the Silhoutte score and Davies Bouldin score

In [None]:
Covariance=['full','tied','diag','spherical']
number_clusters=np.arange(1,21)
results_=pd.DataFrame(columns=['Covariance type','Number of Clusters','Silhouette Score','Davies Bouldin Score'])
for i in Covariance:
    for n in number_clusters:       
        gmm_cluster=GaussianMixture(n_components=n,covariance_type=i,random_state=5)
        clusters=gmm_cluster.fit_predict(X)
        if len(np.unique(clusters))>=2:
            results_=results_.append({"Covariance type":i,'Number of Clusters':n,"Silhouette Score":metrics.silhouette_score(X,clusters),'Davies Bouldin Score':metrics.davies_bouldin_score(X,clusters)},ignore_index=True)

display(results_.sort_values(by=["Silhouette Score"], ascending=False)[:10])

>We will select the Covariance type and number of cluster where :
 - The Silhouette score is maximized <br>
 - The Davies Bouldin score is minimized <br>
>
>We choose the __spherical__ covariance type with __4__ clusters

In [None]:
sns.set()
number_clusters = np.arange(1, 10)
models = [GaussianMixture(n, covariance_type='spherical',max_iter=2000, random_state=5).fit(X) for n in number_clusters]
plt.plot(number_clusters, [m.bic(X) for m in models], label='BIC')
plt.plot(number_clusters, [m.aic(X) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('number_clusters')

>We can see from the BIC Score curve a decline improvement at cluster __n=4__.<br> We therefore validate our choice in order to keep a manageable number of clusters.

### B. Clusters creation <a class="anchor" id="section_2_2"></a>

We fit and predit the data specifying the __number of clusters__ and the __covariance type__  

In [None]:
gmm=GaussianMixture(n_components=4, covariance_type='spherical',max_iter=2000, random_state=5).fit(X)
labels = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis');

We can see below the probability for each observation to belong to each cluster

In [None]:
proba = gmm.predict_proba(X)
print(proba[:].round(2))

We associate to each customer the cluster with the highest probability

In [None]:
dataset['Cluster'] = labels

Probability=pd.DataFrame(proba.max(axis=1))
dataset = dataset.reset_index().merge(Probability, left_index=True, right_index=True)
dataset=dataset.rename(columns={0: "Probability"}).drop(columns=['index'])
dataset

### C. Clusters interpretation <a class="anchor" id="section_2_3"></a>

We plot a statistical summary of the 4 clusters to understand their meaning and give our segment a name

In [None]:
pd.options.display.float_format = "{:.0f}".format
summary=dataset[['Income','Spending','Seniority','Cluster']]
summary.set_index("Cluster", inplace = True)
summary=summary.groupby('Cluster').describe().transpose()
summary

The clusters are equally weighted :
- __Cluster 0__ is composed of __old customers__ with __high income__ and __high spending amount__<br>
- __Cluster 1__ is composed of __new customers__ with __below average income__ and __small spending amount__<br>
- __Cluster 2__ is composed of __new customers__ with __high income__ and __high spending amount__<br>
- __Cluster 3__ is composed of __old customers__ with __below average income__  and __small spending amount__<br>

In [None]:
#Rename clusters
dataset=dataset.replace({0:'Stars',1:'Need attention',2:'High potential',3:'Leaky bucket'})

### D. Clusters visualization <a class="anchor" id="section_2_4"></a>

In [None]:
PLOT = go.Figure()
for C in list(dataset.Cluster.unique()):
    

    PLOT.add_trace(go.Scatter3d(x = dataset[dataset.Cluster == C]['Income'],
                                y = dataset[dataset.Cluster == C]['Seniority'],
                                z = dataset[dataset.Cluster == C]['Spending'],                        
                                mode = 'markers',marker_size = 6, marker_line_width = 1,
                                name = str(C)))
PLOT.update_traces(hovertemplate='Income: %{x} <br>Seniority: %{y} <br>Spending: %{z}')

    
PLOT.update_layout(width = 850, height = 850, autosize = True, showlegend = True,
                   scene = dict(xaxis=dict(title = 'Income', titlefont_color = 'black'),
                                yaxis=dict(title = 'Seniority', titlefont_color = 'black'),
                                zaxis=dict(title = 'Spending', titlefont_color = 'black')),
                   font = dict(family = "Gilroy", color  = 'black', size = 12))

We can see the 4 clusters are well defined.<br> 
Some customers with low income are spending a lot, meaning we could try to applicate a marketing strategy initially defined for *Stars* customers to them