<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h1 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#005097; border:0' role="tab" aria-controls="home"><center>Customer Personna</center></h1>

In [None]:
import numpy as np
import pandas as pd
import datetime
from datetime import date
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler, normalize
from sklearn import metrics
from sklearn.mixture import GaussianMixture
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
data_folder = "/kaggle/input/arketing-campaign/"

### Table of Contents

* [Data Preprocessing](#section_1)
    * [Feature Engineering](#section_1_1)   
    * [Outliers and missing values treatment](#section_1_2)
    * [Customer clustering](#section_1_3)
    ___
* [Apriori Algorithm](#section_2)
    * [Association Rules generation](#section_2_1)
    * [Customer peronnas validation](#section_2_2)
    ---
Link to my previous Notebook on same dataset : <br>
Exploratory Data Analysis: https://www.kaggle.com/raphael2711/data-prep-visual-eda-and-statistical-hypothesis <br>
Customer segmentation : https://www.kaggle.com/raphael2711/customer-segmentation-with-gmm-clustering

# 1. Data Preprocessing <a class="anchor" id="section_1"></a>

### A. Feature Engineering <a class="anchor" id="section_1_1"></a>

To define customer personnas, I will create several variables :

>- Variable __*Age*__ in replacement of the variable *Year_birth*
>- Variable __*Spending*__ as the sum of the amount spent on the 6 product categories
>- Variable __*Seniority*__ as the number of months the customer is enrolled with the company
>- Variable __*Marital_Status*__ to group the different marital status in only 2 comprehensive categories : In couple vs Alone
>- Variable __*Education*__ as either Undergraduate or Postgraduate
>- Variable __*Children*__ as the total numberr of children at home
>- Variable __*Has_child*__ as a binary variable equal to Yes if the customers has 1 child or more

We will remove the unused variables for this analysis

In [None]:
data=pd.read_csv(data_folder+'marketing_campaign.csv',header=0,sep=';') 
#Spending variable creation
data['Age']=2014-data['Year_Birth']

data['Spending']=data['MntWines']+data['MntFruits']+data['MntMeatProducts']+data['MntFishProducts']+data['MntSweetProducts']+data['MntGoldProds']
#Seniority variable creation
last_date = date(2014,10, 4)
data['Seniority']=pd.to_datetime(data['Dt_Customer'], dayfirst=True,format = '%Y-%m-%d')
data['Seniority'] = pd.to_numeric(data['Seniority'].dt.date.apply(lambda x: (last_date - x)).dt.days, downcast='integer')/30
data=data.rename(columns={'NumWebPurchases': "Web",'NumCatalogPurchases':'Catalog','NumStorePurchases':'Store'})
data['Marital_Status']=data['Marital_Status'].replace({'Divorced':'Alone','Single':'Alone','Married':'In couple','Together':'In couple','Absurd':'Alone','Widow':'Alone','YOLO':'Alone'})
data['Education']=data['Education'].replace({'Basic':'Undergraduate','2n Cycle':'Undergraduate','Graduation':'Postgraduate','Master':'Postgraduate','PhD':'Postgraduate'})

data['Children']=data['Kidhome']+data['Teenhome']
data['Has_child'] = np.where(data.Children> 0, 'Has child', 'No child')
data['Children'].replace({3: "3 children",2:'2 children',1:'1 child',0:"No child"},inplace=True)
data=data.rename(columns={'MntWines': "Wines",'MntFruits':'Fruits','MntMeatProducts':'Meat','MntFishProducts':'Fish','MntSweetProducts':'Sweets','MntGoldProds':'Gold'})


data=data[['Age','Education','Marital_Status','Income','Spending','Seniority','Has_child','Children','Wines','Fruits','Meat','Fish','Sweets','Gold']]
data

### B. Outliers and missing values treatment <a class="anchor" id="section_1_2"></a>

I covered in my previous notebook the outliers and missing values treatment : <br>
https://www.kaggle.com/raphael2711/data-prep-visual-eda-and-statistical-hypothesis

There are 24 missing values and 1 outlier for the *Income* variable. We simply remove these observations for this analysis

In [None]:
#Remove rows with missing values
data=data.dropna(subset=['Income'])
#Remove the only outlier in the dataset
data=data[data['Income']<600000]

### C. Customer clustering <a class="anchor" id="section_1_3"></a>

I explained in my previous notebook how I clustered the customers based on their Income, Spending level and Seniority in the company :
https://www.kaggle.com/raphael2711/customer-segmentation-with-gmm-clustering

We identified 4 equally weighted clusters :
- __Stars__ is composed of __old customers__ with __high income__ and __high spending amount__<br>
- __Need attention__ is composed of __new customers__ with __below average income__ and __small spending amount__<br>
- __High potential__ is composed of __new customers__ with __high income__ and __high spending amount__<br>
- __Leaky bucket__ is composed of __old customers__ with __below average income__  and __small spending amount__<br>

In [None]:
#Normalize data before clustering
scaler=StandardScaler()
dataset_temp=data[['Income','Seniority','Spending']]
X_std=scaler.fit_transform(dataset_temp)
X = normalize(X_std,norm='l2')

#fit the algorithm
gmm=GaussianMixture(n_components=4, covariance_type='spherical',max_iter=2000, random_state=5).fit(X)
#predict clusters
labels = gmm.predict(X)
dataset_temp['Cluster'] = labels
dataset_temp=dataset_temp.replace({0:'Stars',1:'Need attention',2:'High potential',3:'Leaky bucket'})
data = data.merge(dataset_temp.Cluster, left_index=True, right_index=True)

In [None]:
pd.options.display.float_format = "{:.0f}".format
summary=data[['Income','Spending','Seniority','Cluster']]
summary.set_index("Cluster", inplace = True)
summary=summary.groupby('Cluster').describe().transpose()
summary

In [None]:
PLOT = go.Figure()
for C in list(data.Cluster.unique()):
    

    PLOT.add_trace(go.Scatter3d(x = data[data.Cluster == C]['Income'],
                                y = data[data.Cluster == C]['Seniority'],
                                z = data[data.Cluster == C]['Spending'],                        
                                mode = 'markers',marker_size = 6, marker_line_width = 1,
                                name = str(C)))
PLOT.update_traces(hovertemplate='Income: %{x} <br>Seniority: %{y} <br>Spending: %{z}')

    
PLOT.update_layout(width = 800, height = 800, autosize = True, showlegend = True,
                   scene = dict(xaxis=dict(title = 'Income', titlefont_color = 'black'),
                                yaxis=dict(title = 'Seniority', titlefont_color = 'black'),
                                zaxis=dict(title = 'Spending', titlefont_color = 'black')),
                   font = dict(family = "Gilroy", color  = 'black', size = 12))

We bin data before running Apriori algorithm 

In [None]:
#Create Age segment
cut_labels_Age = ['Young', 'Adult', 'Mature', 'Senior']
cut_bins = [0, 30, 45, 65, 120]
data['Age_group'] = pd.cut(data['Age'], bins=cut_bins, labels=cut_labels_Age)
#Create Income segment
cut_labels_Income = ['Low income', 'Low to medium income', 'Medium to high income', 'High income']
data['Income_group'] = pd.qcut(data['Income'], q=4, labels=cut_labels_Income)
#Create Seniority segment
cut_labels_Seniority = ['New customers', 'Discovering customers', 'Experienced customers', 'Old customers']
data['Seniority_group'] = pd.qcut(data['Seniority'], q=4, labels=cut_labels_Seniority)

data=data.drop(columns=['Age','Income','Seniority'])
data

We define customer segments for each product based on their spending :
- **Non consumer :** Customers with 0 amount of spending
- **Low consumer :** Customers below the 1st quartile
- **Frequent consumer :** Customers between the 1st and 3rd quartile
- **Biggest consumer :** Customers above the 3rd quartile

In [None]:
cut_labels = ['Low consumer', 'Frequent consumer', 'Biggest consumer']
data['Wines_segment'] = pd.qcut(data['Wines'][data['Wines']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Fruits_segment'] = pd.qcut(data['Fruits'][data['Fruits']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Meat_segment'] = pd.qcut(data['Meat'][data['Meat']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Fish_segment'] = pd.qcut(data['Fish'][data['Fish']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Sweets_segment'] = pd.qcut(data['Sweets'][data['Sweets']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Gold_segment'] = pd.qcut(data['Gold'][data['Gold']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")

data.replace(np.nan, "Non consumer",inplace=True)

data.drop(columns=['Spending','Wines','Fruits','Meat','Fish','Sweets','Gold'],inplace=True)
data = data.astype(object)
data

# 2. Apriori algorithm <a class="anchor" id="section_2"></a>

### A. Association Rules generation <a class="anchor" id="section_2_1"></a>

We are ready to start running Apriori algorithm. We will look for the profile of Wines biggest consumer

In [None]:
#Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 999)
pd.options.display.float_format = "{:.3f}".format

association=data.copy() 
df = pd.get_dummies(association)

#Apriori min support
min_support = 0.08

#Max lenght of apriori n-grams
max_len = 10

frequent_items = apriori(df, use_colnames=True, min_support=min_support, max_len=max_len + 1)
rules = association_rules(frequent_items, metric='lift', min_threshold=1)

In [None]:
# We select the product and the segment we want to analyze
product='Wines'
segment='Biggest consumer'
target = '{\'%s_segment_%s\'}' %(product,segment)

results_personnal_care = rules[rules['consequents'].astype(str).str.contains(target, na=False)].sort_values(by='confidence', ascending=False)

results_personnal_care.head(5)

### B. Customer personnas validation <a class="anchor" id="section_2_2"></a>

<span style="font-size: 200%;color:#6d071a;font-weight:bold">Customer profile of Wines best customers</span> <br>
Our algorithm generated several rules from which we can clearly identify a profile of wines best customers.<br> Rule id 9440 tells us it is :
- A customer with an __average income of 69500 dollars__
- With an __average total spending of 1252 dollars__
- Enrolled with the company for __21 months__
- Owning a __Postgraduate diploma__
- Who is also a __big consumer of Meat products__