# CUSTOMER SEGMENTATION
### Author: Vu Duong
#### Date: June, 2020

# CREDITS:
This work is inspired by multiple greate sources done before:
- https://www.kaggle.com/kushal1996/customer-segmentation-k-means-analysis

- https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis

# INTRODUCTION
You are owing a supermarket mall and through membership cards , you have some basic data about your customers like Customer ID, age, gender, annual income and spending score.
Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

Problem Statement
You own the mall and want to understand the customers like who can be easily converge [Target Customers] so that the sense can be given to marketing team and plan the strategy accordingly.

https://www.kaggle.com/kushal1996/customer-segmentation-k-means-analysis

https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis

# LIBRARY

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly as py
import plotly.graph_objs as go
from sklearn.cluster import KMeans
import warnings
import os
warnings.filterwarnings("ignore")
py.offline.init_notebook_mode(connected = True)

# DATA EXPLORATION

In [None]:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

In [None]:
df

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.drop('CustomerID', axis=1, inplace=True)

# DATA VISUALIZATION

In [None]:
plt.style.use('fivethirtyeight')

### AGE, ANNUAL INCOME, SPENDING SCORE DISTRIBUTION
#### Observation
- Age Distribution: The most regular customers for the Mall has age around 30-35 years of age. Whereas the the senior citizens age group is the least frequent visitor in the Mall. Youngsters are lesser in umber as compared to the Middle aged people.
- Annual Income Distribution: There are few people who earn more than 100 US Dollars. Most of the people have an earning of around 50-75 US Dollars. Also, we can say that the least Income is around 20 US Dollars.
- Spending Score Distribution: On a general level, we may conclude that most of the Customers have their Spending Score in the range of 35-60. The lowest is 1, and the highest is 99.

In [None]:
plt.figure(1 , figsize = (15 , 6))
n = 0 
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    n += 1
    plt.subplot(1 , 3 , n)
    plt.subplots_adjust(hspace =0.5 , wspace = 0.5)
    sns.distplot(df[x] , bins = 20)
    plt.title('Distplot of {}'.format(x))
plt.show()

### GENDER DISTRIBUTION
#### Observation
Looking at the above pie chart explains the Gender distribution in the Mall
- Interestingly, The Females are in the lead with a share of 56% whereas the Males have a share of 44%, that's a huge gap specially when the population of Males is comparatively higher than Females.

In [None]:
labels = ['Female', 'Male']
size = df['Gender'].value_counts()
colors = ['lightgreen', 'orange']
explode = [0, 0.1]

plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Gender', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()

### OVERALL 

In [None]:
plt.figure(figsize = (20 , 6))
sns.pairplot(df, hue='Gender', size=3)

### GENDER AGAINST SPENDING SCORE
#### Observation
It is clearly visible that the most of the males have a Spending Score of around 25k US Dollars to 70k US Dollars whereas the Females have a spending score of around 35k US Dollars to 75k US Dollars. which again points to the fact that women are Shopping Leaders.

In [None]:
sns.boxenplot(df['Gender'], df['Spending Score (1-100)'])
plt.title('Gender vs Spending Score', fontsize = 20)
plt.show()

### AGE, ANNUAL INCOME, SPENDING SCORE AND GENDER PLOT AGAINST EACH OTHER
#### Observation
- There is not clear when plot Age against Anuual Income, Spending Score and Annual Income against Spending Score.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes = axes.flatten()


sns.scatterplot(x=df['Age'], y=df['Annual Income (k$)'], hue=df['Gender'], ax=axes[0])

sns.scatterplot(x=df['Age'], y=df['Spending Score (1-100)'], hue=df['Gender'], ax=axes[1])

sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], hue=df['Gender'], ax=axes[2])
plt.subplots_adjust(wspace = 0.3)

# CLUSTERING

### K-MEANS

In [None]:
df['Gender'] = df['Gender'].map({'Male':0, 'Female':1})

In [None]:
from sklearn.cluster import KMeans
SSE_to_nearest_centroid = []

for k in range(1,15):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(df)
    SSE_to_nearest_centroid.append(kmeans.inertia_)

#sns.set(style='whitegrid')
plt.figure(constrained_layout=True, figsize=(12, 5))
sns.lineplot(x=list(range(1,15)), y=SSE_to_nearest_centroid)
plt.xlabel("Amount of Clusters",fontsize=14)
plt.ylabel("Inertia",fontsize=14)
plt.grid(True)

In [None]:
kmeans = KMeans(n_clusters=6)
df["labels_6"] = kmeans.fit_predict(df)

kmeans = KMeans(n_clusters=8)
df["labels_8"] = kmeans.fit_predict(df)

#### Observation
- On the left is a list of segmentation for 6 types of customers, while on the right is 8 types.
- For simplicity purposes and customers segments being separated quite clear on both sides, we choose 6 types of customers for further analysis. 

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(15, 10))
axes = axes.flatten()
plt.subplots_adjust(hspace = 0.7, wspace = 0.3)
sns.scatterplot(x=df['Age'], y=df['Annual Income (k$)'], hue=df['labels_6'], ax=axes[0], palette='deep')
# Remove Legend title & move legend outside to the right
handles, labels = axes[0].get_legend_handles_labels()
axes[0].legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(0.9, 1.4), loc='upper left', borderaxespad=0.)

sns.scatterplot(x=df['Age'], y=df['Annual Income (k$)'], hue=df['labels_8'], ax=axes[1], palette='deep')
handles, labels = axes[1].get_legend_handles_labels()
axes[1].legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(0.9, 1.4), loc='upper left', borderaxespad=0.)

sns.scatterplot(x=df['Age'], y=df['Spending Score (1-100)'], hue=df['labels_6'], ax=axes[2], palette='deep')
handles, labels = axes[2].get_legend_handles_labels()
axes[2].legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(0.9, 1.4), loc='upper left', borderaxespad=0.)

sns.scatterplot(x=df['Age'], y=df['Spending Score (1-100)'], hue=df['labels_8'], ax=axes[3], palette='deep')
handles, labels = axes[3].get_legend_handles_labels()
axes[3].legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(0.9, 1.4), loc='upper left', borderaxespad=0.)

sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], ax=axes[4], hue=df['labels_6'], palette='deep')
handles, labels = axes[4].get_legend_handles_labels()
axes[4].legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(0.9, 1.4), loc='upper left', borderaxespad=0.)

sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], ax=axes[5], hue=df['labels_8'], palette='deep')
handles, labels = axes[5].get_legend_handles_labels()
axes[5].legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(0.9, 1.4), loc='upper left', borderaxespad=0.)

plt.show()
#plt.legend('',frameon=False) # for 1 graph

### ANNUAL INCOME - SPENDING SCORE ANALYSIS
#### Observation
- There are clearly Five segments of Customers namely Miser, General, Target, Spendthrift, Careful based on their Annual Income and Spending Score which are reportedly the best factors/attributes to determine the segments of a customer in a Mall.

In [None]:
df['labels_6_rename_AnuualIncome-SpendingScore'] = df['labels_6'].map({0: 'miser', 1:'General', 2:'General', 3:'Target', 4:'SpendThrift', 5: 'Careful'})
fig, axes = plt.subplots(figsize=(10, 5))
sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], hue=df['labels_6_rename_AnuualIncome-SpendingScore'], palette='deep')
handles, labels = axes.get_legend_handles_labels()
axes.legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(1, 1.1), loc='upper left', borderaxespad=0.)

### AGE - SPENDING SCORE ANALYSIS
#### Observation
- According to our own intuition by looking at the above clustering plot between the customers age and their corresponding spending scores,  I have aggregated them into 4 different categories namely Usual Customers, Young Target Customers, Old Target Customers, Priority Customers.
- Then after getting the results we can accordingly make different marketing strategies and policies to optimize the spending scores of the customer in the Mall.

In [None]:
df['labels_6_Age-SpendingScore'] = df['labels_6'].map({0: 'Usualy Customer', 
                                                       1:'Target Customer (old)', 
                                                       2:'Target Customer (young)', 
                                                       3:'Priority Customer', 
                                                       4:'Priority Customer', 
                                                       5: 'Usualy Customer'})

In [None]:
fig, axes = plt.subplots(figsize=(10, 5))
sns.scatterplot(x=df['Age'], y=df['Spending Score (1-100)'], hue=df['labels_6_Age-SpendingScore'], palette='deep')
handles, labels = axes.get_legend_handles_labels()
axes.legend(handles=handles[1:], labels=labels[1:], bbox_to_anchor=(1, 1.1), loc='upper left', borderaxespad=0.)

### AGE - SPENDING SCORE - ANNUAL INCOME ANALYSIS
- 6 clusters indicate quite clearly in their own regions.
- Here is the combination of 2 graphs right above.

In [None]:
df

In [None]:
trace1 = go.Scatter3d(x= df['Age'],
                      y= df['Spending Score (1-100)'],
                      z= df['Annual Income (k$)'],
                      mode='markers',
                      marker=dict(color = df['labels_6'],
                                  size= 10,
                                  line=dict(color= df['labels_6'],
                                            width= 12),
                                  opacity=0.8))
data = [trace1]
layout = go.Layout(title= 'Clusters',
                   scene = dict(xaxis = dict(title  = 'Age'),
                                yaxis = dict(title  = 'Spending Score'),
                                zaxis = dict(title  = 'Annual Income')))
                    #     margin=dict(
                    #         l=0,
                    #         r=0,
                    #         b=0,
                    #         t=0
                    #     )
        
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)