**Customer Segmentation**

Customer segmentation is the activity of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers (known as segments) based on some type of shared characteristics. The overall aim of segmentation is to identify high yield segments – that is, those segments that are likely to be the most profitable or that have growth potential – so that these can be selected for special attention  ([Reference](https://en.wikipedia.org/wiki/Market_segmentation))

**What is in this Kernel?**

* Cleaning/Transforming the Data
* Univariate Analysis
* Analyzing the KPIs
  1. Annual Revenue
  2. Monthly Revenue
  3. Monthly Revenue growth rate
  4. Monthly Active Customers
  5. Average Sales per Order
  6. New Customers Growth Rate
* Clustering with arbitrary number of clusters
  1. Calculating Recency, Frequency and Monetary value for each customer
  2. Calculating RFM Score
  3. Dividing the customers into segments
* KMeans Clustering
  1. Data Preprocessing for KMeans
      1. Removing the Skewness for achieving Normal distribution using Log Transformation
      2. Standardizing the variables using Standard Scaler for eual variance and equal mean
      3. Choosing the number of clusters using Elbow Method
      4. Implementing KMeans
      5. Building Customer Personas
         1. Snake Plot
         2. Calculation relative importance of each cluster compared to the population

  

In [None]:
#importing the required libraries
import pandas as pd
import numpy as np

#viz Libraries
import matplotlib.pyplot as plt

plt.style.use('ggplot')
import seaborn as sns

#warnings
import warnings
warnings.filterwarnings("ignore")

#datetime
import datetime as dt

#StandardSccaler
from sklearn.preprocessing import StandardScaler

#KMeans
from sklearn.cluster import KMeans

#file directoryy
import os

In [None]:

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



**Reading the data**

In [None]:
#reading the data
df = pd.read_csv('../input/sample-sales-data/sales_data_sample.csv', encoding = 'unicode_escape')

In [None]:
df.shape #Dimensions of the data

In [None]:
df.head() #Glimpse of the data

**Dropping columns **

In [None]:
#Removing the variables which dont add significant value fot the analysis.
to_drop = ['PHONE','ADDRESSLINE1','ADDRESSLINE2','STATE','POSTALCODE']
df = df.drop(to_drop, axis=1)

**Checking for null values**

In [None]:
df.isnull().sum()

Not dealing with the mising values of 'Territory' Variable as it may not have a significant effect on the analysis.

**Checking for inconsistent data types**

In [None]:
df.dtypes

Changing the data type of variable 'ORDERDATE' from object to datetime

In [None]:
df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'])

**Summary stats of Quantitative variables**

In [None]:
quant_vars = ['QUANTITYORDERED','PRICEEACH','SALES','MSRP']
df[quant_vars].describe()

It is observed that there are no negative values for the quantitative variables, which is a good sign because we cannot have negative prices or quantities.

# Exploring the variables

**Order Quantity Distribution**

In [None]:
plt.figure(figsize=(9,6))
sns.distplot(df['QUANTITYORDERED'])
plt.title('Order Quantity Distribution')
plt.xlabel('Quantity Ordered')
plt.ylabel('Frequency')
plt.show()

From the distribution plot of quantity, we can infer that the orders are bulk orders. Majority of the order's quantity are between 20 -40 units.

**Price Distribution**

In [None]:
plt.figure(figsize=(9,6))
sns.distplot(df['PRICEEACH'])
plt.title('Price Distribution')
plt.xlabel('Price Ordered')
plt.ylabel('Frequency')
plt.show()

The distribution of Price is Left Skewed with max price of 100$. Interestingly, many of the orders recieved are of this price. Not investigating further about this particular product line which has the highest price beacuse the target is to segment the customers.

**Sales Distribution**

In [None]:
plt.figure(figsize=(9,6))
sns.distplot(df['SALES'])
plt.title('Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()

**Analyzing the STATUS variable**

In [None]:
df['STATUS'].value_counts(normalize = True)

**Checking the time range of the data**

In [None]:
df.groupby(['YEAR_ID'])['MONTH_ID'].nunique()

We dont have the complete data for 2005.

**Dealsize Distribution**

In [None]:
plt.figure(figsize=(9,6))
df['DEALSIZE'].value_counts(normalize = True).plot(kind = 'bar')
plt.title('DealSize distribution')
plt.xlabel('Deal Size')
plt.ylabel('% Proportion')
plt.show()

# Analyzing KPIs

# Annual Revenue

In [None]:
#Annual Revenue
plt.figure(figsize=(9,6))
df.groupby(['YEAR_ID'])['SALES'].sum().plot()
plt.xlabel('Year')
plt.ylabel('Revenue')
plt.title('Annual Revenue')
plt.xticks(np.arange(2003,2006,1))
plt.show()

As we dont have the complete data for 2005, analyzing the Annual Revenue can be misleading. Instead, we can analyze Monthy Revenue.

# Monthly Revenue

In [None]:
#Monthly Revenue
plt.figure(figsize=(9,6))

monthly_revenue = df.groupby(['YEAR_ID','MONTH_ID'])['SALES'].sum().reset_index()
monthly_revenue
sns.lineplot(x="MONTH_ID", y="SALES",hue="YEAR_ID", data=monthly_revenue)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Monthly Revenue')
plt.show()

This clearly shows that the revenue is growing especially in October and November. It can be the result of the seasonality(Thnaks Giving and other festivitues). We can also observe that 2005 is performing better than the other years in terms of revenue having the maximum sales in all the months(Jan - May). The reason behind this spike of sales in 2005 can be further investigated to maintain high sales in future.

# Monthly Revenue Growth Rate:

In [None]:
monthly_revenue['MONTHLY GROWTH'] = monthly_revenue['SALES'].pct_change()

In [None]:
monthly_revenue.head()

In [None]:
#Monthly Sales Growth Rate
plt.figure(figsize=(9,6))
sns.lineplot(x="MONTH_ID", y="MONTHLY GROWTH",hue="YEAR_ID", data=monthly_revenue)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Monthly Sales Growth Rate')
plt.show()

Apart from expected high/low growth rates during the seasonal months, there is hig growth rate from Apr 2005 to May 2005.

# Top 10 countries by Sales

In [None]:
plt.figure(figsize=(9,6))
top_cities = df.groupby(['COUNTRY'])['SALES'].sum().sort_values(ascending=False)
top_cities.plot(kind = 'bar')
plt.title('Top 10 countries by Sales')
plt.xlabel('Country')
plt.ylabel('Total Sales')
plt.show()

# Monthly Active Customers

In [None]:
#plt.figure(figsize=(10,8))
df['YEAR_MONTH'] = df['YEAR_ID'].map(str)+df['MONTH_ID'].map(str).map(lambda x: x.rjust(2,'0'))
monthly_active = df.groupby(['YEAR_MONTH'])['CUSTOMERNAME'].nunique().reset_index()
monthly_active.plot(kind='bar',x='YEAR_MONTH',y='CUSTOMERNAME')
#plt.figure(figsize=(10,8))
plt.title('Monthly Active Customers')
plt.xlabel('Month/Year')
plt.ylabel('Number of Unique Customers')
plt.xticks(rotation=90)
#plt.figure(figsize=(10,8))
plt.show()

As expected, customers are highly active during the months of November and October. The number of active customers increased from 2003 to 2004 which indicates that the company is successful in retention/acquisition of ol/new customers.

# Average Sales per Order

In [None]:
#Average Sales per Order
average_revenue = df.groupby(['YEAR_ID','MONTH_ID'])['SALES'].mean().reset_index()
plt.figure(figsize=(10,6))
sns.lineplot(x="MONTH_ID", y="SALES",hue="YEAR_ID", data=average_revenue)
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.title('Average Sales per Order')
plt.show()

# New Customers Growth Rate

New customer is whoever did his/her first purchase in the time window we defined, i.e., Mothly in this analysis.

In [None]:
#New Customers Growth Rate
df_first_purchase = df.groupby('CUSTOMERNAME').YEAR_MONTH.min().reset_index()
df_first_purchase.columns = ['CUSTOMERNAME','FirstPurchaseDate']

plt.figure(figsize=(10,6))
df_first_purchase.groupby(['FirstPurchaseDate'])['CUSTOMERNAME'].nunique().pct_change().plot(kind='bar')
plt.title('New Customers Growth Rate')
plt.xlabel('YearMonth')
plt.ylabel('Percentage Growth Rate')
plt.show()

The highest growth rate is observed in February 2002. This can be investigated further to betetr understand what factors contributed the growth.

# Segmentation with number of clusters chosen randomly

In [None]:
df['ORDERDATE'] = [d.date() for d in df['ORDERDATE']]
df.head()

**Calculate Recency, Frequency and Monetary value for each customer**

Assuming that we are analyzing the next day of latest order date in the data set. Creating a variable '*snapshot date*****' which is the latest date in data set.

**Recency** : Recency is the number of days between the customer's latest order date and the snapshot date<br>
**Frequency**: Number of purchases made by the customer<br>
**MonetaryValue**: Revenue generated by the customer

In [None]:
# Calculate Recency, Frequency and Monetary value for each customer
snapshot_date = df['ORDERDATE'].max() + dt.timedelta(days=1) #latest date in the data set
df_RFM = df.groupby(['CUSTOMERNAME']).agg({
    'ORDERDATE': lambda x: (snapshot_date - x.max()).days,
    'ORDERNUMBER': 'count',
    'SALES':'sum'})

#Renaming the columns
df_RFM.rename(columns={'ORDERDATE': 'Recency',
                   'ORDERNUMBER': 'Frequency',
                   'SALES': 'MonetaryValue'}, inplace=True)


In [None]:
df_RFM.head()

**Dividing the customer into 4 segments(Randomly Chosen)**

**Recency/Frequency/MonetaryValue** : Level 4 > Level 3> Level 2 > Level 1

* Lower the recency, higher the Recency level<br>
* Higher the number of orders, higher the Frequency level<br>
* Higher the monetary value, higher the MonetaryValue level


In [None]:
#Dividing into segments

# Create a spend quartile with 4 groups - a range between 1 and 5
MonetaryValue_quartile = pd.qcut(df_RFM['MonetaryValue'], q=4, labels=range(1,5))
Recency_quartile = pd.qcut(df_RFM['Recency'], q=4, labels=list(range(4, 0, -1)))
Frequency_quartile = pd.qcut(df_RFM['Frequency'], q=4, labels=range(1,5))


# Assign the quartile values to the Spend_Quartile column in data
df_RFM['R'] = Recency_quartile
df_RFM['F'] = Frequency_quartile
df_RFM['M'] = MonetaryValue_quartile

#df_RFM[['MonetaryValue_Quartile','Recency_quartile','Frequency_quartile']] = [MonetaryValue_quartile,Recency_quartile,Frequency_quartile]

# Print data with sorted Spend values
#print(df_RFM.sort_values('MonetaryValue'))

df_RFM.head()

**Calculating RFM Score**

In [None]:
# Calculate RFM_Score
df_RFM['RFM_Score'] = df_RFM[['R','F','M']].sum(axis=1)
df_RFM.head()

**Labelling the levels**

* RFM Score > 10 - **High Value Customer**<br>
* RFM SCore < 10 and RFM Score >= 6 - **Mid Value Customer**<br>
* RFM Score < 6 - **Low Value Customer**


In [None]:
#Naming Levels
# Define rfm_level function
def rfm_level(df):
    if np.bool(df['RFM_Score'] >= 10):
        return 'High Value Customer'
    elif np.bool((df['RFM_Score'] < 10) & (df['RFM_Score'] >= 6)):
        return 'Mid Value Customer'
    else:
        return 'Low Value Customer'

# Create a new variable RFM_Level
df_RFM['RFM_Level'] = df_RFM.apply(rfm_level, axis=1)

# Print the header with top 5 rows to the console
df_RFM.head()

In [None]:
plt.figure(figsize=(10,6))
df_RFM['RFM_Level'].value_counts(normalize = True).plot(kind='bar')
plt.title('RFM_level Distribution')
plt.xlabel('RFM_Level')
plt.ylabel('% Proportion')
plt.show()

**Higher the monetary value, higher the MonetaryValue level**

In [None]:
#Analyzing customer segments
# Calculate average values for each RFM_Level, and return a size of each segment 
rfm_level_agg = df_RFM.groupby(['RFM_Level']).agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue':['mean','count']}).round(1)

# Print the aggregated dataset
print(rfm_level_agg)

# Segmentation using KMeans Clustering

# Data Preprocessing for KMeans

**K Means Assumptions**

* All variables have symmetrical (Normal) Distribution <br>
* All Variables have same average value(approx)<br>
* All Variables have same variance(approx)

**Check the distribution of the variables **

In [None]:
data = df_RFM[['Recency','Frequency','MonetaryValue']]
data.head()

In [None]:
plt.figure(figsize=(10,6))

plt.subplot(1,3,1)
data['Recency'].plot(kind='hist')
plt.title('Recency')

plt.subplot(1,3,2)
data['Frequency'].plot(kind='hist')
plt.title('Frequency')

plt.subplot(1,3,3)
data['MonetaryValue'].plot(kind='hist')
plt.xticks(rotation = 90)
plt.title('MonetaryValue')

plt.tight_layout()
plt.show()

**Removing the skewness by performing log transformation on the variables**

In [None]:
data_log = np.log(data)

In [None]:
data_log.head()

**Distribution of Recency, Frequency and MonetaryValue after Log Transformation**

In [None]:
plt.figure(figsize=(10,6))

#plt.subplot(1,3,1)
sns.distplot(data_log['Recency'],label='Recency')

#plt.subplot(1,3,1)
sns.distplot(data_log['Frequency'],label='Frequency')

#plt.subplot(1,3,1)
sns.distplot(data_log['MonetaryValue'],label='MonetaryValue')

plt.title('Distribution of Recency, Frequency and MonetaryValue after Log Transformation')
plt.legend()
plt.show()

**Standardizing the variables using StandardScaler() for equal variance and mean**

In [None]:
# Initialize a scaler
scaler = StandardScaler()

# Fit the scaler
scaler.fit(data_log)

# Scale and center the data
data_normalized = scaler.transform(data_log)

# Create a pandas DataFrame
data_normalized = pd.DataFrame(data_normalized, index=data_log.index, columns=data_log.columns)

# Print summary statistics
data_normalized.describe().round(2)

# Choosing number of Clusters using Elbow Method

In [None]:
# Fit KMeans and calculate SSE for each k
sse={}
for k in range(1, 21):
    kmeans = KMeans(n_clusters=k, random_state=1)
    kmeans.fit(data_normalized)
    sse[k] = kmeans.inertia_ 

    
plt.figure(figsize=(10,6))
# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')

# Add X-axis label "k"
plt.xlabel('k')

# Add Y-axis label "SSE"
plt.ylabel('SSE')

# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.text(4.5,60,"Largest Angle",bbox=dict(facecolor='lightgreen', alpha=0.5))
plt.show()

# Running KMeans with 5 clusters

In [None]:
# Initialize KMeans
kmeans = KMeans(n_clusters=5, random_state=1) 

# Fit k-means clustering on the normalized data set
kmeans.fit(data_normalized)

# Extract cluster labels
cluster_labels = kmeans.labels_

# Assigning Cluster Labels to Raw Data
# Create a DataFrame by adding a new cluster label column
data_rfm = data.assign(Cluster=cluster_labels)
data_rfm.head()

In [None]:
# Group the data by cluster
grouped = data_rfm.groupby(['Cluster'])

# Calculate average RFM values and segment sizes per cluster value
grouped.agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)


# Building Customer Personas

Customer Pesonas can build by determining the summary stats of RFM values or Snake Plot. Snake Plots is a Market Research technique used to compare segments.
Visual representation of each segment's attributes helps us to determine the relative Importance of segment attributes

# Snake Plot

In [None]:
data_rfm_melt = pd.melt(data_rfm.reset_index(), id_vars=['CUSTOMERNAME', 'Cluster'],
                        value_vars=['Recency', 'Frequency', 'MonetaryValue'], 
                        var_name='Metric', value_name='Value')

plt.figure(figsize=(10,6))
# Add the plot title
plt.title('Snake plot of normalized variables')

# Add the x axis label
plt.xlabel('Metric')

# Add the y axis label
plt.ylabel('Value')

# Plot a line for each value of the cluster variable
sns.lineplot(data=data_rfm_melt, x='Metric', y='Value', hue='Cluster')
plt.show()


# Calculating relative importance of each attribute

In [None]:
# Calculate average RFM values for each cluster
cluster_avg = data_rfm.groupby(['Cluster']).mean() 
print(cluster_avg)

In [None]:
# Calculate average RFM values for the total customer population
population_avg = data.mean()
print(population_avg)

In [None]:
# Calculate relative importance of cluster's attribute value compared to population
relative_imp = cluster_avg / population_avg - 1

# Print relative importance score rounded to 2 decimals
print(relative_imp.round(2))

In [None]:
#Plot Relative Importance

# Initialize a plot with a figure size of 8 by 2 inches 
plt.figure(figsize=(8, 2))

# Add the plot title
plt.title('Relative importance of attributes')

# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()