## 1. Introduction
The given dataset contains sales records with 64.682 transactions and 22.625 customers IDs in 2016. The columns are Transaction, Customer ID, Transaction ID, Category, SKU, Quantity, and Sales Amount.

## 2. Objectives
The questions for this analysis are as follows.
1. What is the retention rate of the users in terms of the date since these users sign up.
2. Who are the top 5% frequency quantile with valid activity in 14 days? Who are the top 0.5% quantile in recency, frequency, and influence? Who are the top 10 most valuable users?
3. Who are the most valuable customers by using K-Means clustering?
4. The outcome difference between the k-means clustering and the linear quantile method.

## 3. Method
- Python
The structure of this analysis would be split into two major chapters. The first section is to cluster the customers by using a linear quantile method. In this way, it will be clear who purchase on certain days and how much they spend recently.

The second section is a k-mean clustering analysis. This method splits the customers into 3 different clusters in different criteria, recency, frequency, and monetary. In this way, there will be a group of selected people marked as the most important customers for this business.

Lastly, this analysis campares the outcomes from the linear quantile method and the k-mean clustering method. There is a difference between these two methods to achieve customer segmentation.

## 4. Prepare
- First of all, First of all, import functions that will be used and take a look at the dataset.

In [None]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
from datetime import timedelta
import matplotlib.dates as mdates
import datetime as dt
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [None]:
data = pd.read_csv('../input/retail-store-sales-transactions/scanner_data.csv')
data

## 5. Process
- Determine whether there is an null value in the dataset.

In [None]:
data.info()

## 6. Analyze
- Calculate the duration month

In [None]:
def get_day(x): 
    return dt.datetime(x.year, x.month, x.day)

def get_month(x):
    return dt.datetime(x.year, x.month, 1)

data['Date']= pd.to_datetime(data['Date'])
data['invoice_date'] = data['Date'].apply(get_day)
group_invoice = data.groupby('Customer_ID')['invoice_date']
data['first_date'] = group_invoice.transform('min')
data['last_date'] = group_invoice.transform('max')
data['invoice_month'] = data['Date'].apply(get_month)
group_month = data.groupby('Customer_ID')['invoice_date']
data['first_month'] = group_month.transform('min').apply(get_month)
data.head()

In [None]:
def get_ymd (df, column):
    year=df[column].dt.year
    month=df[column].dt.month
    day=df[column].dt.day
    return year, month, day

invoice_year, invoice_month, _ = get_ymd(data, 'invoice_date')
first_year, first_month, _ = get_ymd(data, 'first_date')

years_diff = invoice_year - first_year
months_diff = invoice_month - first_month

data['duration_month'] = years_diff * 12 + months_diff * 1 + 1
data.head()

In [None]:
group_cohort = data.groupby(['first_month','duration_month'])
cohort_data = group_cohort['Customer_ID'].apply(pd.Series.nunique).reset_index()
cohort_data

- Get the number of users in terms of how many month they have been with valid activities. Then, generate the matrix by using a pivot table.

In [None]:
cohort_counts = cohort_data.pivot(index='first_month', columns='duration_month', values='Customer_ID')
cohort_sizes = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_sizes, axis=0)
retention.round(4)*100

- Plot a heatmap to visualize the retention rates in terms of the dutation and the date that the account is created.

In [None]:
pd.to_datetime(retention.index)
plt.subplots(figsize=(16, 8))
sns.set()
ax = sns.heatmap(data=retention, annot=True, fmt='.2%', vmin=0.0, vmax=0.4, cmap='BuGn')
ax.set_yticklabels(retention.iloc[:].index.strftime('%b-%d-%Y'))
plt.yticks()
plt.title('Retention Rate')
plt.show()

- Figure out the average sales amount in terms of the customers splits into different duration months. Also, plot a heatmap to easily understand the situation.

In [None]:
group_sales = data.groupby(['first_month','duration_month']) 
cohort_sales = group_sales['Sales_Amount'].mean()
cohort_sales = cohort_sales.reset_index()
average_sales = cohort_sales.pivot(index='first_month', columns='duration_month', values='Sales_Amount')
average_sales.round(2)

In [None]:
pd.to_datetime(retention.index)
plt.subplots(figsize=(16, 8))
sns.set()
ax = sns.heatmap(data=average_sales, annot=True, fmt='.2f',cmap='Blues')
ax.set_yticklabels(retention.iloc[:].index.strftime('%b-%d-%Y'))
plt.yticks()
plt.title('Average Sales')
plt.show()

In [None]:
data

- Make sure the last date of the last transaction. The last date is Dec 31st, 2016. Thus, here is to set Jan 1st, 2017, as the date of this analysis is processing in order to calculate the recency. (The lower recency, the better)

In [None]:
print(data['invoice_date'].min())
print(data['invoice_date'].max())

In [None]:
def get_ymd (df, column):
    year=df[column].dt.year
    month=df[column].dt.month
    day=df[column].dt.day
    return year, month, day

last_year, last_month, last_day = get_ymd(data, 'last_date')

years_diff_last = 2016 - last_year
months_diff_last = 12 - last_month
days_diff_last = 31 - last_day

data['recency'] = years_diff_last * 365 + months_diff_last * 30 + days_diff_last + 1
data.head()

- Generate a new dataframe for recency, frequency, and monetary.

In [None]:
data_recency=data.groupby('Customer_ID')['recency'].apply(min).reset_index()
data_frequency=data.groupby('Customer_ID').size().reset_index(name='count')
data_monetary=data.groupby('Customer_ID')['Sales_Amount'].apply(sum).reset_index()

data_rf = pd.merge(left=data_recency, right=data_frequency, how='inner', left_on='Customer_ID', right_on='Customer_ID')
data_rfm = pd.merge(left=data_rf, right=data_monetary, how='inner', left_on='Customer_ID', right_on='Customer_ID')
data_rfm.rename(columns={'count': 'frequency', 'Sales_Amount': 'monetary'}, inplace=True)
data_rfm = data_rfm.set_index('Customer_ID')
data_rfm

- Here is to subset the data by the top 20% frequency quantile and the last activity in 2 weeks. The total number of customers in this group is 1,108.

In [None]:
print('Top 20% shopping frequency quantile:', data_rfm['frequency'].quantile(q = 0.8))

data_rfm['frequency'].quantile(q = 0.8)
data_20qf = data_rfm[data_rfm['frequency'] >= data_rfm['frequency'].quantile(q = 0.8)]
data_20qf_14d=data_20qf[data_20qf['recency']<=14]
data_20qf_14d

- Visualize the data to quickly view the distribution of these 1,108 customers.

In [None]:
fig=go.Figure()
fig.add_trace(
    go.Scatter(x=data_rfm['frequency'], y=data_rfm['monetary'], name='All customers', mode='markers', opacity=0.5)
)
fig.add_trace(
    go.Scatter(x=data_20qf_14d['frequency'], y=data_20qf_14d['monetary'], name='Top 20% frequency purchasing in 14 days', mode='markers')
)
fig.update_layout({  
      'showlegend':True, 'legend':{'x':0.02, 'y':0.96, 'bgcolor':'rgb(246, 228, 129)'}
      })
fig.update_xaxes(
        title_text = "Purchase Frequency",
        title_font = {"size": 16},
        title_standoff = 12)
fig.update_yaxes(
        title_text = "Monetary",
        title_font = {"size": 16},
        title_standoff = 12)
fig.show()

- Here is to subset the data by the top 5% quantile in recency, frequency, and monetary. The total number of these customers is 224.

In [None]:
print('Top 5% recency quantile:', data_rfm['recency'].quantile(q = 0.05))
print('Top 5% shopping frequency quantile:', data_rfm['frequency'].quantile(q = 0.95))
print('Top 5% purchase amount quantile:', data_rfm['monetary'].quantile(q = 0.95))

sel_data_5qr = data_rfm['recency'] <= data_rfm['recency'].quantile(q = 0.05)
sel_data_5qf = data_rfm['frequency'] >= data_rfm['frequency'].quantile(q = 0.95)
sel_data_5qm = data_rfm['monetary'] >= data_rfm['monetary'].quantile(q = 0.95)
data_5rfm = data_rfm[sel_data_5qr & sel_data_5qf & sel_data_5qm]
data_5rfm

- Again, visualize the data to see where these 752 customers are.

In [None]:
fig=go.Figure()
fig.add_trace(
    go.Scatter(x=data_rfm['frequency'], y=data_rfm['monetary'], name='All customers', mode='markers', opacity=0.5)
)
fig.add_trace(
    go.Scatter(x=data_5rfm['frequency'], y=data_5rfm['monetary'], name='Top 5% quantile in recency, frequency and spending', mode='markers')
)
fig.update_layout({  
      'showlegend':True, 'legend':{'x':0.02, 'y':0.96, 'bgcolor':'rgb(246, 228, 129)'}
      })
fig.update_xaxes(
        title_text = "Purchase Frequency",
        title_font = {"size": 16},
        title_standoff = 12)
fig.update_yaxes(
        title_text = "Monetary",
        title_font = {"size": 16},
        title_standoff = 12)
fig.show()

- Subset the data by the top 1% quantile in recency, frequency, and monetary to get 17 customers, and then sort the value by monetary to pick the top 10 valuable customers.

In [None]:
print('Top 1% recency quantile:', data_rfm['recency'].quantile(q = 0.01))
print('Top 1% shopping frequency quantile:', data_rfm['frequency'].quantile(q = 0.99))
print('Top 1% purchase amount quantile:', data_rfm['monetary'].quantile(q = 0.99))

sel_data_1qr = data_rfm['recency'] <= data_rfm['recency'].quantile(q = 0.01)
sel_data_1qf = data_rfm['frequency'] >= data_rfm['frequency'].quantile(q = 0.99)
sel_data_1qm = data_rfm['monetary'] >= data_rfm['monetary'].quantile(q = 0.99)
data_sel_rfm = data_rfm[sel_data_1qr & sel_data_1qf & sel_data_1qm]
data_sel_rfm

In [None]:
data_top10 = data_sel_rfm.sort_values(by='monetary', ascending=False).head(10)
data_top10

- Point out the top 10 cutomers in a scatter plot

In [None]:
fig=go.Figure()
fig.add_trace(
    go.Scatter(x=data_rfm['frequency'], y=data_rfm['monetary'], name='All customers', mode='markers', opacity=0.3)
)
fig.add_trace(
    go.Scatter(x=data_top10['frequency'], y=data_top10['monetary'], name='Top 10 Customers', mode='markers')
)

annotaion_01={'x':'218', 'y':'3844.97', 'showarrow':True, 'arrowhead':4, 'xshift':-2,'yshift':8,'text':'ID-17450', 'textangle':-90, 'font':{'size':10, 'color':'green'}}
annotaion_02={'x':'143', 'y':'2057.69', 'showarrow':True, 'arrowhead':4, 'xshift':-2,'yshift':8,'text':'ID-16029', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_03={'x':'66', 'y':'1690.49', 'showarrow':True, 'arrowhead':4, 'xshift':-2,'yshift':8,'text':'ID-13694', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_04={'x':'114', 'y':'1563.10', 'showarrow':True, 'arrowhead':4, 'xshift':-2,'yshift':8,'text':'ID-13089', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_05={'x':'72', 'y':'1515.32', 'showarrow':True, 'arrowhead':4, 'xshift':-2,'yshift':8,'text':'ID-17949', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_06={'x':'71', 'y':'1485.44', 'showarrow':False, 'arrowhead':4, 'xshift':0,'yshift':-30,'text':'ID-15061', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_07={'x':'228', 'y':'1429.91', 'showarrow':False, 'arrowhead':4, 'xshift':0,'yshift':-32,'text':'ID-14298', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_08={'x':'52', 'y':'1427.39', 'showarrow':True, 'arrowhead':4, 'xshift':-2,'yshift':8,'text':'ID-17841', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_09={'x':'83', 'y':'1063.42', 'showarrow':True, 'arrowhead':4, 'xshift':-2,'yshift':8,'text':'ID-13798', 'textangle':-90,'font':{'size':10, 'color':'green'}}
annotaion_10={'x':'112', 'y':'1028.61', 'showarrow':False, 'arrowhead':4, 'xshift':0,'yshift':-32,'text':'ID-16422', 'textangle':-90,'font':{'size':10, 'color':'green'}}

fig.update_layout({ 
    'annotations':[annotaion_01, annotaion_02, annotaion_03,
      annotaion_04, annotaion_05, annotaion_06,
      annotaion_07, annotaion_08, annotaion_09, annotaion_10], 
      'showlegend':True, 'legend':{'x':0.02, 'y':0.96, 'bgcolor':'rgb(246, 228, 129)'}
      })
fig.update_xaxes(
        title_text = "Purchase Frequency",
        title_font = {"size": 16},
        title_standoff = 12)
fig.update_yaxes(
        title_text = "Monetary",
        title_font = {"size": 16},
        title_standoff = 12)
fig.show()

- K-Means clustering analysis starts from here. First, splits the customers into 3 levels to quickly review the numbers of these 3 levels.

In [None]:
range_labels = list(range(3, 0, -1))
data_q=data_rfm.copy(deep=True)
data_q['recency_quartile']=pd.qcut(data_q['recency'].rank(method='first'), q=3, labels=range_labels)
data_q['frequency_quartile']=pd.qcut(data_q['frequency'].rank(method='first'), q=3, labels=range(1,4))
data_q['monetary_quartile']=pd.qcut(data_q['monetary'].rank(method='first'), q=3, labels=range(1,4))
data_q.head()

In [None]:
data_cat = (data_q['recency_quartile'].astype(str)).str.cat(data_q['frequency_quartile'].astype(str))
data_q['rfm_segment']=data_cat.str.cat(data_q['monetary_quartile'].astype(str))
data_q.head()

- Sum up the scores to rank the customers into 3 levels, the Top, Middle, and Low levels.

In [None]:
data_q['rfm_score'] = data_q[['recency_quartile','frequency_quartile','monetary_quartile']].sum(axis=1)
data_q.head()

In [None]:
def rfm_level(df):
    if df['rfm_score'] >= 7:
        return 'Top'
    elif ((df['rfm_score'] >= 4) and (df['rfm_score'] < 7)):
        return 'Middle'
    else:
        return 'Low'
data_q['rfm_level'] = data_q.apply(rfm_level, axis=1)
data_q.head()

- Here is to show the mean numbers of 3 criteria for the following comparison.

In [None]:
rfm_level_agg = data_q.groupby('rfm_level').agg({'recency': 'mean','frequency': 'mean', 'monetary': ['mean', 'count']
}).round(2)
rfm_level_agg.head()

- Because K-Means works well on variables with the same mean and standard diviation, here is normalizing the data and reviwing the key statistics to verify whehter the data is good to use.

In [None]:
scaler=StandardScaler()
scaler.fit(data_rfm)
data_normalized=scaler.transform(data_rfm)
data_normalized=pd.DataFrame(data_normalized, index=data_rfm.index, columns=data_rfm.columns)
print(data_normalized.describe().round(2))

- Plot the distribution of the 3 subjects. As the graphics show, these numbers are highly skewed.

In [None]:
plt.subplot(3, 1, 1); sns.kdeplot(data_rfm['recency'])
plt.subplot(3, 1, 2); sns.kdeplot(data_rfm['frequency'])
plt.subplot(3, 1, 3); sns.kdeplot(data_rfm['monetary'])
plt.show()

- Get the log scale in order to get a better distribution.

In [None]:
data_log=np.log(data_rfm)
scaler=StandardScaler()
data_log.replace([np.inf, -np.inf], np.nan, inplace=True)
scaler.fit(data_log)
data_normalized=scaler.transform(data_log)
data_normalized = pd.DataFrame(data=data_normalized, index=data_log.index, columns=data_log.columns)
data_normalized.head()

In [None]:
data_normalized=data_normalized.fillna(0)
kmeans = KMeans(n_clusters=3, random_state=1)
kmeans.fit(data_normalized)
cluster_labels = kmeans.labels_

data_rfm_k3 = data_rfm.assign(Cluster=cluster_labels)
grouped = data_rfm_k3.groupby(['Cluster'])

grouped.agg({
    'recency': 'mean',
    'frequency': 'mean',
    'monetary': ['mean', 'count']
  }).round(1)

- Plot the log numbers to see the distribution. Although the log recency is still slightly skewed, frequency and monetary look better than before.

In [None]:
plt.subplot(3, 1, 1); sns.kdeplot(data_log['recency'])
plt.subplot(3, 1, 2); sns.kdeplot(data_log['frequency'])
plt.subplot(3, 1, 3); sns.kdeplot(data_log['monetary'])
plt.show()

- Running k-mean clustering on the normalized data. From the visulization, the elbow with a sharpest angel is the optimal number for clustering. Here shows 2 or 3 clusters is the optimal number.

In [None]:
sse={}
for k in range(1, 21):
    kmeans = KMeans(n_clusters=k, random_state=1)
    kmeans.fit(data_normalized)
    sse[k] = kmeans.inertia_

sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.title('The Elbow Method')
plt.xlabel('k')
plt.ylabel('SSE')
plt.show()

- Add a 'Cluster' columns into the normalized datafram, and then melt the datafram into a long datafram by vertically stacking 3 subjects, recency, frequency, and monetry, one by one.

In [None]:
data_normalized['cluster']=data_rfm_k3['Cluster']
data_normalized.head()

In [None]:
data_melt = pd.melt(
    data_normalized.reset_index(), 
    id_vars=['Customer_ID', 'cluster'],
    value_vars=['recency', 'frequency', 'monetary'], 
    var_name='metric', value_name='value'
)
data_melt.head()

- Plot a line chart to compare different clusters. Obviously, the cluster 2 has higher frequncy and monetary with short recency, which is much better than the others.

In [None]:
sns.lineplot(data=data_melt, x='metric', y='value', hue='cluster')
plt.title('Line plot of normalized variables')
plt.xlabel('Metric')
plt.ylabel('Value')
plt.show()

- Identify the importance of each segment's attribute by calculating ratio beteen the average values of the clusters and the average of the population.

In [None]:
cluster_avg = data_rfm_k3.groupby(['Cluster']).mean() 
population_avg = data_rfm.mean()
relative_imp = cluster_avg / population_avg - 1
print(relative_imp.round(2))

- Plot a heat map to easily distinguash the most important cluster, cluster 2.

In [None]:
plt.figure(figsize=(16, 6))
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.title('Relative importance of attributes')
plt.show()

- Identify the customers in the cluster 2 and list out the customer IDs.

In [None]:
cluster2=data_rfm_k3[data_rfm_k3['Cluster']==2]
cluster2

In [None]:
cluster2_list=[]
customer=[cluster2_list.append(x) for x in cluster2.reset_index()['Customer_ID'] if x not in cluster2_list]
print(len(cluster2_list))
print(cluster2_list)

## 7. Conclusion

In sum, here are viewpoints as below.

1. The customers who started purchase in the early 2016 have a higher retension rate. The customers who started purchase in Feb with 6 months duration and who started purchase in Aug with 2 months duration have higher average expenditure.

2. 1,108 users are in the top 20% frequency quantile with an activity in the past two weeks. 224 users are in the top 5% quantile in recency, frequency and monetary. The IDs of the top 10 most valuable users are shown in the table above.

3. There is a significant difference between the results of the K-Means clustering and the linear quantile clustering. The K-Means clustering brings 4,262 customers together as a group, which might be too huge to target some customers for a specific marketing campaign.

Suggested further analysis.

1. Use other machine learning methods for more deeper analysis and prediction.

2. Optimize this KMeans model.

3. Other visual types to explore more insights.