# Step 0. Importing packages, setting up the environment

In [None]:
!pip install pydot
!pip install pydotplus

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import datetime, date
from math import log

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import matplotlib.pyplot as plt  # for eda and simple data visualization
import seaborn as sns  # for eda and simple data visualization

## Outline of the first notebook:
* calculate total revenue brought by every customer as total value
* calculate the frequency of orders by every customer
* calculate the recency of last purchase made by every customer at the end date of this dataset
* use an algorithm to create clusters of customers based on customers' activity in terms of recency, frequency and monetary value
* algorithm inference and customers EDA: what time most valueable customers buy, where are customers from

# Step 1. Read csv, explore NA values and some pre-cleaning of data

In [None]:
df = pd.read_csv('../input/online-retail-ii-uci/online_retail_II.csv')  # importing online reatil dataset

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### There are 3 columns of numeric types. However, Customer ID column should be treated as categorical variable because one id represents one customer, it does not imply any numeric meaning. Other five columns are categorical type.


#### Let's check NA values, and how they are spreaded along each column

In [None]:
null_df = df.isnull().sum().reset_index()
null_df.rename(columns={0: 'nan values', 'index': 'column_name'}, inplace=True)

total_len_df = df.shape[0]
null_df['nan percentage'] = round(null_df['nan values'] / total_len_df * 100, 2)

print(null_df)

#### The majority of na values are in Customer ID column. Since we are interested in clients segmentation than we have no other option than just remove na values. 

#### 0.41% of NaNs on goods description is not a critical amount of observations, thereby we can easily remove all NaN's from dataframe.

In [None]:
df.dropna(inplace=True)
df = df.reset_index(drop=True)

#### It is  also mentioned in a description of dataset that canceled orders start with prefix C. Let's try to exclude cancellations from our first calculation of metrics.

In [None]:
# check some rows of cancelled orders
df[df['Invoice'].str.contains('C')].head()

In [None]:
df['InvoiceDate'].max()

#### As we can see cancelled orders are also marked with minus sign in terms of quantity. Now we can remove all rows associated with cancelled orders. It will allow us to make a segmentation of our customers in terms of recency, frequency and monetary value regardless whether a customer made cancellations. 

#### Cancelled orders should be treated in a separate analyis as well as customers who cancel orders. This can be done in a different task of finding patterns in cancelled orders. Moreover, we can be interested in taking a thorough look at cancelled orders from customers that belong to the most valueable segments. So there will be a follow-up notebook with EDA on customers that cancel orders.

In [None]:
# filter out cancelations from datatset
clean_df = df[~(df['Invoice'].str.contains('C'))]

### Now we have only NOT cancelled orders in dataframe. 
# Step 2. Calculate each customer's recency, frequency and amount of purchases

In [None]:
clean_df['Amount'] = clean_df['Quantity'] * clean_df['Price']  # creating a new column of amount spent

In [None]:
# aggregate metrics on each customer
customers_df = clean_df.groupby(['Customer ID']).agg(
    frequency = ('Invoice', 'nunique'),
    last_purchase = ('InvoiceDate', 'max'),
    amount = ('Amount', 'sum')
).reset_index()  # to turn groupby object into dataframe

customers_df['last_purchase'] = pd.to_datetime(customers_df['last_purchase'], format='%Y-%m-%d')

customers_df['recency'] = datetime(2011, 12, 11) - customers_df['last_purchase']  # the next day of the last invoice date
customers_df['recency'] = customers_df['recency'].dt.days  # leave only days number for recency

customers_df.drop(columns=['last_purchase'], inplace=True)  # drop last_purchase column as we would not need it futher

In [None]:
customers_df.describe()

### The description of customers dataset looks fine. Except there are customers who spent 0 at our online-store. Probably these are the customers who used discounts or other promotional activities. 

### Let's look at these customers:

In [None]:
print(f"There are {int(customers_df.describe()['Customer ID']['count'])} customers overall")
print(f"There are {customers_df[customers_df['amount'] == 0].shape[0]} customers who spent 0")
print(customers_df[customers_df['amount'] == 0].head())

#### As we can see around 0.05% (3 out of 5,881) of customers spent 0 at our store. They have made just 1 purchase each. 
#### Therefore we can exclude them from our analysis in order to focus on customers who spent more than zero at store.

In [None]:
customers_df = customers_df[customers_df['amount'] > 0].reset_index(drop=True)

In [None]:
def plot_variable_distribution(X, column_name):
    
    fig, ax = plt.subplots()
    
    ax.boxplot(x=X, notch=True)
    ax.set_title(f"{column_name} distribution boxplot")
    
    mu = X.mean()
    sigma = X.std()
    num_bins = 50
    
    fig1, ax1 = plt.subplots()
    
    # the histogram of the data
    n, bins, patches = ax1.hist(X, num_bins, density=True)

    # add a 'best fit' line
    y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *
         np.exp(-0.5 * (1 / sigma * (bins - mu))**2))
    
    mu = round(mu, 2)
    sigma = round(sigma, 2)
    
    ax1.plot(bins, y, '--')
    ax1.set_xlabel(column_name)
    ax1.set_ylabel('Probability density')
    ax1.set_title(f'Histogram of {column_name} distribution: $\mu={mu}$, $\sigma={sigma}$')

In [None]:
feature_columns = ['frequency', 'amount', 'recency']
for col_name in feature_columns:
    
    x = customers_df[col_name]
    
    plot_variable_distribution(x, col_name)

### As we can see on the plots above, the distribution of variables of interest does not look like normal. There are outliers in the dataset in terms of recency, frequency of purchases and amount spent by customers. Altough these outliers probably skew the distribution of variables of intereset from normailty into non-normality, arguably we should keep these observations in our dataset. If we get rid of these "outlier" observations, we can lose a valueable information about users who bring a big part of revenue or number of purchases. 

### Instead we can try to transform the distribution of the selected variables. This will allow us to utilize these variables for customer segmentation without any concerns about non-normality of variables' distributions.

But first of all - let's double check whether variable samples are normal or not with Sharipo-Wilk, D’Agostino’s K^2 and Anderson-Darling statistical tests.

In [None]:
from scipy.stats import shapiro
from scipy.stats import normaltest
from scipy.stats import anderson

for col_name in feature_columns:
    
    print(f'Sharipo-Wilk test for {col_name}')
    stat, p = shapiro(customers_df[col_name])
    print('Statistics=%.3f, p=%.3f of %s distribution' % (stat, p, col_name))

    alpha = 0.05
    if p > alpha:
        print(f'Sample of {col_name} looks normal (fail to reject H0) with Sharipo-Wilk Test')
    else:
        print(f'Sample of {col_name} does not look Gaussian (reject H0) with Sharipo-Wilk Test \n\n')
        
    print(f'D’Agostino’s K^2 test for {col_name}')
    stat, p = normaltest(customers_df[col_name])
    print('Statistics=%.3f, p=%.3f of %s distribution' % (stat, p, col_name))

    alpha = 0.05
    if p > alpha:
        print(f'Sample of {col_name} looks normal (fail to reject H0) with D’Agostino’s K^2 Test')
    else:
        print(f'Sample of {col_name} does not look normal (reject H0) with D’Agostino’s K^2 Test \n\n')
        
    print(f'Anderson-Darling test for {col_name}')
    
    result = anderson(customers_df[col_name])
    print('Statistic: %.3f' % result.statistic)
    p = 0
    for i in range(len(result.critical_values)):
        sl, cv = result.significance_level[i], result.critical_values[i]
        if result.statistic < result.critical_values[i]:
            print('%.3f: %.3f, data looks normal (fail to reject H0) with Anderson-Darling test \n\n' % (sl, cv))
        else:
            print('%.3f: %.3f, data does not look normal (reject H0) with Anderson-Darling test \n\n' % (sl, cv))

### As we can see from above, if we apply normality tests to our variables, the result shows that we deal probably with not normal distributions.

### That is why we can try to apply natural logarithm transformation to our variables to keep the outliers and get normal-alike distribution.
**The idea is inspired by Anatoly Karpov's [report](https://www.youtube.com/watch?v=dFCJysbOJ8c) at Matemarketing 2019 Conference.**

In [None]:
for col_name in feature_columns:
    
    x = customers_df[col_name]
    
    ln_x = x.apply(lambda x: log(x))  # apply ln transformation to our variables
    
    ln_col_name = 'ln ' + col_name
    
    plot_variable_distribution(ln_x, ln_col_name)

### Log trasnformation made distribution to look more like normal distributiions. As a 'bonus' after log transformation we now have all variables in a pretty much similar absolute values range.

In [None]:
customers_df['r'] = customers_df['recency'].apply(lambda x: log(x))
customers_df['f'] = customers_df['frequency'].apply(lambda x: log(x))
customers_df['m'] = customers_df['amount'].apply(lambda x: log(x))

In [None]:
use_df = customers_df[['Customer ID', 'r', 'f', 'm']]

## Now we can go to the segmentation of our customers by recency, frequency and amount of purchases. 
### We will try to use k-means clusterization algorithm. After that we will review clusters by decistion tree algorithm to make explicit interpretation using CART.  

# Step 3. Customer segmentation using K-Means

In [None]:
use_df.set_index('Customer ID', inplace=True)
use_df.head()

### Let's import necessary modules from sklearn library

In [None]:
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_score

In [None]:
use_df.shape

In [None]:
%%time

inertia = []
ss = []
for k in range(2, 15):
    %time kmeans = KMeans(n_clusters=k, random_state=1).fit(use_df)
    ss.append(silhouette_score(use_df, kmeans.labels_, metric='euclidean'))
    inertia.append(np.sqrt(kmeans.inertia_))
    
print('\n\nTOTAL CELL RUNTIME: ', )

In [None]:
plt.plot(range(2, 15), inertia, marker='s')
plt.xlabel('$k$')
plt.ylabel('$J(C_k)$')

#### Optimal number of clusters is 4 according to 'elbow rule'. After k=4 the change in inertia is less dramatical than it was before k=4.

In [None]:
plt.plot(range(2, 15), ss, marker='h')
plt.xlabel('$k$')
plt.ylabel('silhouette_score')

#### Silhouette score also confirms the suggestion that 4 is an optimal number of clusters.
#### Let's also compare the speed and performance of MiniBatchKmeans with default KMeans algorithm from skickit-learn.

In [None]:
%%time
inertia_minib = []
ss_minib = []
for k in range(2, 15):
    %time kmeans_mini = MiniBatchKMeans(n_clusters=k, random_state=1).fit(use_df)
    ss_minib.append(silhouette_score(use_df, kmeans_mini.labels_, metric='euclidean'))
    inertia_minib.append(np.sqrt(kmeans_mini.inertia_))
    
print('\n\nTOTAL CELL RUNTIME: ', )

In [None]:
plt.plot(range(2, 15), inertia_minib, marker='s')
plt.xlabel('$k$')
plt.ylabel('$J(C_k)$')
plt.title('Intertia')

In [None]:
plt.plot(range(2, 15), ss_minib, marker='h')
plt.xlabel('$k$')
plt.ylabel('silhouette_score')
plt.title('Silhouette score')

#### We can obtain similar conclusions from MiniBatchKMeans algorithm and KMeans algorithm runs. Nevetheless, there is a dramatic improvement in terms of runtime using MiniBatchKMeans in comparison to default KMeans algorithm (17.2 seconds vs 4 min 49 seconds in total run time of cell). 

#### Thereby, we can use MiniBatchKMeans algorithm on large datasets, (scikit-learn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) says n >= 10k observations) to find the optimal number of clusters for KMeans algorithm to save our time.

However, to get just one interation of clusterization we can use default KMeans algorithm to keep results stable. We will not observe such a dramatic improvement using MiniBatchKMeans in comparison to default KMeans algorithm one just one iteration as we have seen that on n > 1 iterations.

In [None]:
c = KMeans(n_clusters=4, random_state=42)
use_df['cluster'] = c.fit_predict(use_df)

In [None]:
use_df['cluster'] = use_df['cluster'] + 1

In [None]:
use_df['cluster'] = use_df['cluster'].astype(int)

### The distribution of customers among clusters is almost even without any huge overloads towards any cluster.

### Let's explore clusters distributions with some basic visualizattions

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

scatter = ax.scatter(use_df['r'], use_df['f'], c=use_df['cluster'], s=50)
                    
ax.set_title('RFM clusters')
ax.set_xlabel('r')
ax.set_ylabel('f')
plt.colorbar(scatter)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

scatter = ax.scatter(use_df['r'], use_df['m'], c=use_df['cluster'], s=50)
                    
ax.set_title('RFM clusters')
ax.set_xlabel('r')
ax.set_ylabel('m')
plt.colorbar(scatter)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

scatter = ax.scatter(use_df['f'], use_df['m'], c=use_df['cluster'], s=50)
                    
ax.set_title('RFM clusters')
ax.set_xlabel('f')
ax.set_ylabel('m')
plt.colorbar(scatter)

### Now let's try to visualize the relationship between clusters, r, f and m in one scatter matrix to try to grasp the whole picture: 

In [None]:
sns.pairplot(use_df, hue="cluster", markers=["o", "s", "D", "*"], 
             palette=sns.color_palette('Set1', n_colors=4))

## A high-level interpreation for each cluster:
### Cluster 4: low recency, high frequency, high monetary value - did buy recently with high frequency and high monetary value. The most loyal and valueable customers.
### Cluster 1: high recency, high frequency, high monetary value - did NOT buy recently, buy with high frequency and with high monetary value. Need to re-activate these users, need to convert them into the first cluster. 
### Cluster 3: high recency, low frequency, low monetary value. - did NOT buy recently, buy with low frequency and monetary value. Probably customers who just bought for one time and churned. Can be converted into Cluster 4 customers.
### Cluster 2: low recency, low frequency, low monetary value. - did buy recently, buy with high frequency and monetary value. Probably new customers who just made a couple of new orders.

### To get a more explicit interpretation in terms of how customers' segments differentiate from each other we will train a decision tree classifier. This will allow us to see the boundaries between segments.

# Step 4. Get explicit interpetation for clusters

In [None]:
use_df = use_df.reset_index()

Merge info about client's cluster.

In [None]:
final_df = customers_df.merge(use_df[['Customer ID', 'cluster']], how='inner', on='Customer ID')

In [None]:
# drop not neccesary columns for further interpretation of clusters
final_df.drop(columns=['r', 'f', 'm'], inplace=True)

In [None]:
final_df.head()

In [None]:
from sklearn.tree import DecisionTreeClassifier  # import decision tree classifier
from sklearn.model_selection import train_test_split  # import train_test_split function
from sklearn import metrics  # import scikit-learn metrics module for accuracy calculation

In [None]:
print(feature_columns)
X = final_df[feature_columns]
y = final_df['cluster']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)  # 80% training and 20% test

In [None]:
clf_tree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=17)
clf_tree = clf_tree.fit(X_train,y_train)

In [None]:
y_pred = clf_tree.predict(X_test)

In [None]:
print("Accuracy score: ", metrics.accuracy_score(y_test, y_pred))

Let's visualize the decision tree we trained

In [None]:
import io
from io import StringIO
from sklearn.tree import export_graphviz
from IPython.display import Image  
import pydotplus
import graphviz
from pydot import graph_from_dot_data

In [None]:
cluster_names = ['1', '2', '3', '4']

In [None]:
def plot_decision_tree(clf, feature_cols, class_names):

    dot_data = StringIO()
    export_graphviz(clf, out_file=dot_data,  
                    filled=True, rounded=True,
                    special_characters=True,
                    feature_names = feature_cols,
                    class_names=class_names)
    (graph, ) = graph_from_dot_data(dot_data.getvalue())
    
    return Image(graph.create_png())

In [None]:
plot_decision_tree(clf_tree, feature_columns, cluster_names)

### We can arguably keep the depth of decision tree up to 2 as at depth = 3 there are no rules that can be applied without any conflict with the previous rules in terms of entropy gain. Also, step at depth 3 in dividing cluster 3 customers migh look like overfitting.

### Let's now train new decision tree with max_depth = 2

In [None]:
clf_2 = DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=17)
clf_2 = clf_2.fit(X_train,y_train)
y_pred = clf_2.predict(X_test)
print("Accuracy score: ", metrics.accuracy_score(y_test, y_pred))

### With max_depth of 2 we even got a little bit higher accuracy of predictions.

### Let's visualize the tree with max_depth = 2

In [None]:
plot_decision_tree(clf_2, feature_columns, cluster_names)

## Also from the way decision tree trained rules we can see that frequency is not used because it is probably correlate with amount spent. This can also be seen on plots we had above with 4 clusters. There is no cluster with high frequency & low amount or low frequency & high amount. 

## This point is part of the criticism towards the RFM-methodology. An alternative solution might be to use average amount spent per one invoice instead of total amount spent by customer.

**Huge thanks to https://mljar.com/ project blog for providing useful custom functions to retrieve rules from decision tree algorithm!**

In [None]:
from sklearn.tree import _tree
def get_rules(tree, feature_names, class_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]

    paths = []
    path = []
    
    def recurse(node, path, paths):
        
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            p1, p2 = list(path), list(path)
            p1 += [f"({name} <= {np.round(threshold, 3)})"]
            recurse(tree_.children_left[node], p1, paths)
            p2 += [f"({name} > {np.round(threshold, 3)})"]
            recurse(tree_.children_right[node], p2, paths)
        else:
            path += [(tree_.value[node], tree_.n_node_samples[node])]
            paths += [path]
            
    recurse(0, path, paths)

    # sort by samples count
    samples_count = [p[-1][1] for p in paths]
    ii = list(np.argsort(samples_count))
    paths = [paths[i] for i in reversed(ii)]
    
    rules = []
    for path in paths:
        rule = "if "
        
        for p in path[:-1]:
            if rule != "if ":
                rule += " and "
            rule += str(p)
        rule += " then "
        if class_names is None:
            rule += "response: "+str(np.round(path[-1][0][0][0],3))
        else:
            classes = path[-1][0][0]
            l = np.argmax(classes)
            rule += f"class: {class_names[l]} (proba: {np.round(100.0*classes[l]/np.sum(classes),2)}%)"
        rule += f" | based on {path[-1][1]:,} samples"
        rules += [rule]
        
    return rules

In [None]:
rule_values = get_rules(clf_2, feature_columns, cluster_names)

In [None]:
rule_values

From the rules formulated above we can divide out customers into segments. 

In [None]:
conditions = [
    (final_df['recency'] > 71.5) & (final_df['amount'] <= 636.02),
    (final_df['recency'] > 71.5) & (final_df['amount'] > 636.02),
    (final_df['recency'] <= 71.5) & (final_df['amount'] > 1509.575),
    (final_df['recency'] <= 71.5) & (final_df['amount'] <= 1509.575)
]

choices = ['high recency - low amount', 'high recency - high amount', 'low recency - high amount',
          'low recency - low amount']

In [None]:
final_df['segment'] = np.select(conditions, choices, default='other')

In [None]:
agg_stats_df = final_df.groupby('segment').agg(
    median_recency = ('recency', 'median'),
    median_frequency = ('frequency', 'median'),
    median_amount = ('amount', 'median'),
    customers = ('Customer ID', 'nunique')
).reset_index()

In [None]:
agg_stats_df['% cutomers percentage'] = round(
    agg_stats_df['customers'] / agg_stats_df['customers'].sum() * 100, 1)

In [None]:
agg_stats_df

# Step 5: Conclustion and first insights
### Some insigths about recency of purchases by customers

Accoridng to data around 57% of clients are in 'high recency' segments. This might be an indicator for improving the retention of buyers, introducing the loyalty program. The business should challenge the stategy in acquisition channels that bring a lot of customers who stay in 'high recency - low amount' segment. 
Special treament in re-activation should be done to 'high recency - high amount' as those are more valueable customers for the business.

As we have retrieved from rules here we can also track the percentage of users who made a purchase less than 71.5 days ago. A possible KPI of improvements in this area of business might be the percentage of customers who made purchase less than 71.5 days ago. The greater this percentage gets - the better.

At the same time, there should be a mechanism to prevent the transfer of customers from segments with low recency into segments with high recency. It might be profitable to create special offers for customers who got close to turning into 'high recency' customer. Specifically, if the customer did not buy anyting between less than 60-70 days ago, it might be worth trying to give such customer a special offer.

In [None]:
total_stats_df = final_df.groupby('segment').agg(
    total_frequency = ('frequency', 'sum'),
    total_amount = ('amount', 'sum')
).reset_index()

In [None]:
labels = total_stats_df['segment']
sizes = total_stats_df['total_frequency']
explode = (0, 0.1, 0, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax1.set_title('Total frequency by segment')

plt.show()

61.7% of all orders are made by 'low recency - high amount' segment customers

In [None]:
labels = total_stats_df['segment']
sizes = total_stats_df['total_amount']
explode = (0, 0.1, 0, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax1.set_title('Total amount by segment')

plt.show()

72.2% of all amount is spent by 'low recency - high amount' segment customers

In [None]:
final_df.to_csv('./customer_segments.csv', index=False)

### The further notebooks using customer segments will involve:
1. Analysis of different segments of clients in terms geo, time of purchases;
2. Analysis of cancelled orders - it might be useful to compares segments from cancellations perspective

Links and resources used in this notebook:


2. https://habr.com/ru/company/mindbox/blog/423463/ - mindbox about how they have created similar solution for their customers
3. https://habr.com/ru/company/mindbox/blog/420915/ - yet another article by mindbox
4. https://stats.stackexchange.com/questions/102984/is-there-a-decision-tree-like-algorithm-for-unsupervised-clustering - stats exchange post about general way of approach
5. https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree - about retrieving rules
6. https://habr.com/ru/company/ods/blog/322534/#vvedenie - ods.ai post on decision trees
7. https://habr.com/ru/company/ods/blog/325654/#vybor-chisla-klasterov-dlya-kmeans - ods.ai post on KMeans dataset.
8. https://mljar.com/blog/extract-rules-decision-tree/ - great overview of how rules can be extracted from decision tree classifier in human-readable text

#### If you manage to get here, thank you very much! Please, upvote kernel if you like it and leave your opinion on this kernel in comments section.