# How Is The Business Doing?

What this notebook aims to do is to understand the nuances of the business, and try to find ways to elevate their performance. It will be broken down into the following parts:

1. How sales have performed throughout entire time period.
2. What are the best and worst selling products.
3. Understanding Customer Behavior
    1. Repeat Purchasers vs First Time Purchasers
    2. Predicted probability of customers who'll still perform a repeat purchase
    3. Estimating Customer Lifetime Value for the next 30 days for each customer.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
#read csv file        
df = pd.read_csv("../input/retail-store-sales-transactions/scanner_data.csv")        

# Data Cleaning

Checking the data for any nuances as well as change parameters into the appropriate datatypes for analysis.

In [None]:
df.head()

There's a column known as "Unnamed: 0" which will not be used for the analysis of the business. This is probably the index of the csv file being exported.

In [None]:
df.dtypes

The "Date" columns has an object data type. Let's turn it into a datetime format for analysis.

In [None]:
#checking for missing data
df.isnull().sum()

In [None]:
#Turn date column into datetime and create new features
df["Date"] = pd.to_datetime(df["Date"])
df["Day of Week"] = df["Date"].dt.dayofweek
df["Month"] = df["Date"].dt.month

df["Month Truncated"] = df["Date"] + pd.offsets.MonthBegin(-1)
df["Month Truncated"] = df["Month Truncated"].dt.date

#Create a price parameter
df["Price"] = df["Sales_Amount"]/df["Quantity"]

In [None]:
df.head()

# How is the business doing?

This section will be broken by their sales across the entire dataset, their top and worst selling products, and knowing who their customers are.

## Sales by time

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (16,8))

df.groupby("Date")["Sales_Amount"].sum().plot(ax = ax)
ax.set_title("Sales Across Time", fontsize = 14);

Insights:

1. It seems that the time series data is showing a weekly seasonality.
2. Business sales shows no trend which means that it possibly have operated for sometime now and has started to stagnate. The next step to help grow the business is to optimize their operations. This can be done by checking the performance of their products, and their customer behavior.
    1. No context has been given in this dataset except for the transactional data, so 1 thing that can be done would be to check their online stores vs. the brick and mortar sales. 
    2. Understand why customers would come back to the store, since repeating purchasers are far easier to acquire than find new customers.
    3. Check all the operational costs of the business and try to minimize it.
    4. Do cross-selling, which is to try make the customers buy more items per transaction.
    
    
Note: I'll be assuming that the retail store has both a physical and e-commerce shop.

In [None]:
#add in monthly sales data as well
sales_month = df.groupby(["Month Truncated"])["Sales_Amount"].sum().reset_index()
sales_date = df.groupby(["Date", "Day of Week"])["Sales_Amount"].sum().reset_index()

dow_labels = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
month = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (10, 5))

sales_month.plot(ax = ax1, rot = 90) #sum has been used to check for the sales since the transactions only showed for year 2017
sales_date.groupby(["Day of Week"])["Sales_Amount"].mean().plot(ax = ax2, rot = 90) 

ax1.set_title("Month", fontsize = 12)
ax2.set_title("Day of Week", fontsize = 12)

ax1.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax1.set_xticklabels(month)
ax1.set_xlabel("")

ax2.set_xticks(range(0,7))
ax2.set_xticklabels(dow_labels)
ax2.set_xlabel("")

plt.suptitle("Average Sales", fontsize = 15);

Insights:

1. Difficult to find any real trends on the monthly data since the dataset being given is only for a single year. But judging from the graph, the dips are coming from winter and summer months, but sales rises rapidly and peaks during December where people usually buy a lot of stuff.
2. Sunday and Monday have the lowest average daily sales.

# What are the best and worst selling products?

In [None]:
total_prod = df["SKU"].nunique()
total_prod_cat = df["SKU_Category"].nunique()

print("There are a total of {} products and {} product categories".format(total_prod, total_prod_cat))

In [None]:
products = df["SKU"].value_counts().reset_index()
products.columns = ["SKU", "Items Sold"]

product_cat = df["SKU_Category"].value_counts().reset_index()
product_cat.columns = ["SKU_Category", "Items Sold"]

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (10, 5))

products["Items Sold"].plot(kind = "kde", rot = 90, ax = ax1)
ax1.set_xlim([0,200])
ax1.locator_params(axis="x", nbins= 20)
ax1.set_title("Products")

product_cat["Items Sold"].plot(kind = "kde", rot = 90, ax = ax2)
ax2.set_xlim([0,8000])
ax2.locator_params(axis="x", nbins= 20)
ax2.set_title("Product Category")

fig.suptitle("Distribution of Items Sold", fontsize = 15)
plt.tight_layout()

A right skewed distribution plot, which means that a bulk majority of both the products and the product categories have sold poorly. Identifying these items would be very important to help minimize the production cost to create these items.

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols =2, nrows = 1, figsize = (10, 5))

df["SKU_Category"].value_counts().head(10).plot(kind = "bar", ax = ax1)
df["SKU"].value_counts().head(10).plot(kind = "bar", ax = ax2)

ax1.set_title("By SKU Category", fontsize = 12)
ax2.set_title("By SKU", fontsize = 12)

ax1.set_xlabel("")
ax2.set_xlabel("")

fig.suptitle("Top 10 Best Selling Items", fontsize =  18, y = 1.03);

Unfortunately, we don't really have much information about the SKU's so it would be difficult to find any reasons for why these are the top selling items and categories.

## Worst Selling Products

I didn't add in the list of worst selling products because there were way too many to mention.

In [None]:
median_prod = products["Items Sold"].median()
median_prod_cat = product_cat["Items Sold"].median()

print("There are {} items, about {:.2f}% ,that have been sold less than {} times. For product categories there are over {} groups, {:.2f}% of total categories,that have only sold less than {} times."
      .format(products[products["Items Sold"] <= median_prod].shape[0], 
              products[products["Items Sold"] <= median_prod].shape[0]/total_prod*100,
              median_prod,
              product_cat[product_cat["Items Sold"] <= median_prod_cat].shape[0],
              product_cat[product_cat["Items Sold"] <= median_prod_cat].shape[0]/total_prod_cat*100,
              median_prod_cat))

In [None]:
low_selling_items = df[df["SKU"].isin(products[products["Items Sold"] <= median_prod]["SKU"])]["Sales_Amount"].sum()/df["Sales_Amount"].sum()*100
low_selling_product_cat = df[df["SKU_Category"].isin(product_cat[product_cat["Items Sold"] <= median_prod_cat]["SKU_Category"])]["Sales_Amount"].sum()/df["Sales_Amount"].sum()*100

print("SKUs that have sold less than 7 times have contributed to about {:.2f}% of the total sales.".format(low_selling_items))
print("Breaking down the lowest selling SKU categories that have sold less than 213 items, it only consisted about {:.2f}% of the revenue".format(low_selling_product_cat))

Now the worst selling products have been identified, the next best thing would be to check how much supply is their around, how much did it cost to create these items and try to weigh the options. If they are detrimental to the business they can stop the production of these items and figure a way out to dispose the remaining.

## Understanding Customer Behavior

In the previous section, we know that the sales of the business have been relatively stagnant all throughout, and there a several products that have sold very poorly which have added a lot of costs to the business. Next let's try to understand purchaser behavior.

For this section, it will be broken down by the number of repeat and first time purchasers and also look at the sales broken down into these segments.

In [None]:
customer_trans = df.groupby("Customer_ID").agg({"Transaction_ID":"nunique"}).rename(columns = {"Transaction_ID":"Transactions"}).reset_index()
customer_trans["Buyer Status"] = np.where(customer_trans["Transactions"] > 1, "Repeat Purchaser", "First Time")

#create a new table into the main dataframe
df = df.merge(customer_trans, on = "Customer_ID")

In [None]:
customer_trans["Buyer Status"].value_counts().plot(kind = "pie", autopct = "%.2f", labels = None, figsize = (6,6))
plt.title("Share of Repeat and First Time Purchasers", fontsize = 18)
plt.legend(loc = 4, labels = ["Repeat Purchaser", "First Time"])
plt.ylabel("");

I've defined repeat purchasers as people who have bought more than once, not the customers who are on their 2nd transaction. As shown, share of customers are about 50/50. This means that 50% of the people only bought once but never returned for a repeat purchase, high churn rate.

In [None]:
time_series = df.groupby(["Month Truncated", "Buyer Status"]).agg({"Customer_ID":"nunique"}).reset_index()
time_series.columns = ["Month Truncated", "Buyer Status", "Customre_Count"]

df.groupby(["Month Truncated", "Buyer Status"]).agg({"Customer_ID":"nunique"}).unstack().plot(kind = "bar", figsize = (8,5))

plt.legend(loc = 2, labels = ["First Time", "Repeat Purchaser"])
plt.title("Customers Segemented by Frequency", fontsize = 15)
plt.xlabel("");

Insights:

1. On the early months of the year, about 1/3 of the purchasers were first timers. The succeesing months, showed that there is a slight decreasing trend of first time customers. 
    1. Months June to August had the lowest number of first time purchasers, it could be because the business is highly affected duing the summer months.
2. There's a massive jump between the first two months and the succeeding from 2.25K repeat purchasers and jumps to near 3K a month. Probable cause for this would be that marketing campaigns from March onwards brought in more quality users, which made the do a repeat purchase.

## Market Basket Analysis

Market basket analsysis is a technique that tries to find the relationship between two items, which can be described in three metrics (support, confidence, lift). It is important to understand how the correlation between items to help improve sales by cross-selling. What cross selling does is let the crusomer buy more by getting more items in the cart. It is also possible to target customers online and make them entice to buy more of the associated product.


Understanding the basic concepts of MBA:

1. Support - Answers on how frequent when both items have been bought with respect to all transactions.
2. Confidence - How likely both products are being bought together IF your antecedent item has been added to the cart.
3. Lift - How strongly correlated are the items being bought.
   1. Lift > 1 - Products are more likely to be bought together
   2. Lift = 1 - Both items don't have any relation
   3. Lift < 1 - Products are less likely to be bought together
    
Library being used to do market basket analysis is [mlxtend](https://pypi.org/project/Lifetimes/).

In [None]:
#import libraries for MBA
from mlxtend.frequent_patterns import apriori, association_rules
import mlxtend as ml

In [None]:
#data pre-processing 
df1 = df.set_index("Transaction_ID")["SKU_Category"].str.get_dummies().reset_index() #one hot encode SKU category which is grouped by Transaction ID

list_name = df1.columns.tolist()
list_name.remove("Transaction_ID")

mba = df1.groupby("Transaction_ID")[list_name].apply(lambda x:x.sum()) #sums up all the items bought per transactions

def encode_units(x): #data should only be 0 or 1 
    if x <= 0:
        return 0 #item does not exist in this transaction
    if x >= 1:
        return 1 # item exists for this transaction
    
mba = mba.applymap(encode_units)

frequent_itemsets = apriori(mba, min_support = 0.01, use_colnames = True) #only shows the itemsets where the support >= 1%
rules = association_rules(frequent_itemsets, metric = "lift")

#showed percentage to easily interpret the data
rules["support"] = rules["support"]*100
rules["confidence"] = rules["confidence"]*100

Note:

I originally used the SKU column but there were over 5200 unique categories and the Kaggle notebook couldn't handle it. So instead I used SKU Category as it has lower number of groups of < 200. I also tried taking the top 200 items but the support (how many items the combination of items have shown together, was extremely low (< 0.1%).

In [None]:
print("Maximum support for the retail store is {:.2f}%, and minimum is at {:.2f}%".format(rules["support"].max(), rules["support"].min()))

In [None]:
mba_table = rules[(rules["confidence"] >= 30) & (rules["lift"] >= 1)]
mba_table[["antecedents", "consequents","support", "confidence", "lift"]].sort_values("lift", ascending = False).head(10) 

Focusing on itemsets that have high lift, I did this because they have high potential for cross-selling, and the top 3 items have low support.

Understanding the data:

1. Since these itemsets have high lift, then these can be used to do cross-selling and let customers buy more items in the store. 
2. Although these itemsets have the highest lift, the support (number of times these items are being bought in a transactions), are on the lower end of the spectrum. Since it is a retail store, one way to improve this would be to put these two categories near each other or probably place it near the entrace point. It is also possible to do digitally target the customers and entice them to buy both items together.

Do note that I filtered out the data with a lift greater than 1, and confidence to 30 to get the best itemsets.

## Estimating customer LTV

In a retail business where they constantly do marketing campaigns to acquire new and old customers to their store, it is important to know the customer lifetime value or the total sales they've brought to the business since they first purchased. It is important to know the LTV, as this can be used to understand how much value is the customer giving to the store. 


The traditional approach to estimating LTV is through this equation:

Customer Lifetime Value = Average Order Value * Estimated Number of Purchases at some fixed time period * the average length of the customer relationship

    Where: Average Order Value = Sum of Sales for the customer/Total transactions
    

One problem when using this equation is that a single churn rate is used all throughout the entire customer base. However, that is impossible since every customer has their own level of engagement with the store. Another is that the average order value, which is unique for each customer, assumes that it is consistent all throughout time. To overcome this, the Beta Geometric Negative Binomial Distribution model willl be used, to address the issues of the traditional formula and answer the number of purchases as well as the churn rate. This will then be tied into the Gamma-Gamma model which predicts the average order value of the customer at that instant of time.

The Beta Geomteric Negative Binomial Distribution model, creates an output that would estimate the number of purchases within some time period, as well as the probability of the customer going back to the store. This has the following assumptions:

1. That the number of transactions within some period can be modelled using a poisson distribution. The poisson distribution is being used here since this type of process models how many events would occur within some fixed period with a rate of transaction, lambda.
2. The transaction rate of each customer can be modelled using a Gamma distributed plot. The Gamma distribution answers on how long of a time should someone wait till the nth event will occur.
3. Each customer becomes inactive after each transaction with probability, p, and is estimated using a beta distribution. This kind of distribution answers the probabilities. In this case, since we're trying to answer the probability of the customer going back to the store, then this would fit well to the problem.


Lastly, the Gamma-Gamma model predicts the AOV of the customer in the future. 

1. Similar to the BG NBD model, it assumes that the transaction is Gamma distributed, with shape p and scale v.
2. While scale v is being also assumed having a Gamma distribution.

Now that we know the idea behind gettting the LTV of the customer, we're using the [lifetimes library](https://pypi.org/project/Lifetimes/) to model and predict the future purchases in the next 30 days.

In [None]:
!pip install Lifetimes

from lifetimes import BetaGeoFitter #used to compute for the number of transactions at period T, and proabability that the customer is alive
from lifetimes import GammaGammaFitter #used to compute for the average sales per transaction
from lifetimes.plotting import plot_frequency_recency_matrix, plot_probability_alive_matrix
from lifetimes.utils import summary_data_from_transaction_data, calculate_alive_path

When using the lifetimes package, it is important to transform your transactional data into RFM (Recency, Frequency, Monetary Value). As these 
are the inputs being placed into the BG/NBD model and the Gamma-Gamma Model.

If you do your own RFM data, there are slight differences with the classical definition of it:
1. Recency - The difference between your first purchase transaction and the last purchase. 
2. Frequency - The number of times a customer has done a repeat purchase. However, in case a customer has done multiple transactions in a single day it will still be considered as 1. Computation for frequency: Number of total transactions that have unique days - 1
3. Monetary Value - This is the average sales per transaction for each customer.
4. T - How long has the customer existed. This equates to the end of the period minus the date of first purchase.

## Predicting Customer Purchases and Status

In [None]:
from lifetimes.plotting import plot_frequency_recency_matrix, plot_period_transactions, plot_calibration_purchases_vs_holdout_purchases, plot_probability_alive_matrix
from lifetimes.utils import calibration_and_holdout_data

In [None]:
df.columns

In [None]:
#prepare and transform transactional data into RFM
summary_trans = df.groupby(["Customer_ID", "Date", "Transaction_ID"])["Sales_Amount"].sum().reset_index()
data = summary_data_from_transaction_data(summary_trans, customer_id_col = "Customer_ID", datetime_col = "Date", monetary_value_col = "Sales_Amount")

#modelling using the lifetimes package is a lot similar to using 
bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(data['frequency'], data['recency'], data['T'])

## BG/NBD - Model Evaluation

Before predicting the number of transactions of each customer, it is important to understand first if the model describes the data well. This will be done by various tests, which is also under the lifetimes package.

In [None]:
plot_period_transactions(bgf)

In [None]:
summary_cal_holdout = calibration_and_holdout_data(transactions = df, 
                                                   customer_id_col = 'Customer_ID', 
                                                   datetime_col = 'Date',
                                                   monetary_value_col = "Sales_Amount",
                                        calibration_period_end='2017-11-01',
                                        observation_period_end='2017-12-31' )

bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal'])
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)

Under the calibration_and_holdout_data, the calibration_period_end is used as a training set, while the observation period is used as the test set. As shown, the model describes the data well as there is very little error.

Note: Originally, when initally setting the penalizer at 0, the model predictions against the holdout data, were way off. Setting it to 0.001 made the model perform a lot more closely to the data.

## Customer Analysis

Now the model is working well, next is to proceed with the analysis of the future behavior of the customers.

In [None]:
#refit the bgf model with the entire dataset since it last fitted the data using the holdout and calibration
bgf.fit(data['frequency'], data['recency'], data['T'])

In [None]:
plot_frequency_recency_matrix(bgf, T = 30);

Purchasers who have had been a customer for over 150 days, is where they usually do a single purchase. It is only when the customer's age in the business is over 300 days would you expect to have high number of transactions.

In [None]:
plot_probability_alive_matrix(bgf)

Historically, only when a customer has an age of about 50 days would show high probability of doing another purchase.One way to improve of this would probably use a retargeting campaign to these users to better improve the chances of a repeat purchase. In a marketing standpoint, enticing repeat customers are much cheaper than finding new purchasers.

## Estimating Average Order Value of Customer

The Gamma-Gamma model will be used to  estimate the future AOV of the customer.

In [None]:
returning_customers = data[data["frequency"] > 0]

Part of data processing using the Gamma-Gamma model is to take out the customers who did not do any repeat purchases.

In [None]:
returning_customers[['monetary_value', 'frequency']].corr()

One assumption when doing the Gamma-Gamma model is that there should be ideally 0 or very little correlation between frequency and the monetary value. Using Pearson correlation test, it is shown that there is hardly any correlation between the two parameters thus assumption has been satisfied and we can proceed with the modelling.

# Gamma-Gamma - Model Evaluation

In [None]:
from sklearn.metrics import mean_absolute_error
cal_sample = summary_cal_holdout[summary_cal_holdout["frequency_cal"] > 0]
ggf = GammaGammaFitter()

ggf.fit(cal_sample["frequency_cal"], cal_sample["monetary_value_cal"])

x = ggf.conditional_expected_average_profit(summary_cal_holdout["frequency_holdout"], summary_cal_holdout["monetary_value_holdout"]).reset_index()
x.columns = ["Customer_ID", "Predicted_AOV"]

# (summary_trans["Date"] > '2017-11-01') &  (summary_trans["Date"] < '2017-07-01')
y = summary_trans[(summary_trans["Date"] > '2017-11-01') &  (summary_trans["Date"] < '2018-01-01')].groupby("Customer_ID").agg({"Sales_Amount":"sum", "Transaction_ID":"nunique"}).reset_index()
y["True AOV"] = y["Sales_Amount"]/y["Transaction_ID"]

z = x.merge(y[["Customer_ID", "True AOV"]], on = "Customer_ID")
z["Error"] = z["True AOV"] - z["Predicted_AOV"]

In [None]:
mae = mean_absolute_error(z["True AOV"], z["Predicted_AOV"])
print("MAE using the Gamma-Gamma model is: {:.2f}".format(mae))

Using the mean absolute error to better understand how large is the error between the estimated AOV and the True AOV.

In [None]:
z["Error"].plot(kind = "kde")
plt.xlim([-100,100])
plt.title("Error Distribution", fontsize = 15);

In [None]:
print("Median error is {:.2f}".format(z["Error"].median()))
print("Max error is at {:.2f} with over {} customers that have an error of greater than 13".format(z["Error"].max(), z[z["Error"] >= 13].shape[0]))
print("Min error is {:.2f} with over {} customers that have an error of less than -10".format(z["Error"].min(), z[z["Error"] < -10].shape[0]))

Most of the error plays around the value of 3. However, the model tends to be more conservative and underestimates the actual average order value, since there are over 1036 customers who have an error of larger than 13. One way to resolve this would be segment the customers and create a different Gamma-Gamma model for each segment to have a better model performance.

## Expected Future Customer Behavior

In [None]:
#refit ggf using the entire dataset
ggf = GammaGammaFitter()
ggf.fit(returning_customers['frequency'], returning_customers['monetary_value'])

In [None]:
period = 30 #30 days

data['predicted_purchases'] = bgf.conditional_expected_number_of_purchases_up_to_time(period, data['frequency'], data['recency'], data['T'])
data["p_alive"] = bgf.conditional_probability_alive(data["frequency"], data["recency"], data["T"])

data["average_sales"] = ggf.conditional_expected_average_profit(data['frequency'],data['monetary_value'])
data["LTV"] = ggf.customer_lifetime_value(bgf, frequency = data["frequency"], recency = data["recency"], T = data["T"], monetary_value = data["monetary_value"], 
                            time = period)

In [None]:
data["average_sales"].plot(kind = "kde")
plt.xlim([-20,50])
plt.title("Distribution of AOV for Each Customer", fontsize = 15);

In [None]:
ref_table = df[["Customer_ID", "Buyer Status"]].drop_duplicates()
data = data.merge(ref_table, on = "Customer_ID")


first_time_aov = data[data["Buyer Status"] == "First Time"]["average_sales"].median()
repeat_aov = data[data["Buyer Status"] == "Repeat Purchaser"]["average_sales"].median()

print("Median AOV for first time purchasers is {:.2f}USD, while repeat customers are at {:.2f}USD".format(first_time_aov, repeat_aov))

There are 2 peaks found in the distribution plot, which is at around 11 USD and 25 USD. The AOV at 25 USD are the first time purchasers, while the repeat customers are expected to have less AOV.

In [None]:

data["p_alive category"] = np.where(data["p_alive"] > 0.8, "Returning", "High Risk")

data["p_alive category"].value_counts().plot(kind = "pie", autopct = "%.2f", figsize = (6,6))
plt.ylabel("")
plt.title("Expected % of Customers for the next 30 days", fontsize = 15);

Over 35% of the customers are at high risk of not going back to the store.

# What to do next?

1. As the business has already stagnated, the next phase to elevate would be through optimization.
    1. We have identified several SKUs and SKU categories that have underperformed and possibly have an impact with the operational cost
        1. Do a further analysis on the cost made for that product and decide whether to take it out from production or minimize it.
    2. Several of the item categories that have strong correlation with each other, don't show up on many transactions
        1. Since they have a high chance of cross-selling, placing those two categories beside each other could work well.
        2. Another would be to email the customers and do cross-selling from there.
    3. First time purchasers show a decreasing trend
        1. Doing marketing campaigns to help grow brand awareness would really help to entice potentially new customers to go the store.
    4. About 35% of the existing customer base have a chance of not returning in the next 30 days.
        1. Retarget the existing customers to entice them to get back to the store. 