# E-COMMERCE DATA

### Author: Vu Duong

#### Date: June, 2020

# CREDITS:

This work is inspired by multiple greate sources done before:
- https://www.kaggle.com/admond1994/e-commerce-data-eda/notebook
- https://www.kaggle.com/fabiendaniel/customer-segmentation
- https://www.udemy.com/course/data-science-deep-learning-for-business-20-case-studies/
- https://www.kaggle.com/ostrowski/market-basket-analysis-exploring-e-commerce-data
- https://www.geeksforgeeks.org/implementing-apriori-algorithm-in-python/
- https://www.kaggle.com/carrie1/customer-insights
- https://www.kaggle.com/yugagrawal95/rfm-analysis
- https://www.kaggle.com/fszlnwr/customer-segmentation-rfm-cohort-analysis

# INTRODUCTION

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers

Detailed description of dataset content is described in the following link:
https://www.kaggle.com/carrie1/ecommerce-data

# LIBRARY

In [None]:
# Data Processing
import numpy as np
import pandas as pd
import datetime as dt

# Data Visualizing
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator
from IPython.display import display, HTML
import plotly.express as px
import plotly.graph_objs as go
from IPython.display import display, HTML
from IPython.display import Image

# Data Clustering
from mlxtend.frequent_patterns import apriori # Data pattern exploration
from mlxtend.frequent_patterns import association_rules # Association rules conversion

# Data Modeling
from sklearn.ensemble import RandomForestRegressor

# Math
from scipy import stats  # Computing the t and p values using scipy 
from statsmodels.stats import weightstats 

# Warning Removal
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

# DATA EXPLORATION

In [None]:
# https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s/50538501#50538501
df = pd.read_csv('../input/ecommerce-data/data.csv', encoding= 'unicode_escape')

In [None]:
df

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.columns

# DATA MANIPULATION

### Check for any Duplicated Rows

In [None]:
print(df.duplicated().sum())
df.drop_duplicates(inplace = True)

### Description
- For any StockCode, there is one specific description along.

In [None]:
# https://stackoverflow.com/questions/574730/python-how-to-ignore-an-exception-and-proceed/575711#575711
# https://stackoverflow.com/questions/59127458/pandas-fillna-using-groupby-and-mode
def cleaning_description(df):
    try: 
        return df.mode()[0] # df.mode().iloc[0]
    except Exception:
        return 'unknown'
    
df[['StockCode', 'Description']] = df[['StockCode', 'Description']].fillna(df[['StockCode', 'Description']].groupby('StockCode').transform(cleaning_description))

# Cleaning Description field for proper aggregation
df['Description'] = df['Description'].str.strip().copy()

### Invoice Number
- Invoice Number should be in a form of 6-digit integral number. If the code starts with a letter C, it shows the invoice is cancelled.
- However, some codes that don't start with the letter C are categorized with cancellation by a negative number from Quantity feature.
- Thus, cleaning the starting letter C is right.

In [None]:
def clean_InvoiceNo(InvoiceNo):    
    if InvoiceNo[0] == 'C':
        return InvoiceNo.replace(InvoiceNo[0], '')
    else:
        return InvoiceNo
df['InvoiceNo'] = df['InvoiceNo'].apply(clean_InvoiceNo)

### Quantity
- Most invoices cluster around 0, in a range of [-10000, 10000].
- There are some outliers way above and below the range above. We can safely remove for the sake of revenue analysis.

In [None]:
# Plot Quantity
plt.figure(constrained_layout=True, figsize=(12, 5))
sns.boxplot(df['Quantity'])

# remove outliers for Quantity
df = df[(df['Quantity'] < 15000) & (df['Quantity'] > -15000)]

### InvoiceDate
- Applying feature extraction on InvoiceDate to get new features such as date, day, month, year, hour, day of week for further analysis.

In [None]:
# Change datatype of InvoiceDate as datetime type
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# df['date'] = pd.to_datetime(df['InvoiceDate'], utc=False)
# df['date'].dtypes

# Create new features
df['date'] = df['InvoiceDate'].dt.date   # df['date'].dt.normalize()  # Show only date
df['day'] = df['InvoiceDate'].dt.day
df['month'] = df['InvoiceDate'].dt.month
df['year'] = df['InvoiceDate'].dt.year
df['hour'] = df['InvoiceDate'].dt.hour
df['dayofweek'] = df['InvoiceDate'].dt.dayofweek
df['dayofweek'] = df['dayofweek'].map( {0: '1_Mon', 1: '2_Tue', 2: '3_Wed', 3: '4_Thur', 4: '5_Fri', 5: '6_Sat', 6: '7_Sun'})

### UnitPrice
- Generally, unit price is at least 0, thus any unit price is below the baseline is considered as outliers.

In [None]:
# Clean UnitPrice
''' 
Steps to clean Unit Price
    df['UnitPrice'].describe()
    df[df['UnitPrice'] < 0]
    sns.boxplot(df['UnitPrice'])
    sns.distplot(df['UnitPrice'])
    df[df['StockCode'] == 'M']
    df[df['UnitPrice'] > 15000]
'''
df = df[df['UnitPrice'] >= 0]

### CustomerID
- There are multiple unknown customers, yet we know from which country the invoice comes from. Thus, we should fill missing value of customerID with 'unknown' rather than filtering out those rows containing unknown customerID.

In [None]:
# Fill CustomerID with unknown
df['CustomerID'].dropna(inplace=True)

### Revenue
- Revenue is the product of UnitPrice and Quantity from each transaction.

In [None]:
# Create a new feature Revenue
df['Revenue'] = df['UnitPrice'] * df['Quantity']

# DATA ANALYSIS
For the curiosity of data, we conduct a depth analysis based on our following questions:
1. Who, customerID, bring most revenue?
2. Who, customerID, buy most in term of quantity?
3. Who, customerID, is likely to return the product?
4. Which item is bought most and least?
5. Which country bring most revenue in total and average?
6. Which month we sell out most and least?
7. What time people tend to buy our product?
8. Which day of a week people tend to visit and purchase stuff?
9. Are there any relationship between Repeat Customers and All Customers over a year?
10. What is the most trending of some items?

### Question 1 and 2: Who, customerID, bring most revenue? And, who, customerID, buy most in term of quantity?
- Regardless unknown customerID, we notice CustomerID of 14646 bring \\$279489 in total, and an amount of each transaction is around \\$134. A total number of items are bought from this customer is 196719, however the unit price for each item is approximately \\$2.6.
- Most people bring revenue between 0 and \\$1608

In [None]:
CustomerID_Rev = df.groupby('CustomerID')[['Revenue',
                          'Quantity',
                          'UnitPrice']].agg(['sum',
                                             'mean',
                                             'median']).sort_values(by=[('Revenue', 'sum')], ascending=False)
display(CustomerID_Rev.reset_index())

display(pd.DataFrame(CustomerID_Rev.iloc[1:][('Revenue','sum')].describe()))

# Remove the unknown CustomerID
sns.distplot(CustomerID_Rev.iloc[1:][('Revenue','sum')], kde=False)

### Question 3: Who, customerID, is likely to return the product?
- As we can see, customerIDs of 15838 and 15749 return so many items, 9361 and 9014 respectively, which bring a huge deduction in our Revenue overal.

In [None]:
Item_retured = df[df['Quantity'] < 0].groupby('CustomerID')[['Revenue',
                                              'Quantity']].agg(['sum']).sort_values(by=[('Quantity', 'sum')], ascending=True).head(10)

sns.barplot(x=Item_retured.index, y=abs(Item_retured[('Quantity','sum')]))
plt.ylabel('A number of Quantity returned')
plt.xticks(rotation=90)
plt.show()

Item_retured

### Question 4: Which item is bought most and least?
- StockCode of 22197 and 84007 are leading at the price of 0.72, 0.29 & 0.21 in this question. This somehow can explain if the unit price is low, people are able to afford more.
- There are about 14 least preferred items among all.
- Some items with negative quantity figure are not considered in this analysis, because those items were bought before the period of this dataset, probably in 2009 and they were returned in 2010 or 2011. Thus, we have not enough evidence to analyze. 

In [None]:
most_prefered_items = df.groupby(['StockCode', 'UnitPrice'])[['Quantity']].sum().sort_values(by=['Quantity'],ascending=False).head(10)

most_prefered_items

In [None]:
most_prefered_items1 = df.groupby(['StockCode'])[['Quantity']].sum().sort_values(by=['Quantity'],ascending=False).head(10)

most_prefered_items2 = df.groupby(['StockCode', 'UnitPrice'])[['Quantity']].sum().sort_values(by=['Quantity'],ascending=False).head(10)

sns.barplot(x=most_prefered_items1.index, y=most_prefered_items1['Quantity'])
plt.ylabel('A number of Quantity returned')
plt.xticks(rotation=90)
plt.show()

display(most_prefered_items1)
display(most_prefered_items2)

In [None]:
least_prefered_items = df.groupby(['StockCode'])[['Quantity']].sum().sort_values(by=['Quantity'],ascending=False)
least_prefered_items = least_prefered_items[least_prefered_items['Quantity']==0]
print('A list of least preferred items: ', len(least_prefered_items))
least_prefered_items

### Question 5: Which country bring most revenue in total and average?
- This is the e-commerce UK-based online retail, so United Kingdom brings most revenue and quantity.
- However, Netherlands only comes as the second place, but spend quite much, around \\$120 with 84 in quantity for each transaction.
- Average UnitPrices per Invoice from Singapore and HongKong, around \\$109 and \\$42, outstand from that of rest, only between \\$2 and \\$8.5 

In [None]:
InvoiceNumber_Country = pd.DataFrame(df.groupby(['Country'])['InvoiceNo'].count())

fig = go.Figure(data=go.Choropleth(
                locations=InvoiceNumber_Country.index, # Spatial coordinates
                z = InvoiceNumber_Country['InvoiceNo'].astype(float), # Data to be color-coded
                locationmode = 'country names', # set of locations match entries in `locations`
                colorscale = 'Reds',
                colorbar_title = "Order number",
            ))

fig.update_layout(
    title_text = 'Order number per country',
    geo = dict(showframe = True, projection={'type':'mercator'})
)
fig.layout.template = None
fig.show()

In [None]:
# Source: https://stackoverflow.com/questions/36220829/fine-control-over-the-font-size-in-seaborn-plots-for-academic-papers/36222162#36222162
country_revenue = df.groupby('Country')[['Revenue']].agg(['sum',
                                        'mean',
                                        'median']).sort_values(by=[('Revenue', 'sum')], ascending=False)
display(country_revenue)

fig = plt.figure(constrained_layout=True, figsize=(20, 6))
a = sns.barplot(y=country_revenue.index, x=country_revenue[('Revenue', 'sum')])
plt.xlabel('Total Revenue from all country', fontsize=18)
plt.ylabel('Country', fontsize=18)


fig = plt.figure(constrained_layout=True, figsize=(20, 6))
country_revenue = country_revenue.drop('United Kingdom')
sns.barplot(y=country_revenue.index, x=country_revenue[('Revenue', 'sum')])
plt.xlabel('Total Revenue from all country but UK', fontsize=18)
plt.ylabel('Country', fontsize=18)
plt.show()



In [None]:
country_quantity = df.groupby('Country')[['Quantity']].agg(['sum',
                                        'mean',
                                        'median']).sort_values(by=[('Quantity', 'sum')], ascending=False)

display(country_quantity)

fig = plt.figure(constrained_layout=True, figsize=(20, 6))
a = sns.barplot(y=country_quantity.index, x=country_quantity[('Quantity', 'sum')])
plt.xlabel('Total Quantity from all country', fontsize=18)
plt.ylabel('Country', fontsize=18)


fig = plt.figure(constrained_layout=True, figsize=(20, 6))
country_quantity = country_quantity.drop('United Kingdom')
sns.barplot(y=country_quantity.index, x=country_quantity[('Quantity', 'sum')])
plt.xlabel('Total Quantity from all country but UK', fontsize=18)
plt.ylabel('Country', fontsize=18)
plt.show()

In [None]:
unitprice_average = df.groupby('Country')[['UnitPrice']].agg(['sum',
                                        'mean']).sort_values(by=[('UnitPrice', 'mean')], ascending=False)
display(unitprice_average)

fig = plt.figure(constrained_layout=True, figsize=(20, 6))
a = sns.barplot(y=unitprice_average.index, x=unitprice_average[('UnitPrice', 'mean')])
plt.xlabel('Total Quantity from all country', fontsize=18)
plt.ylabel('Country', fontsize=18)


fig = plt.figure(constrained_layout=True, figsize=(20, 6))
unitprice_average = unitprice_average.drop('United Kingdom')
sns.barplot(y=country_quantity.index, x=unitprice_average[('UnitPrice', 'mean')])
plt.xlabel('Total Quantity from all country but UK', fontsize=18)
plt.ylabel('Country', fontsize=18)
plt.show()


### Question 6: Which month we sell out most and least?
- As we see, from January to  August, the revenue makes a gradual increase from \\$560K to \\$700K.
- Towards the end of the year, sales make a huge jump to over a million and peak in November with \\$1461K
- However, looking at average revenue diagram indicates nothing change drastically.

In [None]:
month_sales = df.groupby(['month'])['Revenue'].agg(['sum','mean'])

fig, axes = plt.subplots(1, 2, figsize=(18, 5))
axes = axes.flatten()

sns.barplot(x=month_sales.index, y=month_sales['sum'], ax=axes[0]).set_title("Total Revenue over a year")
plt.ylabel('a')
plt.xticks(rotation=90)

sns.barplot(x=month_sales.index, y=month_sales['mean'], ax=axes[1]).set_title("Average Revenue over a year")
plt.xticks(rotation=90)
plt.show()

month_sales

### Question 7: What time do people tend to buy our products?
- At 6 o'clock, people may want to return undesired stuff
- Starting from 7 am, people tend to make purchase on the online retail. As we can see the revenue hit the top at 12pm. Afterwards, sales gradually decrease till 18pm. After that only a few of customers left make purchases.
- Taking a look at the 2nd image, average revenue for an invoice at 7 am is substantially higher than the rest hours in a day. It suggests people make a huge quantity of items per transaction at the beginning of a day. 

In [None]:
hour_sales = df.groupby(['hour'])['Revenue'].agg(['sum','mean'])

fig, axes = plt.subplots(1, 2, figsize=(18, 5))
axes = axes.flatten()

sns.barplot(x=hour_sales.index, y=hour_sales['sum'], ax=axes[0]).set_title("Total Revenue in a day")
plt.ylabel('a')
plt.xticks(rotation=90)

sns.barplot(x=hour_sales.index, y=hour_sales['mean'], ax=axes[1]).set_title("Average Revenue per Invoice in a day")
plt.xticks(rotation=90)
plt.show()

hour_sales

### Question 8: Which day of a week people tend to visit and purchase stuff?
- It seems sales goes up and down during week days.
- At the weekend, there is no transaction on Saturday, and sales on Sunday is just as a half or a third as compared to weekdays.
- 2 Images below look the same in distribution

In [None]:
dayofweek_sales = df.groupby(['dayofweek'])['Revenue'].agg(['sum','mean',])

fig, axes = plt.subplots(1, 2, figsize=(18, 5))
axes = axes.flatten()

sns.barplot(x=dayofweek_sales.index, y=dayofweek_sales['sum'], ax=axes[0]).set_title("Total Revenue over a week")
plt.ylabel('a')
plt.xticks(rotation=90)

sns.barplot(x=dayofweek_sales.index, y=dayofweek_sales['mean'], ax=axes[1]).set_title("Average Revenue over a week")
plt.xticks(rotation=90)
plt.show()

dayofweek_sales

### Question 9: Are there any relationship between Repeat Customers and All Customers over a year?
- Investigating the number of Repeat Customers and All Customers
- Looking at the revenue generated from the Repeat Customers and All Customers

In [None]:
# Get our date range for our data
print('Date Range: %s to %s' % (df['InvoiceDate'].min(), df['InvoiceDate'].max()))

# We're taking all of the transactions that occurred before December 01, 2011 
df = df[df['InvoiceDate'] < '2011-12-01']

In [None]:
# Get total amount spent per invoice and associate it with CustomerID and Country
invoice_customer_df = df.groupby(by=['InvoiceNo', 'InvoiceDate']).agg({'Revenue': sum,'CustomerID': max,'Country': max,}).reset_index()
invoice_customer_df

In [None]:
# Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
# We set our index to our invoice date
# And use Grouper(freq='M') groups data by the index 'InvoiceDate' by Month
# We then group this data by CustomerID and count the number of unique repeat customers for that month (data is the month end date)
# The filter fucntion allows us to subselect data by the rule in our lambda function i.e. those greater than 1 (repeat customers)

monthly_repeat_customers_df = invoice_customer_df.set_index('InvoiceDate').groupby([
              pd.Grouper(freq='M'), 'CustomerID']).filter(lambda x: len(x) > 1).resample('M').nunique()['CustomerID']

monthly_repeat_customers_df

In [None]:
# Number of Unique customers per month
monthly_unique_customers_df = df.set_index('InvoiceDate')['CustomerID'].resample('M').nunique()
monthly_unique_customers_df

In [None]:
# Ratio of Repeat to Unique customers
monthly_repeat_percentage = monthly_repeat_customers_df/monthly_unique_customers_df*100.0
monthly_repeat_percentage

In [None]:
fig = plt.figure(constrained_layout=True, figsize=(20, 6))
grid = gridspec.GridSpec(nrows=1, ncols=1,  figure=fig)

ax = fig.add_subplot(grid[0, 0])

pd.DataFrame(monthly_repeat_customers_df.values).plot(ax=ax, figsize=(12,8))

pd.DataFrame(monthly_unique_customers_df.values).plot(ax=ax,grid=True)

ax.set_xlabel('Date')
ax.set_ylabel('Number of Customers')
ax.set_title('Number of Unique vs. Repeat Customers Over Time')
plt.xticks(range(len(monthly_repeat_customers_df.index)), [x.strftime('%m.%Y') for x in monthly_repeat_customers_df.index], rotation=45)
ax.legend(['Repeat Customers', 'All Customers'])

In [None]:
# Let's investigate the relationship between revenue and repeat customers
monthly_revenue_df = df.set_index('InvoiceDate')['Revenue'].resample('M').sum()

monthly_rev_repeat_customers_df = invoice_customer_df.set_index('InvoiceDate').groupby([
    pd.Grouper(freq='M'), 'CustomerID']).filter(lambda x: len(x) > 1).resample('M').sum()['Revenue']

# Let's get a percentage of the revenue from repeat customers to the overall monthly revenue
monthly_rev_perc_repeat_customers_df = monthly_rev_repeat_customers_df/monthly_revenue_df * 100.0
monthly_rev_perc_repeat_customers_df

In [None]:
fig = plt.figure(constrained_layout=True, figsize=(20, 6))
grid = gridspec.GridSpec(nrows=1, ncols=1,  figure=fig)

ax = fig.add_subplot(grid[0, 0])
pd.DataFrame(monthly_rev_repeat_customers_df.values).plot(ax=ax, figsize=(12,8))

pd.DataFrame(monthly_revenue_df.values).plot(ax=ax,grid=True)

ax.set_xlabel('Date')
ax.set_ylabel('Number of Customers')
ax.set_title('Number of Unique vs. Repeat Customers Over Time')
plt.xticks(range(len(monthly_repeat_customers_df.index)), [x.strftime('%m.%Y') for x in monthly_repeat_customers_df.index], rotation=45)
ax.legend(['Repeat Customers', 'All Customers'])

### Question 10: What are the item trends?

Let's count the number of items sold for each product for each period.

In [None]:
# Now let's get quantity of each item sold per month
date_item_df = df.set_index('InvoiceDate').groupby([pd.Grouper(freq='M'), 'StockCode'])['Quantity'].sum()
date_item_df.head(15)

In [None]:
# Rank items by the last month's sales
last_month_sorted_df = date_item_df.loc['2011-11-30']
last_month_sorted_df = last_month_sorted_df.reset_index()
last_month_sorted_df.sort_values(by='Quantity', ascending=False).head(10)

In [None]:
# Let's look at the top 5 items sale over a year
date_item_df = df.loc[df['StockCode'].isin(['23084', '84826', '22197', '22086', '85099B'])].set_index('InvoiceDate').groupby([
    pd.Grouper(freq='M'), 'StockCode','Description'])['Quantity'].sum().reset_index()

date_item_df

In [None]:
date_item_df = date_item_df.reset_index()

sns.set(style='whitegrid')
plt.figure(constrained_layout=True, figsize=(12, 5))
sns.lineplot(x=date_item_df['InvoiceDate'], y=date_item_df['Quantity'], hue=date_item_df['StockCode'])

### Question 11: Top 10 Reorder Items

In [None]:
df.groupby(['StockCode', 'Description'])['InvoiceNo'].count().sort_values(ascending = False).head(10)

### Question 12: What is the Mall's Cancellation Rate?

In [None]:
Num_Canceled_Orders = df[df['Quantity']<0]['InvoiceNo'].nunique()
Total_Orders = df['InvoiceNo'].nunique()
print('Cancellation Rate: {:.2f}%'.format(Num_Canceled_Orders/Total_Orders*100 ))

### Question 13: The revenue comes from repeat items or 1 items per month?

In [None]:
Monthly_Reorder_Items_Revenue = df.set_index('InvoiceDate').groupby([ pd.Grouper(freq='M'), 'StockCode']).filter(lambda x: len(x) > 1).resample('M').sum()['Revenue']
Monthly_One_Items_Revenue = df.set_index('InvoiceDate').groupby([ pd.Grouper(freq='M'), 'StockCode']).filter(lambda x: len(x) == 1).resample('M').sum()['Revenue']
#Monthly_Revenue = df.groupby(['year','month']).sum()['Revenue']  # Generate the same Result
Monthly_Revenue = df.set_index('InvoiceDate').groupby([pd.Grouper(freq='M')]).sum()['Revenue']

In [None]:
fig = plt.figure(constrained_layout=True, figsize=(20, 6))

ax = fig.add_subplot()
pd.DataFrame(Monthly_Reorder_Items_Revenue.values).plot(ax=ax, figsize=(12,8))
pd.DataFrame(Monthly_Revenue.values).plot(ax=ax,grid=True)
pd.DataFrame(Monthly_One_Items_Revenue.values).plot(ax=ax,grid=True)

ax.set_xlabel('Date')
ax.set_ylabel('Number of Customers')
ax.set_title('Number of Unique vs. Repeat vs Total Items Over Time')
plt.xticks(range(len(monthly_repeat_customers_df.index)), [x.strftime('%m.%Y') for x in monthly_repeat_customers_df.index], rotation=45)
ax.legend(['Repeat Items', 'All Items', 'One Item'])

# MARKET BASKET ANALYSIS
- The solution focus on improving marketing performance upon data driven.
- Applying Association Rule with Apriori Algorithm to extract frequent itemsets in data mining.
- For further explanation, find this source: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
- Final thought: Apriori is very useful for finding simple associations between our data items. They are easy to implement and have high explain-ability.

In [None]:
Sample_df = df[:50]
Sample_df = Sample_df[['InvoiceNo', 'Description']]

In [None]:
Sample_df.set_index('InvoiceNo', inplace=True)

In [None]:
# Note that the quantity bought is not considered, only if the item was present or not in the basket
basket = pd.get_dummies(Sample_df)
basket_sets = pd.pivot_table(basket, index='InvoiceNo', aggfunc='sum')
basket_sets

In [None]:
# Apriori aplication: frequent_itemsets
# Note that min_support parameter was set to a very low value, this is the Spurious limitation, more on conclusion section
frequent_itemsets = apriori(basket_sets, min_support=0.22, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

In [None]:
# Advanced and strategical data frequent set selection
frequent_itemsets[ (frequent_itemsets['length'] > 1) &
                   (frequent_itemsets['support'] >= 0.02)]

In [None]:
# Generating the association_rules: rules
# Selecting the important parameters for analysis
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules

In [None]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values('support', ascending=False).head()

In [None]:
# Visualizing the rules distribution color mapped by Lift
plt.figure(figsize=(14, 8))
plt.scatter(rules['support'], rules['confidence'], c=rules['lift'], alpha=0.9, cmap='YlOrRd');
plt.title('Rules distribution color mapped by lift');
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.colorbar();

# MODELING
- RandomForest Regression

### RandomForest Regression
#### Observation:
- Apply 6 new features such as NumberOrders, Unitprice, days_as_customer, days_since_purchase, NumberItems, OrderFrequency to regression model to predict which features have most influence on Revenue the company receive, then accordingly running marketing campaigns to yeild highest profit. 
- The diagram below indicates NumberOrders and UnitPrice are 2 most important factors of forming revenue.
- As we already discuss above, having higher price may have traded off against NumberOrders, thus the next step for the company is to run the A/B Test to know if we should increase UnitPrice followed by a deduction of NumberOrder and vice versa.
- days-as-customers and days-since-purcharse may not contribute much to see if a customer is loyal and bring most revenue to us. 

In [None]:
# df.InvoiceDate = pd.to_datetime(df.InvoiceDate, format="%m/%d/%Y %H:%M")
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

df['Revenue'] = df['Quantity']*df['UnitPrice']

In [None]:
invoice_ct = df.groupby(by='CustomerID', as_index=False)['InvoiceNo'].count()
invoice_ct.columns = ['CustomerID', 'NumberOrders']
invoice_ct

In [None]:
unitprice = df.groupby(by='CustomerID', as_index=False)['UnitPrice'].mean()
unitprice.columns = ['CustomerID', 'Unitprice']
unitprice

In [None]:
revenue = df.groupby(by='CustomerID', as_index=False)['Revenue'].sum()
revenue.columns = ['CustomerID', 'Revenue']
revenue

In [None]:
total_items = df.groupby(by='CustomerID', as_index=False)['Quantity'].sum()
total_items.columns = ['CustomerID', 'NumberItems']
total_items

In [None]:
earliest_order = df.groupby(by='CustomerID', as_index=False)['InvoiceDate'].min()
earliest_order

In [None]:
earliest_order.columns = ['CustomerID', 'EarliestInvoice']

In [None]:
earliest_order['now'] = pd.to_datetime((df['InvoiceDate']).max())

In [None]:
earliest_order

In [None]:
# == earliest_order['days_as_customer'] = 1 + (earliest_order.now-earliest_order.EarliestInvoice).dt.days
# Source: https://kite.com/python/docs/pandas.core.indexes.accessors.TimedeltaProperties
earliest_order['days_as_customer'] = 1 + (earliest_order['now']-earliest_order['EarliestInvoice']).dt.days

In [None]:
earliest_order.drop('now', axis=1, inplace=True)
earliest_order

In [None]:
# when was their last order and how long ago was that from the last date in file (presumably
# when the data were pulled)
last_order = df.groupby(by='CustomerID', as_index=False)['InvoiceDate'].max()
last_order.columns = ['CustomerID', 'last_purchase']
last_order['now'] = pd.to_datetime((df['InvoiceDate']).max())
last_order['days_since_last_purchase'] = 1 + (last_order.now-last_order.last_purchase).astype('timedelta64[D]')
last_order.drop('now', axis=1, inplace=True)
last_order

In [None]:
#combine all the dataframes into one
import functools
dfs = [invoice_ct,unitprice,revenue,earliest_order,last_order,total_items]
CustomerTable = functools.reduce(lambda left,right: pd.merge(left,right,on='CustomerID', how='outer'), dfs)
CustomerTable['OrderFrequency'] = CustomerTable['NumberOrders']/CustomerTable['days_as_customer']
CustomerTable

In [None]:
CustomerTable.corr()['Revenue'].sort_values(ascending = False)

In [None]:
x = CustomerTable[['NumberOrders','Unitprice', 'days_as_customer', 'days_since_last_purchase', 'NumberItems', 'OrderFrequency']]
y = CustomerTable['Revenue']

#### Observation:
- NumberOrders and UnitPrice are 2 most important factors of forming revenue.
- days-as-customers and days-since-purcharse may not contribute much to see if a customer is loyal and bring most revenue to us.

In [None]:
reg = RandomForestRegressor()
reg.fit(x.values, y)

#list(zip(x, reg.feature_importances_))
coef = pd.Series(reg.feature_importances_, index = x.columns)

imp_coef = coef.sort_values()
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Linear Model")

# RFM - Recency Frequency Monetary 

In [None]:
recency = df.groupby(by='CustomerID', as_index=False)['InvoiceDate'].max()
recency.columns = ['CustomerID', 'last_purchase']
recency['now'] = pd.to_datetime((df['InvoiceDate']).max())
recency['Recency'] = 1 + (recency.now-recency['last_purchase']).astype('timedelta64[D]')
recency.drop(['now','last_purchase'], axis=1, inplace=True)
recency.head()

In [None]:
#check frequency of customer means how many transaction has been done..

frequency = df.copy()
frequency.drop_duplicates(subset=['CustomerID','InvoiceNo'], keep="first", inplace=True) 
frequency = frequency.groupby('CustomerID',as_index=False)['InvoiceNo'].count()
frequency.columns = ['CustomerID','Frequency']
frequency.head()

In [None]:
monetary=df.groupby('CustomerID',as_index=False)['Revenue'].sum()
monetary.columns = ['CustomerID','Monetary']
monetary.head()

In [None]:
dfs = [recency, frequency, monetary]
rfm = functools.reduce(lambda left,right: pd.merge(left,right,on='CustomerID', how='outer'), dfs)

In [None]:
rfm

In [None]:
#bring all the quartile value in a single dataframe
rfm_segmentation = rfm.copy()

In [None]:
rfm_segmentation

In [None]:
from sklearn.cluster import KMeans
SSE_to_nearest_centroid = []

for k in range(1,15):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(rfm_segmentation)
    SSE_to_nearest_centroid.append(kmeans.inertia_)

plt.figure(figsize=(20,8))
plt.plot(range(1,15),SSE_to_nearest_centroid,"-o")
plt.title("SSE / K Chart", fontsize=18)
plt.xlabel("Amount of Clusters",fontsize=14)
plt.ylabel("Inertia (Mean Distance)",fontsize=14)
plt.xticks(range(1,20))
plt.grid(True)
plt.show()

In [None]:
#fitting data in Kmeans theorem.
kmeans = KMeans(n_clusters=3, random_state=0).fit(rfm_segmentation)

# this creates a new column called cluster which has cluster number for each row respectively.
rfm_segmentation['cluster'] = kmeans.labels_
rfm_segmentation.head()

#### Observation
- Recency graph: cluster 0 have high recency rate which is bad, yet cluster 1 and cluster 2 having low so they are in race of platinum and gold customer.
- Frequency graph: cluster 1 and cluster 2 having low so they are in race of platinum and gold customer regarding frequency metrics.
- Monetary graph: cluster 1 have highest Montary (money spend) platinum where as cluster 2 have medium level(Gold) and cluster 0 is silver customer.

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(rfm_segmentation['cluster'],rfm_segmentation.Recency)

plt.figure(figsize=(8,5))
sns.boxplot(rfm_segmentation['cluster'],rfm_segmentation.Frequency)

plt.figure(figsize=(8,5))
sns.boxplot(rfm_segmentation['cluster'],rfm_segmentation.Frequency)

### More granularity level of analysis

In [None]:
quantile = rfm.quantile(q=[0.25,0.5,0.75])
quantile

In [None]:
# lower the recency, good for store..
def RScore(x):
    if x <= quantile['Recency'][0.25]:
        return 1
    elif x <= quantile['Recency'][0.50]:
        return 2
    elif x <= quantile['Recency'][0.75]: 
        return 3
    else:
        return 4

# higher value of frequency and monetary lead to a good consumer.
def FScore(x):
    if x <= quantile['Frequency'][0.25]:
        return 4
    elif x <= quantile['Frequency'][0.50]:
        return 3
    elif x <= quantile['Frequency'][0.75]: 
        return 2
    else:
        return 1

def MScore(x):
    if x <= quantile['Monetary'][0.25]:
        return 4
    elif x <= quantile['Monetary'][0.50]:
        return 3
    elif x <= quantile['Monetary'][0.75]: 
        return 2
    else:
        return 1

In [None]:
rfm_segmentation

In [None]:
rfm_segmentation['R_quartile'] = rfm_segmentation['Recency'].apply(RScore)
rfm_segmentation['F_quartile'] = rfm_segmentation['Frequency'].apply(FScore)
rfm_segmentation['M_quartile'] = rfm_segmentation['Monetary'].apply(MScore)

In [None]:
rfm_segmentation

In [None]:
# Approach 1: group customer's attributes, leading to detail customer's profile
# for example 121 and 112 are different.
rfm_segmentation['RFMScore'] = rfm_segmentation['R_quartile'].astype(str) \
                               + rfm_segmentation['F_quartile'].astype(str) \
                               + rfm_segmentation['M_quartile'].astype(str)

In [None]:
# Approach 2: group customer's attributes, leading to more general customers' profile
# for example 121 and 112 are the same.
rfm_segmentation['TotalScore'] = rfm_segmentation['R_quartile'] \
                               + rfm_segmentation['F_quartile'] \
                               + rfm_segmentation['M_quartile']

In [None]:
print("Best Customers: ",len(rfm_segmentation[rfm_segmentation['RFMScore']=='111']))
print('Loyal Customers: ',len(rfm_segmentation[rfm_segmentation['F_quartile']==1]))
print("Big Spenders: ",len(rfm_segmentation[rfm_segmentation['M_quartile']==1]))
print('Almost Lost: ', len(rfm_segmentation[rfm_segmentation['RFMScore']=='134']))
print('Lost Customers: ',len(rfm_segmentation[rfm_segmentation['RFMScore']=='344']))
print('Lost Cheap Customers: ',len(rfm_segmentation[rfm_segmentation['RFMScore']=='444']))

Image(url= "https://i.imgur.com/YmItbbm.png?")

In [None]:
rfm_segmentation.sort_values(by=['RFMScore', 'Monetary'], ascending=[True, False])

In [None]:
rfm_segmentation.groupby('RFMScore')['Monetary'].mean()

In [None]:
Score_Recency = rfm_segmentation.groupby('TotalScore')['Recency'].mean().reset_index()
Score_Monetatry = rfm_segmentation.groupby('TotalScore')['Monetary'].mean().reset_index()
Score_Frequency = rfm_segmentation.groupby('TotalScore')['Frequency'].mean().reset_index()

### Observation:
- Based on Recency, categories 10,11,12 have highest value which is good for model. because it could have combination of values such as 444, 434, 334 etc.
- Based on Frequency, categories 3,4,5 have highest value which is good for model. because it could have combination of values such as 111, 121, 122 etc.
- Based on Monetary, categories 3,4,5 have highest value which is good for model. because it could have combination of values such as 111, 121, 122 etc.

In [None]:
sns.barplot(x=Score_Recency['TotalScore'],y=Score_Recency['Recency'])

plt.figure(constrained_layout=True, figsize=(12, 4))

plt.subplot(1,2,1)
sns.barplot(x=Score_Frequency['TotalScore'],y=Score_Frequency['Frequency'])

plt.subplot(1,2,2)
sns.barplot(x=Score_Monetatry['TotalScore'],y=Score_Monetatry['Monetary'])
plt.subplots_adjust(wspace = 0.2)

# COHORT ANALYSIS
- It is a subset of behavioral analytics that takes the data from a given eCommerce platform, web application, or online game and rather than looking at all users as one unit, it breaks them into related groups for analysis.
- A cohort is a group of users who share a common characteristic. For example, all users with the same Acquisition Date belong to the same cohort. 
- The retention rate show the percentage of customers return in the following months after the their first purchase.
- Customer acquisition cost is so expensive that we have to do remarketing our clients to retain them. If The retention rate is low, it means we have to spend more budget amount to acquire more customers to visit.

In [None]:
def get_month(x): 
    return dt.datetime(x.year, x.month, 1)

In [None]:
df['InvoiceMonth'] = df['InvoiceDate'].apply(get_month)

In [None]:
# https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
# explain the difference between   apply - transform. In this case, use transform for CohortMonth.
# CohortMonth: the first time a customer came to our retail store.
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceMonth'].transform('min')

In [None]:
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

In [None]:
invoice_year, invoice_month, _ = get_date_int(df, 'InvoiceMonth')
cohort_year, cohort_month, _ = get_date_int(df, 'CohortMonth')

years_diff = invoice_year - cohort_year
months_diff = invoice_month - cohort_month

df['CohortIndex'] = years_diff * 12 + months_diff + 1

df.head()

In [None]:
## grouping customer berdasarkan masing masing cohort
cohort_data = df.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().reset_index()
# To solve the problem when ploting heatmap diagram below.
cohort_data['CohortMonth'] = cohort_data['CohortMonth'].dt.date
cohort_counts = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')

### Observation:
- CohortMonth of 2021-12-01 indicates 949 distinct customers when they first came (CohortIndex 1), 
- The following month (CohortIndex 2) has 363 repeat customers, so on. 

In [None]:
cohort_counts

#### Observation:
- The graph show customer numbers as percentage

In [None]:
cohort_sizes = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_sizes, axis=0)
retention.round(2) * 100

In [None]:
plt.figure(figsize=(15, 8))
plt.title('Retention rates')
sns.heatmap(data = retention,
            annot = True,
            fmt = '.0%',
            vmin = 0.0, vmax = 0.5,
            cmap = 'BuGn')
plt.show()