## This notebook aims to answer the follwing questions -:
- What is the distribution of sales for each geographical region and are there any specific products which drives the revenue or have a higher profit especially in cases where the customer share is low
- What are the most ordered products in general?
- Can we find mutually dependant buying patterns (Association Rule mining)?
- How does the company perform with time. Are they gaining customers or revenue?
- What are the possible ways to segment customers for efficient targeting and what are the possible ways for efficient marketing?
- Are there efficient ways for inventory managment?


In [None]:
import numpy as np
import pandas as pd 
import os

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

from sklearn.cluster import KMeans

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
# specify encoding to deal with different formats
df = pd.read_csv('../input/ecommerce-data/data.csv', encoding = 'ISO-8859-1')
print(df)


In [None]:
print(df.dtypes)

Some attributes of the data

```
InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.
```

# Data cleaning

- Before I go further, I recieved a lot of interesting ideas from the notebook "https://www.kaggle.com/anmoltripathi/complete-e-commerce-analysis" especially on data cleaning, I will be using some of the ideas from that notebook here

In [None]:
df.describe()

- As we can see some of the numerical fields are negative. This could be due to erronous values or refunds 
- This can be checked by comparing whether there are any duplicate invoice numbers or cancellations
- Letter c indicates cancellations in Invoice number. Lets check that. 

In [None]:
Cancellations = []

for i in range(len(df)):
    
    if "c" in df["InvoiceNo"].loc[i] or "C" in df["InvoiceNo"].loc[i] :
        
        Cancellations.append(1)
        
    else:
        pass

print("No of cancellations: ", len(Cancellations))

In [None]:
df['InvoiceDate']= pd.to_datetime(df['InvoiceDate'])

df.set_index(np.arange(0,len(df),1),inplace=True)
neg_df = df[df["Quantity"]<0]
pos_df = df[~df["Quantity"]<0]

print("No of total negative quantity: ",len(neg_df))

In [None]:
neg_df.head(50)

Cancellations could amount to the negative quantity. Let's check for refunds. These can happen for
- Quantity lower than total bought quantity
- Bought after the purchase date
- Same customer id

In [None]:
check_for_truth = []

for j in neg_df.index:  
    
        for_ref = pos_df[(neg_df["CustomerID"].loc[j]==pos_df["CustomerID"]) & \
        (neg_df["Description"].loc[j]==pos_df["Description"])\
         &  (neg_df["InvoiceDate"].loc[j]>pos_df["InvoiceDate"]) & \
         (abs(neg_df["Quantity"].loc[j])<=pos_df["Quantity"])]
        
        if len(for_ref)>0:
            print("Refunds exists")
            break    
        
        


### Let's check which items contribute to the most refunds

In [None]:
df_neg = neg_df[['Quantity','Description']].groupby("Description").sum()
df_neg["Quantity"] = abs(df_neg["Quantity"])

plt.figure(figsize=(10,10))
df_neg = df_neg.nlargest(10,columns="Quantity")
plt.title("Top 10 items that were refunded")
plt.pie(df_neg["Quantity"].to_list(),labels=df_neg.index,autopct='%1.1f%%')
plt.show()

- Paper craft little birdie along with medium ceramic top storage jar are the most refunded
- Refunded item details will not be deleted from the dataframe as they are vital later for correctly finding the revenue retention
- Additionally there are items mentioned as unsaleable, destroyed, printing smudges/thrown away which don't point to a specific item. These can be checked with stockcode

In [None]:
print(df[df.Description=="printing smudges/thrown away"])

print(df[df.Description=="Unsaleable, destroyed."])

print(df[df.Description=="check"])

- Since, they lack a CustomerID, they could be refunds at the company level i.e. refunds during purchase from the supplier
- Let's delete null values and find correctly the items that were refunded

In [None]:
neg_df.Description = neg_df.Description.astype(str)
neg_df["Description"] = neg_df["Description"].apply(lambda x:np.nan if x.lower() == "nan" else x)
neg_df.dropna(inplace=True)

df_neg = neg_df[['Quantity','Description']].groupby("Description").sum()
df_neg["Quantity"] = abs(df_neg["Quantity"])

plt.figure(figsize=(10,10))
df_neg = df_neg.nlargest(10,columns="Quantity")
plt.title("Top 10 items that were refunded")
plt.pie(df_neg["Quantity"].to_list(),labels=df_neg.index,autopct='%1.1f%%')
plt.show()

- Null values need to be deleted later
- What's manual?

In [None]:
print(df[df.Description=="Manual"])

print(df[(df.InvoiceNo=="536569") & (df.CustomerID==16274)])

- Analysing the stock code later can be helpful

- Let's check for negative price

In [None]:
neg_price = df[df["UnitPrice"]<0]
print(neg_price)

- Since this does not count as a customer, let's delete it 

In [None]:
df = df[df["UnitPrice"]>0]

- Remove possible duplicates and check for null values

In [None]:
df = df.drop_duplicates(keep='first')
df.isnull().sum()

- From above, some descriptions are NaN, but they don't appear here.
- Maybe they are text, let's find them and convert them to null values

In [None]:
df["Description"] = df["Description"].apply(lambda x:np.nan if x.lower() == "nan" else x)
print(df.isnull().sum()) # Seems to be zero for zero unit price

In [None]:
print(df[df["CustomerID"].isna()])

- This could probably be customers whose details have not been taken down
- To ensure our analysis are correct, drop rows with null values throughout
- Additionally some items are coded less than 5, they seem to some kind of delivery charge

In [None]:
# String nan
columns = ["InvoiceNo","StockCode","Country","Description"]
for column in columns:
    print(column)
    df[column] = df[column].apply(lambda x:np.nan if x.lower=="nan" else x)   
  
df.dropna(inplace=True)
print(df)

Lets look at the stock code especially where there are instance of codes with length less than 5

In [None]:
df_sc = df[df.StockCode.str.len()<5]
print(df_sc["StockCode"].value_counts())

In [None]:
print("POST",df_sc[df_sc.StockCode=="POST"]["Description"].head(1))
print("DOT",df_sc[df_sc.StockCode=="DOT"]["Description"].head(1)) 
print("M",df_sc[df_sc.StockCode=="M"]["Description"].head(1)) 
print("C2",df_sc[df_sc.StockCode=="C2"]["Description"].head(1))
print("D",df_sc[df_sc.StockCode=="D"]["Description"].head(1)) 
print("S",df_sc[df_sc.StockCode=="S"]["Description"].head(1))
print("CRUK",df_sc[df_sc.StockCode=="CRUK"]["Description"].head(1))
print("PADS",df_sc[df_sc.StockCode=="PADS"]["Description"].head(1))
print("B",df_sc[df_sc.StockCode=="B"]["Description"].head(1))
print("M",df_sc[df_sc.StockCode=="M"]["Description"].head(1))

- These seem to be additional charges for delivery and additional services. It would be better to remove them as they might inflate the revenue figures (except for discount)
- Before that, let's check for stock code greater than 5

In [None]:
df_sc = df[(df.StockCode.str.len()>5)]
print(df_sc.head(2))
df_sc = df[(df.StockCode.str.len()>6)]
print(df_sc.head(2))
df_sc = df[(df.StockCode.str.len()>7)]
print(df_sc.head(2))

- Delete additional and bank charges as bank charges might be due to transaction charges and will inflate actual revenue to the company through customers

In [None]:
df = df[~((df.StockCode.str.len()<5) & (df.StockCode!="D"))]
df = df[df.StockCode.str.lower() !="bank charges"]

- Let's further check stock code. Is each product associated with a single stock code

In [None]:
stock_code_df = df[["StockCode","Description"]].groupby("StockCode")["Description"].nunique()
stock_code_df = stock_code_df.sort_values(ascending=False)
plt.figure(figsize=(20,10))
stock_code_df.head(100).plot(kind='bar')

In [None]:
stockcode = df[["StockCode","Description"]].groupby("StockCode")["Description"].unique()
print("Product for stock code 23196: " + stockcode.at["23196"])

print("Product for stock code 23236: " + stockcode.at["23236"])

print("Product for stock code 21243: " + stockcode.at["21243"])

- Stockcode seems to denote similar items from different suppliers or with different spellings
- So rather than analysing with product descriptions, it would be advisable with stock codes for some of the analysis

# Study of distribution of sales and customers for each country

In [None]:
df["Individual Revenue"] = df["Quantity"]*df["UnitPrice"]

# Group by geopgraphical distribution
df_revenue = df[["Country","Individual Revenue"]].groupby("Country").sum()
df_customers = df[["Country","CustomerID"]].groupby("Country").nunique()
df_customers.rename(columns ={"CustomerID":"Number of customers"},inplace=True)

In [None]:
df_revenue["% of total revenue"] = df_revenue["Individual Revenue"]*100/np.sum(df_revenue["Individual Revenue"])
df_revenue.reset_index(inplace=True)

df_customers["% of total customers"] = df_customers["Number of customers"]*100/np.sum(df_customers["Number of customers"])
df_customers.reset_index(inplace=True)

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2,sharey = True, figsize=(20,10))

ax[0].title.set_text('% of total revenue by country')
g1 = sns.barplot(x= "% of total revenue",y= "Country",
                data=df_revenue,
                ax=ax[0])

ax[1].title.set_text('% of total customers by country')
g2 = sns.barplot(x= "% of total customers",y= "Country",
                data=df_customers,
                ax=ax[1])



## Products which are profitable
- United Kingdom has the highest share of customers and revenue
- Netherlands and ERIE have a lower share of customers but a higher share of revenue compared to Germany which has an equivalent amount of revenue and customers
- This could be due to customer buying behaviour prefering higher priced products. However, it is essential to know these products as they can drive revenue and profitability with a lower customer share


In [None]:
plt.figure(figsize=(20,20))
sns.violinplot(x="UnitPrice", y="Country",
                    data=df[df["Individual Revenue"]>0],order=["EIRE","Netherlands", "Germany"],
                    )
plt.title("Violin plot of unit price with respect to the selected country (not for refunds)")
plt.show()

- Similar preferences in Netherlands, EIRE and Germany for sale of goods. However, occasionally higher priced goods are bought in Netherlands and EIRE
- Let's take this one step further to check if high priced goods are mostly seasonal or not

In [None]:
df["Seasonality"] = "Non seasonal"

# November, December are the seasonal months for sales

df.loc[df.InvoiceDate.dt.month.isin([11,12]),"Seasonality"] = "Seasonal"

plt.figure(figsize=(20,20))
sns.violinplot(x="UnitPrice", y="Country",
                    data=df[df["Individual Revenue"]>0],order=["EIRE","Netherlands", "Germany"],
                    hue="Seasonality",split=True)
plt.title("Violin plot of unit price with respect to the selected country considering the seasonality (not for refunds)")
plt.show()

- These products have an almost equitable distribution between seasonal and non-seasonal months
- In order to find these products or stock codes, we can implement K-means to segment these products

In [None]:
df_price_analysis = df[(df["Individual Revenue"]>0) & (df['StockCode']!="D")]

# Find optimal number of clusters

Sum_of_squared_distances_Price = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_price_analysis["UnitPrice"].values.reshape(-1,1))
    Sum_of_squared_distances_Price.append(km.inertia_)
    
plt.plot(K, Sum_of_squared_distances_Price, 'bx-')
plt.title('Elbow Method For Optimal k - Unit Price')
plt.show()

In [None]:
# Optimal number of clusters
k=6

Cluster_labels = []
kmeans = KMeans(n_clusters=k).fit(df_price_analysis["UnitPrice"].values.reshape(-1,1))
# Order kmeans in sequential order
# https://stackoverflow.com/questions/44888415/how-to-set-k-means-clustering-labels-from-highest-to-lowest-with-python
idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
lut = np.zeros_like(idx)
lut[idx] = np.arange(k)

df_price_analysis["Labels_UnitPrice"] = lut[kmeans.labels_]

In [None]:
sns.catplot(x="Labels_UnitPrice", y="UnitPrice", kind="box", data=df_price_analysis)
plt.show()

In [None]:
df_high_share = df_price_analysis[(df_price_analysis["Labels_UnitPrice"]>=3) & (df_price_analysis["Country"]!="United Kingdom")]

fig = px.pie(df_high_share, values='Quantity', names="Country",title="Distribution of high priced products sold with respect to different countries (apart from UK)")
fig.update_layout(title_x=0.5)
fig.show()

- Let's find the products which belong to this range which are driving the sales in Netherlands and EIRE

In [None]:
fig = px.treemap(df_high_share[(df_high_share["Country"]=="EIRE") | (df_high_share["Country"]=="Netherlands")], \
                                path=['Description'], values = "Quantity", title="Treemap of the most profitable products according to their quantity for Netherlands and EIRE")
fig.data[0].hovertemplate = '%{label}<br>%{value}'
fig.update_layout(title_x=0.5)
fig.show()

### Top 10 selling products in Netherlands,EIRE,UK and Germany in terms of quantity

In [None]:
df_neth = df[df.Country=="Netherlands"]
df_EIRE = df[df.Country=="EIRE"]
df_Germany = df[df.Country=="Germany"]
df_UK = df[df.Country=="United Kingdom"]

df_neth = df_neth.groupby("Description").sum()
df_EIRE = df_EIRE.groupby("Description").sum()
df_Germany = df_Germany.groupby("Description").sum()
df_UK = df_UK.groupby("Description").sum()

In [None]:
df_neth = df_neth.nlargest(10,columns="Quantity")
df_neth = df_neth.groupby("Description").sum()
df_neth.reset_index(inplace=True)
plt.figure(figsize=(20,20))
plt.title("Top 10 selling items in Netherlands")
sns.barplot(x="Quantity",y="Description",data=df_neth,palette='dark')
plt.show()

df_EIRE = df_EIRE.nlargest(10,columns="Quantity")
df_EIRE = df_EIRE.groupby("Description").sum()
df_EIRE.reset_index(inplace=True)
plt.figure(figsize=(20,20))
plt.title("Top 10 selling items in EIRE")
sns.barplot(x="Quantity",y="Description",data=df_EIRE,palette='dark')
plt.show()

df_Germany = df_Germany.nlargest(10,columns="Quantity")
df_Germany = df_Germany.groupby("Description").sum()
df_Germany.reset_index(inplace=True)
plt.figure(figsize=(20,20))
plt.title("Top 10 selling items in Germany")
sns.barplot(x="Quantity",y="Description",data=df_Germany,palette='dark')
plt.show()

df_UK = df_UK.nlargest(10,columns="Quantity")
df_UK = df_UK.groupby("Description").sum()
df_UK.reset_index(inplace=True)
plt.figure(figsize=(20,20))
plt.title("Top 10 selling items in UK")
sns.barplot(x="Quantity",y="Description",data=df_UK,palette='dark')
plt.show()

## Association rule mining
- Let's find any interesting relation between products for the top 4 revenue generating countries
- Reference - https://pbpython.com/market-basket-analysis.html
- https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis
- https://towardsdatascience.com/association-rules-2-aa9a77241654
- https://www.kaggle.com/anmoltripathi/complete-e-commerce-analysis


In [None]:
df_basket = df[df["Individual Revenue"]>0]
basket_Netherlands = df_basket[df_basket['Country']=="Netherlands"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket_EIRE = df_basket[df_basket['Country']=="EIRE"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket_Germany = df_basket[df_basket['Country']=="Germany"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket_UK = df_basket[df_basket['Country']=="United Kingdom"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets_Neth = basket_Netherlands.applymap(encode_units)
basket_sets_EIRE = basket_EIRE.applymap(encode_units)
basket_sets_Germany = basket_Germany.applymap(encode_units)
basket_sets_UK = basket_UK.applymap(encode_units)

- Market basket analysis for Netherlands

In [None]:
frequent_itemsets = apriori(basket_sets_Neth, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift")
rules.sort_values(['lift','support'],ascending=False,inplace=True)
rules.head(10)

- Apart from an indicator of FOLDING BUTTERFLY MIRROR being bought in bulk,probably from a large bulk supplier. 10 COLOUR SPACEBOY PEN seems to be commonly associated with other supplies as indicated by it's high lift and consequnce when it is an ancedent

- Market basket analysis for EIRE

In [None]:
frequent_itemsets = apriori(basket_sets_EIRE, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift")
rules.sort_values(['lift','support'],ascending=False,inplace=True)
rules.head(10)

- Market basket analysis for Germany

In [None]:
frequent_itemsets = apriori(basket_sets_Germany, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift")
rules.sort_values(['lift','support'],ascending=False,inplace=True)
rules.head(10)

- Market basket analysis for UK

In [None]:
frequent_itemsets = apriori(basket_sets_UK, min_support=0.02, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift")
rules.sort_values(['lift','support'],ascending=False,inplace=True)
rules.head(10)

-  From the data from Germany, UK and ERIE, indication of bulk purchases of similar type of goods seems to be happening.
- This indicates that they are probably being bought by a wholesale retailer

# Analyse behaviour of revenue with time with a cohort analysis
- As the customer list contains some ambiguities with respect to refunds, we can do cohort analysis for revenue as the negative sales values takes effect with addition
- As marketing varies from different geographical conditions, let's perform cohort analysis for UK alone (Top revenue generation)
- http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/

In [None]:
# change the invoice_date format from string to timestep
df['OrderPeriod'] = df.InvoiceDate.apply(lambda x: x.strftime('%Y-%m'))
df = df[df.Country=="United Kingdom"]

df.set_index('CustomerID', inplace=True)
df['CohortGroup'] = df.groupby(['CustomerID'])['InvoiceDate'].min().apply(lambda x: x.strftime('%Y-%m'))
df.reset_index(inplace=True)
df.drop(columns=["Country","UnitPrice"],inplace=True)

In [None]:
cohorts = df.groupby(['CohortGroup', 'OrderPeriod']).agg({'CustomerID': pd.Series.nunique,"Individual Revenue":'sum'})
cohorts.rename(columns={'CustomerID': 'Total Users', "Individual Revenue":'Total revenue'}, inplace=True)

In [None]:
def cohort_period(df):
    """
    Number of periods after the first purchase for each cohort group
    
    """
    df['CohortPeriod'] = np.arange(len(df)) + 1
    return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.head(20)

In [None]:
# reindex the DataFrame
cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)

# create individual holding the total revenue of each cohort group
cohort_revenue = cohorts['Total revenue'].groupby(level=0).first()
print(cohort_revenue.head())

In [None]:
cohorts.drop(columns = ["OrderPeriod"],inplace=True)
cohorts.reset_index(inplace=True)

# For total revenue
cohorts_customer_revenue = cohorts.pivot(index="CohortPeriod",columns="CohortGroup",values="Total revenue")
print(cohorts_customer_revenue.head(10))

In [None]:
# Revenue retention
CustomerRevenueRetentionAnalysis = cohorts_customer_revenue.div(cohort_revenue)
print(CustomerRevenueRetentionAnalysis)

In [None]:
plt.figure(figsize=(20,20))
plt.title('Revenue Retention',fontsize=10, weight = 'bold', color="black")
sns.heatmap(CustomerRevenueRetentionAnalysis.T, mask=CustomerRevenueRetentionAnalysis.T.isnull(), annot=True,cmap="YlGnBu", fmt='.0%')

- 2010-12, 2011-01 and 2011-08 period has the highest revenue retention among all the periods. Maybe during these periods, the company had a effective marketing strategy. 


# Derivation of marketing strategies

## Customer segmentation will be achieved with an RFM matrix. It has three indicators 
```
1. Recency - When the customer last bought from the store. Calculated with the difference between the last date of purchase ever to date of purchase
2. Frequency - No of times customer has bought from the store
3. Monetary - Total purchase from the customer
```

- Here, let's analyse only based on purchases not based on returns or cancellations
- For UK alone, for targeted marketing

In [None]:
df = df[~df['InvoiceNo'].str.contains('C', na=False)]
df = df[df["Quantity"]>0]

# Non consideration of discounts
df = df[df["StockCode"]!='D']

In [None]:
# Latest purchase
max_date = df["InvoiceDate"].max()

rfm_df = df.groupby("CustomerID").agg({"InvoiceDate":lambda date: (max_date - date.max()).days,
                                      "InvoiceNo": 'count',
                                      "Individual Revenue":'sum'})

rfm_df.rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","Individual Revenue":"Monetary"},inplace=True)
print(rfm_df)

- One way of customer segmentation is through dividing it with quantiles.
- Another way is use unsupervised learning to divide into segments.

### Customer segmentation by behaviour

In [None]:
km = KMeans(n_clusters=3)
cluster_labels = km.fit_predict(rfm_df)

# https://stackoverflow.com/questions/44888415/how-to-set-k-means-clustering-labels-from-highest-to-lowest-with-python
idx = np.argsort(km.cluster_centers_.sum(axis=1))
lut = np.zeros_like(idx)
lut[idx] = np.arange(3)

fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,10))
sns.boxplot(lut[km.labels_],rfm_df["Recency"].tolist(),ax=ax[0])
ax[0].set(xlabel="Clusters", ylabel = "Recency in days")
ax[0].title.set_text('Clusters - Recency')
sns.boxplot(lut[km.labels_],rfm_df["Frequency"].tolist(),ax=ax[1])
ax[1].set(xlabel="Clusters", ylabel = "Frequency in Days")
ax[1].title.set_text('Clusters - Frequency')
sns.boxplot(lut[km.labels_],rfm_df["Monetary"].tolist(),ax=ax[2])
ax[2].set(xlabel="Clusters", ylabel = "Total Revenue")
ax[2].title.set_text('Clusters - Monetary')

- Some interesting observations such as the clear segmentation on revenue
1. Many high paying customers have not bought frequently and recently. 
2. Company haven't gained substantial income from frequent customers
3. Let's break into segments

In [None]:
rfm_df["Cluster labels"] = lut[km.labels_]

In [None]:
typec1 = 'Customers who may or may have not bought recently,do not buy frequently enough and contribute the least' # 0
typec3 = 'Customers who have not bought recently, do not buy frequently enough but contribute the most'        # 2
typec2 = 'Customers who may or may have not bought recently, may or may not buy frequently enough but contribute sufficiently'# 1

def func(x):
    if x == 0:
        return typec1
    elif x == 2:
        return typec3
    elif x == 1:
        return typec2
    else:
        return 'N/A'

rfm_df["Customer segments"] = rfm_df["Cluster labels"].apply(func)
print(rfm_df)

In [None]:
plt.figure(figsize=(10,10))

fig = px.pie(rfm_df, names='Customer segments',title="Distribution of customer buying behaviour")
fig.data[0].hovertemplate = '%{label}'
fig.update_layout(showlegend=False, title_x=0.5)
fig.show()

plt.figure(figsize=(10,10))

fig = px.pie(rfm_df, names='Customer segments',title="Distribution of customer buying behaviour with revenue",values="Monetary")
fig.data[0].hovertemplate = '%{label}'
fig.update_layout(showlegend=False, title_x=0.5)
fig.show()

- Major share of customers do not buy frequently and contribute individually a lower revenue to the company
- However, these customers also account for the bulk of the monetary value of the company.
- Hence, inventory managment here is very critical


### Possible marketing strategies
- In order for effective targeting, selective strategy specific to the use case must be undertaken
- Hence selective clustering for frequency, monetary and recency must be undertaken

### Finding the optimal number of clusters

In [None]:
# Find optimal number of clusters

Sum_of_squared_distances_rec = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(rfm_df["Recency"].values.reshape(-1,1))
    Sum_of_squared_distances_rec.append(km.inertia_)
    
Sum_of_squared_distances_freq = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(rfm_df["Frequency"].values.reshape(-1,1))
    Sum_of_squared_distances_freq.append(km.inertia_)
    
    
Sum_of_squared_distances_mon = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(rfm_df["Monetary"].values.reshape(-1,1))
    Sum_of_squared_distances_mon.append(km.inertia_)
 
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,10))
ax[0].plot(K, Sum_of_squared_distances_rec, 'bx-')
ax[0].set_title('Elbow Method For Optimal k - Recency')


ax[1].plot(K, Sum_of_squared_distances_freq, 'bx-')
ax[1].set_title('Elbow Method For Optimal k - Frequency')


ax[2].plot(K, Sum_of_squared_distances_mon, 'bx-')
ax[2].set_title('Elbow Method For Optimal k - Monetary')

for a in ax.flat:
    a.set(xlabel='k', ylabel='Sum_of_squared_distances')
    
# Hide x labels and tick labels for top plots and y ticks for right plots.
for a in ax.flat:
    a.label_outer()
    
plt.show()

- 5 clusters seem to be an acceptable number of clusters for frequncy, Recency and Monetary values

In [None]:
k=5

fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,10))
labels = ["Recency","Frequency","Monetary"]

Cluster_labels = []

for i in range(len(labels)):
    kmeans = KMeans(n_clusters=k).fit(rfm_df[labels[i]].values.reshape(-1,1))   
    idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
    lut = np.zeros_like(idx)
    lut[idx] = np.arange(k)
    

    dict = {
    labels[i]: rfm_df[labels[i]].tolist(),
     "Cluster_labels": list(lut[kmeans.labels_])
     }
    param_df=pd.DataFrame.from_dict(dict,orient='index').transpose()
    sns.boxplot(x="Cluster_labels", y=labels[i], data=param_df,ax=ax[i])
    
    ax[i].set(xlabel="Cluster labels", ylabel = labels[i])
    ax[i].title.set_text("Clustering " + labels[i])
    
        
    rfm_df["Cluster labels - " + str(labels[i])] = list(lut[kmeans.labels_])

    
    
plt.show()


In [None]:
# https://guillaume-martin.github.io/rfm-segmentation-with-python.html

segt_map = {
    r'[0-1][0-1]': 'First timers - SMS ads',
    r'[0-1][2-3]': 'At risk - Discounts',
    r'[0-1]4': 'Immediate targeting - Promotions',
    r'2[0-1]': 'About to lose - Feedback form',
    r'[2-3][3-4]': 'Regular customers - Loyalty cards',
    r'30': 'Promising - Instant vouchers',
    r'40': 'New customers - 50 % discounts on certain items',
    r'[3-4][1-2]': 'Potential regular customers - Loyalty cards',
    r'4[3-4]': 'Best customers - Loyalty cards',
    r'22': 'Need attention - Discounts'
}  # For example if customers hasn't bought for a long time and has tried the shop only once, an SMS ad might     
   # rekindle interest


rfm_df["RFM class"] = rfm_df['Cluster labels - Recency'].apply(str)+ rfm_df['Cluster labels - Frequency'].apply(str) + \
rfm_df['Cluster labels - Monetary'].apply(str)

# Marketing strategy based on recency and frequnecy of shopping alone
rfm_df['Marketing strategy'] = rfm_df['Cluster labels - Recency'].apply(str)+ rfm_df['Cluster labels - Frequency'].apply(str)
rfm_df['Marketing strategy'] = rfm_df['Marketing strategy'].replace(segt_map, regex=True)
rfm_df.head(5)

In [None]:
fig = px.treemap(rfm_df, path=['Marketing strategy'],title ="Treemap of marketing strategies")
fig.data[0].hovertemplate = '%{label}'
fig.update_layout(title_x=0.5)
fig.show()

# Inventory management

- As discussed earlier, major share of customers do not buy frequently and contribute individually a lower revenue to the company. Also, these customers also account for the bulk of the monetary value of the company.
- Hence, inventory management is very critical
- One strategy is to find products which sell quickly
- Another involves arranging finding products which sell mostly at particular periods of time (example holidays or particular months)
- Additionally analysis will be done for UK alone 


## Inventory management as per purchase rate

In [None]:
# Latest purchase
max_date = df["InvoiceDate"].max()

# Analysis done with stock code as similar stock code denotes same type of item from multiple suppliers. Hence, there 
# is a flexibility for the supplier to buy the same product from multiple suppliers
IM_df = df.groupby("StockCode").agg({"InvoiceDate":lambda date: (max_date - date.max()).days,
                                      "Quantity": 'sum',
                                      "Individual Revenue":'sum'})

IM_df.rename(columns={"InvoiceDate":"Recency","Quantity":"Frequency","Individual Revenue":"Monetary"},inplace=True)


In [None]:
# Find optimal number of clusters

Sum_of_squared_distances_freq = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(IM_df["Recency"].values.reshape(-1,1))
    Sum_of_squared_distances_freq.append(km.inertia_)
    
plt.plot(K, Sum_of_squared_distances_freq, 'bx-')
plt.title('Elbow Method For Optimal k - Frequency')
plt.show()

In [None]:
k=5

Cluster_labels = []
kmeans = KMeans(n_clusters=k).fit(IM_df["Frequency"].values.reshape(-1,1))
idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
lut = np.zeros_like(idx)
lut[idx] = np.arange(k)
    
dict = {
"Frequency": IM_df["Frequency"].tolist(),
"Cluster_labels": list(lut[kmeans.labels_])
}
param_df=pd.DataFrame.from_dict(dict,orient='index').transpose()
sns.boxplot(x="Cluster_labels", y="Frequency", data=param_df)

plt.title("Clustering " + "Frequency")
    
IM_df["Cluster labels - " + "Frequency"] = list(lut[kmeans.labels_])    
    
plt.show()


In [None]:
IM_df.loc[IM_df["Cluster labels - Frequency"]==0,"Cluster labels - Frequency"]= "Low purchase rate"
IM_df.loc[IM_df["Cluster labels - Frequency"]==1,"Cluster labels - Frequency"]= "Lower mid purchase rate"
IM_df.loc[IM_df["Cluster labels - Frequency"]==2,"Cluster labels - Frequency"]= "Mid purchase rate"
IM_df.loc[IM_df["Cluster labels - Frequency"]==3,"Cluster labels - Frequency"]= "Higher mid purchase rate"
IM_df.loc[IM_df["Cluster labels - Frequency"]==4,"Cluster labels - Frequency"]= "Higher purchase rate"

In [None]:
fig = px.treemap(IM_df, path=['Cluster labels - Frequency'],values="Monetary", title="Treemap of product purchase rate with revenue")
fig.data[0].hovertemplate = '%{label}<br>%{value}'
fig.update_layout(title_x=0.5)
fig.show()

- Products which are brought infrequently contribute to a larger revenue

In [None]:
print(IM_df[IM_df["Cluster labels - Frequency"]=="Higher purchase rate"].head(10))

- These stocks needs to be replenished at a faster rate as indicated also by their lower recency

## Inventory management as per seasonality

In [None]:
df["Seasonality"] = "Non seasonal"

# November, December are the seasonal months for sales
# https://www.statista.com/topics/6297/holiday-retail-in-the-uk/#:~:text=Typically%2C%20during%20the%20months%20of,seen%20even%20more%20elevated%20numbers.

df.loc[df.InvoiceDate.dt.month.isin([11,12]),"Seasonality"] = "Seasonal"

fig = px.pie(df, values='Individual Revenue', names="Seasonality",title="Distribution of revenue according to seasonal and non seasonal sales")
fig.update_layout(title_x=0.5)
fig.show()

- Nearly 27 % of the sales come from the holiday months alone. Hence significant attention has to be paid to the products that are sold during this period in order to ensure that these stocks don't run out

In [None]:
Seasonal_df = df[df.Seasonality== "Seasonal"]
Non_Seasonal_df = df[df.Seasonality== "Non seasonal"]

- Groupby stock code as single stock code refers to similar products from different suppliers
- Reference: https://www.kaggle.com/anmoltripathi/complete-e-commerce-analysis

In [None]:
Seasonal_df = Seasonal_df[["Quantity","StockCode"]].groupby("StockCode").sum()
Non_Seasonal_df  = Non_Seasonal_df[["Quantity","StockCode"]].groupby("StockCode").sum()

Non_Seasonal_df.rename(columns={"Quantity":"IR_nonseasonal"},inplace=True)
Seasonal_df.rename(columns={"Quantity":"IR_seasonal"},inplace=True)


In [None]:
mergedDf = Non_Seasonal_df.merge(Seasonal_df, left_index=True, right_index=True)

In [None]:
# Find optimal number of clusters

Sum_of_squared_distances_freq = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(mergedDf.values.reshape(-1,1))
    Sum_of_squared_distances_freq.append(km.inertia_)
    
plt.plot(K, Sum_of_squared_distances_freq, 'bx-')
plt.title('Elbow Method For Optimal k - Frequency')
plt.show()

In [None]:
k = 6
kmeans = KMeans(n_clusters=k)
clustered_cust = kmeans.fit_predict(mergedDf[['IR_nonseasonal','IR_seasonal']])

# Ordering
idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
lut = np.zeros_like(idx)
lut[idx] = np.arange(k)
mergedDf['cluster'] = list(lut[kmeans.labels_])

plt.figure(figsize=(15,10))
g1 = sns.scatterplot(mergedDf['IR_nonseasonal'],mergedDf['IR_seasonal'],hue=mergedDf['cluster'],palette="dark")
g1.set_xscale("log")
plt.xlabel('Seasonal orders')
plt.title('Orders: Seasonal and non-seasonal segmentation by KNN')
plt.ylabel('Non seasonal orders')

In [None]:
print(mergedDf)

In [None]:
df = pd.merge(df, mergedDf, left_on="StockCode", right_index=True)
df.drop(columns=["Seasonality","IR_nonseasonal","IR_seasonal"],inplace=True)
print(df.head(5))

In [None]:
df["weeks"] = df["InvoiceDate"].dt.week
print(df)

In [None]:
week_df = df[["Quantity","cluster","weeks"]].groupby(["cluster","weeks"]).sum()

In [None]:
# Number of clusters defined earlier
clusters = np.arange(0,k,1)

for cluster in clusters:
    plt.figure(figsize=(30,10))
    sub_df = week_df.loc[cluster,:]
    sub_df.reset_index(inplace=True)
    
    sns.lineplot(data=sub_df, x="weeks", y="Quantity")
    plt.title("Orders for cluster "+ str(cluster+1))
    plt.show()

### Observations
- Cluster 1, Cluster 2, Cluster 3 and Cluster 5 are seasonal in nature. Hence, efforts for restocking them in large numbers must be undertaken mostly in the months of November and December
- Cluster 4 has a demand throughout the year. Hence, they need to be replinshed throughout
- Cluster 6 seems to be be in demand mostly in the month of January. 

In [None]:
# Products in cluster 6
print("Products in demand in the month of January: " + df[df.cluster==5]["Description"].unique()[0])


In [None]:
fig = px.pie(df[df.cluster==3], values='Quantity', names="Description",\
             title="Distribution of orders for products sold throughout the year")
fig.update_layout(title_x=0.5)
fig.show()

## Conclusion - Inventory management

- Company can do their inventory management either by replinishing their stocks by the frequency of products being sold
- or prioritising the amount of products sold in seasonal or non-seasonal months

## Would appreciate to receive any feedback. Also upvote if you find the notebook useful