# Complete E-Commerce Data Analysis: Business First Approach

The purpose of this data analysis is to find key insights and derive meaning from it. We will try to derive some business impacting insights by trying to answer relevant questions like 
* Is company’s performance improving or degrading over time?
* What are some important trends visible in the sales data and insights?
* How can we measure our performance in terms of customer acquisition and building customer loyalty? 
* Can we take some initiatives based on the data to increase the sales? 
* Based on data can we avoid out of stock situations?
* What kind of customer do typically buy from us?

This dataset is orignally uploaded at https://archive.ics.uci.edu/ml/datasets/online+retail# 

Table of Contents 
> [Getting Started](#1)
>> [Importing Libraries](#1.1)<br>
>> [Reading Data](#1.2)
>>> [Data Description](#1.2.1)<br>

>> [Analysing -ve Quantity and -ve UnitPrice](#1.3)
>>> [Negative Quantity](#1.3.1) <br>
>>> [Negative Price](#1.3.2)

> [Preprocessing](#2)
>> [Duplicate Deletion](#2.1)<br>
>> [Analysing null Values](#2.2)

> [Feature Generation](#3)
>> [Creating Product StockCode Lookup](#3.1)<br>
>> [Feature Generation: Product Name](#3.2)<br>
>> [Feature Generation: Total Price](#3.3)<br>

> [Analysis](#4)
>> [Q. Is company’s performance improving or degrading over time?](#4.1)<br>
>> [Q. What are some important trends visible in the sales data and insights?](#4.2)<br>
>> [Regionwise Analysis](#4.2.1)<br>
>> [Q. How can we measure our performance in terms of customer acquisition and building customer loyalty?](#4.3)
>>> [RFM Analysis](#4.3.1)

>> [Q. Can we take some initiatives based on the data to increase the sales? ](#4.4)<br>
>> [Q. Based on data can we avoid out of stock situations?](#4.5)
>>> [Are there any products which are sold more during christmas season?](#4.5.1)

>> [Q.What kind of Customers are buying from us?](#4.6)

## Getting Started <a class="anchor" id="1"></a>
### Importing Libraries <a class="anchor" id="1.1"></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS 
from datetime import datetime
import plotly
plotly.offline.init_notebook_mode (connected = True)
import plotly.express as px
import matplotlib.pyplot as plt
import squarify
import matplotlib.colors as mcolors
from sklearn.cluster import KMeans
from sklearn.decomposition import KernelPCA,PCA,TruncatedSVD
import spacy
import en_core_web_sm
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import calendar
# nlp = en_core_web_sm.load()
# !python -m spacy download en_core_web_md

In [None]:
### Copied from justfortherec's answer: https://stackoverflow.com/questions/28931224/adding-value-labels-on-a-matplotlib-bar-chart


def add_value_labels(ax, spacing=5):
    """Add labels to the end of each bar in a bar chart.

    Arguments:
        ax (matplotlib.axes.Axes): The matplotlib object containing the axes
            of the plot to annotate.
        spacing (int): The distance between the labels and the bars.
    """

    # For each bar: Place a label
    for rect in ax.patches:
        # Get X and Y placement of label from rect.
        y_value = rect.get_height()
        x_value = rect.get_x() + rect.get_width() / 2

        # Number of points between bar and label. Change to your liking.
        space = spacing
        # Vertical alignment for positive values
        va = 'bottom'

        # If value of bar is negative: Place label below bar
        if y_value < 0:
            # Invert space to place label below
            space *= -1
            # Vertically align label at top
            va = 'top'

        # Use Y value as label and format number with one decimal place
        label = "{:.1f}".format(y_value)

        # Create annotation
        ax.annotate(
            label,                      # Use `label` as label
            (x_value, y_value),         # Place label at end of the bar
            xytext=(0, space),          # Vertically shift label by `space`
            textcoords="offset points", # Interpret `xytext` as offset in points
            ha='center',                # Horizontally center label
            va=va)                      # Vertically align label differently for
                                        # positive and negative values.

### Reading Data <a class="anchor" id="1.2"></a>

Let's have sneak peek to the data

In [None]:
df = pd.read_csv("../input/ecommerce-data/data.csv",engine='python',parse_dates=['InvoiceDate'])
df_original = df.copy()
df.head()

#### Data Description <a class="anchor" id="1.2.1"></a>

This following information is taken from UCI Machine Learning Repository:<br>
*This is a transnational data set which contains all the transactions occurring between **01/12/2010 and 09/12/2011** for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.*


| Column | Description |
| --- | :-- |
| InvoiceNo | Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. **If this code starts with letter 'c', it indicates a cancellation.** |
| StockCode | Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. |
| Description | Product (item) name. Nominal. |
| Quantity | The quantities of each product (item) per transaction. Numeric. |
| InvoiceDate | Invice Date and time. Numeric, the day and time when each transaction was generated. |
| UnitPrice | Numeric, Product price per unit in sterling. |
| CustomerID | Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. |
| Country | Country name. Nominal, the name of the country where each customer resides. |



In [None]:
df.describe()

In [None]:
sns.pairplot(df)

It should be noted that there are -ve values in both ```UnitPrice``` and ```Quantity```. Is it for cancelled orders?

### Analysing -ve Quantity and -ve UnitPrice <a class="anchor" id="1.3"></a>
#### Negative Quantity <a class="anchor" id="1.3.1"></a>

In [None]:
fig,ax = plt.subplots(nrows=2,figsize=(20,7))
sns.boxplot(df[(df['InvoiceNo'].str[0]=='c')|(df['InvoiceNo'].str[0]=='C')]['Quantity'],ax=ax[0])
sns.boxplot(df[(df['InvoiceNo'].str[0]=='c')|(df['InvoiceNo'].str[0]=='C')]['UnitPrice'],ax=ax[1])
ax[0].title.set_text("Cancelled Orders 'C' quantity and price distribution")

In [None]:
neg_qty = df[df["Quantity"]<0]
neg_qty_without_C = neg_qty[neg_qty["InvoiceNo"].str[0]!="C"]
print("Negative Qty without 'C' in InvoiceNo \n Unit Prices: {} \t CustomerIDs: {}".format(neg_qty_without_C["UnitPrice"].unique(),neg_qty_without_C["CustomerID"].unique()))

We don't need to bother for the negative quantity without C in invoice number as their unit price 0 only. They won't effect the calculations later on. These entries have no CustomerID assoiciated with them.<br><br>
Do all -ve Quantity have 'C' initiated ```InvoiceNo```? <br>
'C' is said to be denoting cancelled orders and hence have -ve quantity. But we need to analyse it to be sure.

#### **HYPOTHESIS: Rows with -ve quantities mean that the order was previously ordered and cancelled later on.**
If this hypothesis is true, then for majority of negative quantity orders, there must exist an entry which follows following condition (Why majority? Because for some initial months because of data cut conditions won't satisfy):<br>
1. CustomerID (if exists) must match
2. Quantity ordered <= Quantity Cancelled
3. Order date must before Order cancelled date

In [None]:
def check_hypothesis_cancelled_order(df):
    failed = 0
    passed = 0
    neg_qty = df[df["Quantity"]<0]
    pos_qty = df[~df["Quantity"]<0]
    for ind in neg_qty.index:
        if(neg_qty['CustomerID'][ind]):
            p = pos_qty[
                (pos_qty['CustomerID'] == neg_qty['CustomerID'][ind])&
                (pos_qty['Quantity'] <= abs(neg_qty['Quantity'][ind]))&
                ((pos_qty['InvoiceDate'] - neg_qty['InvoiceDate'][ind]).dt.total_seconds()>=0)
            ]
            if(len(p)==0):
                failed+=1
            else:
                passed+=1
    if(failed>passed):
        print("Hypothesis Rejected!")
        print("Failed Counts:"+str(failed)+" Passed Counts:"+str(passed))
        print("Approximately "+str(int(failed/(failed + passed)*100)) + "% rows didn't satisfy the condition")
    else:
        print("Hypothesis Accepted")
        print("Failed Counts:"+str(failed)+" Passed Counts:"+str(passed))
        print("Approximately "+str(int(passed/(failed + passed)*100)) + "% rows satisfy the condition")
            

In [None]:
check_hypothesis_cancelled_order(df)

Since, 56% rows satisfy our condition hence, we can surely say that -ve quantity entries are for some previous order cancellation 

#### Negative Pricing <a class="anchor" id="1.3.2"></a>

In [None]:
neg_price = df[df["UnitPrice"]<0]
neg_price

## Preprocessing <a class="anchor" id="2"></a>
### Duplicate Deletion <a class="anchor" id="2.1"></a>

In [None]:
df.drop_duplicates(inplace=True)

### Analysing the *null* values <a class="anchor" id="2.2"></a>

In [None]:
fig,ax = plt.subplots(figsize=(14,2))
((df.isnull().sum() / len(df))*100).plot.bar(ax=ax)
add_value_labels(ax)
ax.set_title('% of Null Values')
plt.show()

We can see there are missing values in ```Description``` and ```CustomerID```. <br>
**~25%** of values missing in ```CustomerID``` and **~0.3%** values are missing in ```Description```.<br>
We cannot do anything for ```CustomerID```, but let's check what can we do for ```Description```.

First, it is important to check whether if each ```StockCode``` has a unique ```Description```

In [None]:
x = pd.DataFrame(df.groupby("StockCode")["Description"].value_counts())
y = x.droplevel(level=1).index
y = y[y.duplicated()]
test = df[["StockCode","Description"]]
test = test.drop_duplicates()
test1 = test[test["StockCode"].isin(y)]
test2 = pd.DataFrame(test1.groupby("StockCode")["Description"].value_counts())
test2.head(10)

In [None]:
fig, ax = plt.subplots(figsize=(15,4))
grouped = df.groupby("StockCode")['Description'].unique()
grouped_counts = grouped.apply(lambda x: len(x)).sort_values(ascending=False)
grouped_counts.head(50).plot.bar(ax=ax)

We can see that there are products which have varied ```Description``` but belong to same ```StockCode``` For example, for ```StockCode```= 20713, there are 8 unique descriptions (9 including null)

In [None]:
df[df["StockCode"]=="20713"]["Description"].unique()

## Feature Generation <a class="anchor" id="3"></a>
### Creating Product StockCode Lookup <a class="anchor" id="3.1"></a>

We can see Description not only tells what the product is but in some cases it is either corrupted by vague values or describes the condition of product (for eg. "wet/rusty"). We can create a product lookup which can tell us a good idea what product a specific ```StockCode``` indicates to.

In [None]:
# LOGIC: Description with max upper case letters = Product Name

def get_product_name(x):
    max_upper_count = 0
    product_name = ''
    for i in x:
        if(i==i):  #To Check for NaN
            count = 0
            for letter in i:
                if(letter.isupper()):
                    count = count+1
            if count>max_upper_count:
                max_upper_count = count
                product_name = i
    return product_name


In [None]:
grouped = df.groupby("StockCode")['Description'].unique()
lookup = grouped.apply(get_product_name)
# lookup.to_excel('lookup_product_stockCode.xlsx')

### Feature Generation: Product Name <a class="anchor" id="3.2"></a>

In [None]:
df = df.join(other=lookup, on='StockCode', how='left', rsuffix='ProductName')
df = df.rename(columns={'DescriptionProductName':'ProductName'})

In [None]:
# GETTING SIMILARITY BETWEEN THE Description AND ProductName
# !pip install jellyfish
# import jellyfish
from difflib import SequenceMatcher

des = df['Description']
prod = df['ProductName']
dist = []
for d,p in zip(des, prod):
    try:
        dist.append(SequenceMatcher(None,d,p).ratio())
#         dist.append(float(jellyfish.damerau_levenshtein_distance(d,p)))
    except:
        dist.append(0)
        

In [None]:
df['dist'] = dist
df[(df['dist']<0.3)&(df['dist']!=0)][['StockCode','Description','ProductName','dist']]

### Feature Generation: Total Price <a class="anchor" id="3.3"></a>

In [None]:
df["TotalPrice"] = df["UnitPrice"] * df["Quantity"]

In [None]:
fig, ax = plt.subplots(figsize=(7,10))
neg_qty = df[df["Quantity"]<0]
neg_qty["TotalPrice"] = abs(neg_qty["TotalPrice"])
x = neg_qty[["ProductName","TotalPrice"]]
x.groupby("ProductName")["TotalPrice"].sum().sort_values(ascending=True).tail(30).plot.barh(ax=ax)

We can see `AMAZON FEE`, `Manual`, `Bank Charges`, `Postage` are some of the main contributors for negative priced entries

#### Removing 'C' Products

In [None]:
cancelled_df = df[df['InvoiceNo'].str[0]=='C']
df = df[~(df['InvoiceNo'].str[0]=='C')]
cancelled_df = cancelled_df.reset_index(drop=True)

## Analysis <a class="anchor" id="4"></a>
Now as we have cleaned our dataset and also generated required features, we are good to go for analysing our actual business problems
### Q. Is company’s performance improving or degrading over time? <a class="anchor" id="4.1"></a>
Company’s performance can be measured in many terms. Here, we will be analysing the performance by:

* Growth in Revenue by Sales

* Growth in Sales Quantity

* Growth in Customer Base 


In [None]:
fig, ax = plt.subplots(nrows=4, ncols=1,figsize=(15,20))
rev = df[df['TotalPrice']>=0]
rev['TransactionsCount'] = 1
rev = rev.groupby(rev['InvoiceDate'].dt.date).agg({'TotalPrice':'sum',
                                                  'Quantity': 'sum',
                                                  'CustomerID': 'count',
                                                  'TransactionsCount':'sum'})
rev['10 Days Moving Average Revenue'] = rev['TotalPrice'].rolling(10).mean()
rev['10 Days Moving Average Quantity'] = rev['Quantity'].rolling(10).mean()
rev['10 Days Moving Transactions Count'] = rev['TransactionsCount'].rolling(10).mean()
cust = df.groupby('CustomerID').first().reset_index()[['CustomerID','InvoiceDate']]
cust = cust.groupby(cust.InvoiceDate.dt.date).agg({'CustomerID':'count'})
cust['10 Days Moving Average Quantity'] = cust['CustomerID'].rolling(10).mean()

sns.set_style("whitegrid")
sns.lineplot(data=rev[['TotalPrice','10 Days Moving Average Revenue']], palette='magma_r', linewidth=1.5, ax=ax[0],legend=False)
ax[0].legend(title='Revenue Trends', loc='upper left', labels=['Revenue', '10 Days Moving Average Revenue'])
ax[0].title.set_text('Revenue Trends')
ax[0].set_xlabel('')

sns.lineplot(data=rev[['TotalPrice','10 Days Moving Average Quantity']], palette='ocean', linewidth=1.5, ax=ax[1])
ax[1].legend(title='Quantity Trends', loc='upper left', labels=['Quantity Sold', '10 Days Moving Average Quantity'])
ax[1].title.set_text('Quantity Sold Trends')
ax[1].set_xlabel('')

sns.lineplot(data=cust, palette='cividis', linewidth=1.5, ax=ax[2])
ax[2].legend(title='New Customers Trends', loc='upper right', labels=['New Customers', '10 Days Moving Average New Customers'])
ax[2].title.set_text('New Customers Trends')
ax[2].set_xlabel('')

sns.lineplot(data=rev[['TransactionsCount','10 Days Moving Transactions Count']], palette='twilight_shifted', linewidth=1.5, ax=ax[3])
ax[3].legend(title='Transactions Count Trend', loc='upper right', labels=['Transactions Count', '10 Days Moving Average Transactions Count'])
ax[3].title.set_text('Transactions Count Trends')
ax[3].set_xlabel('')

plt.show()

In [None]:
fig, ax = plt.subplots(nrows=4, ncols=1,figsize=(15,20))
rev = df[(df['TotalPrice']>0)&(df['InvoiceDate'].dt.year==2011)]
rev['Transactions Count'] = 1
rev = rev.groupby(rev['InvoiceDate'].dt.month).agg({'TotalPrice':'sum',
                                                  'Quantity': 'sum',
                                                  'CustomerID': 'count',
                                                'Transactions Count':'sum'})
rev = rev.reset_index()
rev['Month'] = rev['InvoiceDate'].apply(lambda x: calendar.month_abbr[x])
rev = rev.rename({'TotalPrice':'Revenue'},axis=1)
cust = df.groupby('CustomerID').first().reset_index()[['CustomerID','InvoiceDate']]
cust = cust.groupby(cust.InvoiceDate.dt.month).agg({'CustomerID':'count'})
cust = cust.reset_index()
cust['Month'] = cust['InvoiceDate'].apply(lambda x: calendar.month_abbr[x])


sns.set_style("whitegrid")
sns.barplot(data=rev, x=rev.Month, y='Revenue', palette='magma_r', ax=ax[0])
ax[0].title.set_text('Revenue by Months')
add_value_labels(ax[0])

sns.barplot(data=rev, x=rev.Month, y='Quantity',  palette='ocean', ax=ax[1])
ax[1].title.set_text('Quantity Sold by Months')
add_value_labels(ax[1])

sns.barplot(data=cust, x=cust.Month, y='CustomerID',  palette='cividis', ax=ax[2])
ax[2].title.set_text('New Customers by Months')
fig.suptitle('Growth Month wise',fontsize=16)
add_value_labels(ax[2])

sns.barplot(data=rev, x=rev.Month, y='Transactions Count',  palette='twilight_shifted', ax=ax[3])
ax[3].title.set_text('Transactions Count by Months')
fig.suptitle('Growth Month wise',fontsize=16)
add_value_labels(ax[3])

plt.show()

**Inferences**

* It seems that company’s performance is improving in terms of revenue and sales, but as we are provided with just one year of data, it is hard to claim this for sure


* There might be a seasonal rise during the end of the year.


* In terms of new customer acquisition,  we can see a clear downward slope. Though, it can be explained by the fact that, with growing customer base, it is difficult to get more new customers. 


* We can compare December sales for 2010 and 2011 to get an insight


In [None]:
sales_comp = df[(df['InvoiceDate'].dt.month==12)&(df['TotalPrice']>=0)][['InvoiceDate','TotalPrice','Quantity']]
sales_comp['Transactions Count'] = 1
sales_comp = sales_comp.groupby(sales_comp['InvoiceDate'].dt.year)[['TotalPrice','Quantity','Transactions Count']].sum()
fig, ax = plt.subplots(nrows=1, ncols=3,figsize=(20,5))

sns.set_style("whitegrid")
sns.barplot(data=sales_comp, x=sales_comp.index, y='TotalPrice', palette='magma_r', ax=ax[0])
ax[0].title.set_text('Revenue Comparision')
ax[0].set_ylabel('Revenue')
ax[0].set_xlabel('December of Year')
add_value_labels(ax[0])

sns.barplot(data=sales_comp, x=sales_comp.index, y='Quantity',  palette='ocean', ax=ax[1])
ax[1].title.set_text('Quantity Sold Comparision')
add_value_labels(ax[1])
ax[1].set_xlabel('December of Year')

sns.barplot(data=sales_comp, x=sales_comp.index, y='Transactions Count',  palette='twilight_shifted', ax=ax[2])
ax[2].title.set_text('Transactions Count Comparision')
add_value_labels(ax[2])
ax[2].set_xlabel('December of Year')

fig.suptitle('Comparision for the month of December in 2020 and 2021',fontsize=16)

plt.show()

In [None]:
print("Sales Revenue Difference: {:2.2f}% decline in revenue from 2010 \nSales Quantity Difference: {:2.2f}% decline in quantity from 2010".format(
((sales_comp['TotalPrice'][2010] - sales_comp['TotalPrice'][2011]) / sales_comp['TotalPrice'][2010])*100,
    ((sales_comp['Quantity'][2010] - sales_comp['Quantity'][2011]) / sales_comp['Quantity'][2010])*100
))
    

* Here, we can see that company’s revenue by sales is decreased from 2010 by **22.4%**, suggesting that company’s performance is degrading.
* We get a similar picture in sales quantity as well. A net decrease of **13.4%**  in Sales quantity.


### Q.What are some important trends visible in the sales data and insights? <a class="anchor" id="4.2"></a>

* After clustering the data in different forms, many interesting  purchasing patterns emerged. 
* Trends can be better known through the purchasing pattern of the customers.
* We can prose certain hypothesis to check the purchasing pattern, like

> Do customers tend to buy more on certain weekday? <br>
> Do customers buy more at the start of the month?<br>
> Do customers tend to buy more at a specific hour of the day?<br>


In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(15,5))
sns.set_style("whitegrid")

week = df[df['TotalPrice']>=0][['InvoiceDate','TotalPrice','Quantity']]
week = week.groupby(week['InvoiceDate'].dt.weekday)[['TotalPrice','Quantity']].sum()
week = week.reset_index()
week['Week'] = week['InvoiceDate'].apply(lambda x: calendar.day_name[x])

sns.lineplot(data = week, x=week.Week, y='Quantity', marker='o', sort = False, ax=ax)
ax2 = ax.twinx()
sns.barplot(data = week, x=week.Week, y='TotalPrice', alpha=0.5, ax=ax2)
fig.suptitle('Revenue and Quantity by Sale Week Day Wise',fontsize=16)
add_value_labels(ax2)

plt.show()

* It is interesting to note that Thursdays have the maximum sale by volume and as well as revenue. 
* Also, there are no transactions done on Saturdays. It is safe to assume that the store is closed on Saturdays.
* Nearly, 42% of total sale is done on Thursdays and Tuesdays 


In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1,figsize=(15,7))
sns.set_style("whitegrid")

day = df[df['TotalPrice']>=0][['InvoiceDate','TotalPrice','Quantity']]
day = day.groupby(day['InvoiceDate'].dt.hour)[['TotalPrice','Quantity']].sum()

sns.barplot(data = day, x=day.index, y='TotalPrice', alpha=1, ax=ax[0])
sns.lineplot(data = day, x=day.index, y='Quantity', marker='o', sort = False, ax=ax[1])
fig.suptitle('Revenue and Quantity by Sale Hourwise',fontsize=16)
add_value_labels(ax[0])
plt.show()

* We can see that the customers do tend to buy between 9:00 hrs - 13:00 hrs. Infact, ~51 % of the sales by revenue is done in these hours

* Also, there are no transactions are done before 6:00 hrs and after 20:00 hrs. It can be assumed that the store usually opens for 14 hours between 6:00 hrs and 20:00 hrs.


In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1,figsize=(15,7))
sns.set_style("whitegrid")

date = df[df['TotalPrice']>=0][['InvoiceDate','TotalPrice','Quantity']]
date = date.groupby(date['InvoiceDate'].dt.day)[['TotalPrice','Quantity']].sum()

sns.barplot(data = date, x=date.index, y='TotalPrice', alpha=1, ax=ax[0])
sns.lineplot(data = date, x=date.index, y='Quantity', marker='o', sort = False, ax=ax[1])
fig.suptitle('Revenue and Quantity by Sale Daywise',fontsize=16)

plt.show()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1,figsize=(15,7))
sns.set_style("whitegrid")

q = df[(df['TotalPrice']>=0)&(df['InvoiceDate'].dt.year==2011)][['InvoiceDate','TotalPrice','Quantity']]
q = q.groupby(q['InvoiceDate'].dt.quarter)[['TotalPrice','Quantity']].sum()

sns.barplot(data = q, x=q.index, y='TotalPrice', alpha=0.7, ax=ax[0])
sns.lineplot(data = q, x=q.index, y='Quantity', marker='o', sort = False, ax=ax[1])
fig.suptitle('Revenue and Quantity by Sale Quarterly for 2011',fontsize=16)
add_value_labels(ax[0])
ax[1].set_xticklabels(['',1,'',2,'',3,'',4])
plt.show()

* It is evident that there is increase in sales by both volume and revenue in the fourth quarter of the year 2011.
* There is major sale in the month of November and December.
* It could be because of Holiday season.

### Regionwise Analysis <a class="anchor" id="4.2.1"></a>

In [None]:
reg = df[df['TotalPrice']>=0].groupby('Country').agg({'TotalPrice':'sum',
                                                  'Quantity': 'sum',
                                                  'CustomerID': 'count'})

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(10,30))
g1 = sns.barplot(x=reg['TotalPrice'], y=reg.index, alpha=1, ax=ax[0],palette='Reds', orient='h')
g2 = sns.barplot(x=reg['Quantity'], y=reg.index, alpha=1, ax=ax[1], palette='Blues',orient='h')
g3 = sns.barplot(x=reg['CustomerID'], y=reg.index, alpha=1, ax=ax[2], palette='Greens', orient='h')
ax[2].title.set_text('Customers Count by Country')
ax[2].set_xlabel("Customers (Log Scale)")
ax[1].title.set_text('Quantity Sold by Country')
ax[1].set_xlabel("Quantity (Log Scale)")
ax[0].title.set_text('Revenue by Country')
ax[0].set_xlabel("Revenue (Log Scale)")
g1.set_xscale("log")
g2.set_xscale("log")
g3.set_xscale("log")
plt.show()


* Major sales is been done from UK (~85%). 
* Almost 95% of all the transactions occurring are from Europe 

In [None]:
reg = reg[reg.index!='United Kingdom']
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(20,20))
# Change color
squarify.plot(sizes=reg['TotalPrice'], label=[str(x)+'\n'+str(y)+'K' for x,y in zip(reg.index,(reg['TotalPrice']/1000).round(2))], alpha=.6, ax=ax[0], color=mcolors.CSS4_COLORS )
ax[0].title.set_text('Revenue by Country (Excluding UK)')
squarify.plot(sizes=reg['Quantity'], label=[str(x)+'\n'+str(y)+'K' for x,y in zip(reg.index,(reg['Quantity']/1000).round(2))], alpha=.6, ax=ax[1], color=mcolors.CSS4_COLORS )
ax[1].title.set_text('Quantity Sold by Country (Excluding UK)')
r1 = reg[reg['CustomerID']!=0]
squarify.plot(sizes=r1['CustomerID'], label=[str(x)+'\n'+str(y)+'K' for x,y in zip(r1.index,(r1['CustomerID']/1000).round(2))], alpha=.6, ax=ax[2], color=mcolors.CSS4_COLORS )
ax[2].title.set_text('Customers Count by Country (Excluding UK)')
ax[0].axis('off')
ax[1].axis('off')
ax[2].axis('off')
plt.show()

In [None]:
## Copied this beautiful piece of code from fabiendaniel's notebook
## https://www.kaggle.com/fabiendaniel/customer-segmentation


import plotly.graph_objs as go
import warnings
from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected=True)
warnings.filterwarnings("ignore")

temp = df[['CustomerID', 'InvoiceNo', 'Country']].groupby(['CustomerID', 'InvoiceNo', 'Country']).count()
temp = temp.reset_index(drop = False)
countries = temp['Country'].value_counts()

data = dict(type='choropleth',
locations = countries.index,
locationmode = 'country names', z = countries,
text = countries.index, colorbar = {'title':'Order no.'},
colorscale=[[0, 'rgb(224,255,255)'],
            [0.01, 'rgb(166,206,227)'], [0.02, 'rgb(31,120,180)'],
            [0.03, 'rgb(178,223,138)'], [0.05, 'rgb(51,160,44)'],
            [0.10, 'rgb(251,154,153)'], [0.20, 'rgb(255,255,0)'],
            [1, 'rgb(227,26,28)']],    
reversescale = False)
#_______________________
layout = dict(title='Number of orders per country',
geo = dict(showframe = True, projection={'type':'mercator'}))
#______________
choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

### Q. How can we measure our performance in terms of customer acquisition and building customer loyalty? <a class="anchor" id="4.3"></a>

#### RFM Analysis <a class="anchor" id="4.3.1"></a>

Used to segregate the customers based on behaviors

RFM helps in segregating:
* the more recent the purchase, the more responsive the customer is to promotions
* the more frequently the customer buys, the more engaged and satisfied they are
* monetary value differentiates heavy spenders from low-value purchasers


In [None]:
lastdate = datetime(2012,1,1)
cleaned_dataset = df[df['TotalPrice']>=0]
recent = (lastdate - cleaned_dataset.groupby("CustomerID")["InvoiceDate"].last()).dt.days
frequent = cleaned_dataset.groupby("CustomerID")["InvoiceDate"].count()
monetary = cleaned_dataset.groupby("CustomerID")["TotalPrice"].sum()

In [None]:
recent_quantile = recent.quantile(q=[0.25,0.5,0.75])
recent_quantile

In [None]:
frequent_quantile = frequent.quantile(q=[0.25,0.5,0.75])
frequent_quantile

In [None]:
monetary_quantile = monetary.quantile(q=[0.25,0.5,0.75])
monetary_quantile

In [None]:
rfm = pd.DataFrame(data=[recent,frequent,monetary])
rfm = rfm.transpose()
rfm.columns = ["recent","frequent","monetary"]
rfm

We will try to segregate customers based on RFM analysis. First, we'll create clusters and see characterstics of each cluster


In [None]:
def get_kmeans_wcss(data, n_limit=15):
    wcss = [] #Within cluster sum of squares (WCSS)
    for i in range(1,n_limit):
        km = KMeans(init='k-means++', n_clusters=i, n_init=10)
        km.fit(data)
        wcss.append(km.inertia_)
    plt.title("Elbow Method")
    plt.plot(range(1, n_limit), wcss)
    plt.xlabel("Number of clusters")
    plt.ylabel("WCSS")
    return wcss

In [None]:
_ = get_kmeans_wcss(rfm, n_limit=15)

Elbow method suggests that ideal number of clusters would be 3 (knee point)

In [None]:
kmeans = KMeans(n_clusters=3, init = "k-means++", random_state=42)
clustered_cust = kmeans.fit_predict(rfm)

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(10,20))
sns.boxplot(clustered_cust,recent, palette="cubehelix",ax=ax[0])
ax[0].set(xlabel="Clusters", ylabel = "Recency in Number of Days")
ax[0].title.set_text('Clusters on Recency')
sns.boxplot(clustered_cust,frequent, palette="cubehelix",ax=ax[1])
ax[1].set(xlabel="Clusters", ylabel = "Frequency in Number of Days")
ax[1].title.set_text('Clusters on Frequency')
sns.boxplot(clustered_cust,monetary, palette="cubehelix",ax=ax[2])
ax[2].set(xlabel="Clusters", ylabel = "Spending Amount")
ax[2].title.set_text('Clusters on Monetary')

* RECENCY boxplot<br>
Cluster 0 have low recency, which is not appreciable. <br>
Cluster 1,2 are having high recency, suggesting that they are more prone to marketing<br>

* FREQUENCY boxplot<br>
Cluster 0 has low frequency, which means customers are not doing transactions frequently<br>
Cluster 1,2 are having better frequency, suggesting that they are more satisfied in general<br>

* MONETARY boxplot<br>
Cluster 0 has lowest value suggesting they are having lowest affordability<br>
Cluster 1 are heavy spenders<br>
Cluster 2 comes in middle<br>

In summary, <br>

| Clusters | Recency | Frequency | Monetary |
| --- | --- | --- | --- |
| 0 | Have not visited recently | Least frequent | Least spending |
| 1 | Most recently visited | Highest frequency | Spending Highest |
| 2 | Recently visited | Decent frequency | Decent Spending |

<br>
Now let us see customer counts by clusters

In [None]:
rfm['Clusters'] = clustered_cust
rfm.Clusters.value_counts()

**Segregation:** <br>
* Majority of customers,  ~96% belong to cluster 0 (Which is not good)
* Only ~1.14% of customers belong to cluster 1
* ~2.74% of customers belong to cluster 2


### Q. Can we take some initiatives based on the data to increase the sales? <a class="anchor" id="4.4"></a>

* **Inititative 1**: We noticed that not just revenue and sales increased in November and December, but number of new customers also increased during aforementioned period. It gives testifies that, November and December is a peak season. It would be wise to increase marketing during this tenure

* **Initiative 2**: POSTAL CHARGES and CARRIAGE CHARGES. Majority of transactions done in UK, Germany and in France is for Postage charges. Working with postal companies to reduce the postal charges can increase company’s revenue and Customer base

* **Initiative 3**: Coming up with loyalty program for cluster 1 and 2 category customers. It will appreciate customers for buying more products from the company

* **Initiative 4**: ~16 % of all transactions are cancelled. This might be due to poor transaction interface infrastructure. Upgrading technical foreground of the company might increase the customer engagement. 


### Q. Based on data can we avoid out of stock situations? <a class="anchor" id="4.5"></a>

We have huge variety of products (~4000 unique products) and in order to make them managable, it is important for us to segregate the products into categories

In [None]:
comment_words = '' 
stopwords = set(STOPWORDS) 

for val in df.ProductName: 
    
    val = str(val) 
    tokens = val.split() 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
    
    comment_words += " ".join(tokens)+" "

wordcloud = WordCloud(width = 1200, height = 600, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(comment_words) 

# plot the WordCloud image
plt.figure(figsize = (12, 6), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show() 


#### Holiday Analysis
Let us see how holidays have impacted the sales figure

In [None]:
df['HolidaySeason'] = 0
df.loc[df['InvoiceDate'].dt.month.isin([9,10,11]), 'HolidaySeason'] = 1

In [None]:
d = df[df['InvoiceDate'].dt.year==2011]
d['Transactions Count'] = 1
labels0 = ['Holiday Season Revenue', 'Non-Holiday Revenue']
sizes0 = [d[d['HolidaySeason']==1]['TotalPrice'].sum(),
         d[d['HolidaySeason']==0]['TotalPrice'].sum()
        ]

labels1 = ['Holiday Season Quantity', 'Non-Holiday Quantity']
sizes1 = [d[d['HolidaySeason']==1]['Quantity'].sum(),
         d[d['HolidaySeason']==0]['Quantity'].sum()
        ]

labels2 = ['Holiday Season Transactions Count', 'Non-Holiday Transactions Count']
sizes2 = [d[d['HolidaySeason']==1]['Transactions Count'].sum(),
         d[d['HolidaySeason']==0]['Transactions Count'].sum()
        ]

fig1, ax = plt.subplots(ncols=3,figsize=(18,5))
ax[0].pie(sizes0, labels=labels0, autopct='%1.1f%%', shadow=True)
ax[0].axis('equal')
ax[1].pie(sizes1, labels=labels1, autopct='%1.1f%%', shadow=True)
ax[1].axis('equal')
ax[2].pie(sizes2, labels=labels2, autopct='%1.1f%%', shadow=True)
ax[2].axis('equal')
plt.show()

It is safe to say that almost 37% sale (by revenue,quantity and transactions) occurs in Holiday season (Sep-Dec) 
#### Are there any products which are sold more during christmas season? <a class="anchor" id="4.5.1"></a>
We have already seen a surge in sale during holiday season. Are there any specific ```StockCode``` contributing to it? Let's figure it out.
We will segregate the products in two ways using count of seasonal and off-seasonal sale:<br>
1. Based on K-nn created groupings
2. Based on Qunatile Range groupings<br><br>
Let's see first KNN based groupings.<br>
LOGIC: Based on Seasonal and Off-Seasonal Counts for each ```StockCode```, we will create clusters called as ```ProductClusters``` and we will see how each cluster's sales have occurred during the whole year 2011

In [None]:
df['Transactions Count'] = 1
l1 = df[df['HolidaySeason']==1].groupby('StockCode')['Transactions Count'].sum()
l2 = df[df['HolidaySeason']==0].groupby('StockCode')['Transactions Count'].sum()
x = pd.DataFrame(data=[l1,l2]).T
x.columns = ['Season','Off-Season']
x = x.fillna(0)
x = x.reset_index()

In [None]:
_ = get_kmeans_wcss(x[['Season','Off-Season']], n_limit=20)

In [None]:
kmeans = KMeans(n_clusters=10, init = "k-means++", random_state=100)
clustered_cust = kmeans.fit_predict(x[['Season','Off-Season']])
x['cluster'] = clustered_cust

In [None]:
plt.figure(figsize=(15,10))
g1 = sns.scatterplot(x['Season'],x['Off-Season'],hue=x['cluster'],palette="deep")
g1.set_xscale("log")
plt.xlabel('Season Counts')
plt.title('StockCodes: Seasonal and Off-seasonal counts by KNN clusters (Seasonal Count axis is in log scale)')
plt.ylabel('Off-Season Counts')

Let's see what products are there in Cluster 8 (It has great sales during both season and off-season)

In [None]:
df[df['StockCode'].isin(x[x['cluster']==8]['index'])]['ProductName'].unique()

Let's see Quantile based groupings.<br>
LOGIC: Based on Seasonal and Off-Seasonal Counts for each ```StockCode```, we will create clusters called as ```Q-Regions``` and we will see how each region's sales have occurred during the whole year 2011

| Quantile Region | Seasonal Sale (X-Axis) | Off-Seasonal Sale (Y-Axis) | Comment |
| --- | --- | --- | --- |
| 1 | High Sales | High Sales | Ever-green Products |
| 2 | Low and Moderate Sales | High Sales | Off-Seasonal Products (Day-to-day products) |
| 3 | Low and Moderate Sales | Low and Moderate Sales | Rarely and often used products |
| 4 | Low and Moderate Sales | High Sales | Seasonal Products/Giftings |

<br>
Note that because of choice of quantile point, distribution would be highly skewed towards Low and Moderate Sales. Reason of choosing such a high quantile is to segregate the Region-1 products effectively

In [None]:
QUANTILE = [0.90]
MAX_QUANTILE = [0.95]
MIN_QUANTILE = [0.15]
print(x['Season'].quantile(QUANTILE))
print(x['Off-Season'].quantile(QUANTILE))
x.loc[:,'Q-Region'] = 0
x.loc[(x['Season']>x['Season'].quantile(QUANTILE).values[0])&(x['Off-Season']>x['Off-Season'].quantile(QUANTILE).values[0]),'Q-Region'] = 1
x.loc[(x['Season']<=x['Season'].quantile(QUANTILE).values[0])&(x['Off-Season']>x['Off-Season'].quantile(QUANTILE).values[0]),'Q-Region'] = 2
x.loc[(x['Season']<=x['Season'].quantile(QUANTILE).values[0])&(x['Off-Season']<=x['Off-Season'].quantile(QUANTILE).values[0]),'Q-Region'] = 3
x.loc[(x['Season']>x['Season'].quantile(QUANTILE).values[0])&(x['Off-Season']<=x['Off-Season'].quantile(QUANTILE).values[0]),'Q-Region'] = 4

In [None]:
plt.figure(figsize=(15,10))
g1 = sns.scatterplot(x['Season'],x['Off-Season'],hue=x['Q-Region'],palette="deep")
# g1.set_xscale("log")
plt.title('StockCodes: Seasonal and Off-seasonal counts by Quantile Regions (Seasonal Count axis is in log scale)')
plt.xlabel('Season Counts')
plt.ylabel('Off-Season Counts')

In [None]:
df = df.merge(x,left_on=['StockCode'],right_on=['index']).drop(['index','Season','Off-Season'],axis=1).rename({'cluster':'ProductCluster'},axis=1)

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1,figsize=(15,14))
ddf = df[(df['Quantity']>0)&(df["InvoiceDate"].dt.year==2011)]
d = df.groupby([ddf["InvoiceDate"].dt.week,df['Q-Region']]).agg({'TotalPrice':'sum',
                                                  'Quantity': 'sum',
                                                  'CustomerID': 'count'}).reset_index()
sns.lineplot(data=d,y='TotalPrice',x='InvoiceDate',hue='Q-Region',palette="deep",ax=ax[0])
sns.lineplot(data=d,y='Quantity',x='InvoiceDate',hue='Q-Region',palette="deep",ax=ax[1])
plt.show()

Let us see few products of each quantiles

In [None]:
Q1 = df[df['Q-Region']==1]
Q2 = df[df['Q-Region']==2]
Q3 = df[df['Q-Region']==3]
Q4 = df[df['Q-Region']==4]

In [None]:
Q1.groupby(['StockCode','ProductName'])[['UnitPrice','TotalPrice','Quantity']].sum().sort_values(by='UnitPrice',ascending=False).head(10)

In [None]:
Q2.groupby(['StockCode','ProductName'])[['UnitPrice','TotalPrice','Quantity']].sum().sort_values(by='UnitPrice',ascending=False).head(10)

In [None]:
Q3.groupby(['StockCode','ProductName'])[['UnitPrice','TotalPrice','Quantity']].sum().sort_values(by='UnitPrice',ascending=False).head(10)

In [None]:
Q4.groupby(['StockCode','ProductName'])[['UnitPrice','TotalPrice','Quantity']].sum().sort_values(by='UnitPrice',ascending=False).head(10)

In [None]:
fig, ax = plt.subplots(nrows=10, ncols=1,figsize=(15,70))
ddf = df[(df['Quantity']>0)&(df["InvoiceDate"].dt.year==2011)]
d = df.groupby([ddf["InvoiceDate"].dt.week,df['ProductCluster']]).agg({'TotalPrice':'sum',
                                                  'Quantity': 'sum',
                                                  'CustomerID': 'count'}).reset_index()
d = d.rename({'TotalPrice':'Revenue','InvoiceDate':'Weeks'},axis=1)
sns.lineplot(data=d[d['ProductCluster']==0],y='Revenue',x='Weeks',hue='ProductCluster',palette="Greens",ax=ax[0])
sns.lineplot(data=d[d['ProductCluster']==1],y='Revenue',x='Weeks',hue='ProductCluster',palette="Reds",ax=ax[1])
sns.lineplot(data=d[d['ProductCluster']==2],y='Revenue',x='Weeks',hue='ProductCluster',palette="deep",ax=ax[2])
sns.lineplot(data=d[d['ProductCluster']==3],y='Revenue',x='Weeks',hue='ProductCluster',palette="tab10",ax=ax[3])
sns.lineplot(data=d[d['ProductCluster']==4],y='Revenue',x='Weeks',hue='ProductCluster',palette="tab10_r",ax=ax[4])
sns.lineplot(data=d[d['ProductCluster']==5],y='Revenue',x='Weeks',hue='ProductCluster',palette="prism",ax=ax[5])
sns.lineplot(data=d[d['ProductCluster']==6],y='Revenue',x='Weeks',hue='ProductCluster',palette="vlag",ax=ax[6])
sns.lineplot(data=d[d['ProductCluster']==7],y='Revenue',x='Weeks',hue='ProductCluster',palette="RdPu_r",ax=ax[7])
sns.lineplot(data=d[d['ProductCluster']==8],y='Revenue',x='Weeks',hue='ProductCluster',palette="RdPu_r",ax=ax[8])
sns.lineplot(data=d[d['ProductCluster']==9],y='Revenue',x='Weeks',hue='ProductCluster',palette="CMRmap",ax=ax[9])
# sns.lineplot(data=d,y='Quantity',x='InvoiceDate',hue='ProductCluster',palette="deep",ax=ax[1])
plt.show()

We can see certain category of products get a sale spike during holiday seasons and during year end. For example products in cluster 7,8 and 1<br>
Moreover there is a sudden demand on Product cluster 3 in the beginning of the year

### Q. What kind of Customers are buying from us? <a class="anchor" id="4.6"></a>
### Understanding Customer's Purchase Pattern <a class="anchor" id="4.6.1"></a>
Now, we can also analyse the customer's purchasing pattern which can give us some insights about the customers. We can do this by using **Association Rule Learning** or ARL. These algorithms answer questions like "People who bought this also bought..." or "Customers tend to buy A and B together more often". We will use one of the famous ARL, Apriori Algorithm
<br><br>
But first, let us segregate transactions in baskets based on countries (since, creating a single basket of all the transactions would be resource intensive. Also, for the same reason, we would be avoiding creation of United Kingdom's basket)

In [None]:
basket_Germany = df[df['Country']=="Germany"].groupby(['InvoiceNo', 'ProductName'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket_EIRE = df[df['Country']=="EIRE"].groupby(['InvoiceNo', 'ProductName'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket_UK = df[df['Country']=="UK"].groupby(['InvoiceNo', 'ProductName'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket_France = df[df['Country']=="France"].groupby(['InvoiceNo', 'ProductName'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_Germany.drop('POSTAGE',axis=1,inplace=True)
basket_France.drop('POSTAGE',axis=1,inplace=True)

basket_Germany = basket_Germany.applymap(encode_units)
basket_EIRE = basket_EIRE.applymap(encode_units)
basket_UK = basket_UK.applymap(encode_units)
basket_France = basket_France.applymap(encode_units)

Before heading it is important to understand few terminology,<br>
First, 
### Support:
Support of an itemset X is propotion of transaction in the data in which X appears. It shows the **popularity** of X i.e. <br>
##### $$\text{Support(X)} = \frac{\text{Number of Transactions in which X appears}}{\text{Total Number of Transactions}} = \frac{freq(X)}{N}$$
in our case, for a given rule A->B (in other words if A is bought, B is also bought) 
##### $$\text{Support(A,B)} = \frac{\text{Number of Transactions in which A and B both appears}}{\text{Total Number of Transactions}} = \frac{freq(A,B)}{N}$$

### Confidence:
Confidence for a given rule A->B is defined as frequency of A and B together in a basket divided by frequency of A:
##### $$\text{Confidence} = \frac{freq(A,B)}{freq(A)}$$
It shows the **likelihood** of item B being purchased when A is purchased. We must also take its drawback into consideration viz. it takes popularity of A in consideration. If product B is also equally popular as A then there will be a higher probability that a transaction containing A will also contain B thus increasing the confidence

### Lift:
Lift for a given rule A->B is defined as
##### $$\text{Lift} = \frac{Support(A,B)}{Support(A) * Support(B))}$$
This signifies the likelihood of the itemset B being purchased when item A is purchased while taking into account the popularity of B. If the value of lift is greater than 1, it means that the itemset B is likely to be bought with itemset A, while a value less than 1 implies that itemset B is unlikely to be bought if the itemset A is bought.

### Conviction:
Conviction for a given rule A->B is defined as
##### $$\text{Conviction(A,B)} = \frac{1-Support(B)}{1-Confidence(A,B)}$$
Say, the conviction value of 1.32 means that the rule A->B would be incorrect 32% more often if the association between A and B was an accidental chance.


### Leverage:
Leverage for a given rule A->B is defined as
##### $$\text{Leverage(A,B)} = P(A and B) - P(A)P(B)$$

" Leverage measures the difference of A and B appearing together in the data set and what would be expected if A and B where statistically dependent. The rational in a sales setting is to find out how many more units (items A and B together) are sold than expected from the independent sells. Using min. leverage thresholds at the same time incorporates an implicit frequency constraint. E.g., for setting a min. leverage thresholds to 0.01% (corresponds to 10 occurrence in a data set with 100,000 transactions) one first can use an algorithm to find all itemsets with min. support of 0.01% and then filter the found item sets using the leverage constraint. Because of this property leverage also can suffer from the rare item problem. "

Here is a good read on Apriori Algorithm Basics: <br>
https://michael.hahsler.net/research/recommender/associationrules.html<br>
https://www.hackerearth.com/blog/developers/beginners-tutorial-apriori-algorithm-data-mining-r-implementation/

In [None]:
frequent_itemsets = apriori(basket_France, min_support=0.07, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True)

In [None]:
frequent_itemsets = apriori(basket_Germany, min_support=0.07, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True)

Hmm... It is interesting to note that in France, (ALARM CLOCK BAKELIKE RED ) and (ALARM CLOCK BAKELIKE GREEN) are very frequently bought together. Similarly, (ALARM CLOCK BAKELIKE RED ) and (ALARM CLOCK BAKELIKE PINK) are often bought together.
Also, in Germany, (ROUND SNACK BOXES SET OF 4 FRUITS ) and (ROUND SNACK BOXES SET OF4 WOODLAND ) are most frequently bought together.
It is worth noting that all of these items are **essencially the same, i.e. ALARM CLOCK and SNACK BOXES**. 

##### This behavior in purchasing pattern shows us that the majority of transactions done by the Customers are more likely to be Wholesale Retailers than some individuals. 

# *That's All Folks!*