<h1 style="text-align:center; font-size:250%; font-family:Arial;"><b>Market Basket Analysis</b></h1> 

<h2 style="text-align:left; font-family:Arial;"><b>1. Importing Necessary Dependencies</b></h2> 

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import networkx as nx

<h2 style="text-align:left; font-family:Arial;"><b>2. Loading and Reading Dataset</b></h2> 

In [2]:
bakeryDF=pd.read_csv("Bakery.csv")
bakeryDF.head()

Unnamed: 0,TransactionNo,Items,DateTime,Daypart,DayType
0,1,Bread,2016-10-30 09:58:11,Morning,Weekend
1,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
2,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
3,3,Hot chocolate,2016-10-30 10:07:57,Morning,Weekend
4,3,Jam,2016-10-30 10:07:57,Morning,Weekend


In [3]:
print("Database dimension :", bakeryDF.shape)
print("Database size      :", bakeryDF.size)

Database dimension : (20507, 5)
Database size      : 102535


In [4]:
bakeryDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20507 entries, 0 to 20506
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   TransactionNo  20507 non-null  int64 
 1   Items          20507 non-null  object
 2   DateTime       20507 non-null  object
 3   Daypart        20507 non-null  object
 4   DayType        20507 non-null  object
dtypes: int64(1), object(4)
memory usage: 801.2+ KB


In [5]:
bakeryDF['TransactionNo'].nunique()

9465

In [6]:
bakeryDF.describe(include=object)

Unnamed: 0,Items,DateTime,Daypart,DayType
count,20507,20507,20507,20507
unique,94,9465,4,2
top,Coffee,2017-11-02 14:08:27,Afternoon,Weekday
freq,5471,11,11569,12807


<h2 style="text-align:left; font-family:Arial;"><b>Data Summary:</b></h2>

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#E8F6EF;  
            font-size:100%;
            theme: cosmo;
            letter-spacing:0.5px">
<h3 style="padding-left: 20px; padding-top: 20px; color:#4a4a4a; font-family:Arial;"><b>Overview</b>
</h3>
<p style="padding-left: 20px; padding-right: 20px; color:#4a4a4a; font-size:110%;">
The dataset provides transaction details of all items purchased between 2016 and 2017 from the bakery online. The dataset has <b>20507</b> entries over <b>9000</b> transactions, and 4 columns.
</p>
<ul style="padding-left: 40px; padding-bottom: 20px; color:#4a4a4a; font-size:110%;">
    <li>Number of variables: 1</li>
    <li>Numeric variables: 1</li>
    <li>Categorical variables: 4</li>
    <li>Number of observations: 20507</li>
    <li>Total number of transactions: 9465</li>
    <li>Missing cells : 0</li>
</ul>
</div>

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#E8F6EF;  
            font-size:100%;
            theme:cosmo;
            letter-spacing:0.5px">
<h3 style="padding-left: 20px; padding-top: 20px; color:#4a4a4a; font-family:Arial;"><b>Variables</b>
</h3>
<ul style="padding-left: 40px; padding-bottom: 20px; color:#4a4a4a; font-size:110%;">
    <li><code>TransactionNo</code> : <b>9465</b> distinct values</li>
    <li><code><b>Items</b></code> has a high cardinality: <b>94</b> distinct values</li>
    <li><code><b>DateTime</b></code> has a high cardinality: <b>9182</b> distinct values</li>
    <li><code><b>Daypart</b></code> has <b>4</b> distinct values</li>
    <li><code><b>DayType</b></code> has <b>2</b> distinct values</li>
</ul>
</div>

<h2 style="text-align:left; font-family:Arial;"><b>3. Data Exploration and Visualization</b></h2> 

<h3 style="text-align:left; font-family:Arial;"><b>3.1 Let's look into the frequent items and the best sellers</b></h3> 

In [7]:
itemFrequency = bakeryDF['Items'].value_counts().sort_values(ascending=False)
itemFrequency.head(10)

Coffee           5471
Bread            3325
Tea              1435
Cake             1025
Pastry            856
Sandwich          771
Medialuna         616
Hot chocolate     590
Cookies           540
Brownie           379
Name: Items, dtype: int64

In [8]:
fig = px.bar(itemFrequency.head(20), title='20 Most Frequent Items', color=itemFrequency.head(20), color_continuous_scale=px.colors.sequential.Mint)
fig.update_layout(margin=dict(t=50, b=0, l=0, r=0), titlefont=dict(size=20), xaxis_tickangle=-45, plot_bgcolor='white', coloraxis_showscale=False)
fig.update_yaxes(showticklabels=False, title=' ')
fig.update_xaxes(title=' ')
fig.update_traces(texttemplate='%{y}', textposition='outside', hovertemplate = '<b>%{x}</b><br>No. of Transactions: %{y}')
fig.show()

<p style="color:#4a4a4a; font-size:110%;">Coffee is the best-selling product by far, followed by bread and tea.</p>

<h3 style="text-align:left; font-family:Arial;"><b>3.2 Let's look into the peak hours of sales</b></h3> 

In [9]:
peakHours = bakeryDF.groupby('Daypart')['Items'].count().sort_values(ascending=False)
peakHours

Daypart
Afternoon    11569
Morning       8404
Evening        520
Night           14
Name: Items, dtype: int64

In [10]:
fig = go.Figure(data=[go.Pie(labels=['Afternoon','Morning','Evening','Night'],
                values=peakHours, title="Peak Selling Hours",titlefont=dict(size=20), textinfo='label+percent', marker=dict(colors=px.colors.sequential.Mint), hole=.5)])
fig.update_layout(margin=dict(t=40, b=40, l=0, r=0), font=dict(size=13), showlegend=False)
fig.show()

<p style="color:#4a4a4a; font-size:110%;">The bakery seems to be making most of its sales in the afternoon everyday with over 56% of the sales. Sales fall sharply after that. However the bakery makes a decent amount of sales in the morning as well.</p>

<h3 style="text-align:left; font-family:Arial;"><b>3.3 Further let's look into the monthly and weekly sales</b></h3> 

<p style="color:#4a4a4a; font-size:110%;">Need to extract months and days from the dataset for further analysis.</p>

In [11]:
dateTime=pd.to_datetime(bakeryDF['DateTime'])
bakeryDF['Day']=dateTime.dt.day_name()
bakeryDF['Month']=dateTime.dt.month_name()
bakeryDF['Year']=dateTime.dt.year
bakeryDF.head(5)

Unnamed: 0,TransactionNo,Items,DateTime,Daypart,DayType,Day,Month,Year
0,1,Bread,2016-10-30 09:58:11,Morning,Weekend,Sunday,October,2016
1,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend,Sunday,October,2016
2,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend,Sunday,October,2016
3,3,Hot chocolate,2016-10-30 10:07:57,Morning,Weekend,Sunday,October,2016
4,3,Jam,2016-10-30 10:07:57,Morning,Weekend,Sunday,October,2016


In [12]:
mpd = bakeryDF.groupby('Day')['Items'].count().sort_values(ascending=False)
mpd

Day
Saturday     3554
Friday       3266
Sunday       3118
Monday       3035
Tuesday      2645
Thursday     2601
Wednesday    2288
Name: Items, dtype: int64

In [13]:
fig = px.bar(mpd, title='Most Productive Day', color=mpd, color_continuous_scale=px.colors.sequential.Mint)
fig.update_layout(margin=dict(t=50, b=0, l=0, r=0), titlefont=dict(size=20), xaxis_tickangle=0, plot_bgcolor='white', coloraxis_showscale=False)
fig.update_yaxes(showticklabels=False, title=' ')
fig.update_xaxes(title=' ')
fig.update_traces(texttemplate='%{y}', textposition='outside', hovertemplate = '<b>%{x}</b><br>No. of Transactions: %{y}')
fig.show()

<p style="color:#4a4a4a; font-size:110%;">For obvious reasons, the sales are high as expected during the weekends. However the sales seem to be quite uniform rest of the days.</p>

In [14]:
mpm = bakeryDF.groupby('Month')['Items'].count().sort_values(ascending=False)
mpm

Month
March        3220
November     3076
January      3027
February     2748
December     2647
April        1048
October      1041
May           924
July          741
June          739
August        700
September     596
Name: Items, dtype: int64

In [15]:
fig = px.bar(mpm, title='Most Productive Month', color=mpm, color_continuous_scale=px.colors.sequential.Mint)
fig.update_layout(margin=dict(t=50, b=0, l=0, r=0), titlefont=dict(size=20), xaxis_tickangle=0, plot_bgcolor='white', coloraxis_showscale=False)
fig.update_yaxes(showticklabels=False, title=' ')
fig.update_xaxes(title=' ')
fig.update_traces(texttemplate='%{y}', textposition='outside', hovertemplate = '<b>%{x}</b><br>No. of Transactions: %{y}')
fig.show()

<p style="color:#4a4a4a; font-size:110%;">The bakery seems to be heavily occupied and makes most of its business from November to March.</p>

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#E8F6EF;  
            font-size:100%;
            theme:cosmo;
            letter-spacing:0.5px">
<h3 style="padding-left: 20px; padding-top: 20px; color:#4a4a4a; font-family:Arial;"><b>EDA Summary:</b></h3>
<p style="padding-left: 20px; padding-bottom: 20px; color:#4a4a4a; font-size:110%;"> 
<b>Coffee</b> is the best-selling product by far, followed by <b>bread</b> and <b>tea</b>. The bakery seems to be making most of its sales in the afternoon everyday with over <b>56%</b> of the sales. Sales fall sharply after that. However the bakery makes a decent amount of sales in the morning as well. For obvious reasons, the sales are high as expected during the weekends. However the sales seem to be quite uniform rest of the days. The bakery seems to be heavily occupied and makes most of its business from November to March.
</p>
</div>

<h2 style="text-align:left; font-family:Arial;"><b>4. Association Rules Generation</b></h2> 

<h3 style="text-align:left; font-family:Arial;"><b>4.1 Data Preparation for Association Rule Mining</b></h3>
<p style="color:#4a4a4a; font-size:110%;">Apriori algorithm requires a dataframe with all the transactions one hot encoded for all the items.</p>

- <h4 style="text-align:left; font-family:Arial;"><b>list of all the transactions</b></h4>

In [16]:
transactions=[]
for item in bakeryDF['TransactionNo'].unique():
    lst=list(set(bakeryDF[bakeryDF['TransactionNo']==item]['Items']))
    transactions.append(lst)

transactions[0:10]

[['Bread'],
 ['Scandinavian'],
 ['Hot chocolate', 'Jam', 'Cookies'],
 ['Muffin'],
 ['Bread', 'Coffee', 'Pastry'],
 ['Muffin', 'Pastry', 'Medialuna'],
 ['Tea', 'Coffee', 'Pastry', 'Medialuna'],
 ['Bread', 'Pastry'],
 ['Muffin', 'Bread'],
 ['Scandinavian', 'Medialuna']]

- <h4 style="text-align:left; font-family:Arial;"><b>one hot encoding</b></h4>

In [17]:
te = TransactionEncoder()
encodedData = te.fit(transactions).transform(transactions)
data = pd.DataFrame(encodedData, columns=te.columns_)
data.head()

Unnamed: 0,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


<h3 style="text-align:left; font-family:Arial;"><b>4.2 Association Rules Generation</b></h3>

- <h4 style="text-align:left; font-family:Arial;"><b>frequent items</b></h4>

In [18]:
frequentItems= apriori(data, use_colnames=True, min_support=0.02)
frequentItems.head()

Unnamed: 0,support,itemsets
0,0.036344,(Alfajores)
1,0.327205,(Bread)
2,0.040042,(Brownie)
3,0.103856,(Cake)
4,0.478394,(Coffee)


- <h4 style="text-align:left; font-family:Arial;"><b>association rules</b></h4>

In [19]:
rules = association_rules(frequentItems, metric="lift", min_threshold=1)
rules.antecedents = rules.antecedents.apply(lambda x: next(iter(x)))
rules.consequents = rules.consequents.apply(lambda x: next(iter(x)))
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,Bread,Pastry,0.327205,0.086107,0.02916,0.089119,1.034977,0.000985,1.003306
1,Pastry,Bread,0.086107,0.327205,0.02916,0.33865,1.034977,0.000985,1.017305
2,Cake,Coffee,0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664
3,Coffee,Cake,0.478394,0.103856,0.054728,0.114399,1.101515,0.005044,1.011905
4,Cake,Tea,0.103856,0.142631,0.023772,0.228891,1.604781,0.008959,1.111865


<h3 style="text-align:left; font-family:Arial;"><b>4.3 Rules Visualization</b></h3> 

In [20]:
network_A = list(rules["antecedents"].unique())
network_B = list(rules["consequents"].unique())
node_list = list(set(network_A + network_B))
G = nx.Graph()
for i in node_list:
    G.add_node(i)
for i,j in rules.iterrows():
    G.add_edges_from([(j["antecedents"], j["consequents"])])
pos = nx.spring_layout(G, k=0.5, dim=2, iterations=400)
for n, p in pos.items():
    G.nodes[n]['pos'] = p

edge_trace = go.Scatter(x=[], y=[], line=dict(width=0.5, color='#888'), hoverinfo='none', mode='lines')

for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])

node_trace = go.Scatter(x=[], y=[], text=[], mode='markers', hoverinfo='text',
    marker=dict(showscale=True, colorscale='Burg', reversescale=True, color=[], size=15,
    colorbar=dict(thickness=10, title='Node Connections', xanchor='left', titleside='right')))

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])

for node, adjacencies in enumerate(G.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
    node_info = str(adjacencies[0]) +'<br>No of Connections: {}'.format(str(len(adjacencies[1])))
    node_trace['text']+=tuple([node_info])

fig = go.Figure(data=[edge_trace, node_trace], 
    layout=go.Layout(title='Item Connections Network', titlefont=dict(size=20),
    plot_bgcolor='white', showlegend=False, margin=dict(b=0,l=0,r=0,t=50),
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

iplot(fig)

<h2 style="text-align:left; font-family:Arial;"><b>5. Refining Rules</b></h2> 

<p style="color:#4a4a4a; font-size:110%;">The confidence for a very frequent consequent is always high even if there is a very weak association. So this doesn't give a clearer picture. Here, coffee is by far the most frequent item and the best seller. It can therefore be recommended anyway with every other item. So, we can drop the rules recommending coffee to get a clearer picture of the real unknown rules generated from the data.</p>

In [21]:
index_names = rules[rules['consequents'] == 'Coffee'].index
refinedRules = rules.drop(index_names).sort_values('lift', ascending=False)
refinedRules.drop(['leverage','conviction'], axis=1, inplace=True)
refinedRules = refinedRules.reset_index()
refinedRules

Unnamed: 0,index,antecedents,consequents,antecedent support,consequent support,support,confidence,lift
0,5,Tea,Cake,0.142631,0.103856,0.023772,0.166667,1.604781
1,4,Cake,Tea,0.103856,0.142631,0.023772,0.228891,1.604781
2,19,Coffee,Toast,0.478394,0.033597,0.023666,0.04947,1.472431
3,12,Coffee,Medialuna,0.478394,0.061807,0.035182,0.073542,1.189878
4,14,Coffee,Pastry,0.478394,0.086107,0.047544,0.099382,1.154168
5,10,Coffee,Juice,0.478394,0.038563,0.020602,0.043065,1.11675
6,16,Coffee,Sandwich,0.478394,0.071844,0.038246,0.079947,1.112792
7,3,Coffee,Cake,0.478394,0.103856,0.054728,0.114399,1.101515
8,6,Coffee,Cookies,0.478394,0.054411,0.028209,0.058966,1.083723
9,9,Coffee,Hot chocolate,0.478394,0.05832,0.029583,0.061837,1.060311


<h2 style="text-align:left; font-family:Arial;"><b>Summary:</b></h2> 

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#EEF7FA;  
            font-size:100%;
            theme:cosmo;
            letter-spacing:0.5px">
<h3 style="padding-left: 20px; padding-top: 20px; color:#4a4a4a; font-family:Arial;"><b>Insights</b>
</h3>
<ul style="padding-left: 40px; padding-bottom: 20px; color:#4a4a4a; font-size:110%;">
<li>Coffee is the bestseller of this bakery and it shows association with <b>8</b> other items.</li>
<li>Over <b>11%</b> coffee lovers also buy cake along with while almost <b>10%</b> of them buy pastry along with it.</li>
<li>Over <b>16%</b> of tea consumers also buy cakes and over <b>22%</b> cake lovers also buy tea.</li>
<li>Among the pastry lovers, over <b>33%</b> of them also buy bread, while nearly <b>9%</b> of those who buy pastry also buy bread.</li>
</ul>
</div>

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#F9ECEC;  
            font-size:100%;
            theme:cosmo;
            letter-spacing:0.5px">
<h3 style="padding-left: 20px; padding-top: 20px; color:#4a4a4a; font-family:Arial;"><b>Business Strategy</b>
</h3>
<p style="padding-left: 20px; color:#4a4a4a; font-size:110%;">
There are a couple of strategies that the bakery can adopt if is yet to use, to increase its sales considering the associations we have seen between coffee and its 8 partners.
</p>
<ul style="padding-left: 40px; padding-bottom: 20px; color:#4a4a4a; font-size:110%;">
<li>Promotional discounts on these items can entice customers to buy coffee or the other way round.</li>
<li>Arranging placements of these items close to coffee ordering counter can be a good strategy to tempt customers into buying them.</li>
</ul>
</div>