



# Market Basket analysis (Frequent Item Generation, Apriori algorithm & Confidence Based Pruning. Evaluation of association rule by different measures.)

**Problem Statement**:  A smaller dataset that contains hypothetical transactions (baskets) from a grocery. Dataset store in 'groceries'. Perform the market basket analysis on groceries data.

In [None]:
import pandas as pd
import numpy as np

In [None]:
from mlxtend.frequent_patterns import association_rules, apriori
from mlxtend.preprocessing import TransactionEncoder

In [None]:
# Reading CSV file
groceries = pd.read_csv('groceries).csv') 
groceries.head()

Unnamed: 0,ID,Transaction
0,0.0,"milk,bread,biscuit"
1,1.0,"bread,milk,biscuit,cereal"
2,2.0,"bread,tea"
3,3.0,"jam,bread,milk"
4,4.0,"tea,biscuit"


In [None]:
# Dropping unwanted rows (i.e 20 & 21)
groceries.drop([20,21],inplace=True)

In [None]:
groceries.head()

Unnamed: 0,ID,Transaction
0,0.0,"milk,bread,biscuit"
1,1.0,"bread,milk,biscuit,cereal"
2,2.0,"bread,tea"
3,3.0,"jam,bread,milk"
4,4.0,"tea,biscuit"


In [None]:
# Get all the Transcactions as a list
transcactions = list(groceries['Transaction'].apply(lambda x: sorted(x.split(','))))
transcactions

[['biscuit', 'bread', 'milk'],
 ['biscuit', 'bread', 'cereal', 'milk'],
 ['bread', 'tea'],
 ['bread', 'jam', 'milk'],
 ['biscuit', 'tea'],
 ['bread', 'tea'],
 ['cereal', 'tea'],
 ['biscuit', 'bread', 'tea'],
 ['bread', 'jam', 'tea'],
 ['bread', 'milk'],
 ['biscuit', 'cereal', 'coffee', 'orange'],
 ['biscuit', 'cereal', 'coffee', 'orange'],
 ['coffee', 'sugar'],
 ['bread', 'coffee', 'orange'],
 ['biscuit', 'bread', 'sugar'],
 ['cereal', 'coffee', 'sugar'],
 ['biscuit', 'bread', 'sugar'],
 ['bread', 'coffee', 'sugar'],
 ['bread', 'coffee', 'sugar'],
 ['cereal', 'coffee', 'milk', 'tea']]

In [None]:
# instantiate transcation encoder
encoder = TransactionEncoder().fit(transcactions)
onehot = encoder.transform(transcactions)

# convert one-hot encode data to DataFrame
onehot = pd.DataFrame(onehot, columns=encoder.columns_)

In [None]:
onehot.head()

Unnamed: 0,biscuit,bread,cereal,coffee,jam,milk,orange,sugar,tea
0,True,True,False,False,False,True,False,False,False
1,True,True,True,False,False,True,False,False,False
2,False,True,False,False,False,False,False,False,True
3,False,True,False,False,True,True,False,False,False
4,True,False,False,False,False,False,False,False,True


Here, we encode our data so that it could easily be fitted to a machine learning model

In [None]:
# compute frequent items using the Apriori algorithm - Get up to three items
frequent_itemsets = apriori(onehot, min_support = 0.001, max_len = 3, use_colnames=True)
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.4,(biscuit)
1,0.65,(bread)
2,0.3,(cereal)
3,0.4,(coffee)
4,0.1,(jam)


In [None]:
rules =  rules.sort_values('confidence',ascending = False)[:25]


 Support is an indication of how frequently the items appear in the data.
And support shows that 65% of people purcahsed bread in their groceries

In [None]:
# compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(cereal),(biscuit),0.3,0.4,0.15,0.5,1.25,0.03,1.2
1,(biscuit),(cereal),0.4,0.3,0.15,0.375,1.25,0.03,1.12
2,(milk),(biscuit),0.25,0.4,0.1,0.4,1.0,0.0,1.0
3,(biscuit),(milk),0.4,0.25,0.1,0.25,1.0,0.0,1.0
4,(orange),(biscuit),0.15,0.4,0.1,0.666667,1.666667,0.04,1.8


In [None]:
# To display maximum set of rows
pd.set_option('display.max_rows',500)
rules.describe

<bound method NDFrame.describe of            antecedents        consequents  antecedent support  \
0             (cereal)          (biscuit)                0.30   
1            (biscuit)           (cereal)                0.40   
2               (milk)          (biscuit)                0.25   
3            (biscuit)             (milk)                0.40   
4             (orange)          (biscuit)                0.15   
5            (biscuit)           (orange)                0.40   
6              (bread)              (jam)                0.65   
7                (jam)            (bread)                0.10   
8              (bread)             (milk)                0.65   
9               (milk)            (bread)                0.25   
10             (sugar)            (bread)                0.30   
11             (bread)            (sugar)                0.65   
12            (coffee)           (cereal)                0.40   
13            (cereal)           (coffee)               

In [None]:
rules.sort_values('confidence',ascending = False)[:25]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
52,"(cereal, orange)",(biscuit),0.1,0.4,0.1,1.0,2.5,0.06,inf
59,"(coffee, biscuit)",(orange),0.1,0.15,0.1,1.0,6.666667,0.085,inf
26,"(bread, cereal)",(biscuit),0.05,0.4,0.05,1.0,2.5,0.03,inf
76,"(milk, jam)",(bread),0.05,0.65,0.05,1.0,1.538462,0.0175,inf
87,"(coffee, milk)",(cereal),0.05,0.3,0.05,1.0,3.333333,0.035,inf
35,"(sugar, biscuit)",(bread),0.1,0.65,0.1,1.0,1.538462,0.035,inf
67,"(bread, orange)",(coffee),0.05,0.4,0.05,1.0,2.5,0.03,inf
64,"(bread, cereal)",(milk),0.05,0.25,0.05,1.0,4.0,0.0375,inf
19,(orange),(coffee),0.15,0.4,0.15,1.0,2.5,0.09,inf
94,"(cereal, orange)",(coffee),0.1,0.4,0.1,1.0,2.5,0.06,inf


Confidence indicates the number of times the if-then statements are found true.

We can see that there are many items for which the confidence is 100%
For instance, if the customers that bought (biscuit, cereal) also bought orange.
With (coffee,tea) also bought milk which seems to be obvious

Let’s say that we want to get the top associated rules, given that the left-hand side has two items, then which item is more likely to be added to the basket?

In [None]:
rules['lhs items'] = rules['antecedents'].apply(lambda x:len(x) )
rules[rules['lhs items']>1].sort_values('lift', ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,lhs items
59,"(coffee, biscuit)",(orange),0.1,0.15,0.1,1.0,6.666667,0.085,inf,2
53,"(cereal, biscuit)",(orange),0.15,0.15,0.1,0.666667,4.444444,0.0775,2.55,2
111,"(coffee, tea)",(milk),0.05,0.25,0.05,1.0,4.0,0.0375,inf,2
64,"(bread, cereal)",(milk),0.05,0.25,0.05,1.0,4.0,0.0375,inf,2
41,"(coffee, biscuit)",(cereal),0.1,0.3,0.1,1.0,3.333333,0.07,inf,2
54,"(orange, biscuit)",(cereal),0.1,0.3,0.1,1.0,3.333333,0.07,inf,2
87,"(coffee, milk)",(cereal),0.05,0.3,0.05,1.0,3.333333,0.035,inf,2
92,"(coffee, cereal)",(orange),0.2,0.15,0.1,0.5,3.333333,0.07,1.7,2
100,"(coffee, tea)",(cereal),0.05,0.3,0.05,1.0,3.333333,0.035,inf,2
105,"(milk, tea)",(cereal),0.05,0.3,0.05,1.0,3.333333,0.035,inf,2


As we can see, if someone has already added to his basket (coffee, biscuit) or (cereal, biscuit) then the item which is more likely to be added is **orange**

We will show all the rules where the left-hand side consists of 2 items and we are looking for an extra one. The measurement is the lift.

In [None]:
# Replace frozen sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))
# Transform the DataFrame of rules into a matrix using the lift metric
pivot = rules[rules['lhs items']>1].pivot(index = 'antecedents_', 
                    columns = 'consequents_', values= 'lift')
pivot

consequents_,biscuit,bread,cereal,coffee,jam,milk,orange,sugar,tea
antecedents_,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"bread,biscuit",,,,,,1.6,,1.333333,
"bread,cereal",2.5,,,,,4.0,,,
"bread,coffee",,,,,,,2.222222,2.222222,
"bread,jam",,,,,,2.0,,,1.428571
"bread,milk",1.25,,,,2.5,,,,
"bread,orange",,,,2.5,,,,,
"bread,tea",,,,,2.5,,,,
"cereal,biscuit",,,,1.666667,,1.333333,4.444444,,
"cereal,orange",2.5,,,2.5,,,,,
"cereal,tea",,,,1.25,,2.0,,,


Lift > 1 ,implies that there is a positive relationship for the corresponding antecedent and consequent

**Problem Statement**: Consider the 'Online retail II' dataset. This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers. Perform the Market Basket Analysis on this data.

**Data Set Information:**

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

**Attribute Information:**

1) InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.

2)StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.

3)Description: Product (item) name. Nominal.

4)Quantity: The quantities of each product (item) per transaction. Numeric.

5)InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.

6)UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).

7)CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.

8)Country: Country name. Nominal. The name of the country where a customer resides.

In [None]:
data=pd.read_csv('/content/online_retail_II (1).csv',encoding='ISO-8859-1')
data.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6.0,01-12-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6.0,01-12-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8.0,01-12-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6.0,01-12-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6.0,01-12-2010 08:26,3.39,17850.0,United Kingdom


As you can see, this datasets contains 8 features with 541909 rows. It shows us the transaction of an actual online retail in UK from 1 December 2010 until 9 December 2011. So, it is a whole year transaction data from an actual online retail business. 

In [None]:
# Structure of the data
data=data.dropna()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 272706 entries, 0 to 370510
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Invoice      272706 non-null  object 
 1   StockCode    272706 non-null  object 
 2   Description  272706 non-null  object 
 3   Quantity     272706 non-null  float64
 4   InvoiceDate  272706 non-null  object 
 5   Price        272706 non-null  float64
 6   Customer ID  272706 non-null  float64
 7   Country      272706 non-null  object 
dtypes: float64(3), object(5)
memory usage: 18.7+ MB


After dropping the NA the entries now is decreased to 406830 entries (24.9%). It means, there are around 24% from all rows that contain at least one missing values.

In [None]:
# No. of missing values in each column
data.isna().sum()

Invoice        0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
Price          0
Customer ID    0
Country        0
dtype: int64

In [None]:

data_plus=data[data['Quantity']>=0]
data_plus.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 266349 entries, 0 to 370510
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Invoice      266349 non-null  object 
 1   StockCode    266349 non-null  object 
 2   Description  266349 non-null  object 
 3   Quantity     266349 non-null  float64
 4   InvoiceDate  266349 non-null  object 
 5   Price        266349 non-null  float64
 6   Customer ID  266349 non-null  float64
 7   Country      266349 non-null  object 
dtypes: float64(3), object(5)
memory usage: 18.3+ MB


In this datasets, the Quantity column shows us the number of items that are bought in each transaction (InvoiceNo). Sometimes, the transaction gets cancelled, because this is an online retail. When there is a cancellation on a particular transaction, it will be datificated in Quantity column as a negative value. Since we’re doing market basket analysis, we basically would like to analyze what’s inside the basket that our customer actually bought. This negative value is not one of them. That is why we’re not going to use them. As you can see in the picture above, there are only 397925 entries that are not cancelled.

In [None]:
#Create the Basket Data while Using The Transaction From UK Only
basket_plus=(data_plus[data_plus['Country']=='United Kingdom'].groupby(['Invoice','Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('Invoice'))
basket_plus

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TOADSTOOL BEDSIDE LIGHT,TRELLIS COAT RACK,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
569132,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
569133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
569134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
569136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Using the positive-quantity-and-the-transaction-from-UK-only data, we grouped the data by the transaction (InvoiveNo) & the items (Description) and showed the values ​​of Quantity of each item bought. After that we sum up the value and unstack it. Lastly we changed the index of the data frame to the InvoiceNo so that we could see the quantity of each item bought per InvoiceNo. This dataframe is basically the ‘basket’ that our customers ‘carry on’ to the cashier in our shop. It shows us how much this customer / transaction (InvoiveNo) bought a particular item. If the number is 0, then this customer didn’t buy that particular item. If it shows another value (12 for instances), it means that the customer has bought as many as 12 items.

In [None]:
 #Encode The Data

In market basket analysis, the number of each item bought is not really important. The important one is whether an item is bought or not. Because, we only would like to know, what is the association of buying some items and buying some others. So, we need to encode the basket data into a binary data that shows whether an items is bought (1) or not (0).

In [None]:
def encode_units(x):
  if x<=0:
    return 0
  if x>=1:
    return 1

basket_encode_plus=basket_plus.applymap(encode_units)
basket_encode_plus

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TOADSTOOL BEDSIDE LIGHT,TRELLIS COAT RACK,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
569132,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
569133,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
569134,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
569136,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This way, we generated a data frame that shows us whether a particular items is bought or not.

Filter The Transaction : Bought More Than 1 Items Only

In market basket analysis, we are going to uncover the association between 2 or more items that is bought according to historical data. So, it is less useful if a transaction only bought a single items. I mean, how could we uncover the association between 2 or more items if there is only 1 item bought? Hence, the next step is to filter out the transactions that is bought more than 1 item

In [None]:
basket_filter_plus=basket_encode_plus[(basket_encode_plus>0).sum(axis=1)>=2]
basket_filter_plus

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TOADSTOOL BEDSIDE LIGHT,TRELLIS COAT RACK,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536372,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
569132,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
569133,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
569134,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
569136,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


According to the result above, we could see that there are 15376 transaction that bought more than 1 items. It means, 92.35 % of the basket data is a transaction that is bought more than 1 item.

After generating the dataset above, it is now the time for us to use the apriori algorithm. Apriori algorithm is simply used to find the frequently bought items in the dataset.


In [None]:
# compute frequent items using the Apriori algorithm - Get up to three items
frequent_itemsets_plus = apriori(basket_filter_plus, min_support = 0.03, use_colnames=True).sort_values('support',ascending=False).reset_index(drop=True)
frequent_itemsets_plus['length']=frequent_itemsets_plus['itemsets'].apply(lambda x:len(x))
frequent_itemsets_plus

As you can see that there are 108 transaction that is consider as a frequently bought items. It is shown in the picture that White hanging Heart T-Light Holder is the most frequently bought items with the support value of 0.121358.

**Finding The Association Between Frequently Bought Items**

After applying the apriori algorithm and finding the frequently bought item, it is now the time for us to apply the association rules. From association rules, we could extract information and even discover knowledge about which items that is more effective to be sold together.

In [None]:
rule=association_rules(frequent_itemsets_plus, metric="lift", min_threshold=1).sort_values('lift', ascending=False).reset_index(drop=True)
rule.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.045546,0.049762,0.03574,0.784708,15.769312,0.033474,4.413724
1,(ROSES REGENCY TEACUP AND SAUCER ),(GREEN REGENCY TEACUP AND SAUCER),0.049762,0.045546,0.03574,0.718232,15.769312,0.033474,3.387375
2,(WOODEN PICTURE FRAME WHITE FINISH),(WOODEN FRAME ANTIQUE WHITE ),0.060576,0.050495,0.030242,0.499244,9.887016,0.027183,1.896142
3,(WOODEN FRAME ANTIQUE WHITE ),(WOODEN PICTURE FRAME WHITE FINISH),0.050495,0.060576,0.030242,0.598911,9.887016,0.027183,2.342185
4,(LUNCH BAG CARS BLUE),(LUNCH BAG PINK POLKADOT),0.063691,0.063233,0.030242,0.47482,7.50904,0.026215,1.783707


From the association_rules results above, we could see that ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER are the items that has the highest association each other since these two items has the highest “lift” value. The higher the lift value, the higher the association between the items willl. If the lift value is more than 1, it is enough for us to say that those two items are associated each other. In thise case, the highest value is 17.717 which is very high. It means these 2 items are very good to be sold together.

Beside that, we could also see the support value of ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER are 0.0309% which means there are 3.09% out of total transaction that these 2 items were sold together.

From the confidence, we could even extract more information. Remember that the confidence value is influenced by the antecedent and consequent. If the antecedent is higher than the consequent, then the rule that will be applied is rule number 1 (not number 2). vice versa. In this case, the antecedent value is higher than the consequent value. It means we will apply rule number 1 which is 𝐺𝑅𝐸𝐸𝑁 𝑅𝐸𝐺𝐸𝑁𝐶𝑌 𝑇𝐸𝐴𝐶𝑈𝑃 𝐴𝑁𝐷 𝑆𝐴𝑈𝐶𝐸𝑅 → 𝑅𝑂𝑆𝐸𝑆 𝑅𝐸𝐺𝐸𝑁𝐶𝑌 𝑇𝐸𝐴𝐶𝑈𝑃 𝐴𝑁𝐷 𝑆𝐴𝑈𝐶𝐸. In a more detail explanation, it means that a customer will tends to bought Roses Regency Teacup and Saucer AFTER they bought Green Regency Teacup And Saucer. Not in the other way around. This could be a very valuable information, because we are now aware which products should we put the discounts on. We could give a discounts on Roses Regency Teacup and Sauce if a customer buy Green Regency Teacup and Saucer.

# **Conclusion:**
The result of this market basket analysis could be used for a data-driven marketing strategy and decision making.

**Item Placements.**: We could put ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER in a closer place, maybe in a same shelf or any other closer place.

**Products Bundling:** We could put ROSES REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER as a single bundle of product with a lower price compare to each price combined. This way will attract more sales and generates more income.

**Customer Recommendation and Discounts:** We could put Roses Regency Teacup and Saucer in the cashier, so that every time a customer bought Green Regency Teacup and Saucer, we could offer and recommend them to buy Roses Regency Teacup and Saucer with a lower price.

**Problem Satement**: In this implementation, we have used the Market Basket Optimization dataset. This dataset comprises the list of transactions of a retail company over the period of one week. It contains a total of 7501 transaction records where each record consists of the list of items sold in one transaction. Using this record of transactions and items in each transaction, find the association rules between items.

In [None]:
dataset = pd.read_csv('/content/Market_Basket_Optimisation.csv',header=None)
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [None]:
#Getting the list of transactions from the dataset
transactions = []
for i in range(0, 7501):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

In [None]:
transactions

Now once we are ready with the list of items in our training set, we need to run the apriori algorithm which will learn the list of association rules from the training set. Suppose we want to find the association of items with a product which is sold at least 3 times a day. So, the minimum support here will be 3 items per day multiplied by 7 days of weak and divided by the total number of transactions. That means (3*7)/7501 =  0.00279. So the equivalent 0.003 is taken here as support. Now let us we are looking for a 30% confidence in the association rule so we have kept 0.3 as the minimum confidence. The minimum lift is taken as 3 and the minimum length is considered as 2 because we want to find an association between a minimum of two items. These hyperparameters can be tuned depending on the business requirements. 

In [None]:
pip install apyori

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from apyori import apriori

In [None]:
# Training Apriori algorithm on the dataset
rule_list = apriori(transactions, min_support = 0.003,min_confidence = 0.3, min_lift = 3, min_length = 2)
rule_list

<generator object apriori at 0x7f4d080867b0>

After executing the above line of code, we have generated the list of association rules between the items of the retail.

In [None]:
# Visualizing the list of rules
results = list(rule_list)
for i in results:
    print('\n')
    print(i)
    print('**********') 

The first rule indicates an association between mushroom cream sauce and escalope with a confidence of 30%. The next rule shows an association between escalope and pasta with a confidence of 37.28%. 