# Market_Basket_Analysis_AssociationRuleLearning: a)Apriori, b)Eclat

**--------------------------------------------------------------------------------------------------------------------------**
**--------------------------------------------------------------------------------------------------------------------------**
**--------------------------------------------------------------------------------------------------------------------------**
**---------------------------------------------------**


**STRUCTURE**

*In this notebook, the use of two algorithms (**Part A**: Apriori and **Part B**: Eclat) for Market Basket Analysis (groceries dataset) is demonstrated. This type of analysis is based on Association Rule Learning that asscociates item sets based on their frequency of their appearance in the dataset (it depends on 'past relational knowledge' from dataset records). In the first part of this project **Part A**, the scope is to determine the strength of the relationship between all combinations of two items from the grocery store that are frequently bought together (i.e. place item 2 in the basket given the fact that item 1 is already in the basket). For the Apriori algorithm to perform this task, it is necessary to select values for a) minimum_support, b)minimum_confidence,c)minimum_lift and d) the minimum/maximum length of the item set. The algorithm results are displayed in descending order with respect to the 'lift' (lift= confidence / support). The higher the 'lift' value, the stronger the association between two grocery items. In the second part (**Part B**) of this Market Basket Analysis demonstration, there is use of the Eclat algorithm, which is a simplified version of the Apriori as the strength of item sets association is based only on support (item set frequency / length of records) and not on confidence and lift. Therefore, this example is focused on sets as the goal is to identify the items of given sets that are associated the most with respect to the total number of dataset records.*



**The Dataset (.csv file format) for this project has been obtained from Kaggle:**

"*Groceries Market Basket Dataset*" -- File: "groceries.csv" -- Source:https://www.kaggle.com/irfanasrullah/groceries



In [1]:
# importing the libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')


In [2]:
# importing the dataset
data=pd.read_csv('groceries.csv')
# Dataset first 5 records
data.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,Item 23,Item 24,Item 25,Item 26,Item 27,Item 28,Item 29,Item 30,Item 31,Item 32
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,,,,,,,,,,
1,3,tropical fruit,yogurt,coffee,,,,,,,...,,,,,,,,,,
2,1,whole milk,,,,,,,,,...,,,,,,,,,,
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,,,,,,,,,,
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,,,,,,,,,,


In [3]:
# The first column of the dataset 'Item(s)' counts the number of products in each basket, therefore its presence is not
# required in the dataset, as we are only interest in the products. Therefore, this column can be dropped
data=data.drop('Item(s)',axis=1)

In [4]:
# Data info: 9835 records, Dtype('object'), not null
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9835 entries, 0 to 9834
Data columns (total 32 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Item 1   9835 non-null   object
 1   Item 2   7676 non-null   object
 2   Item 3   6033 non-null   object
 3   Item 4   4734 non-null   object
 4   Item 5   3729 non-null   object
 5   Item 6   2874 non-null   object
 6   Item 7   2229 non-null   object
 7   Item 8   1684 non-null   object
 8   Item 9   1246 non-null   object
 9   Item 10  896 non-null    object
 10  Item 11  650 non-null    object
 11  Item 12  468 non-null    object
 12  Item 13  351 non-null    object
 13  Item 14  273 non-null    object
 14  Item 15  196 non-null    object
 15  Item 16  141 non-null    object
 16  Item 17  95 non-null     object
 17  Item 18  66 non-null     object
 18  Item 19  52 non-null     object
 19  Item 20  38 non-null     object
 20  Item 21  29 non-null     object
 21  Item 22  18 non-null     object
 22  

## Part A) Apriori

In [5]:
# Dataset records conversion from pandas dataframe to list
basket_records=[]

for x in range(0, len(data)):
    basket_records.append([str(data.values[x,y])for y in range(0,len(data.columns))])
# Print basket_records type    
print(type(basket_records))
print('\r')
# Print first list record
print('First basket list entry: \n',basket_records[0])

<class 'list'>

First basket list entry: 
 ['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']


In [6]:
# Importing the apriori function from apyori. Apriori algorithm outputs a 'RelationRecord 'generator
from apyori import apriori
# Support= num of times an item (or set of items) has been added to the basket / length of records
minimum_support=21/len(data)
# The number of times we want the apriori rule to be correct(i.e 0.25-->25%)
# Confidence = i.e number of times items 1 & 2  have been added together to the basket divided by number of times item 1 
# has been added to the basket
minimum_confidence= 0.25
# Minimum lift to measure the quality/strength of the apriori rules
# lift= confidence / support
minimum_lift=3
apriori_rules=apriori(transactions=basket_records,
                      min_support=minimum_support,
                      min_confidence=minimum_confidence,
                      min_lift=minimum_lift,
                      min_length=2,
                      max_length=2)

In [7]:
# List of apriori rules
apriori_result=list(apriori_rules)
apriori_result

[RelationRecord(items=frozenset({'hamburger meat', 'Instant food products'}), support=0.003050330452465684, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Instant food products'}), items_add=frozenset({'hamburger meat'}), confidence=0.379746835443038, lift=11.42143769597027)]),
 RelationRecord(items=frozenset({'baking powder', 'whipped/sour cream'}), support=0.004575495678698526, ordered_statistics=[OrderedStatistic(items_base=frozenset({'baking powder'}), items_add=frozenset({'whipped/sour cream'}), confidence=0.25862068965517243, lift=3.607850330154072)]),
 RelationRecord(items=frozenset({'beef', 'root vegetables'}), support=0.017386883579054397, ordered_statistics=[OrderedStatistic(items_base=frozenset({'beef'}), items_add=frozenset({'root vegetables'}), confidence=0.3313953488372093, lift=3.0403668431100312)]),
 RelationRecord(items=frozenset({'berries', 'whipped/sour cream'}), support=0.009049313675648195, ordered_statistics=[OrderedStatistic(items_base=frozenset({'be

In [8]:
# First apriori rule
# If a customer buys 'Instant food Products'(items_base) then there is 37,97% (confidence=0.3797) chance that he will 
# buy 'hamburger meat' (items_add) and this rule appears in 0.3% of the basket items list (support=0.00305)
apriori_result[0]

RelationRecord(items=frozenset({'hamburger meat', 'Instant food products'}), support=0.003050330452465684, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Instant food products'}), items_add=frozenset({'hamburger meat'}), confidence=0.379746835443038, lift=11.42143769597027)])

In [9]:
## Converting the apriori results to pandas dataframe
def apriori_res(apriori_result):
    left_hand_side_str= [tuple(x[2][0][0])[0] for x in apriori_result]
    right_hand_side_str= [tuple(x[2][0][1])[0] for x in apriori_result]
    support= [x[1] for x in apriori_result]
    confidence= [x[2][0][2] for x in apriori_result]
    lift= [x[2][0][3] for x in apriori_result]
    return list(zip(left_hand_side_str,right_hand_side_str, support, confidence, lift))
df_res = pd.DataFrame(apriori_res(apriori_result), columns = ['L-Hand Side', 'R-Hand Side', 'Rule_Support', 
                                                              'Rule_Confidence', 'Rule_Lift'])




In [10]:
## Apriori results in pandas DataFrame
df_res


Unnamed: 0,L-Hand Side,R-Hand Side,Rule_Support,Rule_Confidence,Rule_Lift
0,Instant food products,hamburger meat,0.00305,0.379747,11.421438
1,baking powder,whipped/sour cream,0.004575,0.258621,3.60785
2,beef,root vegetables,0.017387,0.331395,3.040367
3,berries,whipped/sour cream,0.009049,0.272171,3.796886
4,liquor,bottled beer,0.004677,0.422018,5.240594
5,red/blush wine,bottled beer,0.004881,0.253968,3.15376
6,flour,sugar,0.004982,0.28655,8.463112
7,herbs,root vegetables,0.007016,0.43125,3.956477
8,popcorn,salty snack,0.002237,0.309859,8.19211
9,processed cheese,white bread,0.004169,0.251534,5.975445


In [11]:
## Top 5 Apriori results sorted by 'Rule_Lift' column (descending)
df_res.nlargest(n = 5, columns = 'Rule_Lift')

Unnamed: 0,L-Hand Side,R-Hand Side,Rule_Support,Rule_Confidence,Rule_Lift
0,Instant food products,hamburger meat,0.00305,0.379747,11.421438
6,flour,sugar,0.004982,0.28655,8.463112
8,popcorn,salty snack,0.002237,0.309859,8.19211
9,processed cheese,white bread,0.004169,0.251534,5.975445
4,liquor,bottled beer,0.004677,0.422018,5.240594


In [12]:
## Top 5 Apriori results sorted by 'Rule_Confidence' column (descending)
df_res.nlargest(n = 5, columns = 'Rule_Confidence')

Unnamed: 0,L-Hand Side,R-Hand Side,Rule_Support,Rule_Confidence,Rule_Lift
7,herbs,root vegetables,0.007016,0.43125,3.956477
4,liquor,bottled beer,0.004677,0.422018,5.240594
10,rice,root vegetables,0.003152,0.413333,3.792102
0,Instant food products,hamburger meat,0.00305,0.379747,11.421438
2,beef,root vegetables,0.017387,0.331395,3.040367


## Part B: Eclat

In [13]:
# Eclat analysis is a simpler version of Apriori as it is based on just the support and not on confidence and lift
# Support= num of times an set of item(s) has been added to the basket / length of records
minimum_support=21/len(data)
eclat=apriori(transactions=basket_records,
                      min_support=minimum_support,
                      min_length=2,
                      max_length=2)

In [14]:
# List of eclat rules
eclat_result=list(eclat)


In [15]:
## Converting the Eclat results to pandas dataframe
def eclat_res(eclat_result):
    First_Item_Set= [tuple(x[2][1][0])[0] for x in eclat_result if len(x[0])>1]
    Second_Item_Set= [tuple(x[2][1][1])[0]for x in eclat_result if len(x[0])>1]
    support= [x[1]for x in eclat_result if len(x[0])>1 ]
    return list(zip(First_Item_Set,Second_Item_Set, support))

eclat_res = pd.DataFrame(eclat_res(eclat_result), columns = ['First_Item_Set', 'Second_Item_Set', 'Support_Set'])




In [16]:
## Top 5 Eclat results sorted by 'Support_Set' column (descending)
eclat_res=eclat_res[(eclat_res['First_Item_Set']!='nan')&(eclat_res['Second_Item_Set']!='nan')]
eclat_res.nlargest(n = 5, columns = 'Support_Set')

Unnamed: 0,First_Item_Set,Second_Item_Set,Support_Set
1484,other vegetables,whole milk,0.074835
1618,rolls/buns,whole milk,0.056634
1761,whole milk,yogurt,0.056024
1637,root vegetables,whole milk,0.048907
1460,other vegetables,root vegetables,0.047382
