I'll be using the apyori library for getting the association rules, so install the library if you haven't already

In [71]:
pip install apyori

Note: you may need to restart the kernel to use updated packages.


In [1]:
import matplotlib.pyplot as plt
import math
from math import log
import pandas as pd
import numpy as np
import random
from apyori import apriori

### Load the data
First we load the carts:

In [2]:
df_cart = pd.read_csv('../data/tesco/tesco_carts.csv', delimiter='\t') 
carts = df_cart['cart'].str.strip('[]').str.split(', ').tolist() #save the carts as long list of lists for using later in apriori
df_cart.head()

Unnamed: 0,cart_id,cart
0,0,[73314923]
1,1,"[58175124, 50502269, 70943424, 57346955, 56036..."
2,2,"[50962501, 50503297, 50507984, 50169663, 51965..."
3,3,"[72680530, 62284801, 58098273]"
4,4,"[56373768, 67336474, 52844621, 54921285]"


I'll change the cart dataframe to expand the carts, so that the product id is easily accessible as I want to merge the cart df with the inventory to know more about the products in each cart:

In [3]:
df_cart['cart'] = df_cart['cart'].str.strip('[]').str.split(', ')
df_shopping_cart = df_cart.explode('cart')
df_shopping_cart.reset_index(drop=True, inplace=True)

In [4]:
# Rename column cart for product_id and make sure they're int type
df_shopping_cart.rename(columns={'cart': 'product_id'}, inplace=True)
df_shopping_cart['product_id'] = df_shopping_cart['product_id'].astype('int64')
df_shopping_cart.head()

Unnamed: 0,cart_id,product_id
0,0,73314923
1,1,58175124
2,1,50502269
3,1,70943424
4,1,57346955


Load the inventory:

In [5]:
df_inventory = pd.read_csv('../data/tesco/tesco_inventory_clean.csv') 
df_inventory.set_index('product_id', inplace=True)
df_inventory.head()

Unnamed: 0_level_0,category,description,ingredients,energy,fat,saturates,salt,sugars,protein,carbohydrate,fibre,avg_price
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
68238698,fruit_veg,Tesco Baby Corn 190G (M),no_ingredients,28.0,0.4,0.1,0.3,1.9,2.5,2.7,2.0,1.645
53426251,fruit_veg,Tesco Cranberries 100G,Pineapple Juice from Concentrate Cranberries S...,336.0,1.6,0.2,0.1,65.0,0.3,77.4,5.5,1.495
59445495,fruit_veg,Tesco Crispy Slices 350G,Potato (78%) Batter Rapeseed Oil Batter contai...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5
74533881,sweets,Kelloggs Pop Tarts Frosted S'mores 416G,"Enriched Flour (Wheat Flour, Niacin, Reduced I...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5
54198489,grains,Osem Bamba Peanut Snack 25G,Peanuts (50%) Corn Palm Oil Salt,534.0,34.0,6.6,1.0,3.2,0.0,0.0,0.0,0.4


### Superficial EDA

In [6]:
# Are there duplicates?
dup_shop = len(df_shopping_cart[df_shopping_cart.duplicated()])
dup_inv = len(df_inventory[df_inventory.duplicated()])
print(f'Number of duplicates in Shopping carts {dup_shop}, and in the inventory {dup_inv}')

Number of duplicates in Shopping carts 0, and in the inventory 5


In [7]:
# Nan values?
nan_values = df_shopping_cart.isna().sum()
nan_values

cart_id       0
product_id    0
dtype: int64

In [8]:
# Nan values?
nan_values = df_inventory.isna().sum()
nan_values

category        0
description     0
ingredients     0
energy          0
fat             0
saturates       0
salt            0
sugars          0
protein         0
carbohydrate    0
fibre           0
avg_price       0
dtype: int64

### Merging to get final dataframe

In [9]:
df_all= pd.merge(df_shopping_cart, df_inventory, on='product_id', how='left')

In [11]:
# All items in cart no 17
df_all[df_all['cart_id']==17]

Unnamed: 0,cart_id,product_id,category,description,ingredients,energy,fat,saturates,salt,sugars,protein,carbohydrate,fibre,avg_price
130,17,50553041,sweets,Tesco Marshmallows 200G,Sugar Glucose Syrup Water Cornflour Beef Gelat...,332.0,0.4,0.2,0.1,64.6,4.1,77.7,0.5,0.883333
131,17,75105344,beer,Estrella Damm Beer 660Ml,Water <strong>Barley Malt</strong> Maize Rice ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.885
132,17,58663952,wine,Jp Chenet Sauvignon Blanc 75Cl,no_ingredients,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.99


Total number of transactions:

In [25]:
len(df_all['cart_id'].unique().tolist())

1582801

### Defining Support, Confidence and Lift

Since the marketing department require us to fulfil the following requirements:

- A  list of triplets of products ($A$, $B$, $C$) such that $C$ is purchased frequently after $A$ and $B$. They are interested in strong associations, such that the likelihood of the association is at least 2.5 times higher than chance.
- A list of pairs ($A$, $B$) such that $B$ is frequently purchased after $A$, and that the pair ($A$, $B$) is found least in 0.7\% of the carts.

For the first list we need to set a lift of minimum 2.5 and since we want to be highly confident that (A,B) --> C then the minimum confidence is 10%. I would also need a minimum support which will be set to 0.0017 as I define an itemset to be frequently bought if it has been bought more than 2691 times.

Then we need to for the seconds list first define our minimum support to 0.007 as the marketing team wants pairs of items with the probability of them being there to be of 0.7%, and we want to be highly confident that A --> B then I propose the likelihood of the association to be at least 1.17 times higher than chance and with a minimum of 10% confidence.

#### List of triplets

In [30]:
# Paramenters of A-priori
supp = 0.0017
lift = 2.5
confi = 0.1

In [31]:
association_rules = apriori(carts, min_support=supp, min_confidence=confi, min_lift=lift)
association_rules = list(association_rules)

In [32]:
for idx in range(len(association_rules)):
        
    rule = association_rules[idx]
    frequent_itemset = rule.items
    support = rule.support
    
    antecedent = rule.ordered_statistics[0].items_base
    antecedent = [df_inventory.loc[int(a)]['description'] for a in antecedent]
    consequent = rule.ordered_statistics[0].items_add
    consequent = [df_inventory.loc[int(c)]['description'] for c in consequent]
    lift = rule.ordered_statistics[0].lift
    confidence = rule.ordered_statistics[0].confidence

    if len(antecedent)>1:
        print(f'{antecedent}->{consequent}')
        print(f'support = {support}')
        print(f'confidence = {confidence}')
        print(f'lift = {lift}')
        print("-----------------------------------------")

['Tesco Bananas Loose', 'Tesco Whole Cucumber Each']->['Tesco White Potatoes 2.5Kg']
support = 0.003285315083829237
confidence = 0.20062502411358465
lift = 5.51215068465007
-----------------------------------------
['Tesco Bananas Loose', 'Tesco Whole Cucumber Each']->['Tesco Mixed Peppers 3 Pack']
support = 0.0030926187183354065
confidence = 0.18885759481461475
lift = 5.865510074567178
-----------------------------------------
['Tesco Bananas Loose', 'Tesco Whole Cucumber Each']->['Tesco British Semi Skimmed Milk 2.272Ltr 4 Pints']
support = 0.002872123532901483
confidence = 0.17539256915776072
lift = 2.849928486351225
-----------------------------------------
['Tesco Bananas Loose', 'Tesco Whole Cucumber Each']->['Tesco Red Seedless Grapes 500G']
support = 0.0017702793970941389
confidence = 0.10810602260889696
lift = 6.039471999554733
-----------------------------------------
['Tesco Bananas Loose', 'Tesco Whole Cucumber Each']->['Tesco Broccoli 335G']
support = 0.0028127351448476467

In [38]:
t_data = []
for idx in range(len(association_rules)):
    rule = association_rules[idx]
    frequent_itemset = rule.items
    support = rule.support
    
    antecedent = rule.ordered_statistics[0].items_base
    antecedent = [df_inventory.loc[int(a)]['description'] for a in antecedent]
    consequent = rule.ordered_statistics[0].items_add
    consequent = [df_inventory.loc[int(c)]['description'] for c in consequent]
    lift = rule.ordered_statistics[0].lift
    confidence = rule.ordered_statistics[0].confidence
    
    # Looking for triplets in the shape (A,B) --> C
    if len(antecedent) > 1:
        t_data.append([antecedent, consequent, support, lift, confidence])

df_filtered_rules = pd.DataFrame(t_data, columns=['Antecedent pair (A,B)', 'Consequent item C', 'Support', 'Lift', 'Confidence'])

In [39]:
df_filtered_rules_sorted = df_filtered_rules.sort_values(by='Confidence', ascending=False)
df_filtered_rules_sorted

Unnamed: 0,"Antecedent pair (A,B)",Consequent item C,Support,Lift,Confidence
8,"[Tesco Cauliflower Each, Tesco Bananas Loose]",[Tesco Broccoli 335G],0.001816,10.019565,0.290832
9,"[Tesco White Potatoes 2.5Kg, Tesco Whole Cucum...",[Tesco Mixed Peppers 3 Pack],0.001874,6.787825,0.218554
11,"[Tesco Iceberg Lettuce Each, Tesco Whole Cucum...",[Tesco Mixed Peppers 3 Pack],0.002361,6.355447,0.204633
10,"[Tesco White Potatoes 2.5Kg, Tesco Whole Cucum...",[Tesco Broccoli 335G],0.001749,7.029397,0.204038
0,"[Tesco Bananas Loose, Tesco Whole Cucumber Each]",[Tesco White Potatoes 2.5Kg],0.003285,5.512151,0.200625
6,"[Tesco White Potatoes 2.5Kg, Tesco Bananas Loose]",[Tesco British Semi Skimmed Milk 2.272Ltr 4 Pi...,0.001935,3.228263,0.198676
7,"[Tesco White Potatoes 2.5Kg, Tesco Bananas Loose]",[Tesco Broccoli 335G],0.001864,6.594319,0.191409
1,"[Tesco Bananas Loose, Tesco Whole Cucumber Each]",[Tesco Mixed Peppers 3 Pack],0.003093,5.86551,0.188858
5,"[Tesco White Potatoes 2.5Kg, Tesco Bananas Loose]",[Tesco Cauliflower Each],0.001773,8.933693,0.182066
2,"[Tesco Bananas Loose, Tesco Whole Cucumber Each]",[Tesco British Semi Skimmed Milk 2.272Ltr 4 Pi...,0.002872,2.849928,0.175393


#### List of pairs

In [41]:
# Paramenters of A-priori
supp = 0.007
lift = 1.17
confi = 0.1

In [42]:
association_rules2 = apriori(carts, min_support=supp, min_confidence=confi, min_lift=lift)
association_rules2 = list(association_rules2)

In [43]:
for idx in range(len(association_rules2)):
        
    rule = association_rules2[idx]
    frequent_itemset = rule.items
    support = rule.support
    
    antecedent = rule.ordered_statistics[0].items_base
    antecedent = [df_inventory.loc[int(a)]['description'] for a in antecedent]
    consequent = rule.ordered_statistics[0].items_add
    consequent = [df_inventory.loc[int(c)]['description'] for c in consequent]
    lift = rule.ordered_statistics[0].lift
    confidence = rule.ordered_statistics[0].confidence


    print(f'{antecedent}->{consequent}')
    print(f'support = {support}')
    print(f'confidence = {confidence}')
    print(f'lift = {lift}')
    print("-----------------------------------------")

['Tesco Bananas Loose']->['Tesco Whole Cucumber Each']
support = 0.01637540031880192
confidence = 0.18466350332720616
lift = 3.3890147571430838
-----------------------------------------
['Tesco Bananas Loose']->['Tesco White Potatoes 2.5Kg']
support = 0.009737168475380038
confidence = 0.10980492739993446
lift = 3.0168784199264636
-----------------------------------------
['Tesco Iceberg Lettuce Each']->['Tesco Bananas Loose']
support = 0.007078590422927456
confidence = 0.285568639445379
lift = 3.220324656113548
-----------------------------------------
['Tesco Mixed Peppers 3 Pack']->['Tesco Bananas Loose']
support = 0.007034365027568216
confidence = 0.21847222494751095
lift = 2.463686117778433
-----------------------------------------
['Tesco Bananas Loose']->['Tesco British Semi Skimmed Milk 2.272Ltr 4 Pints']
support = 0.010278613672849587
confidence = 0.11591074252981662
lift = 1.883416889302292
-----------------------------------------
['Tesco Broccoli 335G']->['Tesco Bananas Loos

In [44]:
p_data = []
for idx in range(len(association_rules2)):
    rule = association_rules2[idx]
    frequent_itemset = rule.items
    support = rule.support
    
    antecedent = rule.ordered_statistics[0].items_base
    antecedent = [df_inventory.loc[int(a)]['description'] for a in antecedent]
    consequent = rule.ordered_statistics[0].items_add
    consequent = [df_inventory.loc[int(c)]['description'] for c in consequent]
    lift = rule.ordered_statistics[0].lift
    confidence = rule.ordered_statistics[0].confidence

    p_data.append([antecedent, consequent, support, lift, confidence])

df_pairs = pd.DataFrame(p_data, columns=['Antecedent', 'Consequent', 'Support', 'Lift', 'Confidence'])

In [45]:
df_pairs = df_pairs.sort_values(by='Lift', ascending=False)
df_pairs

Unnamed: 0,Antecedent,Consequent,Support,Lift,Confidence
7,[Tesco Whole Cucumber Each],[Tesco Iceberg Lettuce Each],0.011538,8.542365,0.211746
11,[Tesco Whole Cucumber Each],[Tesco Cherry Tomatoes 330G],0.00718,6.58842,0.131764
8,[Tesco Whole Cucumber Each],[Tesco Mixed Peppers 3 Pack],0.008603,4.903643,0.157887
10,[Tesco Whole Cucumber Each],[Tesco Broccoli 335G],0.007026,4.442392,0.128947
6,[Tesco Whole Cucumber Each],[Tesco White Potatoes 2.5Kg],0.008574,4.323285,0.157354
0,[Tesco Bananas Loose],[Tesco Whole Cucumber Each],0.016375,3.389015,0.184664
5,[Tesco Broccoli 335G],[Tesco Bananas Loose],0.008498,3.301599,0.292776
2,[Tesco Iceberg Lettuce Each],[Tesco Bananas Loose],0.007079,3.220325,0.285569
1,[Tesco Bananas Loose],[Tesco White Potatoes 2.5Kg],0.009737,3.016878,0.109805
3,[Tesco Mixed Peppers 3 Pack],[Tesco Bananas Loose],0.007034,2.463686,0.218472
