# Practice Session 04: Basket analysis

Author: <font color="blue">Rubén Vera Martínez</font>

E-mail: <font color="blue">ruben.vera01@estudiant.upf.edu</font>

Date: <font color="blue">21/10/22</font>

In [1]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
import csv
import gzip
from apyori import apriori

## 0. The Apriori Algorithm in a nutshell

There are three major components of Apriori algorithm, which we describe below using as an example the case where transactions are purchase histories.

**Support**: the number of transactions containing a particular item divided by total number of transactions:

   *Support(A) = (Transactions containing (A))/(Total Transactions)*

**Confidence**: normally indicates the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought:

   *Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)*

**Lift**: the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) by Support(B):

   *Lift(A→B) = (Confidence (A→B))/(Support (B))*
   
A Lift of 1 means there is no association between products A and B. Lift greater than 1.0 means products A and B are more likely to be bought together. Lift less than 1.0 indicates two products are unlikely to be bought together.

The Apriori algorithm first finds itemsets having the desired level of support, and then within those itemsets tries to derive rules having the desired confidence and lift.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. Playing with apyori

In [2]:
# LEAVE AS-IS

def print_apyori_output (association_results, info=False, info_key=False):
    for relation_record in association_results:
        itemset = list(relation_record.items)
        
        # Consider only itemsets of two elements
        if len(itemset) > 1: 
        
            print("Rules involving itemset %s" % itemset)
            support = relation_record.support

            for rules in relation_record.ordered_statistics:
                antecedent = list(rules.items_base)
                consequent = list(rules.items_add)
                
                if info_key:
                    antecedent = [info.loc[x][info_key] for x in antecedent]
                    consequent = [info.loc[x][info_key] for x in consequent]
                
                confidence = rules.confidence
                lift = rules.lift

                print("%s => %s (support=%.4f, confidence=%.2f, lift=%.2f)" %
                      (antecedent, consequent, support, confidence, lift))
            print()

In [3]:
#Implement 20 transactions to see results of the apriori algorithm
transactions = [
    ['beer', 'chips', 'nuts', 'olives'],
    ['beer', 'chips', 'olives'],
    ['chips', 'nuts' ],
    ['chips', 'olives'],
    ['beer', 'nuts' ],
    ['chips'],
    ['nuts', 'olives'],
    ['beer', 'nuts'],
    ['beer', 'chips', 'olives'], 
    ['beer', 'nuts', 'chips'], 
    ['beer', 'nuts', 'olives', 'chips'], 
    ['beer', 'nuts', 'olives', 'coke'], 
    ['beer', 'nuts', 'coke'], 
    ['coke', 'olives'], 
    ['beer', 'olives'], 
    ['coke', 'nuts', 'olives'], 
    ['coke', 'nuts', 'olives'], 
    ['nuts', 'olives'], 
    ['coke'], 
    ['coke', 'olives'], 

]
results = list(apriori(transactions, min_support=0.2, min_confidence=0.75, min_lift=1.0))

In [4]:
print_apyori_output(results)#Print the results to see the rules having the desired support, confidence and lift.
count_c_b = 0
count_c_b_o = 0
count_olives = 0
for transaction in transactions:
    if('beer' in transaction and 'chips' in transaction):
        count_c_b+=1#Count appearances of chips and beer together in a transaction
    if('beer' in transaction and 'chips' in transaction and 'olives' in transaction):
        count_c_b_o+=1#Count appearances of chips beer and olives together in a transaction
    if('olives' in transaction):
        count_olives+=1 #Count appearances of olives together in a transaction

#Apply formula to know support of the rule, confidence of the rule, and lift of the rule
print("support_chips_beer =", count_c_b/len(transactions))
print("support_olives =", count_olives/len(transactions))
print("confidence_chips_beer_olives =", count_c_b_o/count_c_b)
print("lift_chips_beer =", count_c_b_o/count_c_b/(count_olives/len(transactions)))

count_o_c = 0
count_o_c_b = 0
count_beer = 0
for transaction in transactions:
    if('olives' in transaction and 'chips' in transaction):
        count_o_c+=1 #Count appearances of olives and chips together in a transaction
    if('olives' in transaction and 'chips' in transaction and 'beer' in transaction):
        count_o_c_b+=1 #Count appearances of olives, chips and beer together in a transaction
    if('beer' in transaction):
        count_beer+=1 #Count appearances of beer together in a transaction

#Apply formula to know support of the rule, confidence of the rule, and lift of the rule
print("\nsupport_olives_chips =", count_o_c/len(transactions))
print("support_beer =", count_beer/len(transactions))
print("confidence_olives_chips_beer =", count_o_c_b/count_o_c)
print("lift_olives_chips =", count_o_c_b/count_o_c/(count_beer/len(transactions)))

Rules involving itemset ['beer', 'chips', 'olives']
['beer', 'chips'] => ['olives'] (support=0.2000, confidence=0.80, lift=1.23)
['chips', 'olives'] => ['beer'] (support=0.2000, confidence=0.80, lift=1.60)

support_chips_beer = 0.25
support_olives = 0.65
confidence_chips_beer_olives = 0.8
lift_chips_beer = 1.2307692307692308

support_olives_chips = 0.25
support_beer = 0.5
confidence_olives_chips_beer = 0.8
lift_olives_chips = 1.6


# 2. Load and prepare the shopping baskets

In [5]:
# LEAVE AS-IS

# File names
INPUT_PRODUCTS = "instacart-products.csv"
INPUT_TRANSACTIONS = "instacart-transactions.csv.gz"

# Read into a dataframe
products = pd.read_csv(INPUT_PRODUCTS, delimiter=",")

# Set product_id as index, and drop column aisle_id
products = products.set_index('product_id').drop(columns=['aisle_id'])

products.head(100)

Unnamed: 0_level_0,product_name,department_id
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Chocolate Sandwich Cookies,19
2,All-Seasons Salt,13
3,Robust Golden Unsweetened Oolong Tea,7
4,Smart Ones Classic Favorites Mini Rigatoni Wit...,1
5,Green Chile Anytime Sauce,13
...,...,...
96,Sprinklez Confetti Fun Organic Toppings,13
97,Organic Chamomile Lemon Tea,7
98,2% Yellow American Cheese,16
99,Local Living Butter Lettuce,4


## 2.1. Select by department

In [6]:
# LEAVE AS-IS

DEPT_BAKERY = 3
DEPT_VEGGIES = 4
DEPT_ALCOHOL = 5
DEPT_WORLD = 6
DEPT_DRINKS = 7
DEPT_PETS = 8
DEPT_PHARMACY = 11
DEPT_CLEANING = 17
DEPT_BABIES = 18

In [7]:
def select_from_departments(products, product_id, department_id):
    product_id_belonging = []
    #Should return product_id that belong to one of the departments
    department_belonging = products.loc[product_id].department_id
    for product in product_id:#Iterate over all products demanded to select if they are on some department
        if(department_belonging[product] in department_id):#check if the department of the product it's one of the desired departments
            product_id_belonging.append(product)#Add to the output
    #If no products belong to any department return empty list because any product will be added to the output
    return product_id_belonging

In [8]:
#Testing whether the function works how it should or not

product_id = [21, 26, 45, 54, 57, 71, 111, 112]
department_id = [DEPT_PETS, DEPT_CLEANING]
print("\nTest Case:\n", product_id, "\n")
print("Input products:")
for product in product_id:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
products_selected = select_from_departments(products, product_id, department_id)
print("\nSelected products from departments", department_id, ":")
for product in products_selected:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
    
product_id = [100, 200, 300, 400, 500, 600, 700, 800]
department_id = [DEPT_PETS, DEPT_CLEANING, 13]
print("\nTest Case:\n", product_id, "\n")
print("Input products:")
for product in product_id:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
products_selected = select_from_departments(products, product_id, department_id)
print("\nSelected products from departments", department_id, ":")
for product in products_selected:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
    
    
product_id = [40, 50, 60, 70, 80, 90, 100, 110]
department_id = [DEPT_WORLD, DEPT_BAKERY, 13]
print("\nTest Case:\n", product_id, "\n")
print("Input products:")
for product in product_id:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")
products_selected = select_from_departments(products, product_id, department_id)
print("\nSelected products from departments", department_id, ":")
for product in products_selected:
    print(product, products.loc[product].product_name, "(dept ", products.loc[product].department_id, ")")



Test Case:
 [21, 26, 45, 54, 57, 71, 111, 112] 

Input products:
21 Small & Medium Dental Dog Treats (dept  8 )
26 Fancy Feast Trout Feast Flaked Wet Cat Food (dept  8 )
45 European Cucumber (dept  4 )
54 24/7 Performance Cat Litter (dept  8 )
57 Flat Toothpicks (dept  17 )
71 Ultra 7 Inch Polypropylene Traditional Plates (dept  17 )
111 Fabric Softener, Geranium Scent (dept  17 )
112 Hot Tomatillo Salsa (dept  13 )

Selected products from departments [8, 17] :
21 Small & Medium Dental Dog Treats (dept  8 )
26 Fancy Feast Trout Feast Flaked Wet Cat Food (dept  8 )
54 24/7 Performance Cat Litter (dept  8 )
57 Flat Toothpicks (dept  17 )
71 Ultra 7 Inch Polypropylene Traditional Plates (dept  17 )
111 Fabric Softener, Geranium Scent (dept  17 )

Test Case:
 [100, 200, 300, 400, 500, 600, 700, 800] 

Input products:
100 Peanut Butter & Strawberry Jam Sandwich (dept  1 )
200 Radiant Pantiliners Regular Wrapped Unscented (dept  11 )
300 Organic Enriched Unbleached White Flour (dept  13 )
4

## 2.2. Read and filter transactions

In [9]:
# Open a compressed file
def keep_items_of_department(departments): 
    with gzip.open(INPUT_TRANSACTIONS, "rt") as inputfile:

        # Create a CSV reader
        reader = csv.reader(inputfile, delimiter=",")
        transactions = []
        i = 0
        # Iterate through the CSV file
        for row in reader:
            # Convert to integers
            items = [int(x) for x in row]
            #Select for each row if some of the products of the row its from the departments we desire
            temp = select_from_departments(products, items, departments)
            if(temp != []): #If exist some product of the department we desire we add it to the output and summarize to counter to break at 5000 readings
                i+=1
                transactions.append(temp)
                if(i%1000 == 0):
                    print("Reading transaction:", i)
                if(i > 5000):
                    break
                
    return transactions

## 2.3. Extract association rules and comment on them (DEPT_CLEANING)

In [10]:
#Check results of reading 5000 transactions, with support, confidence, and lift desired.
results = list(apriori(keep_items_of_department([DEPT_CLEANING]), min_support=0.0008, min_confidence=0.2, min_lift=1.0))
print_apyori_output(results, products, 'product_name')

Reading transaction: 1000
Reading transaction: 2000
Reading transaction: 3000
Reading transaction: 4000
Reading transaction: 5000
Rules involving itemset [47865, 5047]
['Easy Open TabsBags'] => ['Quart Storage Bags'] (support=0.0012, confidence=0.23, lift=38.47)

Rules involving itemset [37357, 8021]
['Natural Laundry Detergent, Free & Clear 33'] => ['100% Recycled Paper Towels'] (support=0.0010, confidence=0.21, lift=3.84)

Rules involving itemset [31801, 21653]
['Compostable Forks'] => ['9 Inch Plates'] (support=0.0010, confidence=0.25, lift=35.72)

Rules involving itemset [41387, 21653]
['Compostable Forks'] => ['Plastic Spoons'] (support=0.0016, confidence=0.40, lift=90.93)
['Plastic Spoons'] => ['Compostable Forks'] (support=0.0016, confidence=0.36, lift=90.93)



Firstly, the rule that I would reccomend would be the one with most confidence, which is the rule where if you buy compostable forks, you should buy plastic spoons, because they are usually bought together. So, I would add to the application a recommendation of compostable forks if you are buying plastic spoons and vice versa because 36% confidence is enough to make a recommendation based on my criteria. My criteria, to clarify, would be to recommend all rules that have more than 30% confidence and a lift above 30. Because of the 90.93 lift of the last rule in the output, I would definitely recommend this over any other rule.

On the code cell I put min_confidence 0.2 and min_lift 1 to see all results even that my criteria it's not this one.

## 2.4. Extract association rules and comment on them (other departments)

In [11]:
results = list(apriori(keep_items_of_department([DEPT_BABIES, DEPT_BAKERY]), min_support=0.002, min_confidence=0.2, min_lift=1.0))
print_apyori_output(results, products, 'product_name')

Reading transaction: 1000
Reading transaction: 2000
Reading transaction: 3000
Reading transaction: 4000
Reading transaction: 5000
Rules involving itemset [3020, 34134]
['Broccoli & Apple Stage 2 Baby Food'] => ['Spinach Peas & Pear Stage 2 Baby Food'] (support=0.0024, confidence=0.34, lift=46.34)
['Spinach Peas & Pear Stage 2 Baby Food'] => ['Broccoli & Apple Stage 2 Baby Food'] (support=0.0024, confidence=0.32, lift=46.34)

Rules involving itemset [47888, 43875]
['Baby Food Stage 2 Blueberry Pear & Purple Carrot'] => ['Apple and Carrot Stage 2 Baby Food'] (support=0.0022, confidence=0.22, lift=41.58)
['Apple and Carrot Stage 2 Baby Food'] => ['Baby Food Stage 2 Blueberry Pear & Purple Carrot'] (support=0.0022, confidence=0.41, lift=41.58)



I've chosen to keep items of babies department and bakery department.
Looking at the output rules, using criteria mentioned above, I'd consider to recomment 3 of the 4 output rules.
The recommendations would be as follows:
- If purchasing Broccoli & Apple Stage 2 baby food, I'd recommend Spinach Peas & Pear Stage 2 baby Food due to >30% confidence and >30 lift.
- Same recommendation but in reverse for the same reasons.
- Finally, if you buy Apple and Carrot Stage 2 Baby Food I'd recommend Baby Food Stage 2 Blueberry Pear & Purple Carrot.

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>