<a href="https://colab.research.google.com/github/wolframalexa/FrequentistML/blob/master/market_basket.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Market basket assignment: Select a dataset of interest to you and perform a market basket analysis, including finding frequent itemsets and mining association rules. 
# This assignment is a little more subjective than previous assignments. You can use the code from the text or any of the shelf method that performs the A Priori algorithm

# code mostly taken from https://github.com/pbharrin/machinelearninginaction3x/blob/master/Ch11/Ch11.ipynb

In [2]:
# import packages

import pandas as pd
import numpy as np

In [3]:
# read data

data = 'https://raw.githubusercontent.com/wolframalexa/FrequentistML/master/datasets/groceries.csv'
dataframe = pd.read_csv(data, sep=',', header='infer')
data = dataframe.to_numpy()
data = np.delete(data, 0, 1) # delete first column containing number of items in purchase
data = np.nan_to_num(data)

In [4]:
# create set of one-item candidate itemsets
def createC1(dataSet):
    C1 = []
    for transaction in dataSet:
        for item in transaction:
            if not [item] in C1 and str(item) != "nan":
                C1.append([str(item)])
    print(C1)            
    C1.sort()
    return list(map(frozenset, C1))#use frozen set so we
                            #can use it as a key in a dict    

In [5]:
# create a set of all candidate itemsets of length one
C1 = createC1(data)

[['citrus fruit'], ['semi-finished bread'], ['margarine'], ['ready soups'], ['tropical fruit'], ['yogurt'], ['coffee'], ['whole milk'], ['pip fruit'], ['cream cheese'], ['meat spreads'], ['other vegetables'], ['condensed milk'], ['long life bakery product'], ['butter'], ['rice'], ['abrasive cleaner'], ['rolls/buns'], ['UHT-milk'], ['bottled beer'], ['liquor (appetizer)'], ['potted plants'], ['cereals'], ['white bread'], ['bottled water'], ['chocolate'], ['curd'], ['flour'], ['dishes'], ['beef'], ['frankfurter'], ['soda'], ['chicken'], ['sugar'], ['fruit/vegetable juice'], ['newspapers'], ['packaged fruit/vegetables'], ['specialty bar'], ['butter milk'], ['pastry'], ['processed cheese'], ['detergent'], ['root vegetables'], ['frozen dessert'], ['sweet spreads'], ['salty snack'], ['waffles'], ['candy'], ['bathroom cleaner'], ['canned beer'], ['sausage'], ['brown bread'], ['shopping bags'], ['beverages'], ['hamburger meat'], ['spices'], ['hygiene articles'], ['napkins'], ['pork'], ['berrie

In [6]:
# find sets that meet a minimum level of support
def scanD(D, Ck, minSupport):
    ssCnt = {}
    for tid in D:
        for can in Ck:
            if can.issubset(tid):
                if can not in ssCnt: ssCnt[can]=1
                else: ssCnt[can] += 1
    numItems = float(len(D))
    retList = []
    supportData = {}
    for key in ssCnt:
        support = ssCnt[key]/numItems
        if support >= minSupport:
            retList.insert(0,key)
        supportData[key] = support
    return retList, supportData

In [7]:
D = list(map(set,data)) # create sets: one per transaction
L1,suppData0 = scanD(D, C1, 0.10) # output sets that meet a minimum support
print("Common items:",L1)

Common items: [frozenset({'root vegetables'}), frozenset({'soda'}), frozenset({'bottled water'}), frozenset({'rolls/buns'}), frozenset({'other vegetables'}), frozenset({'whole milk'}), frozenset({'yogurt'}), frozenset({'tropical fruit'})]


In [8]:
def aprioriGen(Lk, k): #creates Ck
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk): 
            L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
            L1.sort(); L2.sort()
            if L1==L2: #if first k-2 elements are equal
                retList.append(Lk[i] | Lk[j]) #set union
    return retList

def apriori(dataSet, minSupport = 0.5):
    C1 = createC1(dataSet)
    D = list(map(set, dataSet))
    L1, supportData = scanD(D, C1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0):
        Ck = aprioriGen(L[k-2], k)
        Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
        supportData.update(supK)
        L.append(Lk)
        k += 1
    return L, supportData

In [9]:
# generate all possible sets of all lengths
L, suppData = apriori(data, minSupport = 0.02)

[['citrus fruit'], ['semi-finished bread'], ['margarine'], ['ready soups'], ['tropical fruit'], ['yogurt'], ['coffee'], ['whole milk'], ['pip fruit'], ['cream cheese'], ['meat spreads'], ['other vegetables'], ['condensed milk'], ['long life bakery product'], ['butter'], ['rice'], ['abrasive cleaner'], ['rolls/buns'], ['UHT-milk'], ['bottled beer'], ['liquor (appetizer)'], ['potted plants'], ['cereals'], ['white bread'], ['bottled water'], ['chocolate'], ['curd'], ['flour'], ['dishes'], ['beef'], ['frankfurter'], ['soda'], ['chicken'], ['sugar'], ['fruit/vegetable juice'], ['newspapers'], ['packaged fruit/vegetables'], ['specialty bar'], ['butter milk'], ['pastry'], ['processed cheese'], ['detergent'], ['root vegetables'], ['frozen dessert'], ['sweet spreads'], ['salty snack'], ['waffles'], ['candy'], ['bathroom cleaner'], ['canned beer'], ['sausage'], ['brown bread'], ['shopping bags'], ['beverages'], ['hamburger meat'], ['spices'], ['hygiene articles'], ['napkins'], ['pork'], ['berrie

In [10]:
def calcConf(freqSet, H, supportData, brl, minConf):
    prunedH = [] #create new list to return
    for conseq in H:
        conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
        if conf >= minConf: 
            print(freqSet-conseq,'-->',conseq,'conf:',conf)
            brl.append((freqSet-conseq, conseq, conf))
            prunedH.append(conseq)
    return prunedH

In [11]:
def rulesFromConseq(freqSet, H, supportData, brl, minConf):
    m = len(H[0])
    if (len(freqSet) > (m + 1)): #try further merging
        Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 1):    #need at least two sets to merge
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

In [12]:
def generateRules(L, supportData, minConf):  #supportData is a dict coming from scanD
    bigRuleList = []
    for i in range(1, len(L)):#only get the sets with two or more items
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList   

In [13]:
# generate rules
rules = generateRules(L, suppData, 0.4)

frozenset({'margarine'}) --> frozenset({'whole milk'}) conf: 0.4131944444444444
frozenset({'beef'}) --> frozenset({'whole milk'}) conf: 0.4050387596899225
frozenset({'frozen vegetables'}) --> frozenset({'whole milk'}) conf: 0.4249471458773784
frozenset({'domestic eggs'}) --> frozenset({'whole milk'}) conf: 0.47275641025641024
frozenset({'whipped/sour cream'}) --> frozenset({'whole milk'}) conf: 0.449645390070922
frozenset({'whipped/sour cream'}) --> frozenset({'other vegetables'}) conf: 0.40283687943262414
frozenset({'root vegetables'}) --> frozenset({'whole milk'}) conf: 0.44869402985074625
frozenset({'root vegetables'}) --> frozenset({'other vegetables'}) conf: 0.43470149253731344
frozenset({'tropical fruit'}) --> frozenset({'whole milk'}) conf: 0.40310077519379844
frozenset({'curd'}) --> frozenset({'whole milk'}) conf: 0.4904580152671756
frozenset({'yogurt'}) --> frozenset({'whole milk'}) conf: 0.40160349854227406
frozenset({'butter'}) --> frozenset({'whole milk'}) conf: 0.497247706

Generating rules with a minimum confidence of 40%, we find that buying margarine, beef, frozen vegetables, domestic eggs, curd, yogurt, butter, root vegetabes, or whipped/sour cream, is a relatively strong indicator that one will buy whole milk. 

For supermarkets, staples like milk and eggs are known as loss leaders - they are sold slightly above or at cost in order to get customers in the door and encourage them to buy higher-margin items, which is the case of the items on the left hand side of the rules. The grocery store can use the loss leader to its full potential by placing it at the back of the store so customers have to walk through full-priced items (those on the LHS) to pick it up.

In [14]:
rules = generateRules(L, suppData, 0.3)

frozenset({'yogurt'}) --> frozenset({'other vegetables'}) conf: 0.3112244897959184
frozenset({'pip fruit'}) --> frozenset({'other vegetables'}) conf: 0.3454301075268817
frozenset({'citrus fruit'}) --> frozenset({'other vegetables'}) conf: 0.34889434889434895
frozenset({'fruit/vegetable juice'}) --> frozenset({'whole milk'}) conf: 0.36849507735583686
frozenset({'frankfurter'}) --> frozenset({'whole milk'}) conf: 0.3482758620689655
frozenset({'newspapers'}) --> frozenset({'whole milk'}) conf: 0.34267515923566877
frozenset({'margarine'}) --> frozenset({'whole milk'}) conf: 0.4131944444444444
frozenset({'pip fruit'}) --> frozenset({'whole milk'}) conf: 0.3978494623655914
frozenset({'rolls/buns'}) --> frozenset({'whole milk'}) conf: 0.30790491984521834
frozenset({'beef'}) --> frozenset({'whole milk'}) conf: 0.4050387596899225
frozenset({'sausage'}) --> frozenset({'whole milk'}) conf: 0.3181818181818182
frozenset({'frozen vegetables'}) --> frozenset({'whole milk'}) conf: 0.4249471458773784
f

These rules include all of the above plus rules whose confidence is between 30 and 40 percent. This is mostly interesting because there are fewer associations with "whole milk". For instance, buying sausages means one is likely to buy rolls or buns (32% confidence), and buying yogurt, fruit, and eggs is an indicator one will buy 'other vegetables'. 

The supermarket can capitalize on these associations and consumer behaviour by ensuring that these items are placed far from each other to encourage impulse buys.