# Tutorial Association Rules
# Market Basket Analysis

<b>Association rules</b> are normally written like this: {Diapers} -> {Beer} which means that there is a strong relationship between customers that purchased diapers and also purchased beer in the same transaction.

In the above example, the {Diaper} is the <b>antecedent</b> and the {Beer} is the <b>consequent</b>. Both antecedents and consequents can have multiple items. In other words, {Diaper, Gum} -> {Beer, Chips} is a valid rule.

<b>Support</b> is the relative frequency that the rules show up. In many instances, you may want to look for high support in order to make sure it is a useful relationship. However, there may be instances where a low support is useful if you are trying to find “hidden” relationships.

<b>Confidence</b> is a measure of the reliability of the rule. A confidence of 0.5 in the above example would mean that in 50% of the cases where Diaper and Gum were purchased, the purchase also included Beer and Chips. For product recommendation, a 50% confidence may be perfectly acceptable but in a medical situation, this level may not be high enough.

<b>Lift</b> is the ratio of the observed support to that expected if the two rules were independent. The basic rule of thumb is that a lift value close to 1 means the rules were completely independent. Lift values > 1 are generally more “interesting” and could be indicative of a useful rule pattern.

In [1]:
import warnings
warnings.filterwarnings('ignore') # We can suppress the warnings

In [2]:
# Create an array named as 'dataset' comprised of transactions
dataset = [['Milk', 'Eggs', 'Bread'],
['Milk', 'Eggs'],
['Milk', 'Bread'],
['Eggs', 'Apple']]

In [3]:
# Display the records
print(dataset)

[['Milk', 'Eggs', 'Bread'], ['Milk', 'Eggs'], ['Milk', 'Bread'], ['Eggs', 'Apple']]


In [4]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

# Create an object te by calling a method TransactionEncoder()
te = TransactionEncoder()

# Call fit() method to train the model
te_array = te.fit(dataset).transform(dataset)

# Transform te_array into dataframe
df = pd.DataFrame(te_array, columns = te.columns_)

In [5]:
df.head()

Unnamed: 0,Apple,Bread,Eggs,Milk
0,False,True,True,True
1,False,False,True,True
2,False,True,False,True
3,True,False,True,False


In [6]:
from mlxtend.frequent_patterns import apriori

# Calculate the sfrequenct itemsets by calling a method apriori
frequent_itemsets_ap = apriori(df, min_support = 0.01, use_colnames = True)

In [7]:
print(frequent_itemsets_ap)

   support             itemsets
0     0.25              (Apple)
1     0.50              (Bread)
2     0.75               (Eggs)
3     0.75               (Milk)
4     0.25        (Eggs, Apple)
5     0.25        (Eggs, Bread)
6     0.50        (Milk, Bread)
7     0.50         (Milk, Eggs)
8     0.25  (Milk, Eggs, Bread)


In [8]:
from mlxtend.frequent_patterns import fpgrowth

# Evaluate the frequent item rules for fpgrowth function
frequent_itemsets_fp = fpgrowth(df, min_support = 0.01, use_colnames = True)

print(frequent_itemsets_fp)

   support             itemsets
0     0.75               (Milk)
1     0.75               (Eggs)
2     0.50              (Bread)
3     0.25              (Apple)
4     0.50         (Milk, Eggs)
5     0.50        (Milk, Bread)
6     0.25        (Eggs, Bread)
7     0.25  (Milk, Eggs, Bread)
8     0.25        (Eggs, Apple)


In [9]:
from mlxtend.frequent_patterns import association_rules

# Display the reles due to apriori algorithm
rules_ap = association_rules(frequent_itemsets_ap, metric = "confidence", min_threshold = 0.8)

print(rules_ap)

     antecedents consequents  antecedent support  consequent support  support  \
0        (Apple)      (Eggs)                0.25                0.75     0.25   
1        (Bread)      (Milk)                0.50                0.75     0.50   
2  (Bread, Eggs)      (Milk)                0.25                0.75     0.25   

   confidence      lift  leverage  conviction  zhangs_metric  
0         1.0  1.333333    0.0625         inf       0.333333  
1         1.0  1.333333    0.1250         inf       0.500000  
2         1.0  1.333333    0.0625         inf       0.333333  


In [10]:
# Display the reles due to fp-growth algorithm
rules_fp = association_rules(frequent_itemsets_fp, metric = "confidence", min_threshold = 0.8)

print(rules_fp)

     antecedents consequents  antecedent support  consequent support  support  \
0        (Bread)      (Milk)                0.50                0.75     0.50   
1  (Bread, Eggs)      (Milk)                0.25                0.75     0.25   
2        (Apple)      (Eggs)                0.25                0.75     0.25   

   confidence      lift  leverage  conviction  zhangs_metric  
0         1.0  1.333333    0.1250         inf       0.500000  
1         1.0  1.333333    0.0625         inf       0.333333  
2         1.0  1.333333    0.0625         inf       0.333333  


## Perform Market Basket Analysis for the following dataset

In [11]:
example1 = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

# Load the dataset 

In [12]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Load the dataset
df = pd.read_excel('Online_Retail.xlsx')

# Display the records
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


There is a little cleanup, we need to do. First, some of the descriptions have spaces that need to be removed. We also drop the rows that don’t have invoice numbers and remove the credit transactions (those with invoice numbers containing C).

In [13]:
# Remove any whote spaces or specified characters at the start and end of a string. 
df['Description'] = df['Description'].str.strip()

# Drop NA values from the column 'InvoiceNo'
df.dropna(axis = 0, subset = ['InvoiceNo'], inplace = True)

# astype is a Pandas function for DataFrames (and numpy for numpy arrays) that will cast the object to the specified type
df['InvoiceNo'] = df['InvoiceNo'].astype('str')

# Returns true if the characters exist and false if not.
df = df[~df['InvoiceNo'].str.contains('C')]

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


After the cleanup, we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. For the sake of keeping the data set small, we look at sales for France. However, in additional code below, we compare these results to sales from Germany. Further country comparisons would be interesting to investigate.

In [14]:
basket = (df[df['Country'] == "France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [15]:
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


There are a lot of zeros in the data but we also need to make sure any positive values are converted to a 1 and anything less the 0 is set to 0. This step will complete the one hot encoding of the data and remove the postage column (since that charge is not one we wish to explore).

In [16]:
# Create and declare a method named as 'encode_units()'
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
#basket_sets.drop('POSTAGE', inplace = True, axis = 1)

Now that the data is structured properly, we can generate frequent item sets that have a support of at least 7% (this number was chosen so that we could get enough useful examples):

## Apriori Rule for Market Basket Analysis

In [17]:
frequent_itemsets = apriori(basket_sets, min_support = 0.07, use_colnames = True)



The final step is to generate the rules with their corresponding support, confidence and lift:

In [18]:
rules = association_rules(frequent_itemsets, metric = "lift", min_threshold = 1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,0.064088,3.791383,0.959283
1,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,0.064088,3.283859,0.964734
2,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181,0.979224
3,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878,0.976465
4,(POSTAGE),(ALARM CLOCK BAKELIKE GREEN),0.765306,0.096939,0.084184,0.11,1.134737,0.009996,1.014676,0.505929


For instance, we can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations. We can also see several where the confidence is high as well. This part of the analysis is where the domain knowledge will be useful.

We can filter the dataframe using standard pandas code. In this case, look for a large lift (6) and high confidence (0.8).

In [19]:
rules[ (rules['lift'] >= 6) &
       (rules['confidence'] >= 0.8) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
2,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181,0.979224
3,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878,0.976465
75,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS),0.127551,0.132653,0.102041,0.8,6.030769,0.085121,4.336735,0.95614
76,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.137755,0.127551,0.122449,0.888889,6.968889,0.104878,7.852041,0.993343
77,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.127551,0.137755,0.122449,0.96,6.968889,0.104878,21.556122,0.981725
78,"(POSTAGE, ALARM CLOCK BAKELIKE GREEN)",(ALARM CLOCK BAKELIKE RED),0.084184,0.094388,0.071429,0.848485,8.989353,0.063483,5.977041,0.970454
79,"(POSTAGE, ALARM CLOCK BAKELIKE RED)",(ALARM CLOCK BAKELIKE GREEN),0.086735,0.096939,0.071429,0.823529,8.495356,0.063021,5.117347,0.966081
115,"(SET/6 RED SPOTTY PAPER CUPS, POSTAGE)",(SET/6 RED SPOTTY PAPER PLATES),0.117347,0.127551,0.102041,0.869565,6.817391,0.087073,6.688776,0.966763
116,"(POSTAGE, SET/6 RED SPOTTY PAPER PLATES)",(SET/6 RED SPOTTY PAPER CUPS),0.107143,0.137755,0.102041,0.952381,6.91358,0.087281,18.107143,0.958
118,(SET/6 RED SPOTTY PAPER PLATES),"(SET/6 RED SPOTTY PAPER CUPS, POSTAGE)",0.127551,0.117347,0.102041,0.8,6.817391,0.087073,4.413265,0.97807


In looking at the rules, it seems that the green and red alarm clocks are purchased together and the red paper cups, napkins and plates are purchased together in a manner that is higher than the overall probability would suggest.

At this point, you may want to look at how much opportunity there is to use the popularity of one product to drive sales of another. For instance, we can see that we sell 340 Green Alarm clocks but only 316 Red Alarm Clocks so maybe we can drive more Red Alarm Clock sales through recommendations.

In [20]:
print(basket['ALARM CLOCK BAKELIKE GREEN'].sum())
print(basket['ALARM CLOCK BAKELIKE RED'].sum())

340.0
316.0


What is also interesting is to see how the combinations vary by country of purchase. Let’s check out what some popular combinations might be in Germany.

In [21]:
basket2 = (df[df['Country'] == "Germany"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

basket_sets2 = basket2.applymap(encode_units)
basket_sets2.drop('POSTAGE', inplace = True, axis = 1)
frequent_itemsets2 = apriori(basket_sets2, min_support = 0.05, use_colnames = True)
rules2 = association_rules(frequent_itemsets2, metric = "lift", min_threshold = 1)

rules2[ (rules2['lift'] >= 4) &
        (rules2['confidence'] >= 0.5)]



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.115974,0.137856,0.067834,0.584906,4.242887,0.051846,2.076984,0.86458
6,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.107221,0.137856,0.061269,0.571429,4.145125,0.046488,2.01167,0.849877
10,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.070022,0.126915,0.059081,0.84375,6.648168,0.050194,5.587746,0.913551


It is clear from the above results that Germans love Plasters in Tin Spaceboy and Woodland Animals.

## FP-growth algorithm for Market Basket Analysis

In [22]:
# Evaluate the frequent item rules for fpgrowth function
frequent_itemsets_fp1 = fpgrowth(basket_sets, min_support = 0.05, use_colnames = True)

print(frequent_itemsets_fp1)

      support                                           itemsets
0    0.765306                                          (POSTAGE)
1    0.181122                    (RED TOADSTOOL LED NIGHT LIGHT)
2    0.158163               (ROUND SNACK BOXES SET OF4 WOODLAND)
3    0.125000                               (SPACEBOY LUNCH BOX)
4    0.104592                           (MINI PAINT SET VINTAGE)
..        ...                                                ...
190  0.061224  (POSTAGE, LUNCH BAG RED RETROSPOT, LUNCH BAG A...
191  0.063776  (CHILDRENS CUTLERY DOLLY GIRL, CHILDRENS CUTLE...
192  0.053571             (POSTAGE, CHARLOTTE BAG APPLES DESIGN)
193  0.058673                        (JUMBO BAG APPLES, POSTAGE)
194  0.165816                      (RABBIT NIGHT LIGHT, POSTAGE)

[195 rows x 2 columns]




In [23]:
# Display the reles due to fp-growth algorithm
rules_fp1 = association_rules(frequent_itemsets_fp1, metric = "confidence", min_threshold = 0.8)

print(rules_fp1)

                                          antecedents  \
0                     (RED TOADSTOOL LED NIGHT LIGHT)   
1                (ROUND SNACK BOXES SET OF4 WOODLAND)   
2                         (ALARM CLOCK BAKELIKE PINK)   
3                        (ALARM CLOCK BAKELIKE GREEN)   
4   (ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL...   
..                                                ...   
83  (LUNCH BAG RED RETROSPOT, LUNCH BAG APPLE DESIGN)   
84                     (CHILDRENS CUTLERY DOLLY GIRL)   
85                       (CHILDRENS CUTLERY SPACEBOY)   
86                                 (JUMBO BAG APPLES)   
87                               (RABBIT NIGHT LIGHT)   

                       consequents  antecedent support  consequent support  \
0                        (POSTAGE)            0.181122            0.765306   
1                        (POSTAGE)            0.158163            0.765306   
2                        (POSTAGE)            0.102041            0.765306   
3  

## References
- <p>https://pbpython.com/market-basket-analysis.html</p>
- <p>http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/</p>