# Market Basket Analysis: Association Rule Mining

<img src='pic/marketbasket.png' width = 400>

In this era, it's hardly sufficient for a company to make profit only by selling one product at a time. You may have heard off the classic [Beer and Diapers](https://tdwi.org/articles/2016/11/15/beer-and-diapers-impossible-correlation.aspx) case, and this is how current companies can out-perform competitors, if they can know how customers behave while shopping. While it is entirely possible to come up with links by intuition, for example apples and oranges which are both fruits, some other relationships are hard to detect and can be exceptionally supprising, for example the Beer and Diapers example.

**`Market Basket Analysis`** is one of the key techniques used by large retailers to uncover associations between items. Say that a person who buy bread have 60% likelihood to also buy jam, but only a 20% likelihood to also purchase melons. In this case, you should try to juxtopose bread and jam, rather than bread and melons. For a company you should try to find associations between different items and products that can be sold together. This way you can generate higher "marginal profits" per order. 

**`Association Rule Mining`** is the method utilized for Market Basket Analysis and is all about building the rules to uncover associations between items. If we know that ""when customer purchases A, then he/she is likely to buy B as well,"" we can tell the relationship between two items. The first item or group of items is called **antecedent**, and the later one is called **consequent**. This relationship is known as **Single Cardinality**.

The outcome of this type of technique is, in simple terms, a set of rules that can be understood as **“if this, then that”**. See the following example:

* **If** A occurs, **then** B will be likely to occur.
$$ A⇒B $$



* **If** customer purchases C, **then** he/she is likely to buy D as well.
$$ C⇒D $$


There are two important things to keep in mind:

* **Correlation doesn't imply causality**: Rather, it is a co-occurrence pattern (correlation) and can only be interpreted as the followings:

* **Directional**: Keep extra attention that the two relationships above are not the same! The relationship between items are not bidirectional, which means:

$$ (A ⇒ B) \neq (B ⇒ A) $$

In this notebook, I will dive deep into the Market Basket Analysis and the core algorithm -- Apriori Algorithm. We will use the [UCI Online Retail Dataset](https://data.world/uci/online-retail) and the [Kaggle's Instacart Market Basket Analysis](https://www.kaggle.com/c/instacart-market-basket-analysis/data) dataset which are availble in the Internet.

---

## Metrics for Association Rule Mining

Since there are infinite possibility in terms of this cardinality relationship, we use several metrics to help mesure the association. The three metrics are:

* **`Support`**: the frequency of the combination of items bought. With this metrics, we can filter out items that are bought less frequently

$$Support = \frac{freq(A,B)}{N}$$

* **`Confidence`**: measures how oftern items A and B comes together. By intuition, if confidence is low, this case is not worth discussing, since items A and B are rarely purchased together.

$$Confidence = \frac{freq(A,B)}{freq(A)}$$

Normally we will set the minimum **Support** and **Confidence** to detect valuable relationships (rules) instead of full but possibly trivial ones. From here there's a third metrics that can help.

* **`Lift`**: measures the strength of any rule.

$$Lift = \frac{Support(A,B)}{Support(A)  *  Support(B)}$$

The denominator implys the independent occurence probability of A and B. If the denominator $Supp(A) * Supp(B)$ is large, it means that the occurence of randomness is more rather than the occurrence due to associations. Therefore, we can utilize lift as the final judge the decide whether this rule is worth considering.

---

# Example for Association Rules Mining

Apriori Algorithm uses **frequent item sets** ( item sets whose **support value is greater than the threshold value** ) to generate association rules. It is based on the concept that **a subset of a frequent itemset must also be a frequent itemset**. 

For example, if $A\cap B$ is a frequent item set, then A, B are both frequent item sets respectively. Let's see the example below.

TransactionId | Items
:------:|:-------:
Trans 1| (oranges,apples,peaches,bananas)
Trans 2| (oranges,bananas,watermelons)
Trans 3| (oranges,apples)
Trans 4| (bananas)
Trans 5| (apples,bananas)
Trans 6| (oranges,watermelons)
Trans 7| (bananas,watermelons)

In the table above we can see seven transactions from a fruit store, with items purchased for each transaction. We can represent an item set as the follows:

$$ I = \{i_1, i_2,...,i_n\}$$

In this example it corresponds to 

$$ I = \{oranges, apples, peaches, bananas, watermelons\}$$

An association rule is defined as the following format:
    
$$\{oranges,apples\} ⇒\{bananas\}$$


Now let's calculate metrics discussed above.


$$Support(oranges ⇒ apples) = \frac{2}{7} = 28.6\%$$


$$Confidence(oranges ⇒ apples) = \frac{2/7}{4/7} = 50\%$$


$$Lift(bananas ⇒ watermelons) = \frac{2/7}{(5/7)*(3/7)} = 0.93$$

---

# Apriori Algorithm

Apriori algorithm assumes that any subset of a frequent itemset must be frequent. It is the main algorithm behind Market Basket Analysis.

The Apriori Algorithm works in the following two processes:

1. **`Generating Frequent Itemsets`**: Loop through each transaction to find all frequent itemsets. Normally we will filter out some less valuable associations, which can be defined by less lower support value thna pre-determined hyperparameter -- min_support.


2. **`Generating Associations`**: Having all frequent itemsets at hand, now it list all the assocaitions from the frequent itemsets. During the process it also calculate Support and Confidence and prune unwanted associations that don't satistfy our minimum constraint.

### Notebook Setting

In [1]:
import numpy as np
import pandas as pd
import glob
import sys
import warnings
from itertools import combinations, groupby
from collections import Counter
warnings.filterwarnings('ignore')

from itertools import combinations, groupby
from collections import Counter
warnings.filterwarnings('ignore')

### Apriori with packages

In [2]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

First we implement some basic data cleaning to prepare workable dataset for our Market Basket Analysis

In [3]:
data = pd.read_csv('Online_Retail.csv')
data.dropna(inplace=True)
data = data[~data['InvoiceNo'].str.contains('C')]
data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010/12/1 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010/12/1 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011/12/9 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011/12/9 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011/12/9 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011/12/9 12:50,4.15,12680.0,France


In this dataset, `InvoiceNo` means the orderIDNo, and the `StockCode` illustrates different kinds of products, with detailed descriptions in `Description`. Here let's reorganize our dataset, with **orders being rows** and **products being columns**.

In [4]:
data = data.groupby(['InvoiceNo','Description'])['Quantity'].sum().unstack().fillna(0)
data

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,12.0,0.0,0.0,0.0,24.0,0.0,0.0
581586,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Note that in each order, there might be multiple identical products. In each order we only need to know whether or not this product is purchased. The quantity is not considered in this analysis. Therefore, we implement the following transformation to clean clean our dataset again.

In [5]:
def encoding(x):
    if x <= 0:
        return(0)
    else:
        return(1)
basket = data.applymap(encoding)
basket.drop('POSTAGE', inplace=True, axis=1) # We drop "POSTAGE" since it's not a real product.
basket

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581583,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581584,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
581586,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


From the above we can see that this dataset includes 18,536 orders and 3,876 items. While some items are often purchased, which is the **frequent itemsets**, some others might be less purchased. As discussed in the beginning of this notebook, we will set a **minimum support value** to consider only products that are often purchased.

The following is how we conduct Market Basket Analysis via the mlxtend Apriori Algorithm. First, we construct the frequent itemsets using the **apriori** function, with **minimum support** specified. Then, we can retrieve the rule inside the pattern using the **association_rules** function.

In [6]:
frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True, verbose=1)
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules.head()

Processing 15 combinations | Sampling itemset size 5 432


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(60 CAKE CASES DOLLY GIRL DESIGN),(PACK OF 72 RETROSPOT CAKE CASES),0.019098,0.055514,0.010196,0.533898,9.617433,0.009136,2.026353
1,(PACK OF 72 RETROSPOT CAKE CASES),(60 CAKE CASES DOLLY GIRL DESIGN),0.055514,0.019098,0.010196,0.183673,9.617433,0.009136,1.201605
2,(SET OF 20 VINTAGE CHRISTMAS NAPKINS),(60 CAKE CASES VINTAGE CHRISTMAS),0.026381,0.02514,0.010088,0.382413,15.211178,0.009425,1.578498
3,(60 CAKE CASES VINTAGE CHRISTMAS),(SET OF 20 VINTAGE CHRISTMAS NAPKINS),0.02514,0.026381,0.010088,0.401288,15.211178,0.009425,1.626188
4,(60 TEATIME FAIRY CAKE CASES),(72 SWEETHEART FAIRY CAKE CASES),0.035445,0.027028,0.011977,0.3379,12.501609,0.011019,1.469522


In [7]:
rules.sample(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
525,(REGENCY CAKESTAND 3 TIER),(SET OF 3 REGENCY CAKE TINS),0.091929,0.038735,0.013487,0.146714,3.787582,0.009926,1.126544
429,(PACK OF 72 RETROSPOT CAKE CASES),(LUNCH BAG RED RETROSPOT),0.055514,0.069486,0.010628,0.191448,2.755187,0.006771,1.15084
69,(STRAWBERRY CHARLOTTE BAG),(CHARLOTTE BAG PINK POLKADOT),0.026543,0.026273,0.011275,0.424797,16.168445,0.010578,1.69284
308,(LUNCH BAG VINTAGE LEAF DESIGN),(JUMBO BAG VINTAGE LEAF),0.032208,0.041649,0.011275,0.350084,8.405638,0.009934,1.474577
520,(ROSES REGENCY TEACUP AND SAUCER ),(REGENCY CAKESTAND 3 TIER),0.042242,0.091929,0.022659,0.536398,5.834907,0.018775,1.958731
816,"(LUNCH BAG SUKI DESIGN , LUNCH BAG CARS BLUE)",(LUNCH BAG RED RETROSPOT),0.02104,0.069486,0.012786,0.607692,8.745485,0.011324,2.371897
402,(LUNCH BAG DOLLY GIRL DESIGN),(LUNCH BAG SUKI DESIGN ),0.032262,0.0485,0.013757,0.426421,8.792155,0.012192,1.658883
230,(JUMBO BAG STRAWBERRY),(JUMBO BAG PINK POLKADOT),0.035283,0.04699,0.015807,0.448012,9.534276,0.014149,1.726506
470,(PACK OF 72 RETROSPOT CAKE CASES),(SET/20 RED RETROSPOT PAPER NAPKINS ),0.055514,0.039491,0.011491,0.206997,5.241664,0.009299,1.21123
797,"(LUNCH BAG PINK POLKADOT, LUNCH BAG CARS BLUE)",(LUNCH BAG SUKI DESIGN ),0.023036,0.0485,0.011114,0.482436,9.947081,0.009996,1.838418


Here we see two new metrics that we haven't discussed before: **leverage** and **conviction**.

The difference between **Lift** and **Leverage** is that **Lift** calculates the ratio and **Leverage** computes the difference.

$$Lift = \frac{Support(A,B)}{Support(A)  *  Support(B)}$$

$$Leverage = {Support(A,B} - {Support(A) * Support(B)}$$

**Conviction** is a new metric that can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction)

$$Conviction = \frac{1 - Support(B)}{1 - Confidence(A,B)}$$

### Problem with the mlxtend Apriori Algorithm

The mlxtend Apriori Algorithm is not well-optimized and can barely work on large dataset. The key problem is that the mlxtend Apriori Algorithm requires a matrix having orders being rows and products being columns. For large dataset, this matrix cannot fit into 16GB RAM and thus cannot work on many computers. This is because this algorithm fails to optimize using the **Prune** technique and generate too many unused combinations. For detailed discussion, you can refer to [this github issue](https://github.com/rasbt/mlxtend/issues/644) for mlxtend apriori algorithm.

What can we do then? We need to find a way to build the Apriori Algorithm from Scratch, without making the matrix mentioned above.

---

### Apriori from Scratch

In [8]:
orders = pd.read_csv('order_products__prior.csv')

# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders = orders.set_index('order_id')['product_id'].rename('item_id').sort_index()
orders.head(10)

order_id
2    33120
2    28985
2     9327
2    45918
2    30035
2    17794
2    40141
2     1819
2    43668
3    33754
Name: item_id, dtype: int64

In [9]:
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

In [10]:
print(f'''dimensions: {orders.shape};
size: {size(orders)};   
unique_orders: {len(orders.index.unique())};
unique_items: {len(orders.value_counts())}''')

dimensions: (32434489,);
size: 518.95 MB;   
unique_orders: 3214874;
unique_items: 49677


### Algorithm From Scratch

In [11]:
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().values
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
        for item_pair in combinations(item_list, 2):
            yield item_pair
                   
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]

In [12]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

In [13]:
%%time
rules = association_rules(orders, 0.01)

Starting order_item:               32434489
Items with support >= 0.01:           10906
Remaining order_item:              29843570
Remaining orders with 2+ items:     3013325
Remaining order_item:              29662716
Item pairs:                        30622410
Item pairs with support >= 0.01:      48751

CPU times: user 4min 55s, sys: 12.4 s, total: 5min 7s
Wall time: 5min 23s


In [14]:
# Replace item ID with item name and display association rules
item_name   = pd.read_csv('products.csv')
item_name   = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
rules = merge_item_name(rules, item_name).sort_values('lift', ascending=False)
rules.sample(10)

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
30695,Bag of Organic Bananas,Peaches,424,0.014071,376367,12.49009,2830,0.093916,0.001127,0.149823,0.011995
7013,Trail Mix,Strawberries,370,0.012279,11515,0.382136,141805,4.705931,0.032132,0.002609,0.006828
32846,Small Hass Avocado,Organic Frozen Peas,313,0.010387,48888,1.622394,20164,0.669161,0.006402,0.015523,0.009568
46867,Organic Hass Avocado,Organic Light Agave Nectar,341,0.011316,212785,7.061469,3827,0.127003,0.001603,0.089104,0.012618
47848,Organic Hass Avocado,Tall Kitchen Drawstring Bags,343,0.011383,212785,7.061469,4444,0.147478,0.001612,0.077183,0.01093
31870,Organic Tomato Cluster,Organic Mint,700,0.02323,64144,2.128678,19688,0.653365,0.010913,0.035555,0.016703
3062,Apple Cinnamon Instant Oatmeal,Bag of Organic Bananas,320,0.010619,5604,0.185974,376367,12.49009,0.057102,0.00085,0.004572
31653,Organic Strawberries,Organic Rosemary,691,0.022931,263416,8.741706,10366,0.344005,0.002623,0.06666,0.007626
4452,Organic Fat Free Milk,Organic Hass Avocado,1717,0.05698,26473,0.878531,212785,7.061469,0.064859,0.008069,0.009185
43880,100% Whole Wheat Bread,Organic Multigrain Waffles,360,0.011947,60654,2.01286,10689,0.354724,0.005935,0.033679,0.016732


In [15]:
rules.sort_values(by='supportAB', ascending=False)

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
1712,Bag of Organic Bananas,Organic Strawberries,40029,1.328400,376367,12.490090,263416,8.741706,0.106356,0.151961,0.012167
1581,Banana,Organic Strawberries,38564,1.279782,470096,15.600574,263416,8.741706,0.082034,0.146400,0.009384
4500,Bag of Organic Bananas,Organic Hass Avocado,37628,1.248720,376367,12.490090,212785,7.061469,0.099977,0.176836,0.014158
3492,Banana,Organic Baby Spinach,35590,1.181087,470096,15.600574,240637,7.985763,0.075708,0.147899,0.009480
6678,Banana,Organic Avocado,34032,1.129384,470096,15.600574,176241,5.848722,0.072394,0.193099,0.012378
...,...,...,...,...,...,...,...,...,...,...,...
21063,Organic Extra Large Grade AA Brown Eggs,Organic Small Bunch Celery,302,0.010022,13535,0.449172,67972,2.255714,0.022313,0.004443,0.009892
36421,Organic Lacinato (Dinosaur) Kale,"Organic Butterhead (Boston, Butter, Bibb) Lettuce",302,0.010022,42982,1.426398,12111,0.401915,0.007026,0.024936,0.017482
8128,Organic Greek Plain Nonfat Yogurt,Organic Raspberries,302,0.010022,6786,0.225200,136621,4.533895,0.044503,0.002210,0.009816
36454,Organic Baby Carrots,"Organic Butterhead (Boston, Butter, Bibb) Lettuce",302,0.010022,76615,2.542540,12111,0.401915,0.003942,0.024936,0.009808


### Limitations

Frequent Itemset Generation is the most computationally expensive step because the algorithm scans the database too many times, which reduces the overall performance. The most awful news is that this apriori algorithm works in the **appallingly bad time complexity -- $𝑂(2^n)$, where $n$ is total number of unique items in the dataset**. 

### End Note

Market Basket Analysis is not only popular in retail businesses. By nature the Apriori Algorithm is to look for the likelihood of different elements occurring together. That being said, this technique is also useful in settings that are detecting associations and analyzing co-occurences.

The Market Basket Analysis is also a good complementory tool for recommender system. While the recommender system guesses what customer would purchase next by looking at what items similar customers bought in the past, Market Basket Analysis looks at the "content" of the transction itself -- what items are mostly sold together.  

---

## Reference:

[[Wikipedia] Association Rule Learning](https://en.wikipedia.org/wiki/Association_rule_learning)

[[Youtube] Apriori Algorithm Explained | Association Rule Mining | Finding Frequent Itemset | Edureka](https://www.youtube.com/watch?v=guVvtZ7ZClw)

[[mlxtend/Apriori] Apriori is not Apriori. It is missing all the nice optimzations](https://github.com/rasbt/mlxtend/issues/644)

[[Kaggle] Association Rules Mining/Market Basket Analysis](https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis)

[[Kaggle] Market Basket Analysis](https://www.kaggle.com/xvivancos/market-basket-analysis)

[[Medium] Market Basket Analysis](https://towardsdatascience.com/market-basket-analysis-978ac064d8c6)

[[Medium] A Gentle Introduction on Market Basket Analysis — Association Rules](https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce)

[[Quora] **Omkar's Answer** on What is the time and space complexity of Apriori algorithm?](https://www.quora.com/What-is-the-time-and-space-complexity-of-Apriori-algorithm)

[Pruning of Apriori-Algorithm’s Pruning Steps](https://www.iasj.net/iasj?func=fulltext&aId=41181)

[What is the difference between lift and leverage?](https://support.bigml.com/hc/en-us/articles/207310225-What-is-the-difference-between-lift-and-leverage-)