Most of you may already be familiar with clustering algorithms such as K-Means, Hierarical, or DBSCAN. However, clustering is not the only unsupervised way to find similarities between data points. You can also use association rule learning techniques to determine if certain data points (actions) are more likely to occur together.

A simple example would be the supermarket shopping basket analysis. If someone is buying ground beef, does it make them more likely to also buy spaghetti? We can answer these types of questions by using the Apriori algorithm.

![https://miro.medium.com/max/720/1*yTtuCcNJrSymJfrVj28fQg.png](https://miro.medium.com/max/720/1*yTtuCcNJrSymJfrVj28fQg.png)

# What category of algorithms does Apriori belong to?

Apriori is part of the association rule learning algorithms, which sit under the unsupervised branch of Machine Learning.

This is because Apriori does not require us to provide a target variable for the model. Instead, the algorithm identifies relationships between data points subject to our specified constraints.

Assume we analyze the above transaction data to find frequently bought items and determine if they are often purchased together. To help us find the answers, we will make use of the following 4 metrics:

- Support
- Confidence
- Lift
- Conviction

## Support
The first step for us and the algorithm is to find frequently bought items. It is a straightforward calculation that is based on frequency:

Support(A) = Transactions(A) / Total Transactions

So in our example:

- Support(Eggs) = 3/6 = 1/2 = 0.5
- Support(Bacon) = 4/6 = 2/3 = 0.667
- Support(Apple) = 2/6 = 1/3 = 0.333
- ...
- Support(Eggs&Bacon) = 3/6 = 0.5
- ...

Here we can set our first constraint by telling the algorithm the minimum support level we want to explore, which is useful when working with large datasets. We typically want to focus computing resources to search for associations between frequently bought items while discounting the infrequent ones.

For the sake of our example, let’s set minimum support to 0.5, which leaves us to work with Eggs and Bacon for the rest of this example.

**Important:** while Support(Eggs) and Support(Bacon) individually satisfy our minimum support constraint, it is crucial to understand that we also need the combination of them (Eggs&Bacon) to pass this constraint. Otherwise, we would not have a single item pairing to progress forward to create association rules.

## Confidence:
Now that we have identified frequently bought items let’s calculate confidence. This will tell us how confident (based on our data) we can be that an item will be purchased, given that another item has been purchased.

**Confidence(A→B)** = Probability(A & B) / Support(A)

Note, confidence is the same as what is also known as conditional probability in statistics:

P(B|A) = P(A & B) / P(A) 

Please beware of the notation. The above two equeations are equivalent, although the notations are in different order: **(A→B)** is the same as **(B|A)**.

So, let’s calculate confidence for our example:
    
**Confidence(Eggs→Bacon)** = P(Eggs & Bacon) / Support(Eggs) = (3/6) / (3/6) = 1

**Confidence(Bacon→Eggs)** = P(Eggs & Bacon) / Support(Bacon) = (3/6) / (2/3) = 3/4 = 0.75

The above tells us that whenever eggs are bought, bacon is also bought 100% of the time. Also, whenever bacon is bought, eggs are bought 75% of the time.

## Lift:
Given that different items are bought at different frequencies, how do we know that eggs and bacon really do have a strong association, and how do we measure it? You will be glad to hear that we have a way to evaluate this objectively using lift.

There are multiple ways to express the formula to calculate lift. Let me first show what the formulas look like, and then I will describe an intuitive way for you to think about it.

**Lift(A→B)** = Probability(A & B) / (Support(A) * Support(B))

You should be able to spot that we can simplify this formula by replacing P(A&B)/Sup(A) with Confidence(A→B). Hence, we have:

**Lift(A→B)** = Confidence(A→B) / Support(B)

Let’s calculate lift for our associated items:

**Lift(Eggs→Bacon)** = Confidence(Eggs→Bacon) / Support(Bacon) = 1 / (2/3) = 1.5

**Lift(Bacon→Eggs)** = Confidence(Bacon→Eggs) / Support(Eggs) = (3/4) / (1/2) = 1.5

Lift for the two items is equal to 1.5. Note, lift>1 means that the two items are more likely to be bought together, while lift<1 means that the two items are more likely to be bought separately. Finally, lift=1 means that there is no association between the two items.

An intuitive way to understand this would be to first think about the probability of eggs being bought: P(Eggs)=Support(Eggs)=0.5 as 3 out of 6 shoppers bought eggs.

Then think about the probability of eggs being bought whenever bacon was bought: P(Eggs|Bacon)=Confidence(Bacon->Eggs)=0.75 since out of the 4 shoppers that bought bacon, 3 of them also bought eggs.

Now, lift is simply a measure that tells us whether the probability of buying eggs increases or decreases given the purchase of bacon. Since the probability of buying eggs in such a scenario goes up from 0.5 to 0.75, we see a positive lift of 1.5 times (0.75/0.5=1.5). This means you are 1.5 times (i.e., 50%) more likely to buy eggs if you have already put bacon into your basket.

See if your supermarket places these two items nearby. 😜

## Conviction:
Conviction is another way of measuring association, although it is a bit harder to get your head around. It compares the probability that A appears without B if they were independent with the actual frequency of the appearance of A without B. Let’s take a look at the general formula first:

**Conviction(A→B)** = (1 - Support(B)) / (1 - Confidence(A→B))

In our example, this would be:

**Conviction(Eggs→Bacon)** = (1 - Sup(Bacon) / (1 - Conf(Eggs→Bacon)) = (1 - 2/3) / (1 - 1) = (1/3) / 0 = infinity

**Conviction(Bacon→Eggs)** = (1 - Sup(Eggs) / (1 - Conf(Bacon→Eggs)) = (1 - 1/2) / (1 - 3/4) = (1/2) / (1/4) = 2

As you can see, we had a division by 0 when calculating conviction for (Eggs→Bacon) and this is because we do not have a single instance of eggs being bought without bacon (confidence=100%).

In general, high confidence for A→B with low support for item B would yield a high conviction.

In contrast to lift, conviction is a directed measure. Hence, while lift is the same for both (Eggs→Bacon) and (Bacon→Eggs), conviction is different between the two, with Conv(Eggs→Bacon) being much higher. Thus, you can use conviction to evaluate the directional relationship between your items.

Finally, similar to lift, conviction=1 means that items are not associated, while conviction>1 indicates the relationship between the items (the higher the value, the stronger the relationship).

---

# Apriori algorithm:
Apriori is a pretty straightforward algorithm that performs the following sequence of calculations:

- Calculate support for itemsets of size 1.
- Apply the minimum support threshold and prune itemsets that do not meet the threshold.
- Move on to itemsets of size 2 and repeat steps one and two.
- Continue the same process until no additional itemsets satisfying the minimum threshold can be found.
- To make the process more visual, here is a diagram that illustrates what the algorithm does:

![https://miro.medium.com/max/720/1*PjV7oQE_RABg49Wsm2_Jdw.png](https://miro.medium.com/max/720/1*PjV7oQE_RABg49Wsm2_Jdw.png)

As you can see, most itemsets in this example got pruned since they did not meet the minimum support threshold of 0.5.

It is important to realize that by setting a lower minimum support threshold we would produce many more itemsets of size 2. To be precise, with a minimum support threshold of 0.3, none of the itemsets of size 1 would get pruned giving us a total of 15 itemsets of size 2 (5+4+3+2+1=15).

This is not an issue when we have a small dataset, but it could become a bottleneck if you are working with a large dataset. E.g., 1,000 items can create as many as 499,500 item pairs. Hence, choose your minimum support threshold carefully.

---
---
---

In [29]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

In [30]:
raw_data = pd.read_excel('online_retail_II.xlsx', sheet_name='Year 2010-2011')

In [32]:
raw_data

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [35]:
def prepare_retail(dataframe):
    # preparing dataset
    dataframe.dropna(inplace=True)
    dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]
    dataframe = dataframe[dataframe["Quantity"] > 0]
    dataframe = dataframe[dataframe["Price"] > 0]
    return dataframe

In [36]:
df = prepare_retail(raw_data)

In [37]:
df

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [38]:
def create_apriori_datastructure(dataframe, id=False):
    ############################################
    # Preparing ARL Datastructure (Invoice-Product Matrix)
    ############################################

    # We need to create below structure:

    # Rows represents transactions (invoice, shopping cart etc.), columns represents products
    # We simulate as binary that which transaction (invoice, shopping cart etc.) contains which products
    # If the product is in the invoice, the intersection cell will be "1". If is not, it will be "0"

    # Description   Product1   Product2    Product3
    # Invoice
    # Invoice1          0         1          0
    # Invoice2          1         0          1
    # Invoice3          0         0         0
    # Invoice4          1         0         0
    # Invoice5          0         0         1
    if id:
        grouped = germany_df.groupby(
            ['Invoice', 'StockCode'], as_index=False).agg({'Quantity': 'sum'})
        apriori_datastructure = pd.pivot(data=grouped, index='Invoice', columns='StockCode', values='Quantity').fillna(
            0).applymap(lambda x: 1 if x > 0 else 0)
        return apriori_datastructure
    else:
        grouped = germany_df.groupby(
            ['Invoice', 'Description'], as_index=False).agg({'Quantity': 'sum'})
        apriori_datastructure = pd.pivot(data=grouped, index='Invoice', columns='Description', values='Quantity').fillna(
            0).applymap(lambda x: 1 if x > 0 else 0)
        return apriori_datastructure

In [39]:
germany_df = df[df['Country'] == 'Germany'] # In this example, we will cover just Germany based transactions

germany_df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
1109,536527,22809,SET OF 6 T-LIGHTS SANTA,6,2010-12-01 13:04:00,2.95,12662.0,Germany
1110,536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,2010-12-01 13:04:00,2.55,12662.0,Germany
1111,536527,84945,MULTI COLOUR SILVER T-LIGHT HOLDER,12,2010-12-01 13:04:00,0.85,12662.0,Germany
1112,536527,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,2010-12-01 13:04:00,1.65,12662.0,Germany
1113,536527,22244,3 HOOK HANGER MAGIC GARDEN,12,2010-12-01 13:04:00,1.95,12662.0,Germany


In [40]:
germany_apriori_df = create_apriori_datastructure(germany_df,True)

  uniques = Index(uniques)


In [41]:
germany_apriori_df.head() # we see the Invoice-Product matrix

StockCode,10002,10125,10135,11001,15034,15036,15039,16008,16011,16014,...,90161B,90161C,90161D,90201A,90201B,90201C,90201D,90202D,M,POST
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
536840,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
536861,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
536967,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
536983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [42]:
def get_rules(apriori_df, min_support=0.01):
    # Possibilities of all possible product combinations
    # We say that the products that can be sold together with a min 0.01 probability should come. the probability that each product will be sold together with each other. Applying apriori algorithm.
    frequent_itemsets = apriori(
        apriori_df, min_support=min_support, use_colnames=True)
    # Extracting Association Rules
    # We extract association rules by using the support metric from the dataset that we applied the apriori algorithm.
    rules = association_rules(
        frequent_itemsets, metric="support", min_threshold=min_support)
    # antecedents -> the first product(s)
    # consequents -> the next product(s)
    # antecedent support -> probability of observing the first product(s)
    # consequent support -> probability of observing the next product(s)
    # support -> probability of observing the next product(s) and the first product(s) together
    # confidence -> probability of observing the next product(s) when sold the first product(s)
    # lift -> When the first product is sold, the probability of selling the next product(s) increases by a factor of lift.
    # leverage -> Similar to lift, but tends to prioritize the higher support values.
    # conviction -> probability of observing the antecedents without consequents
    return rules

In [43]:
germany_rules = get_rules(germany_apriori_df)



In [44]:
germany_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(POST),(10125),0.818381,0.013129,0.013129,0.016043,1.221925,0.002384,1.002961
1,(10125),(POST),0.013129,0.818381,0.013129,1.0,1.221925,0.002384,inf
2,(POST),(15036),0.818381,0.019694,0.015317,0.018717,0.950386,-0.0008,0.999004
3,(15036),(POST),0.019694,0.818381,0.015317,0.777778,0.950386,-0.0008,0.817287
4,(16016),(POST),0.010941,0.818381,0.010941,1.0,1.221925,0.001987,inf


In [45]:
def get_item_name(dataframe, stock_code):
    # To find out the product name of the id in the data structure
    # if stock_code is a list, it will return the list by created of item names
    if type(stock_code) != list:
        product_name = dataframe[dataframe["StockCode"] ==
                                 stock_code][["Description"]].values[0].tolist()
        return product_name
    else:
        product_names = [dataframe[dataframe["StockCode"] == product][[
            "Description"]].values[0].tolist()[0] for product in stock_code]
        return product_names

In [47]:
get_item_name(germany_df,10125)

['MINI FUNKY DESIGN TAPES']

In [48]:
def get_golden_shot(target_id,dataframe,rules):
    target_product = get_item_name(dataframe,target_id)[0]
    recomended_product_ids = recommend_products(rules, target_id)
    recomended_product_names = get_item_name(dataframe,recommend_products(rules, target_id))
    print(f'Target Product ID (which is in the cart): {target_id}\nProduct Name: {target_product}')
    print(f'Recommended Products: {recomended_product_ids}\nProduct Names: {recomended_product_names}')

In [49]:
def recommend_products(rules_df, product_id, rec_count=5):
    # rules_df -> the dataframe that we extracted rules
    # product_id -> the product id which is in the cart
    # rec_count -> count of recommended products
    sorted_rules = rules_df.sort_values('lift', ascending=False) # we are sorting the rules dataframe by using "lift" metric
    recommended_products = []  # creating an empty list for holding the recommended products

    for i, product in sorted_rules["antecedents"].items(): # loop on the first products (the products which are in the cart)
        for j in list(product):  # assign to a list for each product
            if j == product_id:  # if the list you return is equal to product_id, which means the product id in the cart
                # consequences column's first product id add to recommended products list 
                recommended_products.append(
                    list(sorted_rules.iloc[i]["consequents"]))
                
    recommended_products = list({item for item_list in recommended_products for item in item_list}) # get unique products

    return recommended_products[:rec_count] # return the recommended_products list by using rec_count limiter

In [50]:
# simulating some products like they are in cart
TARGET_PRODUCT_ID_1 = 21987
TARGET_PRODUCT_ID_2 = 23235
TARGET_PRODUCT_ID_3 = 22747

In [51]:
get_item_name(germany_df, [TARGET_PRODUCT_ID_1,TARGET_PRODUCT_ID_2, TARGET_PRODUCT_ID_3]) # what are their names?

['PACK OF 6 SKULL PAPER CUPS',
 'STORAGE TIN VINTAGE LEAF',
 "POPPY'S PLAYHOUSE BATHROOM"]

In [52]:
get_golden_shot(TARGET_PRODUCT_ID_1,germany_df,germany_rules)

Target Product ID (which is in the cart): 21987
Product Name: PACK OF 6 SKULL PAPER CUPS
Recommended Products: [21124, 21244, 23050, '85049E', 23307]
Product Names: ['SET/10 BLUE POLKADOT PARTY CANDLES', 'BLUE POLKADOT PLATE ', 'RECYCLED ACAPULCO MAT GREEN', 'SCANDINAVIAN REDS RIBBONS', 'SET OF 60 PANTRY DESIGN CAKE CASES ']


In [53]:
get_golden_shot(TARGET_PRODUCT_ID_2,germany_df,germany_rules)

Target Product ID (which is in the cart): 23235
Product Name: STORAGE TIN VINTAGE LEAF
Recommended Products: [21244, 20750, 22037, 22423, 22551]
Product Names: ['BLUE POLKADOT PLATE ', 'RED RETROSPOT MINI CASES', 'ROBOT BIRTHDAY CARD', 'REGENCY CAKESTAND 3 TIER', 'PLASTERS IN TIN SPACEBOY']


In [28]:
get_golden_shot(TARGET_PRODUCT_ID_3,germany_df,germany_rules)

Target Product ID (which is in the cart): 22747
Product Name: POPPY'S PLAYHOUSE BATHROOM
Recommended Products: [22423, 22554, 22555, 22556, 22302]
Product Names: ['REGENCY CAKESTAND 3 TIER', 'PLASTERS IN TIN WOODLAND ANIMALS', 'PLASTERS IN TIN STRONGMAN', 'PLASTERS IN TIN CIRCUS PARADE ', 'COFFEE MUG PEARS  DESIGN']
