<a href="https://colab.research.google.com/github/teellis/UnsupervisedLearning-Tutorial/blob/main/Unsupervised_Lesson2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 2: Market Groceries

In this lesson, we'll be using the market groceries dataset to introduce you to the concepts of apriori association rule learning.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Load the groceries dataset from GitHub.
# dataset source: https://www.kaggle.com/heeraldedhia/groceries-dataset
df_groceries = pd.read_csv('https://raw.githubusercontent.com/teellis/UnsupervisedLearning-Tutorial/main/Groceries_dataset.csv')
df_groceries.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


In [3]:
# Check that there are no nulls
df_groceries.isnull().values.any()

False

In [4]:
# Take a look at what kinds of things people have bought
df_groceries['itemDescription'].unique()

array(['tropical fruit', 'whole milk', 'pip fruit', 'other vegetables',
       'rolls/buns', 'pot plants', 'citrus fruit', 'beef', 'frankfurter',
       'chicken', 'butter', 'fruit/vegetable juice',
       'packaged fruit/vegetables', 'chocolate', 'specialty bar',
       'butter milk', 'bottled water', 'yogurt', 'sausage', 'brown bread',
       'hamburger meat', 'root vegetables', 'pork', 'pastry',
       'canned beer', 'berries', 'coffee', 'misc. beverages', 'ham',
       'turkey', 'curd cheese', 'red/blush wine',
       'frozen potato products', 'flour', 'sugar', 'frozen meals',
       'herbs', 'soda', 'detergent', 'grapes', 'processed cheese', 'fish',
       'sparkling wine', 'newspapers', 'curd', 'pasta', 'popcorn',
       'finished products', 'beverages', 'bottled beer', 'dessert',
       'dog food', 'specialty chocolate', 'condensed milk', 'cleaner',
       'white wine', 'meat', 'ice cream', 'hard cheese', 'cream cheese ',
       'liquor', 'pickled vegetables', 'liquor (appetizer

Now that we've looked at the dataset a bit, let's think. Each row has a member_number, date, and item description. That means that each entry is just one part of a whole purchase. We'll need to group items that are part of one purchase together so we can do our association rule learning.

In [44]:
# Group by customer (member_nember) and purchase date (Date)
# Now, we can now see what items a customer bought together on a given date
df_purchases = df_groceries.groupby(['Member_number','Date'])['itemDescription'].apply(lambda x: list(x))
df_purchases.head()

Member_number  Date      
1000           15-03-2015    [sausage, whole milk, semi-finished bread, yog...
               24-06-2014                    [whole milk, pastry, salty snack]
               24-07-2015                       [canned beer, misc. beverages]
               25-11-2015                          [sausage, hygiene articles]
               27-05-2015                           [soda, pickled vegetables]
Name: itemDescription, dtype: object

---

## Calculating Association Rules Metrics

Let's run through some example calculations of the metrics from lesson 1:

### Calculating Support

First, let's calculate the support for whole milk, the support for yogurt, and the support for $\{whole\ milk\} \rightarrow \{yogurt\}$, using this formula: 

$support = \frac{frequency(X, Y)}{N}$


In [6]:
# Calculate the support for whole milk
has_whole_milk = ['whole milk' in x for x in df_purchases]
whole_milk_support = sum(has_whole_milk) / len(df_purchases)

# Calculate the support for yogurt
has_yogurt = ['yogurt' in x for x in df_purchases]
yogurt_support = sum(has_yogurt) / len(df_purchases)

# Calculate the support for butter milk
has_butter_milk = ['butter milk' in x for x in df_purchases]
butter_milk_support = sum(has_butter_milk) / len(df_purchases)

# Calculate the support for whole milk -> yogurt
has_whole_milk_yogurt = ['whole milk' in x and 'yogurt' in x for x in df_purchases]
whole_milk_yogurt_support = sum(has_whole_milk_yogurt) / len(df_purchases)

# Calculate the support for whole milk -> butter milk
has_whole_milk_butter_milk = ['whole milk' in x and 'butter milk' in x for x in df_purchases]
whole_milk_butter_milk_support = sum(has_whole_milk_butter_milk) / len(df_purchases)

print(f"support(whole milk) = {whole_milk_support}")
print(f"support(yogurt) = {yogurt_support}")
print(f"support(butter milk) = {butter_milk_support}")
print(f"support(whole milk, yogurt) = {whole_milk_yogurt_support}")
print(f"support(whole milk, butter milk) = {whole_milk_butter_milk_support}")

support(whole milk) = 0.15792287642852368
support(yogurt) = 0.08587850030074183
support(butter milk) = 0.017576689166610975
support(whole milk, yogurt) = 0.011160863463209249
support(whole milk, butter milk) = 0.001670787943594199


This means that there is a 15.8%, 8.6%, and 1.8% chance of whole milk, yogurt, and butter milk appearing in a purchase respectively. Considering combinations, there is a 1.1% chance of both whole milk and yogurt appearing in a purchase, and a 0.17% chance of both whole milk and butter milk appearing in a purchase. 


---



### Calculating Confidence
Next, let's try calculating the confidence for $\{whole\ milk\} \rightarrow \{yogurt\}$ and $\{whole\ milk\} -> \{butter\ milk\}$, with the formula:

 $confidence = \frac{frequency(X, Y)}{frequency(Y)} = \frac{support(X, Y)}{support(Y)}$



In [7]:
# Calculate the confidence for whole milk -> yogurt
confidence_whole_milk_yogurt = sum(has_whole_milk_yogurt) / sum(has_yogurt)

# Calculate the confidence for whole milk -> butter milk
confidence_whole_milk_butter_milk = sum(has_whole_milk_butter_milk) / sum(has_butter_milk)

print(f"confidence(whole milk, yogurt) = {confidence_whole_milk_yogurt}")
print(f"confidence(whole milk, butter milk) = {confidence_whole_milk_butter_milk}")

confidence(whole milk, yogurt) = 0.12996108949416343
confidence(whole milk, butter milk) = 0.09505703422053231


In [8]:
# Alternatively, we can use our support values from before:
confidence_whole_milk_yogurt = whole_milk_yogurt_support / yogurt_support
confidence_whole_milk_butter_milk = whole_milk_butter_milk_support / butter_milk_support

print(f"confidence(whole milk, yogurt) = {confidence_whole_milk_yogurt}")
print(f"confidence(whole milk, butter milk) = {confidence_whole_milk_butter_milk}")

confidence(whole milk, yogurt) = 0.1299610894941634
confidence(whole milk, butter milk) = 0.09505703422053231


This means that there is a 13.0% chance that yogurt will be purchased given that whole milk is purchased, while there is only a 9.5% chance that butter milk is purchased given that whole milk is purchased.

(Additionally, note that we get the same confidence values regardless of how we calculate it)


---



### Calculating Lift

Finally, let's calculate the lift for $\{whole\ milk\} \rightarrow \{yogurt\}$ and $\{whole\ milk\} \rightarrow \{butter\ milk\}$ using the formula for lift: 
$lift = \frac{Support(X, Y)}{Support(X) \cdot Support(Y)} = \frac{Confidence(X,Y)}{Support(Y)}$

In [9]:
# Calculate the lift for whole milk -> yogurt
lift_whole_milk_yogurt = whole_milk_yogurt_support / (whole_milk_support * yogurt_support)

# Calculate the lift for whole milk -> butter milk
lift_whole_milk_butter_milk = whole_milk_butter_milk_support / (whole_milk_support * butter_milk_support)

print(f"lift(whole milk, yogurt) = {lift_whole_milk_yogurt}")
print(f"lift(whole milk, butter milk) = {lift_whole_milk_butter_milk}")

lift(whole milk, yogurt) = 0.822940237876076
lift(whole milk, butter milk) = 0.6019206106821097


Since the lift for both itemsets is less than 1, we can say that it is less likely that yogurt or butter milk are bought if whole milk is bought. 


---

## Your Turn

Try calculating these metrics yourself!

* Support(butter)
* Support(sugar)
* Support(butter, sugar)
* Confidence(butter, sugar)
* Lift(butter, sugar)

In [10]:
# YOUR CODE HERE
butter_support = None
sugar_support = None
butter_sugar_support = None
butter_sugar_confidence = None
butter_sugar_lift = None

When you're done, unhide the next cell to compare your results.

In [11]:
#@title
# Calculate support(butter)
has_butter = ['butter' in x for x in df_purchases]
butter_support = sum(has_butter) / len(df_purchases)

# Calculate support(sugar)
has_sugar = ['sugar' in x for x in df_purchases]
sugar_support = sum(has_sugar) / len(df_purchases)

# Calculate support(butter, sugar)
has_butter_sugar = ['butter' in x and 'sugar' in x for x in df_purchases]
butter_sugar_support = sum(has_butter_sugar) / len(df_purchases)

# Calculate confidence(butter, sugar)
butter_sugar_confidence = butter_sugar_support / sugar_support

# Calculate lift(butter, sugar)
butter_sugar_lift = butter_sugar_confidence / sugar_support

print(f"support(butter) = {butter_support}")
print(f"support(sugar) = {sugar_support}")
print(f"support(butter, sugar) = {butter_sugar_support}")
print(f"confidence(butter, sugar) = {butter_sugar_confidence}")
print(f"lift(butter, sugar) = {butter_sugar_lift}")

support(butter) = 0.03522020985096572
support(sugar) = 0.01771035220209851
support(butter, sugar) = 0.00040098910646260775
confidence(butter, sugar) = 0.022641509433962263
lift(butter, sugar) = 1.2784336062655748


---

## Example Model

Finally, let's demonstrate how one would use the apriori algorithm to determine association rules in Python. We'll be using the MLXtend library's apriori implementation for this example. For more information, check out the documentation [here](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).

In [16]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [18]:
# MLXtend's apriori function takes in a one-hot encoded dataframe
# Extract transactions from dataframe
transactions = [x for x in df_purchases]
transactions[0:5]

[['sausage', 'whole milk', 'semi-finished bread', 'yogurt'],
 ['whole milk', 'pastry', 'salty snack'],
 ['canned beer', 'misc. beverages'],
 ['sausage', 'hygiene articles'],
 ['soda', 'pickled vegetables']]

In [23]:
# One-hot encode transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
encoded_transactions = pd.DataFrame(te_ary, columns=te.columns_)
encoded_transactions.head()

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,beverages,bottled beer,bottled water,brandy,brown bread,butter,butter milk,cake bar,candles,candy,canned beer,canned fish,canned fruit,canned vegetables,cat food,cereals,chewing gum,chicken,chocolate,chocolate marshmallow,citrus fruit,cleaner,cling film/bags,cocoa drinks,coffee,condensed milk,cooking chocolate,cookware,cream,cream cheese,...,salt,salty snack,sauces,sausage,seasonal products,semi-finished bread,shopping bags,skin care,sliced cheese,snack products,soap,soda,soft cheese,softener,soups,sparkling wine,specialty bar,specialty cheese,specialty chocolate,specialty fat,specialty vegetables,spices,spread cheese,sugar,sweet spreads,syrup,tea,tidbits,toilet cleaner,tropical fruit,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [40]:
# Run apriori, get items and itemsets with at least 1% support (there's quite a few unique items)
freq_items = apriori(encoded_transactions, min_support=0.001, use_colnames=True)
freq_items

Unnamed: 0,support,itemsets
0,0.004010,(Instant food products)
1,0.021386,(UHT-milk)
2,0.001470,(abrasive cleaner)
3,0.001938,(artif. sweetener)
4,0.008087,(baking powder)
...,...,...
745,0.001136,"(sausage, whole milk, rolls/buns)"
746,0.001002,"(soda, whole milk, rolls/buns)"
747,0.001337,"(whole milk, yogurt, rolls/buns)"
748,0.001069,"(sausage, whole milk, soda)"


In [52]:
# Now, get some association rules from these frequent items and itemsets
rules = association_rules(freq_items, metric="confidence", min_threshold=0.001)

In [53]:
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(bottled water),(UHT-milk),0.060683,0.021386,0.001069,0.017621,0.823954,-0.000228,0.996168
1,(UHT-milk),(bottled water),0.021386,0.060683,0.001069,0.05,0.823954,-0.000228,0.988755
2,(other vegetables),(UHT-milk),0.122101,0.021386,0.002139,0.017515,0.818993,-0.000473,0.99606
3,(UHT-milk),(other vegetables),0.021386,0.122101,0.002139,0.1,0.818993,-0.000473,0.975443
4,(rolls/buns),(UHT-milk),0.110005,0.021386,0.001804,0.016403,0.767013,-0.000548,0.994934
5,(UHT-milk),(rolls/buns),0.021386,0.110005,0.001804,0.084375,0.767013,-0.000548,0.972009
6,(root vegetables),(UHT-milk),0.069572,0.021386,0.001002,0.014409,0.673766,-0.000485,0.992921
7,(UHT-milk),(root vegetables),0.021386,0.069572,0.001002,0.046875,0.673766,-0.000485,0.976187
8,(sausage),(UHT-milk),0.060349,0.021386,0.001136,0.018826,0.880298,-0.000154,0.997391
9,(UHT-milk),(sausage),0.021386,0.060349,0.001136,0.053125,0.880298,-0.000154,0.992371


In [54]:
rules.tail(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1228,"(whole milk, soda)",(sausage),0.011629,0.060349,0.001069,0.091954,1.523708,0.000368,1.034806
1229,(sausage),"(whole milk, soda)",0.060349,0.011629,0.001069,0.017719,1.523708,0.000368,1.0062
1230,(whole milk),"(sausage, soda)",0.157923,0.005948,0.001069,0.006771,1.138374,0.00013,1.000829
1231,(soda),"(sausage, whole milk)",0.097106,0.008955,0.001069,0.011012,1.229612,0.0002,1.002079
1232,"(sausage, whole milk)",(yogurt),0.008955,0.085879,0.00147,0.164179,1.91176,0.000701,1.093681
1233,"(sausage, yogurt)",(whole milk),0.005748,0.157923,0.00147,0.255814,1.619866,0.000563,1.131541
1234,"(whole milk, yogurt)",(sausage),0.011161,0.060349,0.00147,0.131737,2.182917,0.000797,1.082219
1235,(sausage),"(whole milk, yogurt)",0.060349,0.011161,0.00147,0.024363,2.182917,0.000797,1.013532
1236,(whole milk),"(sausage, yogurt)",0.157923,0.005748,0.00147,0.00931,1.619866,0.000563,1.003596
1237,(yogurt),"(sausage, whole milk)",0.085879,0.008955,0.00147,0.017121,1.91176,0.000701,1.008307



---
### Analysis

Let's look at a couple of these generated rules. 


Rule 4: $\{rolls/buns\} \rightarrow \{UHT-milk\}$
* This rule has 1.6% confidence and a lift ratio of 0.77. That means that it is unlikely that someone will buy UHT milk if they have bought rolls/buns, and the lift ratio being below 1 confirms this.

Rule 1232: $\{sausage, whole\ milk\} \rightarrow \{yogurt\}$
* This rule has 16.4% confidence and a lift ratio of 1.9. That means that it is somewhat likely that someone will buy yogurt if they have bought sausage and whole milk, and the lift ratio being above 1 confirms this.

And now something slightly different, rule 1236: $\{whole\ milk\} \rightarrow \{sausage, yogurt\}$
* This rule has only 0.93% confidence but a lift ratio of 1.6. Why is confidence so low? Because sausage and yogurt aren't bought together frequently (the consequent support for $\{sausage\} \rightarrow \{yogurt\}$ is 0.57%). But, when someone does buy yogurt and sausage together, they are likely to also buy whole milk because the lift ratio is above 1!





## Wrapping Up

Now that you've run through an example of how to use the apriori algorithm and generate association rules, try it yourself on a more complicated, real world example!