<a href="https://colab.research.google.com/github/sgulyano/aic402/blob/main/lab6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIC-402 Lab 6: Apriori Algorithm

CMKL University

By Sarun Gulyanon


### Goal

Explore the application of association rule learning and frequent pattern mining in common scenarios such as market basket analysis.

### Outline

In this lab, we provide example code for building association rule learning models, such as the Apriori algorithm.

The example uses a transaction dataset of gift purchases for various occasions, the [`Groceries dataset`](http://archive.ics.uci.edu/ml/datasets/Online+Retail). The code examples are implemented using the [`mlxtend`](https://rasbt.github.io/mlxtend/) library.

**Reference**: https://www.geeksforgeeks.org/machine-learning/implementing-apriori-algorithm-in-python/

----



## 1. Import Package and Module

In [4]:
import warnings
warnings.filterwarnings("ignore", message=r"datetime.datetime.utcnow", module="jupyter_client")

In [2]:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

  return datetime.utcnow().replace(tzinfo=utc)


## 2. Load Dataset

In this dataset, each row represents the purchase of a single product within one transaction. Every transaction is identified by a unique `Invoice No.`. The `Description` field specifies the product name, while `Quantity` indicates the number of units purchased in that transaction.

For this lab, we focus only on transactions from Sweden and consider only these three features. We treat each product as one item, and an **itemset** is defined as a collection of items purchased together in the same transaction. Transactions are identified using the `Invoice No.` as the **transaction ID**.

In [7]:
data = pd.read_excel("https://github.com/sgulyano/aic402/blob/main/Online%20Retail%20Sweden.xlsx?raw=true")
data.drop(['StockCode', 'InvoiceDate', 'UnitPrice', 'CustomerID'], axis=1, inplace=True)
# Clean the Description column: drop rows with null descriptions and convert to string
data.dropna(subset=['Description'], inplace=True)
data['Description'] = data['Description'].astype(str)
# Filter data for Country='Sweden'
data = data[data['Country'] == 'Sweden']
data

Unnamed: 0,InvoiceNo,Description,Quantity,Country
0,C538847,SET OF 3 BABUSHKA STACKING TINS,-240,Sweden
1,538848,SET OF 3 BABUSHKA STACKING TINS,240,Sweden
2,539338,WORLD WAR 2 GLIDERS ASSTD DESIGNS,576,Sweden
3,539338,60 CAKE CASES DOLLY GIRL DESIGN,240,Sweden
4,539338,PACK OF 60 SPACEBOY CAKE CASES,240,Sweden
...,...,...,...,...
457,578089,RED TOADSTOOL LED NIGHT LIGHT,12,Sweden
458,578089,POSTAGE,4,Sweden
459,C578090,POSTAGE,-2,Sweden
460,580704,VINTAGE DOILY DELUXE SEWING KIT,40,Sweden


Transform the data into a binary database, where each column represents an item and each row represents a transaction (`Invoice No.`). Each cell has a value of 1 if the corresponding item was purchased in that transaction, and 0 otherwise.

In [9]:
basket = data.groupby(['InvoiceNo'])['Description'].apply(list).reset_index()
# Ensure all items within the transaction lists are strings
transactions = [[str(item) for item in transaction] for transaction in basket['Description'].tolist()]
for i in transactions[:10]:
    print(i)

['SET OF 3 BABUSHKA STACKING TINS']
['WORLD WAR 2 GLIDERS ASSTD DESIGNS', '60 CAKE CASES DOLLY GIRL DESIGN', 'PACK OF 60 SPACEBOY CAKE CASES', 'PACK OF 60 PINK PAISLEY CAKE CASES', '72 SWEETHEART FAIRY CAKE CASES', 'PACK OF 72 RETROSPOT CAKE CASES', '36 DOILIES DOLLY GIRL', 'SET OF 72 RETROSPOT PAPER  DOILIES', 'MAGIC DRAWING SLATE PURDEY', 'MAGIC DRAWING SLATE DOLLY GIRL ', '5 HOOK HANGER MAGIC TOADSTOOL', 'PACK OF 12 TRADITIONAL CRAYONS', 'LARGE PURPLE BABUSHKA NOTEBOOK  ', 'LARGE RED BABUSHKA NOTEBOOK ', 'LARGE YELLOW BABUSHKA NOTEBOOK ', 'GIFT BAG BIRTHDAY', 'TEA BAG PLATE RED RETROSPOT', 'VICTORIAN SEWING KIT', 'GENTLEMAN SHIRT REPAIR KIT ', 'SET OF 3 CAKE TINS PANTRY DESIGN ', 'SET OF 3 CAKE TINS SKETCHBOOK', 'ROUND SNACK BOXES SET OF 4 FRUITS ', 'ROUND SNACK BOXES SET OF4 WOODLAND ', 'SET 3 RETROSPOT TEA,COFFEE,SUGAR', 'ASSORTED BOTTLE TOP  MAGNETS ']
['FELT FARM ANIMAL WHITE BUNNY ', 'EAU DE NIL LOVE BIRD CANDLE', 'IVORY LOVE BIRD CANDLE', 'BLACK LOVE BIRD CANDLE', 'FELTCRAFT D

In [11]:
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_array, columns=te.columns_)

In [14]:
df_encoded.head()

Unnamed: 0,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE WOODLAND,3 PIECE SPACEBOY COOKIE CUTTER SET,3 RAFFIA RIBBONS 50'S CHRISTMAS,3 RAFFIA RIBBONS VINTAGE CHRISTMAS,3 TIER CAKE TIN RED AND CREAM,3 TRADITIONAl BISCUIT CUTTERS SET,36 DOILIES DOLLY GIRL,...,WOODEN STAR CHRISTMAS SCANDINAVIAN,WOODEN TREE CHRISTMAS SCANDINAVIAN,WOODLAND CHARLOTTE BAG,WOODLAND SMALL RED FELT HEART,WORLD WAR 2 GLIDERS ASSTD DESIGNS,WRAP VINTAGE DOILY,WRAP ALPHABET DESIGN,WRAP DOLLY GIRL,WRAP RED VINTAGE DOILY,ZINC WILLIE WINKIE CANDLE STICK
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


## 3. Learn Association Rules

The Apriori algorithm follows a breadth-first search (BFS) strategy, exploring itemsets level by level—starting from single items and gradually expanding to larger itemsets. At each level, Apriori prunes infrequent candidates using the *Apriori property*, which states that any superset of an infrequent itemset must also be infrequent. This level-wise BFS exploration significantly reduces the search space and improves efficiency.

Next, we mine association rules using the [`apriori`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) algorithm by first computing the frequent itemsets.

In [47]:
frq_items = apriori(df_encoded, min_support = 0.07, use_colnames = True)

In [48]:
frq_items.sort_values(['support'], ascending=False)

Unnamed: 0,support,itemsets
9,0.521739,(POSTAGE)
15,0.152174,(SET OF 3 CAKE TINS PANTRY DESIGN )
5,0.130435,(MINI PAINT SET VINTAGE )
20,0.130435,(WOODEN OWLS LIGHT GARLAND )
21,0.108696,(WORLD WAR 2 GLIDERS ASSTD DESIGNS)
3,0.108696,(JAM MAKING SET PRINTED)
25,0.108696,"(RED TOADSTOOL LED NIGHT LIGHT, POSTAGE)"
11,0.108696,(RED TOADSTOOL LED NIGHT LIGHT)
2,0.108696,(GUMBALL COAT RACK)
22,0.108696,"(GUMBALL COAT RACK, POSTAGE)"


In [49]:
print("Total Frequent Itemsets:", frq_items.shape[0])

Total Frequent Itemsets: 27


Then use the frequent itemsets to generate association rules.

In [50]:
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)

## 4. Rule Evaluation

We evaluate the discovered association rules using multiple metrics, including support, confidence, lift, leverage, and conviction. These measures help quantify how frequently items occur together, the strength of implication between items, and the degree to which the observed co-occurrence differs from random chance.

In [51]:
rules.drop(['zhangs_metric', 'jaccard',	'certainty', 'kulczynski', 'representativity'], axis=1).sort_values(['confidence', 'lift'], ascending =[False, False]).head(50)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(PACK OF 72 RETROSPOT CAKE CASES),(PACK OF 60 SPACEBOY CAKE CASES),0.086957,0.086957,0.086957,1.0,11.5,0.079395,inf
3,(PACK OF 60 SPACEBOY CAKE CASES),(PACK OF 72 RETROSPOT CAKE CASES),0.086957,0.086957,0.086957,1.0,11.5,0.079395,inf
0,(GUMBALL COAT RACK),(POSTAGE),0.108696,0.521739,0.108696,1.0,1.916667,0.051985,inf
4,(RABBIT NIGHT LIGHT),(POSTAGE),0.086957,0.521739,0.086957,1.0,1.916667,0.041588,inf
6,(RED TOADSTOOL LED NIGHT LIGHT),(POSTAGE),0.108696,0.521739,0.108696,1.0,1.916667,0.051985,inf
8,(WALL TIDY RETROSPOT ),(POSTAGE),0.086957,0.521739,0.086957,1.0,1.916667,0.041588,inf
1,(POSTAGE),(GUMBALL COAT RACK),0.521739,0.108696,0.108696,0.208333,1.916667,0.051985,1.125858
7,(POSTAGE),(RED TOADSTOOL LED NIGHT LIGHT),0.521739,0.108696,0.108696,0.208333,1.916667,0.051985,1.125858
5,(POSTAGE),(RABBIT NIGHT LIGHT),0.521739,0.086957,0.086957,0.166667,1.916667,0.041588,1.095652
9,(POSTAGE),(WALL TIDY RETROSPOT ),0.521739,0.086957,0.086957,0.166667,1.916667,0.041588,1.095652


The results detect co-purchase behavior in the transaction data, such as related cake case items (PACK OF 72 RETROSPOT CAKE CASES and PACK OF 60 SPACEBOY CAKE CASES) exhibit very high confidence and high lift, indicating that they are bought together more frequently than expected by chance, suggesting strong complementary relationships.

Rules involving POSTAGE appear frequently due to its high overall support, where purchasing an item almost always implies postage, but paying postage does not strongly imply purchasing any specific product, highlighting the directional nature of association rules.

----