<a href="https://colab.research.google.com/github/wdittaya/MLWorkshop/blob/main/2025_04_10_CUVIP_AssociationRules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Association rules

Find frequent itemsets and association rules from the dataset.
Answer the following questions
- How many itemsets of length 1 if we set the `min_support=0.005` and `min_confidence=0.5`
- How many items in the longest itemsets?
- How many rules can we extract from the data if we set the `min_support=0.005` and `min_confidence=0.5`

Upload your colab notebook and the screenshot of the rules to Google form.

## Loading the dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
# Around 2 min.

from ucimlrepo import fetch_ucirepo
data = fetch_ucirepo(id=352)

In [None]:
df = data.data.original
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


## Data cleansing

- Clear any entries with N/A.
- Convert the `StockCode` column type to `string`.
    - `df['StockCode'].astype('string')`
- Choose only invoices in `United Kingdom`.
    - Hint: There are `361878` invoices.
- Create a list of StockCodes in each invoice. We will pass this to the Apriori algorithm.
    - Hint: Transction `489434` has the following stock code list `['85048', '79323P', '79323W', '22041', '21232', '22064', '21871', '21523']`
- Set the support to `0.005` and confidence to `0.5`
    - You may try with different numbers. However, small support means more samples required to make the pattern be **interesting**.

In [None]:
# Drop N/A
# Convert 'StockCode' column to 'string'
# Choose only rows where 'Country' is 'United Kingdom'





In [None]:
# Create a list to stock codes for each invoice
#Hint: Transction `489434` has the following stock code list
#    ['85048', '79323P', '79323W', '22041', '21232', '22064', '21871', '21523']


In [None]:
!pip install efficient-apriori


Collecting efficient-apriori
  Downloading efficient_apriori-2.0.5-py3-none-any.whl.metadata (6.7 kB)
Downloading efficient_apriori-2.0.5-py3-none-any.whl (14 kB)
Installing collected packages: efficient-apriori
Successfully installed efficient-apriori-2.0.5


In [None]:
from efficient_apriori import apriori

In [None]:
# pass the transaction list to apriori algorithm, set the min_support at min_confidence
# e.g. if
# - trans_list is the transaction list variable
# - min_support is 0.05
# - min_confidence is 0.5
# Uncomment the following line
itemsets, rules = apriori(trans_list, min_support=0.005, min_confidence=0.5)

In [None]:
# Codes to find the answer of the following questions
# - How many itemsets of length 1 if we set the min_support=0.005 and min_confidence=0.5
# - How many items in the longest itemsets?
# - How many rules can we extract from the data if we set the min_support=0.005 and min_confidence=0.5



In [None]:
# Find reference stock code description

stock_df = df[['StockCode', 'Description']].drop_duplicates(ignore_index=True).set_index('StockCode')
stock_df

In [None]:
# rules is the association rules from the apriori algorithm

for i, rule in enumerate(rules):
    print('Rule {}'.format(i+1))

    lhs = '/'.join(stock_df.loc[list(rule.lhs)].values.reshape(-1)) # antecedent
    rhs = '/'.join(stock_df.loc[list(rule.rhs)].values.reshape(-1)) # consequent

    # antecedent -> consequent
    print('{} -> {}'.format(lhs, rhs))

    print('Supp = {}, Conf = {}, Life = {}, Conv = {}'.format(rule.support, rule.confidence, rule.lift, rule.conviction))
    print()