<a href="https://colab.research.google.com/github/slavyolov/Algorithms/blob/main/data_mining/association_rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mount google drive and setup working directory

In [4]:
#  mount your Google drive
from google.colab import drive
drive.mount('/content/Python_files', force_remount=True)

Mounted at /content/Python_files


In [26]:
# Change working directory
%cd /content/Python_files/MyDrive/Python_files

/content/Python_files/MyDrive/Python_files


In [27]:
!ls

barcode_reader.py  Data  YOLOv3_course_or


## Install packages

In [None]:
!pip install mlxtend==0.19.0



## Package imports

In [31]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth, association_rules 

## Data 
- https://www.kaggle.com/datasets/devchauhan1/market-basket-optimisationcsv

! the data was uploaded manually from the above loink to the google drive "/content/Python_files/MyDrive/Python_files"

In [29]:
data = pd.read_csv("Data/Market_Basket_Optimisation.csv", header=None)
print(data.shape)

(7501, 20)


In [30]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


## Data Pre-Processing


In order to be able to use apriori algorithm and get most frequent itemsets, we have to transform our dataset into a 1 – 0 matrix where rows are transactions and columns are products. In that matrix, “1” should be encoded if the product has been bought on that transaction and “0” should be encoded if the product has not been bought on that transaction. This preprocessing is required for use of the algorithm.

http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
http://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/


In [33]:
# Transform Every Transaction to Seperate List & Gather Them into Numpy Array

transaction_l = []
for i in range(data.shape[0]):
    transaction_l.append([str(data.values[i,j]) for j in range(data.shape[1])])
    
transaction_arr = np.array(transaction_l)

print(type(transaction_arr))
print(transaction_arr)

<class 'numpy.ndarray'>
[['shrimp' 'almonds' 'avocado' ... 'frozen smoothie' 'spinach'
  'olive oil']
 ['burgers' 'meatballs' 'eggs' ... 'nan' 'nan' 'nan']
 ['chutney' 'nan' 'nan' ... 'nan' 'nan' 'nan']
 ...
 ['chicken' 'nan' 'nan' ... 'nan' 'nan' 'nan']
 ['escalope' 'green tea' 'nan' ... 'nan' 'nan' 'nan']
 ['eggs' 'frozen smoothie' 'yogurt cake' ... 'nan' 'nan' 'nan']]


In [51]:
te = TransactionEncoder()
te_ary = te.fit(transaction_arr).transform(transaction_arr)
df_train = pd.DataFrame(te_ary, columns=te.columns_)
df_train = df_train.drop(['nan'], axis=1)

print(type(te_ary))
df_train.head()

## There are 120 columns/features at the moment. ('nan' feature removed)

<class 'numpy.ndarray'>


Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [52]:
# FP growth algorithm
model_result = fpgrowth(df_train, min_support=0.0005, use_colnames=True)
print(model_result)

        support                                          itemsets
0      0.238368                                   (mineral water)
1      0.132116                                       (green tea)
2      0.076523                                  (low fat yogurt)
3      0.071457                                          (shrimp)
4      0.065858                                       (olive oil)
...         ...                                               ...
21125  0.000533                 (chocolate, ground beef, oatmeal)
21126  0.000533  (chocolate, ground beef, mineral water, oatmeal)
21127  0.000533                   (spaghetti, green tea, oatmeal)
21128  0.000533               (spaghetti, mineral water, oatmeal)
21129  0.000533                                (spaghetti, cream)

[21130 rows x 2 columns]


In [53]:
# Associations
# ! do not use lift if we use Kulcynski distance !
assoc_rules = association_rules(model_result, metric="confidence", min_threshold=0.0001)
print(len(assoc_rules))

198860


In [59]:
# Display of the output results

assoc_rules_final = assoc_rules.copy(deep=True)

# STEP 0 : filter only 1:1 cases (antecedents : consequents)
assoc_rules_final = assoc_rules_final[assoc_rules['antecedents'].str.len() < 2]
assoc_rules_final = assoc_rules_final[assoc_rules['consequents'].str.len() < 2]
print(len(assoc_rules_final))
print(assoc_rules_final.info())
assoc_rules_final.head()

6498
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6498 entries, 0 to 198859
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   antecedents         6498 non-null   object 
 1   consequents         6498 non-null   object 
 2   antecedent support  6498 non-null   float64
 3   consequent support  6498 non-null   float64
 4   support             6498 non-null   float64
 5   confidence          6498 non-null   float64
 6   lift                6498 non-null   float64
 7   leverage            6498 non-null   float64
 8   conviction          6498 non-null   float64
dtypes: float64(7), object(2)
memory usage: 507.7+ KB
None


  import sys


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(green tea),(mineral water),0.132116,0.238368,0.031063,0.235116,0.986357,-0.00043,0.995748
1,(mineral water),(green tea),0.238368,0.132116,0.031063,0.130313,0.986357,-0.00043,0.997927
2,(spaghetti),(green tea),0.17411,0.132116,0.02653,0.152374,1.153335,0.003527,1.0239
3,(green tea),(spaghetti),0.132116,0.17411,0.02653,0.200807,1.153335,0.003527,1.033405
4,(french fries),(green tea),0.170911,0.132116,0.02853,0.166927,1.263488,0.00595,1.041786


In [60]:
# STEP 1 : Convert the frozensets to string
change_cols = ['antecedents', 'consequents']

# 1) frozenset to set; 2) set to string
for col in change_cols:
    assoc_rules_final[col] = assoc_rules_final[col].apply(set)
    assoc_rules_final[col] = assoc_rules_final[col].apply(lambda x: ','.join(map(str, x)))

print(assoc_rules_final.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6498 entries, 0 to 198859
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   antecedents         6498 non-null   object 
 1   consequents         6498 non-null   object 
 2   antecedent support  6498 non-null   float64
 3   consequent support  6498 non-null   float64
 4   support             6498 non-null   float64
 5   confidence          6498 non-null   float64
 6   lift                6498 non-null   float64
 7   leverage            6498 non-null   float64
 8   conviction          6498 non-null   float64
dtypes: float64(7), object(2)
memory usage: 507.7+ KB
None


In [61]:
# STEP 2 : Calculate 'Kulczynski'
assoc_rules_final['kulczynski'] = 0.5 * ((assoc_rules_final['support'] / assoc_rules_final['antecedent support']) +
                                          (assoc_rules_final['support'] / assoc_rules_final['consequent support'])
                                          )

In [62]:
# STEP 3 : Calculate IR
assoc_rules_final['IR'] = np.abs(assoc_rules_final['antecedent support'] -
                                  assoc_rules_final['consequent support']
                                  ) / (assoc_rules_final['antecedent support'] +
                                      assoc_rules_final['consequent support'] -
                                      assoc_rules_final['support']
                                      )

In [63]:
# STEP 4 : Transaction number
assoc_rules_final['total_transactions'] = len(df_train)
assoc_rules_final['antecedent_trans_count'] = assoc_rules_final['antecedent support'] * \
                                              assoc_rules_final['total_transactions']
assoc_rules_final['consequent_trans_count'] = assoc_rules_final['consequent support'] * \
                                              assoc_rules_final['total_transactions']
assoc_rules_final['both_AC_count'] = assoc_rules_final['support'] * \
                                      assoc_rules_final['total_transactions']

In [64]:
# STEP 5 : cosine / harmonized lift
# The cosine measure can be viewed as a harmonized lift measure: The two formulae are similar except that for cosine,
# the square root is taken on the product of the probabilities of A and B. This is an important difference,
# however, because by taking the square root,the cosine value is only influenced by the supports of A,B, and A∪B,
# and not by the total number of transactions

assoc_rules_final['cosine'] = ((assoc_rules_final['support'] / assoc_rules_final['antecedent support']) *
                                (assoc_rules_final['support'] / assoc_rules_final['consequent support'])
                                ) ** 0.5
print(assoc_rules_final.head(5).to_markdown())

|    | antecedents   | consequents   |   antecedent support |   consequent support |   support |   confidence |     lift |     leverage |   conviction |   kulczynski |       IR |   total_transactions |   antecedent_trans_count |   consequent_trans_count |   both_AC_count |   cosine |
|---:|:--------------|:--------------|---------------------:|---------------------:|----------:|-------------:|---------:|-------------:|-------------:|-------------:|---------:|---------------------:|-------------------------:|-------------------------:|----------------:|---------:|
|  0 | green tea     | mineral water |             0.132116 |             0.238368 | 0.0310625 |     0.235116 | 0.986357 | -0.000429663 |     0.995748 |     0.182715 | 0.31304  |                 7501 |                      991 |                     1788 |             233 | 0.175039 |
|  1 | mineral water | green tea     |             0.238368 |             0.132116 | 0.0310625 |     0.130313 | 0.986357 | -0.000429663 |     0.9

In [77]:
# STEP 3 : Example query specific product
assoc_rules_query = assoc_rules_final.copy(deep=True)

# 0) Result
assoc_rules_query = assoc_rules_query.sort_values(by='cosine', ascending=False)
print("Top 5 associations sorted by cosine distance :\n")
print(assoc_rules_query.head(5).to_markdown())

print("\n\n")
print("Top 5 associations sorted by kulczynski distance :\n")
assoc_rules_query = assoc_rules_query.sort_values(by='kulczynski', ascending=False)
print(assoc_rules_query.head(5).to_markdown())

print("\n\n")
print("Top 5 associations sorted by both_AC_count :\n")
assoc_rules_query = assoc_rules_query.sort_values(by='both_AC_count', ascending=False)
print(assoc_rules_query.head(5).to_markdown())

Top 5 associations sorted by cosine distance :

|        | antecedents   | consequents   |   antecedent support |   consequent support |   support |   confidence |    lift |   leverage |   conviction |   kulczynski |       IR |   total_transactions |   antecedent_trans_count |   consequent_trans_count |   both_AC_count |   cosine |
|-------:|:--------------|:--------------|---------------------:|---------------------:|----------:|-------------:|--------:|-----------:|-------------:|-------------:|---------:|---------------------:|-------------------------:|-------------------------:|----------------:|---------:|
| 150963 | spaghetti     | ground beef   |            0.17411   |            0.0982536 | 0.0391948 |     0.225115 | 2.29116 |  0.0220878 |      1.16372 |     0.312015 | 0.325329 |                 7501 |                     1306 |                      737 |             294 | 0.299669 |
| 150962 | ground beef   | spaghetti     |            0.0982536 |            0.17411   | 0.039