## **RADI608: Data Mining and Machine Learning**

### Assignment: APriori Algorithm
**Romen Samuel Rodis Wabina** <br>
Student, PhD Data Science in Healthcare and Clinical Informatics <br>
Clinical Epidemiology and Biostatistics, Faculty of Medicine (Ramathibodi Hospital) <br>
Mahidol University

Note: In case of Python Markdown errors, you may access the assignment through this GitHub [Link](https://github.com/rrwabina/RADI608/tree/main/Submitted)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings('ignore')

import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

from sklearn.preprocessing import LabelEncoder
import warnings 
warnings.filterwarnings('ignore')

### Question 1: Generate the frequent itemset and association rules using the given transaction database below. Use a <code>minimum support threshold = 50%</code>

<center>
<img src = "figures/question1.JPG" width = "650"/> <br>
</center>

##### We find the support of each item to determine which items are frequent. This is shown in the figure below.

\begin{equation*}
\begin{aligned}    
    \text{support}(\text{onions})       = \frac{3}{5} = 0.60 & & \text{support}(\text{potatoes}) = \frac{2}{5} = 0.40 \\
    \text{support}(\text{burger})       = \frac{3}{5} = 0.60 & & \text{support}(\text{cereal})   = \frac{2}{5} = 0.40 \\
    \text{support}(\text{potato chips}) = \frac{4}{5} = 0.80 & & \text{support}(\text{beer})     = \frac{3}{5} = 0.60 \\
    \text{support}(\text{eggs})         = \frac{1}{5} = 0.20 \\
\end{aligned}
\end{equation*}

<center>
<img src = "../figures/02.JPG" width = '650'> <br>
</center>

##### We then used an initial minimum support threshold (i.e., pruning) equal to 50% and applied to find all frequent itemsets in the database. Because of this, we omitted the potatoes, cereal, and egg items since their support level is below 50%. This leaves us with the onions, burger, potato chip, and beer. 

<center>
<img src = "figures/03.JPG" width = '650'> <br>
</center>

##### We create a 2-item association rule per itemset using the frequent items from the previous table. 

\begin{equation*}
\begin{aligned}    
    \text{support}(\text{onions, burgers})       = \frac{2}{5} = 0.40 & & \text{support}(\text{onions, potato chips})       = \frac{2}{5} = 0.40 \\
    \text{support}(\text{onions, beer})          = \frac{2}{5} = 0.40 & & \text{support}(\text{burger, potato chips})       = \frac{2}{5} = 0.40 \\
    \text{support}(\text{burger, beer})          = \frac{2}{5} = 0.40 & & \text{support}(\text{potato chips, beer})         = \frac{3}{5} = 0.60 \\
\end{aligned}
\end{equation*}

##### We omitted the itemsets with support levels less than the minimum support threshold. We stopped at generating 2-items per set because it does not have any frequent item to join and build the new itemset. 

<center>
<img src = "figures/04.JPG" width = '650'> <br>
</center>

##### Due to pruning, we now only have two candidate rules that meets the criteria. We calculated the confidence and lift values for the remaining itemset.

##### Confidence Values

\begin{equation*}
\begin{aligned}    
    \text{confidence}(\text{potato chips} \longrightarrow \text{beer}) &= \frac{\text{support({potato chips, beer})}}{\text{support(potato chips)}} = \frac{0.60}{0.80} = 0.75*100 = 75\%  \\
    \\
    \\
    \text{confidence}(\text{beer} \longrightarrow \text{potato chips}) &= \frac{\text{support({potato chips, beer})}}{\text{support(beer)}} = \frac{0.60}{0.60} = 1.00*100 = 100\%  \\
\end{aligned}
\end{equation*}

##### Lift Values

\begin{equation*}
\begin{aligned}    
    \text{lift}(\text{potato chips} \longrightarrow \text{beer}) &= \frac{\text{support({potato chips, beer})}}{\text{support(potato chips) * support(beer)}} = \frac{0.60}{0.80 * 0.60} = 1.25  \\
    \\
    \\
    \text{lift}(\text{beer} \longrightarrow \text{potato chips}) &= \frac{\text{support({potato chips, beer})}}{\text{support(beer) * support(potato chips)}} = \frac{0.60}{0.60 * 0.80} = 1.25  \\
\end{aligned}
\end{equation*}

##### In summary, we have the following table
<center>
<img src = "figures/05.JPG" width = '650'> <br>
</center>

##### If the customer buys potato chip, there is a 75% chance that the customer will also buy beer. However, if the customer buys a beer, they will **certainly** buy potato chip - having a perfect confidence level. The potato chip and beer has a lift value equal to 1.25. This implioes that these two associations are **independent** of each other such that the rule can be conversed. We may conclude that if someone buys potato chip, he/she is very likely to buy beer as well.

##### Let's try to check our manual computation using Python.

In [11]:
df = [['onions', 'potatoes', 'burger', 'cereal'], 
      ['potato chips', 'burger', 'beer', 'eggs'],
      ['onions', 'potatoes', 'potato chips', 'burger', 'beer'],
      ['potato chips', 'beer', 'onions'],
      ['eggs', 'cereal', 'potato chips']]

encoder = TransactionEncoder()
encoder_array  = encoder.fit(df).transform(df)
df_encode = pd.DataFrame(encoder_array, columns = encoder.columns_)
frequent_itemsets = apriori(df_encode, min_support = 0.50, use_colnames = True)
rules = association_rules(frequent_itemsets, metric = 'lift', min_threshold = 1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(beer),(potato chips),0.6,0.8,0.6,1.0,1.25,0.12,inf
1,(potato chips),(beer),0.8,0.6,0.6,0.75,1.25,0.12,1.6


### Question 2: Using the <code>prescriptionDB.csv</code>, perform an Apriori Algorithm to generate the association rules by using <code>support = 0.001, confidence = 0.5,</code> and find the top 10 of the <code>RHS = 'OMPZ'</code>.

In [21]:
df = pd.read_csv('../data/prescriptionDB.csv')
df.set_index('Item ID')
df.fillna('', inplace = True)
df = df[['Code.1', 'Code.2', 'Code.3', 'Code.4']]
df = df.to_numpy()

dataset = []
for index in range(0, df.shape[0]):
    new_list = list(filter(None, df[index]))
    dataset.append(new_list)
 
encoder = TransactionEncoder()
encoder_array = encoder.fit(dataset).transform(dataset)
df_encode = pd.DataFrame(encoder_array, columns = encoder.columns_)

frequent_itemsets = apriori(df_encode, min_support = 0.001, use_colnames = True)
rules = association_rules(frequent_itemsets, metric = 'lift', min_threshold = 1)

rule_ompz = rules[rules['consequents'] == frozenset({'OMPZ'})]
rule_ompz = rule_ompz[rule_ompz['confidence'] >= 0.50]
rule_ompz = rule_ompz.sort_values(by = ['lift'], ascending = False).head(10)
rule_ompz

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
61,(NAPX),(OMPZ),0.113511,0.746678,0.103821,0.914634,1.224938,0.019065,2.967489
23,(ASPT),(OMPZ),0.047619,0.746678,0.042082,0.883721,1.183537,0.006526,2.178571
49,(INDM),(OMPZ),0.008306,0.746678,0.007198,0.866667,1.160697,0.000997,1.899917
51,(MELO),(OMPZ),0.03433,0.746678,0.029347,0.854839,1.144856,0.003713,1.745109
46,(IBUP),(OMPZ),0.039037,0.746678,0.031008,0.794326,1.063814,0.00186,1.23167
66,(VOLS),(OMPZ),0.007752,0.746678,0.006091,0.785714,1.05228,0.000303,1.182171
21,(ASA.),(OMPZ),0.284884,0.746678,0.215947,0.758017,1.015187,0.003231,1.046862
55,(MOBC),(OMPZ),0.020487,0.746678,0.015504,0.756757,1.013498,0.000206,1.041436


### Question 3: Perform an Apriori Algorithm to generate the association rules by selecting the top 20 rules at <code>support = 0.0001</code> and the LHS has at least two drugs in the basket.

In [18]:
df = pd.read_csv('../data/prescriptionDB.csv')
df.set_index('Item ID')
df.fillna('', inplace = True)
df = df[['Code.1', 'Code.2', 'Code.3', 'Code.4']]
df = df.to_numpy()

dataset = []
for index in range(0, df.shape[0]):
    new_list = list(filter(None, df[index]))
    dataset.append(new_list)

encoder = TransactionEncoder()
encoder_array = encoder.fit(dataset).transform(dataset)
df_encode = pd.DataFrame(encoder_array, columns = encoder.columns_)

def get_index(setList, data2):
   i = -1
   for data in setList:
       i = i + 1
       if data == data2:
           return i
   return -1 

frequent_itemsets = apriori(df_encode, min_support = 0.0001, use_colnames = True)
rules = association_rules(frequent_itemsets, metric = 'lift', min_threshold = 1)

index1 = []
for idx, value in enumerate(rules['antecedents']):
    if len(value) >= 2:
        if get_index(value, '') != 0:
            index1.append(idx)
index2 = []
for idx, value in enumerate(rules['consequents']):
    if get_index(value, '') != 0:
        index2.append(idx)

index = [value for value in index1 if value in index2]
rules = rules.loc[index].sort_values(by = ['lift'], ascending = False).head(20)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
254,"(MELO, XAND)","(OMPZ, PARI)",0.001107,0.001107,0.000277,0.25,225.75,0.000276,1.331857
251,"(OMPZ, PARI)","(MELO, XAND)",0.001107,0.001107,0.000277,0.25,225.75,0.000276,1.331857
248,"(OMPZ, PARI, XAND)",(MELO),0.000277,0.03433,0.000277,1.0,29.129032,0.000267,inf
247,"(OMPZ, MELO, XAND)",(PARI),0.000277,0.048173,0.000277,1.0,20.758621,0.000264,inf
240,"(MUCT, ULSN)",(PARI),0.000277,0.048173,0.000277,1.0,20.758621,0.000264,inf
217,"(IBUP, XAND)",(PRVF),0.000277,0.052326,0.000277,1.0,19.111111,0.000262,inf
212,"(ULSN, GAVI)",(PRVF),0.000277,0.052326,0.000277,1.0,19.111111,0.000262,inf
210,"(PRVF, ULSN)",(GAVI),0.000554,0.030177,0.000277,0.5,16.568807,0.00026,1.939646
252,"(OMPZ, XAND)","(PARI, MELO)",0.063953,0.000277,0.000277,0.004329,15.636364,0.000259,1.00407
253,"(PARI, MELO)","(OMPZ, XAND)",0.000277,0.063953,0.000277,1.0,15.636364,0.000259,inf
