# Association Rules

The goal of this notebook is to write an algorithm to calculate the association rules for a given set of frequent itemsets.

> "The quest to mine frequent patterns appears in many other domains. The prototypical application is market basket analysis, that is, to mine the sets of items that are frequently bought together at a supermarket by analyzing the customer shopping carts (the so-called “market baskets”). Once we mine the frequent sets, they allow us to extract association rules among the item sets, where we make some statement about how likely are two sets of items to co-occur or to conditionally occur." 

> ~ http://cs.ndsu.edu/~perrizo/saturday/Mohammad%20Zaki%20Data%20Mining.pdf


In [None]:
import os
import time
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth
from mlxtend.preprocessing import TransactionEncoder
from itertools import chain, permutations

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Itemsets and Tidsets

Let $I = \{x_1 , x_2 , . . . , x_m \}$ be a set of elements called items. A set $X \subseteq I$ is called an **itemset**. The set of items $I$ may denote, for example, the collection of all products sold at a supermarket, the set of all web pages at a website, and so on. An itemset of cardinality (or size) $k$ is called a $k$-itemset. Further, we denote by $I(k)$ the set of all $k$-itemsets, that is, subsets of $I$ with size $k$. 

Let $\tau = \{t_1,t_2,...,t_n\}$ be another set of elements called transaction identifiers or tids. A set $T \subseteq \tau$ is called a **tidset**. We assume that itemsets and tidsets are kept sorted in lexicographic order.

A **transaction** is a tuple of the form $⟨t,X⟩$, where $t \in \tau$ is a unique transaction identifier, and $X$ is an itemset. The set of transactions $\tau$ may denote the set of all customers at a supermarket, the set of all the visitors to a website, and so on. For convenience, we refer to a transaction $⟨t,X⟩$ by its identifier $t$.

> ~ http://cs.ndsu.edu/~perrizo/saturday/Mohammad%20Zaki%20Data%20Mining.pdf

In [None]:
# Create list of transaction data, i.e.
#
#                 
# Transactions = [transaction-1-list, transaction-2-list, ... , transaction-n-list]
# e.g.         = [[Eggs, Ham, Cheese], [Bread, Ham], ... , [Eggs, Bread, Milk, Butter]]
#

# Function to create the above
def transaction_data(data):
    transactions = data['Transaction'].unique()
    return [[y for y in data.loc[data['Transaction'] == x]['Item']] for x in transactions]

# Database Representation
A **binary database** $D$ is a binary relation on the set of tids and items, that is, $D \subseteq \tau × I$.

We say that tid $t \in \tau$ contains item $x \in I$ $\iff$ $(t,x) \in D$.

In other words, $(t,x) \in D$ $\iff$ $x \in X$ in the tuple $⟨t,X⟩$.

We say that tid $t$ contains itemset 
$X = \{x_1,x_2,...,x_k\}$ $\iff$ ($t,x_i)$ $\in$ $D$ $\forall i = 1,2,...,k$.

> ~ http://cs.ndsu.edu/~perrizo/saturday/Mohammad%20Zaki%20Data%20Mining.pdf

In [None]:
# apriori & fpgrowth take a (pandas) binary transaction table, i.e.
#
#
#               Item
# Transaction   Eggs    Ham     Cheese
#          t1   True    True    False
#          t2   False   True    False
#
#

# We can create this using the libraries built in TransactionEncoder function
def binary_transactions(data):
    te = TransactionEncoder()
    binary_data = te.fit(data).transform(data)
    return pd.DataFrame(binary_data, columns=te.columns_)

# Support
The support of an itemset $X$ in a dataset $D$, denoted $sup(X,D)$, is the number of transactions in $D$ that contain $X$:

$$sup(X,D) = \{t | ⟨t,i(t)⟩ \in D \mbox{ and } X \subseteq i(t)\} = |t(X)|$$

where:
* $i(t)$ is the set of items contained in tid $t \in T$,
* $t(x)$ is the set of tids that contain the single item $x \in I$.

It is an estimate of the joint probability of the items comprising X.

In [None]:
# Function to calculate support:
def compute_support(item, itemset):
    support = 0
    for i in itemset:
        if len(set(item) - set(i)) == 0:
            support += 1
    return support

# Frequent itemsets

An itemset $X$ is said to be **frequent** in $D$ if $sup(X,D) \geq minsup$, where $minsup$ is a user defined **minimum support threshold**. 

We use the set $F$ to denote the set of all **frequent itemsets**, and $F^{(k)}$ to denote the set of frequent $k$-itemsets

> ~ http://cs.ndsu.edu/~perrizo/saturday/Mohammad%20Zaki%20Data%20Mining.pdf


# Association Rule
    
An **association rule** is an expression $X \longrightarrow^{s,c} Y$, where $X$ and $Y$ are itemsets and they are disjoint, that is, $X$, $Y \subseteq I$ , and $X \cap Y$ $= \emptyset$. Let the itemset $X \cup Y$ be denoted as $XY$. 

The **support** of the rule is the number of transactions in which both $X$ and $Y$ co-occur as subsets:

$$s = sup(X \longrightarrow Y) = |t(XY)| = sup(XY)$$

# Confidence

The **confidence of a rule** is the conditional probability that a transaction contains $Y$ given that it contains $X$:

$$c = conf(X \longrightarrow Y) = P(Y|X) = \frac{P(X \cap Y)}{P(X)} = \frac{sup(XY)}{sup(X)}$$

A rule is frequent if the itemset $XY$ is frequent, that is, s$up(XY) \geq minsup$ and a rule is strong if $conf \geq minconf$, where $minconf$ is a user-specified **minimum confidence threshold**.

> ~ http://cs.ndsu.edu/~perrizo/saturday/Mohammad%20Zaki%20Data%20Mining.pdf

# Association Rules Algorithm:

In [None]:
# Re-write itertools.combinations function to return list of list of strings not list of tuples of strings
def combinations(iterable, r):
    pool = tuple(iterable)
    n = len(pool)
    for indices in permutations(range(n), r):
        if sorted(indices) == list(indices):
            yield [pool[i] for i in indices]


# Definition to return antecedents of the frequent itemset (in order of decreasing string length)
def antecedents(freq_itemset):
    s = list(freq_itemset)
    return sorted(chain.from_iterable(combinations(s, r) for r in range(2, len(s))),
                  key=lambda x: len(x), reverse=True)


# Produce power set of itemset
def power_set(itemset, min_length=1, rev = False):
    s = list(itemset)
    return sorted(chain.from_iterable(combinations(s, r) for r in range(min_length, len(s)+1)),
                  key=lambda x: len(x), reverse=rev)


# Function to update antecedents by removing all subsets of an item:
def update_antecedents(ants, item):
    powerset = power_set(item)
    for i in powerset:
        if i in ants:
            ants = set(tuple(ant) for ant in ants).difference({tuple(i), })
            ants = [list(x) for x in ants]
    return ants


# Function to return all viable association rules
def association_rules(freq_itemset, full_itemset, minconf):
    result = []
    antecs = antecedents(freq_itemset)
    support = compute_support(freq_itemset, full_itemset)
    for a in antecs:
        overlap = list(set(freq_itemset).difference(a))
        conf = support/compute_support(overlap, full_itemset)
        if minconf != 0:
            if conf >= minconf:
                result.append([a, overlap, conf])
            else:
                antecs = update_antecedents(antecs, a)
        else:
            result.append([a, overlap, round(conf, 2)])
    return result

In [None]:
# Read our data
data = pd.read_csv('/kaggle/input/the-bread-basket/bread basket.csv')
trans_data = transaction_data(data)
bin_tran = binary_transactions(trans_data)

# All items
all_items = data['Item'].unique()

# Minimum confidence
min_conf = 0.01

## Apriori Algorithm

In [None]:
# Apriori
print("Apriori  with min_support = 0.1:")
tic1 = time.perf_counter()
a = apriori(bin_tran,min_support=min_conf,use_colnames=True)
toc1 = time.perf_counter()
print(a.sort_values(by='support',ascending=False))
print(f"Apriori took {toc1 - tic1:0.4f} seconds")

## FPGrowth Algorithm

In [None]:
# FPGrowth
print("FPGrowth with min_support = 0.1:")
tic2 = time.perf_counter()
fp = fpgrowth(bin_tran, min_support=min_conf, use_colnames=True)
toc2 = time.perf_counter()
print(fp.sort_values(by='support',ascending=False))
print(f"FPGrowth took {toc2 - tic2:0.4f} seconds")

## Association Rules Algorithm

Let's perform our association rules on the frequent itemset \['Bread','Coffee','Tea','Cake'\]:

In [None]:
# Let's compute association rules using our algorithm.
print("Association rules in the form: [Antecedent, Target, Confidence] with min-confidence {n}:".format(n=min_conf))
tic3 = time.perf_counter()
assoc = association_rules(['Bread','Coffee','Tea','Cake'], trans_data, min_conf)
toc3 = time.perf_counter()
for a in assoc:
    print(a)
print(f"Our association rules algorithm took {toc3 - tic3:0.4f} seconds")

# Top Results

|Antecedent|Target|Confidence|
|---|---|---|
|['Coffee', 'Tea']|['Cake', 'Bread']|0.0588|
|['Bread', 'Coffee']|['Tea', 'Cake']|0.0578|
|['Coffee', 'Cake']|['Tea', 'Bread']|0.0489|

### Interpretation:

$$P(\mbox{Target} | \mbox{Antecedent}) = \mbox{Confidence}$$.