# &sect; 4.2.1: Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

**Apriori** is a seminal algorithm proposed by R. Argawal and R. Srikant in 1994 for <mark style="background-color: rgba(255, 0, 255, .25); color: white;">mining frquent itemsets for Boolean association rules</mark>. The name of the algorithm is based on the fact that the algorithm <mark style="background-color: rgba(255, 0, 255, .25); color: white;">uses prior knowledge of frequent itemset properties</mark>. Apriori emplys an iterative approach known as <mark style="background-color: rgba(255, 0, 255, .25); color: white;">*level-wise* search, where $k$-itemsets are used to explore $(k + 1)$-itemsets</mark>.

* First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collectng those items that satisfy the *minimum support*. The resulting set is denoted by $L_1$.
* Next, $L_1$ is used to find $L_2$, the set of frequent 2-itemsets, which is used to find $L_3$, and so on, until no frequent $k$-itemsets can be found.
* The finding of each $L_k$ requires one full scan of the database.

To improve the efficiency of the level-wise generation of (candidate) frequent itemsets, an important property is used to reduce the search space:
> **Apriori property**: all nonempty subsets of a frequent itemset must also be frequent

***Further intuition***: By defintion, if an timeset $I$ does not satisfy the minimum support threshold, *min_sup*, then $I$ is not frequent, that is $P(I)$ < *min_sup*. If an item $A$ is added to the itemset $I$, then the resulting itemset (i.e., $I \cup A$) cannot occur more frequently than $I$. Therefore $I \cup A$ is not frequent either, meaning $P(A \cup I)$ < *min_sup*.

This property follows a special category of properties called **antimonotonicity** in the sense that *if a set cannot pass a test, all of its supersets will faile the same test*.

---

## Example:

Let's look at a concrete example, based on the transaction database, $D$, of the table provided below. There are 9 transactions in this database, that is, $|D| = 9$.

| Transaction ID    | Set of Item IDs           |
|:------------------|:--------------------------|
| $T_1$             | $\{I_1, I_2, I_5\}$       |
| $T_2$             | $\{I_2, I_4\}$            |
| $T_3$             | $\{I_2, I_3\}$            |
| $T_4$             | $\{I_1, I_2, I_4\}$       |
| $T_5$             | $\{I_1, I_3\}$            |
| $T_6$             | $\{I_2, I_3\}$            |
| $T_7$             | $\{I_1, I_3\}$            |
| $T_8$             | $\{I_1, I_2, I_3, I_5\}$  |
| $T_9$             | $\{I_1, I_2, I_3\}$       |

In [1]:
"""Define transaction database."""

from json   import dumps

# Define transaction database
D:  dict[str, set]  = {
    "T1":   {"I1", "I2", "I5"},
    "T2":   {"I2", "I4"},
    "T3":   {"I2", "I3"},
    "T4":   {"I1", "I2", "I4"},
    "T5":   {"I1", "I3"},
    "T6":   {"I2", "I3"},
    "T7":   {"I1", "I3"},
    "T8":   {"I1", "I2", "I3", "I5"},
    "T9":   {"I1", "I2", "I3"}
}

# Print transaction database
print(f"D: {dumps(D, indent = 2, default = str)}")

D: {
  "T1": "{'I1', 'I5', 'I2'}",
  "T2": "{'I2', 'I4'}",
  "T3": "{'I2', 'I3'}",
  "T4": "{'I1', 'I4', 'I2'}",
  "T5": "{'I1', 'I3'}",
  "T6": "{'I2', 'I3'}",
  "T7": "{'I1', 'I3'}",
  "T8": "{'I1', 'I3', 'I5', 'I2'}",
  "T9": "{'I1', 'I3', 'I2'}"
}


### Step 1:

In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, $C_1$. The algorithm simply scans all of the transactions to count the number of occurences of each item.

In [2]:
"""Generate 1-itemset candidates."""

from collections    import Counter
from itertools      import chain
from json           import dumps

# Generate candidate 1-itemsets, C-1
C_1:    dict[str, int]  = Counter(sorted(list(chain(*D.values()))))

# Print C-1
print(f"C-1: {dumps(C_1, indent = 2, default = str)}")

C-1: {
  "I1": 6,
  "I2": 7,
  "I3": 6,
  "I4": 2,
  "I5": 2
}


### Step 2:

Suppose that the minimum support count required is 2, that is, *min_sup* = 2. (Here, we are referring to *absolute* support because we are using a support count. The corresponding relative support is $2 / 9 = 22\%$.) The set of frequent 1-itemsets, $L_1$, can then be determined. It consists of the candidate 1-itemsets satisfying minmum support. In our example, all of the candidates in $C_1$ satisfay minmum support.

More concretely:

| 1-itemset | Support                                                            |
|:----------|:-------------------------------------------------------------------|
| $I_1$     | $support(I_1) = \frac{frequency(I_1)}{\#~of~transactions} = \frac{6}{9} = 66\%$  |
| $I_2$     | $support(I_2) = \frac{frequency(I_2)}{\#~of~transactions} = \frac{7}{9} = 77\%$  |
| $I_3$     | $support(I_3) = \frac{frequency(I_3)}{\#~of~transactions} = \frac{6}{9} = 66\%$  |
| $I_4$     | $support(I_4) = \frac{frequency(I_4)}{\#~of~transactions} = \frac{2}{9} = 22\%$  |
| $I_5$     | $support(I_5) = \frac{frequency(I_5)}{\#~of~transactions} = \frac{2}{9} = 22\%$  |

In [3]:
"""Determine frequent 1-itemsets."""

from json   import dumps

# Define minimum support count
min_sup:    int             = 2

# Eliminate items that do not satisfy minimum support count
L_1:        dict[str, int]  = {item: count for item, count in C_1.items() if count >= min_sup}

# Print L-1
print(f"L-1: {dumps(L_1, indent = 2, default = str)}")

L-1: {
  "I1": 6,
  "I2": 7,
  "I3": 6,
  "I4": 2,
  "I5": 2
}


### Step 3:

To discover the set of frequent 2-itemsets, $L_2$, the algorithm uses the join $L_1 \Join L_1$ to generate a candidate set of 2-itemsets, $C_2$. Note that no candidates are removed from $C_2$ during the prune step, because each subset of the $L_1$ candidates is also frequent.

2-itemsets that will be generated:
* $\{I_1, I_2\}$
* $\{I_1, I_3\}$
* $\{I_1, I_4\}$
* $\{I_1, I_5\}$
* $\{I_2, I_3\}$
* $\{I_2, I_4\}$
* $\{I_2, I_5\}$
* $\{I_3, I_4\}$
* $\{I_3, I_5\}$
* $\{I_4, I_5\}$

In [9]:
"""Generate 2-itemsets Candidates."""

from itertools      import chain, combinations

# Generate C_2 itemset
C_2_itemset:    set = sorted(set(chain.from_iterable(combinations(L_1.keys(), 2) for transaction in L_1.keys())))

print(f"C_2 itemset:")
for itemset in C_2_itemset: print(set(itemset))

C_2 itemset:
{'I1', 'I2'}
{'I1', 'I3'}
{'I1', 'I4'}
{'I1', 'I5'}
{'I2', 'I3'}
{'I2', 'I4'}
{'I2', 'I3'}
{'I3', 'I5'}
{'I2', 'I4'}
{'I2', 'I5'}


### Step 4:

Next, the transactions in $D$ are scanned and the support count of each candidate itemset in $C_2$ is accumulated.

More concretely:

| 2-itemset         | Support                                                                                           |
|:------------------|:--------------------------------------------------------------------------------------------------|
| $\{I_1, I_2\}$    | $support(\{I_1, I_2\}) = \frac{frequency(\{I_1, I_2\})}{\#~of~transactions} = \frac{4}{9} = 44\%$ |
| $\{I_1, I_3\}$    | $support(\{I_1, I_3\}) = \frac{frequency(\{I_1, I_3\})}{\#~of~transactions} = \frac{4}{9} = 44\%$ |
| $\{I_1, I_4\}$    | $support(\{I_1, I_4\}) = \frac{frequency(\{I_1, I_4\})}{\#~of~transactions} = \frac{1}{9} = 11\%$ |
| $\{I_1, I_5\}$    | $support(\{I_1, I_5\}) = \frac{frequency(\{I_1, I_5\})}{\#~of~transactions} = \frac{2}{9} = 22\%$ |
| $\{I_2, I_3\}$    | $support(\{I_2, I_3\}) = \frac{frequency(\{I_2, I_3\})}{\#~of~transactions} = \frac{2}{9} = 22\%$ |
| $\{I_2, I_4\}$    | $support(\{I_2, I_4\}) = \frac{frequency(\{I_2, I_4\})}{\#~of~transactions} = \frac{1}{9} = 11\%$ |
| $\{I_2, I_5\}$    | $support(\{I_2, I_5\}) = \frac{frequency(\{I_2, I_5\})}{\#~of~transactions} = \frac{2}{9} = 22\%$ |
| $\{I_3, I_4\}$    | $support(\{I_3, I_4\}) = \frac{frequency(\{I_3, I_4\})}{\#~of~transactions} = \frac{0}{9} = 0\%$  |
| $\{I_3, I_5\}$    | $support(\{I_3, I_5\}) = \frac{frequency(\{I_3, I_5\})}{\#~of~transactions} = \frac{1}{9} = 11\%$ |
| $\{I_4, I_5\}$    | $support(\{I_4, I_5\}) = \frac{frequency(\{I_4, I_5\})}{\#~of~transactions} = \frac{0}{9} = 0\%$  |

In [8]:
"""Calculate support count of C_2 candidates."""

from collections    import Counter
from json           import dumps

# Count non-zero occurences of 2-itemsets in database
C_2:    dict[str, int]  = Counter(C_2_itemset)

# Print C-2
print(f"C-2: {dumps([{str(k): v} for k, v in C_2.items()], indent = 2, default = str)}")

C-2: [
  {
    "('I3', 'I2')": 1
  },
  {
    "('I1', 'I3')": 1
  },
  {
    "('I1', 'I5')": 1
  },
  {
    "('I3', 'I5')": 1
  },
  {
    "('I5', 'I2')": 1
  },
  {
    "('I2', 'I4')": 1
  },
  {
    "('I2', 'I3')": 1
  },
  {
    "('I1', 'I4')": 1
  },
  {
    "('I1', 'I2')": 1
  },
  {
    "('I4', 'I2')": 1
  }
]


### Step 5:

The set of frequent 2-itemsets, $L_2$, is then determined, consisting of those candidate 2-itemsets in $C_2$ having minimum support (22\%):



| 2-itemset         | Support                                                                                           |
|:------------------|:--------------------------------------------------------------------------------------------------|
| $\{I_1, I_2\}$    | $support(\{I_1, I_2\}) = \frac{frequency(\{I_1, I_2\})}{\#~of~transactions} = \frac{4}{9} = 44\%$ |
| $\{I_1, I_3\}$    | $support(\{I_1, I_3\}) = \frac{frequency(\{I_1, I_3\})}{\#~of~transactions} = \frac{4}{9} = 44\%$ |
| $\{I_1, I_5\}$    | $support(\{I_1, I_5\}) = \frac{frequency(\{I_1, I_5\})}{\#~of~transactions} = \frac{2}{9} = 22\%$ |
| $\{I_2, I_3\}$    | $support(\{I_2, I_3\}) = \frac{frequency(\{I_2, I_3\})}{\#~of~transactions} = \frac{2}{9} = 22\%$ |
| $\{I_2, I_5\}$    | $support(\{I_2, I_5\}) = \frac{frequency(\{I_2, I_5\})}{\#~of~transactions} = \frac{2}{9} = 22\%$ |

In [6]:
"""Eliminate C_2 candidates that do not satisfy minimum support."""

from json   import dumps

# Elminate candidates with support < min_sup
L_2:    dict[str, int]  = {k: v for k, v in C_2.items() if not v < min_sup}

# Print L_2
print(f"L_2: {dumps([{str(k): v} for k, v in L_2.items()], indent = 2, default = str)}")

L_2: [
  {
    "('I1', 'I2')": 4
  },
  {
    "('I1', 'I3')": 4
  },
  {
    "('I1', 'I5')": 2
  },
  {
    "('I2', 'I3')": 2
  },
  {
    "('I3', 'I2')": 2
  },
  {
    "('I5', 'I2')": 2
  }
]


### Step 6:

From the join step, we first get $C_3 = L_2 \Join L_2$