# 10. Association rule mining

Association rule mining is a method for discovering interesting relationships between variables in large datasets. The method is designed for categorical data and is used to identify frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases or other data repositories.

The most common example of association rule mining is the market basket analysis. In this analysis, the goal is to identify items that are frequently purchased together. For example, if a customer buys bread, he may be more likely to buy butter as well.

## Association rules

Association rule mining plays with categorical data. As an example, consider the following dataset, where each row represents a transaction. In each transaction, the customer buys products from different categories.

| TID | Items                        |
| --- | ---------------------------- |
| 1   | {soap, milk, candy, fish}    |
| 2   | {milk, candy, fish}          |
| 3   | {fruit, milk, candy}         |
| 4   | {fruit, soap, milk, fish}    |
| 5   | {soap, milk, candy}          |
| 6   | {soap, milk, fish}           |
| 7   | {soap, fish}                 |


In this simple example, there are seven transactions, and the items are from four categories: {fruit, soap, milk, candy, fish}.

> This is just a tiny example to illustrate the concept. In real-world, there can be thousands, or even millions of transactions, and a large number of items. Association rule mining scales well to very large datasets.

An association rule is a directed implication of the form X -> Y, where X and Y are itemsets. The rule X -> Y holds in the transaction database if the occurrence of X in a transaction implies the occurrence of Y in the same transaction. 

Some examples of association rules derived from the above dataset are:
- {milk} -> {candy}
- {milk, candy} -> {fish}

The first rule states that if a customer buys milk, he will buy candy as well. The second rule states that if a customer buys both milk and candy, he will buy fish as well.

Pay attention to the fact that the rules are directed. The first rule does not imply that if a customer buys candy, he will buy milk. There may be a large number of candy-buyers who do not buy milk, even in the case that almost all milk-buyers also buy candy.

Moreover, association rules do not generally imply causation. The rule that says X -> Y does not mean that X causes Y. It only means that X and Y are associated, i.e. the occurrence of X implies the occurrence of Y.

## Support and confidence

So far, we have constructed a couple of association rules, but we have no idea of wehther the rules are interesting or not. To evaluate the quality of an association rule, we use two metrics: support and confidence.

### Support

Support is always defines for an item set. It means that whenever we talk about the support of rule X -> Y, we are referring to the support of the item set X U Y. Support is the proportion of transactions in the database that contain the item set.

For example, the support of the rule {milk, candy} -> {fish} is the proportion of transactions that contain all three items: milk, candy, and fish. In this case, that is 2/7, or 29%.

Intuitively speaking, a high support for a rule, or set of items, means that the rule is interesting because it can be applied to a large number of transactions. Thus, the rule is not marginal.

### Confidence

Confidence is a measure of the reliability of the rule. It is the proportion of transactions that contain the item set X that also contain the item set Y. It can also be interpreted as the conditional probability of Y given X.

For example, the confidence of the rule {milk, candy} -> {fish} is the proportion of transactions that contain milk and candy that also contain fish. In this case, that is 2/4, or 50%.

Intuitively speaking, a high confidence for a rule means that the rule is interesting because it is reliable, or trustworthy.

## Apriori algorithm

The Apriori algorithm is a popular algorithm for mining frequent itemsets for boolean association rules. The algorithm is designed to operate on databases containing transactions, such as the ones we have seen above.

> Original citation: Agrawal, R. and Srikant, R. (1994) Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago de Chile, 12-15 September 1994, 487-499.

The algorithm finds all frequent item sets and uses a candidate generation function to generate all possible itemsets. The algorithm terminates when no more itemsets can be found.

Let's see how the Apriori algorithm works in practice, using the example dataset above.

The algorithm has two hyperparameters:
- the minimum support threshold for an itemset to be considered frequent. This parameter controls the search space of the algorithm. A lower value will mean a higher execution time, and potentially a lot of marginal rules.
- the minimum confidence threshold for an association rule to be considered interesting. This parameter effectively filters out the rules that are not reliable. A lower value will mean a lot of rules, but many of them will be unreliable.

In our example, let's set the minimum support threshold to 0.4 and the minimum confidence threshold to 0.75.

Apriori algorithm proceeds in two phases. In the first phase, it finds all frequent itemsets in the database. In the second phase, it generates the association rules from the frequent itemsets.

### Phase 1: Find all frequent itemsets

In the first phase, the algorithm finds all itemsets that have a support greater than the minimum support threshold. The algorithm starts with the itemsets of size 1 and iteratively increases the size of the itemsets until no more frequent itemsets can be found.

Let's see how the algorithm works with the example dataset.

The first thing to do is to find all itemsets of size 1, and calculate their support:

| Itemset | Support |
| ------- |---------|
| {fruit} | 2/7     |
| {soap}  | 4/7     |
| {milk}  | 6/7     |
| {candy} | 4/7     |
| {fish}  | 5/7     |

This is called the L1 candidate set.

At this point, the algorithm filters out the itemsets that do not meet the minimum support threshold. In this case, the itemset {fruit} is filtered out, as they do not satisfy the minimum support threshold of 0.4. As a result, the L1 frequent set is:

| Itemset | Support |
| ------- |---------|
| {soap}  | 4/7     |
| {milk}  | 6/7     |
| {candy} | 4/7     |
| {fish}  | 5/7     |

Next, the algorithm generates the L2 candidate set by joining the items in the L1 frequent set with other items in the same set. The L2 candidate set is:

| Itemset         | Support |
| --------------- |---------|
| {soap, milk}    | 4/7     |
| {soap, candy}   | 2/7     |
| {soap, fish}    | 4/7     |
| {milk, candy}   | 4/7     |
| {milk, fish}    | 4/7     |
| {candy, fish}   | 2/7     |

Again, the algorithm filters out the itemsets that do not meet the minimum support threshold. In this case, the itemsets {soap, candy} and {candy, fish} are filtered out. The L2 frequent set is:

| Itemset         | Support |
| --------------- |---------|
| {soap, milk}    | 4/7     |
| {soap, fish}    | 4/7     |
| {milk, candy}   | 4/7     |
| {milk, fish}    | 4/7     |

The algorithm continues to generate larger itemsets until no more frequent itemsets can be found. In this case, the L3 candidate set is computed. From this point onwards, a specific optimization is applied: the algorithm prunes the candidate set by removing the itemsets that have infrequent subsets. This is called the Apriori property.


The L3 candidate set, without support calculation, is:

| Itemset             | All size 2 subsets found in L2?   |
|---------------------|-----------------------------------|
| {soap, milk, fish}  | Yes                               |
| {soap, milk, candy} | No, as {soap, candy} is not in L2 |
| {milk, candy, fish} | No, as {candy, fish} is not in L2 |

This allows us to immediately discard the itemset {soap, milk, candy} from the candidate set, as it has an infrequent subset. The L3 candidate set, together with the support calculation, is:

| Itemset             | Support |
|---------------------|---------|
| {soap, milk, fish}  | 3/7     |

> The reason for utilizing the Apriori property is straightforward: as the previous frequent set was just computed and memorized, the lookup for the subsets is very fast. This allows the algorithm to prune the candidate set efficiently. While not strictly necessary, the Apriori property is a significant optimization that makes the algorithm faster.


The support of the only itemset in the L3 candidate set is 3/7, which is higher than the minimum support threshold of 0.4. Therefore, the L3 frequent set is:

| Itemset             | Support |
|---------------------|---------|
| {soap, milk, fish}  | 3/7     |

As it is not possible to generate any candidate itemsets of size 4, the algorithm terminates. The frequent itemsets are:

| Itemset             | Support |
|---------------------|---------|
| {soap}              | 5/7     |
| {milk}              | 6/7     |
| {candy}             | 4/7     |
| {fish}              | 5/7     |
| {soap, milk}        | 4/7     |
| {soap, fish}        | 4/7     |
| {milk, candy}       | 4/7     |
| {milk, fish}        | 4/7     |
| {soap, milk, fish}  | 3/7     |

This concludes the first phase of the Apriori algorithm.

### Phase 2: Generate association rules

In the second phase of the algorithm, the association rules are generated from the frequent itemsets. The algorithm generates all possible rules from the frequent itemsets and filters out the rules that do not meet the minimum confidence threshold. The creation of the rules is done by brute force, by trying all possible combinations of the items in the itemset.

In the example dataset, the following association rules are generated from the frequent itemsets, together with their confidence:

| Itemset        | Rule                   | Confidence |
|----------------|------------------------|------------|
| {soap, milk}   | {soap} -> {milk}       | 4/5        |
| {soap, milk}   | {milk} -> {soap}       | 4/6        |
| {soap, fish}   | {soap} -> {fish}       | 4/5        |
| {soap, fish}   | {fish} -> {soap}       | 4/5        |
| {milk, candy}  | {milk} -> {candy}      | 4/6        |
| {milk, candy}  | {candy} -> {milk}      | 4/4        |
| {milk, fish}   | {milk} -> {fish}       | 4/6        |
| {milk, fish}   | {fish} -> {milk}       | 4/5        |
| {soap, milk, fish} | {soap, milk} -> {fish} | 3/4        |
| {soap, milk, fish} | {soap, fish} -> {milk} | 3/4        |
| {soap, milk, fish} | {milk, fish} -> {soap} | 3/4        |
| {soap, milk, fish} | {soap} -> {milk, fish} | 3/5        |
| {soap, milk, fish} | {milk} -> {soap, fish} | 3/6        |
| {soap, milk, fish} | {soap, fish} -> {milk} | 3/5        |

Next, the algorithm filters out the rules that do not meet the minimum confidence threshold of 0.75. The final set of association rules, together with the support and confidence, is:

| Rule              | Support | Confidence |
|-------------------|---------|------------|
| {soap} -> {milk}  | 4/7     | 4/5        |
| {soap} -> {fish}  | 4/7     | 4/5        |
| {fish} -> {soap}  | 4/7     | 4/5        |
| {candy} -> {milk} | 4/7     | 4/4        |
| {fish} -> {milk}  | 4/7     | 4/5        |
| {soap, milk} -> {fish} | 3/7 | 3/4        |
| {soap, fish} -> {milk} | 3/7 | 3/4        |
| {milk, fish} -> {soap} | 3/7 | 3/4        |

This is the final output of the Apriori algorithm. The algorithm has found the frequent itemsets and generated the association rules that meet the minimum support and confidence thresholds. For application, these may be sorted by support or confidence, or other metrics, to identify the most interesting rules.

## Python implementation

The **sklearn** library does not provide an implementation of the Apriori algorithm. However, there are other libraries that do. One of them is the **apyori** library, which provides an implementation of the Apriori algorithm.

Let's apply the library to the example dataset that contains 100,000 transactions.













