# Lab 12 - Association Rule Learning - FP-Growth Algorithm



## 1. Dataset

We will use the same dataset as in the previous lab - that represents transactions of customers from
the Instacart online grocery delivery platform. The "The Instacart Online Grocery Shopping Dataset
2017" was provided on the Kaggle platform. Although it is not available at its original location,
you can find the files at:
- https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset
- or 
- https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis
- or 
- https://www.kaggle.com/datasets/suhasyogeshwara/instcart-market-analysis

This is a classic example of a market basket analysis dataset. Download the dataset and extract the
files. There is certainly a lot of information available in the dataset, however, for the purpose of
this lab we will focus just on two files:
- `order_products__prior.csv` - to obtain the baskets of items purchased by customers
- `products.csv` - to obtain the names of the items (for interpretation of the results)

## 2. FP-Growth Algorithm

Many of the approaches to association rule learning, including the Apriori algorithm, are based on
generating candidate itemsets and filtering them according to their support values. However, the
candidate generation step can be computationally expensive. An alternative approach is the FP-Growth
algorithm, which introduces a data structure called a frequent pattern tree (FP-tree) to represent 
transactions in a more compact form. By utilizing the FP-tree, the algorithm can efficiently mine
frequent itemsets without the need for explicit candidate generation.

The FP-Growth algorithm was first proposed by Han et al. in 2000 "Mining Frequent Patterns without Candidate Generation" [pdf](https://dl.acm.org/doi/pdf/10.1145/335191.335372).

Let us also recall the monotonicity property of frequent itemsets: If `A` is a frequent itemset,
then all subsets of `A` are frequent itemsets.


### 2.1 Frequent Pattern Tree (FP-Tree)

Frequent Pattern Tree is a data structure to represent transactions in a compact way. The FP-Tree is
a prefix tree where each path from the root to any node is a prefix of a transaction - it is ordered
by the descending global frequency of items and contains only frequent items. It is used in the
FP-Growth algorithm to find frequent itemsets.

Given a desired support threshold `s`, the outline of the FP-Tree construction algorithm is as
follows:
- compute the frequency of each item in the transactions database
- eliminate infrequent items from all transactions
- sort the remaining items in each transaction in descending order of their frequency
- start with an empty FP-Tree consisting of a single dummy node (the root node)
- each node in the FP-Tree stores the count of transactions that pass through it
- iteratively add transactions to the FP-Tree and process each transaction as follows:
    - iterate over items in the transaction and follow the existing path in the FP-Tree as long as 
      possible, incrementing the count of each node along the path; if there is no path available,
      start adding new nodes with count 1
    - nodes of the FP-tree have maintain links to children and parent nodes
    - all nodes related to a given item/product are linked together to allow efficient traversal of
      all occurrences of that item in the FP-Tree (via a linked list across the tree)
    - an additional header table is maintained to store pointers to the heads of the linked lists 
      for each item/product

An example of a transaction database and the corresponding FP-Tree is shown in the figure, taken
from Han et al. in 2000 "Mining Frequent Patterns without Candidate Generation".

![FP-tree](./figures/fptree.png)

Implement a function `build_fp_tree` that takes transaction data and a support threshold `s` as
input and returns an FP-tree. You are responsible for defining appropriate data structures to
represent the FP-tree. This function will be used during the frequent pattern mining phase - in some
cases, it will be called recursively to construct conditional FP-Trees.

An important aspect of the FP-tree construction algorithm is that the resulting FP-tree size is
bounded by the transaction database size. Each transaction contributes at most one path to the
FP-tree. However, a more compact representation is commonly achieved because many transactions share
common items, allowing for reuse of nodes in the FP-tree. Transactions are reordered so that the
most frequent items are processed first and therefore, they are more likely to be shared.

In [None]:
# write your code here


### 2.2 Discover Frequent Itemsets

An FP-tree is a compact representation of the transactions in a database. It can be used to
efficiently discover all frequent itemsets.


If the FP-tree has only one path, return all subsets of the path as frequent itemsets. Otherwise,
follow the below steps to obtain frequent itemsets:
- iterate over all items in the FP-tree in ascending order of their support, i.e., least frequent
  items first
- handling item `i`, use header table to traverse the FP-tree and collect all prefixes of paths that
  end with item `i`
- these paths, with their support limited to the support of item `i`, form `i`'s conditional pattern
  base - a set of paths that assume presence of item `i`; for example, processing a specific item
  `i=p` in the above FP-tree we obtain the following conditional pattern base for `p`: `{(f: 2, c:
  2, a: 2, m: 2 ), (c: 1, b: 1)}`
- build an FP-tree from the conditional pattern base - this is the conditional FP-tree for item `i`
- recursively, get all frequent itemsets from the conditional FP-tree and append item `i` to each
  of them

Implement a function `generate_frequent_itemsets` that takes an FP-tree, a support threshold `s` and
a suffix of items to be added to the frequent itemsets (empty by default) as input and returns a
list of frequent itemsets. 

In [None]:
# write your code here


### 2.3 Association Rule Generation

For a given frequent itemset `A`, you can generate association rules using the same procedure as in
the previous lab:
- suppose `A` consists of `n` items,
- there are `n` possible association rules of the form $A \setminus \{a\} \rightarrow \{a\}$ for
  each item $a \in A$,
- check the confidence of each rule and keep only the ones that meet a given confidence threshold.

Prepare a function that for a given itemset `A` and a confidence threshold `c` generates association
rules that meet the confidence threshold.

Generate association rules from the frequent itemsets found in the previous section. Compute support
and confidence statistics for them.

In [None]:
# write your code here


### 2.4 Compare the Results

Compare the obtained association rules with the results from any of the available libraries. To
mention a few:
- https://rasbt.github.io/mlxtend/
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.fpm.FPGrowth.html

Are the results the same or sufficiently close? Are the run times required to run your code
comparable to those of the libraries?

In [None]:
# write your code here
