# Lab 11 - Association Rule Learning - Apriori Algorithm

During this lab we will explore association rule learning. It is a domain of data mining that
focuses on discovering interesting relationships between variables in transactional data. You will
familiarize yourself with basic concepts such as association rules, support, confidence, lift, and
leverage and you will implement the Apriori algorithm.

You can also deepen your understanding and knowledge by studying the relevant materials from
[Chapter 6 (pdf)](http://infolab.stanford.edu/~ullman/mmds/ch6.pdf) of "Mining of Massive Datasets"
\- http://www.mmds.org/.

## 1. Dataset

One of the most popular types of datasets for association rule learning is the market basket. For
this lab we have several datasets available (they will be described in the next subsections). Try to
familiarize yourself with the datasets and choose the one that you find most interesting and that
matches the resources available to you.

---

Preprocess the dataset into a format that is appropriate for checking the co-occurrences of products
in transactions. It is up to you to decide what that format should be in order to perform the
computations efficiently. You will later need to answer questions such as "What is the overall
number of transactions?", "What is the number of transactions in which a specific product was
bought?", "What is the number of transactions in which two specific products were bought?", etc. You
may start with the original data layout and then return to this step if needed.

---

Find top 10 products that customers bought the most and least often.

---

You may think of limiting the number of products to consider - for example, is there a point in
considering products that were bought less than 10 times? 100? 1000? That definitely depends on the
particular dataset and context. It is up to you to decide what makes sense for your analysis.
Justify your choice.

---

<span style="color:gold">Note:</span> The dataset may suggest that the "in" relationship between
items and baskets corresponds to real-life "part of" relationship. This is true in this case:
products (items) are purchased together in larger transactions (baskets). However, you should be
aware that for association rule learning, the "in" relationship could be any arbitrary many-to-many
relationship, even if it appears "backwards" compared to real life. See [Section 6.1.2
(pdf)](http://infolab.stanford.edu/~ullman/mmds/ch6.pdf) for an example.



### 1.1 Instacart Online Grocery Shopping

A dataset that represents transactions of customers from the Instacart online grocery delivery
platform. The "The Instacart Online Grocery Shopping Dataset 2017" was provided on the Kaggle
platform. Although it is not available at its original location, you can find the files at:
- https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset
- or 
- https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis
- or 
- https://www.kaggle.com/datasets/suhasyogeshwara/instcart-market-analysis

It is a classic example of a market basket analysis dataset. Download the dataset, extract the files
and familiarize yourself with the data.

There are 6 files available:
- `aisles.csv` - aisles of the store
- `departments.csv` - departments of the store
- `order_products__prior.csv` - details of prior orders (historical data), tells which items
  (products) were bought together in one basket (transaction)
- `order_products__train.csv` - details of train orders (last order for each customer)
- `orders.csv` - orders of the customers
- `products.csv` - products of the store


---

We are not going to use the whole information provided by the dataset. For now, we are interested in
the list of products that customers bought in each transaction. Let's focus on the
`order_products__prior.csv` (historical transactions data) and `products.csv` (for the names of the
products) files.

For example:

```python
>>> order_products_prior_df[order_products_prior_df["order_id"] == 2]

    order_id  product_id  add_to_cart_order  reordered
59         7       34050                  1          0
60         7       46802                  2          0
```

We can see that a customer bought two products (with id `34050` and `46802`) in one transaction. 

If we want to check the names of the products, we can use the `products.csv` file:

```python
>>> products_df[products_df["product_id"].isin([34050, 46802])]

       product_id      product_name  aisle_id  department_id
34049       34050      Orange Juice        31              7
46801       46802  Pineapple Chunks       116              1
```

### 1.2 Online Retail / Online Retail II

Use Online Retail or Online Retail II dataset from UCI Machine Learning Repository. These datasets
represent transactions of a UK-based online retail store that focuses on selling unique all-occasion
gifts.

The datasets are provided as XLSX files. Note that Online Retail II contains two separate sheets for
two consecutive years of transactions (you may consider merging them into a single dataframe).

- https://archive.ics.uci.edu/dataset/352/online+retail
- https://archive.ics.uci.edu/dataset/502/online+retail+ii


In [None]:
# write your code here


## 2. Association Rules

Association rules are rules that express the relationship between items in transactions. They are
usually presentented in the form `A -> B`, where `A` and `B` are subsets of items, and `A -> B`
means that if items from `A` are purchased, then items from `B` are also purchased. We will refer to
`A` as the antecedent and `B` as the consequent of the rule. It is quite common to consider
consequents as single items, as it is easier to interpret the rules. For example, if the client buys
bread and milk, then what other item is likely to be purchased with them?

Let's consider an association rule `A -> B`.There are several metrics that can be used to evaluate
its quality.  The most common metrics are:
- `support` - the ratio of transactions that contain items from both `A` and `B` to the total number
  of transactions (alternatively, just the number of such transactions); we can interpret it as how
  often the rule applies,
- `confidence` - the ratio of transactions that contain both `A` and `B` to the number of
  transactions that contain `A`; we can interpret it as the probability of purchasing `B` given that
  `A` was purchased,
- `coverage` - the ratio of transactions that contain items from `A` to the total number of
  transactions; we can interpret it as how often is `A` purchased; it informs us about the general
  popularity of items from `A` in transactions and what is the fraction of transactions that the
  rule is under consideration (no matter if `B` is purchased or not).
  
<span style="color:gold">Note:</span> The concept of coverage can be defined not only
for rules but also for itemsets. It is the ratio of transactions that contain the items from the
itemset to the total number of transactions.

---

Write functions that let you compute support, confidence, and coverage metrics for association
rules. It is up to you to define the expected association rule structure, e.g., a dictionary, a
tuple, a named tuple, a dataclass, etc.

Below, the product names are used for simplicity, but you should adapt this to the format you
decided on in the previous step. Use your functions to compute the support, confidence, and coverage
metrics of the following association rules:
- {Bread} -> {Milk}
- {Milk} -> {Bread}
- {Bread, Milk} -> {Butter}
- {Banana, Apple} -> {Milk}
- {Bread, Milk, Butter} -> {Eggs}

The example items above should work for the "Instacart" dataset. For the "Online Retail" dataset,
please choose your own appropriate example products.

In [None]:
# write your code here


## 3. Apriori Algorithm

Before we start with the details on how to efficiently search for association rules, let's first
state that the number of possible association rules is exponential in the number of items. We can
choose any item as the consequent and any subset of the items as the antecedent. However, most of
these association rules are not interesting and have no practical value. What we usually want to
find (at least in the most basic cases) are association rules that have large enough support and
confidence.

To find an useful association rule is not much different from finding an interesting/frequent
itemset. It is said that for brick-and-mortar stores, the reasonable threshold could be around 1%
of the transactions. For online stores, where the number of products is much larger, the threshold
is usually even lower ~0.1%.

In practice, we strive to obtain not too many frequent itemsets and association rules. The computed
results are usually presented to or interpreted by a human. Each additional association rule
candidate needs some action to be taken. It is quite normal to modify the thresholds for support
and confidence to modify the number of association rules.

### 3.1 Frequent Itemsets

Frequent itemsets are the sets of items that appear together in a transaction with a frequency
higher than a given threshold. In other words, we are interested in itemsets with "high" (that
depends on the context) coverage.

**Monotonicity Property**: If `A` is a frequent itemset, then all subsets of `A` are frequent
itemsets.

Can you prove this property?

---

Below, the product names are used for simplicity, but you should adapt this to the format you
decided on in the previous step. Compute the frequency of the following itemsets:
- {Bread}
- {Milk}
- {Banana}
- {Bread, Milk}
- {Bread, Butter}
- {Bread, Milk, Butter, Banana}

How would you choose a reasonable frequency threshold to use? How would you approach this problem?
What are the trade-offs?

The example items above should work for the "Instacart" dataset. For the "Online Retail" dataset,
please choose your own appropriate example products.

In [None]:
# write your code here


### 3.2 Searching for Frequent Itemsets

As a consequence of the above property, if we detect an itemset that is not frequent, then we know
that all its supersets are not frequent either. That observation is the basis of the Apriori
algorithm.

Apriori algorithm was first proposed by R. Agrawal et al. in 1994 "Fast Algorithms for Mining
Association Rules" [pdf](https://www.vldb.org/conf/1994/P487.PDF). It is an iterative algorithm
for discovering frequent itemsets in a transaction database. The monotonicity property is used to
prune the search space. Given a desired support threshold `s`, the algorithm can be described by the
following steps (see the link above for more details):


- compute the set $L_1$ of all frequent $1$-itemsets
- generate all pairs of distinct $1$-itemsets from $L_1$ and take their union to form the candidate
  set $C_2$ of $2$-itemsets
- filter out the itemsets from $C_2$ if their support is lower than `s`; this results in the set
  $L_2$ of true frequent $2$-itemsets
- we can generalize this process as follows:
    - assume that the set $L_{k-1}$ has been already computed
    - generate all pairs of distinct frequent $(k-1)$-itemsets from $L_{k-1}$ that differ by exactly
      one element; for each such pair, take their union to obtain an itemset `X` of size $k$; if all
      subsets of `X` of size $(k-1)$ are frequent, add `X` to the candidate set $C_k$
    - filter out the itemsets from $C_k$ if their support is lower than `s`; this results in the set
      $L_k$ of true frequent $k$-itemsets
- the process stops when $L_k$ becomes empty or when the maximum itemset length to consider has been
  reached

Implement the Apriori algorithm and apply it to the dataset. Keep in mind that, depending on the
chosen parameters, the number of generated itemsets may become very large. Our goal is to identify
the interesting itemsets without being overwhelmed by an excessive number of them. Therefore, begin
with relatively high support thresholds and gradually lower them to discover additional itemsets.
Adjust thresholds based on the available computational, memory or time resources.

In [None]:
# write your code here


### 3.3 Association Rules Generation

Suppose we have identified all "interesting" itemsets. We can then generate interesting association
rules (meeting some additional criteria, such as "high" confidence) from them. The procedure to
generate association rules from a frequent itemset `A` can be as simple as follows:
- `A` consists of `n` items,
- we have `n` possible association rules of the form $A \setminus \{a\} \rightarrow \{a\}$ for each
  item $a \in A$,
- we check the confidence of each rule and keep only the ones that meet the confidence threshold.

Prepare a function that for a given itemset `A` and a confidence threshold generates association
rules that meet the confidence threshold.

Generate the set of association rules from the frequent itemsets found in the previous step. Compute
statistics (support, confidence) for them.

In [None]:
# write your code here


### 3.4 Compare the Results

Compare the results you have obtained with the results from any of the available libraries. To
mention a few:
- https://rasbt.github.io/mlxtend/
- https://github.com/tommyod/Efficient-Apriori

Are the results the same or sufficiently close? Are the run times required to run your code
comparable to those of the libraries?

<span style="color:gold">Note:</span> You may want to use some features of the libraries to make it
possible to run the computations on the entire dataset. For example, consider using the
`min_support` (start with higher required support and then lower it), `max_len` (start with smaller
max_len and then increase it), `low_memory` parameters in the `mlxtend` library or transform the
entire dataset into a sparse matrix.

In [None]:
# write your code here
