# Lab 11 - Association Rule Learning - Apriori Algorithm

During this lab we will explore association rule learning. It is a domain of data mining that
focuses on discovering interesting relationships between variables in transactional data. You will
familiarize yourself with basic concepts such as association rules, support, confidence, lift, and
leverage and you will implement the Apriori algorithm.

You can also deepen your understanding and knowledge by studying the relevant materials from
[Chapter 6 (pdf)](http://infolab.stanford.edu/~ullman/mmds/ch6.pdf) of "Mining of Massive Datasets"
\- http://www.mmds.org/.

## 1. The Dataset

We will use a dataset that represents transactions of customers from the Instacart online grocery
delivery platform. The "The Instacart Online Grocery Shopping Dataset 2017" was provided on the
Kaggle platform. Although it is not available at its original location, you can find the files at:
- https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset
- or 
- https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis
- or 
- https://www.kaggle.com/datasets/suhasyogeshwara/instcart-market-analysis

It is a classic example of a market basket analysis dataset. Download the dataset, extract the files
and familiarize yourself with the data.

There are 6 files available:
- aisles.csv - aisles of the store
- departments.csv - departments of the store
- order_products__prior.csv - details of prior orders (historical data), tells which items
  (products) were bought together in one basket (transaction)
- order_products__train.csv - details of train orders (last order for each customer)
- orders.csv - orders of the customers
- products.csv - products of the store

In [None]:
# write your code here


### 1.1 Products and Transactions

We are not going to use the whole information provided by the dataset. For now, we are interested in
the list of products that customers bought in each transaction. Let's focus on the
`order_products__prior.csv` (historical transactions data) and `products.csv` (for the names of the
products) files.

For example:

```python
>>> order_products_prior_df[order_products_prior_df["order_id"] == 2]

    order_id  product_id  add_to_cart_order  reordered
59         7       34050                  1          0
60         7       46802                  2          0
```

We can see that a customer bought both products (with id `34050` and `46802`) in one transaction. 

If we want to check the names of the products, we can use the `products.csv` file:

```python
>>> products_df[products_df["product_id"].isin([34050, 46802])]

       product_id      product_name  aisle_id  department_id
34049       34050      Orange Juice        31              7
46801       46802  Pineapple Chunks       116              1
```

---

Preprocess the dataset into a format that is appropriate for checking the co-occurrences of products
in transactions. It is up to you to decide what that format should be in order to perform the
computations efficiently. You will later need to answer questions such as "What is the overall
number of transactions?", "What is the number of transactions in which a specific product was
bought?", "What is the number of transactions in which two specific products were bought?", etc. You
may start with the original data layout and then return to this step if needed. You can work on
product IDs and use the `products.csv` file later to check the product names.

---

Find top 10 products that customers bought the most and least often.

---

<span style="color:gold">Note:</span> The dataset may suggest that the "in" relationship between
items and baskets corresponds to real-life "part of" relationship. This is true in this case:
products (items) are purchased together in larger transactions (baskets). However, you should be
aware that for association rule learning, the "in" relationship could be any arbitrary many-to-many
relationship, even if it appears "backwards" compared to real life. See [Section 6.1.2
(pdf)](http://infolab.stanford.edu/~ullman/mmds/ch6.pdf) for an example.


In [36]:
# write your code here


## 2 Association Rules

Association rules are rules that express the relationship between items in transactions. They are
usually presentented in the form `A -> B`, where `A` and `B` are subsets of items, and `A -> B`
means that if items from `A` are purchased, then items from `B` are also purchased. We will refer to
`A` as the antecedent and `B` as the consequent of the rule. It is quite common to consider
consequents as single items, as it is easier to interpret the rules. For example, if the client buys
bread and milk, then what other item is likely to be purchased with them?

Let's consider an association rule `A -> B`.There are several metrics that can be used to evaluate
its quality.  The most common metrics are:
- `support` - the ratio of transactions that contain items from both `A` and `B` to the total number
  of transactions (alternatively, just the number of such transactions); we can interpret it as how
  often the rule applies,
- `confidence` - the ratio of transactions that contain both `A` and `B` to the number of
  transactions that contain `A`; we can interpret it as the probability of purchasing `B` given that
  `A` was purchased,
- `coverage` - the ratio of transactions that contain items from `A` to the total number of
  transactions; we can interpret it as how often is `A` purchased; it informs us about the general
  popularity of items from `A` in transactions and what is the fraction of transactions that the
  rule is under consideration (no matter if `B` is purchased or not).
  
<span style="color:gold">Note:</span> The concept of coverage can be defined not only
for rules but also for itemsets. It is the ratio of transactions that contain the items from the
itemset to the total number of transactions.

---

Write functions that let you compute support, confidence, and coverage metrics for association
rules.

Below, the product names are used for simplicity, but you should adapt this to the format you
decided on in the previous step. Use your functions to compute the support, confidence, and coverage
metrics of the following association rules:
- {Bread} -> {Milk}
- {Milk} -> {Bread}
- {Bread, Milk} -> {Butter}
- {Banana, Apple} -> {Milk}
- {Bread, Milk, Butter} -> {Eggs}

In [1]:
# write your code here

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

df = pd.read_csv("../../data/lab-11/order_products__prior.csv")
products_df = pd.read_csv("../../data/lab-11/products.csv")
aisles_df = pd.read_csv("../../data/lab-11/aisles.csv")
departments_df = pd.read_csv("../../data/lab-11/departments.csv")

# te = TransactionEncoder()
# te_ary = te.fit(q["product_id"]).transform(q["product_id"])
# w = pd.DataFrame(te_ary, columns=te.columns_)

q = df.groupby('order_id').agg({'product_id': lambda x: set(x)})
q

Unnamed: 0_level_0,product_id
order_id,Unnamed: 1_level_1
2,"{33120, 17794, 40141, 9327, 30035, 43668, 2898..."
3,"{17668, 24838, 17704, 46667, 21903, 17461, 326..."
4,"{26434, 32645, 10054, 21351, 22598, 39758, 348..."
5,"{48002, 45698, 18569, 37011, 15005, 8479, 9633..."
6,"{15873, 41897, 40462}"
...,...
3421079,{30136}
3421080,"{25122, 4932, 31717, 27845, 12935, 10667, 3806..."
3421081,"{38185, 32299, 3060, 35221, 12218, 20539, 12861}"
3421082,"{12738, 47941, 12023, 43352, 32700, 16797, 17279}"


In [39]:
PRODUCT_COL = "product_id"

def cov_itemset(df, itemset: set[int], absolute=False, col=PRODUCT_COL):
    count = df[col].map(lambda basket: itemset.issubset(basket)).sum()
    return count if absolute else count / len(df)

def coverage(df, rule, absolute=False, col=PRODUCT_COL):
    antecedent, _ = rule
    return cov_itemset(df, antecedent, absolute, col)

def support(df, rule, absolute=False, col=PRODUCT_COL):
    antecedent, consequent = rule
    return cov_itemset(df, antecedent | consequent, absolute, col)

def confidence(df, rule, col=PRODUCT_COL):
    antecedent, consequent = rule
    return cov_itemset(df, antecedent | consequent, absolute=True, col=col) / cov_itemset(df, antecedent, absolute=True, col=col)

def show_product_name(products_df, product_id):
    return products_df[products_df["product_id"] == product_id]["product_name"].iloc[0]

print(coverage(q, ({25122}, {4932}), absolute=True))
print(support(q, ({25122}, {4932}), absolute=True))
print(confidence(q, ({25122}, {4932})))

print(show_product_name(products_df, 25122))
print(show_product_name(products_df, 4932))




1082
20
0.018484288354898338
Organic European Style Lightly Salted Butter
Vanilla Bean Ice Cream


In [38]:
products_df[products_df["product_id"] == 1]["product_name"].values[0]

'Chocolate Sandwich Cookies'

In [30]:
s1 = set([1, 2])
s2 = set([2, 3])

s1.issubset(s2)

False

## 3. Apriori Algorithm

Before we start with the details on how to efficiently search for association rules, let's first
state that the number of possible association rules is exponential in the number of items. We can
choose any item as the consequent and any subset of the items as the antecedent. However, most of
these association rules are not interesting and have no practical value. What we usually want to
find (at least in the most basic cases) are association rules that have large enough support and
confidence.

To find an useful association rule is not much different from finding an interesting/frequent
itemset. It is said that for brick-and-mortar stores, the reasonable threshold could be around 1%
of the transactions. For online stores, where the number of products is much larger, the threshold
is usually even lower ~0.1%.

Suppose we have identified all "interesting" itemsets. We can then generate interesting association
rules (meeting some additional criteria, such as "high" confidence) from them. The procedure to
generate association rules from a frequent itemset `A` can be as simple as follows:
- `A` consists of `n` items,
- we have `n` possible association rules of the form $A \setminus \{a\} \rightarrow \{a\}$ for each
  item $a \in A$,
- we check the confidence of each rule and keep only the ones that meet the confidence threshold.

In practice, we strive to obtain not too many frequent itemsets and association rules. The computed
results are usually presented to or interpreted by a human. Each additional association rule
candidate needs some action to be taken. It is quite normal to modify the thresholds for support
and confidence to modify the number of association rules.

### 3.1 Frequent Itemsets

Frequent itemsets are the sets of items that appear together in a transaction with a frequency
higher than a given threshold. In other words, we are interested in itemsets with "high" (that
depends on the context) coverage.

**Property**: If `A` is a frequent itemset, then all subsets of `A` are frequent itemsets.

Can you prove this property?

---

Below, the product names are used for simplicity, but you should adapt this to the format you
decided on in the previous step. Compute the frequency of the following itemsets:
- {Bread}
- {Milk}
- {Banana}
- {Bread, Milk}
- {Bread, Butter}
- {Bread, Milk, Butter, Banana}

How would you choose a reasonable frequency threshold to use? How would you approach this problem?

In [None]:
# write your code here


### 3.2 Searching for Frequent Itemsets

As a consequence of the above property, if we detect an itemset that is not frequent, then we know
that all its supersets are not frequent either. That observation is the basis of the Apriori
algorithm.

Apriori algorithm is an iterative algorithm that searches for frequent itemsets in a dataset. 

In [None]:
# write your code here


### 3.3 Association Rules Generation