# Lab 11 - Association Rule Learning - Apriori Algorithm

During this lab we will explore association rule learning. It is a domain of data mining that
focuses on discovering interesting relationships between variables in transactional data. You will
familiarize yourself with basic concepts such as association rules, support, confidence, lift, and
leverage and you will implement the Apriori algorithm.

You can also deepen your understanding and knowledge by studying the relevant materials from
[Chapter 6 (pdf)](http://infolab.stanford.edu/~ullman/mmds/ch6.pdf) of "Mining of Massive Datasets"
\- http://www.mmds.org/.

## 1. The Dataset

We will use a dataset that represents transactions of customers from the Instacart online grocery
delivery platform. The "The Instacart Online Grocery Shopping Dataset 2017" was provided on the
Kaggle platform. Although it is not available at its original location, you can find the files at:
- https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset
- or 
- https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis
- or 
- https://www.kaggle.com/datasets/suhasyogeshwara/instcart-market-analysis

Download the dataset, extract the files and familiarize yourself with the data.

There are 6 files available:
- aisles.csv - aisles of the store
- departments.csv - departments of the store
- order_products__prior.csv - details of prior orders (historical data)
- order_products__train.csv - details of train orders (last order for each customer)
- orders.csv - orders of the customers
- products.csv - products of the store

In [None]:
# write your code here


### 1.1 Products and Transactions

We are not going to use the whole information provided by the dataset. For now, we are interested in
the list of products that customers bought in each transaction. Let's focus on the
`order_products__prior.csv` (historical transactions data) and `products.csv` (for the names of the
products) files.

For example:

```python
>>> order_products_prior_df[order_products_prior_df["order_id"] == 2]

    order_id  product_id  add_to_cart_order  reordered
59         7       34050                  1          0
60         7       46802                  2          0
```

We can see that a customer bought both products (with id `34050` and `46802`) in one transaction. 

If we want to check the names of the products, we can use the `products.csv` file:

```python
>>> products_df[products_df["product_id"].isin([34050, 46802])]

       product_id      product_name  aisle_id  department_id
34049       34050      Orange Juice        31              7
46801       46802  Pineapple Chunks       116              1
```

Preprocess the dataset into a format that is appropriate for checking the co-occurrences of products
in transactions. It is up to you to decide what that format should be in order to perform the
computations efficiently. You will later need to answer questions such as "What is the overall
number of transactions?", "What is the number of transactions in which a specific product was
bought?", "What is the number of transactions in which two specific products were bought?", etc. You
may start with the original data layout and then return to this step if needed. 

In [36]:
# write your code here


## 2 Association Rules

Association rules are rules that express the relationship between items in transactions. They are
usually presentented in the form `A -> B`, where `A` and `B` are subsets of items, and `A -> B`
means that if items from `A` are purchased, then items from `B` are also purchased. We will refer to
`A` as the antecedent and `B` as the consequent of the rule.

There are several metrics that can be used to evaluate the quality of association rules. The most
common metrics are:
- support, 
- confidence, 
- lift, 
- leverage.

In [69]:
# write your code here

import pandas as pd

df = pd.read_csv("../../data/lab-11/order_products__prior.csv")
products_df = pd.read_csv("../../data/lab-11/products.csv")
aisles_df = pd.read_csv("../../data/lab-11/aisles.csv")
departments_df = pd.read_csv("../../data/lab-11/departments.csv")


In [68]:
products_df[products_df["product_name"].str.lower().str.contains("diaper")]

Unnamed: 0,product_id,product_name,aisle_id,department_id
14,15,Overnight Diapers Size 6,56,18
681,682,Cruisers Diapers Jumbo Pack - Size 5,56,18
764,765,Swaddlers Diapers Jumbo Pack Size Newborn,56,18
878,879,Baby Dry Diapers Size 4,56,18
1303,1304,Little Movers Comfort Fit Size 3 Diapers,56,18
...,...,...,...,...
46607,46608,Free & Clear Newborn Up To 10 lbs Baby Diapers,56,18
46792,46793,Maximum Strength Original Diaper Rash Ointment,6,2
47577,47578,Diapers,56,18
47631,47632,Honest Diapers,56,18


In [74]:
departments_df.loc[17]

department_id        18
department       babies
Name: 17, dtype: object

In [73]:
aisles_df.loc[55]

aisle_id               56
aisle       diapers wipes
Name: 55, dtype: object

In [118]:
from mlxtend.preprocessing import TransactionEncoder

q = df[:100].groupby('order_id').agg({'product_id': lambda x: list(sorted(x))})
q

Unnamed: 0_level_0,product_id
order_id,Unnamed: 1_level_1
2,"[1819, 9327, 17794, 28985, 30035, 33120, 40141..."
3,"[17461, 17668, 17704, 21903, 24838, 32665, 337..."
4,"[10054, 17616, 21351, 22598, 25146, 26434, 277..."
5,"[6184, 6348, 8479, 9633, 12962, 13176, 13245, ..."
6,"[15873, 40462, 41897]"
7,"[34050, 46802]"
8,[23423]
9,"[432, 2014, 3990, 11182, 14183, 14992, 18362, ..."
10,"[1529, 3464, 4605, 4796, 14992, 21137, 22122, ..."
11,"[1313, 5994, 27085, 30162, 31506]"


In [114]:
df[df["order_id"] == 2]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0
5,2,17794,6,1
6,2,40141,7,1
7,2,1819,8,1
8,2,43668,9,0


order_id
2     [1819, 9327, 17794, 28985, 30035, 33120, 40141...
3     [17461, 17668, 17704, 21903, 24838, 32665, 337...
4     [10054, 17616, 21351, 22598, 25146, 26434, 277...
5     [6184, 6348, 8479, 9633, 12962, 13176, 13245, ...
6                                 [15873, 40462, 41897]
7                                        [34050, 46802]
8                                               [23423]
9     [432, 2014, 3990, 11182, 14183, 14992, 18362, ...
10    [1529, 3464, 4605, 4796, 14992, 21137, 22122, ...
11                    [1313, 5994, 27085, 30162, 31506]
12                                [15221, 30597, 43772]
Name: product_id, dtype: object

In [121]:
te = TransactionEncoder()
te_ary = te.fit(q["product_id"]).transform(q["product_id"])
w = pd.DataFrame(te_ary, columns=te.columns_)


pandas.core.frame.DataFrame

## 3. Apriori Algorithm


### 3.1 Frequent Itemsets

Frequent itemsets are the sets of items that appear together in a transaction with a frequency
higher than a given threshold.

In [None]:
# write your code here


### 3.2 Algorithm


In [None]:
# write your code here
