In [None]:
# Import standard libraries
import pandas as pd
import numpy as np
import urllib.request
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

# Import from mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Association Rules

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*4yFCbNwp0gGdGR5KbquFHA.png' width="600">

Source: [Comparing Association Rule Mining with other similar methods](https://medium.com/@utkarsh.kant/comparing-association-rule-mining-with-other-similar-methods-d964eaafad91), by Utkarsh Kant


## Content

The goal of this walkthrough is to provide you with insights on association rules. After presenting the main concepts, you will be introduced to the techniques to implement association rule mining in Python. Finally, it will be your turn to practice, using an application on groceries purchase.

This notebook is organized as follows:
- [Background](#Background)
    - [Objective](#Objective)
    - [Concepts](#Concepts)
    - [Apriori algorithm](#Apriori-algorithm)
- [Implementation](#Implementation)
    - [Discover dataset](#Discover-dataset)
    - [Preprocessing](#Preprocessing)
    - [Applying Apriori algorithm](#Applying-Apriori-algorithm)
    - [Mining Association Rules](#Mining-Association-Rules)
- [Your turn!](#Your-turn!)

## Background

### Objective

[Association rule](https://en.wikipedia.org/wiki/Association_rule_learning) aims at discovering interesting relations between variables in large dataset. Like clustering, association rule mining is an **unsupervised learning** method. However, while clustering techniques calculate clusters based on similarities, association rule finds associations based on co-occurrences.

### Concepts

Our goal is to learn a rule $\Rightarrow$ informing us that, when a set of items $S$ *occur together*, another item $i$ *frequently occurs with them*: $S \Rightarrow i$. Note that **$\Rightarrow$ does not indicate a causal link**.

The most important relationships can be identified using the *support* and *confidence*:
- The **support** indicates how frequently the itemset appears in our dataset, i.e., it measures the notion *occur together*:
$$\text{support}_{S \Rightarrow i}=\frac{\text{# observations containing }S\text{ and }i}{\text{total number of observations}}$$
- The **confidence** measures how frequently item $i$ appears with the set of items $S$, i.e., the notion *frequently occurs with them*:
$$\text{confidence}_{S \Rightarrow i}=\frac{\text{# observations containing }S\text{ and }i}{\text{# observations containing }S}$$

We need both the support and confidence to satisfy a minimum *threshold*. Indeed:
- a low support indicates that the relation can happen by chance and may not be generalized.
- a low confidence indicates that the rule is not reliable.

One drawback of the confidence is that $S \Rightarrow i$ can have a high confidence because item $i$ appears frequently, not because it is associated with $S$. To better measure the interestingness of a rule, we can use the **lift**:
$$\text{lift}_{S \Rightarrow i}=\frac{\frac{\text{# observations containing }S\text{ and }i}{\text{# observations containing }S}}{\frac{\text{# observations containing }i}{\text{# total observations}}}$$

### Apriori algorithm

[Apriori](https://en.wikipedia.org/wiki/Apriori_algorithm) is an algorithm for frequent item set mining and association rule learning, proposed by Agrawal and Srikant in 1994.

The main idea of Apriori is that the subsets of a frequent itemset must also be frequent.
$$\text{For all sets } X,Y, \text{ if } (X \subseteq Y) \text{ then support}(X) \geq \text{support}(Y) $$
Reciprocally, if a itemset is not frequent, then its supersets cannot be frequent.

Hence, instead of computing the support of each itemset, which would be computationally expensive, Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time and tested, while infrequent itemset and all their supersets are pruned, i.e., not considered.

*Reference:* Agrawal, Rakesh, and Ramakrishnan Srikant. "[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf)" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994



## Implementation

We will implement the Apriori algorithm to mine the frequent itemsets. The `mlxtend` library has an implementation of this algorithm [Documentation](http://rasbt.github.io/mlxtend/). You can install the library using `pip` or `conda`:

```python
!pip install mlxtend
```

### Discover dataset

We are going to use a dataset containing the purchase of customers, available in the /data folder of the course repository.

Source: [Harsh-Git-Hub](https://gist.github.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751)

In [None]:
url_retail ='https://raw.githubusercontent.com/michalis0/MGT-502-Data-Science-and-Machine-Learning/main/data/retail.csv'
retail = pd.read_csv(url_retail, sep=',')
retail.head()

Each row of the dataset represents items that were purchased together by a customer, on the same day at the same store.

The dataset is **sparse**, as a relatively high percentage of cells is null (NA, NaN or equivalent). These null values make it difficult to read the table. Let's find out which unique items can actually be found in the table (based on the first column):

In [None]:
# Unique items in first column:
items = retail['0'].unique()

# Print result - we use the join method to print items one by one:
print('Our dataset contains the following items: '
      +', '.join(items))

### Preprocessing

To make use of the `apriori` module given by `mlxtend` library, we need to convert the dataset to the appropriate format. The `apriori` module requires a dataframe that has either 0 and 1 or True and False as data. Since the data we have is all strings (names of items), we need to encode the data.

We first convert our dataframe to a list of list, removing the NaN values:

In [None]:
# Convert dataframe to list of list
retail_list = retail.values.tolist()

# Remove NaNs with list comprehensions
retail_list_cleaned = [[x for x in y if str(x) != 'nan'] for y in retail_list]

Let's check the results for a few transactions:

In [None]:
print(retail_list_cleaned[0])
print(retail_list_cleaned[4])

Next, we use the `TransactionEncoder` module of the `mlxtend` library to transform the transactions to `True` or `False` ([Documentation](http://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/)). The module is imported at the beginning of the notebook with the following line of code:

```python
from mlxtend.preprocessing import TransactionEncoder
```

In [None]:
# Create instance of Encoder
te = TransactionEncoder()

# Fit encoder and transform our list
retail_list_encoded = te.fit(retail_list_cleaned).transform(retail_list_cleaned)

# Create dataframe with results
retail_encoded = pd.DataFrame(retail_list_encoded, columns=te.columns_)
retail_encoded.head()

### Applying Apriori algorithm

We will now implement the Apriori algorithm using the `apriori` module of the `mlxtend` library to find the frequent itemsets ([Documentation](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/)). Here is the import line:

```python
from mlxtend.frequent_patterns import apriori
```

Here are some of the parameters of the module:
- `df` : DataFrame that has 0 and 1 or True and False as values
- `min_support` : Floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected.
- `use_colnames` : Allows to preserve column names for itemset making it more readable.
- `max_len` : Max length of itemset generated. If not set, all possible lengths are evaluated.

As output, we obtain a DataFrame with columns 'support' and 'itemsets' of all itemsets that have a support greater than `min_support` and a length strictly lower than `max_len`.

Let's try with a minimum support of 0.2 and no maximum length:

In [None]:
# Apriori algorithm
freq_items = apriori(retail_encoded, min_support=0.2, use_colnames=True)
freq_items.head(15)

### Mining Association Rules

We will now mine association rules using the `association_rules` module of the `mlxtend` library  ([Documentation](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)). Here is the import line:

```python
from mlxtend.frequent_patterns import association_rules
```

As you know by now, frequent if-then associations are called "association rules". They consist of an antecedent (if) and a consequent (then): `{antecedent} => {consequent}`.

The `association_rules` module requires as input parameters a DataFrame of frequent itemsets as well as:
- `metric` : metric to evaluate if a rule is of interest; can be set to "support", "confidence", "lift", "leverage" and "conviction". See the documentation for more information on how these metrics are defined.
- `min_threshold` : minimal threshold for the evaluation metric to decide whether a candidate rule is of interest.

We obtain as output a DataFrame with columns "antecedents" and "consequents" that store itemsets, plus the scoring metric columns: "antecedent support", "consequent support", "support", "confidence", "lift", "leverage", "conviction" of all rules for which the `metric` is greater than the `min_thresold`. 

Let's try using the the confidence metric with a threshold of 0.6, i.e., we are only keeping rules with a confidence at or above 0.6:

In [None]:
# Generate rules
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)

# Rules sorted by lift
rules.sort_values(by="lift", ascending=False)

The `rules` dataframe contains all the association rules that we determined as interesting. What do you think? Are they really interesting? What does the lift metric tells us?

Use the interactive function below to further explore the above rules with different threshold for confidence. What do you think about the rules when the threshold is 0.4?

In [None]:
metrics = ['lift', 'support', 'confidence']
thresholds = [0.6, 0.5, 0.4]

@interact
def interactive_association(sort_by = metrics, threshold = thresholds):
    rules_interactive = association_rules(freq_items, metric="confidence", min_threshold= threshold)
    return rules_interactive.sort_values(by=sort_by, ascending=False)

## Your turn!

Now it's your turn to practice. We will use a bigger dataset containing the groceries purchase of customers. 

Note that this is not a proper CSV file since there are different number of values in each row. Hence, we have to read the file manually.

In [None]:
url_groceries = 'https://raw.githubusercontent.com/michalis0/MGT-502-Data-Science-and-Machine-Learning/main/data/groceries.csv'

# Open and read url, and decode into a string
groceries_str = urllib.request.urlopen(url_groceries).read().decode("utf-8")

# Create a list where each item is one line, i.e., one transaction
groceries_lis = groceries_str.split('\n')

# Create a list of list where each item is one good
groceries = [[item for item in line.split(',')] for line in groceries_lis]

Here is how our processed data looks like:

In [None]:
groceries[0:4]

- Encode the data in a dataframe of True and False

In [None]:
# YOUR CODE HERE



- Find association rules for the Groceries dataset using **confidence** as the `metric` parameter, a support threshold of **0.001** and confidence threshold of **0.05**

In [None]:
# YOUR CODE HERE



- Extract all the rules you have found containing "bottled beer" as *antecedent*. Which rules do you find interesting? Can you explain them (e.g., potato chips may be frequently bought with bottled bears for "apéro")?

In [None]:
# YOUR CODE HERE



- Feel free to further explore various other thresholds and antecedents...

In [None]:
# YOUR CODE HERE

