<a href="https://colab.research.google.com/github/venkisamarth/Machine-Learning-/blob/main/association_rules_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Association Rules in Python

##### Association rules is a rule-based learning method used to draw frequent patterns and correlations from datasets such as transactional and relational data.

##### In essence it computes the co-occurence statistics between items, in the form of an implication expression (X → Y).

##### For instance, in customer basket analysis, {diaper} → {beer} means if diaper is bought, then beer is put into basket.

#### 4 fundamental concepts in association rules:

* *(Not a Rule)* Support: number of times X occurs over all instances.

* Support(X→Y) is the probability of co-occurence of both items within all data.

* Confidence(X→Y) is the probability of Y occurs given that X is present.

* Lift(X→Y) is the probability of Y being bought given that X is present, taking into account the popularity of Y as well.

* Conviction(X→Y) is the measure of implication. A value > 1 indicates that Y is highly depending on X.

So basically it is probability/statistics. A simple but useful decision making tool for a wide range of usages such as market basket analysis, customer relationship management, recommender system, marketing activities, network traffic analysis, intrusion detection (fraud & malware detection) and bioinformatics.


# Example 1

### Before getting into the formnulas and terminology, let's begin by a simple example.

Mlxtend is a rich and useful library for machine learning. It provides methods in association rules with a major algorithm *apriori*.

You can install mlxtend via pip or conda.

In [1]:
# !pip install mlxtend

In [7]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

To use association rules, first we neeed some data in one-hot encoded format.

Imagine in a grocery database, there are order id with some products...

In [8]:
data = {'ID':[1,2,3,4,5,6],
       'Onion':[1,0,0,1,1,1],
       'Potato':[1,1,0,1,1,1],
       'Burger':[1,1,0,0,1,1],
       'Milk':[0,1,1,1,0,1],
       'Beer':[0,0,1,0,1,0]}

  and should_run_async(code)


In [9]:
df = pd.DataFrame(data)

  and should_run_async(code)


In [10]:
df = df[['ID', 'Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]]

  and should_run_async(code)


In [11]:
df

  and should_run_async(code)


Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


### Then, we can generate frequent itemsets based on *support*.

Here we need to set the minimum support value between [0,1]. Using min_supp = 50% means we only want itemsets that co-occur more than half of the time.

`apriori(df, min_support=0.5, use_colnames=False, max_len=None)`

In [12]:
frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]], min_support=0.5, use_colnames=True)

  and should_run_async(code)


In [13]:
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.666667,(Onion)
1,0.833333,(Potato)
2,0.666667,(Burger)
3,0.666667,(Milk)
4,0.666667,"(Onion, Potato)"
5,0.5,"(Onion, Burger)"
6,0.666667,"(Burger, Potato)"
7,0.5,"(Milk, Potato)"
8,0.5,"(Onion, Burger, Potato)"


Itemsets with 1, 2 or 3 items are returned, with support > 0.5

The only itemset with 3 products is [Onion, Potato, Burger].

### Final Step: generate the rules with their corresponding support, confidence and lift, (and leverage & conviction):

```association_rules(df, metric='confidence', min_threshold=0.8)```

* Here, df means the frequent_itemsets dataframe;

* metrics is the parameters to consider if there is association. You can set it to one of the five metrics.

* min_threshold is the mininum value for the specified metrics.

In [14]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)

  and should_run_async(code)


In [15]:
df

  and should_run_async(code)


Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


In [16]:
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
1,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
2,(Onion),(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
3,(Burger),(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
4,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
5,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
6,"(Onion, Burger)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf,0.333333
7,"(Onion, Potato)",(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
8,"(Burger, Potato)",(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
9,(Onion),"(Burger, Potato)",0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333


### Intrepreting the result:

We can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations.

Several are high in confidence as well. But domain knowledge will be useful in explaining the phenomenon.

In [17]:
rules [ (rules['lift'] >1.125)  & (rules['confidence']> 0.8)  ]

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
4,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
6,"(Onion, Burger)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf,0.333333


Subsetting the lift and confidence values return you with the itemsets that are relatively highly correlated in this data.

We can see that:

* **If Onion or Burger is in a users' basket, it is highly likely that the user will buy Potato as well.**
* **If Burger and Onion is in a users' basket, it is highly likely that the user will also buy Potato.**

### Some notes on Lift, Conviction & Leverage:


1.  Lift(X→Y) : the likelihood of Y being bought when X is present, taking into account the popularity of Y as well.
    > When Lift=1,  X makes no impact on Y  
    > When Lift>1, there is a relationship between X & Y
2.  Conviction(X→Y): Conviction is a measure of the implication and has value 1 if items are unrelated.
    > A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.
3.  Leverage(X→Y): the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. An leverage value of 0 indicates independence.

# Example 2

In [18]:
retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }

  and should_run_async(code)


In [19]:
retail = pd.DataFrame(retail_shopping_basket)
retail

  and should_run_async(code)


Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, BabyFood,..."
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


In [20]:
retail = retail[['ID', 'Basket']]

  and should_run_async(code)


In [21]:
pd.options.display.max_colwidth=100

  and should_run_async(code)


Suppose we have a list of customer ids to a list of basket items:

In [22]:
retail

  and should_run_async(code)


Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, BabyFood, Milk]"
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


First one-hot encode the basket, but how?

In [23]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
retail = pd.DataFrame(mlb.fit_transform(retail.Basket), columns=mlb.classes_)

  and should_run_async(code)


In [24]:
retail

  and should_run_async(code)


Unnamed: 0,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,0,0,1,0,1,0,0,0,0,0,0,0,0,0


Making use of `Series.str.get_dummies`, we can easily encode lists of items in a dataframe's column!

In [25]:
frequent_itemsets_2 = apriori(retail, min_support=0.2,use_colnames=True)

  and should_run_async(code)


In [26]:
frequent_itemsets_2

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.666667,(Beer)
1,0.666667,(Chips)
2,0.5,(Diaper)
3,0.666667,(Milk)
4,0.333333,(Soda)
5,0.5,"(Chips, Beer)"
6,0.5,"(Diaper, Beer)"
7,0.333333,"(Milk, Beer)"
8,0.333333,"(Chips, Diaper)"
9,0.333333,"(Chips, Milk)"


Just by calculating the support(X>Y), [Beer, Chips] & [Beer, Diaper] are the two frequent basket of intereseted.

But which one is more correlated than the other?

In [27]:
association_rules(frequent_itemsets_2, metric='lift',min_threshold=1.5)

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Diaper),(Beer),0.5,0.666667,0.5,1.0,1.5,0.166667,inf,0.666667
1,(Beer),(Diaper),0.666667,0.5,0.5,0.75,1.5,0.166667,2.0,1.0
2,(Milk),(Soda),0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1.0
3,(Soda),(Milk),0.333333,0.666667,0.333333,1.0,1.5,0.111111,inf,0.5
4,"(Chips, Diaper)",(Beer),0.333333,0.666667,0.333333,1.0,1.5,0.111111,inf,0.5
5,(Beer),"(Chips, Diaper)",0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1.0
6,"(Diaper, Milk)",(Beer),0.333333,0.666667,0.333333,1.0,1.5,0.111111,inf,0.5
7,"(Milk, Beer)",(Diaper),0.333333,0.5,0.333333,1.0,2.0,0.166667,inf,0.75
8,(Diaper),"(Milk, Beer)",0.5,0.333333,0.333333,0.666667,2.0,0.166667,2.0,1.0
9,(Beer),"(Diaper, Milk)",0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1.0
