<a href="https://colab.research.google.com/github/xmpuspus/Lectures/blob/master/notebooks/IntroMarketBasketAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Association Rules Mining
Once the item sets have been generated using apriori, we can start mining association rules.  Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}.  One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:

1. <b>support</b>  
    This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total 
    and {apple,egg} occurs in 3 of them, so: 
       
                    support{apple,egg} = 3/5 or 60%
        
    The minimum support threshold required by apriori can be set based on knowledge of your domain.  In this 
    grocery dataset for example, since there could be thousands of distinct items and an order can contain 
    only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.<br><br><br>
    
2. <b>confidence</b>  
    Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that 
    item A was purchased. This is expressed as:
       
                    confidence{A->B} = support{A,B} / support{A}   
                    
    Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 
    indicates that B is always purchased whenever A is purchased.  Note that the confidence measure is directional.     This means that we can also compute the percentage of times that item A is purchased, given that item B was 
    purchased:
       
                    confidence{B->A} = support{A,B} / support{B}    
                    
    In our example, the percentage of times that egg is purchased, given that apple was purchased is:  
       
                    confidence{apple->egg} = support{apple,egg} / support{apple}
                                           = (3/5) / (4/5)
                                           = 0.75 or 75%

    A confidence value of 0.75 implies that out of all orders that contain apple, 75% of them also contain egg.  Now, 
    we look at the confidence measure in the opposite direction (ie: egg->apple): 
       
                    confidence{egg->apple} = support{apple,egg} / support{egg}
                                           = (3/5) / (3/5)
                                           = 1 or 100%  
                                           
    Here we see that all of the orders that contain egg also contain apple.  But, does this mean that there is a 
    relationship between these two items, or are they occurring together in the same orders simply by chance?  To 
    answer this question, we look at another measure which takes into account the popularity of <i>both</i> items.<br><br><br>  
    
3. <b>lift</b>  
    Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items 
    are occuring together in the same orders simply by chance (ie: at random).  Unlike the confidence metric whose 
    value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), 
    lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}: 
       
                    lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})   
    
    In our example, we compute lift as follows:
    
         lift{apple,egg} = lift{egg,apple} = support{apple,egg} / (support{apple} * support{egg})
                         = (3/5) / (4/5 * 3/5) 
                         = 1.25    
               
    One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same 
    order if there was <i>no</i> relationship between them. In the example above, if apple occurred in 80% of the
    orders and egg occurred in 60% of the orders, then if there was no relationship between them, we would 
    <i>expect</i> both of them to show up together in the same order 48% of the time (ie: 80% * 60%).  The numerator, 
    on the other hand, represents how often apple and egg <i>actually</i> appear together in the same order.  In 
    this example, that is 60% of the time.  Taking the numerator and dividing it by the denominator, we get to how 
    many more times apple and egg actually appear in the same order, compared to if there was no relationship between     them (ie: that they are occurring together simply at random).  
    
    In summary, lift can take on the following values:
    
        * lift = 1 implies no relationship between A and B. 
          (ie: A and B occur together only by chance)
      
        * lift > 1 implies that there is a positive relationship between A and B.
          (ie:  A and B occur together more often than random)
    
        * lift < 1 implies that there is a negative relationship between A and B.
          (ie:  A and B occur together less often than random)
        
    In our example, apple and egg occur together 1.25 times <i>more</i> than random, so we conclude that there exists 
    a positive relationship between them.
   
Armed with knowledge of apriori and association rules mining, let's dive into the data and code to see what relationships we unravel!

Reference: [link](http://pbpython.com/market-basket-analysis.html)

In [0]:
import pandas as pd

## Install packages
!pip install mlxtend
!pip install xlrd

## import packages
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules




### Load Data

In [0]:
data = pd.read_csv('Retail_Data.csv')

In [0]:
data.head()

Unnamed: 0,Trans_ID,Product1,Product2,Product3
0,1,Bread,Butter,Dairy
1,2,Nachos,Butter,Dairy
2,3,Juice,Jam,Egg
3,4,Juice,Jam,Egg
4,5,Juice,Vegetable,Salad


In [0]:
data.shape

(2000, 4)

### Set Features

In [0]:
features = ['Product1', 'Product2', 'Product3']
X = data[features]

### Encode

In [0]:
X_encoded = pd.get_dummies(X)

X_encoded.head()

Unnamed: 0,Product1_Bread,Product1_Fruits,Product1_Juice,Product1_Nachos,Product2_Butter,Product2_Jam,Product2_Salsa,Product2_Vegetable,Product3_Dairy,Product3_Egg,Product3_Salad
0,1,0,0,0,1,0,0,0,1,0,0
1,0,0,0,1,1,0,0,0,1,0,0
2,0,0,1,0,0,1,0,0,0,1,0
3,0,0,1,0,0,1,0,0,0,1,0
4,0,0,1,0,0,0,0,1,0,0,1


In [0]:
X_encoded.shape

(2000, 11)

### Compute Support

Now that the data is structured properly, we can generate frequent item sets that have a support of at least 20% (this number was chosen so that I could get enough useful examples):

In [0]:
frequent_itemsets = apriori(X_encoded, min_support=0.10, use_colnames=True)

In [0]:
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.351,(Product1_Bread)
1,0.1825,(Product1_Fruits)
2,0.2465,(Product1_Juice)
3,0.22,(Product1_Nachos)
4,0.2465,(Product2_Butter)


### Compute Confidence and Lift

Build rules with minimum *lift* of 1. The final step is to generate the rules with their corresponding support, confidence and lift:

In [0]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold = 1)
rules.head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Product2_Butter),(Product1_Bread),0.2465,0.351,0.2145,0.870183,2.479153,0.127978,4.999328
1,(Product1_Bread),(Product2_Butter),0.351,0.2465,0.2145,0.611111,2.479153,0.127978,1.937571
2,(Product2_Jam),(Product1_Bread),0.341,0.351,0.128,0.375367,1.06942,0.008309,1.039009
3,(Product1_Bread),(Product2_Jam),0.351,0.341,0.128,0.364672,1.06942,0.008309,1.03726
4,(Product3_Dairy),(Product1_Bread),0.203,0.351,0.1005,0.495074,1.410467,0.029247,1.285337
5,(Product1_Bread),(Product3_Dairy),0.351,0.203,0.1005,0.286325,1.410467,0.029247,1.116754
6,(Product3_Egg),(Product1_Bread),0.576,0.351,0.2505,0.434896,1.239019,0.048324,1.148461
7,(Product1_Bread),(Product3_Egg),0.351,0.576,0.2505,0.713675,1.239019,0.048324,1.480836
8,(Product1_Fruits),(Product2_Vegetable),0.1825,0.216,0.1705,0.934247,4.325216,0.13108,11.923333
9,(Product2_Vegetable),(Product1_Fruits),0.216,0.1825,0.1705,0.789352,4.325216,0.13108,3.880879


In [0]:
from google.colab import files

# saving the document
rules.to_csv('assoc_rules.csv')

# downloading the document
files.download('assoc_rules.csv')

### High Support 
Products with Highest Support, which means the most bought items.  
Meaning, the most frequently bought items are eggs, bread, jam, juice, butter, etc.

In [0]:
rules.sort_values('antecedent support', ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
24,(Product3_Egg),(Product2_Salsa),0.576,0.1965,0.1175,0.203993,1.038133,0.004316,1.009413
7,(Product3_Egg),(Product1_Bread),0.576,0.351,0.2505,0.434896,1.239019,0.048324,1.148461
21,(Product3_Egg),(Product2_Butter),0.576,0.2465,0.1945,0.337674,1.369873,0.052516,1.137657
15,(Product3_Egg),(Product1_Juice),0.576,0.2465,0.208,0.361111,1.464954,0.066016,1.179391
23,(Product3_Egg),(Product2_Jam),0.576,0.341,0.257,0.446181,1.308447,0.060584,1.189918
45,(Product3_Egg),"(Product2_Jam, Product1_Juice)",0.576,0.201,0.201,0.348958,1.736111,0.085224,1.227264
33,(Product3_Egg),"(Product1_Bread, Product2_Butter)",0.576,0.2145,0.1945,0.337674,1.574236,0.070948,1.185971
50,(Product3_Egg),"(Product1_Nachos, Product2_Salsa)",0.576,0.188,0.1175,0.203993,1.085069,0.009212,1.020092
6,(Product1_Bread),(Product3_Egg),0.351,0.576,0.2505,0.713675,1.239019,0.048324,1.480836
31,(Product1_Bread),"(Product2_Butter, Product3_Egg)",0.351,0.1945,0.1945,0.554131,2.849003,0.126231,1.806585


### High Confidence 
Product pairs with highest confidence means they are bought most frequently by people.  
Meaning, out of all the people who purchased eggs and salsa, 100% or all of the people also bought nachos.  
Similarly, if a person buys fruits, 93% of them bought vegetables and salad.

In [0]:
rules.sort_values('confidence', ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
40,"(Product2_Jam, Product1_Juice)",(Product3_Egg),0.201,0.576,0.201,1.0,1.736111,0.085224,inf
46,"(Product1_Nachos, Product3_Egg)",(Product2_Salsa),0.1175,0.1965,0.1175,1.0,5.089059,0.094411,inf
36,"(Product2_Vegetable, Product1_Fruits)",(Product3_Salad),0.1705,0.221,0.1705,1.0,4.524887,0.13282,inf
11,(Product1_Fruits),(Product3_Salad),0.1825,0.221,0.1825,1.0,4.524887,0.142168,inf
30,"(Product2_Butter, Product3_Egg)",(Product1_Bread),0.1945,0.351,0.1945,1.0,2.849003,0.126231,inf
48,"(Product3_Egg, Product2_Salsa)",(Product1_Nachos),0.1175,0.22,0.1175,1.0,4.545455,0.09165,inf
27,(Product2_Vegetable),(Product3_Salad),0.216,0.221,0.209,0.967593,4.378247,0.161264,24.037714
42,"(Product1_Juice, Product3_Egg)",(Product2_Jam),0.208,0.341,0.201,0.966346,2.83386,0.130072,19.581714
17,(Product2_Salsa),(Product1_Nachos),0.1965,0.22,0.188,0.956743,4.348832,0.14477,18.031765
26,(Product3_Salad),(Product2_Vegetable),0.221,0.216,0.209,0.945701,4.378247,0.161264,14.438667


### High Lift  
Products with high lift are your actual recommendations.  
Below for example, if a person buy salsa, we should recommend nachos and egg along with it.  
Similarly, if a person buys salad, we recommend fruits along with it.

In [0]:
rules.sort_values('lift', ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
51,(Product2_Salsa),"(Product1_Nachos, Product3_Egg)",0.1965,0.1175,0.1175,0.597964,5.089059,0.094411,2.195079
46,"(Product1_Nachos, Product3_Egg)",(Product2_Salsa),0.1175,0.1965,0.1175,1.0,5.089059,0.094411,inf
49,(Product1_Nachos),"(Product3_Egg, Product2_Salsa)",0.22,0.1175,0.1175,0.534091,4.545455,0.09165,1.894146
48,"(Product3_Egg, Product2_Salsa)",(Product1_Nachos),0.1175,0.22,0.1175,1.0,4.545455,0.09165,inf
11,(Product1_Fruits),(Product3_Salad),0.1825,0.221,0.1825,1.0,4.524887,0.142168,inf
37,(Product3_Salad),"(Product2_Vegetable, Product1_Fruits)",0.221,0.1705,0.1705,0.771493,4.524887,0.13282,3.630089
36,"(Product2_Vegetable, Product1_Fruits)",(Product3_Salad),0.1705,0.221,0.1705,1.0,4.524887,0.13282,inf
10,(Product3_Salad),(Product1_Fruits),0.221,0.1825,0.1825,0.825792,4.524887,0.142168,4.692662
39,(Product1_Fruits),"(Product3_Salad, Product2_Vegetable)",0.1825,0.209,0.1705,0.934247,4.470079,0.132358,12.029792
34,"(Product3_Salad, Product2_Vegetable)",(Product1_Fruits),0.209,0.1825,0.1705,0.815789,4.470079,0.132358,4.437857
