## **Association Rule Learning**

Association Rule Learning is a technique to identify underlying relations between different items. Take an example of a super market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

* A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
* People who buy one of the products can be targeted through an advertisement campaign to buy the other.
* Collective discounts can be offered on these products if the customer buys both of them.
* Both A and B can be packaged together.

The process of identifying an associations between products is called Association Rule Learning.

![Alt text](shopping_baskets.png)

## **Apriori Algorithm for Association Rule Learning**
Different statistical algorithms have been developed to implement Association Rule Learning, and Apriori is one such algorithm. We will study the theory behind the Apriori algorithm and will later implement Apriori algorithm in Python.

There are three major components of Apriori algorithm:

* Support
* Confidence
* Lift

#### **Support**
Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:

**Support(B) = (Transactions containing B) / (Total Transactions)**

#### **Confidence**
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:

**Confidence(A, B) = (Transactions containing both A and B) / (Transactions containing A)**

#### **Lift**
Lift refers to the increase in the ratio of sale of B when A is sold. Lift can be calculated by dividing Confidence(A, B) divided by Support(B). Mathematically it can be represented as:

**Lift(A, B) = (Support(A, B)) / (Support(A) * Support(B))**<br><br>

![Alt text](grocery_carts.png)

## **Apriori Algorithm Implementation**

1. Prepare the ARL data structure (Invoice-Product matrix)
2. Revealing the rules of association
3. Suggesting products to users at the basket stage

In [1]:
#######################################################################
####### Prepare the ARL data structure (Invoice-Product matrix) #######
#######################################################################

# !pip install pandas
# !pip install mlxtend

import os
import pandas as pd

from mlxtend.frequent_patterns import apriori, association_rules

df = pd.read_csv(os.path.realpath("data.csv"))
df.head()

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate
0,25446,4,5,2017-08-06 16:11:00
1,22948,48,5,2017-08-06 16:12:00
2,10618,0,8,2017-08-06 16:13:00
3,7256,9,4,2017-08-06 16:14:00
4,25446,48,5,2017-08-06 16:16:00


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162523 entries, 0 to 162522
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   UserId      162523 non-null  int64 
 1   ServiceId   162523 non-null  int64 
 2   CategoryId  162523 non-null  int64 
 3   CreateDate  162523 non-null  object
dtypes: int64(3), object(1)
memory usage: 5.0+ MB


In fact, the main difficulty is to make the data suitable for the Apriori algorithm. The "BasketId" variable and the "Service" variable, which will be unique, will create rows and columns in the table. Here's a piece of the table we're finally trying to get:

![Alt text](output_example.png)

In [3]:
df["Service"] = df["ServiceId"].astype(str) + "_" +df["CategoryId"].astype(str)
df.head()

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate,Service
0,25446,4,5,2017-08-06 16:11:00,4_5
1,22948,48,5,2017-08-06 16:12:00,48_5
2,10618,0,8,2017-08-06 16:13:00,0_8
3,7256,9,4,2017-08-06 16:14:00,9_4
4,25446,48,5,2017-08-06 16:16:00,48_5


In [4]:
df["CreateDate"] = pd.to_datetime(df["CreateDate"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162523 entries, 0 to 162522
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   UserId      162523 non-null  int64         
 1   ServiceId   162523 non-null  int64         
 2   CategoryId  162523 non-null  int64         
 3   CreateDate  162523 non-null  datetime64[ns]
 4   Service     162523 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 6.2+ MB


In [5]:
df["Year_Month"] = df["CreateDate"].dt.to_period('M')
df.head()

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate,Service,Year_Month
0,25446,4,5,2017-08-06 16:11:00,4_5,2017-08
1,22948,48,5,2017-08-06 16:12:00,48_5,2017-08
2,10618,0,8,2017-08-06 16:13:00,0_8,2017-08
3,7256,9,4,2017-08-06 16:14:00,9_4,2017-08
4,25446,48,5,2017-08-06 16:16:00,48_5,2017-08


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162523 entries, 0 to 162522
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   UserId      162523 non-null  int64         
 1   ServiceId   162523 non-null  int64         
 2   CategoryId  162523 non-null  int64         
 3   CreateDate  162523 non-null  datetime64[ns]
 4   Service     162523 non-null  object        
 5   Year_Month  162523 non-null  period[M]     
dtypes: datetime64[ns](1), int64(3), object(1), period[M](1)
memory usage: 7.4+ MB


In [7]:
df["BasketID"] = df["UserId"].astype(str) + "_" +df["Year_Month"].astype(str)
df.head()

Unnamed: 0,UserId,ServiceId,CategoryId,CreateDate,Service,Year_Month,BasketID
0,25446,4,5,2017-08-06 16:11:00,4_5,2017-08,25446_2017-08
1,22948,48,5,2017-08-06 16:12:00,48_5,2017-08,22948_2017-08
2,10618,0,8,2017-08-06 16:13:00,0_8,2017-08,10618_2017-08
3,7256,9,4,2017-08-06 16:14:00,9_4,2017-08,7256_2017-08
4,25446,48,5,2017-08-06 16:16:00,48_5,2017-08,25446_2017-08


In [8]:
# The unstack() method will be used to get the pivot table.

df_pvt = df.groupby(["BasketID", "Service"])["Service"].count().unstack().fillna(0).applymap(lambda x : 1 if x>0 else 0)
df_pvt.head()

Service,0_8,10_9,11_11,12_7,13_11,14_7,15_1,16_8,17_5,18_4,...,46_4,47_7,48_5,49_1,4_5,5_11,6_7,7_3,8_5,9_4
BasketID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0_2017-08,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
0_2017-09,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
0_2018-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
0_2018-04,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10000_2017-08,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [9]:
########################################################################
################## Revealing the rules of association ##################
########################################################################

frequent_itemsets = apriori(df_pvt,
                            min_support=0.01,
                            use_colnames=True)
                            
frequent_itemsets.sort_values("support",ascending=False)[:30]



Unnamed: 0,support,itemsets
8,0.238121,(18_4)
19,0.130286,(2_0)
5,0.120963,(15_1)
39,0.067762,(49_1)
28,0.066568,(38_4)
3,0.056627,(13_11)
12,0.047515,(22_0)
9,0.045563,(19_6)
15,0.042895,(25_0)
7,0.041533,(17_5)


In [10]:
rules = association_rules(frequent_itemsets,
                          metric="support",
                          min_threshold = 0.01)

rules.sort_values("lift",ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
10,(25_0),(22_0),0.042895,0.047515,0.01112,0.259247,5.456141,0.009082,1.285834
11,(22_0),(25_0),0.047515,0.042895,0.01112,0.234043,5.456141,0.009082,1.249553
18,(38_4),(9_4),0.066568,0.041393,0.010067,0.151234,3.653623,0.007312,1.129413
19,(9_4),(38_4),0.041393,0.066568,0.010067,0.243216,3.653623,0.007312,1.233418
4,(15_1),(33_4),0.120963,0.02731,0.011233,0.092861,3.400299,0.007929,1.072262
5,(33_4),(15_1),0.02731,0.120963,0.011233,0.411311,3.400299,0.007929,1.493211
12,(2_0),(22_0),0.130286,0.047515,0.016568,0.127169,2.676409,0.010378,1.09126
13,(22_0),(2_0),0.047515,0.130286,0.016568,0.3487,2.676409,0.010378,1.33535
14,(2_0),(25_0),0.130286,0.042895,0.013437,0.103136,2.404371,0.007849,1.067168
15,(25_0),(2_0),0.042895,0.130286,0.013437,0.313257,2.404371,0.007849,1.266432


The equivalent of the variables in the table:

* antecedents: A

* consequents: B 

* antecedent support: Support(A) 

* consequent support: Support(B) 

* support: Support(A, B) 

* confidence: Confidence(A, B) 

* lift: lift(A, B)

In [11]:
########################################################################
########### Suggesting products to users at the basket stage ###########
########################################################################

def arl_recommender(rules_df, product):
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    recommendation_list = []
    for i, antecedent in enumerate(sorted_rules["antecedents"]):
        if product in antecedent:
            recommendation_list.append(list(sorted_rules.iloc[i]["consequents"])[0])
    return recommendation_list

arl_recommender(rules, "2_0")

['22_0', '25_0', '15_1', '13_11', '38_4']

**As a result, other services that can be recommended to someone who receives the 2_0 service are shown in order of lift importance. As we sort the values, the lift value will decrease towards the end.**