# Association Rules

What are Association Rules? --> It is a method used to reveal product relationships from user purchases. It provides the opportunity to see the unity of products purchased according to a threshold value to be determined.


It is a rule-based machine learning technique used to find patterns (relationships, structures) in data. Also known as __Market Basket Analysis__.

Association Rules applications are among the most common applications in data science. It will also come across as recommendation systems.

You may have come across these applications in the following ways: "those who bought that product also bought this product" or "those who viewed that ad also looked at these ads" or "we created a playlist for you" or "the recommended video for the next video".

These scenarios will be the most common scenarios within the scope of e-commerce data science data mining studies.

Many e-commerce companies around the world or companies, platforms like Spotify, Amazon, Netflix use recommendation systems.

So what do these association analyzes do?

__Apriori Algorithm__

It is the most used method in this field.

Association rule analysis is performed by examining some metrics:

X: product 

Y: product 

N: total shopping / order / transaction

__Support__: Probability of X and Y coexisting

- Support (X, Y) = Freq(X, Y) / N


__Confidence__: Probability of selling product Y when product X is purchased

- Confidence (X, Y) = Freq(X, Y) / Freq(X)

__Lift__: The sales of the first product increases the sales of the second product by 'Lift' times.

- Lift = Support (X, Y) / ( Support(X) * Support(Y) )

In [None]:
# Install mlxtend library for apriori algorithm
!pip install mlxtend

In [None]:
# Install arulesviz library for visualization
!pip install arulesviz

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
from arulesviz import Arulesviz

## Example 1: Retail Dataset

In [2]:
# Load the dataset and show the first columns
df = pd.read_csv('retail_dataset.csv', sep=',')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


In [3]:
# See the number of rows and columns
df.shape

(315, 7)

In [4]:
# We need one-hot-encoded dataset in order to apply apriori algorithm. So, we need to make some transformations in our dataset. 

In [5]:
items = (df['0'].unique())

In [6]:
# one-hot encoding transformation
encoded_vals = []
for index, row in df.iterrows(): 
    labels = {}
    uncommons = list(set(items) - set(row))
    commons = list(set(items).intersection(row))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)

In [7]:
# Create a one-hot-encoded dataframe 
ohe_df = pd.DataFrame(encoded_vals) 
ohe_df.head()

Unnamed: 0,Bagel,Milk,Meat,Diaper,Cheese,Bread,Wine,Pencil,Eggs
0,0,0,1,1,1,1,1,1,1
1,0,1,1,1,1,1,1,1,0
2,0,1,1,0,1,0,1,0,1
3,0,1,1,0,1,0,1,0,1
4,0,0,1,0,0,0,1,1,0


In [8]:
# See the definition of apriori function
? apriori

[1;31mSignature:[0m
 [0mapriori[0m[1;33m([0m[1;33m
[0m    [0mdf[0m[1;33m,[0m[1;33m
[0m    [0mmin_support[0m[1;33m=[0m[1;36m0.5[0m[1;33m,[0m[1;33m
[0m    [0muse_colnames[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mmax_len[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mlow_memory[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Get frequent itemsets from a one-hot DataFrame

Parameters
-----------
df : pandas DataFrame
  pandas DataFrame the encoded format. Also supports
  DataFrames with sparse data; for more info, please
  see (https://pandas.pydata.org/pandas-docs/stable/
       user_guide/sparse.html#sparse-data-structures)

  Please note that the old pandas SparseDataFrame format
  is no longer supported in mlxtend >= 0.17.2.

  The allowed values are either 0/1 or True/False.
  For example,



In [9]:
# Run the apriori algorithm and return the ones with min 0.2 support values for each combination 
freq_items = apriori(ohe_df, min_support = 0.2, use_colnames = True, verbose = 1)

Processing 4 combinations | Sampling itemset size 4 3


In [10]:
# See the results
freq_items

Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.501587,(Milk)
2,0.47619,(Meat)
3,0.406349,(Diaper)
4,0.501587,(Cheese)
5,0.504762,(Bread)
6,0.438095,(Wine)
7,0.361905,(Pencil)
8,0.438095,(Eggs)
9,0.225397,"(Bagel, Milk)"


In [11]:
# Sort the values in descending order
freq_items.sort_values("support", ascending = False)

Unnamed: 0,support,itemsets
5,0.504762,(Bread)
1,0.501587,(Milk)
4,0.501587,(Cheese)
2,0.47619,(Meat)
6,0.438095,(Wine)
8,0.438095,(Eggs)
0,0.425397,(Bagel)
3,0.406349,(Diaper)
7,0.361905,(Pencil)
16,0.32381,"(Cheese, Meat)"


In [12]:
# See the definition of association_rules function
? association_rules

[1;31mSignature:[0m
 [0massociation_rules[0m[1;33m([0m[1;33m
[0m    [0mdf[0m[1;33m,[0m[1;33m
[0m    [0mmetric[0m[1;33m=[0m[1;34m'confidence'[0m[1;33m,[0m[1;33m
[0m    [0mmin_threshold[0m[1;33m=[0m[1;36m0.8[0m[1;33m,[0m[1;33m
[0m    [0msupport_only[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Generates a DataFrame of association rules including the
metrics 'score', 'confidence', and 'lift'

Parameters
-----------
df : pandas DataFrame
  pandas DataFrame of frequent itemsets
  with columns ['support', 'itemsets']

metric : string (default: 'confidence')
  Metric to evaluate if a rule is of interest.
  **Automatically set to 'support' if `support_only=True`.**
  Otherwise, supported metrics are 'support', 'confidence', 'lift',
  'leverage', and 'conviction'
  These metrics are computed as follows:

  - support(A->C) = support(A+C) [aka 'support'], range: [0, 1]

  - confidence(A->C) =

In [13]:
# See other metrics like Confidence, Lift,LEverage, Conviction, too
association_rules(freq_items, metric="confidence", min_threshold = 0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
2,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
3,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
4,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
5,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
7,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
8,"(Cheese, Milk)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429
9,"(Cheese, Meat)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845


__Important and commonly used metrics and definitions__:
- Antecedent Support and Consequent Support --> probability of products observed alone
- Support --> probability of two products observed together
- Confidence --> Probability of selling product Y when product X is purchased
- Lift --> The sales of the first product increases the sales of teh second product tea by 'Lift' times.

In [14]:
# Create a dataframe to keep these rules. Here we defined min_threshold as 0.6.
df_ar = association_rules(freq_items, metric = "confidence", min_threshold = 0.6)

In [15]:
# Filter the values. Return Support < 0.3 and Confidence > 0.7
df_ar[(df_ar.support < 0.3) & (df_ar.confidence > 0.7)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
10,"(Milk, Meat)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137
12,"(Cheese, Eggs)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773
13,"(Meat, Eggs)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667


## Example 2: GroceryStoreDataSet

dataset_link = https://www.kaggle.com/shazadudwadia/supermarket

In [18]:
# Load the dataset and show the first columns
data = pd.read_csv('GroceryStoreDataSet.csv', names = ['products'], header = None)
df = data.copy()
df.head()

Unnamed: 0,products
0,"MILK,BREAD,BISCUIT"
1,"BREAD,MILK,BISCUIT,CORNFLAKES"
2,"BREAD,TEA,BOURNVITA"
3,"JAM,MAGGI,BREAD,MILK"
4,"MAGGI,TEA,BISCUIT"


In [19]:
# Shape of the data --> number of rows and columns
df.shape

(20, 1)

In [20]:
# Take from a different point
df.values

array([['MILK,BREAD,BISCUIT'],
       ['BREAD,MILK,BISCUIT,CORNFLAKES'],
       ['BREAD,TEA,BOURNVITA'],
       ['JAM,MAGGI,BREAD,MILK'],
       ['MAGGI,TEA,BISCUIT'],
       ['BREAD,TEA,BOURNVITA'],
       ['MAGGI,TEA,CORNFLAKES'],
       ['MAGGI,BREAD,TEA,BISCUIT'],
       ['JAM,MAGGI,BREAD,TEA'],
       ['BREAD,MILK'],
       ['COFFEE,COCK,BISCUIT,CORNFLAKES'],
       ['COFFEE,COCK,BISCUIT,CORNFLAKES'],
       ['COFFEE,SUGER,BOURNVITA'],
       ['BREAD,COFFEE,COCK'],
       ['BREAD,SUGER,BISCUIT'],
       ['COFFEE,SUGER,CORNFLAKES'],
       ['BREAD,SUGER,BOURNVITA'],
       ['BREAD,COFFEE,SUGER'],
       ['BREAD,COFFEE,SUGER'],
       ['TEA,MILK,COFFEE,CORNFLAKES']], dtype=object)

In [21]:
# We need one-hot-encoded dataset in order to apply apriori algorithm. So, we need to make some transformations in our dataset. 

In [22]:
# Separate each row items
df1 = list(df["products"].apply(lambda x:x.split(',')))
df1

[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COCK'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

In [23]:
# import TransactionEncoder library to encode items
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_data = te.fit(df1).transform(df1)

In [24]:
# Create the final dataset, on which we are going apply apriori algorithm
df = pd.DataFrame(te_data, columns = te.columns_)
df

Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COCK,COFFEE,CORNFLAKES,JAM,MAGGI,MILK,SUGER,TEA
0,True,False,True,False,False,False,False,False,True,False,False
1,True,False,True,False,False,True,False,False,True,False,False
2,False,True,True,False,False,False,False,False,False,False,True
3,False,False,True,False,False,False,True,True,True,False,False
4,True,False,False,False,False,False,False,True,False,False,True
5,False,True,True,False,False,False,False,False,False,False,True
6,False,False,False,False,False,True,False,True,False,False,True
7,True,False,True,False,False,False,False,True,False,False,True
8,False,False,True,False,False,False,True,True,False,False,True
9,False,False,True,False,False,False,False,False,True,False,False


In [25]:
from mlxtend.frequent_patterns import apriori

In [26]:
# Return the ones with min 0.2 support values for each combination 
freq_items = apriori(df, min_support = 0.2, use_colnames = True)
freq_items

Unnamed: 0,support,itemsets
0,0.35,(BISCUIT)
1,0.2,(BOURNVITA)
2,0.65,(BREAD)
3,0.4,(COFFEE)
4,0.3,(CORNFLAKES)
5,0.25,(MAGGI)
6,0.25,(MILK)
7,0.3,(SUGER)
8,0.35,(TEA)
9,0.2,"(BREAD, BISCUIT)"


In [27]:
# Sort these values and see the most common itemsets
freq_items.sort_values(by = "support", ascending = False)

Unnamed: 0,support,itemsets
2,0.65,(BREAD)
3,0.4,(COFFEE)
0,0.35,(BISCUIT)
8,0.35,(TEA)
4,0.3,(CORNFLAKES)
7,0.3,(SUGER)
5,0.25,(MAGGI)
6,0.25,(MILK)
1,0.2,(BOURNVITA)
9,0.2,"(BREAD, BISCUIT)"


In [28]:
# Let's find the length if each itemset 
freq_items['length'] = freq_items['itemsets'].apply(lambda x: len(x))
freq_items

Unnamed: 0,support,itemsets,length
0,0.35,(BISCUIT),1
1,0.2,(BOURNVITA),1
2,0.65,(BREAD),1
3,0.4,(COFFEE),1
4,0.3,(CORNFLAKES),1
5,0.25,(MAGGI),1
6,0.25,(MILK),1
7,0.3,(SUGER),1
8,0.35,(TEA),1
9,0.2,"(BREAD, BISCUIT)",2


In [29]:
# See other metrics like Confidence, Lift,LEverage, Conviction, too
association_rules(freq_items, metric = "confidence", min_threshold = 0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MILK),(BREAD),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75
1,(SUGER),(BREAD),0.3,0.65,0.2,0.666667,1.025641,0.005,1.05
2,(CORNFLAKES),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8
3,(SUGER),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8
4,(MAGGI),(TEA),0.25,0.35,0.2,0.8,2.285714,0.1125,3.25


In [30]:
# We may want to focus on some combinations
# Create a dataframe to keep some rules, which are for us important
df_ar = association_rules(freq_items, metric = "confidence", min_threshold = 0.6)

In [31]:
# Filter the values. Return Support < 0.3 and Confidence > 0.7
df_ar[(df_ar.support < 0.3) & (df_ar.confidence > 0.7)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MILK),(BREAD),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75
4,(MAGGI),(TEA),0.25,0.35,0.2,0.8,2.285714,0.1125,3.25


## Results and Comments  

__0 : MILK & BREAD:__

- Support =  0.2 --> Milk and bread are seen together in 20% of all observations. 
- Confidence  = 0.8 --> It can be also said, that whoever bought milk also bought bread with a probabilty of 0.8. 
- Lift = 1.23 --> The sales of the milk product increased the sales of product bread by 1.23 times.


__1 : SUGAR & BREAD:__

- Support =  0.2 --> Sugar and bread are seen together in 20% of all observations. 
- Confidence  = 0.67 --> It can be also said, that whoever bought sugar also bought bread with a probabilty of 0.67. 
- Lift = 1.03 --> The sales of the sugar product increased the sales of product bread by 1.03 times.


__4 : MAGGI & TEA :__

- Support =  0.2 --> Maggi and tea are seen together in 20% of all observations. 
- Confidence  = 0.8 --> It can be also said, that whoever bought maggi also bought tea with a probabilty of 0.8. 
- Lift = 2.29 --> The sales of the maggi product increased the sales of product tea by 2.29 times.

## Conclusion

Association rules are used to reveal product relationships from user purchases. 
By examining all purchases of all customers, the association frequencies of the products in the baskets are checked. Marketing strategies are developed according to these frequencies.
They provide the opportunity to see the unity of products purchased according to a threshold value to be determined.

Association rules can be applied in many braches. 
- Shelf layout planning in supermarkets
- Recommender systems
- Market Basket Analysis

## References 

- VBO - Data Science Machine Learning Bootcamp, Lecturer: M. Vahit Keskin, Mentor: Muhammet Cakmak, Assistant Mentors: Cemal Cici, Saltuk Bugra Karacan
- Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Wiley, 2015 
- https://www.kaggle.com/shazadudwadia/supermarket
- https://www.kaggle.com/mariekaram/apriori-association-rule
- http://www.ams.med.uni-goettingen.de/download/Steffen-Unkel/chap9.pdf
- https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html
- https://medium.com/edureka/apriori-algorithm-d7cc648d4f1e