<a href="https://colab.research.google.com/github/tanongsakintean/machineLearning-basic/blob/main/Chapter_4_grocery_apriori.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 9  Apriori Assossiation rule algorithm for grocery shop

# Association Rules

It is a rule-based machine learning technique used to find patterns (relationships, structures) in the data.

Association analysis applications are among the most common applications in data science. It will also coincide as referral systems.

These applications may have come up in the following ways, such as "the person who bought that product also bought this product", "those who viewed that ad also looked at these ads", or "the next video recommended for you".

These scenarios are the most frequently encountered scenarios within the scope of e-commerce data science data mining studies.


# Apriori Algoritması

It is the most used method in this field.

Association rule analysis is carried out by examining some metrics:
X: product
Y: product
N: total trade

*Support: It gives the fraction of transactions which contains product X and Y. Basically Support tells us about the frequently bought products or the combination of products bought frequently.
Support(X, Y) = Freq(X,Y)/N

*Confidence: It tells us how often the products X and Y occur together, given the number times X occurs.
Confidence(X, Y) = Freq(X,Y) / Freq(X)

*Lift: Lift indicates the strength of a rule over the random occurrence of A and B. It basically tells us the strength of any rule.
Lift = Support (X, Y)/( Support(X) * Support(Y) )



## 1. Understanding the Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
import os # use commandline
from google.colab import files
import io
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['GroceryStoreDataSet.csv']),names=['products'], header=None )
df


Saving GroceryStoreDataSet.csv to GroceryStoreDataSet.csv


Unnamed: 0,products
0,"MILK,BREAD,BISCUIT"
1,"BREAD,MILK,BISCUIT,CORNFLAKES"
2,"BREAD,TEA,BOURNVITA"
3,"JAM,MAGGI,BREAD,MILK"
4,"MAGGI,TEA,BISCUIT"
5,"BREAD,TEA,BOURNVITA"
6,"MAGGI,TEA,CORNFLAKES"
7,"MAGGI,BREAD,TEA,BISCUIT"
8,"JAM,MAGGI,BREAD,TEA"
9,"BREAD,MILK"


In [None]:
df.values

array([['MILK,BREAD,BISCUIT'],
       ['BREAD,MILK,BISCUIT,CORNFLAKES'],
       ['BREAD,TEA,BOURNVITA'],
       ['JAM,MAGGI,BREAD,MILK'],
       ['MAGGI,TEA,BISCUIT'],
       ['BREAD,TEA,BOURNVITA'],
       ['MAGGI,TEA,CORNFLAKES'],
       ['MAGGI,BREAD,TEA,BISCUIT'],
       ['JAM,MAGGI,BREAD,TEA'],
       ['BREAD,MILK'],
       ['COFFEE,COCK,BISCUIT,CORNFLAKES'],
       ['COFFEE,COCK,BISCUIT,CORNFLAKES'],
       ['COFFEE,SUGER,BOURNVITA'],
       ['BREAD,COFFEE,COCK'],
       ['BREAD,SUGER,BISCUIT'],
       ['COFFEE,SUGER,CORNFLAKES'],
       ['BREAD,SUGER,BOURNVITA'],
       ['BREAD,COFFEE,SUGER'],
       ['BREAD,COFFEE,SUGER'],
       ['TEA,MILK,COFFEE,CORNFLAKES']], dtype=object)

In [None]:
df.shape

(20, 1)

## 2. Data Preparation¶

I will use the Apriori algorithm to perform an association analysis.

The apriori method of the mlxtend library accepts the dataset as a True-False dataframe. I will use the data conversion methods of the **mlxtend library ** again to convert the data. Therefore, I will convert the raw data set to the format that these methods will require.

In [None]:
# Step1: I converted the data into list format. I separated the objects in each line with ','.
data = list(df["products"].apply(lambda x : x.split(',')))
data

[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COCK'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

In [None]:
!pip install mlxtend



In [None]:
# Step2: I will apply the method of converting the data of mlxtend library into True-False
# dataframe.
# First, I install the mlxtend library for those who do not have it installed.

from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_data = te.fit(data).transform(data)
df = pd.DataFrame(te_data,columns=te.columns_)
df

Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COCK,COFFEE,CORNFLAKES,JAM,MAGGI,MILK,SUGER,TEA
0,True,False,True,False,False,False,False,False,True,False,False
1,True,False,True,False,False,True,False,False,True,False,False
2,False,True,True,False,False,False,False,False,False,False,True
3,False,False,True,False,False,False,True,True,True,False,False
4,True,False,False,False,False,False,False,True,False,False,True
5,False,True,True,False,False,False,False,False,False,False,True
6,False,False,False,False,False,True,False,True,False,False,True
7,True,False,True,False,False,False,False,True,False,False,True
8,False,False,True,False,False,False,True,True,False,False,True
9,False,False,True,False,False,False,False,False,True,False,False


## 3. Implementing Apriori Algorithm

In the output of the Apriori algorithm, we get the frequencies of each combination in the whole data set. For example, in the output below, only the frequency (frequency) of "BISCUIT" in the whole dataset is 0.35, while the frequency (frequency) of "BISCUIT and BREAD" in the whole dataset is 0.20.

The apriori algorithm was given a **min_support value of 0.2.** Thus, product associations that are below 0.2 support value in combinations have been eliminated. If the verbose argument is 1, it will tell us how many combinations there are. In our example, 42 combinations were formed. In the last case, we have 16 combinations. Thus, our combination of 42-16 = 26 remained below the value of 0.2 support and was considered as an insignificant rate that we would not add to our comments.

In [None]:
from mlxtend.frequent_patterns import apriori
freq_items = apriori(df,min_support=0.2,use_colnames = True) #ทดลองเปลี่ยนค่า min_support
freq_items

Unnamed: 0,support,itemsets
0,0.35,(BISCUIT)
1,0.2,(BOURNVITA)
2,0.65,(BREAD)
3,0.4,(COFFEE)
4,0.3,(CORNFLAKES)
5,0.25,(MAGGI)
6,0.25,(MILK)
7,0.3,(SUGER)
8,0.35,(TEA)
9,0.2,"(BISCUIT, BREAD)"


## 4. Create strong association rules

I will apply the association analysis to the combination of mlxtend's association_rules method and the data set that we have support values. I will interpret my latest output according to the values of "support" and "confidence" and suggest a sample action idea.

Interpretation of Sample Association Analysis Output:

The probability of BISCUIT and BREAD being seen together is 20% since support = 0.20.
When BISCUIT is taken, the probability of getting BREAD is around 57% since confidence = 0.571429.

By giving "min_threshold = 0.3", it is ensured that the values with "confidence" value below 0.3 are not brought.

In [None]:
from mlxtend.frequent_patterns import association_rules
df_res = association_rules(freq_items, metric = "confidence", min_threshold = 0.5) #ทดลองเปลี่ยนค่า confidence
df_res

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(BISCUIT),(BREAD),0.35,0.65,0.2,0.571429,0.879121,-0.0275,0.816667,-0.174603
1,(MILK),(BREAD),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75,0.25
2,(SUGER),(BREAD),0.3,0.65,0.2,0.666667,1.025641,0.005,1.05,0.035714
3,(TEA),(BREAD),0.35,0.65,0.2,0.571429,0.879121,-0.0275,0.816667,-0.174603
4,(COFFEE),(CORNFLAKES),0.4,0.3,0.2,0.5,1.666667,0.08,1.4,0.666667
5,(CORNFLAKES),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8,0.571429
6,(COFFEE),(SUGER),0.4,0.3,0.2,0.5,1.666667,0.08,1.4,0.666667
7,(SUGER),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8,0.571429
8,(MAGGI),(TEA),0.25,0.35,0.2,0.8,2.285714,0.1125,3.25,0.75
9,(TEA),(MAGGI),0.35,0.25,0.2,0.571429,2.285714,0.1125,1.75,0.865385


## 5. Preparation for Data Filtering

In this section, taking the lowest and highest confidence values, these values will be used in data filtering and the idea of action will be proposed.

Let's find the highest confidence value. The output shows that the highest confidence value is 0.80.

In [None]:
conf_max = df_res['confidence'].max()
conf_max

  and should_run_async(code)


0.8

Let's find the lowest confidence value. The output shows that the lowest confidence value is 0.307.

In [None]:
conf_min = df_res["confidence"].min()
conf_min

  and should_run_async(code)


0.5

## 6. Data Filtering

Data are displayed. It is ranked in ascending order according to "confidence" value.

In [None]:
df_res.sort_values("lift",ascending = True)


  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(BREAD),(BISCUIT),0.65,0.35,0.2,0.307692,0.879121,-0.0275,0.938889,-0.282051
1,(BISCUIT),(BREAD),0.35,0.65,0.2,0.571429,0.879121,-0.0275,0.816667,-0.174603
6,(BREAD),(TEA),0.65,0.35,0.2,0.307692,0.879121,-0.0275,0.938889,-0.282051
7,(TEA),(BREAD),0.35,0.65,0.2,0.571429,0.879121,-0.0275,0.816667,-0.174603
4,(BREAD),(SUGER),0.65,0.3,0.2,0.307692,1.025641,0.005,1.011111,0.071429
5,(SUGER),(BREAD),0.3,0.65,0.2,0.666667,1.025641,0.005,1.05,0.035714
2,(BREAD),(MILK),0.65,0.25,0.2,0.307692,1.230769,0.0375,1.083333,0.535714
3,(MILK),(BREAD),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75,0.25
8,(CORNFLAKES),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8,0.571429
9,(COFFEE),(CORNFLAKES),0.4,0.3,0.2,0.5,1.666667,0.08,1.4,0.666667


Data with the lowest, highest and 0.5 confidence value are filtered. It is ranked in ascending order according to "confidence" value.

In [None]:
df_filt = df_res[ (df_res["lift"] >= 1) & ((df_res["confidence"] == conf_max) | (df_res["confidence"] == 0.5 ))]
df_filt.sort_values("confidence", ascending = True)
#มากกว่า 1 กฏสัมพันธ์กัน
#0 ไม่สัมพันธ์กัน
#1

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
4,(COFFEE),(CORNFLAKES),0.4,0.3,0.2,0.5,1.666667,0.08,1.4,0.666667
6,(COFFEE),(SUGER),0.4,0.3,0.2,0.5,1.666667,0.08,1.4,0.666667
1,(MILK),(BREAD),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75,0.25
8,(MAGGI),(TEA),0.25,0.35,0.2,0.8,2.285714,0.1125,3.25,0.75


# Comments and Action Advice

Based on the above output;

Line 3:
a). The probability of taking milk and bread together is 20% since support = 0.2.
b). When milk is taken, the probability of taking Bread is 80% since confidence = 0.8.

Comments :
30% of customers who buy bread also get Biscuits. Cereal was also bought in half of the coffee sales. Milk and Bread were taken together in 20% of all purchases.

Action Proposal:
Customers who buy milk are quite likely to buy bread (80%). This means; customers who buy milk also buy bread, and will likely keep their way to the bread aisle after receiving milk. Milk and bread departments can be located in distant locations within the market, and many departments may be required to pass through the bread department of the customer who purchases milk. Thus, the possibility of purchasing any product during the customer's journey can be increased.

