## Association Rule Mining in Retail Store

### Problem Statement:
 * What are the items that may be frequently purchased together?

### Objective:
* To learn how Apriori Algorithm and Association Rules works.
* To learn how Combination and Permutation helps to find Support and Confidence of itemsets respectively.
* To find frequent itemsets with high confidence and lift, keeping both item together will help to increase sales.


### Introduction
* Association rule mining is one of an important technique of data mining for knowledge discovery.
* The knowledge of the correlation between the items in the data transaction can use association rule mining.
* Retail store analysis is one of an application area of association rule mining technique.
* The possible percentage of the correlation of combined items gives the new knowledge. Therefore, it is a very helpful for determiner to take the decisions

### Analysis

In [2]:
# Importing Required Library

import pandas as pd
import numpy as np

In [3]:
# Reading Excel file 
# need to install 1.2.0 version xlrd package to read excel files: pip install xlrd==1.2.0
bread = pd.read_excel('raw_bread.xlsx')

In [4]:
## Here we have transaction data, which include column, Date,Time,Transaction,Item
## we should remove duplicate transaction, it shows quantity of item in same transaction,
## it is not needed in appriori aglo as we only care about different item in particular transaction
bread

Unnamed: 0,"Date,Time,Transaction,Item"
0,"2016-10-30,09:58:11,1,Bread"
1,"2016-10-30,10:05:34,2,Scandinavian"
2,"2016-10-30,10:05:34,2,Scandinavian"
3,"2016-10-30,10:07:57,3,Hot chocolate"
4,"2016-10-30,10:07:57,3,Jam"
...,...
21288,"2017-04-09,14:32:58,9682,Coffee"
21289,"2017-04-09,14:32:58,9682,Tea"
21290,"2017-04-09,14:57:06,9683,Coffee"
21291,"2017-04-09,14:57:06,9683,Pastry"


In [5]:
## dropping Duplicate Transaction
bread = bread.drop_duplicates()

In [6]:
## we need to split transaction data into Dataframe/tabular structure as follow
new = bread['Date,Time,Transaction,Item'].str.split(',', n = 3, expand = True)

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
## assigning column to data frame "bread"
bread['Date'] = new[0]
bread['Time'] = new[1]
bread['Transaction'] = new[2]
bread['Item'] = new[3]

In [9]:
# in this dataframe we only need column Trasaction and Item, rest is not needed in association mining rule
bread[['Date', 'Time', 'Transaction', 'Item']].head(10)
                                                    

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam
5,2016-10-30,10:07:57,3,Cookies
6,2016-10-30,10:08:41,4,Muffin
7,2016-10-30,10:13:03,5,Coffee
8,2016-10-30,10:13:03,5,Pastry
9,2016-10-30,10:13:03,5,Bread
10,2016-10-30,10:16:55,6,Medialuna


In [10]:
# we need to convert cloumn transacton & item into Crosstab or we can say Binary Matrix as follow
transaction = pd.crosstab(index= bread['Transaction'], columns= bread['Item'])
transaction

Item,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
## Just writing csv file to check result
## we have one unwanted column named "NONE", we should remove it as follow and proceed further
tab = transaction.copy()
tab.to_csv('tab.csv')

In [12]:
## removing unwanted col "NONE"
transaction = transaction.drop(['NONE'], axis = 1)

In [13]:
transaction

Item,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Creating APRIORI function to generate frequent itesets based on minimum threshold support = 0.02

In [14]:
def APRIORI(data, min_support = 0.04,  max_length = 4):
    # Collecting Required Library
    import numpy as np
    import pandas as pd
    from itertools import combinations
    
    support = {} 
    L = list(data.columns)   
    k = 0
    while len(L)>0:
        C = []
        k = k+1
        all_item = []
        for candidate in L:
            if type(candidate)==str:
                all_item.append(candidate)
            else:
                for item in candidate:
                    if item not in all_item:
                        all_item.append(item)
        candidates = combinations(all_item , r = k)
        for candidate in candidates:
            can_list = [item for item in candidate]
            s = (data[can_list]==1).all(axis = 1).sum() / data.shape[0]
            if s > min_support:
                support[candidate] = s
                C.append(can_list)
        L = C
        if k>max_length:
            break            
    result = pd.DataFrame(list(support.items()), columns = ["Items", "Support"])
    return result

In [15]:
## finding frequent itemset with min support = 4%
my_freq_itemset = APRIORI(transaction, 0.04, 3)
my_freq_itemset.sort_values(by = 'Support', ascending = False)

Unnamed: 0,Items,Support
2,"(Coffee,)",0.475081
0,"(Bread,)",0.32494
8,"(Tea,)",0.141643
1,"(Cake,)",0.103137
9,"(Bread, Coffee)",0.089393
6,"(Pastry,)",0.08551
7,"(Sandwich,)",0.071346
5,"(Medialuna,)",0.061379
4,"(Hot chocolate,)",0.057916
10,"(Cake, Coffee)",0.054349


### Creating ASSOCIATION_RULE function to generate rules based on minimun threshold confidence.

In [16]:
def ASSOCIATION_RULE(df, min_confidence_threshold = 0.5):
    import pandas as pd
    from itertools import permutations
    
    support = pd.Series(df.Support.values, index=df.Items).to_dict()
    data = []
    L = df.Items.values
    
    # write your code here !!
    for i in L:
        if type(i)==tuple:
            
    
    result = pd.DataFrame(data, columns = ["antecedents", "consequents", "antecedent support", "consequent support",
                                        "support", "confidence", "Lift", "Leverage", "Convection"])
    return result

IndentationError: expected an indented block (<ipython-input-16-9d033d57f0fd>, line 14)

In [109]:
import pandas as pd
from itertools import permutations , combinations    
df = my_freq_itemset.copy()
support = pd.Series(df.Support.values, index=df.Items).to_dict()
data = []
antecedents = []
consquents = []
antecedents_support = []
consequent_support = []

L = df.Items.values
# write your code here !!
for rule in L:
    left = []
    right = []
    p = permutations(rule)
    for per in p:
        for i in range(len(rule)-1):
            l , r = list(set(per[:i+1])) , list(set(per[i+1:]))
            if not l in left:
                left.append(l)
                right.append(r)
    for i in range(len(left)):
        ant_sup = support[tuple(left[i])]
        con_sup = support[tuple(right[i])]
        
        try:
            sup = support[tuple(left[i]+right[i])]
        except:
            sup = support[tuple(right[i]+left[i])]
        conf = cons_sup/ant_sup
        lift = sup/(ant_sup*con_sup)
        if conf>min_confidence_threshold:
            
            
#result = pd.DataFrame(data, columns = ["antecedents", "consequents", "antecedent support", "consequent support",
#    "support", "confidence", "Lift", "Leverage", "Convection"])

0.4750813136082258


In [117]:
try:
    a = support[tuple(right[0]+left[0])]
except:
    a = support[tuple(left[0]+right[0])]
a

0.04952261042912601

In [80]:
from itertools import permutations
rule = ['a','b','c','d']
p = permutations(rule)
left = []
right = []
for per in p:
    if per[0]=='b':
        print(per)
    for i in range(len(rule)-1):
        l = list(set(per[:i+1]))
        r = list(set(per[i+1:]))
#        print(setted , l , r)
        if not l in left:
            print(i)
            left.append(l)
            right.append(r) 
dd = pd.DataFrame({'left':left , 'right':right})
dd

0
1
2
2
1
2
1
('b', 'a', 'c', 'd')
0
('b', 'a', 'd', 'c')
('b', 'c', 'a', 'd')
1
2
('b', 'c', 'd', 'a')
2
('b', 'd', 'a', 'c')
1
('b', 'd', 'c', 'a')
0
1
2
1
0


Unnamed: 0,left,right
0,[a],"[d, c, b]"
1,"[a, b]","[d, c]"
2,"[c, a, b]",[d]
3,"[d, a, b]",[c]
4,"[c, a]","[d, b]"
5,"[d, c, a]",[b]
6,"[d, a]","[c, b]"
7,[b],"[d, c, a]"
8,"[c, b]","[d, a]"
9,"[a, c, b]",[d]


In [168]:
can = ('Coffee' , 'Coffee' , 'Bread')
can.count('Coffee')

2

In [16]:
## Rule with minimun confidence = 50%
my_rule = ASSOCIATION_RULE(my_freq_itemset, 0.5)
my_rule
# the column names in the following table is the same as those generated by mlxtend
# you can refer to http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ 
# to find the formulas of lift, leverage, convection

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,Lift,Leverage,Convection
0,"(Cake,)","(Coffee,)",0.103137,0.475081,0.054349,0.526958,1.109196,0.00535,1.109667
1,"(Pastry,)","(Coffee,)",0.08551,0.475081,0.047214,0.552147,1.162216,0.00659,1.172079


### Finally sorting results by Lift to get highly associated itemsets.

In [17]:
my_rule.sort_values(by='Lift', ascending= False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,Lift,Leverage,Convection
1,"(Pastry,)","(Coffee,)",0.08551,0.475081,0.047214,0.552147,1.162216,0.00659,1.172079
0,"(Cake,)","(Coffee,)",0.103137,0.475081,0.054349,0.526958,1.109196,0.00535,1.109667


## Cross Verifying results with  apriori and association rule from mlxtend

In [18]:
# Loading standard package
from mlxtend.frequent_patterns import apriori, association_rules

In [19]:
## finding frequent itemset with min support = 4%
frequent_itemset = apriori(df = transaction, min_support= 0.04, use_colnames= True)
frequent_itemset.sort_values(by = 'support', ascending = False)

Unnamed: 0,support,itemsets
2,0.475081,(Coffee)
0,0.32494,(Bread)
8,0.141643,(Tea)
1,0.103137,(Cake)
9,0.089393,"(Coffee, Bread)"
6,0.08551,(Pastry)
7,0.071346,(Sandwich)
5,0.061379,(Medialuna)
4,0.057916,(Hot chocolate)
10,0.054349,"(Coffee, Cake)"


### Createing associate rule such that item brought with conditional probability(Confidence) more than 50% with corresponding item

In [20]:
## Rule with minimun confidence = 50%
Rules = association_rules(frequent_itemset, min_threshold= 0.5)
Rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Cake),(Coffee),0.103137,0.475081,0.054349,0.526958,1.109196,0.00535,1.109667
1,(Pastry),(Coffee),0.08551,0.475081,0.047214,0.552147,1.162216,0.00659,1.172079


In [21]:
## Finally sorting results by Lift to get highly associated itemsets.
Rules.sort_values(by='lift', ascending= False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,(Pastry),(Coffee),0.08551,0.475081,0.047214,0.552147,1.162216,0.00659,1.172079
0,(Cake),(Coffee),0.103137,0.475081,0.054349,0.526958,1.109196,0.00535,1.109667


### Conclusion
 * Results from developed function(APRIORI, ASSOCIATION_RULE) has matched with builts packages.
 * it is observed that "Toast" & "Coffee" are highly associated with lift 1.48.
 * Coffee has been brought most frequently with 47.5% of all the transactions