<h1><center>Association rule mining</center></h1>
<h3><center>Mining association rules in Transaction database</center></h3>

Association rule mining is a Data mining concept to discover associations and relationships between variables. In this case associations are found on the transaction database. This highlights relationships between products, which ones are usually bought together etc.

The transaction_data contains dummy columns for each category. A 0 means that product is not bought in that transaction and a number greater than 0 denotes the quantity of the product being bought.

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from mlxtend.frequent_patterns import fpgrowth, apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [14]:
df = pd.read_csv('/home/raj/Github/Olist-business-analysis/Joined data/transaction_data.csv')
df.head(5)

Unnamed: 0,customer_unique_id,order_purchase_timestamp,catg_agro_industry_and_commerce,catg_air_conditioning,catg_art,catg_arts_and_craftmanship,catg_audio,catg_auto,catg_baby,catg_bed_bath_table,...,catg_security_and_services,catg_signaling_and_security,catg_small_appliances,catg_small_appliances_home_oven_and_coffee,catg_sports_leisure,catg_stationery,catg_tablets_printing_image,catg_telephony,catg_toys,catg_watches_gifts
0,0000366f3b9a7992bf8c76cfdf3221e2,2018-05-10 10:56:27,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0000b849f77a49e4a4ce2b2a4ca5be3f,2018-05-07 11:11:27,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000f46a3911fa3c0805444483337064,2017-03-10 21:05:03,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0000f6ccb0745a6a4b88665a16c9f078,2017-10-12 20:29:41,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0004aac84e0df4da2b147fca70cf8255,2017-11-14 19:45:42,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


The column names are cleaned so that they make sense and help better understand the outputs of the mining algorithm. The whitespaces are re-introduced and the prefix catg is removed. 

In [15]:
cols = df.columns
cols = cols.str.replace('catg_','')
cols = cols.str.replace('_',' ')
df.columns = cols
df.head(5)

Unnamed: 0,customer unique id,order purchase timestamp,agro industry and commerce,air conditioning,art,arts and craftmanship,audio,auto,baby,bed bath table,...,security and services,signaling and security,small appliances,small appliances home oven and coffee,sports leisure,stationery,tablets printing image,telephony,toys,watches gifts
0,0000366f3b9a7992bf8c76cfdf3221e2,2018-05-10 10:56:27,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0000b849f77a49e4a4ce2b2a4ca5be3f,2018-05-07 11:11:27,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000f46a3911fa3c0805444483337064,2017-03-10 21:05:03,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0000f6ccb0745a6a4b88665a16c9f078,2017-10-12 20:29:41,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0004aac84e0df4da2b147fca70cf8255,2017-11-14 19:45:42,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


Every row denotes a customer's order date and the product bought in terms of category. The aim is to perform rule mining and frequent pattern mining on this dataset.<br>

For static frequent pattern mining, the temporality of the dataset is removed. Thus grouping by customer_id only and summing all orders. This is then converted to bianry columns since the actual quantity is not important.

In [18]:
static_data = df.drop('order purchase timestamp', axis= 1)
static_data = static_data.groupby('customer unique id').sum().reset_index()

for col in static_data.columns[1:]:
    static_data[col] = static_data[col].apply(lambda x: 1 if x>0 else 0)
    
static_data.head(5)

Unnamed: 0,customer unique id,agro industry and commerce,air conditioning,art,arts and craftmanship,audio,auto,baby,bed bath table,books general interest,...,security and services,signaling and security,small appliances,small appliances home oven and coffee,sports leisure,stationery,tablets printing image,telephony,toys,watches gifts
0,0000366f3b9a7992bf8c76cfdf3221e2,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0000b849f77a49e4a4ce2b2a4ca5be3f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000f46a3911fa3c0805444483337064,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0000f6ccb0745a6a4b88665a16c9f078,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0004aac84e0df4da2b147fca70cf8255,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


FP Growth Algorithm is an efficeint and scalable mining method. It uses an extended prefix tree structure for storing compressed frequent patters, called FP tree.<br>

The algorithm works in 2 steps. First step is to find all the frequent patters from the transaction database. A pattern is said to frequent if it has support more than the threshold set. Since the data avalaible is of 2 years, between 2016 and 2018, and the company started complete operations country wide in 2016, the threshold set is relatively low - 2. <br>
The second step of the algorithm is generating association rules from the frequent pattern sets found previously.
These association rules are generated based on a threshold for confidence. Confidence here determines the probability of products of a frequent pattern set being bought together.

In [5]:
patterns = fpgrowth(static_data.iloc[:,1:], min_support= 0.00002, use_colnames= True)

frequent_sets = patterns.sort_values('support', ascending= False)[:10]
frequent_sets.support = frequent_sets.support * len(static_data)
frequent_sets

Unnamed: 0,support,itemsets
0,9145.0,(bed bath table)
1,8678.0,(health beauty)
5,7515.0,(sports leisure)
8,6557.0,(computers accessories)
22,6317.0,(furniture decor)
14,5821.0,(housewares)
16,5547.0,(watches gifts)
3,4152.0,(telephony)
21,3852.0,(auto)
11,3844.0,(toys)


The patterns returned are in a dataframe. Top 10 most frequent product categories are shown above.<br>
Note: The support doesnot show the quantity of products sold, rather unique orders containing the product.<br>

Using these patters rules are generated. Since the transactions are very sparse and there are about 70 categories, the threshold considered here for the confidence is also taken to be low - 0.10. This means that all association rules formed will show that products from the frequent set have 10% or more chance to be bought together.

In [6]:
rules = association_rules(patterns, metric= 'confidence', min_threshold= 0.1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"(garden tools, housewares)",(bed bath table),0.000252,0.095839,3.1e-05,0.125,1.304265,7.334451e-06,1.033326
1,"(garden tools, bed bath table)",(housewares),0.000168,0.061004,3.1e-05,0.1875,3.07357,2.121082e-05,1.155687
2,"(housewares, fashion bags accessories)",(bed bath table),0.000115,0.095839,2.1e-05,0.181818,1.897112,9.911613e-06,1.105085
3,"(fashion bags accessories, bed bath table)",(housewares),0.000115,0.061004,2.1e-05,0.181818,2.980431,1.392744e-05,1.147662
4,(home confort),(bed bath table),0.00415,0.095839,0.000566,0.136364,1.422834,0.0001681784,1.046923
5,"(home confort, furniture decor)",(bed bath table),9.4e-05,0.095839,2.1e-05,0.222222,2.318693,1.19204e-05,1.162492
6,"(housewares, furniture decor)",(bed bath table),0.000629,0.095839,7.3e-05,0.116667,1.217314,1.309613e-05,1.023578
7,"(housewares, bed bath table)",(furniture decor),0.000702,0.066202,7.3e-05,0.104478,1.578163,2.687552e-05,1.042741
8,"(computers accessories, housewares)",(bed bath table),0.00021,0.095839,2.1e-05,0.1,1.043412,8.720506e-07,1.004623
9,"(computers accessories, bed bath table)",(housewares),0.00021,0.061004,2.1e-05,0.1,1.639237,8.173552e-06,1.043329


The rules show which product is likely bought (consequents) if the customer has already bought a set of products (antecedents) with atleast 10% probability.<br>

These rules can be utilized to understand buying patterns and recommend the customers accordingly. It also helps the sellers and sales executives to understand the demand in the market and relationship between products sold. 

In [7]:
rules.to_csv('/home/raj/Github/Olist-business-analysis/Generated data/assocaition_rules_category.csv')

The above rule mining and frequent pattern mining was done on product categories and not the products themselves. Though the above shows relationships between product categories, it doesnot give a granular view of the products itself.<br>

For this, the rule mining is further carried out on the product ids itself, to find relationships between products. Since rules for product categories were already formed, the products will also follow similar pattern.

In [8]:
df = pd.read_csv('/home/raj/Github/Olist-business-analysis/Joined data/customer_order.csv')
df.head(5)

Unnamed: 0,order_id,order_purchase_timestamp,order_item_id,product_id,price,freight_value,customer_unique_id,product_category_name_english
0,e481f51cbdc54678b7cc49136f2d6af7,2017-10-02 10:56:33,1,87285b34884572647811a353c7ac498a,29.99,8.72,7c396fd4830fd04220f754e42b4e5bff,housewares
1,128e10d95713541c87cd1a2e48201934,2017-08-15 18:29:31,1,87285b34884572647811a353c7ac498a,29.99,7.78,3a51803cc0d012c3b5dc8b7528cb05f7,housewares
2,0e7e841ddf8f8f2de2bad69267ecfbcf,2017-08-02 18:24:47,1,87285b34884572647811a353c7ac498a,29.99,7.78,ef0996a1a279c26e7ecbd737be23d235,housewares
3,bfc39df4f36c3693ff3b63fcbea9e90a,2017-10-23 23:26:46,1,87285b34884572647811a353c7ac498a,29.99,14.1,e781fdcc107d13d865fc7698711cc572,housewares
4,53cdb2fc8bc7dce0b6741e2150273451,2018-07-24 20:41:37,1,595fac2a385ac33a80bd5114aec74eb8,118.7,22.76,af07308b275d755c9edb36a90c618231,perfumery


As done before, products are grouped on customer_id to get all products bought by a customer as a list. Static mining is performed for this as well.

In [9]:
def get_prod_list(row):
    return [prod for prod in row.unique()]

txns = df[[
    'customer_unique_id', 'product_id'
]].groupby('customer_unique_id').agg(get_prod_list).reset_index().product_id

txns = list(txns)
txns[:5]

[['372645c7439f9661fbbacfd129aa92ec'],
 ['5099f7000472b634fea8304448d20825'],
 ['64b488de448a5324c4134ea39c28a34b'],
 ['2345a354a6f2033609bbf62bf5be9ef6'],
 ['c72e18b3fe2739b8d24ebf3102450f37']]

Currently the transactions are in a list format. For the algorithm, these have to be encoded. Thus using TransactionEncoder to convert each product as a binary column denoting which order is the product a part of.

In [10]:
tEncoder = TransactionEncoder()
txn_array = tEncoder.fit(txns).transform(txns)
txn_data = pd.DataFrame(txn_array, columns= tEncoder.columns_)
txn_data.head(5)

Unnamed: 0,00066f42aeeb9f3007548bb9d3f33c38,00088930e925c41fd95ebfe695fd2655,0009406fd7479715e4bef61dd91f2462,000b8f95fcb9e0096488278317764d19,000d9be29b5207b54e86aa1b1ac54872,0011c512eb256aa0dbbb544d8dffcf6e,00126f27c813603687e6ce486d909d01,001795ec6f1b187d37335e1c4704762e,001b237c0e9bb435f2e54071129237e9,001b72dfd63e9833e8c02742adf472e3,...,ffef256879dbadcab7e77950f4f4a195,fff0a542c3c62682f23305214eaeaa24,fff1059cd247279f3726b7696c66e44e,fff28f91211774864a1000f918ed00cc,fff515ea94dbf35d54d256b3e39f0fea,fff6177642830a9a94a0f2cba5e476d1,fff81cc3158d2725c0655ab9ba0f712c,fff9553ac224cec9d15d49f5a263411f,fffdb2d0ec8d6a61f0a0a0db3f25b441,fffe9eeff12fcbd74a2f2b007dde0c58
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


FP Growth Algorithm is applied on the transaction database in two steps - finding frequent pattern sets and generating association rules from them. The frequent patterns are found using a low support threshold of 3.

In [11]:
patterns = fpgrowth(txn_data, min_support= 0.00003, use_colnames= True)

frequent_sets = patterns.sort_values('support', ascending=False)[:10]
frequent_sets.support = frequent_sets.support * len(txns)
frequent_sets

Unnamed: 0,support,itemsets
210,466.0,(99a4788cb24856965c36a24e339b6058)
248,430.0,(aca2eb7d00ea1a7b8ebd4e68314663af)
131,349.0,(422879e10f46682990de24d770e7f83d)
230,322.0,(d1c427060a0f73f6b889a5c7c61f2ac4)
214,310.0,(389d119b48cf3043d311335e499d9c6b)
552,303.0,(53b36df67ebb7c41585e8d54d6772e08)
58,289.0,(368c6c730842d78016ad823897a372db)
218,283.0,(53759a2ecddad2bb87a079a1f1519f73)
50,268.0,(154e7e31ebfa092203795c972e5804a6)
175,257.0,(2b4609f8948be18874494203496bc318)


The top 10 most frequent products are shown above along with their support.
The patterns are now used to generate association rules with 65% confidence.

In [12]:
rules = association_rules(patterns, metric= 'confidence', min_threshold= 0.65)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(058b372f969b59e8c4a27e224243909c),(fb55982be901439613a95940feefd9ee),7.3e-05,0.000859,5.2e-05,0.714286,831.184669,5.2e-05,3.496992
1,(e2cac69b319c0f8a21dbf04b925121bf),(b9900407a55cb2b306ae612415c3340e),7.3e-05,6.3e-05,5.2e-05,0.714286,11359.52381,5.2e-05,3.49978
2,(b9900407a55cb2b306ae612415c3340e),(e2cac69b319c0f8a21dbf04b925121bf),6.3e-05,7.3e-05,5.2e-05,0.833333,11359.52381,5.2e-05,5.99956
3,"(e2cac69b319c0f8a21dbf04b925121bf, 55bfa0307d7...",(b9900407a55cb2b306ae612415c3340e),3.1e-05,6.3e-05,3.1e-05,1.0,15903.333333,3.1e-05,inf
4,"(55bfa0307d7a46bed72c492259921231, b9900407a55...",(e2cac69b319c0f8a21dbf04b925121bf),3.1e-05,7.3e-05,3.1e-05,1.0,13631.428571,3.1e-05,inf
5,"(4025ee582ef6b8c478af3b44cf89054b, f4d705aa95c...",(c211ff3068fcd2f8898192976d8b3a32),3.1e-05,0.000325,3.1e-05,1.0,3078.064516,3.1e-05,inf
6,"(4025ee582ef6b8c478af3b44cf89054b, c211ff3068f...",(f4d705aa95ccca448e5b0deb6e5290ba),3.1e-05,0.000252,3.1e-05,1.0,3975.833333,3.1e-05,inf
7,"(f4d705aa95ccca448e5b0deb6e5290ba, c211ff3068f...",(4025ee582ef6b8c478af3b44cf89054b),4.2e-05,0.000168,3.1e-05,0.75,4472.8125,3.1e-05,3.999329
8,(1dc7685f4fdb9622d84ae2ec658d5bbf),(e256d05115f9eb3766f3ab752132a4e2),9.4e-05,8.4e-05,6.3e-05,0.666667,7951.666667,6.3e-05,2.999748
9,(e256d05115f9eb3766f3ab752132a4e2),(1dc7685f4fdb9622d84ae2ec658d5bbf),8.4e-05,9.4e-05,6.3e-05,0.75,7951.666667,6.3e-05,3.999623


These generated rules show which products are likely to be bought (consequents) if the customer has already purchased a set of products (antecedents), with more than 50% probability. These rules can be utilized to understand relationship between products and demand within customers. It can also be used to recommend a product to a customer based on what the customer is currently buying.

In [13]:
rules.to_csv('/home/raj/Github/Olist-business-analysis/Generated data/assocaition_rules_product.csv')

This concludes the association rule mining.