# Association Analysis

Association analysis is the task of finding interesting relationships in large datasets. 

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases.

We will be working on the dataset where each row of the dataset represents items that were purchased together on the same day at the same store. It is a sparse dataset. 
The dataset can be found here: https://gist.github.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751



In [122]:
# Import Libraries
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt

In [123]:
# Read the dataset 
df = pd.read_csv('retail_dataset.csv', sep=',')
print(len(df)) #315 total transactions/customers

# Print the first 5 rows
df.head(5)

315


Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


In [124]:
# Find the unique items in the table
items = set()
for col in df:
    items.update(df[col].unique())
print(items)

{'Diaper', nan, 'Bread', 'Pencil', 'Bagel', 'Meat', 'Eggs', 'Wine', 'Milk', 'Cheese'}


Data Preprocessing

Apriori module requires a dataframe that has either 0 and 1 or True and False as data. The data we have is all string (name of items), we need to One Hot Encode the data.

In [125]:
itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
    rowset = set(row) 
    labels = {}
    uncommons = list(itemset - rowset)
    commons = list(itemset.intersection(rowset))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)
encoded_vals[0]
ohe_df = pd.DataFrame(encoded_vals)

In [126]:
ohe_df

Unnamed: 0,NaN,Bagel,Milk,Diaper,Bread,Pencil,Meat,Wine,Eggs,Cheese
0,0,0,0,1,1,1,1,1,1,1
1,0,0,1,1,1,1,1,1,0,1
2,1,0,1,0,0,0,1,1,1,1
3,1,0,1,0,0,0,1,1,1,1
4,1,0,0,0,0,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...
310,1,0,0,0,1,0,0,0,1,1
311,1,0,1,0,0,1,1,0,0,0
312,0,0,0,1,1,1,1,1,1,1
313,1,0,0,0,0,0,1,0,0,1


Generate frequent itemsets that have a support value of at least 10% (this number is chosen so that you can get close enough).

Generate the rules with their corresponding support, confidence and lift.

In [127]:
%%time
# Applying apriori
freq_items = apriori(ohe_df, min_support=0.1, use_colnames=True)

print(freq_items)
# Mining association rules
apriori_rules = association_rules(freq_items, metric="confidence", min_threshold=0.5)
apriori_rules.head()

      support                         itemsets
0    0.869841                            (nan)
1    0.425397                          (Bagel)
2    0.501587                           (Milk)
3    0.406349                         (Diaper)
4    0.504762                          (Bread)
..        ...                              ...
141  0.101587       (Wine, Milk, Meat, Cheese)
142  0.152381       (Eggs, Milk, Meat, Cheese)
143  0.104762       (Eggs, Wine, Milk, Cheese)
144  0.111111       (Eggs, Wine, Meat, Cheese)
145  0.117460  (nan, Meat, Eggs, Milk, Cheese)

[146 rows x 2 columns]
CPU times: total: 0 ns
Wall time: 12 ms




Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Bagel),(nan),0.425397,0.869841,0.336508,0.791045,0.909413,-0.03352,0.622902,-0.147743
1,(Milk),(nan),0.501587,0.869841,0.409524,0.816456,0.938626,-0.026778,0.709141,-0.115976
2,(Diaper),(nan),0.406349,0.869841,0.31746,0.78125,0.898152,-0.035999,0.595011,-0.160381
3,(Bread),(nan),0.504762,0.869841,0.396825,0.786164,0.903801,-0.042237,0.608683,-0.176903
4,(Pencil),(nan),0.361905,0.869841,0.266667,0.736842,0.8471,-0.048133,0.494603,-0.220499


The **confidence** tells us the number of times that a rule occurs. 

The lift gives us the strength of association

In [128]:
apriori_rules[ (apriori_rules['lift'] >= 1.5) &
      (apriori_rules['confidence'] >= 0.7) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
85,"(Bagel, Milk)",(Bread),0.225397,0.504762,0.171429,0.760563,1.506777,0.057657,2.068347,0.434199
119,"(Eggs, Milk)",(Meat),0.244444,0.47619,0.177778,0.727273,1.527273,0.061376,1.920635,0.456933
121,"(Milk, Meat)",(Eggs),0.244444,0.438095,0.177778,0.727273,1.660079,0.070688,2.060317,0.526261
122,"(Milk, Meat)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137,0.524816
131,"(Eggs, Milk)",(Cheese),0.244444,0.501587,0.196825,0.805195,1.605293,0.074215,2.558519,0.499051
180,"(Pencil, Eggs)",(Wine),0.165079,0.438095,0.120635,0.730769,1.66806,0.048314,2.087075,0.479688
193,"(Eggs, Meat)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667,0.518717
194,"(Eggs, Cheese)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773,0.487091
207,"(nan, Milk, Meat)",(Eggs),0.168254,0.438095,0.12381,0.735849,1.679655,0.050098,2.127211,0.486494
210,"(nan, Milk, Meat)",(Cheese),0.168254,0.501587,0.146032,0.867925,1.730356,0.061638,3.773696,0.507468


# TASKS

1. Execute the association analysis using the **fpgrowth** and **ECLAT** algorithms on the same dataset. 
2. For the above 2 algorithms, find the following:
  
  a. rate of Milk, Meat and Cheese being purchased together.
  
  b. percentage of customers who buy Eggs, Meat and Cheese. 

3. Compute the overall time for the association analysis.

In [129]:
#method 1: fpgrowth
from mlxtend.frequent_patterns import fpgrowth

In [183]:
%%time
te_df = ohe_df.astype(bool) #convert the onehotencoded dataframe to true/false
#print(te_df)

te_df2 = fpgrowth(te_df, min_support=0.1, use_colnames=True) #run association algorithm
print(te_df2[20:30])
print(te_df2[30:40], '\n')

milk_meat_cheese = te_df2.iloc[22,1]
print(milk_meat_cheese)
print(te_df2.loc[te_df2['itemsets']==milk_meat_cheese])

#rate of milk, meat, and cheese
print("\nrate of milk, meat and cheese being purchased together:", (te_df2.iloc[22,0])*315, '\n')


eggs_meat_cheese = te_df2.iloc[31,1]
print(eggs_meat_cheese)
print(te_df2.loc[te_df2['itemsets']==eggs_meat_cheese])

#percentage of customers who buy eggs, meat and cheese
print("\npercentage of customers who buy eggs, meat and cheese:", (te_df2.iloc[31,0]), '\n')

     support                   itemsets
20  0.104762        (Bread, Milk, Meat)
21  0.120635         (nan, Bread, Meat)
22  0.203175       (Milk, Meat, Cheese)
23  0.168254          (nan, Milk, Meat)
24  0.146032  (nan, Milk, Meat, Cheese)
25  0.298413             (Eggs, Cheese)
26  0.266667               (Eggs, Meat)
27  0.187302              (Eggs, Bread)
28  0.336508                (Eggs, nan)
29  0.244444               (Eggs, Milk)
     support                   itemsets
30  0.219048        (Eggs, nan, Cheese)
31  0.215873       (Eggs, Meat, Cheese)
32  0.187302          (Eggs, nan, Meat)
33  0.155556  (Eggs, nan, Meat, Cheese)
34  0.117460      (Eggs, Bread, Cheese)
35  0.104762        (Bread, Eggs, Milk)
36  0.107937         (Eggs, nan, Bread)
37  0.196825       (Eggs, Milk, Cheese)
38  0.177778         (Eggs, Milk, Meat)
39  0.174603          (Eggs, nan, Milk) 

frozenset({'Milk', 'Meat', 'Cheese'})
     support              itemsets
22  0.203175  (Milk, Meat, Cheese)

rate of m

In [131]:
#method 2: eclat
from pyECLAT import ECLAT

In [184]:
%%time

df2 = df.values.tolist() #convert df to list
df3 = pd.DataFrame(df2) #back to a dataframe
print(df3)

min_n_products = 3 #transactions with 3 products only
min_support = .001 #low minimum support so all of the rules are included
max_length = 3  #transactions with 3 produts only

my_eclat = ECLAT(data=df3, verbose=True) #association algorithm

#fit the model
rule_indices, rule_supports = my_eclat.fit(min_support=min_support, min_combination=min_n_products, max_combination=max_length)

print('\n', len(rule_supports))
print(rule_supports)

#'Meat & Milk & Cheese support: 0.20317460317460317'
print('\n\nrate of milk, meat and cheese being purchased together: ', (0.20317460317460317)*315)
#'Meat & Eggs & Cheese support: 0.21587301587301588'
print('percentage of customers who buy eggs, meat and cheese:', (0.21587301587301588), '\n')

          0       1       2       3       4       5       6
0     Bread    Wine    Eggs    Meat  Cheese  Pencil  Diaper
1     Bread  Cheese    Meat  Diaper    Wine    Milk  Pencil
2    Cheese    Meat    Eggs    Milk    Wine     NaN     NaN
3    Cheese    Meat    Eggs    Milk    Wine     NaN     NaN
4      Meat  Pencil    Wine     NaN     NaN     NaN     NaN
..      ...     ...     ...     ...     ...     ...     ...
310   Bread    Eggs  Cheese     NaN     NaN     NaN     NaN
311    Meat    Milk  Pencil     NaN     NaN     NaN     NaN
312   Bread  Cheese    Eggs    Meat  Pencil  Diaper    Wine
313    Meat  Cheese     NaN     NaN     NaN     NaN     NaN
314    Eggs    Wine   Bagel   Bread    Meat     NaN     NaN

[315 rows x 7 columns]


100%|██████████| 9/9 [00:00<00:00, 475.02it/s]
100%|██████████| 9/9 [00:00<?, ?it/s]
100%|██████████| 9/9 [00:00<00:00, 3009.07it/s]


Combination 3 by 3


84it [00:00, 277.94it/s]


 84
{'Diaper & Bread & Pencil': 0.10476190476190476, 'Diaper & Bread & Bagel': 0.12063492063492064, 'Diaper & Bread & Meat': 0.12063492063492064, 'Diaper & Bread & Eggs': 0.08888888888888889, 'Diaper & Bread & Wine': 0.1492063492063492, 'Diaper & Bread & Milk': 0.11428571428571428, 'Diaper & Bread & Cheese': 0.1365079365079365, 'Diaper & Pencil & Bagel': 0.08253968253968254, 'Diaper & Pencil & Meat': 0.08253968253968254, 'Diaper & Pencil & Eggs': 0.08253968253968254, 'Diaper & Pencil & Wine': 0.10793650793650794, 'Diaper & Pencil & Milk': 0.07301587301587302, 'Diaper & Pencil & Cheese': 0.08888888888888889, 'Diaper & Bagel & Meat': 0.10476190476190476, 'Diaper & Bagel & Eggs': 0.06984126984126984, 'Diaper & Bagel & Wine': 0.10158730158730159, 'Diaper & Bagel & Milk': 0.07936507936507936, 'Diaper & Bagel & Cheese': 0.10793650793650794, 'Diaper & Meat & Eggs': 0.08571428571428572, 'Diaper & Meat & Wine': 0.12380952380952381, 'Diaper & Meat & Milk': 0.06349206349206349, 'Diaper & Meat & 


