# Pattern Mining in Supermarkets
### João Vidigal
-----
## General concepts 

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational data sets.
If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absence of that item. Each basket can then be represented by a Boolean vector of values assigned to these variables. The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules.
Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and certainty of discovered rules. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. These thresholds can be a set by users or domain experts. Additional analysis can be performed to discover interesting statistical correlations between associated items.

* **Association rule mining** consists of first finding **frequent itemsets**(sets of items, such as A and B, satisfying a minimum support threshold), from which strong association rules in the form of A ⇒ B are generated. These rules also satisfy a minimum confidence threshold (a prespecified probability of satisfying B under the condition that A is satisfied). Associations can be further analyzed to uncover **correlation rules**, which convey statistical correlations between itemsets A and B.

* Many efficient and scalable algorithms have been developed for **frequent itemset mining**, from which association and correlation rules can be derived. These algorithms can be classified into three categories: (1) Apriori-like algorithms, (2) FP-growth–based algorithms, and (3) algorithms that use the vertical data format.

## 1. Frequent Itemsets and Association Rules: Ignoring Product Quantities and Stores

The dataset to be analysed is **`Foodmart_2000_PD.csv`**. This is a modified version of the [Foodmart 2000(2005) dataset](https://github.com/neo4j-examples/neo4j-foodmart-dataset/tree/master/data). 

**`Foodmart_2000_PD.csv`** stores a set of **69549 transactions** from **24 stores**, where **103 different products** can be bought. Each transaction (row) has a STORE_ID, an integer from 1 to 24, and a list of produts (items) together with the quantities bought. In the transation highlighted below, a customer bought 2 units of pasta and 2 units of soup at  store 1.

### 1.1. Load and Preprocess Dataset

 **Product quantities and stores should not be considered.**

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# Write code in cells like this
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

#Note: Everything that concerned preprocessing and repetitive code lines was transformed into functions and incorporated in the file *needed_fun.py*.
from needed_fun import *

In [3]:
#loading the original data
data = load_transactions('Foodmart_2019_PD.csv')

Number of transactions 69549
Number of unique transactions 103


In [4]:
#loading and preprocessing the data
data = load_trans('Foodmart_2019_PD.csv')

Number of transactions in  All : 69338
Number of unique transactions in All : 103 

Number of transactions in  Small_Gro : 2278
Number of unique transactions in Small_Gro : 103 

Number of transactions in  Sup : 27146
Number of unique transactions in Sup : 103 

Number of transactions in  Gourmet_sup : 5328
Number of unique transactions in Gourmet_sup : 102 

Number of transactions in  Del_Gour_sup : 31251
Number of unique transactions in Del_Gour_sup : 103 

Number of transactions in  Deluxe_sup : 25923
Number of unique transactions in Deluxe_sup : 103 

Number of transactions in  Mid_Size_Groc : 4760
Number of unique transactions in Mid_Size_Groc : 103 



In [5]:
print(f'Percentage of lost transactions due to missing information: {100-((69338/69549)*100)}')

Percentage of lost transactions due to missing information: 0.30338322621462055


At first, by examining the data we saw some missing data. 
The majority of the missing data concerned *store_id* information but fortunately was less than 1% from the total data.
Since the unique amount of transactions remained the same I decided not to use any type of imputation method and simply eliminate the rows that lack store_id information.

----

### 1.2. Compute Frequent Itemsets

In order to compute the frequent itemsets we are going to use the Apriori Algorithm. 

**Apriori** is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules [AS94b]. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.

**Compute frequent itemsets considering a minimum support of 1%**

In [7]:
All_matrix = data['All']
#encode_df: Encode the transactions matrix into a binary pandas dataframe in order to be used by MLxtend Apriori implementation
All_df = encode_df(All_matrix) 
# Compute frequent itemsets with the apriori algorithm with minimum support of 1%.
frequent_itemsets_all = comp_freqItem(All_df, 0.01) #comp_freqItem: Compute Frequent Itemsets using MLxtend Apriori implementation.

Unnamed: 0,support,itemsets,length
0,0.014393,(Acetominifen),1
74,0.013773,(Pots and Pans),1
73,0.013427,(Pot Scrubbers),1
72,0.027186,(Pot Cleaners),1
71,0.052828,(Popsicles),1



 Size of the itemset: 178


In [8]:
#count the number of itemsets by lenght
frequent_itemsets_all[['itemsets','length']].groupby('length').count()

Unnamed: 0_level_0,itemsets
length,Unnamed: 1_level_1
1,102
2,76


Computation of frequent itemsets for the all dataset. The result is 178 itemsets from which 102 are 1-itemset and 76 are 2-itemsets.

In [9]:
frequent_itemsets_all[['itemsets','support']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
support,178.0,0.029962,0.031111,0.01011,0.013431,0.01572,0.04014,0.284173


Statistics that summarize the support distribution.

In [10]:
frequent_itemsets_all[['length','support']].groupby('length').describe(percentiles=[.25, .5, .75, .95])

Unnamed: 0_level_0,support,support,support,support,support,support,support,support,support
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,95%,max
length,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1,102.0,0.040765,0.037237,0.01285,0.013993,0.028671,0.053917,0.104119,0.284173
2,76.0,0.015464,0.006719,0.01011,0.011249,0.013622,0.016798,0.028581,0.050867


**List frequent 1-itemsets and 2-itemsets with support of at least 25%.**

In [11]:
freq_itemLenght(frequent_itemsets_all, 0.25, 1)

Unnamed: 0,support,itemsets,length
40,0.284173,(Fresh Vegetables),1


Size of the itemset with support of 25.0% and 1-itemset is 1


In [12]:
freq_itemLenght(frequent_itemsets_all, 0.25, 2)

Unnamed: 0,support,itemsets,length


Size of the itemset with support of 25.0% and 2-itemset is 0


**Change the the support thresholds to the value of the 95% percentile**

In [13]:
freq_itemLenght(frequent_itemsets_all, 0.10, 1)

Unnamed: 0,support,itemsets,length
40,0.284173,(Fresh Vegetables),1
39,0.175257,(Fresh Fruit),1
86,0.120064,(Soup),1
12,0.117872,(Cheese),1
31,0.117281,(Dried Fruit),1
22,0.105353,(Cookies),1


Size of the itemset with support of 10.0% and 1-itemset is 6


In [14]:
freq_itemLenght(frequent_itemsets_all, 0.029, 2)

Unnamed: 0,support,itemsets,length
136,0.050867,"(Fresh Fruit, Fresh Vegetables)",2
172,0.035435,"(Fresh Vegetables, Soup)",2
129,0.035219,"(Dried Fruit, Fresh Vegetables)",2
111,0.031166,"(Fresh Vegetables, Cheese)",2


Size of the itemset with support of 2.9000000000000004% and 2-itemset is 4


### **Results and discussion**

**Note:** support(A⇒B) = P(A∪B) - support s, where s is the percentage of transactions in data that contain A ∪ B.

The computation of the frequent itemsets with a minimum support of 1% by the apriori algorithm resulted in 178 frequent itemsets. From these 178, 102 are 1-itemset and 76 are 2-itemsets. Also, the max support level was 0.284173 (Fresh Vegetables) for 1-itemset. For the 2-itemsets was 0.050867 (Fresh Fruit, Fresh Vegetables). 
The support is a fraction of transactions containing the itemset and for that reason is expected we have few items. Therefore, when filtering with support >= 25% only one of 1-itemset was obtained.

In order to see the items that have most impact on sales I lowered the support value to 0.029 (from 95% in percentile for both length). The result was 6 items for 1-itemset and 4 items for 2-itemset(Data in tables above). With this, it is also possible to see the combinations of items which have most impact on sales.

Finally, by choosing the support value based on the 95% percentile and not a random absolute value it is possible to compare the most sold products in different stores.

----

### 1.3. Generate Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence). Because the rules are generated from frequent itemsets, each one automatically satisfies the minimum support. Frequent itemsets can be stored ahead of time in hash tables along with their counts so that they can be accessed quickly.

**Generate association rules with a minimum confidence of 25%.**

In [15]:
# Function to generate association rules with 1% of support, using as confidence as metric with a value of 25%
rules_conf = associat_rules(All_df, 0.01, 'confidence', 0.25)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Batteries),(Fresh Vegetables),0.053982,0.284173,0.015071,0.279188,0.982457,-0.000269,0.993084
1,(Bologna),(Fresh Vegetables),0.040613,0.284173,0.011913,0.293324,1.032201,0.000372,1.012949
2,(Canned Vegetables),(Fresh Vegetables),0.078600,0.284173,0.022080,0.280917,0.988543,-0.000256,0.995472
3,(Cereal),(Fresh Vegetables),0.054314,0.284173,0.015042,0.276952,0.974588,-0.000392,0.990012
4,(Cheese),(Fresh Vegetables),0.117872,0.284173,0.031166,0.264407,0.930444,-0.002330,0.973129
5,(Chips),(Fresh Vegetables),0.064712,0.284173,0.018316,0.283040,0.996012,-0.000073,0.998419
6,(Chocolate Candy),(Fresh Vegetables),0.066673,0.284173,0.019138,0.287043,1.010099,0.000191,1.004025
7,(Cleaners),(Fresh Vegetables),0.039776,0.284173,0.010283,0.258521,0.909729,-0.001020,0.965404
8,(Coffee),(Fresh Vegetables),0.052958,0.284173,0.015013,0.283497,0.997620,-0.000036,0.999056
9,(Cookies),(Fresh Vegetables),0.105353,0.284173,0.027719,0.263107,0.925870,-0.002219,0.971413



 Number of association rules: 48


**Statistic summary of association rules metrics**

In [16]:
rules_conf.describe(percentiles=[.25, .5, .75, .95]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,95%,max
antecedent support,48.0,0.060259,0.02688186,0.039243,0.041204,0.054054,0.066068,0.117665,0.175257
consequent support,48.0,0.284173,1.121972e-16,0.284173,0.284173,0.284173,0.284173,0.284173,0.284173
support,48.0,0.01693,0.007765496,0.010168,0.012068,0.015028,0.018486,0.0338,0.050867
confidence,48.0,0.280574,0.01400875,0.253506,0.26999,0.282465,0.290675,0.301132,0.308586
lift,48.0,0.987334,0.04929654,0.892083,0.95009,0.993988,1.022879,1.059679,1.085906
leverage,48.0,-0.000194,0.0009022043,-0.002472,-0.000747,-0.00011,0.0003,0.001041,0.001891
conviction,48.0,0.995366,0.01936432,0.958918,0.980571,0.99762,1.009166,1.024267,1.035308


**Generate association rules with a minimum lift of 1.1**

In [17]:
# Function to generate association rules with 1% of support, using as lift as metric with a value of 1.1
rules_lift = associat_rules(All_df, 0.01, 'lift', 1.1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Fresh Fruit),(Batteries),0.175257,0.053982,0.010788,0.061554,1.140264,0.001327,1.008068
1,(Batteries),(Fresh Fruit),0.053982,0.175257,0.010788,0.19984,1.140264,0.001327,1.030722
2,(Fresh Fruit),(Juice),0.175257,0.053722,0.010773,0.061471,1.144242,0.001358,1.008257
3,(Juice),(Fresh Fruit),0.053722,0.175257,0.010773,0.200537,1.144242,0.001358,1.031621
4,(Pizza),(Fresh Fruit),0.054126,0.175257,0.010672,0.197176,1.125063,0.001186,1.027301
5,(Fresh Fruit),(Pizza),0.175257,0.054126,0.010672,0.060895,1.125063,0.001186,1.007208
6,(Fresh Fruit),(Sliced Bread),0.175257,0.056347,0.010975,0.062623,1.111386,0.0011,1.006696
7,(Sliced Bread),(Fresh Fruit),0.056347,0.175257,0.010975,0.194779,1.111386,0.0011,1.024243
8,(Soup),(Wine),0.120064,0.080663,0.011249,0.093694,1.161547,0.001565,1.014378
9,(Wine),(Soup),0.080663,0.120064,0.011249,0.13946,1.161547,0.001565,1.022539



 Number of association rules: 10


**Statistic summary of association rules metrics**

In [18]:
rules_lift.describe(percentiles=[.25, .5, .75, .95]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,95%,max
antecedent support,10.0,0.111993,0.057987,0.053722,0.054681,0.100363,0.175257,0.175257,0.175257
consequent support,10.0,0.111993,0.057987,0.053722,0.054681,0.100363,0.175257,0.175257,0.175257
support,10.0,0.010892,0.000215,0.010672,0.010773,0.010788,0.010975,0.011249,0.011249
confidence,10.0,0.127203,0.065479,0.060895,0.061821,0.116577,0.196576,0.200223,0.200537
lift,10.0,1.1365,0.018031,1.111386,1.125063,1.140264,1.144242,1.161547,1.161547
leverage,10.0,0.001307,0.000168,0.0011,0.001186,0.001327,0.001358,0.001565,0.001565
conviction,10.0,1.018103,0.010243,1.006696,1.008115,1.018459,1.026537,1.031216,1.031621


**Generate association rules with both confidence >= 25% and lift >= 1.1**

In [36]:
rules_lift[rules_lift['confidence']>= 0.25]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


**Change the confidence thresholds to the 0.20 (the value of the 95% percentile when lift was used as metric)** 

In [42]:
rules_lift[rules_lift['confidence']>= 0.20]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3,(Juice),(Fresh Fruit),0.053722,0.175257,0.010773,0.200537,1.144242,0.001358,1.031621


**Change the lift thresholds to the 1.06 (the value of the 95% percentile when confidence was used as metric)**  

In [43]:
rules_conf[rules_conf['lift']>= 1.06]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(Donuts),(Fresh Vegetables),0.040382,0.284173,0.012446,0.308214,1.0846,0.000971,1.034752
34,(Personal Hygiene),(Fresh Vegetables),0.05466,0.284173,0.016484,0.301583,1.061265,0.000952,1.024928
40,(Shampoo),(Fresh Vegetables),0.040988,0.284173,0.012648,0.308586,1.085906,0.001001,1.035308


### **Results and discussion**

**Note:**
* **Confidence (A⇒B)** = P(B|A) - The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B.
* **Lift** is a simple correlation measure that takes into account the base popularity of both constituent items. If the resulting value is greater than 1,like in all these associations, then product A and product B are positively correlated, meaning that the occurrence of one implies the occurrence of the other. This is seen the most with "wine and soup". 

A rule is interesting when is unexpected and if we can act on it. For that reason when we change the thresholds (in table above) we get 3 unlikely but interesting associations. 
These 3 associations are also the most likely products to be bought with "Fresh Vegetables". The value of lift, although not huge, shows a positive correlation between them. Despite the confidence values are around 30% their support is only around 1%, mainly due to the antecedent support. We have to be careful with the values of confidence because they can be misrepresenting the importance of these associations.In other words, Donuts, personal hygiene and shampoo are the most purchased items with fresh vegetables.

When changing the confidence thresholds with lift higher than 1.1 a less surprising association appeared, juice with fresh fruit. However, this association has less confidence and support than the already mentioned associations. 

----

## 2. Frequent Itemsets and Association Rules: Looking for Differences between Stores

The 24 stores, whose transactions were analysed, are in fact different types of stores:
* Deluxe Supermarkets: STORE_ID = 8, 12, 13, 17, 19, 21
* Gourmet Supermarkets: STORE_ID = 4, 6
* Mid-Size Grocerys: STORE_ID = 9, 18, 20, 23
* Small Grocerys: STORE_ID = 2, 5, 14, 22
* Supermarkets: STORE_ID = 1, 3, 7, 10, 11, 15, 16

### 2.1. Analyse Deluxe Supermarkets and Gourmet Supermarkets

#### 2.1.1. Load/Preprocess the Dataset

In [19]:
# Dictionary keys corresponding to the different stores
data.keys()

dict_keys(['All', 'Small_Gro', 'Sup', 'Gourmet_sup', 'Del_Gour_sup', 'Deluxe_sup', 'Mid_Size_Groc'])

In [20]:
#Data matrix from Deluxe and Gourmet store already processed in the beggining.
del_Gour_matrix = data['Del_Gour_sup']

#### 2.1.2 Compute Frequent Itemsets

**Compute the frequent itemsets with a minimum support of 1%**

In [22]:
del_Gour_df = encode_df(del_Gour_matrix)
frequent_itemsets_del_Gour = comp_freqItem(del_Gour_df, 0.01)

Unnamed: 0,support,itemsets,length
0,0.014304,(Acetominifen),1
74,0.012928,(Pots and Pans),1
73,0.013856,(Pot Scrubbers),1
72,0.027263,(Pot Cleaners),1
71,0.051806,(Popsicles),1



 Size of the itemset: 173


In [23]:
frequent_itemsets_del_Gour[['itemsets','length']].groupby('length').count()

Unnamed: 0_level_0,itemsets
length,Unnamed: 1_level_1
1,102
2,71


Computation of frequent itemsets results in 173 itemsets from which 102 are 1-itemset and 71 are 2-itemsets.

**Statistic summary of the support metric for all frequent itemsets**

In [24]:
frequent_itemsets_del_Gour[['itemsets','support']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
support,173.0,0.030792,0.031674,0.010496,0.013632,0.017503,0.040479,0.290071


**Statistic summary of the support metric between 1itemsets and 2-itemsets.**

In [25]:
frequent_itemsets_del_Gour[['length','support']].groupby('length').describe(percentiles=[.25, .5, .75, .95])

Unnamed: 0_level_0,support,support,support,support,support,support,support,support,support
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,95%,max
length,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1,102.0,0.040922,0.037715,0.012704,0.01412,0.029103,0.053254,0.104822,0.290071
2,71.0,0.016238,0.006962,0.010496,0.011872,0.014528,0.018079,0.030591,0.051806


In [26]:
#Divide the data from 1-itemsets and 2-itemsets into 2 variables.
frequent_itemsets_del_Gour_1item = frequent_itemsets_del_Gour[frequent_itemsets_del_Gour['length']==1]
frequent_itemsets_del_Gour_2item = frequent_itemsets_del_Gour[frequent_itemsets_del_Gour['length']==2]

**List of 1-itemset with 25.0% support**

In [29]:
freq_itemLenght(frequent_itemsets_del_Gour, 0.25, 1)

Unnamed: 0,support,itemsets,length
40,0.290071,(Fresh Vegetables),1


Size of the 1-itemset with 25.0% support is 1


**List of 2-itemset with 25.0% support**

In [30]:
freq_itemLenght(frequent_itemsets_all, 0.25, 2)

Unnamed: 0,support,itemsets,length


Size of the 2-itemset with 25.0% support is 0


**Change the the support thresholds to the value of the 95% percentile**

In [56]:
freq_itemLenght(frequent_itemsets_del_Gour, 0.10, 1)

Unnamed: 0,support,itemsets,length
40,0.290071,(Fresh Vegetables),1
39,0.176282,(Fresh Fruit),1
86,0.12134,(Soup),1
31,0.119388,(Dried Fruit),1
12,0.11798,(Cheese),1
22,0.106173,(Cookies),1


Size of the itemset with support of 10.0% and 1-itemset is 6


In [58]:
freq_itemLenght(frequent_itemsets_del_Gour, 0.03, 2)

Unnamed: 0,support,itemsets,length
133,0.051806,"(Fresh Fruit, Fresh Vegetables)",2
167,0.036191,"(Soup, Fresh Vegetables)",2
129,0.035679,"(Dried Fruit, Fresh Vegetables)",2
111,0.031167,"(Cheese, Fresh Vegetables)",2
121,0.030015,"(Cookies, Fresh Vegetables)",2


Size of the itemset with support of 3.0% and 2-itemset is 5


### **Results and discussion**

The computation of the frequent itemsets with a minimum support of 1% resulted in 173 frequent itemsets. From these 173, 102 are 1-itemset and 71 are 2-itemsets. Also the max support level was 0.290071 again for Fresh Vegetables. For the 2-itemsets  was 0.051806 ( Fresh Fruit, Fresh Vegetables). Again when filtering with support >= 25% only one of 1-itemset was obtained, Fresh Vegetables.

In order to see the items that have the most impact on sales I lowered the thresholds, 0.10 for 1-itemset and 0.03 for 2-itemset (value of support 95% percentile). The result was 6 1-itemset and 5 2-itemset(Data in tables above).The support was a little bit higher when comparing with the data from all stores but again the same items. Once more the items that impacted the most in all stores are the same for Deluxe and gourmet stores. 

In the end, as a marketing perspective, concerning the most purchased products the client that buys in a more deluxe store seems to be equal to everybody else.

#### 2.1.3 Generate Association Rules from Frequent Itemsets


**Generate association rules with a minimum confidence of 25%**

In [59]:
rules_conf_del_Gour = associat_rules(del_Gour_df, 0.01, 'confidence', 0.25)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Batteries),(Fresh Vegetables),0.053726,0.290071,0.015008,0.279333,0.962982,-0.000577,0.9851
1,(Bologna),(Fresh Vegetables),0.040863,0.290071,0.011872,0.290525,1.001565,1.9e-05,1.00064
2,(Canned Vegetables),(Fresh Vegetables),0.076126,0.290071,0.021919,0.287936,0.992641,-0.000162,0.997002
3,(Cereal),(Fresh Vegetables),0.054206,0.290071,0.014624,0.269776,0.930034,-0.0011,0.972207
4,(Cheese),(Fresh Vegetables),0.11798,0.290071,0.031167,0.264171,0.910714,-0.003056,0.964803
5,(Chips),(Fresh Vegetables),0.064286,0.290071,0.017919,0.278746,0.960958,-0.000728,0.984298
6,(Chocolate Candy),(Fresh Vegetables),0.06771,0.290071,0.019967,0.294896,1.016635,0.000327,1.006843
7,(Cleaners),(Fresh Vegetables),0.040479,0.290071,0.011232,0.27747,0.956561,-0.00051,0.982561
8,(Coffee),(Fresh Vegetables),0.052542,0.290071,0.015264,0.290499,1.001478,2.3e-05,1.000604
9,(Cookies),(Fresh Vegetables),0.106173,0.290071,0.030015,0.2827,0.974591,-0.000783,0.989725



 Number of association rules: 46


**Statistic summary of association rules metrics with a minimum confidence of 25%.**

In [60]:
rules_conf_del_Gour.describe(percentiles=[.25, .5, .75, .95]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,95%,max
antecedent support,46.0,0.061307,0.02735102,0.038591,0.041511,0.053966,0.067126,0.119036,0.176282
consequent support,46.0,0.290071,5.612455000000001e-17,0.290071,0.290071,0.290071,0.290071,0.290071,0.290071
support,46.0,0.017669,0.007952697,0.010624,0.01308,0.015232,0.019503,0.034551,0.051806
confidence,46.0,0.288604,0.01600595,0.256787,0.27828,0.28886,0.297483,0.316159,0.31762
lift,46.0,0.994944,0.05517948,0.885258,0.959353,0.995826,1.025553,1.089937,1.094974
leverage,46.0,-0.000114,0.0009629654,-0.003056,-0.000637,-9.1e-05,0.000619,0.001112,0.001353
conviction,46.0,0.998434,0.02250822,0.955217,0.983663,0.998299,1.010553,1.038152,1.040372


**Generate association rules with a minimum lift of 1.1.**

In [61]:
rules_lift_del_Gour = associat_rules(del_Gour_df, 0.01, 'lift', 1.1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Fresh Fruit),(Batteries),0.176282,0.053726,0.011232,0.063714,1.185899,0.001761,1.010667
1,(Batteries),(Fresh Fruit),0.053726,0.176282,0.011232,0.209053,1.185899,0.001761,1.041432
2,(Cheese),(Soup),0.11798,0.12134,0.015839,0.134255,1.106439,0.001524,1.014918
3,(Soup),(Cheese),0.12134,0.11798,0.015839,0.130538,1.106439,0.001524,1.014443
4,(Fresh Fruit),(Juice),0.176282,0.053086,0.010816,0.061354,1.155744,0.001457,1.008808
5,(Juice),(Fresh Fruit),0.053086,0.176282,0.010816,0.203737,1.155744,0.001457,1.03448
6,(Paper Wipes),(Fresh Fruit),0.079165,0.176282,0.015552,0.196443,1.114366,0.001596,1.025089
7,(Fresh Fruit),(Paper Wipes),0.176282,0.079165,0.015552,0.088219,1.114366,0.001596,1.00993
8,(Fresh Fruit),(Pizza),0.176282,0.054782,0.011456,0.064985,1.186234,0.001798,1.010911
9,(Pizza),(Fresh Fruit),0.054782,0.176282,0.011456,0.209112,1.186234,0.001798,1.04151



 Number of association rules: 14


**Statistic summary of association rules metrics with a minimum lift of 1.1.**

In [62]:
rules_lift_del_Gour.describe(percentiles=[.25, .5, .75, .95]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,95%,max
antecedent support,14.0,0.115532,0.052722,0.053086,0.062006,0.11966,0.176282,0.176282,0.176282
consequent support,14.0,0.115532,0.052722,0.053086,0.062006,0.11966,0.176282,0.176282,0.176282
support,14.0,0.01253,0.002091,0.010816,0.01124,0.011456,0.014552,0.015839,0.015839
confidence,14.0,0.133432,0.061066,0.061354,0.071202,0.132397,0.201904,0.209074,0.209112
lift,14.0,1.156529,0.033387,1.106439,1.124692,1.155744,1.18615,1.19135,1.19135
leverage,14.0,0.001643,0.000138,0.001457,0.001532,0.001596,0.001789,0.001809,0.001809
conviction,14.0,1.021405,0.012302,1.008808,1.010728,1.015677,1.032633,1.041459,1.04151


**Generate association rules with both confidence >= 25% and lift >= 1.1**

In [63]:
rules_lift_del_Gour[rules_lift_del_Gour['confidence']>= 0.25]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


They are none.

**Change the confidence thresholds to the value of the 95% percentile when lift was used as metric**

In [65]:
rules_lift_del_Gour[rules_lift_del_Gour['confidence']>= 0.209]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,(Batteries),(Fresh Fruit),0.053726,0.176282,0.011232,0.209053,1.185899,0.001761,1.041432
9,(Pizza),(Fresh Fruit),0.054782,0.176282,0.011456,0.209112,1.186234,0.001798,1.04151


**Change the lift thresholds to the value of the 95% percentile when confidence was used as metric**

In [70]:
rules_conf_del_Gour[rules_conf_del_Gour['lift']>= 1.089]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(Donuts),(Fresh Vegetables),0.041823,0.290071,0.013248,0.316756,1.091996,0.001116,1.039057
21,(Hot Dogs),(Fresh Vegetables),0.041407,0.290071,0.013152,0.31762,1.094974,0.001141,1.040372
43,(TV Dinner),(Fresh Vegetables),0.041215,0.290071,0.013056,0.31677,1.092045,0.0011,1.039078


### **Results and discussion**

For the conditions where confidence >= 25% and lift >= 1.1 we have no associations again. Despite, the values of support, confidence and lift are higher.

By lowering the thresholds, we see some surprising associations mainly with junk food and fresh vegetables or batteries and fresh fruit. 

Although, the the most products bought by deluxe and gourmet customers seem to be the same as everybody else when we look into the products that are bought together we see a completely different behavior. It seems that with fresh vegetables these customers also buy a lot of junk food.