In [10]:
import warnings
warnings.filterwarnings('ignore')

  and should_run_async(code)


The data that we're going to analyze is market basket data collected from a local grocery store over a 30-day period. The data is stored in CSV (Comma  Separated Values) format as follows:
```
1. citrus fruit,semi-finished bread,margarine,ready soups
2. tropical fruit,yogurt,coffee
3. whole milk
4. pip fruit,yogurt,cream cheese,meat spreads
5. other vegetables,whole milk,condensed milk,long life bakery product
```
Each row in the data represents a set of items purchased by a customer during a store visit, which we refer to as a transaction. As expected, the number of items in each transaction varies so we cannot simply bulk import this data into a tabular data structure such as a pandas DataFrame as-is. Instead, we need to import the data one row at a time.

To do this, we first need to import the `reader` object from the `csv` package.

In [2]:
# 1.	Read the data and remove all null or empty values.
from csv import reader

groceries =[]
with open('Groceries.csv','r') as csvfile:
    read = reader(csvfile)
    for row in read:
        row_new = [x for x in row if x!='NaN']
        groceries.append(row_new)

In [3]:
groceries[0:5]

[['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'],
 ['tropical fruit', 'yogurt', 'coffee'],
 ['whole milk'],
 ['pip fruit', 'yogurt', 'cream cheese ', 'meat spreads'],
 ['other vegetables',
  'whole milk',
  'condensed milk',
  'long life bakery product']]

Now that we've imported the transactions into a list, we need to encode them and represent the data in a sparse format before we can generate frequent itemsets.

To transform our data, we first need to import the `TransactionEncoder` class from the `mlxtend.preprocessing` subpackage. The `mlxtend` package provides several functions and objects for preprocesing transaction data, generating frequent itemsets and creating association rules. 

In [4]:
from mlxtend.preprocessing import TransactionEncoder

encoder = TransactionEncoder()

transactions = encoder.fit(groceries).transform(groceries)
transactions

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False,  True, False],
       [False, False, False, ...,  True, False, False],
       ...,
       [False, False, False, ..., False,  True, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [5]:
import pandas as pd

itemsets = pd.DataFrame(transactions, columns = encoder.columns_)

In [6]:
display(itemsets.head())
display(itemsets.info())

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9835 entries, 0 to 9834
Columns: 168 entries, Instant food products to zwieback
dtypes: bool(168)
memory usage: 1.6 MB


None

# Generate Frequent Itemsets

Now that we have our transactions in a compatible format (one-hot encoded pandas DataFrame), let's limit our focus to the frequent itemsets. The `mlxtend.frequent_patterns` subpackage provides three functions for generating frequent itemsets - `apriori`, `fpgrowth` and `fpmax`. All three functions have similar syntax, so we'll limit our illustration to the the `apriori` function. Let's import it.

The `apriori` function takes several arguments. The first one is the pandas DataFrame of the transactions we wish to analyze. The second is the minimum support threshold of the itemsets we consider frequent. This value specifies how often an itemset must occur in the transaction set in order to warrant our attention. 

Let’s assume that we only want to focus our attention on itemsets that occur at least $5$ times a day. Given that our data is for $30$ days and our dataset has $9,835$ transactions, this means that we need to set our minimum support threshold to $ 5 \times \frac{30}{9835} \approx 0.015$.

In [8]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(itemsets, min_support = 0.015, use_colnames = True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.033452,(UHT-milk)
1,0.017692,(baking powder)
2,0.033249,(berries)
3,0.026029,(beverages)
4,0.080529,(bottled beer)
...,...,...
176,0.023183,"(other vegetables, whole milk, root vegetables)"
177,0.017082,"(whole milk, other vegetables, tropical fruit)"
178,0.022267,"(whole milk, yogurt, other vegetables)"
179,0.015557,"(whole milk, rolls/buns, yogurt)"


From the output, we can tell that there are $181$ itemsets with a minimum support value of $0.015$. To get a better sense of which itemsets have the lowest or highest support values, let's sort the data (in descending order of support):

In [11]:
frequent_itemsets.sort_values('support', ascending = False)

Unnamed: 0,support,itemsets
70,0.255516,(whole milk)
44,0.193493,(other vegetables)
53,0.183935,(rolls/buns)
60,0.174377,(soda)
71,0.139502,(yogurt)
...,...,...
164,0.015252,"(yogurt, shopping bags)"
180,0.015150,"(whole milk, yogurt, tropical fruit)"
11,0.015048,(canned fish)
168,0.015048,"(whole milk, sugar)"


We see that `{whole milk}`, `{other vegetables}`, `{rolls/buns}`, `{soda}`, and `{yogurt}` are the five most frequently bought items in the store.

One of the benefits of working with pandas DataFrames is that we can easily transform and filter our results to meet our needs. For example, let's assume that we are only interested in frequent itemsets with a length greater than $2$. We start by getting the length of the items in the `itemsets` column as follows:

In [12]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].str.len()b

In [14]:
frequent_itemsets.head(-5)

Unnamed: 0,support,itemsets,length
0,0.033452,(UHT-milk),1
1,0.017692,(baking powder),1
2,0.033249,(berries),1
3,0.026029,(beverages),1
4,0.080529,(bottled beer),1
...,...,...,...
171,0.032232,"(whole milk, whipped/sour cream)",2
172,0.020742,"(yogurt, whipped/sour cream)",2
173,0.017082,"(whole milk, white bread)",2
174,0.056024,"(whole milk, yogurt)",2


In [17]:
frequent_itemsets[frequent_itemsets.length>2].sort_values('support',ascending=False)

#the higher order itemsets do not show association rules, only shows items that are bought together.

Unnamed: 0,support,itemsets,length
176,0.023183,"(other vegetables, whole milk, root vegetables)",3
178,0.022267,"(whole milk, yogurt, other vegetables)",3
175,0.017895,"(other vegetables, whole milk, rolls/buns)",3
177,0.017082,"(whole milk, other vegetables, tropical fruit)",3
179,0.015557,"(whole milk, rolls/buns, yogurt)",3
180,0.01515,"(whole milk, yogurt, tropical fruit)",3


Now we see the six frequent itemsets with a length greater than $2$.

We can also use the `describe()` method of a pandas DataFrame to get a big picture view of the distribution of values in the data. For example, to get a statistical summary of the support values by itemset length, we do the following:

In [18]:
frequent_itemsets.groupby('length')['support'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,72.0,0.054157,0.046237,0.015048,0.024504,0.037265,0.071835,0.255516
2,103.0,0.024893,0.009981,0.015048,0.018454,0.021657,0.028724,0.074835
3,6.0,0.018522,0.003417,0.01515,0.015938,0.017489,0.021174,0.023183


The `count` column tells us that the majority of the transactions that are considered frequent are two-item purchases ($103$), while the `mean` and `50%` columns, tell us that transactions with a length of $1$ typically have higher support values than those with a length of $2$ or $3$.

# Create Association Rules

The next step in our market basket analysis process is to create association rules that describe the co-occurrence of itemsets within the transaction set. The `association_rules` function in the `mlxtend.frequent_patterns` subpackage allows us to create these rules. Let's import it.

The association_rules function takes several arguments. The first is the frequent itemset. The next is the metric we intend to use to filter the rules for significance. This can either be "support", "confidence", "lift", "leverage" or "conviction".

Let's assume that we want to limit our focus to rules that have a confidence of 0.25 or more. To do this, we set the metric argument to "confidence" and the min_threshold argument to 0.25.


In [20]:
from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.25)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684,-0.007351
1,(bottled water),(soda),0.110524,0.174377,0.028978,0.262190,1.503577,0.009705,1.119017,0.376535
2,(bottled water),(whole milk),0.110524,0.255516,0.034367,0.310948,1.216940,0.006126,1.080446,0.200417
3,(brown bread),(other vegetables),0.064870,0.193493,0.018709,0.288401,1.490503,0.006157,1.133374,0.351914
4,(brown bread),(whole milk),0.064870,0.255516,0.025216,0.388715,1.521293,0.008641,1.217899,0.366435
...,...,...,...,...,...,...,...,...,...,...
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779,0.357630
75,"(yogurt, rolls/buns)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192,0.451027
76,"(whole milk, yogurt)",(tropical fruit),0.056024,0.104931,0.015150,0.270417,2.577089,0.009271,1.226823,0.648285
77,"(whole milk, tropical fruit)",(yogurt),0.042298,0.139502,0.015150,0.358173,2.567516,0.009249,1.340701,0.637483


In [21]:
rows = rules['antecedents']=={'rolls/buns'}
display(rules[rows])

rows = rules['antecedents'].astype(str).str.contains('rolls/buns')
display(rules[rows])

rows = rules['antecedents'].str.len()>1
display(rules[rows].sort_values('lift',ascending=False))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
52,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696,0.208496


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
52,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696,0.208496
63,"(other vegetables, rolls/buns)",(whole milk),0.042603,0.255516,0.017895,0.420048,1.643919,0.00701,1.283699,0.409128
64,"(whole milk, rolls/buns)",(other vegetables),0.056634,0.193493,0.017895,0.315978,1.633026,0.006937,1.179067,0.410912
73,"(whole milk, rolls/buns)",(yogurt),0.056634,0.139502,0.015557,0.274686,1.969049,0.007656,1.18638,0.521686
75,"(yogurt, rolls/buns)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192,0.451027


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
65,"(whole milk, other vegetables)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909,0.700572
76,"(whole milk, yogurt)",(tropical fruit),0.056024,0.104931,0.01515,0.270417,2.577089,0.009271,1.226823,0.648285
77,"(whole milk, tropical fruit)",(yogurt),0.042298,0.139502,0.01515,0.358173,2.567516,0.009249,1.340701,0.637483
67,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332,0.62223
71,"(other vegetables, whole milk)",(yogurt),0.074835,0.139502,0.022267,0.297554,2.132979,0.011828,1.225003,0.574138
68,"(whole milk, tropical fruit)",(other vegetables),0.042298,0.193493,0.017082,0.403846,2.08714,0.008898,1.352851,0.54388
70,"(whole milk, yogurt)",(other vegetables),0.056024,0.193493,0.022267,0.397459,2.054131,0.011427,1.338511,0.543633
78,"(yogurt, tropical fruit)",(whole milk),0.029283,0.255516,0.01515,0.517361,2.02477,0.007668,1.542528,0.521384
72,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834,0.524577
73,"(whole milk, rolls/buns)",(yogurt),0.056634,0.139502,0.015557,0.274686,1.969049,0.007656,1.18638,0.521686


# Evaluate Association Rules

Now that we've created association rules and know how to filter rules based on different criteria, let's take a look at how to evaluate them based on the associated metrics. 

A quick way to get a big-picture view of the metrics is with summary statistics. We do this by calling the `describe()` method of the `rules` DataFrame:

In [21]:
rules.describe()

Unnamed: 0,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,0.074138,0.210698,0.025793,0.358884,1.753029,0.010218,1.242006,0.438391
std,0.03627,0.046664,0.011942,0.071235,0.360594,0.004778,0.115478,0.123545
min,0.029283,0.104931,0.015048,0.253714,0.993237,-0.000139,0.997684,-0.007351
25%,0.053279,0.193493,0.018149,0.295216,1.504115,0.006947,1.163202,0.359346
50%,0.06487,0.193493,0.022267,0.350962,1.7394,0.009271,1.226392,0.451027
75%,0.082766,0.255516,0.028927,0.404822,1.936331,0.012508,1.291339,0.514627
max,0.255516,0.255516,0.074835,0.517361,2.842082,0.026291,1.542528,0.700572


The summary statistics provide us with the mean, standard deviation, minimum, maximum and some quartile values for the association rule metrics. From the summary, we can tell that a typical rule has a lift of $1.75$ and that the lift values range from $0.99$ to $2.84$.

**Lift** tells us how much more the antecedent and consequent occur together in contrast to how often they occur independently. In other words, lift is the strength of association. Lift values range from $0$ to $\infty$, where a value of $1$ indicates independence between the antecedent and the consequent. Let's take a look at the top $5$ rules by lift: 

In [22]:
rules.sort_values('lift',ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
65,"(whole milk, other vegetables)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909,0.700572
33,(meat),(root vegetables),0.076462,0.108998,0.021759,0.284574,2.610811,0.013425,1.245415,0.668058
76,"(whole milk, yogurt)",(tropical fruit),0.056024,0.104931,0.01515,0.270417,2.577089,0.009271,1.226823,0.648285
48,(pip fruit),(tropical fruit),0.075648,0.104931,0.020437,0.270161,2.574648,0.012499,1.226392,0.66165
77,"(whole milk, tropical fruit)",(yogurt),0.042298,0.139502,0.01515,0.358173,2.567516,0.009249,1.340701,0.637483


The first rule has a lift score of $2.84$. We interpret this to mean that customers who bought whole milk, other vegetables are $2.84$ times more likely to also purchase root vegetables. Note that lift values above $1$ indicate an increased likelihood, while lift values below $1$ indicate a reduced likelihood.

**Leverage** is similar to lift and can be thought of as a normalized value for lift. Leverage values range from $-1$ to $1$, where a value of $0$ indicates independence between the antecedent and the consequent. Let's take a look at the top $5$ rules by leverage: 

In [23]:
rules.sort_values('leverage',ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
40,(root vegetables),(other vegetables),0.108998,0.193493,0.047382,0.434701,2.246605,0.026291,1.426693,0.622764
44,(other vegetables),(whole milk),0.193493,0.255516,0.074835,0.386758,1.513634,0.025394,1.214013,0.42075
45,(whole milk),(other vegetables),0.255516,0.193493,0.074835,0.292877,1.513634,0.025394,1.140548,0.455803
53,(root vegetables),(whole milk),0.108998,0.255516,0.048907,0.448694,1.756031,0.021056,1.350401,0.483202
62,(yogurt),(whole milk),0.139502,0.255516,0.056024,0.401603,1.571735,0.020379,1.244132,0.422732


The first rule has a leverage score of $0.026$. We interpret this to mean that customers who bought root vegetables are also likely to purchase other vegetables. This is expected behavior. However, the second rule which tells us that customers who bought whole milk are $1.5$ times or $50\%$ (using the lift scores) more likely to also purchase other vegetables is a bit suspect. Rules that include highly purchased items such as whole milk can be deceiving, so we should also look at the conviction of association rules.

**Conviction** quantifies how dependent the consequent is on the antecedent. It is also related to lift. However unlike lift, coviction is sensitive to rule direction. This means that $\text{Conviction}_{A \rightarrow B} \neq \text{Conviction}_{B \rightarrow A}$. Conviction values range from $0$ to $\infty$, where a value of $1$ indicates independence between the antecedent and the consequent. Let's take a look at the top $5$ rules by conviction: 

In [24]:
rules.sort_values('conviction',ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
78,"(yogurt, tropical fruit)",(whole milk),0.029283,0.255516,0.01515,0.517361,2.02477,0.007668,1.542528,0.521384
67,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332,0.62223
72,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834,0.524577
6,(butter),(whole milk),0.055414,0.255516,0.027555,0.497248,1.946053,0.013395,1.480817,0.514659
16,(curd),(whole milk),0.053279,0.255516,0.026131,0.490458,1.919481,0.012517,1.461085,0.505984


The first rule has a conviction of $1.54$. This means that the rule $\{\text{tropical fruit, yogurt}\} \rightarrow \{\text{whole milk}\}$ would be incorrect $54\%$ more often (or $1.54$ times as often) if the consequent was independent of the antecedent. The higher the conviction, the more likely it is that the consequent is dependent on the antecedent.

Besides the metrics returned by the `association_rules` function, **Zhang's Metric** is another useful metric that we should also take into consideration when evaluating rules. It ranges in value from $-1$ to $1$ which represent perfect association and perfect dissociation respectively. Zhang's metric is useful in identifying items that should not be placed next to each other, even if they have been purchased together previously. It is calculated as follows:

$$ \text{Zhang}_{A \rightarrow B} = \frac{\text{Support}_{A \rightarrow B} - (\text{Support}_{A} \times \text{Support}_{B})}{\text{max}\{[\text{Support}_{A \rightarrow B} \times (1 - \text{Support}_{A})], [\text{Support}_{A} \times (\text{Support}_{B} - \text{Support}_{A \rightarrow B})]\}}$$

Where $\text{Support}_{A \rightarrow B}$ is the support of the rule, $\text{Support}_{A}$ is the antecedent support and $\text{Support}_{B}$ is the consequent support.

We can add Zhang's metric to our `rules` DataFrame by first creating a function that calculates it:

In [25]:
rules.sort_values('zhangs_metric',ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
65,"(whole milk, other vegetables)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909,0.700572
33,(meat),(root vegetables),0.076462,0.108998,0.021759,0.284574,2.610811,0.013425,1.245415,0.668058
48,(pip fruit),(tropical fruit),0.075648,0.104931,0.020437,0.270161,2.574648,0.012499,1.226392,0.66165
76,"(whole milk, yogurt)",(tropical fruit),0.056024,0.104931,0.01515,0.270417,2.577089,0.009271,1.226823,0.648285
77,"(whole milk, tropical fruit)",(yogurt),0.042298,0.139502,0.01515,0.358173,2.567516,0.009249,1.340701,0.637483


The first rule has a zhang metric score of $0.7$. This indicates a pretty strong positive association between (whole milk, other vegetables) and root vegetables. This tells us that if we were to (whole milk, other vegetables) from root vegetables in our store, there could be an impact to how much of both are purchased. In other words, pairing (whole milk, other vegetables) and root vegetables for promotional purposes is a good choice.

Looking at rules that have a low zhang metric is also very useful. Let's take a look at the bottom $5$ rules by the zhang metric: 

In [27]:
rules.sort_values('zhangs_metric',ascending=True).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684,-0.007351
2,(bottled water),(whole milk),0.110524,0.255516,0.034367,0.310948,1.21694,0.006126,1.080446,0.200417
52,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696,0.208496
55,(sausage),(whole milk),0.09395,0.255516,0.029893,0.318182,1.245252,0.005887,1.09191,0.217372
13,(coffee),(whole milk),0.058058,0.255516,0.018709,0.322242,1.261141,0.003874,1.098451,0.21983


The first rule has a zhang metric score of $-0.007$. This indicates a slight dissociation between bottled beer and whole milk. This tells us that if we were to separate bottled beer from whole milk in the store, there would likely not be an appreciable impact on purchase patterns for both items. This means that it would be a bad choice to pair these two items together for promotional purposes.