# Market Basket Analysis with Association Rules in Python

## Learning Objectives
Market basket analysis or affinity analysis is the process of identifying patterns and extracting meaningful insight from transaction sets. Association rules are often used for market basket analysis because they allow us to discover, quantify and analyze the co-occurrence of items within a set of transactions. By the end of this tutorial, you will have learned:

+ How to import and explore a transaction set
+ How to create Frequent Itemsets
+ How to create Association Rules
+ How to evaluate Association Rules 

## 1. Collect the Data

The data that we're going to analyze is market basket data collected from a local grocery store over a 30-day period. The data is stored in CSV (Comma  Separated Values) format as follows:
```
1. citrus fruit,semi-finished bread,margarine,ready soups
2. tropical fruit,yogurt,coffee
3. whole milk
4. pip fruit,yogurt,cream cheese,meat spreads
5. other vegetables,whole milk,condensed milk,long life bakery product
```
Each row in the data represents a set of items purchased by a customer during a store visit, which we refer to as a transaction. As expected, the number of items in each transaction varies so we cannot simply bulk import this data into a tabular data structure such as a pandas DataFrame as-is. Instead, we need to import the data one row at a time.

To do this, we first need to import the `reader` object from the `csv` package.

In [6]:
from csv import reader

Next, we iterate over each line in the input file (*groceries.csv*) and append it to a list called `groceries`.

In [None]:
"""import csv

def remove_trailing_commas(filename):
    with open(filename, 'r') as infile, open('output.csv', 'w', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)
        
        for row in reader:
            cleaned_row = [item.strip() for item in row if item.strip()]
            writer.writerow(cleaned_row)

# Replace 'dataset.csv' with your actual file name
filename = 'groceries.csv'
remove_trailing_commas(filename)"""

In [None]:
"""import csv

def clean_csv(filename):
    with open(filename, 'r') as infile, open('cleaned_output.csv', 'w', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        # Skip the first row
        next(reader)

        for row in reader:
            # Remove the first column
            cleaned_row = row[1:]
            writer.writerow(cleaned_row)

# Replace 'dataset.csv' with your actual file name
filename = 'groceries.csv'
clean_csv(filename)"""

In [14]:
import csv

def clean_csv(filename):
    with open(filename, 'r') as infile, open('groceries.csv', 'w', newline='') as outfile:
        reader = csv.reader(infile)

        # Skip the first row (header)
        next(reader)

        writer = csv.writer(outfile)
        for row in reader:
            # Remove the first column
            cleaned_row = row[1:]
            # Remove trailing commas and extra spaces
            cleaned_row = [item.strip() for item in cleaned_row if item.strip()]
            writer.writerow(cleaned_row)

# Replace 'dataset.csv' with your actual file name
filename = 'original_groceries.csv'
clean_csv(filename)

In [15]:
groceries = []
with open('groceries.csv', 'r') as csvfile:
    csv_reader = reader(csvfile)
    for row in csv_reader:
        groceries.append(row)

Let’s preview the first 5 elements in the `groceries` list to make sure that the import worked as expected.

In [16]:
groceries[1:5]

[['tropical fruit', 'yogurt', 'coffee'],
 ['whole milk'],
 ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'],
 ['other vegetables',
  'whole milk',
  'condensed milk',
  'long life bakery product']]

Now that we've imported the transactions into a list, we need to encode them and represent the data in a sparse format before we can generate frequent itemsets.

To transform our data, we first need to import the `TransactionEncoder` class from the `mlxtend.preprocessing` subpackage. The `mlxtend` package provides several functions and objects for preprocesing transaction data, generating frequent itemsets and creating association rules. 

In [17]:
from mlxtend.preprocessing import TransactionEncoder

Then we instantiate an object called `encoder` from the `TransactionEncoder` class.

In [18]:
encoder=TransactionEncoder()

Using the `encoder` object, we call the `fit()` method to extract the unique labels in the transaction set and the `transform()` method to one-hot encode the transactions into a boolean NumPy array.

In [19]:
Transactions=encoder.fit(groceries).transform(groceries)
Transactions

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False,  True, False],
       [False, False, False, ...,  True, False, False],
       ...,
       [False, False, False, ..., False,  True, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

Next, we convert the transactions into a pandas DataFrame. This requires that we first import the `pandas` package.

In [20]:
import pandas as pd

Then we pass the boolean array and item names to the pandas `DataFrame()` constructor function to create a new DataFrame called `itemsets`.

In [23]:
itemsets=pd.DataFrame(Transactions,columns=encoder.columns_)

We can preview the first 5 rows in the `itemsets` DataFrame by calling the `head()` method.

In [28]:
itemsets.head()
#itemsets.iloc[5:10]

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


We can also get a concise summary of the structure of the `itemsets` DataFrame by calling the `info()` method.

In [29]:
itemsets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9835 entries, 0 to 9834
Columns: 169 entries, Instant food products to zwieback
dtypes: bool(169)
memory usage: 1.6 MB


By looking at the `RangeIndex` value of the summary, we can tell that there are 9,835 transactions in the dataset. The `Columns` value tells us that that there are 169 features (or unique items) in the dataset. 

## 2. Generate Frequent Itemsets

Now that we have our transactions in a compatible format (one-hot encoded pandas DataFrame), let's limit our focus to the frequent itemsets. The `mlxtend.frequent_patterns` subpackage provides three functions for generating frequent itemsets - `apriori`, `fpgrowth` and `fpmax`. All three functions have similar syntax, so we'll limit our illustration to the the `apriori` function. Let's import it.

In [30]:
from mlxtend.frequent_patterns import apriori

The `apriori` function takes several arguments. The first one is the pandas DataFrame of the transactions we wish to analyze. The second is the minimum support threshold of the itemsets we consider frequent. This value specifies how often an itemset must occur in the transaction set in order to warrant our attention. 

Let’s assume that we only want to focus our attention on itemsets that occur at least $5$ times a day. Given that our data is for $30$ days and our dataset has $9,835$ transactions, this means that we need to set our minimum support threshold to $ 5 \times \frac{30}{9835} \approx 0.015$.

In [33]:
frequent_itemsets=apriori(itemsets, min_support= 0.015, use_colnames=True)

By default, the `apriori` function returns the numeric column indices of the items in the transaction set. For ease of interpretation, we set the `use_colnames` argument within the function to `True`. This tells the function to return item names instead of integer values. Let's see what we got.

In [34]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.033452,(UHT-milk)
1,0.017692,(baking powder)
2,0.052466,(beef)
3,0.033249,(berries)
4,0.026029,(beverages)
...,...,...
175,0.023183,"(other vegetables, whole milk, root vegetables)"
176,0.017082,"(other vegetables, tropical fruit, whole milk)"
177,0.022267,"(other vegetables, whole milk, yogurt)"
178,0.015557,"(rolls/buns, whole milk, yogurt)"


From the output, we can tell that there are $180$ itemsets with a minimum support value of $0.015$. To get a better sense of which itemsets have the lowest or highest support values, let's sort the data (in descending order of support):

In [35]:
frequent_itemsets.sort_values('support', ascending=False)

Unnamed: 0,support,itemsets
71,0.255516,(whole milk)
45,0.193493,(other vegetables)
54,0.183935,(rolls/buns)
61,0.174377,(soda)
72,0.139502,(yogurt)
...,...,...
163,0.015252,"(shopping bags, yogurt)"
179,0.015150,"(tropical fruit, whole milk, yogurt)"
12,0.015048,(canned fish)
46,0.015048,(pasta)


We see that `{whole milk}`, `{other vegetables}`, `{rolls/buns}`, `{soda}`, and `{yogurt}` are the five most frequently bought items in the store.

One of the benefits of working with pandas DataFrames is that we can easily transform and filter our results to meet our needs. For example, let's assume that we are only interested in frequent itemsets with a length greater than $2$. We start by getting the length of the items in the `itemsets` column as follows:

In [36]:
length= frequent_itemsets['itemsets'].str.len()

Then we create a logical filter based on the length of the item sets:  

In [37]:
rows=length>2

Finally, we apply the filter to the `frequent_itemsets` DataFrame:

In [38]:
frequent_itemsets[rows]

Unnamed: 0,support,itemsets
174,0.017895,"(rolls/buns, other vegetables, whole milk)"
175,0.023183,"(other vegetables, whole milk, root vegetables)"
176,0.017082,"(other vegetables, tropical fruit, whole milk)"
177,0.022267,"(other vegetables, whole milk, yogurt)"
178,0.015557,"(rolls/buns, whole milk, yogurt)"
179,0.01515,"(tropical fruit, whole milk, yogurt)"


Now we see the six frequent itemsets with a length greater than $2$.

We can also use the `describe()` method of a pandas DataFrame to get a big picture view of the distribution of values in the data. For example, to get a statistical summary of the support values by itemset length, we do the following:

In [39]:
frequent_itemsets.groupby(length) ['support'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
itemsets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,73.0,0.053441,0.045956,0.015048,0.024504,0.037112,0.06487,0.255516
2,101.0,0.024799,0.010058,0.015048,0.018404,0.021047,0.027555,0.074835
3,6.0,0.018522,0.003417,0.01515,0.015938,0.017489,0.021174,0.023183


The `count` column tells us that the majority of the transactions that are considered frequent are two-item purchases ($101$), while the `mean` and `50%` columns, tell us that transactions with a length of $1$ typically have higher support values than those with a length of $2$ or $3$.

In this tutorial we only use the `apriori` frequent itemset generation function. Note that if you want to try out the `fpgrowth` or the `fpmax` functions, you simply need to import them as follows:
```python
from mlxtend.frequent_patterns import fpgrowth
```
or
```python
from mlxtend.frequent_patterns import fpmax
```
Then you can generate frequent itemsets using the applicable function just as we've done here.

## 3. Create Association Rules

The next step in our market basket analysis process is to create association rules that describe the co-occurrence of itemsets within the transaction set. The `association_rules` function in the `mlxtend.frequent_patterns` subpackage allows us to create these rules. Let's import it.

In [40]:
from mlxtend.frequent_patterns import association_rules

The `association_rules` function takes several arguments. The first is the frequent itemset. The next is the metric we intend to use to filter the rules for significance. This can either be "*support*", "*confidence*", "*lift*", "*leverage*" or "*conviction*". 

Let's assume that we want to limit our focus to rules that have a confidence of `0.25` or more. To do this, we set the `metric` argument to `"confidence"` and the `min_threshold` argument to `0.25`.

In [41]:
rules=association_rules(frequent_itemsets, metric='confidence', min_threshold=0.25)

Let's see what rules were generated.

In [42]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(beef),(other vegetables),0.052466,0.193493,0.019725,0.375969,1.943066,0.009574,1.292416
1,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628
2,(beef),(whole milk),0.052466,0.255516,0.021251,0.405039,1.585180,0.007845,1.251315
3,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684
4,(bottled water),(soda),0.110524,0.174377,0.028978,0.262190,1.503577,0.009705,1.119017
...,...,...,...,...,...,...,...,...,...
73,"(rolls/buns, yogurt)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779
75,"(tropical fruit, whole milk)",(yogurt),0.042298,0.139502,0.015150,0.358173,2.567516,0.009249,1.340701
76,"(tropical fruit, yogurt)",(whole milk),0.029283,0.255516,0.015150,0.517361,2.024770,0.007668,1.542528


There are $78$ association rules that meet our criteria. Each rule is made up of an antecedent and a consequent. For each rule, we get metrics that tell us the support of the antecedent and the support of the consequent. We also get metrics that tell us the support, confidence, lift, leverage and conviction of each rule. 

Because our rules are returned as a pandas DataFrames we can easily transform and filter the data to find what we need. For example, let's say we're only interested in rules that include `'rolls/buns'` in the antecedent. We start by creating a logical expression as a filter:

In [44]:
rows=rules['antecedents'] == {'rolls/buns'}

Note that the entries in the `antecedents` and `consequents` columns are of type frozenset, which is why we include the curly braces `{}` around the item names.

The next step is to apply the filter to the `rules` DataFrame:

In [45]:
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
51,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696


We get $1$ rule that matches our criteria. As you can imagine, we can create a similar filter with the consequent:

In [47]:
rows= rules['consequents']== {'rolls/buns'}
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
24,(frankfurter),(rolls/buns),0.058973,0.183935,0.019217,0.325862,1.771616,0.00837,1.210531
50,(sausage),(rolls/buns),0.09395,0.183935,0.030605,0.325758,1.771048,0.013324,1.210344
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779


This time, we get $3$ rules that match our filter.

Note that in the previous two examples, we only matched rules with `'rolls/buns'` alone as the antecedent or the consequent. If our goal is to match all rules with `'rolls/buns'` and any other item in the antecedent for example, we would need to convert the antecedent column to a string, vectorize the string and use the `contains()` method in the following way:

In [52]:
rows=rules['antecedents'].astype(str).str.contains('rolls/buns')
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
51,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696
62,"(rolls/buns, other vegetables)",(whole milk),0.042603,0.255516,0.017895,0.420048,1.643919,0.00701,1.283699
63,"(rolls/buns, whole milk)",(other vegetables),0.056634,0.193493,0.017895,0.315978,1.633026,0.006937,1.179067
72,"(rolls/buns, whole milk)",(yogurt),0.056634,0.139502,0.015557,0.274686,1.969049,0.007656,1.18638
73,"(rolls/buns, yogurt)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192


There are $5$ rules with `'rolls/buns'` anywhere in the antecedent.

We can aslo filter our rules by the length of the antecedent or consequent. For example, to match only rules with an antecedent length more than `1` we do the following:

In [56]:
rows=rules['antecedents'].str.len()>1
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
62,"(rolls/buns, other vegetables)",(whole milk),0.042603,0.255516,0.017895,0.420048,1.643919,0.00701,1.283699
63,"(rolls/buns, whole milk)",(other vegetables),0.056634,0.193493,0.017895,0.315978,1.633026,0.006937,1.179067
64,"(other vegetables, whole milk)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909
65,"(other vegetables, root vegetables)",(whole milk),0.047382,0.255516,0.023183,0.48927,1.914833,0.011076,1.457687
66,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332
67,"(tropical fruit, other vegetables)",(whole milk),0.035892,0.255516,0.017082,0.475921,1.862587,0.007911,1.420556
68,"(tropical fruit, whole milk)",(other vegetables),0.042298,0.193493,0.017082,0.403846,2.08714,0.008898,1.352851
69,"(other vegetables, whole milk)",(yogurt),0.074835,0.139502,0.022267,0.297554,2.132979,0.011828,1.225003
70,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834
71,"(whole milk, yogurt)",(other vegetables),0.056024,0.193493,0.022267,0.397459,2.054131,0.011427,1.338511


We get $16$ rules that match our criteria.

We can also filter our rules based on the values in any of the numeric columns. For example, let's assume that we only want to see rules that have a lift of more than `2`, a leverage score more than `0.01` and a conviction score of more than `1.4`. This can be written as follows:

In [59]:
rows=(rules['lift']>2) & (rules['leverage']>0.01) & (rules['conviction']>1.4)
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
39,(root vegetables),(other vegetables),0.108998,0.193493,0.047382,0.434701,2.246605,0.026291,1.426693
66,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332
70,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834


There are $3$ rules with a lift of more than `2`, a leverage score more than `0.01` and a conviction score of more than `1.4`.

As you can imagine, the examples shown here are just a tip of the iceberg. You can slice and dice the `rules` DataFrame in as many ways as you can imagine. Feel free to do so.

## 4. Evaluate Association Rules

Now that we've created association rules and know how to filter rules based on different criteria, let's take a look at how to evaluate them based on the associated metrics. 

A quick way to get a big-picture view of the metrics is with summary statistics. We do this by calling the `describe()` method of the `rules` DataFrame:

In [60]:
rules.describe()

Unnamed: 0,antecedent support,consequent support,support,confidence,lift,leverage,conviction
count,78.0,78.0,78.0,78.0,78.0,78.0,78.0
mean,0.073186,0.211041,0.025578,0.360834,1.763158,0.010156,1.245113
std,0.036738,0.046866,0.012045,0.070495,0.377079,0.004766,0.115688
min,0.029283,0.104931,0.015048,0.253714,0.993237,-0.000139,0.997684
25%,0.052466,0.193493,0.017895,0.308374,1.504669,0.00697,1.166832
50%,0.06121,0.193493,0.021912,0.354567,1.740032,0.00926,1.226608
75%,0.082766,0.255516,0.028876,0.405608,1.942669,0.011788,1.294081
max,0.255516,0.255516,0.074835,0.517361,3.040367,0.026291,1.542528


The summary statistics provide us with the mean, standard deviation, minimum, maximum and some percentile values for the association rule metrics. From the summary, we can tell that a typical rule has a lift of $1.76$ and that the lift values range from $0.99$ to $3.04$.

**Lift** tells us how much more the antecedent and consequent occur together in contrast to how often they occur independently. In other words, lift is the strength of association. Lift values range from $0$ to $\infty$, where a value of $1$ indicates independence between the antecedent and the consequent. Let's take a look at the top $5$ rules by lift: 

In [61]:
rules.sort_values('lift', ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628
64,"(other vegetables, whole milk)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909
77,"(whole milk, yogurt)",(tropical fruit),0.056024,0.104931,0.01515,0.270417,2.577089,0.009271,1.226823
47,(pip fruit),(tropical fruit),0.075648,0.104931,0.020437,0.270161,2.574648,0.012499,1.226392
75,"(tropical fruit, whole milk)",(yogurt),0.042298,0.139502,0.01515,0.358173,2.567516,0.009249,1.340701


The first rule has a lift score of $3.04$. We interpret this to mean that customers who bought beef are $3.04$ times more likely to also purchase root vegetables. Note that lift values above $1$ indicate an increased likelihood, while lift values below $1$ indicate a reduced likelihood.

**Leverage** is similar to lift and can be thought of as a normalized value for lift. Leverage values range from $-1$ to $1$, where a value of $0$ indicates independence between the antecedent and the consequent. Let's take a look at the top $5$ rules by leverage: 

In [62]:
rules.sort_values('leverage', ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
39,(root vegetables),(other vegetables),0.108998,0.193493,0.047382,0.434701,2.246605,0.026291,1.426693
43,(other vegetables),(whole milk),0.193493,0.255516,0.074835,0.386758,1.513634,0.025394,1.214013
44,(whole milk),(other vegetables),0.255516,0.193493,0.074835,0.292877,1.513634,0.025394,1.140548
52,(root vegetables),(whole milk),0.108998,0.255516,0.048907,0.448694,1.756031,0.021056,1.350401
61,(yogurt),(whole milk),0.139502,0.255516,0.056024,0.401603,1.571735,0.020379,1.244132


The first rule has a leverage score of $0.026$. We interpret this to mean that customers who bought root vegetables are also likely to purchase other vegetables. This is expected behavior. However, the second rule which tells us that customers who bought whole milk are $1.5$ times or $50\%$ (using the lift scores) more likely to also purchase other vegetables is a bit suspect. Rules that include highly purchased items such as whole milk can be deceiving, so we should also look at the conviction of association rules.

**Conviction** quantifies how dependent the consequent is on the antecedent. It is also related to lift. However unlike lift, coviction is sensitive to rule direction. This means that $\text{Conviction}_{A \rightarrow B} \neq \text{Conviction}_{B \rightarrow A}$. Conviction values range from $0$ to $\infty$, where a value of $1$ indicates independence between the antecedent and the consequent. Let's take a look at the top $5$ rules by conviction: 

In [63]:
rules.sort_values('conviction', ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
76,"(tropical fruit, yogurt)",(whole milk),0.029283,0.255516,0.01515,0.517361,2.02477,0.007668,1.542528
66,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332
70,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834
9,(butter),(whole milk),0.055414,0.255516,0.027555,0.497248,1.946053,0.013395,1.480817
19,(curd),(whole milk),0.053279,0.255516,0.026131,0.490458,1.919481,0.012517,1.461085


The first rule has a conviction of $1.54$. This means that the rule $\{\text{tropical fruit, yogurt}\} \rightarrow \{\text{whole milk}\}$ would be incorrect $54\%$ more often (or $1.54$ times as often) if the consequent was independent of the antecedent. The higher the conviction, the more likely it is that the consequent is dependent on the antecedent.

Besides the metrics returned by the `association_rules` function, **Zhang's Metric** is another useful metric that we should also take into consideration when evaluating rules. It ranges in value from $-1$ to $1$ which represent perfect association and perfect dissociation respectively. Zhang's metric is useful in identifying items that should not be placed next to each other, even if they have been purchased together previously. It is calculated as follows:

$$ \text{Zhang}_{A \rightarrow B} = \frac{\text{Support}_{A \rightarrow B} - (\text{Support}_{A} \times \text{Support}_{B})}{\text{max}\{[\text{Support}_{A \rightarrow B} \times (1 - \text{Support}_{A})], [\text{Support}_{A} \times (\text{Support}_{B} - \text{Support}_{A \rightarrow B})]\}}$$

Where $\text{Support}_{A \rightarrow B}$ is the support of the rule, $\text{Support}_{A}$ is the antecedent support and $\text{Support}_{B}$ is the consequent support.

We can add Zhang's metric to our `rules` DataFrame by first creating a function that calculates it:

In [73]:
import numpy as np
def zhang_metric(rules):
    sup=rules['support'].copy()
    sup_a=rules['antecedent support'].copy()
    sup_b=rules['consequent support'].copy()
    num=sup-sup_a*sup_b
    denom=np.max( (sup*(1-sup_a).values, sup_a*(sup_b-sup).values ),axis=0)
    return num

Then, we assign the result of the function to new column called `'zhang'` in the `rules` DataFrame as follows:

In [74]:
rules['zhang']=zhang_metric(rules)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhang
0,(beef),(other vegetables),0.052466,0.193493,0.019725,0.375969,1.943066,0.009574,1.292416,0.009574
1,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628,0.011668
2,(beef),(whole milk),0.052466,0.255516,0.021251,0.405039,1.585180,0.007845,1.251315,0.007845
3,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684,-0.000139
4,(bottled water),(soda),0.110524,0.174377,0.028978,0.262190,1.503577,0.009705,1.119017,0.009705
...,...,...,...,...,...,...,...,...,...,...
73,"(rolls/buns, yogurt)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192,0.006775
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779,0.005252
75,"(tropical fruit, whole milk)",(yogurt),0.042298,0.139502,0.015150,0.358173,2.567516,0.009249,1.340701,0.009249
76,"(tropical fruit, yogurt)",(whole milk),0.029283,0.255516,0.015150,0.517361,2.024770,0.007668,1.542528,0.007668


Let's take a look at the top $5$ rules by the zhang metric: 

In [75]:
rules.sort_values('zhang',ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhang
39,(root vegetables),(other vegetables),0.108998,0.193493,0.047382,0.434701,2.246605,0.026291,1.426693,0.026291
43,(other vegetables),(whole milk),0.193493,0.255516,0.074835,0.386758,1.513634,0.025394,1.214013,0.025394
44,(whole milk),(other vegetables),0.255516,0.193493,0.074835,0.292877,1.513634,0.025394,1.140548,0.025394
52,(root vegetables),(whole milk),0.108998,0.255516,0.048907,0.448694,1.756031,0.021056,1.350401,0.021056
61,(yogurt),(whole milk),0.139502,0.255516,0.056024,0.401603,1.571735,0.020379,1.244132,0.020379


The first rule has a zhang metric score of $0.708$. This indicates a pretty strong positive association between beef and root vegetables. This tells us that if we were to separate beef from root vegetables in our store, there could be an impact to how much of both are purchased. In other words, pairing beef and root vegetables for promotional purposes is a good choice.

Looking at rules that have a low zhang metric is also very useful. Let's take a look at the bottom $5$ rules by the zhang metric: 

In [76]:
rules.sort_values('zhang').head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhang
3,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684,-0.000139
16,(coffee),(whole milk),0.058058,0.255516,0.018709,0.322242,1.261141,0.003874,1.098451,0.003874
12,(chocolate),(whole milk),0.049619,0.255516,0.016675,0.336066,1.315243,0.003997,1.121322,0.003997
23,(frankfurter),(other vegetables),0.058973,0.193493,0.016472,0.27931,1.443519,0.005061,1.119077,0.005061
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779,0.005252


The first rule has a zhang metric score of $-0.000139$. This indicates a slight dissociation between bottled beer and whole milk. This tells us that if we were to separate bottled beer from whole milk in the store, there would likely not be an appreciable impact on purchase patterns for both items. This means that it would be a bad choice to pair these two items together for promotional purposes.