# 🛒 Customer Purchase Association Rules with Apriori

## 📌 Objective

This checkpoint focuses on discovering **association rules** using the **Apriori algorithm** on a dataset of **customer purchase history**. The goal is to help supermarket owners identify product associations and design more effective marketing strategies and product placements.

---


In [None]:
# import libraries 
import mlxtend
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules 

# 🧼 Data Cleaning & Preprocessing

In [5]:
# import the dataset 
df = pd.read_csv('Market_Basket_Optimisation.csv', header=None)

In [6]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7501 non-null   object
 1   1       5747 non-null   object
 2   2       4389 non-null   object
 3   3       3345 non-null   object
 4   4       2529 non-null   object
 5   5       1864 non-null   object
 6   6       1369 non-null   object
 7   7       981 non-null    object
 8   8       654 non-null    object
 9   9       395 non-null    object
 10  10      256 non-null    object
 11  11      154 non-null    object
 12  12      87 non-null     object
 13  13      47 non-null     object
 14  14      25 non-null     object
 15  15      8 non-null      object
 16  16      4 non-null      object
 17  17      4 non-null      object
 18  18      3 non-null      object
 19  19      1 non-null      object
dtypes: object(20)
memory usage: 1.1+ MB


## 🛠️ Apriori Preparation

Before applying the **Apriori algorithm**, we need to **preprocess our dataset** properly.

### 🔄 One-Hot Encoding

The Apriori function requires a dataset where:
- Each **row** represents a transaction
- Each **column** represents a unique item
- Each **cell** contains either **True** or **False** indicating the presence or absence of that item in the transaction

This transformation is known as **one-hot encoding**.

### 🧰 Using `TransactionEncoder`

To achieve this, we use the `TransactionEncoder` class from the **`mlxtend`** library. It:
- Takes a list of transactions as input
- Returns a NumPy array with boolean values (True/False)
- Not 0/1 — but functionally the same idea

In [9]:
#  Convert to list of lists:

transactions = df.values.tolist()
transactions


[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers',
  'meatballs',
  'eggs',
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan],
 ['chutney',
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan],
 ['turkey',
  'avocado',
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan],
 ['mineral water',
  'milk',
  'energy bar',
  'whole wheat rice',
  'green tea',
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan,
  nan],
 ['low fat yogurt',
  nan,
  n

In [None]:
# Drop Only the NaNs from Each Transaction
cleaned_transactions = [[item for item in transaction if pd.notna(item)] for transaction in transactions]
cleaned_transactions

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'],
 ['low fat yogurt'],
 ['whole wheat pasta', 'french fries'],
 ['soup', 'light cream', 'shallot'],
 ['frozen vegetables', 'spaghetti', 'green tea'],
 ['french fries'],
 ['eggs', 'pet food'],
 ['cookies'],
 ['turkey', 'burgers', 'mineral water', 'eggs', 'cooking oil'],
 ['spaghetti', 'champagne', 'cookies'],
 ['mineral water', 'salmon'],
 ['mineral water'],
 ['shrimp',
  'chocolate',
  'chicken',
  'honey',
  'oil',
  'cooking oil',
  'low fat yogurt'],
 ['turkey', 'eggs'],
 ['turkey',
  'fresh tuna',
  'tomatoes',
  'spagh

In [12]:
# Encode Transactions (One-Hot)
te = TransactionEncoder()
te_ary = te.fit(cleaned_transactions).transform(cleaned_transactions)
encoded_df = pd.DataFrame(te_ary, columns=te.columns_)
encoded_df.head()

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


## 📊 Support Code – Filtering Itemsets by Support

Now that the dataset has been transformed using one-hot encoding, we can move on to **extracting frequent itemsets** using the **Apriori algorithm**.

🎯 Goal: We want to find itemsets that appear in **at least 60%** of all transactions — this is called setting the **minimum support threshold**.

In [16]:
# extract frequent itemsets using apriori algorithm and display the top 10 itemsets
frequent_itemsets = apriori(encoded_df, min_support=0.02, use_colnames=True)
frequent_itemsets.sort_values(by='support', ascending=False, inplace=True)
frequent_itemsets

Unnamed: 0,support,itemsets
34,0.238368,(mineral water)
13,0.179709,(eggs)
44,0.174110,(spaghetti)
17,0.170911,(french fries)
9,0.163845,(chocolate)
...,...,...
0,0.020397,(almonds)
80,0.020264,"(frozen smoothie, mineral water)"
67,0.020131,"(cooking oil, mineral water)"
78,0.020131,"(pancakes, french fries)"


## ✅ Confidence-Based Rule Extraction

Once we have our frequent itemsets, we can extract **association rules** using other interesting metrics like **confidence** and **lift**.

In [20]:
# Using confidence metric we can extract association rules 
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.01)
rules.sort_values(by='confidence', ascending=False, inplace=True)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
68,(soup),(mineral water),0.050527,0.238368,0.023064,0.456464,1.914955,1.0,0.011020,1.401255,0.503221,0.086760,0.286354,0.276610
42,(olive oil),(mineral water),0.065858,0.238368,0.027596,0.419028,1.757904,1.0,0.011898,1.310962,0.461536,0.099759,0.237201,0.267400
8,(ground beef),(mineral water),0.098254,0.238368,0.040928,0.416554,1.747522,1.0,0.017507,1.305401,0.474369,0.138413,0.233952,0.294127
10,(ground beef),(spaghetti),0.098254,0.174110,0.039195,0.398915,2.291162,1.0,0.022088,1.373997,0.624943,0.168096,0.272197,0.312015
94,(cooking oil),(mineral water),0.051060,0.238368,0.020131,0.394256,1.653978,1.0,0.007960,1.257349,0.416672,0.074752,0.204676,0.239354
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69,(mineral water),(soup),0.238368,0.050527,0.023064,0.096756,1.914955,1.0,0.011020,1.051182,0.627330,0.086760,0.048690,0.276610
77,(mineral water),(chicken),0.238368,0.059992,0.022797,0.095638,1.594172,1.0,0.008497,1.039415,0.489364,0.082729,0.037921,0.237819
93,(mineral water),(frozen smoothie),0.238368,0.063325,0.020264,0.085011,1.342461,1.0,0.005169,1.023701,0.334938,0.072004,0.023152,0.202506
95,(mineral water),(cooking oil),0.238368,0.051060,0.020131,0.084452,1.653978,1.0,0.007960,1.036472,0.519145,0.074752,0.035189,0.239354


## 🚀 Lift-Based Rule Extraction

In addition to **confidence**, we can also use **lift** as a metric to extract more meaningful association rules.

In [26]:
# Filter the rules based on lift: Display the top 10 rules based on lift.
association_rules(frequent_itemsets,metric="lift",min_threshold=1.8) 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(ground beef),(spaghetti),0.098254,0.17411,0.039195,0.398915,2.291162,1.0,0.022088,1.373997,0.624943,0.168096,0.272197,0.312015
1,(spaghetti),(ground beef),0.17411,0.098254,0.039195,0.225115,2.291162,1.0,0.022088,1.163716,0.682343,0.168096,0.140684,0.312015
2,(eggs),(burgers),0.179709,0.087188,0.028796,0.160237,1.83783,1.0,0.013128,1.086988,0.555754,0.120941,0.080026,0.245256
3,(burgers),(eggs),0.087188,0.179709,0.028796,0.330275,1.83783,1.0,0.013128,1.224818,0.499424,0.120941,0.183552,0.245256
4,(frozen vegetables),(milk),0.095321,0.129583,0.023597,0.247552,1.910382,1.0,0.011245,1.156781,0.526755,0.117219,0.135532,0.214826
5,(milk),(frozen vegetables),0.129583,0.095321,0.023597,0.182099,1.910382,1.0,0.011245,1.106099,0.54749,0.117219,0.095921,0.214826
6,(soup),(mineral water),0.050527,0.238368,0.023064,0.456464,1.914955,1.0,0.01102,1.401255,0.503221,0.08676,0.286354,0.27661
7,(mineral water),(soup),0.238368,0.050527,0.023064,0.096756,1.914955,1.0,0.01102,1.051182,0.62733,0.08676,0.04869,0.27661
8,(olive oil),(spaghetti),0.065858,0.17411,0.02293,0.348178,1.999758,1.0,0.011464,1.267048,0.535186,0.105651,0.210764,0.239939
9,(spaghetti),(olive oil),0.17411,0.065858,0.02293,0.1317,1.999758,1.0,0.011464,1.075829,0.605334,0.105651,0.070484,0.239939


In [29]:
# Rename columns for clarity: Rename the output columns to "Products" and "Recommendations" for presentation.
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.8)[:10]
rules.rename(columns={'antecedents': 'Products', 'consequents': 'Recommendations'}, inplace=True)
# Display the rules
rules[['Products', 'Recommendations', 'support', 'confidence', 'lift']]

Unnamed: 0,Products,Recommendations,support,confidence,lift
0,(ground beef),(spaghetti),0.039195,0.398915,2.291162
1,(spaghetti),(ground beef),0.039195,0.225115,2.291162
2,(eggs),(burgers),0.028796,0.160237,1.83783
3,(burgers),(eggs),0.028796,0.330275,1.83783
4,(frozen vegetables),(milk),0.023597,0.247552,1.910382
5,(milk),(frozen vegetables),0.023597,0.182099,1.910382
6,(soup),(mineral water),0.023064,0.456464,1.914955
7,(mineral water),(soup),0.023064,0.096756,1.914955
8,(olive oil),(spaghetti),0.02293,0.348178,1.999758
9,(spaghetti),(olive oil),0.02293,0.1317,1.999758


## 📈 Association Rules Interpretation (Lift ≥ 1.8)

We used the **Apriori algorithm** to generate product recommendation rules from frequent itemsets, filtering by a **minimum lift of 1.8** to ensure only **strong associations** were kept.  

Below are the **top 10 rules** generated, with column renaming for clarity:

| 🛒 **If you buy...**        | ➕ **You might also like...** |
|-----------------------------|-------------------------------|
| ground beef                 | spaghetti                     |
| spaghetti                   | ground beef                   |
| eggs                        | burgers                       |
| burgers                     | eggs                          |
| frozen vegetables           | milk                          |
| milk                        | frozen vegetables             |
| soup                        | mineral water                 |
| mineral water               | soup                          |
| olive oil                   | spaghetti                     |
| spaghetti                   | olive oil                     |

---

### 🔍 Rule Format:
> If you buy **X**, you might also like **Y**

These rules were extracted using `mlxtend.association_rules()` and filtered by `lift ≥ 1.8`, indicating that items **X and Y are bought together more often than expected by chance**.

---

### 🧠 Example Interpretations:
- If a customer buys **ground beef**, it's highly likely they'll also buy **spaghetti**.
- **Frozen vegetables** and **milk** are frequently purchased together — useful for shelf placement.
- Customers buying **olive oil** also tend to buy **spaghetti**, suggesting potential for bundled promotions.
