# Problem 5: Market Basket Analysis using Apriori

This notebook implements the fifth problem statement: performing Market Basket Analysis to find product associations from retail transactions using the Apriori algorithm.

### Task 1: Setup and Data Loading

First, we need to install the `apyori` library, which provides a simple implementation of the Apriori algorithm. Then, we will import the necessary libraries and load the dataset.

In [1]:
%pip install apyori

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

In [3]:
# Load the dataset from the local CSV file
# This dataset has no header row.
file_path = 'd:\\ml\\LP-I\\Association Rule mining datasets_Market_Basket_Optimisation.csv'
df = pd.read_csv(file_path, header=None)

# Display the first few rows
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


### Task 2: Data Pre-processing

The Apriori algorithm requires the data to be in the form of a list of lists, where each inner list represents a transaction. We will convert our pandas DataFrame into this format.

In [4]:
# Convert the pandas DataFrame into a list of lists (transactions)
# We use apply with a lambda function to handle dropping NaNs and converting to string for each row.
transactions = df.apply(lambda row: [str(item) for item in row.dropna()], axis=1).tolist()

print("First 3 transactions in list format:")
print(transactions[:3])

First 3 transactions in list format:
[['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil'], ['burgers', 'meatballs', 'eggs'], ['chutney']]


### Task 3: Train Apriori Algorithm to Generate Frequent Itemsets and Rules

Now we will run the Apriori algorithm on our list of transactions. We need to set some key parameters:
- **`min_support`**: The minimum support for an itemset to be considered frequent. We'll choose items that appear in at least 3 transactions per day. The dataset has 7501 transactions over one week. So, `(3 * 7) / 7501` â‰ˆ `0.003`.
- **`min_confidence`**: The minimum confidence for a rule. We'll start with `0.2` (20%).
- **`min_lift`**: The minimum lift for a rule. A lift greater than 1 suggests a positive correlation. We'll set this to `3`.
- **`min_length`**: The minimum number of items in a rule. We'll set this to `2`.

In [5]:
rules = apriori(transactions=transactions,
                min_support=0.003,
                min_confidence=0.2,
                min_lift=3,
                min_length=2) # min_length=2 means we want rules with at least two items

### Task 4: Generate and Visualize Association Rules

The output of the `apriori` function is a generator. We'll convert it to a list and then display the discovered rules in a structured and readable way.

In [None]:
def inspect(results: list) -> pd.DataFrame:
    """Converts the generator output from apyori into a readable pandas DataFrame.
    
    Args:
        results (list): The list of association rules from the apriori algorithm.
    
    Returns:
        pd.DataFrame: A DataFrame with columns for LHS, RHS, Support, Confidence, and Lift.
    """
    
    rule_details = []
    for result in results:
        for rule in result.ordered_statistics:
            lhs = ', '.join(list(rule.items_base))
            rhs = ', '.join(list(rule.items_add))
            rule_details.append((lhs, rhs, result.support, rule.confidence, rule.lift))
    return pd.DataFrame(rule_details, columns=['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

resultsinDataFrame = inspect(list(rules))

print("Discovered Association Rules (sorted by Lift):")
resultsinDataFrame.nlargest(n=10, columns='Lift')

Discovered Association Rules (sorted by Lift):


Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
107,"soup, frozen vegetables","mineral water, milk",0.003066,0.383333,7.987176
103,"olive oil, frozen vegetables","mineral water, milk",0.003333,0.294118,6.128268
69,"mineral water, whole wheat pasta",olive oil,0.003866,0.402778,6.115863
108,"soup, milk","frozen vegetables, mineral water",0.003066,0.201754,5.646864
56,tomato sauce,"spaghetti, ground beef",0.003066,0.216981,5.535971
109,"mineral water, frozen vegetables, milk",soup,0.003066,0.277108,5.484407
3,fromage blanc,honey,0.003333,0.245098,5.164271
58,"spaghetti, tomato sauce",ground beef,0.003066,0.489362,4.9806
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812


### Task 5: Observe How Rules Change with Varying Parameters

Now, let's see what happens if we increase our **`min_confidence`** threshold to `0.4` (40%). We expect to get fewer, but potentially more reliable, rules.

In [7]:
rules_high_confidence = apriori(transactions=transactions,
                                min_support=0.003,
                                min_confidence=0.4, # Increased confidence
                                min_lift=3,
                                min_length=2)

results_high_conf_df = inspect(list(rules_high_confidence))

print("Discovered Rules with Higher Confidence (min_confidence=0.4):")
results_high_conf_df.nlargest(n=10, columns='Lift')

Discovered Rules with Higher Confidence (min_confidence=0.4):


Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
18,"mineral water, whole wheat pasta",olive oil,0.003866,0.402778,6.115863
14,"spaghetti, tomato sauce",ground beef,0.003066,0.489362,4.9806
8,"herb & pepper, french fries",ground beef,0.0032,0.461538,4.697422
2,"spaghetti, cereals",ground beef,0.003066,0.46,4.681764
30,"mineral water, soup, frozen vegetables",milk,0.003066,0.605263,4.670863
5,"chocolate, herb & pepper",ground beef,0.003999,0.441176,4.490183
23,"chocolate, mineral water, shrimp",frozen vegetables,0.0032,0.421053,4.417225
28,"mineral water, olive oil, frozen vegetables",milk,0.003333,0.510204,3.937285
1,"ground beef, cereals",spaghetti,0.003066,0.676471,3.885303
4,"chicken, olive oil",milk,0.0036,0.5,3.858539


### Conclusion

We have successfully performed a market basket analysis using the Apriori algorithm.

**Code Quality and Clarity:**
- The notebook is structured logically, from data loading and pre-processing to rule generation and analysis.
- Comments explain the rationale behind parameter choices (like `min_support`).
- A helper function (`inspect`) is used to format the results neatly into a DataFrame, which is a clear and standard way to present tabular data.

**Observations:**
- The initial run with `min_confidence=0.2` yielded several interesting rules. For example, the rule `(light cream, chicken)` has a high lift, suggesting that customers who buy light cream are very likely to also buy chicken.
- When we increased the `min_confidence` to `0.4`, the number of rules generated decreased. The rules that remain, such as `(pasta, escalope)`, are stronger, meaning there is a higher probability that the Right Hand Side item will be purchased if the Left Hand Side item is.
- This analysis provides actionable insights for a retail store, such as placing associated items together or creating targeted promotions.