# Association Rules Mining Using Python Generators to Handle Large Datasets

### Motivation
Association analysis using the Apriori algorithm can be employed to derive rules of the form {A} -> {B}. However, this algorithm is not included in most standard Python machine learning libraries. While some implementations are available, they may not be optimized for handling large datasets. For example, a dataset with 32 million records, 3.2 million unique orders, and approximately 50,000 unique items (around 1 GB in size) can exceed the capabilities of many existing libraries.

To address this, a custom implementation of the Apriori algorithm can be developed to generate simple {A} -> {B} association rules. When the primary objective is to identify relationships between pairs of items, generating item sets of size 2 is sufficient. The process can involve splitting the dataset into smaller subsets to allow memory-intensive functions such as crosstab and combinations to run on systems with limited memory. However, even with this approach, processing a large number of items (e.g., over 1,800) may still lead to memory issues or kernel crashes. In such cases, using Python generators can offer an efficient solution for handling large datasets while reducing memory consumption.

### Using Python Generators

Generators are a specialized type of function in Python that produce an iterable sequence of items. Unlike regular functions, which return all values at once (such as returning the full contents of a list), a generator produces one value at a time using the yield keyword. To retrieve the next value in the sequence, one must either explicitly call the generator’s next method or iterate over it using a loop, such as a for loop. This behavior is particularly advantageous as it eliminates the need to store all the values in memory simultaneously. Instead, values are generated on demand, processed individually, and discarded when no longer needed. This makes generators highly efficient for tasks involving large datasets.

A practical application of this is generating item pairs from orders and counting their frequency of co-occurrence. Generators are especially well-suited for this purpose as they can efficiently create and process item pairs without overloading memory. Here's a concrete illustration of the process:

Generate all possible item pairs for each order.
Example:

Order 1: apple, egg, milk → Item pairs: {apple, egg}, {apple, milk}, {egg, milk}
Order 2: egg, milk → Item pairs: {egg, milk}
Count the frequency of each item pair.
Example:

{apple, egg}: 1
{apple, milk}: 1
{egg, milk}: 2
The following generator efficiently implements these tasks:

In [1]:
import numpy as np
from itertools import combinations, groupby
from collections import Counter

# Sample data
orders = np.array([[1,'apple'], [1,'egg'], [1,'milk'], [2,'egg'], [2,'milk']], dtype=object)

# Generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    
    # For each order, generate a list of items in that order
    for order_id, order_object in groupby(orders, lambda x: x[0]):
        item_list = [item[1] for item in order_object]      
    
        # For each item list, generate item pairs, one at a time
        for item_pair in combinations(item_list, 2):
            yield item_pair                                      


# Counter iterates through the item pairs returned by our generator and keeps a tally of their occurrence
Counter(get_item_pairs(orders))

Counter({('egg', 'milk'): 2, ('apple', 'egg'): 1, ('apple', 'milk'): 1})

### Apriori Algorithm 
Apriori is an algorithm used to identify frequent item sets (in our case, item pairs).  It does so using a "bottom up" approach, first identifying individual items that satisfy a minimum occurence threshold. It then extends the item set, adding one item at a time and checking if the resulting item set still satisfies the specified threshold.  The algorithm stops when there are no more items to add that meet the minimum occurrence requirement.  Here's an example of apriori in action, assuming a minimum occurence threshold of 3:


    order 1: apple, egg, milk  
    order 2: carrot, milk  
    order 3: apple, egg, carrot
    order 4: apple, egg
    order 5: apple, carrot

    
    Iteration 1:  Count the number of times each item occurs   
    item set      occurrence count    
    {apple}              4   
    {egg}                3   
    {milk}               2   
    {carrot}             2   

    {milk} and {carrot} are eliminated because they do not meet the minimum occurrence threshold.


    Iteration 2: Build item sets of size 2 using the remaining items from Iteration 1 
                 (ie: apple, egg)  
    item set           occurence count  
    {apple, egg}             3  

    Only {apple, egg} remains and the algorithm stops since there are no more items to add.
   
   
If we had more orders and items, we can continue to iterate, building item sets consisting of more than 2 elements.  For the problem we are trying to solve (ie: finding relationships between pairs of items), it suffices to implement apriori to get to item sets of size 2.

### Association Rules Mining
Once the item sets have been generated using apriori, we can start mining association rules.  Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}.  One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:

1. <b>support</b>  
    This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total 
    and {apple,egg} occurs in 3 of them, so: 
       
                    support{apple,egg} = 3/5 or 60%
        
    The minimum support threshold required by apriori can be set based on knowledge of your domain.  In this 
    grocery dataset for example, since there could be thousands of distinct items and an order can contain 
    only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.<br><br><br>
    
2. <b>confidence</b>  
    Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that 
    item A was purchased. This is expressed as:
       
                    confidence{A->B} = support{A,B} / support{A}   
                    
    Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 
    indicates that B is always purchased whenever A is purchased.  Note that the confidence measure is directional.     This means that we can also compute the percentage of times that item A is purchased, given that item B was 
    purchased:
       
                    confidence{B->A} = support{A,B} / support{B}    
                    
    In our example, the percentage of times that egg is purchased, given that apple was purchased is:  
       
                    confidence{apple->egg} = support{apple,egg} / support{apple}
                                           = (3/5) / (4/5)
                                           = 0.75 or 75%

    A confidence value of 0.75 implies that out of all orders that contain apple, 75% of them also contain egg.  Now, 
    we look at the confidence measure in the opposite direction (ie: egg->apple): 
       
                    confidence{egg->apple} = support{apple,egg} / support{egg}
                                           = (3/5) / (3/5)
                                           = 1 or 100%  
                                           
    Here we see that all of the orders that contain egg also contain apple.  But, does this mean that there is a 
    relationship between these two items, or are they occurring together in the same orders simply by chance?  To 
    answer this question, we look at another measure which takes into account the popularity of <i>both</i> items.<br><br><br>  
    
3. <b>lift</b>  
    Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items 
    are occuring together in the same orders simply by chance (ie: at random).  Unlike the confidence metric whose 
    value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), 
    lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}: 
       
                    lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})   
    
    In our example, we compute lift as follows:
    
         lift{apple,egg} = lift{egg,apple} = support{apple,egg} / (support{apple} * support{egg})
                         = (3/5) / (4/5 * 3/5) 
                         = 1.25    
               
    One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same 
    order if there was <i>no</i> relationship between them. In the example above, if apple occurred in 80% of the
    orders and egg occurred in 60% of the orders, then if there was no relationship between them, we would 
    <i>expect</i> both of them to show up together in the same order 48% of the time (ie: 80% * 60%).  The numerator, 
    on the other hand, represents how often apple and egg <i>actually</i> appear together in the same order.  In 
    this example, that is 60% of the time.  Taking the numerator and dividing it by the denominator, we get to how 
    many more times apple and egg actually appear in the same order, compared to if there was no relationship between     them (ie: that they are occurring together simply at random).  
    
    In summary, lift can take on the following values:
    
        * lift = 1 implies no relationship between A and B. 
          (ie: A and B occur together only by chance)
      
        * lift > 1 implies that there is a positive relationship between A and B.
          (ie:  A and B occur together more often than random)
    
        * lift < 1 implies that there is a negative relationship between A and B.
          (ie:  A and B occur together less often than random)
        
    In our example, apple and egg occur together 1.25 times <i>more</i> than random, so we conclude that there exists 
    a positive relationship between them.
   
Armed with knowledge of apriori and association rules mining, let's dive into the data and code to see what relationships we unravel!

In a very intutitve way, what do these terminologies actually mean?
- **Support**: "How popular is this combination of items overall?"
- **Confidence**: "If someone buys one item, how confident can we be they'll also buy the other?"
- **Lift**: "Does buying one item actually make it special or more likely that the other will be bought, compared to random chance?"

### Input Dataset
Instacart, an online grocer, has graciously made some of their datasets accessible to the public.  The order and product datasets that we will be using can be downloaded from the link below, along with the data dictionary:

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on September 1, 2017.<br><br>

In [3]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

In [4]:
# Function that returns the size of an object in MB
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

### Part 1:  Data Preparation

#### A. Load order  data

In [20]:
orders = pd.read_csv('instacart-market-basket-analysis/data_out/compiled_df.csv')
print('orders -- dimensions: {0};   size: {1}'.format(orders.shape, size(orders)))
display(orders.head())

orders -- dimensions: (33819106, 12);   size: 5072.87 MB


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,aisle_id,department_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2,33120,1,1,86,16,202279,prior,3,5,9,8
1,2,28985,2,1,83,4,202279,prior,3,5,9,8
2,2,9327,3,0,104,13,202279,prior,3,5,9,8
3,2,45918,4,1,19,13,202279,prior,3,5,9,8
4,2,30035,5,0,17,13,202279,prior,3,5,9,8


#### B. Convert order data into format expected by the association rules function

In [21]:
# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders = orders.set_index('order_id')['product_id'].rename('item_id')
display(orders.head(10))
type(orders)

order_id
2    33120
2    28985
2     9327
2    45918
2    30035
2    17794
2    40141
2     1819
2    43668
3    33754
Name: item_id, dtype: int64

pandas.core.series.Series

#### C. Display summary statistics for order data

In [22]:
print('dimensions: {0};   size: {1};   unique_orders: {2};   unique_items: {3}'
      .format(orders.shape, size(orders), len(orders.index.unique()), len(orders.value_counts())))

dimensions: (33819106,);   size: 541.11 MB;   unique_orders: 3346083;   unique_items: 49685


### Part 2: Association Rules Function

#### A. Helper functions to the main association rules function

In [24]:
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().to_numpy()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]               

#### B. Association rules function

In [25]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    print(item_pairs)
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

### Part 3:  Association Rules Mining

In [17]:
len(orders)

33819106

In [26]:
orders.head()

order_id
2    33120
2    28985
2     9327
2    45918
2    30035
Name: item_id, dtype: int64

In [31]:
%%time
rules = association_rules(orders, 0.01)  

Starting order_item:               33819106
Items with support >= 0.01:           10949
Remaining order_item:              31116451
Remaining orders with 2+ items:     3136021
Remaining order_item:              30928016
Item pairs:                        31436613
Item pairs with support >= 0.01:      48923

       item_A  item_B  freqAB  supportAB
0       33120   28985     458   0.014604
1       33120   17794     396   0.012627
2       28985   17794    2963   0.094483
3       33754   24838     378   0.012053
4       33754   21903    1588   0.050637
...       ...     ...     ...        ...
48918   21903    9124     341   0.010874
48919   24852   22034     357   0.011384
48920   13176   44097     316   0.010076
48921   34262   45007     331   0.010555
48922   24852   31372     319   0.010172

[48923 rows x 4 columns]
CPU times: user 5min, sys: 11.3 s, total: 5min 11s
Wall time: 5min 11s


In [32]:
# Replace item ID with item name and display association rules
item_name   = pd.read_csv('instacart-market-basket-analysis/data_out/products.csv')
item_name   = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
rules_final = merge_item_name(rules, item_name).sort_values('lift', ascending=False)
display(rules_final)

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
0,Organic Strawberry Chia Lowfat 2% Cottage Cheese,Organic Cottage Cheese Blueberry Acai Chia,317,0.010108,1215,0.038743,878,0.027997,0.260905,0.361048,9.318960
1,Grain Free Chicken Formula Cat Food,Grain Free Turkey Formula Cat Food,332,0.010587,1877,0.059853,912,0.029081,0.176878,0.364035,6.082161
2,Organic Fruit Yogurt Smoothie Mixed Berry,Apple Blueberry Fruit Yogurt Smoothie,364,0.011607,1572,0.050127,1296,0.041326,0.231552,0.280864,5.603028
3,Nonfat Strawberry With Fruit On The Bottom Gre...,"0% Greek, Blueberry on the Bottom Yogurt",421,0.013425,1717,0.054751,1430,0.045599,0.245195,0.294406,5.377182
4,Organic Grapefruit Ginger Sparkling Yerba Mate,Cranberry Pomegranate Sparkling Yerba Mate,359,0.011448,1785,0.056919,1174,0.037436,0.201120,0.305792,5.372385
...,...,...,...,...,...,...,...,...,...,...,...
48918,Organic Strawberries,Strawberries,667,0.021269,274267,8.745700,148254,4.727456,0.002432,0.004499,0.000514
48919,Organic Hass Avocado,Organic Avocado,479,0.015274,220055,7.017013,183626,5.855382,0.002177,0.002609,0.000372
48920,Organic Avocado,Organic Hass Avocado,462,0.014732,183626,5.855382,220055,7.017013,0.002516,0.002099,0.000359
48921,Banana,Bag of Organic Bananas,665,0.021205,488729,15.584366,391725,12.491147,0.001361,0.001698,0.000109


In [33]:
rules_final[rules_final["itemA"]=="Banana"]

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
2368,Banana,Blueberry,657,0.020950,488729,15.584366,1325,0.042251,0.001344,0.495849,0.031817
3559,Banana,Honey Turkey,575,0.018335,488729,15.584366,1603,0.051116,0.001177,0.358702,0.023017
3864,Banana,Sweet Grape Tomatoes,365,0.011639,488729,15.584366,1066,0.033992,0.000747,0.342402,0.021971
3972,Banana,Peaches Large,898,0.028635,488729,15.584366,2664,0.084948,0.001837,0.337087,0.021630
4386,Banana,Reduced Fat Medium Cheddar Cheese,368,0.011735,488729,15.584366,1156,0.036862,0.000753,0.318339,0.020427
...,...,...,...,...,...,...,...,...,...,...,...
48898,Banana,Organic Extra Virgin Olive Oil,381,0.012149,488729,15.584366,13390,0.426974,0.000780,0.028454,0.001826
48907,Banana,Soda,909,0.028986,488729,15.584366,34366,1.095847,0.001860,0.026451,0.001697
48910,Banana,Organic Tortilla Chips,334,0.010650,488729,15.584366,14077,0.448881,0.000683,0.023727,0.001522
48916,Banana,Clementines,435,0.013871,488729,15.584366,31103,0.991798,0.000890,0.013986,0.000897


In [34]:
rules_final.head(20)

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
0,Organic Strawberry Chia Lowfat 2% Cottage Cheese,Organic Cottage Cheese Blueberry Acai Chia,317,0.010108,1215,0.038743,878,0.027997,0.260905,0.361048,9.31896
1,Grain Free Chicken Formula Cat Food,Grain Free Turkey Formula Cat Food,332,0.010587,1877,0.059853,912,0.029081,0.176878,0.364035,6.082161
2,Organic Fruit Yogurt Smoothie Mixed Berry,Apple Blueberry Fruit Yogurt Smoothie,364,0.011607,1572,0.050127,1296,0.041326,0.231552,0.280864,5.603028
3,Nonfat Strawberry With Fruit On The Bottom Gre...,"0% Greek, Blueberry on the Bottom Yogurt",421,0.013425,1717,0.054751,1430,0.045599,0.245195,0.294406,5.377182
4,Organic Grapefruit Ginger Sparkling Yerba Mate,Cranberry Pomegranate Sparkling Yerba Mate,359,0.011448,1785,0.056919,1174,0.037436,0.20112,0.305792,5.372385
5,Baby Food Pouch - Roasted Carrot Spinach & Beans,"Baby Food Pouch - Butternut Squash, Carrot & C...",347,0.011065,1559,0.049713,1333,0.042506,0.222579,0.260315,5.236392
6,Uncured Cracked Pepper Beef,Chipotle Beef & Pork Realstick,422,0.013457,1888,0.060204,1406,0.044834,0.223517,0.300142,4.985447
7,Unsweetened Whole Milk Mixed Berry Greek Yogurt,Unsweetened Whole Milk Blueberry Greek Yogurt,460,0.014668,1705,0.054368,1707,0.054432,0.269795,0.269479,4.956543
8,Grain Free Chicken Formula Cat Food,Grain Free Turkey & Salmon Formula Cat Food,403,0.012851,1877,0.059853,1595,0.050861,0.214704,0.252665,4.221425
9,Raspberry Essence Water,Unsweetened Pomegranate Essence Water,376,0.01199,2095,0.066804,1341,0.042761,0.179475,0.280388,4.197145


## Generating multi-step recommendations

In [38]:
def multi_step_recommendations(product, df, top_n=3, max_depth=3, threshold=0.1):
    recommendations = []
    visited_products = set()  # Tracks products visited during recursion
    recommended_products = set()  # Tracks products already recommended

    def recurse(current_product, depth):
        if depth == max_depth or current_product in visited_products:
            return
        visited_products.add(current_product)

        # Get recommendations for the current product
        filtered = df[df['itemA'] == current_product]
        filtered = filtered[filtered['confidenceAtoB'] >= threshold]
        top_recs = filtered.sort_values(by='confidenceAtoB', ascending=False).head(top_n)

        for _, row in top_recs.iterrows():
            # Check if the recommended product is already in recommended_products
            if row['itemB'] not in recommended_products:
                recommendations.append((current_product, row['itemB'], row['confidenceAtoB']))
                recommended_products.add(row['itemB'])  # Mark as recommended
                recurse(row['itemB'], depth + 1)

    recurse(product, 0)
    return recommendations

In [39]:
visited_products, result = multi_step_recommendations(input_product, rules_final, top_n=2, max_depth=2, threshold=0.01)
input_product = "Grain Free Chicken Formula Cat Food"

In [40]:
result = multi_step_recommendations("Banana", rules_final, top_n=3, max_depth=3, threshold=0.08)
for rec in result:
    print(f"From {rec[0]} -> Recommend {rec[1]} with confidence {rec[2]:.2f}")

From Banana -> Recommend Organic Strawberries with confidence 0.08
From Organic Strawberries -> Recommend Bag of Organic Bananas with confidence 0.08
From Bag of Organic Bananas -> Recommend Organic Hass Avocado with confidence 0.10
From Bag of Organic Bananas -> Recommend Organic Baby Spinach with confidence 0.09


In [41]:
# Extract unique product names
unique_products = set()

# Loop through the result tuples
for item in result:
    unique_products.add(item[0])  # Add the first product
    unique_products.add(item[1])  # Add the second product

# Convert the set back to a sorted list (optional)
unique_products_list = sorted(unique_products)

# Print the unique products
print(unique_products_list)

['Bag of Organic Bananas', 'Banana', 'Organic Baby Spinach', 'Organic Hass Avocado', 'Organic Strawberries']


## Example Usage

In [42]:
# ----------------------------
# Helper functions (unchanged)
# ----------------------------

def freq(iterable):
    if isinstance(iterable, pd.Series):
        return iterable.value_counts().rename("freq")
    else:
        return pd.Series(Counter(iterable)).rename("freq")

def order_count(order_item):
    return len(set(order_item.index))

def get_item_pairs(order_item):
    order_item = order_item.reset_index().to_numpy()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
        for item_pair in combinations(item_list, 2):
            yield item_pair

def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))

# --------------------------------------------------------
# Step-by-step association rules function for visualization
# --------------------------------------------------------

def association_rules_visualize(order_item, min_support):
    print("=" * 80)
    print("Step 1: Calculate initial item frequency and support")
    # Calculate frequency and support for each item
    item_stats = freq(order_item).to_frame("freq")
    item_stats['support'] = item_stats['freq'] / order_count(order_item) * 100
    print(item_stats, "\n")
    
    print("=" * 80)
    print("Step 2: Filter items below min support threshold")
    # Filter out items with support below the minimum threshold
    qualifying_items = item_stats[item_stats['support'] >= min_support].index
    order_item_filtered = order_item[order_item.isin(qualifying_items)]
    print("Qualifying items:")
    print(list(qualifying_items))
    print("Filtered order_item:")
    print(order_item_filtered, "\n")
    
    print("=" * 80)
    print("Step 3: Filter orders with fewer than 2 items")
    # Identify orders that have 2 or more items
    order_size = freq(order_item_filtered.index)
    qualifying_orders = order_size[order_size >= 2].index
    order_item_filtered = order_item_filtered[order_item_filtered.index.isin(qualifying_orders)]
    print("Qualifying orders:")
    print(list(qualifying_orders))
    print("Filtered order_item (orders with 2+ items):")
    print(order_item_filtered, "\n")
    
    print("=" * 80)
    print("Step 4: Recalculate item frequency and support after filtering")
    # Recalculate stats with the filtered data
    item_stats_filtered = freq(order_item_filtered).to_frame("freq")
    item_stats_filtered['support'] = item_stats_filtered['freq'] / order_count(order_item_filtered) * 100
    print(item_stats_filtered, "\n")
    
    print("=" * 80)
    print("Step 5: Generate item pairs from qualifying orders")
    # Generate all possible item pairs from the filtered orders
    item_pair_gen = get_item_pairs(order_item_filtered)
    
    print("Step 6: Calculate item pair frequency and support")
    # Calculate frequency and support for each item pair
    item_pairs = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100
    print(item_pairs, "\n")
    
    print("=" * 80)
    print("Step 7: Filter item pairs below min support threshold")
    # Keep only those pairs that meet the minimum support threshold
    item_pairs_filtered = item_pairs[item_pairs['supportAB'] >= min_support]
    print(item_pairs_filtered, "\n")
    
    print("=" * 80)
    print("Step 8: Merge item statistics and compute association metrics")
    # Prepare the table with association metrics
    item_pairs_filtered = item_pairs_filtered.reset_index().rename(
        columns={'level_0': 'item_A', 'level_1': 'item_B'}
    )
    item_pairs_with_stats = merge_item_stats(item_pairs_filtered, item_stats_filtered)
    # Calculate confidence and lift
    item_pairs_with_stats['confidenceAtoB'] = item_pairs_with_stats['supportAB'] / item_pairs_with_stats['supportA']
    item_pairs_with_stats['confidenceBtoA'] = item_pairs_with_stats['supportAB'] / item_pairs_with_stats['supportB']
    item_pairs_with_stats['lift'] = item_pairs_with_stats['supportAB'] / (
        item_pairs_with_stats['supportA'] * item_pairs_with_stats['supportB']
    )
    print(item_pairs_with_stats, "\n")
    
    print("=" * 80)
    print("Final association rules sorted by lift (descending)")
    final_rules = item_pairs_with_stats.sort_values('lift', ascending=False)
    print(final_rules)
    
    return final_rules

# --------------------------
# Example usage on orders_df
# --------------------------

# Example orders DataFrame: each row is an item in an order
data = {
    'order_id': [1, 1, 2, 2, 3, 3, 4, 4, 4],
    'item_id':  ['apple', 'banana', 'apple', 'orange', 'banana', 'orange', 'apple', 'banana', 'orange']
}
orders_df = pd.DataFrame(data)

# Prepare the order_item Series with order_id as the index
order_item = orders_df.set_index('order_id')['item_id']

# Run the visualized association rules analysis with a minimum support threshold of 30%
final_association_rules = association_rules_visualize(order_item, min_support=30.0)

Step 1: Calculate initial item frequency and support
         freq  support
item_id               
apple       3     75.0
banana      3     75.0
orange      3     75.0 

Step 2: Filter items below min support threshold
Qualifying items:
['apple', 'banana', 'orange']
Filtered order_item:
order_id
1     apple
1    banana
2     apple
2    orange
3    banana
3    orange
4     apple
4    banana
4    orange
Name: item_id, dtype: object 

Step 3: Filter orders with fewer than 2 items
Qualifying orders:
[1, 2, 3, 4]
Filtered order_item (orders with 2+ items):
order_id
1     apple
1    banana
2     apple
2    orange
3    banana
3    orange
4     apple
4    banana
4    orange
Name: item_id, dtype: object 

Step 4: Recalculate item frequency and support after filtering
         freq  support
item_id               
apple       3     75.0
banana      3     75.0
orange      3     75.0 

Step 5: Generate item pairs from qualifying orders
Step 6: Calculate item pair frequency and support
             

### Part 4:  Conclusion

From the output above, we see that the top associations are not surprising, with one flavor of an item being purchased with another flavor from the same item family (eg: Strawberry Chia Cottage Cheese with Blueberry Acai Cottage Cheese, Chicken Cat Food with Turkey Cat Food, etc).  As mentioned, one common application of association rules mining is in the domain of recommender systems.  Once item pairs have been identified as having positive relationship, recommendations can be made to customers in order to increase sales.  And hopefully, along the way, also introduce customers to items they never would have tried before or even imagined existed!