# Fall 2024 Data Science Track: Week 5 - Unsupervised Learning

## Packages, Packages, Packages!

Import *all* the things here! You need: `matplotlib`, `networkx`, `numpy`, and `pandas`―and also `ast.literal_eval` to correctly deserialize two columns in the `rules.tsv.xz` file.

If you got more stuff you want to use, add them here too. 🙂

In [2]:
# Don’t worry about this. This is needed to interpret the Python code that is embedded in the data set. You only need it literally in the very next code cell and nowhere else. 
from ast import literal_eval

# The rest is just the stuff from the lecture.
from matplotlib import pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd

# Instacart Association Rules

## Introduction

With the packages out of the way, now you will be working with the Instacart association rules data set, mined from the [Instacart Market Basket Analysis data set](https://www.kaggle.com/c/instacart-market-basket-analysis/data) on Kaggle. [The script](https://github.com/LiKenun/shopping-assistant/blob/main/api/preprocess_instacart_market_basket_analysis_data.py) that does it and the instructions to run it can be found in my [Shopping Assistant Project](https://github.com/LiKenun/shopping-assistant) repository.

## Load the Data

This code has already been pre-written, simply because there are a few quirks which require converters to ensure the correct deserialization of some columns.

In [3]:
rules_data_path = 'data/rules.tsv.xz'                       # You do not need to decompress this yourself. Pandas understands how to read compressed data.

df_rules = pd.read_csv(rules_data_path,
                       sep='\t',
                       quoting=3,                           # This disables interpretation of quotes by Pandas itself, because both single and double quotes will be resolved by literal_eval.
                       converters={
                           'consequent_item': literal_eval,
                           'antecedent_items': literal_eval # This reads something like ["Grandma's 8\" Chip Cookies", '6" Apple Pie'] into a list, so you will get a column where each individual cell is a list.
                       },
                       low_memory=True)                     # For Chris

But just *how* many rules were just loaded‽

In [4]:
# Show the list of column names and the number of rules

print(df_rules.head())

print("\n")
df_rules.info(verbose=True)

print("\n")
print('Columns:', ', '.join(df_rules.columns))
print('Rules: %d' % len(df_rules))


                                     consequent_item  transaction_count  \
0  Total 2% with Raspberry Pomegranate Lowfat Gre...            3346083   
1  Total 2% Lowfat Greek Strained Yogurt With Blu...            3346083   
2   Total 0% with Honey Nonfat Greek Strained Yogurt            3346083   
3                          Total 0% Raspberry Yogurt            3346083   
4                                Pineapple Yogurt 2%            3346083   

   item_set_count  antecedent_count  consequent_count  \
0             101               123               128   
1             101               128               123   
2             101               123               128   
3             101               123               128   
4             101               128               123   

                                    antecedent_items  
0  [Fat Free Blueberry Yogurt, Pineapple Yogurt 2...  
1  [Fat Free Strawberry Yogurt, Total 0% Raspberr...  
2  [Fat Free Blueberry Yogurt, Pineapple 

In [4]:
print(len(df_rules))

1048575


## Metrics

Compute the support, confidence, and lift of each rule.

* The rule’s *support* tells you how frequently the set of items appears in the dataset. It’s important to prune infrequent sets from further consideration.
    * The simple definition: $$P(A \cap B)$$
    * `= item_set_count / transaction_count`
* The rule’s *confidence* tells you how often a the rule is true. Divide the support for the set of items by the support for just the antecedents. Rules which are not true very often are also pruned.
    * The simple definition: $$\frac{P(A \cap B)}{P(A)}$$
    * `= item_set_count / transaction_count / (antecedent_count / transaction_count)`
    * `= item_set_count / antecedent_count`
* The rule’s *lift* tells you how much more likely the consequent is, given the antecedents, compared to its baseline probability. Divide the support for the set of items by both the support of the antecedents and consequent. Equivalently, divide the confidence by the support of the consequent.
    * The simple definition: $$\frac{P(A \cap B)}{P(A) \cdot P(B)}$$
    * `= item_set_count / transaction_count / (antecedent_count / transaction_count * (consequent_count / transaction_count))`
    * `= item_set_count / antecedent_count / (consequent_count / transaction_count)`
    * `= item_set_count * transaction_count / (antecedent_count * consequent_count)`

In [5]:
# Add new columns support, confidence, and lift to df_rules. And show the first 50 rules.
df_rules = df_rules.pipe(lambda df: df.assign(support=df.item_set_count / df.transaction_count)) \
                   .pipe(lambda df: df.assign(confidence=df.item_set_count / df.antecedent_count)) \
                   .pipe(lambda df: df.assign(lift=df.item_set_count * df.transaction_count / (df.antecedent_count * df.consequent_count)))
df_rules.head(50)


Unnamed: 0,consequent_item,transaction_count,item_set_count,antecedent_count,consequent_count,antecedent_items,support,confidence,lift
0,Total 2% with Raspberry Pomegranate Lowfat Gre...,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2...",3e-05,0.821138,21465.598514
1,Total 2% Lowfat Greek Strained Yogurt With Blu...,3346083,101,128,123,"[Fat Free Strawberry Yogurt, Total 0% Raspberr...",3e-05,0.789062,21465.598514
2,Total 0% with Honey Nonfat Greek Strained Yogurt,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2...",3e-05,0.821138,21465.598514
3,Total 0% Raspberry Yogurt,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2...",3e-05,0.821138,21465.598514
4,Pineapple Yogurt 2%,3346083,101,128,123,"[Fat Free Strawberry Yogurt, Total 0% Raspberr...",3e-05,0.789062,21465.598514
5,Fat Free Strawberry Yogurt,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2...",3e-05,0.821138,21465.598514
6,Fat Free Blueberry Yogurt,3346083,101,128,123,"[Fat Free Strawberry Yogurt, Total 0% Raspberr...",3e-05,0.789062,21465.598514
7,Total 2% with Raspberry Pomegranate Lowfat Gre...,3346083,101,147,117,"[Blackberry Yogurt, Fat Free Strawberry Yogurt...",3e-05,0.687075,19649.653061
8,Total 2% Lowfat Greek Strained Yogurt With Blu...,3346083,101,147,117,"[Blackberry Yogurt, Fat Free Strawberry Yogurt...",3e-05,0.687075,19649.653061
9,Total 2% Greek Strained Yogurt with Cherry 5.3 oz,3346083,101,117,147,"[Pineapple Yogurt 2%, Total 0% Raspberry Yogur...",3e-05,0.863248,19649.653061


The yogurts have got some insane lift (*over 9,000*). Why do you think that might be?

*(Write your answer here.)*

In [5]:
print("Maybe they have a deal on yogurt. Or because there are just so many different types of yogurt, it's being recorded more.")

Maybe they have a deal on yogurt. Or because there are just so many different types of yogurt, it's being recorded more.


In [7]:
# Query the rule set if you have to to find out more.

yogurt_rules = df_rules[(df_rules['consequent_item'].apply(lambda x: 'Yogurt' in x)) | 
                         (df_rules['antecedent_items'].apply(lambda x: 'Yogurt' in x))]

# Display the rules related to yogurt
print(yogurt_rules[['antecedent_items', 'consequent_item', 'support', 'confidence', 'lift']])

                                          antecedent_items  \
0        [Fat Free Blueberry Yogurt, Pineapple Yogurt 2...   
1        [Fat Free Strawberry Yogurt, Total 0% Raspberr...   
2        [Fat Free Blueberry Yogurt, Pineapple Yogurt 2...   
3        [Fat Free Blueberry Yogurt, Pineapple Yogurt 2...   
4        [Fat Free Strawberry Yogurt, Total 0% Raspberr...   
...                                                    ...   
1048121                                                 []   
1048135                                                 []   
1048248                                                 []   
1048340                                                 []   
1048452                                                 []   

                                           consequent_item   support  \
0        Total 2% with Raspberry Pomegranate Lowfat Gre...  0.000030   
1        Total 2% Lowfat Greek Strained Yogurt With Blu...  0.000030   
2         Total 0% with Honey Nonfat Gr

## Network Visualization for Consequents with Single Antecedents

Let’s now visualize a small subset of 1,000,000+ rules. First, filter the rule set for the following to whittle it down to something more manageable:

1. The rule must have exactly `1` antecedent item. (There should be 38,684 such rules.)
2. The lift must be between `5` and `20`. (There should be 1,596 such rules, including the prior criterion.)
3. Either the antecedent or consequent of the rule must contain `'Hummus'`, but not both. (This should get you down to 26 rules.)
    * Convert the antecedents `list`-typed column to a `str`-typed column (`antecedent_item`) since there will only be a single antecedent in the subset.
    * Replace any item containing `'Hummus'` to just `'Hummus'`. This will make the visualization more readable later.

Hint: your code may run more efficiently if you re-order certain processing steps.

Assign the subset to `df_rules_subset`.

In [10]:
# Define df_rules_subset.
df_rules_subset = df_rules[(df_rules.antecedent_items.apply(len) == 1)
                         & (df_rules.lift.between(5, 20))] 
                        #  \
                        #   .pipe(lambda df: df.assign(antecedent_item=df.antecedent_items.apply(lambda _: _[0]))) \
                        #   .pipe(lambda df: df.assign(antecedent_item=np.where(df.antecedent_item.str.contains('Hummus'), 'Hummus', df.antecedent_item))
                                            #  .assign(consequent_item=np.where(df.consequent_item.str.contains('Hummus'), 'Hummus', df.consequent_item))) \
                        #   .pipe(lambda df: df[~((df.consequent_item != 'Hummus') & (df.antecedent_item != 'Hummus'))])
print('There are %d rules.' % len(df_rules_subset))

There are 1596 rules.


No rules found matching the criteria.

Final subset of rules:
Empty DataFrame
Columns: [antecedent_items, consequent_item, lift]
Index: []


In [43]:
hummus_antecedents = df_rules[df_rules['antecedent_items'].apply(lambda x: 'Hummus' in x)]

# Display the filtered DataFrame
print(f"Rules with 'Hummus' in antecedents: {len(hummus_antecedents)}")
print(hummus_antecedents[['antecedent_items', 'consequent_item']].head())  # Display the first few results
hummus_antecedents_with_lift = hummus_antecedents[['antecedent_items', 'consequent_item', 'lift']]
print(f"\nLift values for rules with 'Hummus' as an antecedent:\n{hummus_antecedents_with_lift}")
# Remove lift condition and check for rules with 'Hummus' as an antecedent
hummus_antecedents = df_rules[df_rules['antecedent_items'].apply(lambda x: 'Hummus' in x)]

# Display the final subset of rules
df_hummus_final = hummus_antecedents[['antecedent_items', 'consequent_item', 'lift']]
print(f"\nFinal subset of rules with 'Hummus' (no lift condition):\n{df_hummus_final}")
hummus_consequents = df_rules[df_rules['consequent_item'].str.contains('Hummus', na=False)]

# Display the final subset of rules
df_hummus_final_consequents = hummus_consequents[['antecedent_items', 'consequent_item', 'lift']]
print(f"\nFinal subset of rules with 'Hummus' as a consequent:\n{df_hummus_final_consequents}")

Rules with 'Hummus' in antecedents: 3
        antecedent_items         consequent_item
946319          [Hummus]    Organic Hass Avocado
1006726         [Hummus]    Organic Strawberries
1022383         [Hummus]  Bag of Organic Bananas

Lift values for rules with 'Hummus' as an antecedent:
        antecedent_items         consequent_item      lift
946319          [Hummus]    Organic Hass Avocado  2.156840
1006726         [Hummus]    Organic Strawberries  1.549182
1022383         [Hummus]  Bag of Organic Bananas  1.302926

Final subset of rules with 'Hummus' (no lift condition):
        antecedent_items         consequent_item      lift
946319          [Hummus]    Organic Hass Avocado  2.156840
1006726         [Hummus]    Organic Strawberries  1.549182
1022383         [Hummus]  Bag of Organic Bananas  1.302926

Final subset of rules with 'Hummus' as a consequent:
                                          antecedent_items  \
234323   [Frozen Broccoli Florets, Organic Extra Firm T...   
239

Build a network `graph_rules_subset` from the association rules subset.

In [11]:
# Define graph_rules_subset, add the graph’s edges, and plot it. You may need a large figure size, smaller node size, and smaller font size.

graph_rules_subset = nx.MultiDiGraph()
graph_rules_subset.add_edges_from(zip(df_rules_subset.antecedent_item, df_rules_subset.consequent_item))
plt.figure(1, figsize=(20, 20))
nx.draw_networkx(graph_rules_subset, arrows=True, node_size=100, font_size=6, node_color='tab:green', font_color='blue', connectionstyle='arc3, rad=0.1')

# Then render the graph.

AttributeError: 'DataFrame' object has no attribute 'antecedent_item'

What can you tell about people who buy hummus?

*(Write your answer here.)*

## Make a Prediction

Given that the basket of items contains the following items, use the full set of association rules to predict the next 20 most likely items (consequents) that the person will add to the basket in descending order of lift:

* `'Orange Bell Pepper'`
* `'Organic Red Bell Pepper'`

Hint: a single item in the basket may be a better predictor of some consequents than both items considered together. You must consider both or either, but not neither.

In [12]:
basket = {'Orange Bell Pepper', 'Organic Red Bell Pepper'}

df_rules[(df_rules.antecedent_items.apply(len) >= 1)
       & (df_rules.antecedent_items.apply(basket.issuperset))] \
        .sort_values('lift', ascending=False) \
        .head(20)

Unnamed: 0,consequent_item,transaction_count,item_set_count,antecedent_count,consequent_count,antecedent_items,support,confidence,lift
364989,Organic Bell Pepper,3346083,582,2721,24331,"[Orange Bell Pepper, Organic Red Bell Pepper]",0.000174,0.213892,29.415159
370144,Yellow Bell Pepper,3346083,499,2721,26625,"[Orange Bell Pepper, Organic Red Bell Pepper]",0.000149,0.183388,23.047249
370169,Yellow Bell Pepper,3346083,7520,41052,26625,[Orange Bell Pepper],0.002247,0.183182,23.021341
386994,Organic Bell Pepper,3346083,6024,59878,24331,[Organic Red Bell Pepper],0.0018,0.100605,13.835486
404185,Green Bell Pepper,3346083,494,2721,58005,"[Orange Bell Pepper, Organic Red Bell Pepper]",0.000148,0.181551,10.472966
449846,Red Peppers,3346083,5529,41052,58185,[Orange Bell Pepper],0.001652,0.134683,7.745295
480595,Green Bell Pepper,3346083,7086,59878,58005,[Organic Red Bell Pepper],0.002118,0.118341,6.826611
531201,Green Bell Pepper,3346083,4144,41052,58005,[Orange Bell Pepper],0.001238,0.100945,5.823133
603392,Organic Red Onion,3346083,278,2721,70804,"[Orange Bell Pepper, Organic Red Bell Pepper]",8.3e-05,0.102168,4.82831
657541,Organic Cucumber,3346083,6480,59878,85005,[Organic Red Bell Pepper],0.001937,0.10822,4.259905


## Bonus: Other Interesting Findings

Find and share something else interesting about these association rules. It can be a graph, table, or some other format that illustrates your point.