# CS 634 - Midterm Project
**Name:** Srushti Thakre  
**ID:** 31667303  
**Course:** Data Mining

## Environment
Kernel: project `.venv`. Print Python version and interpreter path for reproducibility.

In [None]:
import sys, platform
print("Python:", platform.python_version())
print("Interpreter:", sys.executable)

## Imports and project path
Add `../src` to `sys.path`, import helpers and mining functions.

In [2]:
from pathlib import Path
import sys

# notebook is in /notebooks -> add ../src to path
root = Path.cwd().parent
sys.path.append(str(root / "src"))

from io_utils import load_transactions
from bruteforce import mine_frequent_itemsets
from rules import generate_rules

# Library wrappers (Apriori & FP-Growth)
import importlib, apriori_fp
importlib.reload(apriori_fp)
from apriori_fp import mine_with_apriori, mine_with_fpgrowth, print_rules

## Datasets

Five deterministic CSVs (one transaction per line): Amazon, Best Buy, KMart, Nike, Walmart.
- data/amazon.csv
- data/bestbuy.csv
- data/kmart.csv
- data/nike.csv
- data/walmart.csv



## Brute-force runner

`run_scenario(csv_path, minsup, minconf)`  
Loads the CSV, mines frequent itemsets (brute force), prints L1/L2/L3 with count/support, then prints final rules (A → B) and timings.  
Returns the transactions so we can reuse them for Apriori/FP-Growth.


In [11]:
from time import perf_counter

def run_scenario(csv_path, minsup, minconf):
    """
    Load CSV → mine frequent itemsets (brute force) → generate rules → print.
    """
    txns = load_transactions(csv_path)
    n_tx = len(txns)

    t0 = perf_counter()
    freq = mine_frequent_itemsets(txns, minsup)
    t1 = perf_counter()
    rules = generate_rules(freq, minconf)
    t2 = perf_counter()

    print(f"\nDataset: {csv_path.name} | transactions={n_tx} | "
          f"minsup={minsup:.0%} | minconf={minconf:.0%}")
    # Print frequent itemsets by size
    by_k = {}
    for it, sup in freq.items():
        by_k.setdefault(len(it), []).append((sorted(it), sup))
    if not by_k:
        print("No frequent itemsets at this support.")
    else:
        for k in sorted(by_k):
            print(f"\nL{k} (frequent {k}-itemsets):")
            for items, sup in sorted(by_k[k]):
                count = int(round(sup * n_tx))
                print(f"{items} | count={count} | support={sup:.2%}")

    # Print rules
    print("\nFinal Association Rules (A -> B):")
    if not rules:
        print("No rules at these thresholds.")
    else:
        for i, (A,B,s,conf) in enumerate(rules, 1):
            count = int(round(s * n_tx))
            print(f"Rule {i}: {sorted(A)} -> {sorted(B)}")
            print(f"  Support: {s:.2%} (count={count})  Confidence: {conf:.2%}")

    print(f"\nTiming: mining={t1 - t0:.4f}s  rules={t2 - t1:.4f}s")
    return txns  # return for Part 3 reuse


### Run 1 - Amazon (minsup=20%, minconf=50%)
Moderate thresholds. Expect multiple L2/L3 itemsets and several rules.


In [12]:
csv = root / "data/amazon.csv"
minsup, minconf = 0.20, 0.50
txns_amazon_2050 = run_scenario(csv, minsup, minconf)



Dataset: amazon.csv | transactions=20 | minsup=20% | minconf=50%

L1 (frequent 1-itemsets):
['A Beginner’s Guide'] | count=11 | support=55.00%
['Android Programming: The Big Nerd Ranch'] | count=13 | support=65.00%
['Beginning Programming with Java'] | count=6 | support=30.00%
['Head First Java 2nd Edition'] | count=8 | support=40.00%
['Java 8 Pocket Guide'] | count=4 | support=20.00%
['Java For Dummies'] | count=13 | support=65.00%
['Java: The Complete Reference'] | count=10 | support=50.00%

L2 (frequent 2-itemsets):
['A Beginner’s Guide', 'Android Programming: The Big Nerd Ranch'] | count=6 | support=30.00%
['A Beginner’s Guide', 'Java For Dummies'] | count=9 | support=45.00%
['A Beginner’s Guide', 'Java: The Complete Reference'] | count=9 | support=45.00%
['Android Programming: The Big Nerd Ranch', 'Head First Java 2nd Edition'] | count=6 | support=30.00%
['Android Programming: The Big Nerd Ranch', 'Java For Dummies'] | count=9 | support=45.00%
['Android Programming: The Big Nerd 

### Run 2 - Amazon (minsup=70%, minconf=70%)
Very strict thresholds. Likely few/no frequent pairs and no rules.


In [13]:
csv = root / "data/amazon.csv"
minsup, minconf = 0.70, 0.70
_ = run_scenario(csv, minsup, minconf)


Dataset: amazon.csv | transactions=20 | minsup=70% | minconf=70%
No frequent itemsets at this support.

Final Association Rules (A -> B):
No rules at these thresholds.

Timing: mining=0.0001s  rules=0.0000s


### Run 3 - Walmart (minsup=20%, minconf=50%)
Moderate thresholds on grocery-style data; expect a few L2 pairs and rules.


In [14]:
csv = root / "data/walmart.csv"
minsup, minconf = 0.20, 0.50
txns_walmart_2050 = run_scenario(csv, minsup, minconf)


Dataset: walmart.csv | transactions=25 | minsup=20% | minconf=50%

L1 (frequent 1-itemsets):
['bread'] | count=14 | support=56.00%
['butter'] | count=6 | support=24.00%
['cereal'] | count=7 | support=28.00%
['eggs'] | count=11 | support=44.00%
['milk'] | count=10 | support=40.00%

L2 (frequent 2-itemsets):
['bread', 'eggs'] | count=6 | support=24.00%
['bread', 'milk'] | count=6 | support=24.00%

Final Association Rules (A -> B):
Rule 1: ['milk'] -> ['bread']
  Support: 24.00% (count=6)  Confidence: 60.00%
Rule 2: ['eggs'] -> ['bread']
  Support: 24.00% (count=6)  Confidence: 54.55%

Timing: mining=0.0002s  rules=0.0000s


### Run 4 – Best Buy (minsup=25%, minconf=60%)
Moderate thresholds on electronics-style data; expect a few L2 pairs and some rules.


In [17]:
csv = root / "data/bestbuy.csv"
minsup, minconf = 0.25, 0.60
_ = run_scenario(csv, minsup, minconf)



Dataset: bestbuy.csv | transactions=20 | minsup=25% | minconf=60%

L1 (frequent 1-itemsets):
['Anti-Virus'] | count=14 | support=70.00%
['Desk Top'] | count=6 | support=30.00%
['Digital Camera'] | count=9 | support=45.00%
['External Hard-Drive'] | count=9 | support=45.00%
['Flash Drive'] | count=13 | support=65.00%
['Lab Top'] | count=12 | support=60.00%
['Lab Top Case'] | count=14 | support=70.00%
['Microsoft Office'] | count=11 | support=55.00%
['Printer'] | count=10 | support=50.00%
['Speakers'] | count=11 | support=55.00%

L2 (frequent 2-itemsets):
['Anti-Virus', 'Digital Camera'] | count=5 | support=25.00%
['Anti-Virus', 'External Hard-Drive'] | count=9 | support=45.00%
['Anti-Virus', 'Flash Drive'] | count=10 | support=50.00%
['Anti-Virus', 'Lab Top'] | count=10 | support=50.00%
['Anti-Virus', 'Lab Top Case'] | count=12 | support=60.00%
['Anti-Virus', 'Microsoft Office'] | count=8 | support=40.00%
['Anti-Virus', 'Printer'] | count=7 | support=35.00%
['Anti-Virus', 'Speakers'] | 

### Run 5 – KMart (minsup=30%, minconf=60%)
Slightly stricter minsup; likely fewer L2/L3 itemsets and fewer rules.


In [18]:
csv = root / "data/kmart.csv"
minsup, minconf = 0.30, 0.60
_ = run_scenario(csv, minsup, minconf)



Dataset: kmart.csv | transactions=20 | minsup=30% | minconf=60%

L1 (frequent 1-itemsets):
['Bed Skirts'] | count=11 | support=55.00%
['Bedding Collections'] | count=7 | support=35.00%
['Bedspreads'] | count=7 | support=35.00%
['Decorative Pillows'] | count=10 | support=50.00%
['Embroidered Bedspread'] | count=6 | support=30.00%
['Kids Bedding'] | count=12 | support=60.00%
['Quilts'] | count=8 | support=40.00%
['Shams'] | count=11 | support=55.00%
['Sheets'] | count=10 | support=50.00%

L2 (frequent 2-itemsets):
['Bed Skirts', 'Bedspreads'] | count=7 | support=35.00%
['Bed Skirts', 'Kids Bedding'] | count=10 | support=50.00%
['Bed Skirts', 'Shams'] | count=9 | support=45.00%
['Bed Skirts', 'Sheets'] | count=9 | support=45.00%
['Bedding Collections', 'Kids Bedding'] | count=6 | support=30.00%
['Bedspreads', 'Kids Bedding'] | count=7 | support=35.00%
['Bedspreads', 'Sheets'] | count=7 | support=35.00%
['Decorative Pillows', 'Quilts'] | count=6 | support=30.00%
['Kids Bedding', 'Shams'] 

### Run 6 – Nike (minsup=20%, minconf=50%)
Looser thresholds; expect more frequent pairs and several rules.


In [19]:
csv = root / "data/nike.csv"
minsup, minconf = 0.20, 0.50
_ = run_scenario(csv, minsup, minconf)


Dataset: nike.csv | transactions=20 | minsup=20% | minconf=50%

L1 (frequent 1-itemsets):
['Dry Fit V-Nick'] | count=10 | support=50.00%
['Hoodies'] | count=8 | support=40.00%
['Modern Pants'] | count=9 | support=45.00%
['Rash Guard'] | count=12 | support=60.00%
['Running Shoe'] | count=14 | support=70.00%
['Soccer Shoe'] | count=5 | support=25.00%
['Socks'] | count=12 | support=60.00%
['Sweatshirts'] | count=12 | support=60.00%
['Swimming Shirt'] | count=11 | support=55.00%
['Tech Pants'] | count=9 | support=45.00%

L2 (frequent 2-itemsets):
['Dry Fit V-Nick', 'Hoodies'] | count=7 | support=35.00%
['Dry Fit V-Nick', 'Modern Pants'] | count=4 | support=20.00%
['Dry Fit V-Nick', 'Rash Guard'] | count=10 | support=50.00%
['Dry Fit V-Nick', 'Running Shoe'] | count=5 | support=25.00%
['Dry Fit V-Nick', 'Soccer Shoe'] | count=4 | support=20.00%
['Dry Fit V-Nick', 'Socks'] | count=4 | support=20.00%
['Dry Fit V-Nick', 'Sweatshirts'] | count=4 | support=20.00%
['Dry Fit V-Nick', 'Swimming Sh

### Apriori on the same data (Amazon 20% / 50%)

Reuse the **Amazon** transactions from Run 1 with the **same thresholds**  
(*minsup = 20%*, *minconf = 50%*), then print the **top 10 rules**.  
If no rules meet the threshold, a clear message is shown


In [15]:
# Use the same transactions/thresholds as Run 1 (Amazon 20/50)
txns = txns_amazon_2050
minsup, minconf = 0.20, 0.50

ap_rules = mine_with_apriori(txns, minsup, minconf)
print("Apriori (top 10 rules):")
if not ap_rules:
    print("No rules at these thresholds.")
else:
    print_rules(ap_rules, 10)

Apriori (top 10 rules):
Rule 1: ['Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 50.00%  Confidence: 100.00%
Rule 2: ['A Beginner’s Guide', 'Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 45.00%  Confidence: 100.00%
Rule 3: ['A Beginner’s Guide', 'Java For Dummies'] -> ['Java: The Complete Reference']
  Support: 45.00%  Confidence: 100.00%
Rule 4: ['Android Programming: The Big Nerd Ranch', 'Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 30.00%  Confidence: 100.00%
Rule 5: ['A Beginner’s Guide', 'Android Programming: The Big Nerd Ranch', 'Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 25.00%  Confidence: 100.00%
Rule 6: ['A Beginner’s Guide', 'Android Programming: The Big Nerd Ranch', 'Java For Dummies'] -> ['Java: The Complete Reference']
  Support: 25.00%  Confidence: 100.00%
Rule 7: ['Java: The Complete Reference'] -> ['A Beginner’s Guide']
  Support: 45.00%  Confidence: 90.00%
Rule 8: ['Java For Dummies', '

### FP-Growth on the same data (Amazon 20% / 50%)

Run **FP-Growth** from `mlxtend` on the **same transactions and thresholds** as Apriori  
(*minsup = 20%*, *minconf = 50%*).  
Print the **top 10 rules**, or show a message if none qualify.


In [16]:
fp_rules = mine_with_fpgrowth(txns, minsup, minconf)
print("\nFP-Growth (top 10 rules):")
if not fp_rules:
    print("No rules at these thresholds.")
else:
    print_rules(fp_rules, 10)


FP-Growth (top 10 rules):
Rule 1: ['Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 50.00%  Confidence: 100.00%
Rule 2: ['A Beginner’s Guide', 'Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 45.00%  Confidence: 100.00%
Rule 3: ['A Beginner’s Guide', 'Java For Dummies'] -> ['Java: The Complete Reference']
  Support: 45.00%  Confidence: 100.00%
Rule 4: ['Android Programming: The Big Nerd Ranch', 'Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 30.00%  Confidence: 100.00%
Rule 5: ['A Beginner’s Guide', 'Android Programming: The Big Nerd Ranch', 'Java: The Complete Reference'] -> ['Java For Dummies']
  Support: 25.00%  Confidence: 100.00%
Rule 6: ['A Beginner’s Guide', 'Android Programming: The Big Nerd Ranch', 'Java For Dummies'] -> ['Java: The Complete Reference']
  Support: 25.00%  Confidence: 100.00%
Rule 7: ['Java: The Complete Reference'] -> ['A Beginner’s Guide']
  Support: 45.00%  Confidence: 90.00%
Rule 8: ['Java For Dummies'

## Observations
- With **minsup=70% / minconf=70%**, almost no pairs survive; rules are rare or none.
- With **minsup=20% / minconf=50%**, more L2/L3 itemsets appear and several rules are produced.
- On Amazon 20/50, the strongest rules involve frequently co-purchased titles (see above).
- Library **Apriori** and **FP-Growth** produce rules consistent with our brute-force output at the same thresholds (differences mainly in ordering/duplicates).
- Mining time is small on these tiny datasets; timing prints show sub-second runs.

## How to Reproduce
1. Select the Jupyter kernel for this project (Python **.venv**).
2. Ensure CSVs are in `../data/`.
3. Run cells in order. Change `csv`, `minsup`, `minconf` to try other settings.
4. Export: **File → Save and Export As → PDF** (report) and **Executable Script (.py)**.