# Market Basket Analysis

- Construct association rules
- Identify items purchased together

## Association Rules
- (Antecedent) -> (Consequent)
    - e.g (Fiction) -> (Biography) means buying Fiction makes them buy Biography

In [1]:
import pandas as pd

books = pd.read_csv(
    "https://assets.datacamp.com/production/repositories/5654/datasets/1e5276e5a24493ec07b44fe99b46984f2ffa4488/bookstore_transactions.csv"
)

In [2]:
books.head()

Unnamed: 0,Transaction
0,"History,Bookmark"
1,"History,Bookmark"
2,"Fiction,Bookmark"
3,"Biography,Bookmark"
4,"History,Bookmark"


In [3]:
# Convert strings to list
txl = books["Transaction"].apply(lambda x: x.split(","))
tx = list(txl)
tx

[['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'B

### Identifying Association Rules
- Set of all possible rules is large
- Most rules ain't useful
- **Restrict the set of rules**

In [4]:
# Get the number of genres
from itertools import chain
from collections import Counter

flatten = list(chain.from_iterable(tx))
categories = [*set(flatten)]
print(Counter(flatten))
print("Num of categories:", len(categories))

Counter({'Bookmark': 99, 'Biography': 40, 'History': 25, 'Fiction': 25, 'Poetry': 9})
Num of categories: 5


In [5]:
# Generating rules
from itertools import permutations

rules = list(permutations(categories, 2))
print(rules)
print("Num of rules:", len(rules))

[('Poetry', 'Bookmark'), ('Poetry', 'History'), ('Poetry', 'Fiction'), ('Poetry', 'Biography'), ('Bookmark', 'Poetry'), ('Bookmark', 'History'), ('Bookmark', 'Fiction'), ('Bookmark', 'Biography'), ('History', 'Poetry'), ('History', 'Bookmark'), ('History', 'Fiction'), ('History', 'Biography'), ('Fiction', 'Poetry'), ('Fiction', 'Bookmark'), ('Fiction', 'History'), ('Fiction', 'Biography'), ('Biography', 'Poetry'), ('Biography', 'Bookmark'), ('Biography', 'History'), ('Biography', 'Fiction')]
Num of rules: 20


## Computing different metrics

### Support metric
- Given a rule R
- num of transactions with R / num of all transactions

In [6]:
# Support metric for ("Bookmark", "History")
tx = [tuple(li) for li in tx]  # replace lists with tuples to use Counter

count_tx = Counter(tx)
all_tx = len(tx)
count_tx[("History", "Bookmark")] / all_tx

0.25252525252525254

### Computing that metric with pandas & numpy

In [9]:
from mlxtend.preprocessing import TransactionEncoder
encoder = TransactionEncoder().fit(tx)

# one-hot encode
onehot = encoder.transform(tx)


In [11]:
# convert OHE list-of-lists to DF
df = pd.DataFrame(onehot, columns=encoder.columns_)

In [12]:
df.head()

Unnamed: 0,Biography,Bookmark,Fiction,History,Poetry
0,False,True,False,True,False
1,False,True,False,True,False
2,False,True,True,False,False
3,True,True,False,False,False
4,False,True,False,True,False


In [28]:
# Support for each single item

df.mean()

Biography    0.404040
Bookmark     1.000000
Fiction      0.252525
History      0.252525
Poetry       0.090909
dtype: float64

In [33]:
# Support for multiple items

df["fiction+poetry"] = np.logical_and(df["Fiction"], df["Poetry"])

df["history+biography"] = np.logical_and(df["History"], df["Biography"])

df["history+bookmark"] = np.logical_and(df["History"], df["Bookmark"])

In [34]:
df.mean()

Biography            0.404040
Bookmark             1.000000
Fiction              0.252525
History              0.252525
Poetry               0.090909
fiction+poetry       0.000000
history+biography    0.000000
history+bookmark     0.252525
dtype: float64

### Confidence and Lift


#### When support is misleading

| TID | Transactions |
| --- | ------------ |
| 1   | Coffee, Milk        
| 2   | Bread, Milk, Orange 
| 3   | Bread, Milk
| 4   | Bread, Milk, Sugar
| 5   | Bread, Jam, Milk

- Milk and Bread are purchased together, so Milk -> Bread
- The rule above **is not informative for marketing**
    - Milk and bread are both popular items, **coexistence does not necessarily imply association**