# Market Basket Analysis

- Construct association rules
- Identify items purchased together

## Association Rules
- (Antecedent) => (Consequent)
    - e.g (Fiction) => (Biography) means buying Fiction makes them buy Biography

In [1]:
import pandas as pd

books = pd.read_csv(
    "https://assets.datacamp.com/production/repositories/5654/datasets/1e5276e5a24493ec07b44fe99b46984f2ffa4488/bookstore_transactions.csv"
)

In [2]:
books.head()

Unnamed: 0,Transaction
0,"History,Bookmark"
1,"History,Bookmark"
2,"Fiction,Bookmark"
3,"Biography,Bookmark"
4,"History,Bookmark"


In [3]:
# Convert strings to list
txl = books["Transaction"].apply(lambda x: x.split(","))
tx = list(txl)
tx

[['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'B

### Identifying Association Rules
- Set of all possible rules is large
- Most rules ain't useful
- **Restrict the set of rules**

In [4]:
# Get the number of genres
from itertools import chain
from collections import Counter

flatten = list(chain.from_iterable(tx))
categories = [*set(flatten)]
print(Counter(flatten))
print("Num of categories:", len(categories))

Counter({'Bookmark': 99, 'Biography': 40, 'History': 25, 'Fiction': 25, 'Poetry': 9})
Num of categories: 5


In [5]:
# Generating rules
from itertools import permutations

rules = list(permutations(categories, 2))
print(rules)
print("Num of rules:", len(rules))

[('Poetry', 'Bookmark'), ('Poetry', 'History'), ('Poetry', 'Fiction'), ('Poetry', 'Biography'), ('Bookmark', 'Poetry'), ('Bookmark', 'History'), ('Bookmark', 'Fiction'), ('Bookmark', 'Biography'), ('History', 'Poetry'), ('History', 'Bookmark'), ('History', 'Fiction'), ('History', 'Biography'), ('Fiction', 'Poetry'), ('Fiction', 'Bookmark'), ('Fiction', 'History'), ('Fiction', 'Biography'), ('Biography', 'Poetry'), ('Biography', 'Bookmark'), ('Biography', 'History'), ('Biography', 'Fiction')]
Num of rules: 20


## Computing different metrics

### Support metric
- Given a rule R
- num of transactions with R / num of all transactions

In [6]:
# Support metric for ("Bookmark", "History")
tx = [tuple(li) for li in tx]  # replace lists with tuples to use Counter

count_tx = Counter(tx)
all_tx = len(tx)
count_tx[("History", "Bookmark")] / all_tx

0.25252525252525254

### Computing that metric with pandas & numpy

In [9]:
from mlxtend.preprocessing import TransactionEncoder
encoder = TransactionEncoder().fit(tx)

# one-hot encode
onehot = encoder.transform(tx)


In [11]:
# convert OHE list-of-lists to DF
df = pd.DataFrame(onehot, columns=encoder.columns_)

In [12]:
df.head()

Unnamed: 0,Biography,Bookmark,Fiction,History,Poetry
0,False,True,False,True,False
1,False,True,False,True,False
2,False,True,True,False,False
3,True,True,False,False,False
4,False,True,False,True,False


In [28]:
# Support for each single item

df.mean()

Biography    0.404040
Bookmark     1.000000
Fiction      0.252525
History      0.252525
Poetry       0.090909
dtype: float64

In [33]:
# Support for multiple items

df["fiction+poetry"] = np.logical_and(df["Fiction"], df["Poetry"])

df["history+biography"] = np.logical_and(df["History"], df["Biography"])

df["history+bookmark"] = np.logical_and(df["History"], df["Bookmark"])

In [34]:
df.mean()

Biography            0.404040
Bookmark             1.000000
Fiction              0.252525
History              0.252525
Poetry               0.090909
fiction+poetry       0.000000
history+biography    0.000000
history+bookmark     0.252525
dtype: float64

### Confidence and Lift
They **refine** the support metric

#### When support is misleading

| TID | Transactions |
| --- | ------------ |
| 1   | Coffee, Milk        
| 2   | Bread, Milk, Orange 
| 3   | Bread, Milk
| 4   | Bread, Milk, Sugar
| 5   | Bread, Jam, Milk

- Milk and Bread are purchased together, so Milk -> Bread
- The rule above **is not informative for marketing**
    - Milk and bread are both popular items, **coexistence does not necessarily imply association**

### Confidence
X => Y: Support(X and Y) / Support(X)

**Shows likelihood of people buying Y, given they bought X**

For the table above: 

Confidence(Milk, Coffee) = 0.2 / 1 = 0.2  
Low likelihood of association: Milk => Coffee

Confidence(Coffee, Milk) = 0.2 / 0.2 = 1  
High likelihood of association: Coffee => Milk

### Lift

X => Y: Support(X and Y) / (Support(X) * Support(Y))

**real X and Y / expected X and Y**

**Lift < 1 means X and Y are paired less frequently than the random pairings**

Expected = Condition where variables are independent

Lift >= 1 is good

For the table above:

Lift(Coffee, Milk) = 0.2 / (0.2 * 1) = 1  
Association exists but the demand is just "sufficient"

#### Case Study: Support, Confidence and Lift

**Given:**
- Support(Hunger Games, Harry Potter) = 0.12
- Support(Hunger Games, Twilight) = 0.09
- Support(Harry Potter, Twilight) = 0.14


- Support(Harry Potter) = 0.477
- Support(Twilight) = 0.256


- Confidence(Potter, Twilight) = 0.29
- Confidence(Twilight, Potter) = 0.55

- Lift(Potter, Twilight) = 1.15

**Inferences:**
- Harry Potter is more popular than Twilight
- Potter people mostly don't like Twilight but some Twilight people like Potter
- Demand to buy Potter and Twilight together is stronger than expected

Twilight => Potter

### Leverage

X => Y: Support(X and Y) - Support(X) * Support(Y)

**real X and Y - expected X and Y**

**Leverage > 0 means the association creates some surplus value**

Leverage >= 0 is good

Range is -1 to 1, while Lift's range is 0 to infinity

For the table above:

Leverage(Coffee, Milk) = 0.2 - 0.2 = 0  
Coffee and Milk has no leverage, association exists but the demand is just sufficient

### Conviction

X => Y: (Support(X) * Support(~Y)) / Support(X and ~Y)  

OR

X => Y: Support(~Y) / (1 - Confidence(X, Y))

Probability that X appears without Y if they were dependent with the actual frequency of the appearance of X without Y

**Shows Y's dependence on X**

**Conviction(Apple, Pear) = 1.01 means the rule Apple => Pear is incorrect 1% more often than expected (if variables were independent)**

For the table above:

Conviction(Coffee, Milk) = 0 / 0 = NaN  
Conviction(Milk, Coffee) = 0.8 / 0.8 = 1

Rule (Milk => Coffee) is incorrect %0 more often than expected 

#### Case Study: Conviction

**Given:**
- Conviction(Twilight, Potter) = 1.16
- Conviction(Potter, Twilight) = 1.05


**Inferences:**
- Twilight => Potter is incorrect 16% more often, if variables were independent
- Potter => Twilight is incorrect 5% more often, if variables were independent

### Zhang's Metric - Association and Dissociation

`(Conf(A, B) - Conf(A, ~B)) / max(Conf(A, B), Conf(~A, B))`

- Range -1 to 1, **1 is perfect association and -1 is perfect dissociation**


## Overview of MBA
1. Generate large set of rules
2. Filter those rules with metrics
3. Apply intuition and common sense