# Computing Association Rules with `dplyr`

### Review - Association Rules

Consider the rule $\{butter\} \rightarrow \{whole.milk\}$

  * $Support(\textrm{butter and milk}) = \frac{\textrm{# butter and milk transactions}}{\textrm{# total transactions}}$ 
  * $Support(\textrm{butter}) = \frac{\textrm{# butter transactions}}{\textrm{#
  total transactions}}$ 
  * $Confidence= \frac{Support(\textrm{butter and milk})}{Support(\textrm{butter})}$ 
  * $Lift= \frac{Confidence}{Support(\textrm{milk})}$ 
  

### Small example:  Compute the confidence and lift of {bread} -> {milk} 


<img width="350" src="img/small_example.png">


Use `dyplr` to:  

  * mutate to compute joint transactions 
  * summarize to compute counts and percents 
  

### New example: investigate rule $\{butter\} \longrightarrow \{milk\}$ with `dplyr`
  

In [1]:
groceries <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv')
head(groceries)

frankfurter,sausage,liver.loaf,ham,meat,finished.products,organic.sausage,chicken,turkey,pork,⋯,candles,light.bulbs,sound.storage.medium,newspapers,photo.film,pot.plants,flower.soil.fertilizer,flower..seeds.,shopping.bags,bags
0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


In [2]:
# Run if needed
# install.packages('dplyr')

In [3]:
library(dplyr)
butter_milk <- groceries %>%
                select(butter, whole.milk)
head(butter_milk)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



butter,whole.milk
0,0
0,0
0,1
0,0
0,1
1,1


In [4]:
N <- nrow(groceries)
N

#### Support(Butter): 2 steps

In [5]:
butter_milk %>%
  summarize(Nbutter = sum(butter)) %>% 
  mutate(support_butter = Nbutter/N)

Nbutter,support_butter
545,0.05541434


#### Support(Butter): all at once

In [6]:
butter_milk %>%
  summarize(support_butter = sum(butter)/N)

support_butter
0.05541434


#### Support of whole.milk

In [7]:
butter_milk %>%
  summarize(support_milk = sum(whole.milk)/N)

support_milk
0.255516


#### Support of $\{Butter\;and\;Milk\}$

Why `butter * whole.milk`? 

In [8]:
butter_milk %>%
  mutate(butter_and_milk = butter * whole.milk) %>%
  summarize(support_rule = sum(butter_and_milk)/N)

support_rule
0.02755465


#### All together now (+ confidence and lift)

In [9]:
groceries %>%
  mutate(bought_butter_milk = butter *  whole.milk) %>%
  summarize(support_milk = sum(whole.milk)/N,
            support_butter = sum(butter)/N,
            support_rule = sum(bought_butter_milk)/N) %>%
  mutate(confidence = support_rule/support_butter) %>%
  mutate(lift = confidence/support_milk)

support_milk,support_butter,support_rule,confidence,lift
0.255516,0.05541434,0.02755465,0.4972477,1.946053


#### Notes

* Must compute values before you use them
    * Supports before confidence
    * Confidence before lift

### Computing Many Rules At Once


* Stack the LHS into one column
* Group by LHS
* Compute:
    * Support
    * Confidence
    * Lift
  

#### Step 0 - Read the data and load libraries

In [16]:
# Run if needed
install.packages('tidyr')

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [10]:
library(tidyr)
library(dplyr)

In [11]:
groceries <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv')
N <- nrow(groceries)

#### Step 1 - Stack all of the other products

In [12]:
groceries_stacked <-
  groceries %>%
  gather(key = "lhs",
         value = "pur_lhs",
         -whole.milk) 
head(groceries_stacked)

whole.milk,lhs,pur_lhs
0,frankfurter,0
0,frankfurter,0
1,frankfurter,0
0,frankfurter,0
1,frankfurter,0
1,frankfurter,0


#### Step 2 - find whether lhs and milk were bought together

In [13]:
groceries_stacked <-
  groceries_stacked %>%
  mutate(pur_both = whole.milk * pur_lhs) 
head(groceries_stacked)

whole.milk,lhs,pur_lhs,pur_both
0,frankfurter,0,0
0,frankfurter,0,0
1,frankfurter,0,0
0,frankfurter,0,0
1,frankfurter,0,0
1,frankfurter,0,0


#### Step 3 - Compute the support, confidence, and lift for each

In [14]:
# Note that we group_by the products to keep them separate.
many_rules <-
groceries_stacked %>%
  group_by(lhs) %>%
  summarize(sup_milk = sum(whole.milk)/N,
            sup_lhs = sum(pur_lhs)/N,
            joint_support = sum(pur_both)/N) %>%
  mutate(conf = joint_support/sup_lhs) %>%
  mutate(lift = conf/sup_milk) 
many_rules %>% head

`summarise()` ungrouping output (override with `.groups` argument)


lhs,sup_milk,sup_lhs,joint_support,conf,lift
abrasive.cleaner,0.255516,0.0035587189,0.0016268429,0.4571429,1.7890967
artif..sweetener,0.255516,0.0032536858,0.0011184545,0.34375,1.3453169
baby.cosmetics,0.255516,0.0006100661,0.000305033,0.5,1.9568245
baby.food,0.255516,0.0001016777,0.0,0.0,0.0
bags,0.255516,0.0004067107,0.0001016777,0.25,0.9784123
baking.powder,0.255516,0.0176919166,0.009252669,0.5229885,2.0467935


#### Step 4 - filter rules with low support; sort by lift

In [15]:
many_rules %>%
  filter(joint_support > .05) %>%
  arrange(-lift)

lhs,sup_milk,sup_lhs,joint_support,conf,lift
yogurt,0.255516,0.1395018,0.0560244,0.4016035,1.571735
other.vegetables,0.255516,0.1934926,0.07483477,0.3867578,1.513634
rolls.buns,0.255516,0.1839349,0.05663447,0.3079049,1.205032


Interpretation of first rule: 

* Milk is purchased 25.6% of the time.  
* Knowing yogurt was also purchased 'lifts' this rate of purchase by 57%.  
* In other words, knowing yogurt was purchased increases the likelihood that milk was purchased by 57%, relative to the underlying rate at which milk was already being purchased.