<a href="https://colab.research.google.com/github/yardsale8/DSCI_210_R_notebooks/blob/main/lecture_10_3_computing_arules_the_dplyr_way.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Computing Association Rules with `dplyr`

### Review - Association Rules

Consider the rule $\{butter\} \rightarrow \{whole.milk\}$

  * $Support(\textrm{butter and milk}) = \frac{\textrm{# butter and milk transactions}}{\textrm{# total transactions}}$
  * $Support(\textrm{butter}) = \frac{\textrm{# butter transactions}}{\textrm{#
  total transactions}}$
  * $Confidence= \frac{Support(\textrm{butter and milk})}{Support(\textrm{butter})}$
  * $Lift= \frac{Confidence}{Support(\textrm{milk})}$
  

### Small example:  Compute the confidence and lift of {bread} -> {milk}


<img width="350" src="https://github.com/yardsale8/DSCI_210_R_notebooks/blob/main/img/small_example.png?raw=1">


Use `dyplr` to:  

  * mutate to compute joint transactions
  * summarize to compute counts and percents
  

### New example: investigate rule $\{butter\} \longrightarrow \{milk\}$ with `dplyr`
  

In [1]:
groceries <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv')
head(groceries)

Unnamed: 0_level_0,frankfurter,sausage,liver.loaf,ham,meat,finished.products,organic.sausage,chicken,turkey,pork,⋯,candles,light.bulbs,sound.storage.medium,newspapers,photo.film,pot.plants,flower.soil.fertilizer,flower..seeds.,shopping.bags,bags
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


In [2]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [3]:
(groceries
 %>% select(butter, whole.milk)
 ) -> butter_milk

butter_milk %>% head

Unnamed: 0_level_0,butter,whole.milk
Unnamed: 0_level_1,<int>,<int>
1,0,0
2,0,0
3,0,1
4,0,0
5,0,1
6,1,1


#### Support(Butter): 2 steps

In [4]:
(butter_milk
%>% summarize(Nbutter = sum(butter),
              N = n())
%>% mutate(support_butter = Nbutter/N)
)

Nbutter,N,support_butter
<int>,<int>,<dbl>
545,9835,0.05541434


#### Support(Butter): all at once

In [6]:
(butter_milk
 %>% summarize(support_butter = sum(butter)/n())
)

support_butter
<dbl>
0.05541434


#### Support of whole.milk

In [7]:
(butter_milk
%>% summarize(support_milk = sum(whole.milk)/n())
)

support_milk
<dbl>
0.255516


#### Support of $\{Butter\;and\;Milk\}$

Why `butter * whole.milk`?

In [8]:
(butter_milk
%>% summarize(support_rule = sum(butter * whole.milk)/n())
)

support_rule
<dbl>
0.02755465


#### Trying to combine the previous steps (won't work!)

In [10]:
(butter_milk
 %>% summarize(support_butter = sum(butter)/n())                # First summarize colapses the data to one row (simple aggregation)
 %>% summarize(support_milk = sum(whole.milk)/n())              # Can no longer aggregate the collapsed data
 %>% summarize(support_rule = sum(butter * whole.milk)/n())
)

ERROR: [1m[33mError[39m in `summarize()`:[22m
[1m[22m[36mℹ[39m In argument: `support_milk = sum(whole.milk)/n()`.
[1mCaused by error:[22m
[33m![39m object 'whole.milk' not found


#### Solution - Combine summaries in one `summarise`

In [11]:
(groceries
%>% summarize(support_milk = sum(whole.milk)/n(),
              support_butter = sum(butter)/n(),
              support_rule = sum(butter *  whole.milk)/n())
)

support_milk,support_butter,support_rule
<dbl>,<dbl>,<dbl>
0.255516,0.05541434,0.02755465


#### All together now (+ confidence and lift)

In [12]:
(groceries
%>% summarize(support_milk = sum(whole.milk)/n(),
              support_butter = sum(butter)/n(),
              support_rule = sum(butter *  whole.milk)/n())
%>% mutate(confidence = support_rule/support_butter)
%>% mutate(lift = confidence/support_milk)
)

support_milk,support_butter,support_rule,confidence,lift
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.255516,0.05541434,0.02755465,0.4972477,1.946053


#### Notes

* Must compute values before you use them
    * Supports before confidence
    * Confidence before lift

## <font color="red"> Exercise 9.3.1 </font>

Use a similar approach to compute the above values for the rule $\{\text{domestic.eggs}\}\longrightarrow\{\text{ham}\}$.

In [None]:
# Your code here

### Computing Many Rules At Once


* Stack the LHS into one column
* Group by LHS
* Compute:
    * Support
    * Confidence
    * Lift
  

In [13]:
groceries <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv')
N <- nrow(groceries)

#### Step 1 - Stack all of the other products

In [14]:
(groceries
 %>% gather(key = "lhs",
            value = "pur_lhs",
            -whole.milk)
) -> groceries_stacked

groceries_stacked %>% head

Unnamed: 0_level_0,whole.milk,lhs,pur_lhs
Unnamed: 0_level_1,<int>,<chr>,<int>
1,0,frankfurter,0
2,0,frankfurter,0
3,1,frankfurter,0
4,0,frankfurter,0
5,1,frankfurter,0
6,1,frankfurter,0


#### Step 2 - Compute the support, confidence, and lift for each

In [15]:
# Note that we group_by the products to keep them separate.
many_rules <-
(groceries_stacked
 %>% group_by(lhs)
 %>% summarize(sup_milk = sum(whole.milk)/n(),
               sup_lhs = sum(pur_lhs)/n(),
               joint_support = sum(whole.milk*pur_lhs)/n())
 %>% mutate(conf = joint_support/sup_lhs)
 %>% mutate(lift = conf/sup_milk)
) -> many_rules

many_rules %>% head

lhs,sup_milk,sup_lhs,joint_support,conf,lift
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Instant.food.products,0.255516,0.0080325369,0.00305033,0.3797468,1.486196
UHT.milk,0.255516,0.0334519573,0.00396543,0.118541,0.463928
abrasive.cleaner,0.255516,0.0035587189,0.001626843,0.4571429,1.789097
artif..sweetener,0.255516,0.0032536858,0.001118454,0.34375,1.345317
baby.cosmetics,0.255516,0.0006100661,0.000305033,0.5,1.956825
baby.food,0.255516,0.0001016777,0.0,0.0,0.0


Step 3 - Sort by lift

We can use the `arrange` function with `desc` to sort by the lift from largest to smallest.

In [19]:
(many_rules
 %>% arrange(desc(lift))
 %>% head
 )

lhs,sup_milk,sup_lhs,joint_support,conf,lift
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
kitchen.utensil,0.255516,0.0004067107,0.000305033,0.75,2.935237
honey,0.255516,0.0015251652,0.0011184545,0.7333333,2.870009
cereals,0.255516,0.0056939502,0.0036603965,0.6428571,2.515917
rice,0.255516,0.0076258261,0.0046771734,0.6133333,2.400371
rubbing.alcohol,0.255516,0.0010167768,0.0006100661,0.6,2.348189
cocoa.drinks,0.255516,0.002236909,0.0013218099,0.5909091,2.312611


#### Step 4 - filter rules with high joint support; sort by lift

In [20]:
(many_rules
 %>% filter(joint_support > .05)
 %>% arrange(desc(lift))
)

lhs,sup_milk,sup_lhs,joint_support,conf,lift
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
yogurt,0.255516,0.1395018,0.0560244,0.4016035,1.571735
other.vegetables,0.255516,0.1934926,0.07483477,0.3867578,1.513634
rolls.buns,0.255516,0.1839349,0.05663447,0.3079049,1.205032


In [21]:
(many_rules
 %>% filter(joint_support > .05)
 %>% arrange(lift)
)

lhs,sup_milk,sup_lhs,joint_support,conf,lift
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
rolls.buns,0.255516,0.1839349,0.05663447,0.3079049,1.205032
other.vegetables,0.255516,0.1934926,0.07483477,0.3867578,1.513634
yogurt,0.255516,0.1395018,0.0560244,0.4016035,1.571735


Interpretation of first rule:

* Milk is purchased 25.6% of the time.  
* Knowing yogurt was also purchased 'lifts' this rate of purchase by 57%.  
* In other words, knowing yogurt was purchased increases the likelihood that milk was purchased by 57%, relative to the underlying rate at which milk was already being purchased.

## <font color="red"> Exercise 9.3.2 </font>

Use a similar approach to all rules of the form $\{\text{<something>}\}\longrightarrow\{\text{ham}\}$, then answer the following questions.

1. Which of these rules is least useful in the prediction of `ham`?  Explain how you made this determination.
2. Which items would you say are the most useful for predicting `ham`?  Explain.  


In [None]:
# Your code here

<font color="orange">
Your answers here
</font>