<a href="https://colab.research.google.com/github/thooks630/DSCI_210_R_notebooks/blob/main/lecture_8_4_finding_and_visualizing_many_association_rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

* `dplyr` is fine if we want to use just one LHS item to predict a single RHS item, *but...*
 * What about multiple LHS items?? 
 * Best rule among all RHS items?? 

  The big idea: We need a better "search" algorithm!

### Automation with `arules`

We can automate the process with the 'arules' package.

In [1]:
install.packages("arules")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [2]:
library(arules)

Loading required package: Matrix


Attaching package: ‘arules’


The following objects are masked from ‘package:base’:

    abbreviate, write




In [3]:
library(dplyr)
library(tidyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:arules’:

    intersect, recode, setdiff, setequal, union


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘tidyr’


The following objects are masked from ‘package:Matrix’:

    expand, pack, unpack




In [None]:
groceries <- read.csv("https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv")
head(groceries)

### Formatting the Data

The data needs to be transformed into an object of class *transactions*.

In [None]:
groceries2 <- (groceries 
                %>% mutate(id = row_number()) 
                %>% gather(key = "item", value = "val", frankfurter:bags) 
                %>% mutate(val = ifelse(val==0,FALSE,TRUE)) 
                %>% spread(key = item, value = val) 
                %>% select(-id)
               )
head(groceries2)
groceries3 <- as(groceries2,"transactions")


### Exploring the Dataset
We can use the `summary` function to explore the data.

In [None]:
summary(groceries3)

### Exploring the first five transactions
We can use the `inspect` function to explore the first five transactions.

In [None]:
groceries3 %>% inspect() %>% head(5)

### Determining how often each item was purchased
We can use the `itemFrequency` function to determine what proportion of transactions included each item. Then, we can use the `itemFrequencyPlot` function to visualize the top 10 most frequently purchased items.

In [None]:
itemFrequency(groceries3)
itemFrequencyPlot(groceries3,topN=10)

### Using the apriori algorithm


* The `apriori` function can be used to find associations between the items in the dataset; we can use `parameter= ` to set min values.
* Note the default: `parameter = list(support = .1, confidence = .8, maxlen = 10)`
    * `minlen = 2` removes rules that contain less than two items
    * `maxlen = 2` sets one item on LHS, one on RHS
    *  **The "support" filter refers to the JOINT support!  *SUPPORT{LHS, RHS}***

In [None]:
groc_rules <- apriori(groceries3, 
                      parameter = list(supp = 0.01,
                                       conf = 0.25,
                                       minlen = 2))
                    

### Evaluating the results

Use `summary()` to get an overview of the association rules.

In [None]:
summary(groc_rules)

We can use `inspect()` to inspect the individual rules:

In [None]:
inspect(groc_rules[1:10]) 

**Remember, the `support` column is the JOINT support of {LHS,RHS}**

### Sorting the association rules

We can use the `sort()` function to sort rules according to support, confidence, or lift.

In [None]:
groc_rules %>% sort(decreasing=TRUE, by="lift") %>% head(10) %>% inspect

### Pull out the rules with whole.milk

We can use the `subset()` function and the `%in%` operator to filter rules.

In [None]:
groc_rules %>%
  subset(rhs %in% 'whole.milk') %>%
  sort(by = 'lift', decreasing = TRUE) %>% head(10) %>%
  inspect()


### A few more examples

#### Finding the 20 best rules for predicting whole milk, considering rules with at least 1% support.

In [None]:
milk_rules_1pct <-  subset(groc_rules, rhs %in% 'whole.milk' & support > .01) 
milk_rules_1pct %>% 
  sort(by = "lift") %>%
  head(20) %>% 
  inspect() 

#### Finding the 10 best rules overall, among rules with at least 2% support. 

In [None]:
rules_10pct <-  subset(groc_rules, support > .02) 
rules_10pct %>% 
  sort(by="lift") %>%
  head(10) %>% 
  inspect()

### Visualizing association rules


The `arulesViz` package can be used to visualize and interact with individual rules. 

In [4]:
install.packages('arulesViz')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘iterators’, ‘foreach’, ‘zoo’, ‘tweenr’, ‘polyclip’, ‘RcppEigen’, ‘gridExtra’, ‘RcppArmadillo’, ‘later’, ‘TSP’, ‘qap’, ‘gclus’, ‘ca’, ‘registry’, ‘lmtest’, ‘Rcpp’, ‘ggforce’, ‘ggrepel’, ‘viridis’, ‘tidygraph’, ‘graphlayouts’, ‘htmlwidgets’, ‘crosstalk’, ‘promises’, ‘lazyeval’, ‘seriation’, ‘vcd’, ‘igraph’, ‘scatterplot3d’, ‘ggraph’, ‘DT’, ‘plotly’, ‘visNetwork’




In [12]:
library(arulesViz)

#### Scatter plot with `color = lift`

In [None]:
plot(rules_10pct)

#### Rearranged scatter plot with `color = confidence`

In [None]:
#change the visual encoding:
plot(rules_10pct, measure = c('support','lift'), shading = 'confidence')

#### More on plotting rules

In [None]:
plot(rules_10pct, method = "grouped")

In [None]:
plot(rules_10pct, method = "graph")

rules_10pct %>% 
  sort(by="lift") %>%
  head(5) %>% 
  plot(method = "graph")