# Association Rules with Groceries Dataset

Adapted from Lantz (2015), Chapter 8


Our market basket analysis will utilize the purchase data collected from one month of operation at a real world grocery store. The data contains 9,835 transactions or about 327 transactions per day (roughly 30 transactions per hour in a 12-hour business day), suggesting that the retailer is not particularly large, nor is it particularly small.

## Libraries and dataset

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(arules) # for association rules and data
library(arulesViz) # visualizing association rules
library(formattable) # for formatting numbers

In [None]:
data("Groceries", package = "arules")

Note that Groceries dataset is in a special format called "transactions" suitable for association rules analysis: 

In [None]:
Groceries

We can also import csv data into a transaction object:

In [None]:
groceries <- arules::read.transactions("../data/csv/09_01_groceries.csv", sep = ",")

In [None]:
groceries

## Explore data

First let's inspect the first three transactions:

In [None]:
arules::inspect(groceries[1:3])

And let's view the structure of the object:

In [None]:
str(groceries)

How do we get the items in a transaction from this sparse representation?

From @itemInfo we know that we have 169 separate items in all transactions:

In [None]:
groceries@itemInfo[,1]

@data@p shows the cumulative number of items in 9835 transactions:

In [None]:
head(groceries@data@p)

In [None]:
tail(groceries@data@p)

We have a total of 43367 items from 9835 transactions

We can get them from @data@i using the cumulative number of items:

In [None]:
starts <- groceries@data@p[1:3] + 1
ends <- groceries@data@p[2:4]

starts
ends

These are the indices where the first three transactions starts and ends

For example we get the 5th to 7th indices from the @data@i for the items of 2nd transaction.

Of course these are just the indices we should subset @itemInfo to get the names of those items (after incrementing the indices with 1)

In [None]:
mapply(function(x,y) groceries@itemInfo[,1][groceries@data@i[seq(x,y)] + 1],
       starts,
       ends)

Now let's see some summary statistics:

In [None]:
summary(groceries)

Density of 0.026 means, from a total of 9835 * 169 combinations, only 2.6% of those possibilities (43367) exist in the dataset:

In [None]:
43367 / (9835 * 169) 

We can see the most frequent items purchased as whole milk, vegetables, rolls/buns, soda and yogurt

2159 transactions have only one items, 1643 items have 2 items and so on.

The mean item size of transactions is 4.409. We can also draw this value as follows:

In [None]:
43367 / 9835

### Frequencies

We can examine the frequency of selected items:

In [None]:
arules::itemFrequency(groceries[,1:3]) %>% formattable::percent()

As we see, alphabetically first items are not so frequent in the dataset.

Then, which items are most frequent?

For example plot only those items with at least 10% frequency (appears in 10% of all transactions)

In [None]:
arules::itemFrequencyPlot(groceries, support = 0.1)

Not so visually appealing but easy to create

What about most frequent 20 items:

In [None]:
arules::itemFrequencyPlot(groceries, topN = 20)

### Image

In addition to looking at the items, it's also possible to visualize the entire sparse matrix.

To do so, use the image() function. The command to display the sparse matrix for the first five transactions is as follows:

In [None]:
arules::image(groceries[1:5])

The first, fourth, and fifth transactions contained four items each, since their rows have four cells filled in.

You can also see that rows three, five, two, and four have an item in common (on the right side of the diagram).

Or we can visualize random 100 transactions:

In [None]:
set.seed(1)
sample(groceries, 100) %>% arules::image()

## Train a model

We will use the arules::apriori() function:

In [None]:
?arules::apriori

We supply support, confidence and length parameters:

The support of an itemset or rule measures how frequently it occurs in the data.

The support can be calculated for any itemset or even a single item.

A rule's confidence is a measurement of its predictive power or accuracy.

It is defined as the support of the itemset containing both X and Y divided by the support of the itemset containing only X

The lift of a rule measures how much more likely one item or itemset is purchased relative to its typical rate of purchase, given that you know another item or itemset has been purchased.

![arules](https://www.saedsayad.com/images/AR_1.png)

We'll start with a confidence threshold of 0.25, which means that in order to be included in the results, the rule has to be correct at least 25 percent of the time.

This will eliminate the most unreliable rules, while allowing some room for us to modify behavior with targeted promotions.

We are now ready to generate some rules. In addition to the minimum support and confidence parameters, it is helpful to set minlen = 2 to eliminate rules that contain fewer than two items. This prevents uninteresting rules from being created simply because the item is purchased frequently, for instance, {} → whole milk. This 
rule meets the minimum support and confidence because whole milk is purchased in over 25 percent of the transactions, but it isn't a very actionable insight.

In [None]:
groceryrules <- arules::apriori(groceries,
                               parameter = list(support = 0.006,
                                               confidence = 0.25,
                                               minlen = 2))

The model created 463 rules

Let's visualize them in an interactive scatterplot:

In [None]:
plot(groceryrules,
     measure = c("support", "confidence"),
     shading = "lift",
     engine = "plotly")

Axes show confidence and support values while the tone of the shading is for the lift value.

Since plotly is used as the engine, the plot is interactive with tooltips on hover

We can aggregate all rules in k number of groups:

- In each group most frequent and important items in the LHS appear in the columns while items in the RHS appear in rows
- Sizes of circles show the support while the tone of color is the lift value
- The aggregator function for the values in each group is median by default

In [None]:
plot(groceryrules,
        method = "grouped",
        control = list(k = 20))

## Evaluate performance

Let's first the the summary of the model:

In [None]:
summary(groceryrules)

We have 150 rules that have only two items (sum of LHS and RHS items) and 297 rules with 3 items

Let's inpect some of the rules:

In [None]:
arules::inspect(groceryrules[1:3])

The first rule can be read in plain language as, "if a customer buys potted plants, they will also buy whole milk."

With support of 0.007 and confidence of 0.400, we can determine that this rule covers 0.7 percent of the transactions and is correct in 40 percent of purchases involving potted plants.

The lift value tells us how much more likely a customer is to buy whole milk relative to the average customer, given that he or she bought a potted plant:

Since we know that about 25.6 percent of the customers bought whole milk (support), while 40 percent of the customers buying  a potted plant bought whole milk (confidence), we can compute the lift value as 0.40 / 0.256 = 1.56, which matches the value shown.

## Improve model performance

### Sorting the set of association rules 

Depending upon the objectives of the market basket analysis, the most useful rules might be the ones with the highest support, confidence, or lift

The arules package includes a sort() function that can be used to reorder the list of rules so that the ones with the highest or lowest values of the quality measure come first

In [None]:
arules::sort(groceryrules, by = "lift")[1:5] %>% arules::inspect()

These rules appear to be more interesting than the ones we looked at previously. The first rule, with a lift of about 3.96, implies that people who buy herbs are nearly four times more likely to buy root vegetables than the typical customer—perhaps for a stew of some sort?

Rule two is also interesting. Whipped cream is over three times more likely to be found in a shopping cart with berries versus other carts, suggesting perhaps a dessert pairing?

### Taking subsets of association rules

The subset() function provides a method to search for subsets of transactions, items, or rules.

To use it to find any rules with berries appearing in the rule, use  the following command. It will store the rules in a new object titled berryrules:

In [None]:
berryrules <- arules::subset(groceryrules, items %in% "berries")

In [None]:
arules::inspect(berryrules)

The subset() function is very powerful. The criteria for choosing the subset can be 
defined with several keywords and operators:

- The keyword items explained previously, matches an item appearing anywhere in the rule. To limit the subset to where the match occurs only on the left- or right-hand side, use lhs and rhs instead.

- The operator %in% means that at least one of the items must be found in the list you defined. If you want any rules matching either berries or yogurt, you could write items %in%c("berries", "yogurt”).

- Additional operators are available for partial matching (%pin%) and complete matching (%ain%). Partial matching allows you to find both citrus fruit and tropical fruit using one search: items %pin% "fruit". Complete matching requires that all the listed items are present. For instance, items %ain% c("berries", "yogurt") finds only rules with both berries and yogurt.

- Subsets can also be limited by support, confidence, or lift. For instance, confidence > 0.50 would limit you to the rules with confidence greater than 50 percent.
 
- Matching criteria can be combined with the standard R logical operators such as and (&), or (|), and not (!). Using these options, you can limit the selection of rules to be as specific or general as 

Now let's plot those rules as a graph: Sizes of circles indicate the lift for the rule while the colors show confidence:

In [None]:
plot(berryrules, method = "graph", measure = "lift", shading = "confidence")

Another option to show a rule is a double-decker plot:

In [None]:
inspect(berryrules[1])

In [None]:
plot(berryrules[1], method = "doubledecker", data = groceries)

The area of blocks gives the support and the height of the “yes” blocks is proportional to the confidence for the rules consisting of the antecedent items marked as “yes.”

Items that show a significant jump in confidence when changed from “no” to “yes” are interesting.

Let's view a more interesting example:

First create a subset of rules where size of lhs is 3, confidenceis above 0.4 and lift is above 3:

In [None]:
subset3 <- subset(groceryrules, size(lhs) == 3 & confidence > 0.4 & lift > 3)

Let's inspect the rules:

In [None]:
inspect(subset3)

Now visualize the first rule as doubledecker:

In [None]:
plot(subset3[1], method = "doubledecker", data = groceries)

As we see from the right most columns, addition of tropical fruit to root vegetables makes an apparent jump in confidence, both with or without whole milk

## Export rules

We can export rules into a data frame:

In [None]:
groceryrules_dt <- as(groceryrules, "data.frame") %>% as.data.table()

In [None]:
str(groceryrules_dt)

In [None]:
groceryrules_dt