-
Notifications
You must be signed in to change notification settings - Fork 7
Expand file tree
/
Copy pathcubist-and-rulefit.Rmd
More file actions
286 lines (213 loc) · 10.5 KB
/
cubist-and-rulefit.Rmd
File metadata and controls
286 lines (213 loc) · 10.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "What's the difference between Cubist and RuleFit?"
---
```{r}
#| include: false
if (rlang::is_installed(c("modeldata", "recipes", "Cubist", "xrf"))) {
run <- TRUE
} else {
run <- FALSE
}
knitr::opts_chunk$set(
eval = run,
collapse = TRUE,
comment = "#>"
)
library(dplyr)
library(purrr)
library(rlang)
library(tidyr)
library(recipes)
```
[Cubist](https://www.rulequest.com/cubist-win.html) and [RuleFit](https://statweb.stanford.edu/~jhf/R_RuleFit.html) are two rule-based regression models. They are similar in some ways but otherwise very different. This is a short description of their approaches.
Cubist is for numeric outcomes while RuleFit can work with numeric and categorical outcomes. For this document, we'll focus on numeric outcomes (without loss of generality).
## Both start by making rules from trees
For each model, an an initial tree-based model is created. Let's look at a really simple example tree:
```r
if (A < a) {
if (B < b) {
node <- 1
} else {
node <- 2
}
} else
node <- 3
}
```
In this case, two predictors are used in splits: `A` and `B`. With this model, the split to the left is the "<" condition. There are three terminal nodes. This tree-based structure can be converted to a set of distinct rules. Each rule is a collection of `if/then` statements that define the path to the terminal nodes. In this example:
```r
rule_1 <- (A < a) & (B < b)
rule_2 <- (A < a) & (B >= b)
rule_3 <- (A >= a)
```
As is, these paths through the tree are mutually exclusive. For deep trees, these rules can become overly complex and may contain redundancies. Cubist takes the approach of simplifying rules whenever possible (Quinlan (1987)). For example,
```r
rule_x <- (A < 10) & (A < 7) & (B >= 1)
```
becomes
```r
rule_y <- (A < 7) & (B >= 1)
```
Cubist also has an approach that can _prune_ conditions inside of the rule that simplifies the structure without degrading performance. If `B` doesn't matter much in the rule above, then Cubist could reduce it to `A < 7`.
Cubist and RuleFit are also ensemble methods. For RuleFit, any tree ensemble model could generate a set of rules. For the `xrf` package, the `xgboost` package creates the initial set of rules. The rules create be each tree are pooled into a broader rule set. Cubist takes a different approach and uses model committees (see [_APM_](http://appliedpredictivemodeling.com/) Chapter 14). This is similar to boosting but creates a pseudo-outcome for each tree in the ensemble. This outcome adjusts the trees based on the size of the residual from the previous tree.
At the end of the rule generating process, the rules are unlikely to mutually exclusive. Any data point is likely to be fall into multiple rules.
Now let's look at how each model uses the rules.
## Cubists makes a model for each rule
Before delving into the rules, let's start with the _model tree_ initially created by Cubist (Quinlan (1992)). An initial tree is created and a regression model is created for _each split in the tree_. Each linear regression is created on the subset of the data covered by the current rule and uses only the predictors in the current rule. In the tree shown above, the first split produces two models with `A` as a predictor. The data for each are filtered either with `A < a` or `A >= a`. The entire set of models are:
```{r}
#| eval: false
split_1_low <- filter(data, A < a)
model_1_low <- lm(y ~ A, data = split_1_low)
split_1_high <- filter(data, A >= a)
model_1_high <- lm(y ~ A, data = split_1_high)
split_2_low <- filter(data, A < a & B < b)
model_2_low <- lm(y ~ A + B, data = split_2_low)
split_2_high <- filter(data, A < a & B >= b)
model_2_high <- lm(y ~ A + B, data = split_2_high)
```
Cubist does some feature selection on these models so they may not contain all of the possible predictors. Also, since Cubist prunes the rules, there may be not appear to be a connection between the variables used in a rule and its corresponding model.
The models associated with the rules are actually average of many models further up the tree. Since these are linear models, the models used by Cubist for each rule have coefficients that are averages:
| Rule | Logic | Blended Models |
|------|-------------------|----------------------------------|
| 1 | `A < a & B < b` | `model_1_low` and `model_2_low` |
| 2 | `A < a & B >= b` | `model_1_low` and `model_2_high` |
| 2 | `A >= a` | `model_1_high` |
The equations for averaging were first described in Quinlan (1992) but the updated equation for Cubist can be found in Chapter 14 of [_APM_](http://appliedpredictivemodeling.com/).
There is a model for each committee and rule within committee. When predicting, a new observation is compared to the conditions in the rules to determine which rules are active for this data point. The active linear models predict the new sample and these predictions are averaged to produce the final prediction value.
Perhaps unrelated to this document, which focuses on how rules are used, Cubist also does a nearest-neighbor correction to the predicted values (Quinlan R (1993)).
Let's look at an example model on the Palmer penguin data:
```{r}
#| label: cubist-penguins
library(rules)
data(penguins, package = "modeldata")
cubist_fit <-
cubist_rules(committees = 2) |>
set_engine("Cubist") |>
fit(body_mass_g ~ ., data = penguins)
cubist_fit
```
The `summary()` function shows the details of the rules.
```{r}
#| label: cubist-summary
summary(cubist_fit$fit)
```
Note that the only rule in the second committee has no conditions; all data points being predicted are affected by that rule. Also, it is possible for the linear model within a rule to only contain an intercept.
The `tidy()` function can extract the rule and model data:
```{r}
#| label: cubist-tidy
cb_res <- tidy(cubist_fit)
cb_res
```
The `estimate` and `statistic` columns contain tibbles with parameter estimates and rule statistics, respectively. They can be easily expanded using `unnest()`:
```{r}
#| label: stats
library(tidyr)
cb_res |>
dplyr::select(committee, rule_num, statistic) |>
unnest(cols = c(statistic))
```
For the first committee, rule four contains two conditions and covers 23 data points from the original data set. We can find these data points by converting the character string for the rule to an R expression, then evaluate the expression on the data set:
```{r}
#| label: cubist-expr
library(dplyr)
library(purrr)
library(rlang)
rule_4_filter <-
cb_res |>
dplyr::filter(rule_num == 4) |>
pluck("rule") |> # <- character string
parse_expr() |> # <- R expression
eval_tidy(penguins) # <- logical vector
penguins |>
dplyr::slice(which(rule_4_filter))
```
## RuleFit makes one model with rule predictors
RuleFit uses rules in a more straightforward way. It creates an initial tree (or ensemble of trees) from the data. Rules are extracted from this initial model and converted to a set of binary features. These features are then added to a regularized regression model (along with the original columns). For example:
```{r}
#| label: rule-fit
#| eval: false
data_with_rules <-
data |>
mutate(
rule_1 = ifelse(A < a & B < b, 0, 1),
rule_2 = ifelse(A < a & B >= b, 0, 1),
rule_2 = ifelse(A >= a, 0, 1)
)
rule_fit_model <-
glmnet(x = data_with_rules |> select(A, B, starts_with("rule_") |> as.matrix(),
y = data_with_rules$y,
alpha = 1)
```
It is common to use a lasso model to regularize the model and eliminate non-informative features.
While the details can depend on the implementation, the rules do not appear to be pruned or simplified.
The tuning parameters for this model inherits the parameter of the tree-based model as well as the amount of regularization used in the `glmnet` model. The complexity of the rules is determined by the allowed depth of the tree. For example, using a depth of four means that each rule may have up to four terms that define it. The number of rules would be primarily determined by the number of boosting iterations.
RuleFit, as implemented by the `xrf` package, required all of the data to be complete (i.e, non-missing). In the original data set, the body mass is encoded as integer and `xrf` requires a double, so:
```{r}
#| label: preproc
penguins <-
penguins |>
mutate(body_mass_g = body_mass_g + 0.0) |>
na.omit()
```
Fitting the model:
```{r}
#| label: rf-penguins
#| results: hide
#| warning: false
#| message: false
rule_fit_spec <-
rule_fit(trees = 10, tree_depth = 5, penalty = 0.01) |>
set_engine("xrf") |>
set_mode("regression")
rule_fit_fit <-
rule_fit_spec |>
fit(body_mass_g ~ ., data = penguins)
```
```{r}
#| label: rf-penguins-print
rule_fit_fit
```
To get the rules and their associated coefficients, the `tidy()` method can be used again:
```{r}
#| label: rf-tidy
rf_res <- tidy(rule_fit_fit, penalty = 0.01)
rf_res
```
Note that the units of each predictor have been scaled to be between zero and one.
Looking at the rules, there are examples of some rules, such as:
```r
( bill_depth_mm >= 14.1499996 ) &
( flipper_length_mm < 227 ) &
( flipper_length_mm < 228.5 ) &
( flipper_length_mm >= 197.5 ) &
( flipper_length_mm >= 224.5 )"
```
which can be simplified to have fewer conditions:
```r
( bill_depth_mm >= 14.1499996 ) &
( flipper_length_mm < 227 ) &
( flipper_length_mm >= 197.5 ) &
```
Also, the `tidy()` function can extract the same information but use the original predictors as the unit:
```{r}
#| label: rf-cols
rf_variable_res <- tidy(rule_fit_fit, unit = "columns", penalty = 0.01)
rf_variable_res
```
A single rule might be represented with multiple rows of this version of the data. These results can also be used to compute a rough estimate or variable importance using the absolute value of the coefficients:
```{r}
#| label: rf-imp
num_rules <- sum(grepl("^r[0-9]*_", unique(rf_res$rule_id))) + 1
rf_variable_res |>
dplyr::filter(term != "(Intercept)") |>
group_by(term) |>
summarize(effect = sum(abs(estimate)), .groups = "drop") |>
ungroup() |>
# normalize by number of possible occurrences
mutate(effect = effect / num_rules ) |>
arrange(desc(effect))
```
## References
- Quinlan R (1987). "Simplifying Decision Trees." _International Journal of Man-Machine Studies_, 27(3), 221-234.
- Quinlan R (1992). "Learning with Continuous Classes." _Proceedings of the 5th Australian Joint Conference On Artificial Intelligence_, pp. 343-348.
- Quinlan R (1993). "Combining Instance-Based and Model-Based Learning." _Proceedings of the Tenth International Conference on Machine Learning_, pp. 236-243.