-
Notifications
You must be signed in to change notification settings - Fork 88
/
Submodels.Rmd
120 lines (92 loc) · 4.1 KB
/
Submodels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "Evaluating submodels with the same model object"
output: rmarkdown::html_vignette
description: |
You can use `multi_predict()` to evaluate submodels with the same model object
and avoid having to fit any of the submodels.
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Evaluating submodels with the same model object}
%\VignetteEncoding{UTF-8}
---
```{r startup, include = FALSE}
library(utils)
library(ggplot2)
theme_set(theme_bw())
```
Some R packages can create predictions from models that are different than the one that was fit. For example, if a boosted tree is fit with 10 iterations of boosting, the model can usually make predictions on _submodels_ that have less than 10 trees (all other parameters being static). This is helpful for model tuning since you can cheaply evaluate tuning parameter combinations which often results in a large speed-up in the computations.
In parsnip, there is a method called `multi_predict()` that can do this. It's current methods are:
```{r methods}
library(parsnip)
methods("multi_predict")
```
We'll use the attrition data in `rsample` to illustrate:
```{r}
library(tidymodels)
data(attrition, package = "modeldata")
set.seed(4595)
data_split <- initial_split(attrition, strata = "Attrition")
attrition_train <- training(data_split)
attrition_test <- testing(data_split)
```
A boosted classification tree is one of the most low-maintenance approaches that we could take to these data:
```{r boost-model}
# requires the xgboost package
attrition_boost <-
boost_tree(mode = "classification", trees = 100) %>%
set_engine("C5.0")
```
Suppose that 10-fold cross-validation was being used to tune the model over the number of trees:
```{r folds}
set.seed(616)
folds <- vfold_cv(attrition_train)
```
The process would fit a model on 90% of the data and predict on the remaining 10%. Using `rsample`:
```{r fold-1}
model_data <- analysis(folds$splits[[1]])
pred_data <- assessment(folds$splits[[1]])
fold_1_model <-
attrition_boost %>%
fit_xy(x = model_data %>% dplyr::select(-Attrition), y = model_data$Attrition)
```
For `multi_predict()`, the same semantics of `predict()` are used but, for this model, there is an extra argument called `trees`. Candidate submodel values can be passed in with `trees`:
```{r fold-1-pred}
fold_1_pred <-
multi_predict(
fold_1_model,
new_data = pred_data %>% dplyr::select(-Attrition),
trees = 1:100,
type = "prob"
)
fold_1_pred
```
The results is a tibble that has as many rows as the data being predicted (_n_ = `r nrow(pred_data)`). The `.pred` column contains a list of tibbles and each has the predictions across the different number of trees:
```{r obs-1-pred}
fold_1_pred$.pred[[1]]
```
To get this into a format that is more usable, we can use `tidyr::unnest()` but we first add row numbers so that we can track the predictions by test sample as well as the actual classes:
```{r unnest}
fold_1_df <-
fold_1_pred %>%
bind_cols(pred_data %>% dplyr::select(Attrition)) %>%
add_rowindex() %>%
tidyr::unnest(.pred)
fold_1_df
```
For two samples, what do these look like over trees?
```{r prob-plot, fig.alt = "Two lines tracing the predicted probabilty of no attrition for two observation over the number of trees ranging from 0 to 100. For both observations, the true value of attrition is 'no'. The predicted probability is larger than 0.5 for both observations across all numbers of trees and does not change much between 15 and 100 trees."}
fold_1_df %>%
dplyr::filter(.row %in% c(1, 88)) %>%
ggplot(aes(x = trees, y = .pred_No, col = Attrition, group = .row)) +
geom_step() +
ylim(0:1) +
theme(legend.position = "top")
```
What does performance look like over trees (using the area under the ROC curve)?
```{r auc-plot, fig.alt = "A line plot of the AUC ROC for a number of trees varying between 1 and 100. The AUC is largest for higher number of trees but the gains get smaller for 50+ trees with the maximum being reached around 70 trees."}
fold_1_df %>%
group_by(trees) %>%
roc_auc(truth = Attrition, .pred_No) %>%
ggplot(aes(x = trees, y = .estimate)) +
geom_step()
```