/
basic-early-stopping.Rmd
184 lines (132 loc) · 7.36 KB
/
basic-early-stopping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: "Early Stopping"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Early Stopping}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
# fig.path = "Readme_files/"
)
library(compboost)
```
## Before Starting
- Read the [use-case](https://danielschalk.com/compboost/articles/getting_started/use_case.html) to get to know how to define a `Compboost` object using the `R6` interface
## Data: Titanic Passenger Survival Data Set
We use the [titanic dataset](https://www.kaggle.com/c/titanic/data) with binary
classification on `Survived`. First of all we store the train and test data
into two data frames and remove all rows that contains missing values (`NA`s):
```{r}
# Store train and test data:
df = na.omit(titanic::titanic_train)
df$Survived = factor(df$Survived, labels = c("no", "yes"))
```
For the later stopping we split the dataset into train and test:
```{r}
set.seed(123)
idx_train = sample(seq_len(nrow(df)), size = nrow(df) * 0.8)
idx_test = setdiff(seq_len(nrow(df)), idx_train)
```
## Defining the Model
We define the same model as in the [use-case](https://danielschalk.com/compboost/articles/getting_started/use_case.html) but just on the train index without specifying an out-of-bag fraction:
```{r}
cboost = Compboost$new(data = df[idx_train, ], target = "Survived")
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline)
cboost$addBaselearner("Sex", "ridge", BaselearnerCategoricalRidge)
```
## Early Stopping in Compboost
### How does it work?
The early stopping of `compboost` is done by using logger objects. Logger are executed after each iteration and stores class dependent data such as the runtime or risk. Additionally, each logger can be declared as a stopper by setting `use_as_stopper = TRUE`. By declaring a logger as stopper, it is used to stop the algorithm after a logger-specific criteria is reached. For example, the `LoggerTime` will stop the algorithm after a pre-defined runtime is reached.
### Example with runtime stopping
Now it is time to define a logger to track the runtime. As mentioned above, we set `use_as_stopper = TRUE`. By setting the `max_time` we define how long we want to train the model, here 50000 microseconds:
```{r, warnings=FALSE}
cboost$addLogger(logger = LoggerTime, use_as_stopper = TRUE, logger_id = "time",
max_time = 50000, time_unit = "microseconds")
cboost$train(2000, trace = 250)
cboost
```
As we can see, the fittings is stopped early after `r cboost$getCurrentIteration()` and does not train the full 2000 iterations. The logger data can be accessed by calling `$getLoggerData()`:
```{r}
tail(cboost$getLoggerData())
```
## Loss-Based Early Stopping
```{r, include=FALSE}
cboost = Compboost$new(data = df[idx_train, ], target = "Survived")
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline)
cboost$addBaselearner("Sex", "ridge", BaselearnerCategoricalRidge)
```
In machine learning, we often like to stop at the best model performance. We need either tuning or early stopping to determine what is a good number of iterations $m$. A well-known procedure is to log the out-of-bag (oob) behavior of the model and stop after the model performance starts to get worse. The required parameters for the logger are
- the loss $L$ that is used for stopping: $$\mathcal{R}_{\text{emp}}^{[m]} = \frac{1}{n}\sum_{i=1}^n L\left(y^{(i)}, f^{[m]}(x^{(i)})\right)$$
- the percentage of performance increase as lower boundary for the increase: $$\text{err}^{[m]} = \frac{\mathcal{R}_{\text{emp}}^{[m- 1]} - \mathcal{R}_{\text{emp}}^{[m]}}{\mathcal{R}_{\text{emp}}^{[m - 1]}}$$
### Define the risk logger
Since we are interested in the oob behavior, it is necessary to prepare the oob data and response for `compboost`. Therefore, it is possible to use the `$prepareResponse()` and `$prepareData()` member functions to create suitable objects:
```{r}
oob_response = cboost$prepareResponse(df$Survived[idx_test])
oob_data = cboost$prepareData(df[idx_test,])
```
With these objects we can add the oob risk logger, declare it as stopper, and train the model:
```{r}
cboost$addLogger(logger = LoggerOobRisk, use_as_stopper = TRUE, logger_id = "oob",
used_loss = LossBinomial$new(), eps_for_break = 0, patience = 5, oob_data = oob_data,
oob_response = oob_response)
cboost$train(2000, trace = 250)
```
**Note:** The use of `eps_for_break = 0` is a hard constrain to stop the training until the oob risk starts to increase.
Taking a look at the logger data tells us that we stopped exactly after the first five differences are bigger than zero (the oob risk of these iterations is bigger than the previous ones):
```{r}
tail(cboost$getLoggerData(), n = 10)
diff(tail(cboost$getLoggerData()$oob, n = 10))
```
```{r}
library(ggplot2)
ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
geom_line() +
xlab("Iteration") +
ylab("Empirical Risk")
```
Taking a look at 2000 iterations shows that we have stopped quite good:
```{r}
cboost$train(2000, trace = 0)
ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
geom_line() +
xlab("Iteration") +
ylab("Empirical Risk")
```
**Note:** It can happen that the model's oob behavior increases locally for a few iterations and then starts to decrease again. To capture this, we need the "patience" parameter which waits for, let's say, 5 iterations and stops the algorithm only if the improvement in all 5 iterations is smaller than our criteria. Setting this parameter to one can lead to unstable results:
```{r}
df = na.omit(titanic::titanic_train)
df$Survived = factor(df$Survived, labels = c("no", "yes"))
set.seed(123)
idx_train = sample(seq_len(nrow(df)), size = nrow(df) * 0.8)
idx_test = setdiff(seq_len(nrow(df)), idx_train)
cboost = Compboost$new(data = df[idx_train, ], target = "Survived", loss = LossBinomial$new())
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline)
cboost$addBaselearner("Sex", "ridge", BaselearnerCategoricalRidge)
oob_response = cboost$prepareResponse(df$Survived[idx_test])
oob_data = cboost$prepareData(df[idx_test,])
cboost$addLogger(logger = LoggerOobRisk, use_as_stopper = TRUE, logger_id = "oob",
used_loss = LossBinomial$new(), eps_for_break = 0, patience = 1, oob_data = oob_data,
oob_response = oob_response)
cboost$train(2000, trace = 0)
library(ggplot2)
ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
geom_line() +
xlab("Iteration") +
ylab("Empirical Risk")
```
### Further comments on risk logging
- Since we can define as many as logger as we like, it is possible to define multiple risk logger regarding different loss functions.
- It is also possible to log performance measures with the risk logging mechanism. This is covered as advanced topic.
## Some remarks
- Early stopping can be done globally or locally:
- *locally* (any): The algorithm stops after **the first** stopping criteria of a logger is reached
- *globally* (all): The algorithm stops after **all** stopping criteria are reached
- Some arguments are ignored if the logger is not set as stopper, e.g. `max_time` from the time logger
- The logger functionality is summarized [here](https://danielschalk.com/compboost/articles/functionality/logger.html)