/
basic-use-case.Rmd
136 lines (100 loc) · 4.18 KB
/
basic-use-case.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "Use-case"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Use-case}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
# fig.path = "Readme_files/"
)
library(compboost)
```
## Data: Titanic Passenger Survival Data Set
We use the [titanic dataset](https://www.kaggle.com/c/titanic/data) with binary
classification on `survived`. First of all we store the train and test data
in two data frames and remove all rows that contains `NA`s:
```{r}
# Store train and test data:
df_train = na.omit(titanic::titanic_train)
str(df_train)
```
In the next step we transform the response to a factor with more intuitive levels:
```{r}
df_train$Survived = factor(df_train$Survived, labels = c("no", "yes"))
```
## Initializing Model
Due to the `R6` API it is necessary to create a new class object which gets the data, the target as character, and the used loss. Note that it is important to give an initialized loss object:
```{r}
cboost = Compboost$new(data = df_train, target = "Survived", oob_fraction = 0.3)
```
Use an initialized object for the loss gives the opportunity to use a loss initialized with a custom offset.
## Adding Base-Learner
Adding new base-learners is also done by giving a character to indicate the feature. As second argument it is important to name an identifier for the factory since we can define multiple base-learner on the same source.
### Numerical Features
For instance, we can define a spline and a linear base-learner of the same feature:
```{r}
# Spline base-learner of age:
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
# Linear base-learner of age (degree = 1 with intercept is default):
cboost$addBaselearner("Age", "linear", BaselearnerPolynomial)
```
Additional arguments can be specified after naming the base-learner:
```{r}
# Spline base-learner of fare:
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2,
n_knots = 14, penalty = 10, differences = 2)
```
For references to the base learner documentation see [functionality](https://danielschalk.com/compboost/articles/fct-baselearner.html) at the project page.
### Categorical Features
When adding categorical features we use a dummy coded representation with a ridge penalty:
```{r}
cboost$addBaselearner("Sex", "categorical", BaselearnerCategoricalRidge)
```
Finally, we can check what factories are registered:
```{r}
cboost$getBaselearnerNames()
```
## Define Logger
### Time logger
This logger logs the elapsed time. The time unit can be one of `microseconds`, `seconds` or `minutes`. The logger stops if `max_time` is reached. But we do not use that logger as stopper here:
```{r}
cboost$addLogger(logger = LoggerTime, use_as_stopper = FALSE, logger_id = "time",
max_time = 0, time_unit = "microseconds")
```
## Train Model and Access Elements
```{r, warnings=FALSE}
cboost$train(2000, trace = 250)
cboost
```
Objects of the `Compboost` class do have member functions such as `getCoef()`, `getInbagRisk()` or `predict()` to access the results:
```{r}
str(cboost$getCoef())
str(cboost$getInbagRisk())
str(cboost$predict())
```
To obtain a vector of selected base learners use `getSelectedBaselearner()`:
```{r}
table(cboost$getSelectedBaselearner())
```
We can also access predictions directly from the response object `cboost$response` and `cboost$response_oob`. Note that `$response_oob` was created automatically when defining an `oob_fraction` within the constructor:
```{r}
oob_label = cboost$response_oob$getResponse()
oob_pred = cboost$response_oob$getPredictionResponse()
table(true_label = oob_label, predicted = oob_pred)
```
## Retrain the Model
To continue the training or set the whole model to another iteration simply re-call `train()`:
```{r, warnings=FALSE}
cboost$train(3000)
str(cboost$getCoef())
str(cboost$getInbagRisk())
table(cboost$getSelectedBaselearner())
```
## Next steps
- Have a look at the [visualization capabilities](https://danielschalk.com/compboost/articles/getting_started/visualizations.html) of the package.
- See how [other loss functions](https://danielschalk.com/compboost/articles/getting_started/robust_regression.html) effect the model training.