Skip to content

Commit

Permalink
update vignetts
Browse files Browse the repository at this point in the history
  • Loading branch information
schalkdaniel committed Mar 30, 2019
1 parent 30c6477 commit 810901e
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 30 deletions.
26 changes: 13 additions & 13 deletions vignettes/getting_started/early_stopping.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ library(compboost)

We use the [titanic dataset](https://www.kaggle.com/c/titanic/data) with binary
classification on `Survived`. First of all we store the train and test data
in two data frames and remove all rows that contains `NA`s:
into two data frames and remove all rows that contains missing values (`NA`s):

```{r}
# Store train and test data:
Expand All @@ -43,7 +43,7 @@ idx_test = setdiff(seq_len(nrow(df)), idx_train)

## Defining the Model

We define the same model as in the [use-case](https://danielschalk.com/compboost/articles/getting_started/use_case.html) but now just on the train index without specifying an out of bag fraction:
We define the same model as in the [use-case](https://danielschalk.com/compboost/articles/getting_started/use_case.html) but just on the train index without specifying an out of bag fraction:
```{r}
cboost = Compboost$new(data = df[idx_train, ], target = "Survived", loss = LossBinomial$new())
Expand All @@ -57,11 +57,11 @@ cboost$addBaselearner("Sex", "categorical", BaselearnerPolynomial, intercept = F

### How does it work?

The early stopping of `compboost` is done using logger objects. The logger is executed after each iteration and stores class dependent data, e.g. the runtime. Additionally, each logger can be declared as a stopper with `use_as_stopper = TRUE`. Declaring a logger as stopper, the logged data is used to stop the algorithm after a logger-specific criteria is reached. For example, using `LoggerTime` as stopper, the algorithm can be stopped after a pre-defined runtime is reached.
The early stopping of `compboost` is done by using the logger objects. The logger is executed after each iteration and stores class dependent data, e.g. the runtime. Additionally, each logger can be declared as a stopper with `use_as_stopper = TRUE`. Declaring a logger as stopper, the logged data is used to stop the algorithm after a logger-specific criteria is reached. For example, using `LoggerTime` as stopper will break the algorithm algorithm after a pre-defined runtime is reached.

### Example with runtime stopping

Now it's time to define a logger to log the runtime. As mentioned above, we set `use_as_stopper = TRUE`. Now it matters what is specified in `max_time` since this defines the stopping behavior. Here we want to stop after 50000 microseconds:
Now it is time to define a logger to track the runtime. As mentioned above, we set `use_as_stopper = TRUE`. Now it matters what is specified in `max_time` since this defines how long we like to train the model. Here we want to stop after 50000 microseconds:

```{r, warnings=FALSE}
cboost$addLogger(logger = LoggerTime, use_as_stopper = TRUE, logger_id = "time",
Expand All @@ -71,7 +71,7 @@ cboost$train(2000, trace = 100)
cboost
```

As we can see, the fittings is stopped after `r cboost$getCurrentIteration()` and not after 2000 iterations as specified in train. Taking a look at the logger data, we can see that the last entry exceeds the 50000 microseconds and therefore hits the stopping criteria:
As we can see, the fittings is stopped after `r cboost$getCurrentIteration()` and not after 2000 iterations as specified in train. Taking a look at the logger data, we can see that the last entry exceeds the 50000 microseconds and therefore triggers the stopping criteria:
```{r}
tail(cboost$getLoggerData())
```
Expand All @@ -87,34 +87,34 @@ cboost$addBaselearner("Fare", "spline", BaselearnerPSpline)
cboost$addBaselearner("Sex", "categorical", BaselearnerPolynomial, intercept = FALSE)
```

In machine learning we often like to stop when the best model performance is reached. Especially in boosting, which tends to overfit, we need either tuning or early stopping to determine what is a good number of iteration $m$ to get good model performance. A well-known procedure is to log the out of bag (oob) behavior of the model and stop after this starts to get worse. This is how oob early stopping is implemented in `compboost`. The freedom we have is to specify
In machine learning we often like to stop when the best model performance is reached. Especially in boosting, which may tend to overfit, we need either tuning or early stopping to determine what is a good number of iterations $m$ to get a good model performance. A well-known procedure is to log the out of bag (oob) behavior of the model and stop after this starts to get worse. This is how the oob early stopping is implemented in `compboost`. The parameter we need to specify are

- the loss $L$ that is used for stopping: $$\mathcal{R}_{\text{emp}}^{[m]} = \frac{1}{n}\sum_{i=1}^n L\left(y^{(i)}, f^{[m]}(x^{(i)})\right)$$

- the percentage of performance increase that should be undershot: $$\text{err}^{[m]} = \frac{\mathcal{R}_{\text{emp}}^{[m- 1]} - \mathcal{R}_{\text{emp}}^{[m]}}{\mathcal{R}_{\text{emp}}^{[m - 1]}}$$

### Define the risk logger

Since we are interested in the oob behavior it is necessary to define the oob data and response in a manner that `compboost` understands it. Therefore, it is possible to use the `$prepareResponse()` and `$prepareData()` to get the objects:
Since we are interested in the oob behavior it is necessary to define the oob data and response in a manner that `compboost` understands it. Therefore, it is possible to use the `$prepareResponse()` and `$prepareData()` member functions to create suitable objects:

```{r}
oob_response = cboost$prepareResponse(df$Survived[idx_test])
oob_data = cboost$prepareData(df[idx_test,])
```

With that data we can add the oob risk logger, declare it as stopper, and train the model:
With these objects we can add the oob risk logger, declare it as stopper, and train the model:

```{r}
cboost$addLogger(logger = LoggerOobRisk, use_as_stopper = TRUE, logger_id = "oob",
used_loss = LossBinomial$new(), eps_for_break = 0, patience = 5, oob_data = oob_data,
used_loss = LossBinomial$new(), eps_for_break = 0, patience = 5, oob_data = oob_data,
oob_response = oob_response)
cboost$train(2000, trace = 100)
```

**Note:** The use of `eps_for_break = 0` is a hard constrained that says that the training should continue until the oob risk starts to get bigger.
**Note:** The use of `eps_for_break = 0` is a hard constrain to continue the training just until the oob risk starts to increase.

Taking a look at the logger data tells us that we stopped after the first 5 differences are bigger than zero (the oob risk of that iterations is bigger than the previous ones):
Taking a look at the logger data tells us that we stop exactly after the first five differences are bigger than zero (the oob risk of these iterations are bigger than the previous ones):
```{r}
tail(cboost$getLoggerData(), n = 10)
diff(tail(cboost$getLoggerData()$oob, n = 10))
Expand All @@ -139,7 +139,7 @@ ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
ylab("Empirical Risk")
```

**Note:** It could happen that the model's oob behavior increases locally for a few iterations and then starts to decrease again. To capture this we would need the "patience" parameter which waits for, lets say, 5 iterations and breaks just if all 5 iterations fulfill the criteria. Setting this parameter to 1 can lead to very unstable results:
**Note:** It could happen that the model's oob behavior increases locally for a few iterations and then starts to decrease again. To capture this we need the "patience" parameter which waits for, lets say, 5 iterations and breaks just if all 5 iterations fulfill the criteria. Setting this parameter to one can lead to very unstable results:
```{r}
df = na.omit(titanic::titanic_train)
df$Survived = factor(df$Survived, labels = c("no", "yes"))
Expand All @@ -158,7 +158,7 @@ oob_response = cboost$prepareResponse(df$Survived[idx_test])
oob_data = cboost$prepareData(df[idx_test,])
cboost$addLogger(logger = LoggerOobRisk, use_as_stopper = TRUE, logger_id = "oob",
used_loss = LossBinomial$new(), eps_for_break = 0, patience = 1, oob_data = oob_data,
used_loss = LossBinomial$new(), eps_for_break = 0, patience = 1, oob_data = oob_data,
oob_response = oob_response)
cboost$train(2000, trace = 0)
Expand Down
36 changes: 19 additions & 17 deletions vignettes/getting_started/use_case.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ library(compboost)
## Data: Titanic Passenger Survival Data Set

We use the [titanic dataset](https://www.kaggle.com/c/titanic/data) with binary
classification on `survived`. First of all we store the train and test data
in two data frames and remove all rows that contains `NA`s:
classification on `survived`. First of all, we store the train and test data
into two data frames and remove all rows that contains missing values (`NA`s):

```{r}
# Store train and test data:
Expand All @@ -30,29 +30,27 @@ df_train = na.omit(titanic::titanic_train)
str(df_train)
```

In the next step we transform the response to a factor with more intuitive levels:
In the next step we transform the response to a factor having more intuitive levels:

```{r}
df_train$Survived = factor(df_train$Survived, labels = c("no", "yes"))
```

## Initializing Model

Due to the `R6` API it is necessary to create a new class object which gets the data, the target as character, and the used loss. Note that it is important to give an initialized loss object:
Due to the `R6` API it is necessary to create a new class object by calling the `$new()` constructor which gets the data, the target as character, and the used loss. Note that it is important to pass an initialized loss object which gives the opportunity to use, for example, a custom offset:
```{r}
cboost = Compboost$new(data = df_train, target = "Survived",
loss = LossBinomial$new(), oob_fraction = 0.3)
```

Use an initialized object for the loss gives the opportunity to use a loss initialized with a custom offset.

## Adding Base-Learner

Adding new base-learners is also done by giving a character to indicate the feature. As second argument it is important to name an identifier for the factory since we can define multiple base-learner on the same source.
Adding new base-learners requires as first argument a character to indicate what feature we want to use for the new base-learner. As second argument it is important to define an identifier for the factory. This is necessary since it is possible to define multiple base-learners on the same source.

### Numerical Features

For instance, we can define a spline and a linear base-learner of the same feature:
We can define a spline and a linear base-learner of the same feature:
```{r}
# Spline base-learner of age:
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
Expand All @@ -61,7 +59,7 @@ cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
cboost$addBaselearner("Age", "linear", BaselearnerPolynomial)
```

Additional arguments can be specified after naming the base-learner. For a complete list see the [functionality](https://compboost.org/functionality.html) at the project page:
Additional arguments can be specified after the base-learner. For a complete list see the [functionality](https://compboost.org/functionality.html) at the project page:
```{r}
# Spline base-learner of fare:
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2,
Expand All @@ -70,22 +68,26 @@ cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2,

### Categorical Features

When adding categorical features each group is added as single base-learner to avoid biased feature selection. Also note that we don't need an intercept here:
When adding categorical features, each group is added as single base-learner. Do also note that we don't want an intercept here:
```{r}
cboost$addBaselearner("Sex", "categorical", BaselearnerPolynomial,
intercept = FALSE)
```

Finally, we can check what factories are registered:
Finally, we can get all registered factories:
```{r}
cboost$getBaselearnerNames()
```

## Define Logger

A logger is another class that is evaluated after each iteration to track the performance, elapsed runtime, or the iterations. For each `Compboost` object is by default one iterations logger defined with as many iterations as specified in the `$train()` function.

To be able to control the fitting behavior with logger, each logger can also be defined as stopper to stop the fitting process after a pre-defined stopping criteria.

### Time logger

This logger logs the elapsed time. The time unit can be one of `microseconds`, `seconds` or `minutes`. The logger stops if `max_time` is reached. But we do not use that logger as stopper here:
This logger tracks the elapsed time. The time unit can be one of `microseconds`, `seconds` or `minutes`. The logger stops if `max_time` is reached. But we do not use that logger as stopper here:

```{r}
cboost$addLogger(logger = LoggerTime, use_as_stopper = FALSE, logger_id = "time",
Expand All @@ -100,7 +102,7 @@ cboost$train(2000, trace = 100)
cboost
```

Objects of the `Compboost` class do have member functions such as `getEstimatedCoef()`, `getInbagRisk()` or `predict()` to access the results:
Objects of the `Compboost` class do have member functions such as `$getEstimatedCoef()`, `$getInbagRisk()` or `$predict()` to access the results:
```{r}
str(cboost$getEstimatedCoef())
Expand All @@ -109,12 +111,12 @@ str(cboost$getInbagRisk())
str(cboost$predict())
```

To obtain a vector of selected learner just call `getSelectedBaselearner()`
To obtain a vector of the selected base-learners just call `$getSelectedBaselearner()`
```{r}
table(cboost$getSelectedBaselearner())
```

We can also access predictions directly from the response object `cboost$response` and `cboost$response_oob`. Note that `$response_oob` was created automatically when defining an `oob_fraction` within the constructor:
We can also access the predictions directly from the response object `cboost$response` and `cboost$response_oob`. Note that `$response_oob` was created automatically when defining an `oob_fraction` within the constructor:
```{r}
oob_label = cboost$response_oob$getResponse()
oob_pred = cboost$response_oob$getPredictionResponse()
Expand All @@ -130,7 +132,7 @@ cboost$plotInbagVsOobRisk()

## Retrain the Model

To set the whole model to another iteration one can easily call `train()` to another iteration:
To set the whole model to another iteration one can again call `$train()`. The model is then set to an already seen iteration, if the new iteration is smaller than the already trained once or it trains additional base-learner until the new number is reached:
```{r, warnings=FALSE}
cboost$train(3000)
Expand All @@ -143,7 +145,7 @@ table(cboost$getSelectedBaselearner())

## Visualizing Base-Learner

To visualize a base-learner it is important to exactly use a name from `getBaselearnerNames()`:
To visualize a base-learner it is important to exactly use a name from `$getBaselearnerNames()`:
```{r, eval=FALSE}
gg1 = cboost$plot("Age_spline")
gg2 = cboost$plot("Age_spline", iters = c(50, 100, 500, 1000, 1500))
Expand Down

0 comments on commit 810901e

Please sign in to comment.