diff --git a/vignettes/getting_started/early_stopping.Rmd b/vignettes/getting_started/early_stopping.Rmd index 6d5fabaa..5941d3e4 100644 --- a/vignettes/getting_started/early_stopping.Rmd +++ b/vignettes/getting_started/early_stopping.Rmd @@ -25,7 +25,7 @@ library(compboost) We use the [titanic dataset](https://www.kaggle.com/c/titanic/data) with binary classification on `Survived`. First of all we store the train and test data -in two data frames and remove all rows that contains `NA`s: +into two data frames and remove all rows that contains missing values (`NA`s): ```{r} # Store train and test data: @@ -43,7 +43,7 @@ idx_test = setdiff(seq_len(nrow(df)), idx_train) ## Defining the Model -We define the same model as in the [use-case](https://danielschalk.com/compboost/articles/getting_started/use_case.html) but now just on the train index without specifying an out of bag fraction: +We define the same model as in the [use-case](https://danielschalk.com/compboost/articles/getting_started/use_case.html) but just on the train index without specifying an out of bag fraction: ```{r} cboost = Compboost$new(data = df[idx_train, ], target = "Survived", loss = LossBinomial$new()) @@ -57,11 +57,11 @@ cboost$addBaselearner("Sex", "categorical", BaselearnerPolynomial, intercept = F ### How does it work? -The early stopping of `compboost` is done using logger objects. The logger is executed after each iteration and stores class dependent data, e.g. the runtime. Additionally, each logger can be declared as a stopper with `use_as_stopper = TRUE`. Declaring a logger as stopper, the logged data is used to stop the algorithm after a logger-specific criteria is reached. For example, using `LoggerTime` as stopper, the algorithm can be stopped after a pre-defined runtime is reached. +The early stopping of `compboost` is done by using the logger objects. The logger is executed after each iteration and stores class dependent data, e.g. the runtime. Additionally, each logger can be declared as a stopper with `use_as_stopper = TRUE`. Declaring a logger as stopper, the logged data is used to stop the algorithm after a logger-specific criteria is reached. For example, using `LoggerTime` as stopper will break the algorithm algorithm after a pre-defined runtime is reached. ### Example with runtime stopping -Now it's time to define a logger to log the runtime. As mentioned above, we set `use_as_stopper = TRUE`. Now it matters what is specified in `max_time` since this defines the stopping behavior. Here we want to stop after 50000 microseconds: +Now it is time to define a logger to track the runtime. As mentioned above, we set `use_as_stopper = TRUE`. Now it matters what is specified in `max_time` since this defines how long we like to train the model. Here we want to stop after 50000 microseconds: ```{r, warnings=FALSE} cboost$addLogger(logger = LoggerTime, use_as_stopper = TRUE, logger_id = "time", @@ -71,7 +71,7 @@ cboost$train(2000, trace = 100) cboost ``` -As we can see, the fittings is stopped after `r cboost$getCurrentIteration()` and not after 2000 iterations as specified in train. Taking a look at the logger data, we can see that the last entry exceeds the 50000 microseconds and therefore hits the stopping criteria: +As we can see, the fittings is stopped after `r cboost$getCurrentIteration()` and not after 2000 iterations as specified in train. Taking a look at the logger data, we can see that the last entry exceeds the 50000 microseconds and therefore triggers the stopping criteria: ```{r} tail(cboost$getLoggerData()) ``` @@ -87,7 +87,7 @@ cboost$addBaselearner("Fare", "spline", BaselearnerPSpline) cboost$addBaselearner("Sex", "categorical", BaselearnerPolynomial, intercept = FALSE) ``` -In machine learning we often like to stop when the best model performance is reached. Especially in boosting, which tends to overfit, we need either tuning or early stopping to determine what is a good number of iteration $m$ to get good model performance. A well-known procedure is to log the out of bag (oob) behavior of the model and stop after this starts to get worse. This is how oob early stopping is implemented in `compboost`. The freedom we have is to specify +In machine learning we often like to stop when the best model performance is reached. Especially in boosting, which may tend to overfit, we need either tuning or early stopping to determine what is a good number of iterations $m$ to get a good model performance. A well-known procedure is to log the out of bag (oob) behavior of the model and stop after this starts to get worse. This is how the oob early stopping is implemented in `compboost`. The parameter we need to specify are - the loss $L$ that is used for stopping: $$\mathcal{R}_{\text{emp}}^{[m]} = \frac{1}{n}\sum_{i=1}^n L\left(y^{(i)}, f^{[m]}(x^{(i)})\right)$$ @@ -95,26 +95,26 @@ In machine learning we often like to stop when the best model performance is rea ### Define the risk logger -Since we are interested in the oob behavior it is necessary to define the oob data and response in a manner that `compboost` understands it. Therefore, it is possible to use the `$prepareResponse()` and `$prepareData()` to get the objects: +Since we are interested in the oob behavior it is necessary to define the oob data and response in a manner that `compboost` understands it. Therefore, it is possible to use the `$prepareResponse()` and `$prepareData()` member functions to create suitable objects: ```{r} oob_response = cboost$prepareResponse(df$Survived[idx_test]) oob_data = cboost$prepareData(df[idx_test,]) ``` -With that data we can add the oob risk logger, declare it as stopper, and train the model: +With these objects we can add the oob risk logger, declare it as stopper, and train the model: ```{r} cboost$addLogger(logger = LoggerOobRisk, use_as_stopper = TRUE, logger_id = "oob", - used_loss = LossBinomial$new(), eps_for_break = 0, patience = 5, oob_data = oob_data, + used_loss = LossBinomial$new(), eps_for_break = 0, patience = 5, oob_data = oob_data, oob_response = oob_response) cboost$train(2000, trace = 100) ``` -**Note:** The use of `eps_for_break = 0` is a hard constrained that says that the training should continue until the oob risk starts to get bigger. +**Note:** The use of `eps_for_break = 0` is a hard constrain to continue the training just until the oob risk starts to increase. -Taking a look at the logger data tells us that we stopped after the first 5 differences are bigger than zero (the oob risk of that iterations is bigger than the previous ones): +Taking a look at the logger data tells us that we stop exactly after the first five differences are bigger than zero (the oob risk of these iterations are bigger than the previous ones): ```{r} tail(cboost$getLoggerData(), n = 10) diff(tail(cboost$getLoggerData()$oob, n = 10)) @@ -139,7 +139,7 @@ ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) + ylab("Empirical Risk") ``` -**Note:** It could happen that the model's oob behavior increases locally for a few iterations and then starts to decrease again. To capture this we would need the "patience" parameter which waits for, lets say, 5 iterations and breaks just if all 5 iterations fulfill the criteria. Setting this parameter to 1 can lead to very unstable results: +**Note:** It could happen that the model's oob behavior increases locally for a few iterations and then starts to decrease again. To capture this we need the "patience" parameter which waits for, lets say, 5 iterations and breaks just if all 5 iterations fulfill the criteria. Setting this parameter to one can lead to very unstable results: ```{r} df = na.omit(titanic::titanic_train) df$Survived = factor(df$Survived, labels = c("no", "yes")) @@ -158,7 +158,7 @@ oob_response = cboost$prepareResponse(df$Survived[idx_test]) oob_data = cboost$prepareData(df[idx_test,]) cboost$addLogger(logger = LoggerOobRisk, use_as_stopper = TRUE, logger_id = "oob", - used_loss = LossBinomial$new(), eps_for_break = 0, patience = 1, oob_data = oob_data, + used_loss = LossBinomial$new(), eps_for_break = 0, patience = 1, oob_data = oob_data, oob_response = oob_response) cboost$train(2000, trace = 0) diff --git a/vignettes/getting_started/use_case.Rmd b/vignettes/getting_started/use_case.Rmd index 7d9e8a46..91f6a12d 100644 --- a/vignettes/getting_started/use_case.Rmd +++ b/vignettes/getting_started/use_case.Rmd @@ -20,8 +20,8 @@ library(compboost) ## Data: Titanic Passenger Survival Data Set We use the [titanic dataset](https://www.kaggle.com/c/titanic/data) with binary -classification on `survived`. First of all we store the train and test data -in two data frames and remove all rows that contains `NA`s: +classification on `survived`. First of all, we store the train and test data +into two data frames and remove all rows that contains missing values (`NA`s): ```{r} # Store train and test data: @@ -30,7 +30,7 @@ df_train = na.omit(titanic::titanic_train) str(df_train) ``` -In the next step we transform the response to a factor with more intuitive levels: +In the next step we transform the response to a factor having more intuitive levels: ```{r} df_train$Survived = factor(df_train$Survived, labels = c("no", "yes")) @@ -38,21 +38,19 @@ df_train$Survived = factor(df_train$Survived, labels = c("no", "yes")) ## Initializing Model -Due to the `R6` API it is necessary to create a new class object which gets the data, the target as character, and the used loss. Note that it is important to give an initialized loss object: +Due to the `R6` API it is necessary to create a new class object by calling the `$new()` constructor which gets the data, the target as character, and the used loss. Note that it is important to pass an initialized loss object which gives the opportunity to use, for example, a custom offset: ```{r} cboost = Compboost$new(data = df_train, target = "Survived", loss = LossBinomial$new(), oob_fraction = 0.3) ``` -Use an initialized object for the loss gives the opportunity to use a loss initialized with a custom offset. - ## Adding Base-Learner -Adding new base-learners is also done by giving a character to indicate the feature. As second argument it is important to name an identifier for the factory since we can define multiple base-learner on the same source. +Adding new base-learners requires as first argument a character to indicate what feature we want to use for the new base-learner. As second argument it is important to define an identifier for the factory. This is necessary since it is possible to define multiple base-learners on the same source. ### Numerical Features -For instance, we can define a spline and a linear base-learner of the same feature: +We can define a spline and a linear base-learner of the same feature: ```{r} # Spline base-learner of age: cboost$addBaselearner("Age", "spline", BaselearnerPSpline) @@ -61,7 +59,7 @@ cboost$addBaselearner("Age", "spline", BaselearnerPSpline) cboost$addBaselearner("Age", "linear", BaselearnerPolynomial) ``` -Additional arguments can be specified after naming the base-learner. For a complete list see the [functionality](https://compboost.org/functionality.html) at the project page: +Additional arguments can be specified after the base-learner. For a complete list see the [functionality](https://compboost.org/functionality.html) at the project page: ```{r} # Spline base-learner of fare: cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2, @@ -70,22 +68,26 @@ cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2, ### Categorical Features -When adding categorical features each group is added as single base-learner to avoid biased feature selection. Also note that we don't need an intercept here: +When adding categorical features, each group is added as single base-learner. Do also note that we don't want an intercept here: ```{r} cboost$addBaselearner("Sex", "categorical", BaselearnerPolynomial, intercept = FALSE) ``` -Finally, we can check what factories are registered: +Finally, we can get all registered factories: ```{r} cboost$getBaselearnerNames() ``` ## Define Logger +A logger is another class that is evaluated after each iteration to track the performance, elapsed runtime, or the iterations. For each `Compboost` object is by default one iterations logger defined with as many iterations as specified in the `$train()` function. + +To be able to control the fitting behavior with logger, each logger can also be defined as stopper to stop the fitting process after a pre-defined stopping criteria. + ### Time logger -This logger logs the elapsed time. The time unit can be one of `microseconds`, `seconds` or `minutes`. The logger stops if `max_time` is reached. But we do not use that logger as stopper here: +This logger tracks the elapsed time. The time unit can be one of `microseconds`, `seconds` or `minutes`. The logger stops if `max_time` is reached. But we do not use that logger as stopper here: ```{r} cboost$addLogger(logger = LoggerTime, use_as_stopper = FALSE, logger_id = "time", @@ -100,7 +102,7 @@ cboost$train(2000, trace = 100) cboost ``` -Objects of the `Compboost` class do have member functions such as `getEstimatedCoef()`, `getInbagRisk()` or `predict()` to access the results: +Objects of the `Compboost` class do have member functions such as `$getEstimatedCoef()`, `$getInbagRisk()` or `$predict()` to access the results: ```{r} str(cboost$getEstimatedCoef()) @@ -109,12 +111,12 @@ str(cboost$getInbagRisk()) str(cboost$predict()) ``` -To obtain a vector of selected learner just call `getSelectedBaselearner()` +To obtain a vector of the selected base-learners just call `$getSelectedBaselearner()` ```{r} table(cboost$getSelectedBaselearner()) ``` -We can also access predictions directly from the response object `cboost$response` and `cboost$response_oob`. Note that `$response_oob` was created automatically when defining an `oob_fraction` within the constructor: +We can also access the predictions directly from the response object `cboost$response` and `cboost$response_oob`. Note that `$response_oob` was created automatically when defining an `oob_fraction` within the constructor: ```{r} oob_label = cboost$response_oob$getResponse() oob_pred = cboost$response_oob$getPredictionResponse() @@ -130,7 +132,7 @@ cboost$plotInbagVsOobRisk() ## Retrain the Model -To set the whole model to another iteration one can easily call `train()` to another iteration: +To set the whole model to another iteration one can again call `$train()`. The model is then set to an already seen iteration, if the new iteration is smaller than the already trained once or it trains additional base-learner until the new number is reached: ```{r, warnings=FALSE} cboost$train(3000) @@ -143,7 +145,7 @@ table(cboost$getSelectedBaselearner()) ## Visualizing Base-Learner -To visualize a base-learner it is important to exactly use a name from `getBaselearnerNames()`: +To visualize a base-learner it is important to exactly use a name from `$getBaselearnerNames()`: ```{r, eval=FALSE} gg1 = cboost$plot("Age_spline") gg2 = cboost$plot("Age_spline", iters = c(50, 100, 500, 1000, 1500))