Skip to content

Latest commit

 

History

History
189 lines (144 loc) · 7.27 KB

README.md

File metadata and controls

189 lines (144 loc) · 7.27 KB

Build Status AppVeyor Build Status Coverage Status license CRAN_Status_Badge status

Documentation | Contributors | Release Notes

compboost: Fast and Flexible Component-Wise Boosting Framework

Component-wise boosting applies the boosting framework to statistical models, e.g., general additive models using component-wise smoothing splines. Boosting these kinds of models maintains interpretability and enables unbiased model selection in high dimensional feature spaces.

The R package compboost is an alternative implementation of component-wise boosting written in C++ to obtain high runtime performance and full memory control. The main idea is to provide a modular class system which can be extended without editing the source code. Therefore, it is possible to use R functions as well as C++ functions for custom base-learners, losses, logging mechanisms or stopping criteria.

For an introduction and overview about the functionality visit the project page.

Installation

CRAN version:

install.packages("compboost")

Developer version:

devtools::install_github("schalkdaniel/compboost")

Examples Binder

This examples are rendered using compboost 0.1.0.

To be as flexible as possible one should use the R6 API do define base-learner, losses, stopping criteria, or optimizer as desired. Another option is to use wrapper functions as described on the project page.

library(compboost)

# Check installed version:
packageVersion("compboost")
#> [1] '0.1.0'

# Load data set with binary classification task:
data(PimaIndiansDiabetes, package = "mlbench")
# Create categorical feature:
PimaIndiansDiabetes$pregnant.cat = ifelse(PimaIndiansDiabetes$pregnant == 0, "no", "yes")

# Define Compboost object:
cboost = Compboost$new(data = PimaIndiansDiabetes, target = "diabetes", loss = LossBinomial$new())
cboost
#> Component-Wise Gradient Boosting
#> 
#> Trained on PimaIndiansDiabetes with target diabetes
#> Number of base-learners: 0
#> Learning rate: 0.05
#> Iterations: 0
#> Positive class: neg
#> 
#> LossBinomial Loss:
#> 
#>   Loss function: L(y,x) = log(1 + exp(-2yf(x))
#> 
#> 

# Add p-spline base-learner with default parameter:
cboost$addBaselearner(feature = "pressure", id = "spline", bl.factory = BaselearnerPSpline)

# Add another p-spline learner with custom parameters:
cboost$addBaselearner(feature = "age", id = "spline", bl.factory = BaselearnerPSpline, degree = 3, 
  n.knots = 10, penalty = 4, differences = 2)

# Add categorical feature (as single linear base-learner):
cboost$addBaselearner(feature = "pregnant.cat", id = "category", bl.factory = BaselearnerPolynomial,
  degree = 1, intercept = FALSE)

# Check all registered base-learner:
cboost$getBaselearnerNames()
#> [1] "pressure_spline"           "age_spline"               
#> [3] "pregnant.cat_yes_category" "pregnant.cat_no_category"

# Train model:
cboost$train(1000L, trace = 200L)
#>    1/1000: risk = 0.66
#>  200/1000: risk = 0.58
#>  400/1000: risk = 0.57
#>  600/1000: risk = 0.57
#>  800/1000: risk = 0.57
#> 1000/1000: risk = 0.57
#> 
#> 
#> Train 1000 iterations in 0 Seconds.
#> Final risk based on the train set: 0.57
cboost
#> Component-Wise Gradient Boosting
#> 
#> Trained on PimaIndiansDiabetes with target diabetes
#> Number of base-learners: 4
#> Learning rate: 0.05
#> Iterations: 1000
#> Positive class: neg
#> Offset: 0.3118
#> 
#> LossBinomial Loss:
#> 
#>   Loss function: L(y,x) = log(1 + exp(-2yf(x))
#> 
#> 

cboost$getBaselearnerNames()
#> [1] "pressure_spline"           "age_spline"               
#> [3] "pregnant.cat_yes_category" "pregnant.cat_no_category"

selected.features = cboost$getSelectedBaselearner()
table(selected.features)
#> selected.features
#>               age_spline pregnant.cat_no_category          pressure_spline 
#>                      434                      150                      416

params = cboost$getEstimatedCoef()
str(params)
#> List of 4
#>  $ age_spline              : num [1:14, 1] 2.99 1.501 0.588 -0.535 -0.119 ...
#>  $ pregnant.cat_no_category: num [1, 1] -0.299
#>  $ pressure_spline         : num [1:24, 1] -0.8087 -0.4274 -0.0602 0.2226 0.3368 ...
#>  $ offset                  : num 0.312

cboost$train(3000)
#> 
#> You have already trained 1000 iterations.
#> Train 2000 additional iterations.

cboost$plot("age_spline", iters = c(100, 500, 1000, 2000, 3000)) +
  ggthemes::theme_tufte() + 
  ggplot2::scale_color_brewer(palette = "Spectral")

Benchmark ---------

To get an idea of the performance of compboost, we have conduct a small benchmark in which compboost is compared with mboost. For this purpose, the runtime behavior and memory consumption of the two packages were compared. The results of the benchmark can be read here.

Citing

To cite compboost in publications, please use:

Schalk et al., (2018). compboost: Modular Framework for Component-Wise Boosting. Journal of Open Source Software, 3(30), 967, https://doi.org/10.21105/joss.00967

@article{schalk2018compboost,
  author = {Daniel Schalk, Janek Thomas, Bernd Bischl},
  title = {compboost: Modular Framework for Component-Wise Boosting},
  URL = {https://doi.org/10.21105/joss.00967},
  year = {2018},
  publisher = {Journal of Open Source Software},
  volume = {3},
  number = {30},
  pages = {967},
  journal = {JOSS}
}

Testing

On your local machine

In order to test the pacakge functionality you can use devtools to test the pacakge on your local machine:

devtools::test()

Using docker

You can test the package locally using docker and the compboost-test repository:

  • Latest R release:

    docker run schalkdaniel/compboost-test
    
  • Latest R devel build:

    docker run schalkdaniel/compboost-test:devel