Skip to content

Commit

Permalink
Merge pull request #436 from tidymodels/xgboost-event-level
Browse files Browse the repository at this point in the history
Set event level for xgboost
  • Loading branch information
hfrick committed Mar 31, 2021
2 parents 8238826 + 221ea10 commit 30dc026
Show file tree
Hide file tree
Showing 8 changed files with 123 additions and 22 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@

* Column names for `x` are now required when `fit_xy()` is used. (#398)

* There is now an `event_level` argument for the `xgboost` engine. (#420)

* New mode "censored regression" and new prediction types "linear_pred", "time", "survival", "hazard". (#396)

* Censored regression models cannot use `fit_xy()` (use `fit()`). (#442)
Expand Down
23 changes: 19 additions & 4 deletions R/boost_tree.R
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,9 @@ check_args.boost_tree <- function(object) {
#' @param objective A single string (or NULL) that defines the loss function that
#' `xgboost` uses to create trees. See [xgboost::xgb.train()] for options. If left
#' NULL, an appropriate loss function is chosen.
#' @param event_level For binary classification, this is a single string of either
#' `"first"` or `"second"` to pass along describing which level of the outcome
#' should be considered the "event".
#' @param ... Other options to pass to `xgb.train`.
#' @return A fitted `xgboost` object.
#' @keywords internal
Expand All @@ -313,8 +316,11 @@ xgb_train <- function(
x, y,
max_depth = 6, nrounds = 15, eta = 0.3, colsample_bytree = 1,
min_child_weight = 1, gamma = 0, subsample = 1, validation = 0,
early_stop = NULL, objective = NULL, ...) {
early_stop = NULL, objective = NULL,
event_level = c("first", "second"),
...) {

event_level <- rlang::arg_match(event_level, c("first", "second"))
others <- list(...)

num_class <- length(levels(y))
Expand Down Expand Up @@ -347,7 +353,7 @@ xgb_train <- function(
n <- nrow(x)
p <- ncol(x)

x <- as_xgb_data(x, y, validation)
x <- as_xgb_data(x, y, validation, event_level)

# translate `subsample` and `colsample_bytree` to be on (0, 1] if not
if (subsample > 1) {
Expand Down Expand Up @@ -427,7 +433,7 @@ xgb_pred <- function(object, newdata, ...) {
}


as_xgb_data <- function(x, y, validation = 0, ...) {
as_xgb_data <- function(x, y, validation = 0, event_level = "first", ...) {
lvls <- levels(y)
n <- nrow(x)

Expand All @@ -436,7 +442,16 @@ as_xgb_data <- function(x, y, validation = 0, ...) {
}

if (is.factor(y)) {
y <- as.numeric(y) - 1
if (length(lvls) < 3) {
if (event_level == "first") {
y <- -as.numeric(y) + 2
} else {
y <- as.numeric(y) - 1
}
} else {
if (event_level == "second") rlang::warn("`event_level` can only be set for binary variables.")
y <- as.numeric(y) - 1
}
}

if (!inherits(x, "xgb.DMatrix")) {
Expand Down
5 changes: 4 additions & 1 deletion man/boost_tree.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion man/rmd/boost-tree.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,9 @@ mod_param <-
For this engine, tuning over `trees` is very efficient since the same model
object can be used to make predictions over multiple values of `trees`.

Finally, note that `xgboost` models require that non-numeric predictors (e.g., factors) must be converted to dummy variables or some other numeric representation. By default, when using `fit()` with `xgboost`, a one-hot encoding is used to convert factor predictors to indicator variables.
Note that `xgboost` models require that non-numeric predictors (e.g., factors) must be converted to dummy variables or some other numeric representation. By default, when using `fit()` with `xgboost`, a one-hot encoding is used to convert factor predictors to indicator variables.

Finally, in the classification mode, non-numeric outcomes (i.e., factors) are converted to numeric. For binary classification, the `event_level` argument of `set_engine()` can be set to either `"first"` or `"second"` to specify which level should be used as the event. This can be helpful when a watchlist is used to monitor performance from with the xgboost training process.


## C5.0
Expand Down
5 changes: 5 additions & 0 deletions man/xgb_train.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

55 changes: 39 additions & 16 deletions revdep/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,44 +3,67 @@
|field |value |
|:--------|:----------------------------|
|version |R version 4.0.3 (2020-10-10) |
|os |macOS Catalina 10.15.5 |
|system |x86_64, darwin17.0 |
|ui |RStudio |
|language |(EN) |
|collate |en_US.UTF-8 |
|ctype |en_US.UTF-8 |
|tz |America/New_York |
|date |2021-01-19 |
|os |Ubuntu 18.04.5 LTS |
|system |x86_64, linux-gnu |
|ui |X11 |
|language |en |
|collate |en_GB.UTF-8 |
|ctype |en_GB.UTF-8 |
|tz |Europe/London |
|date |2021-02-23 |

# Dependencies

|package |old |new |Δ |
|:-----------|:------|:----------|:--|
|parsnip |0.1.4 |0.1.4.9000 |* |
|parsnip |0.1.5 |0.1.5.9000 |* |
|assertthat |0.2.1 |0.2.1 | |
|cli |2.2.0 |2.2.0 | |
|cpp11 |0.2.5 |0.2.5 | |
|crayon |1.3.4 |1.3.4 | |
|cli |2.3.1 |2.3.1 | |
|cpp11 |0.2.6 |0.2.6 | |
|crayon |1.4.1 |1.4.1 | |
|digest |0.6.27 |0.6.27 | |
|dplyr |1.0.3 |1.0.3 | |
|dplyr |1.0.4 |1.0.4 | |
|ellipsis |0.3.1 |0.3.1 | |
|fansi |0.4.2 |0.4.2 | |
|generics |0.1.0 |0.1.0 | |
|globals |0.14.0 |0.14.0 | |
|glue |1.4.2 |1.4.2 | |
|lifecycle |0.2.0 |0.2.0 | |
|lifecycle |1.0.0 |1.0.0 | |
|magrittr |2.0.1 |2.0.1 | |
|pillar |1.4.7 |1.4.7 | |
|pillar |1.5.0 |1.5.0 | |
|pkgconfig |2.0.3 |2.0.3 | |
|prettyunits |1.1.1 |1.1.1 | |
|purrr |0.3.4 |0.3.4 | |
|R6 |2.5.0 |2.5.0 | |
|rlang |0.4.10 |0.4.10 | |
|tibble |3.0.5 |3.0.5 | |
|tibble |3.0.6 |3.0.6 | |
|tidyr |1.1.2 |1.1.2 | |
|tidyselect |1.1.0 |1.1.0 | |
|utf8 |1.1.4 |1.1.4 | |
|vctrs |0.3.6 |0.3.6 | |

# Revdeps

## Failed to check (18)

|package |version |error |warning |note |
|:------------------|:-------|:-----|:-------|:----|
|baguette |? | | | |
|bayesian |? | | | |
|butcher |? | | | |
|coefplot |? | | | |
|condvis2 |? | | | |
|discrim |? | | | |
|easyalluvial |? | | | |
|finetune |? | | | |
|insight |? | | | |
|modeltime |? | | | |
|modeltime.ensemble |? | | | |
|modeltime.gluonts |? | | | |
|modeltime.resample |? | | | |
|plsmod |? | | | |
|SSLR |? | | | |
|tabnet |? | | | |
|timetk |? | | | |
|vip |? | | | |

Binary file modified revdep/data.sqlite
Binary file not shown.
51 changes: 51 additions & 0 deletions tests/testthat/test_boost_tree_xgboost.R
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,18 @@ test_that('xgboost data conversion', {
expect_true(inherits(from_mat$watchlist$validation, "xgb.DMatrix"))
expect_true(nrow(from_sparse$data) > nrow(from_sparse$watchlist$validation))

# set event_level for factors

mtcars_y <- factor(mtcars$mpg < 15, levels = c(TRUE, FALSE), labels = c("low", "high"))
expect_error(from_df <- parsnip:::as_xgb_data(mtcar_x, mtcars_y), regexp = NA)
expect_equal(xgboost::getinfo(from_df$data, name = "label")[1:5], rep(0, 5))
expect_error(from_df <- parsnip:::as_xgb_data(mtcar_x, mtcars_y, event_level = "second"), regexp = NA)
expect_equal(xgboost::getinfo(from_df$data, name = "label")[1:5], rep(1, 5))

mtcars_y <- factor(mtcars$mpg < 15, levels = c(TRUE, FALSE, "na"), labels = c("low", "high", "missing"))
expect_warning(from_df <- parsnip:::as_xgb_data(mtcar_x, mtcars_y, event_level = "second"),
regexp = "`event_level` can only be set for binary variables.")

})


Expand Down Expand Up @@ -410,4 +422,43 @@ test_that('argument checks for data dimensions', {

})

test_that("set `event_level` as engine-specific argument", {

skip_if_not_installed("xgboost")

data(penguins, package = "modeldata")
penguins <- na.omit(penguins[, -c(1:2)])

spec <-
boost_tree(trees = 10, tree_depth = 3) %>%
set_engine(
"xgboost",
eval_metric = "aucpr",
event_level = "second",
verbose = 1
) %>%
set_mode("classification")

set.seed(24)
fit_p <- spec %>% fit(sex ~ ., data = penguins)

penguins_x <- as.matrix(penguins[, -5])
penguins_y <- as.numeric(penguins$sex) - 1
xgbmat <- xgb.DMatrix(data = penguins_x, label = penguins_y)

set.seed(24)
fit_xgb <- xgboost::xgb.train(data = xgbmat,
params = list(eta = 0.3, max_depth = 3,
gamma = 0, colsample_bytree = 1,
min_child_weight = 1,
subsample = 1),
nrounds = 10,
watchlist = list("training" = xgbmat),
objective = "binary:logistic",
verbose = 1,
eval_metric = "aucpr",
nthread = 1)

expect_equal(fit_p$fit$evaluation_log, fit_xgb$evaluation_log)

})

0 comments on commit 30dc026

Please sign in to comment.