-
Notifications
You must be signed in to change notification settings - Fork 94
Closed
Labels
bugan unexpected problem or unintended behavioran unexpected problem or unintended behavior
Description
The problem
I've consistently found different results when using boost_tree()
vs xgb.train()
in classification mode. After looking through the source code I noticed the y
is being converted from factor to numeric via y <- as.numeric(y) -1
this has the effect of coding the first factor level as 0
and second factor level as 1
. This has been confusing because tidymodels defaults to the first factor level as the event
class, but when the xgboost model is trained the second factor level is represented as a 1
.
Reproducible example
library(tidyverse)
library(tidymodels)
library(mlbench)
library(xgboost)
data("PimaIndiansDiabetes")
set.seed(24)
df <- PimaIndiansDiabetes %>%
mutate(diabetes = fct_relevel(diabetes, 'pos'))
xgb_model_1 <-
boost_tree(trees = 10,
tree_depth = 3
) %>%
set_engine('xgboost',
eval_metric = 'aucpr',
verbose = 1) %>%
set_mode('classification')
# Model is using 'neg' as relevant class
# Conversion of factor to numeric is reversing relevant categories assuming the first factor level is the true relevant class
as.numeric(df$diabetes) - 1
df$diabetes
xgb_model_1 %>%
fit(diabetes ~ . , df)
# Expected result
x <- as.matrix(df[,-ncol(df)])
y <- if_else(as.numeric(df$diabetes) == 2, 0, 1)
xgbmat <- xgb.DMatrix(data = x, label = y)
set.seed(24)
xgboost::xgb.train(params = list(eta = 0.3, max_depth = 3, gamma = 0,
colsample_bytree = 1, min_child_weight = 1, subsample = 1),
data = xgbmat, nrounds = 10, watchlist = list('train' = xgbmat), verbose = 1,
objective = "binary:logistic", eval_metric = "aucpr",
nthread = 1)
Metadata
Metadata
Assignees
Labels
bugan unexpected problem or unintended behavioran unexpected problem or unintended behavior