-
Notifications
You must be signed in to change notification settings - Fork 107
Closed
Labels
bugan unexpected problem or unintended behavioran unexpected problem or unintended behavior
Description
The problem
I've consistently found different results when using boost_tree() vs xgb.train() in classification mode. After looking through the source code I noticed the y is being converted from factor to numeric via y <- as.numeric(y) -1 this has the effect of coding the first factor level as 0 and second factor level as 1. This has been confusing because tidymodels defaults to the first factor level as the event class, but when the xgboost model is trained the second factor level is represented as a 1.
Reproducible example
library(tidyverse)
library(tidymodels)
library(mlbench)
library(xgboost)
data("PimaIndiansDiabetes")
set.seed(24)
df <- PimaIndiansDiabetes %>%
mutate(diabetes = fct_relevel(diabetes, 'pos'))
xgb_model_1 <-
boost_tree(trees = 10,
tree_depth = 3
) %>%
set_engine('xgboost',
eval_metric = 'aucpr',
verbose = 1) %>%
set_mode('classification')
# Model is using 'neg' as relevant class
# Conversion of factor to numeric is reversing relevant categories assuming the first factor level is the true relevant class
as.numeric(df$diabetes) - 1
df$diabetes
xgb_model_1 %>%
fit(diabetes ~ . , df)
# Expected result
x <- as.matrix(df[,-ncol(df)])
y <- if_else(as.numeric(df$diabetes) == 2, 0, 1)
xgbmat <- xgb.DMatrix(data = x, label = y)
set.seed(24)
xgboost::xgb.train(params = list(eta = 0.3, max_depth = 3, gamma = 0,
colsample_bytree = 1, min_child_weight = 1, subsample = 1),
data = xgbmat, nrounds = 10, watchlist = list('train' = xgbmat), verbose = 1,
objective = "binary:logistic", eval_metric = "aucpr",
nthread = 1)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugan unexpected problem or unintended behavioran unexpected problem or unintended behavior