Skip to content

boost_tree() reverses 'event' class when converting to Xgb.DMatrix #420

@joeycouse

Description

@joeycouse

The problem

I've consistently found different results when using boost_tree() vs xgb.train() in classification mode. After looking through the source code I noticed the y is being converted from factor to numeric via y <- as.numeric(y) -1 this has the effect of coding the first factor level as 0 and second factor level as 1. This has been confusing because tidymodels defaults to the first factor level as the event class, but when the xgboost model is trained the second factor level is represented as a 1.

Reproducible example

library(tidyverse)
library(tidymodels)
library(mlbench)
library(xgboost)

data("PimaIndiansDiabetes")

set.seed(24)
df <- PimaIndiansDiabetes %>%
  mutate(diabetes = fct_relevel(diabetes, 'pos'))

xgb_model_1 <- 
  boost_tree(trees = 10,
             tree_depth = 3
             ) %>%
  set_engine('xgboost', 
             eval_metric = 'aucpr',
             verbose = 1) %>%
  set_mode('classification')


# Model is using 'neg' as relevant class
# Conversion of factor to numeric is reversing relevant categories assuming the first factor level is the true relevant class

as.numeric(df$diabetes) - 1
df$diabetes

xgb_model_1 %>%
  fit(diabetes ~ . , df)

# Expected result

x <- as.matrix(df[,-ncol(df)])

y <- if_else(as.numeric(df$diabetes) == 2, 0, 1)

xgbmat <- xgb.DMatrix(data = x, label = y)

set.seed(24)

xgboost::xgb.train(params = list(eta = 0.3, max_depth = 3, gamma = 0, 
    colsample_bytree = 1, min_child_weight = 1, subsample = 1), 
    data = xgbmat, nrounds = 10, watchlist = list('train' = xgbmat), verbose = 1, 
    objective = "binary:logistic", eval_metric = "aucpr", 
    nthread = 1)

Metadata

Metadata

Assignees

Labels

bugan unexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions