Skip to content

Sparse matrix conversion not working with XGboost #690

@tolliam

Description

@tolliam

I have attempted to alter the example in this tidyverse blog post to use boost_tree with xgboost rather than logistic_reg with glmnet. I am keen to get the speed boost which comes with passing to XGBoost a sparse matrix rather than tibble. However what I find is that the sparse matrix version of the workflow is not returning the same predictions as the tibble version, unlike in the lasso example. The predictions coming out of the sparse matrix version are worse and I suspect systematically wrong.

I wonder if this problem relates to translating factor levels or something like that? I actually discovered this problem first using my own data set but it also occurs with the small_fine_foods dataset.

The code below is very similar to what is in the blog.

library(tidyverse)
library(tidymodels)

data("small_fine_foods")
training_data
#> # A tibble: 4,000 × 3
#>    product    review                                                       score
#>    <chr>      <chr>                                                        <fct>
#>  1 B000J0LSBG "this stuff is  not stuffing  its  not good at all  save yo… other
#>  2 B000EYLDYE "I absolutely LOVE this dried fruit.  LOVE IT.  Whenever I … great
#>  3 B0026LIO9A "GREAT DEAL, CONVENIENT TOO.  Much cheaper than WalMart and… great
#>  4 B00473P8SK "Great flavor, we go through a ton of this sauce! I discove… great
#>  5 B001SAWTNM "This is excellent salsa/hot sauce, but you can get it for … great
#>  6 B000FAG90U "Again, this is the best dogfood out there.  One suggestion… great
#>  7 B006BXTCEK "The box I received was filled with teas, hot chocolates, a… other
#>  8 B002GWH5OY "This is delicious coffee which compares favorably with muc… great
#>  9 B003R0MFYY "Don't let these little tiny cans fool you.  They pack a lo… great
#> 10 B001EO5ZXI "One of the nicest, smoothest cup of chai I've made. Nice m… great
#> # … with 3,990 more rows

library(hardhat)
#> Warning: package 'hardhat' was built under R version 4.1.2
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")

library(textrecipes)

text_rec <-
  recipe(score ~ review, data = training_data) %>%
  step_tokenize(review)  %>%
  step_tokenfilter(review, max_tokens = 1e3) %>%
  step_tfidf(review)

xgb_spec <-
  boost_tree(mode = "classification") %>%
  set_engine("xgboost")

wf_sparse <- 
  workflow() %>%
  add_recipe(text_rec, blueprint = sparse_bp) %>%
  add_model(xgb_spec)

wf_default <- 
  workflow() %>%
  add_recipe(text_rec) %>%
  add_model(xgb_spec)

set.seed(123)
food_folds <- vfold_cv(training_data, v = 3)

results <- bench::mark(
  iterations = 1, check = FALSE,
  sparse = fit_resamples(wf_sparse, food_folds),  
  default = fit_resamples(wf_default, food_folds), 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

fit_resamples(wf_sparse, food_folds) %>%
  collect_metrics()
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.474     3  0.0492 Preprocessor1_Model1
#> 2 roc_auc  binary     0.533     3  0.0426 Preprocessor1_Model1

fit_resamples(wf_default, food_folds) %>%
  collect_metrics()
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.743     3 0.00416 Preprocessor1_Model1
#> 2 roc_auc  binary     0.780     3 0.00521 Preprocessor1_Model1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions