Sparse matrix conversion not working with XGboost

I have attempted to alter the example in [this tidyverse blog post](https://www.tidyverse.org/blog/2020/11/tidymodels-sparse-support/) to use boost_tree with xgboost rather than logistic_reg with glmnet. I am keen to get the speed boost which comes with passing to XGBoost a sparse matrix rather than tibble. However what I find is that the sparse matrix version of the workflow is not returning the same predictions as the tibble version, unlike in the lasso example. The predictions coming out of the sparse matrix version are worse and I suspect systematically wrong.

I wonder if this problem relates to translating factor levels or something like that? I actually discovered this problem first using my own data set but it also occurs with the small_fine_foods dataset.

The code below is very similar to what is in [the blog](https://www.tidyverse.org/blog/2020/11/tidymodels-sparse-support/).

```r
library(tidyverse)
library(tidymodels)

data("small_fine_foods")
training_data
#> # A tibble: 4,000 × 3
#>    product    review                                                       score
#>    <chr>      <chr>                                                        <fct>
#>  1 B000J0LSBG "this stuff is  not stuffing  its  not good at all  save yo… other
#>  2 B000EYLDYE "I absolutely LOVE this dried fruit.  LOVE IT.  Whenever I … great
#>  3 B0026LIO9A "GREAT DEAL, CONVENIENT TOO.  Much cheaper than WalMart and… great
#>  4 B00473P8SK "Great flavor, we go through a ton of this sauce! I discove… great
#>  5 B001SAWTNM "This is excellent salsa/hot sauce, but you can get it for … great
#>  6 B000FAG90U "Again, this is the best dogfood out there.  One suggestion… great
#>  7 B006BXTCEK "The box I received was filled with teas, hot chocolates, a… other
#>  8 B002GWH5OY "This is delicious coffee which compares favorably with muc… great
#>  9 B003R0MFYY "Don't let these little tiny cans fool you.  They pack a lo… great
#> 10 B001EO5ZXI "One of the nicest, smoothest cup of chai I've made. Nice m… great
#> # … with 3,990 more rows

library(hardhat)
#> Warning: package 'hardhat' was built under R version 4.1.2
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")

library(textrecipes)

text_rec <-
  recipe(score ~ review, data = training_data) %>%
  step_tokenize(review)  %>%
  step_tokenfilter(review, max_tokens = 1e3) %>%
  step_tfidf(review)

xgb_spec <-
  boost_tree(mode = "classification") %>%
  set_engine("xgboost")

wf_sparse <- 
  workflow() %>%
  add_recipe(text_rec, blueprint = sparse_bp) %>%
  add_model(xgb_spec)

wf_default <- 
  workflow() %>%
  add_recipe(text_rec) %>%
  add_model(xgb_spec)

set.seed(123)
food_folds <- vfold_cv(training_data, v = 3)

results <- bench::mark(
  iterations = 1, check = FALSE,
  sparse = fit_resamples(wf_sparse, food_folds),  
  default = fit_resamples(wf_default, food_folds), 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

fit_resamples(wf_sparse, food_folds) %>%
  collect_metrics()
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.474     3  0.0492 Preprocessor1_Model1
#> 2 roc_auc  binary     0.533     3  0.0426 Preprocessor1_Model1

fit_resamples(wf_default, food_folds) %>%
  collect_metrics()
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.743     3 0.00416 Preprocessor1_Model1
#> 2 roc_auc  binary     0.780     3 0.00521 Preprocessor1_Model1
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse matrix conversion not working with XGboost #690

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sparse matrix conversion not working with XGboost #690

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions