-
Notifications
You must be signed in to change notification settings - Fork 106
Description
I have attempted to alter the example in this tidyverse blog post to use boost_tree with xgboost rather than logistic_reg with glmnet. I am keen to get the speed boost which comes with passing to XGBoost a sparse matrix rather than tibble. However what I find is that the sparse matrix version of the workflow is not returning the same predictions as the tibble version, unlike in the lasso example. The predictions coming out of the sparse matrix version are worse and I suspect systematically wrong.
I wonder if this problem relates to translating factor levels or something like that? I actually discovered this problem first using my own data set but it also occurs with the small_fine_foods dataset.
The code below is very similar to what is in the blog.
library(tidyverse)
library(tidymodels)
data("small_fine_foods")
training_data
#> # A tibble: 4,000 × 3
#> product review score
#> <chr> <chr> <fct>
#> 1 B000J0LSBG "this stuff is not stuffing its not good at all save yo… other
#> 2 B000EYLDYE "I absolutely LOVE this dried fruit. LOVE IT. Whenever I … great
#> 3 B0026LIO9A "GREAT DEAL, CONVENIENT TOO. Much cheaper than WalMart and… great
#> 4 B00473P8SK "Great flavor, we go through a ton of this sauce! I discove… great
#> 5 B001SAWTNM "This is excellent salsa/hot sauce, but you can get it for … great
#> 6 B000FAG90U "Again, this is the best dogfood out there. One suggestion… great
#> 7 B006BXTCEK "The box I received was filled with teas, hot chocolates, a… other
#> 8 B002GWH5OY "This is delicious coffee which compares favorably with muc… great
#> 9 B003R0MFYY "Don't let these little tiny cans fool you. They pack a lo… great
#> 10 B001EO5ZXI "One of the nicest, smoothest cup of chai I've made. Nice m… great
#> # … with 3,990 more rows
library(hardhat)
#> Warning: package 'hardhat' was built under R version 4.1.2
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")
library(textrecipes)
text_rec <-
recipe(score ~ review, data = training_data) %>%
step_tokenize(review) %>%
step_tokenfilter(review, max_tokens = 1e3) %>%
step_tfidf(review)
xgb_spec <-
boost_tree(mode = "classification") %>%
set_engine("xgboost")
wf_sparse <-
workflow() %>%
add_recipe(text_rec, blueprint = sparse_bp) %>%
add_model(xgb_spec)
wf_default <-
workflow() %>%
add_recipe(text_rec) %>%
add_model(xgb_spec)
set.seed(123)
food_folds <- vfold_cv(training_data, v = 3)
results <- bench::mark(
iterations = 1, check = FALSE,
sparse = fit_resamples(wf_sparse, food_folds),
default = fit_resamples(wf_default, food_folds),
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
fit_resamples(wf_sparse, food_folds) %>%
collect_metrics()
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 accuracy binary 0.474 3 0.0492 Preprocessor1_Model1
#> 2 roc_auc binary 0.533 3 0.0426 Preprocessor1_Model1
fit_resamples(wf_default, food_folds) %>%
collect_metrics()
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 accuracy binary 0.743 3 0.00416 Preprocessor1_Model1
#> 2 roc_auc binary 0.780 3 0.00521 Preprocessor1_Model1