-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
Related to #239—just a place to keep notes on the thought process for supporting sparse tibbles with formula preprocessors. In #245, we see:
library(tidymodels)
sparse_hotel_rates <- function() {
# 99.2 sparsity
hotel_rates <- modeldata::hotel_rates
prefix_colnames <- function(x, prefix) {
colnames(x) <- paste(colnames(x), prefix, sep = "_")
x
}
dummies_country <- hardhat::fct_encode_one_hot(hotel_rates$country)
dummies_company <- hardhat::fct_encode_one_hot(hotel_rates$company)
dummies_agent <- hardhat::fct_encode_one_hot(hotel_rates$agent)
res <- cbind(
hotel_rates["avg_price_per_room"],
prefix_colnames(dummies_country, "country"),
prefix_colnames(dummies_company, "company"),
prefix_colnames(dummies_agent, "agent")
)
res <- as.matrix(res)
Matrix::Matrix(res, sparse = TRUE)
}
hotel_data <- sparse_hotel_rates()
hotel_data <- sparsevctrs::coerce_to_sparse_tibble(hotel_data)
spec <- boost_tree() %>%
set_mode("regression") %>%
set_engine("xgboost")
form <- avg_price_per_room ~ .
rec <- recipe(form, data = hotel_data)
wflow <- workflow(spec = spec)
system.time({fit(wflow %>% add_recipe(rec), hotel_data)})
#> user system elapsed
#> 0.255 0.014 0.269
system.time({fit(wflow %>% add_formula(form), hotel_data)})
#> user system elapsed
#> 3.847 0.039 3.905
Created on 2024-09-13 with reprex v2.1.1
In the formula preprocessor fit()
evaluation, the data type conversions don't actually take a ton of time:

It's just that, with add_formula()
, parsnip::xgb_train(x)
is a matrix, whereas it's a dgCMatrix
when passed with add_recipe()
, and xgboost is much slower when data that ought to be sparse is dense.
Metadata
Metadata
Assignees
Labels
No labels