Skip to content

address fit() slowdown with sparse tibble and formula preprocessor #246

@simonpcouch

Description

@simonpcouch

Related to #239—just a place to keep notes on the thought process for supporting sparse tibbles with formula preprocessors. In #245, we see:

library(tidymodels)
  
sparse_hotel_rates <- function() {
  # 99.2 sparsity
  hotel_rates <- modeldata::hotel_rates
  
  prefix_colnames <- function(x, prefix) {
    colnames(x) <- paste(colnames(x), prefix, sep = "_")
    x
  }
  
  dummies_country <- hardhat::fct_encode_one_hot(hotel_rates$country)
  dummies_company <- hardhat::fct_encode_one_hot(hotel_rates$company)
  dummies_agent <- hardhat::fct_encode_one_hot(hotel_rates$agent)
  
  res <- cbind(
    hotel_rates["avg_price_per_room"],
    prefix_colnames(dummies_country, "country"),
    prefix_colnames(dummies_company, "company"),
    prefix_colnames(dummies_agent, "agent")
  )
  
  res <- as.matrix(res)
  Matrix::Matrix(res, sparse = TRUE)
}

hotel_data <- sparse_hotel_rates()
hotel_data <- sparsevctrs::coerce_to_sparse_tibble(hotel_data)

spec <- boost_tree() %>%
  set_mode("regression") %>%
  set_engine("xgboost")

form <- avg_price_per_room ~ .
rec <- recipe(form, data = hotel_data)

wflow <- workflow(spec = spec)

system.time({fit(wflow %>% add_recipe(rec), hotel_data)})
#>    user  system elapsed 
#>   0.255   0.014   0.269
system.time({fit(wflow %>% add_formula(form), hotel_data)})
#>    user  system elapsed 
#>   3.847   0.039   3.905

Created on 2024-09-13 with reprex v2.1.1

In the formula preprocessor fit() evaluation, the data type conversions don't actually take a ton of time:

Screenshot 2024-09-13 at 10 16 32 AM

It's just that, with add_formula(), parsnip::xgb_train(x) is a matrix, whereas it's a dgCMatrix when passed with add_recipe(), and xgboost is much slower when data that ought to be sparse is dense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions