Skip to content

step_bs is extremely slow for large datasets #574

@AshesITR

Description

@AshesITR

In practice, especially for modelling tasks, datasets can contain a comparatively small amount of distinct values for predictors - even for continuous.

splines::bs() is very inefficient for such predictors, because it computes all basis functions for all rows separately.
This inefficiency is exacerbated by the fact that recipes:::prep.step_bs unnecessarily calls splines::bs() on the training data, throwing away the most expensive part of the computation altogether.

I have coded a small reprex with a fast preprocessing for splines::bs() that yields identical results for all aspects with one small exception: If knots is not provided and df > degree, the automatic knot selection algorithm for splines::bs() will select different knots if called with unique(x) instead of x.

As such, fixing the performance inefficiency in recipes is a little bit (but not much) more work than this proof of concept.

If there is interest in such a performance improvement, I can try coding up a pull request with more efficient versions of prep.step_bs and bake.step_bs.

Minimal, runnable code:

library(recipes)
library(tibble)
set.seed(123)
v_u <- rnorm(100)
x <- tibble(
  v = sample(v_u, 1000000, TRUE)
)

t0 <- Sys.time()
rec <- recipe(x) %>%
  step_bs(v) %>%
  prep()

t1 <- Sys.time()
baked <- rec %>% bake(new_data = x)
t2 <- Sys.time()

bs_fast <- function(x, ...) {
  xu <- unique(x)
  ru <- splines::bs(xu, ...)
  res <- ru[match(x, xu), ]
  copy_attrs <- c("class", "degree", "knots", "Boundary.knots", "intercept")
  attributes(res)[copy_attrs] <- attributes(ru)[copy_attrs]
  res
}

t3 <- Sys.time()
baked_fast <- bs_fast(x$v) %>%
  unclass() %>%
  `attributes<-`(attributes(.)[c("dim", "dimnames")]) %>%
  as_tibble() %>%
  magrittr::set_colnames(sprintf("v_bs_%d", 1L:3L))
t4 <- Sys.time()

identical(baked, baked_fast)
#> [1] TRUE
identical(bs_fast(x$v), splines::bs(x$v))
#> [1] TRUE

cat(glue::glue(
  "prep recipe: {t1 - t0}\n",
  "bake recipe: {t2 - t1}\n",
  "total recipe: {t2 - t0}\n",
  "bake fast: {t4 - t3}"
))
#> prep recipe: 0.714876890182495
#> bake recipe: 0.235795021057129
#> total recipe: 0.950671911239624
#> bake fast: 0.11726713180542

Created on 2020-09-24 by the reprex package (v0.3.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions