-
Notifications
You must be signed in to change notification settings - Fork 123
Description
In practice, especially for modelling tasks, datasets can contain a comparatively small amount of distinct values for predictors - even for continuous.
splines::bs() is very inefficient for such predictors, because it computes all basis functions for all rows separately.
This inefficiency is exacerbated by the fact that recipes:::prep.step_bs unnecessarily calls splines::bs() on the training data, throwing away the most expensive part of the computation altogether.
I have coded a small reprex with a fast preprocessing for splines::bs() that yields identical results for all aspects with one small exception: If knots is not provided and df > degree, the automatic knot selection algorithm for splines::bs() will select different knots if called with unique(x) instead of x.
As such, fixing the performance inefficiency in recipes is a little bit (but not much) more work than this proof of concept.
If there is interest in such a performance improvement, I can try coding up a pull request with more efficient versions of prep.step_bs and bake.step_bs.
Minimal, runnable code:
library(recipes)
library(tibble)
set.seed(123)
v_u <- rnorm(100)
x <- tibble(
v = sample(v_u, 1000000, TRUE)
)
t0 <- Sys.time()
rec <- recipe(x) %>%
step_bs(v) %>%
prep()
t1 <- Sys.time()
baked <- rec %>% bake(new_data = x)
t2 <- Sys.time()
bs_fast <- function(x, ...) {
xu <- unique(x)
ru <- splines::bs(xu, ...)
res <- ru[match(x, xu), ]
copy_attrs <- c("class", "degree", "knots", "Boundary.knots", "intercept")
attributes(res)[copy_attrs] <- attributes(ru)[copy_attrs]
res
}
t3 <- Sys.time()
baked_fast <- bs_fast(x$v) %>%
unclass() %>%
`attributes<-`(attributes(.)[c("dim", "dimnames")]) %>%
as_tibble() %>%
magrittr::set_colnames(sprintf("v_bs_%d", 1L:3L))
t4 <- Sys.time()
identical(baked, baked_fast)
#> [1] TRUE
identical(bs_fast(x$v), splines::bs(x$v))
#> [1] TRUE
cat(glue::glue(
"prep recipe: {t1 - t0}\n",
"bake recipe: {t2 - t1}\n",
"total recipe: {t2 - t0}\n",
"bake fast: {t4 - t3}"
))
#> prep recipe: 0.714876890182495
#> bake recipe: 0.235795021057129
#> total recipe: 0.950671911239624
#> bake fast: 0.11726713180542Created on 2020-09-24 by the reprex package (v0.3.0)