Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_bs is extremely slow for large datasets #574

Closed
AshesITR opened this issue Sep 24, 2020 · 12 comments · Fixed by #640
Closed

step_bs is extremely slow for large datasets #574

AshesITR opened this issue Sep 24, 2020 · 12 comments · Fixed by #640
Labels
feature a feature request or enhancement

Comments

@AshesITR
Copy link
Contributor

In practice, especially for modelling tasks, datasets can contain a comparatively small amount of distinct values for predictors - even for continuous.

splines::bs() is very inefficient for such predictors, because it computes all basis functions for all rows separately.
This inefficiency is exacerbated by the fact that recipes:::prep.step_bs unnecessarily calls splines::bs() on the training data, throwing away the most expensive part of the computation altogether.

I have coded a small reprex with a fast preprocessing for splines::bs() that yields identical results for all aspects with one small exception: If knots is not provided and df > degree, the automatic knot selection algorithm for splines::bs() will select different knots if called with unique(x) instead of x.

As such, fixing the performance inefficiency in recipes is a little bit (but not much) more work than this proof of concept.

If there is interest in such a performance improvement, I can try coding up a pull request with more efficient versions of prep.step_bs and bake.step_bs.

Minimal, runnable code:

library(recipes)
library(tibble)
set.seed(123)
v_u <- rnorm(100)
x <- tibble(
  v = sample(v_u, 1000000, TRUE)
)

t0 <- Sys.time()
rec <- recipe(x) %>%
  step_bs(v) %>%
  prep()

t1 <- Sys.time()
baked <- rec %>% bake(new_data = x)
t2 <- Sys.time()

bs_fast <- function(x, ...) {
  xu <- unique(x)
  ru <- splines::bs(xu, ...)
  res <- ru[match(x, xu), ]
  copy_attrs <- c("class", "degree", "knots", "Boundary.knots", "intercept")
  attributes(res)[copy_attrs] <- attributes(ru)[copy_attrs]
  res
}

t3 <- Sys.time()
baked_fast <- bs_fast(x$v) %>%
  unclass() %>%
  `attributes<-`(attributes(.)[c("dim", "dimnames")]) %>%
  as_tibble() %>%
  magrittr::set_colnames(sprintf("v_bs_%d", 1L:3L))
t4 <- Sys.time()

identical(baked, baked_fast)
#> [1] TRUE
identical(bs_fast(x$v), splines::bs(x$v))
#> [1] TRUE

cat(glue::glue(
  "prep recipe: {t1 - t0}\n",
  "bake recipe: {t2 - t1}\n",
  "total recipe: {t2 - t0}\n",
  "bake fast: {t4 - t3}"
))
#> prep recipe: 0.714876890182495
#> bake recipe: 0.235795021057129
#> total recipe: 0.950671911239624
#> bake fast: 0.11726713180542

Created on 2020-09-24 by the reprex package (v0.3.0)

@juliasilge juliasilge added the feature a feature request or enhancement label Dec 4, 2020
@topepo
Copy link
Member

topepo commented Dec 8, 2020

This looks very promising. I think that adding a top level argument to the step (say fast = FALSE) that will toggle between the normal approach and this.

A PR would be great.

Does this affect the natural spline step too?

@AshesITR
Copy link
Contributor Author

AshesITR commented Dec 9, 2020

Likely, yes.

If I make a PR it wouldn't change the behaviour of step_bs (and / or step_ns), so I'd leave out the argument fast if you don't mind.

That means the actual preparation code, namely chosing knots from the quantiles of x, would be duplicated from splines::bs.
This would also speed up the prep part a lot.

@juliasilge
Copy link
Member

We are thinking this would be best as an argument to the existing step_bs() (i.e. being able to turn on/off the fast option), not as a separate step. Do you have any questions/thoughts about that?

@AshesITR
Copy link
Contributor Author

AshesITR commented Dec 9, 2020

I thought of transparently improving performance in step_bs because it's possible to have identical behaviour with mich improved perfomance?

@juliasilge
Copy link
Member

We are happy about your proposed improvement here, but believe that it would be surprising for users to have this different behavior (compared to splines::bs()) as a default. The idea would be to have the argument fast = FALSE as the default.

@AshesITR
Copy link
Contributor Author

AshesITR commented Dec 9, 2020

I'm sorry if I confused something, but the behaviour will be absolutely identical for any input - only the runtime will be reduced.

@juliasilge
Copy link
Member

I think we do have some confusion here. Here are a couple of clarifications:

  • The idea of offering an option to only use distinct values sounds like a good one that can help for large datasets. We would have to do extensive testing to understand if it helps in all circumstances.
  • The basis functions do need to computed during prep(), and then applied to data (such as new or testing data) during bake(). This is the basic underlying design principle of recipes.

We are thinking that this argument would be passed down and then the change implemented in this function:

recipes/R/bs.R

Lines 103 to 116 in 394b401

bs_wrapper <- function(x, args) {
if (!("Boundary.knots" %in% names(args)))
args$Boundary.knots <- range(x)
args$x <- x
bs_obj <- do.call("bs", args)
## don't need to save the original data so keep 1 row
out <- matrix(NA, ncol = ncol(bs_obj), nrow = 1)
class(out) <- c("bs", "basis", "matrix")
attr(out, "knots") <- attr(bs_obj, "knots")[]
attr(out, "degree") <- attr(bs_obj, "degree")
attr(out, "Boundary.knots") <- attr(bs_obj, "Boundary.knots")
attr(out, "intercept") <- attr(bs_obj, "intercept")
out
}

@AshesITR
Copy link
Contributor Author

AshesITR commented Dec 9, 2020

I'm sorry I still don't understand why during estimation of parameters you need to evaluate the basis functions.

Take for example scaling. To estimate, you need to calculate the mean and standard deviation, but not the z-scores.

For evaluation you then compute the z-scores based on precomputed parameters.

In the B-spline case, the parameters are the degree and the knot vector and evaluation would be computation of all the basis polynomials at the necessary points.

@juliasilge
Copy link
Member

Is there a way to get the boundary knots from the training data (to then be applied to new/other data) without evaluating splines::bs() on the training data? I think this is what we're concerned about:

If knots is not provided and df > degree, the automatic knot selection algorithm for splines::bs() will select different knots if called with unique(x) instead of x.

@AshesITR
Copy link
Contributor Author

AshesITR commented Dec 9, 2020

Yeah, those different knots are quantiles() of x - a much cheaper computation than all of the basis functions. If you're interested in the details, check the source code of splines::bs(). I'd duplicate this automatic selection behaviour without the expensive basis function evaluation.

@juliasilge
Copy link
Member

Ah, I see what you are getting at. 👍 That may work well to use a different approach to find the knots in prep(). As a next step, can you do some benchmarking (we like to use the bench package) of your proposed new version of prep.step_bs() compared to the existing implemention of prep.step_bs()?

AshesITR added a commit to AshesITR/recipes that referenced this issue Jan 30, 2021
@github-actions
Copy link

github-actions bot commented Apr 9, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants