-
Notifications
You must be signed in to change notification settings - Fork 21
Closed
Labels
featurea feature request or enhancementa feature request or enhancementtarget encodingTemporary label to group target encodingsTemporary label to group target encodings
Description
I think i found some evidence that we can improve the speed of step_lencode_glm()
significantly
the following shows a rough benchmark. to note
- they produce the same result up to 10^-15
- the ordering of the values are not the same, but doesn't matter as we left_join it on
- this only works for the numeric outcome, but would be easy enough to extend to other supported modes
- old method scales linearly in time with the number of levels of
x
. new method has same speed
library(embed)
n_obs <- 500000
data <- tibble(
outcome = rnorm(n_obs),
x = factor(sample(seq_len(100), n_obs, TRUE))
)
tictoc::tic("old")
res <- recipe(outcome ~ x, data = data) |>
step_lencode_glm(x, outcome = vars(outcome)) |>
prep()
tictoc::toc()
#> old: 8.327 sec elapsed
tictoc::tic("new")
tmp <- data |>
summarise(value = mean(outcome), .by = x)
tictoc::toc()
#> new: 0.007 sec elapsed
Metadata
Metadata
Assignees
Labels
featurea feature request or enhancementa feature request or enhancementtarget encodingTemporary label to group target encodingsTemporary label to group target encodings