Skip to content

speed improvement for step_lencode_glm() #232

@EmilHvitfeldt

Description

@EmilHvitfeldt

I think i found some evidence that we can improve the speed of step_lencode_glm() significantly

the following shows a rough benchmark. to note

  • they produce the same result up to 10^-15
  • the ordering of the values are not the same, but doesn't matter as we left_join it on
  • this only works for the numeric outcome, but would be easy enough to extend to other supported modes
  • old method scales linearly in time with the number of levels of x. new method has same speed
library(embed)
n_obs <- 500000

data <- tibble(
  outcome = rnorm(n_obs),
  x = factor(sample(seq_len(100), n_obs, TRUE))
)

tictoc::tic("old")
res <- recipe(outcome ~ x, data = data) |>
  step_lencode_glm(x, outcome = vars(outcome)) |>
  prep()
tictoc::toc()
#> old: 8.327 sec elapsed


tictoc::tic("new")
tmp <- data |>
  summarise(value = mean(outcome), .by = x)
tictoc::toc()
#> new: 0.007 sec elapsed

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancementtarget encodingTemporary label to group target encodings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions