Skip to content

Conversation

@EmilHvitfeldt
Copy link
Member

@EmilHvitfeldt EmilHvitfeldt commented Oct 19, 2022

This PR aim to start the conversation on creating a function that takes a factor and return the corresponding integer indicator matrix. I find that I want to do this many places in tidymodels without a good way to do it.

This proposed function

  • returns integers
  • doesn't rely on global options
  • returns sensible column names
  • Works for factors that are too big for model.matrix()
  • Doesn't work with NAs

Benchmarking

This new function outperforms model.matrix() for all sizes of factors that I have tested

library(hardhat)

old_fun <- function(x) {
  res <- stats::model.matrix(~y-1, data = data.frame(y = x))
  attr(res, "assign") <- NULL
  attr(res, "contrasts") <- NULL
  res
}

create_factor <- function(n, n_levels) {
  factor(sample(seq_len(n_levels), n, replace = TRUE), levels = seq_len(n_levels))
}

res <- bench::press(
  n = 10 ^ seq_len(5),
  n_levels = 10 ^ seq_len(3),
  {
    fff <- create_factor(n, n_levels)
    bench::mark(
      new = unname(get_indicators(fff)),
      old = unname(old_fun(fff))
    )
  }
)

ggplot2::autoplot(res)

model.matrix() can't do factors with many levels

get_indicators(create_factor(10, 100000)) |> str()
#>  int [1:10, 1:100000] 0 0 0 0 0 0 0 0 0 0 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:100000] "1" "2" "3" "4" ...
old_fun(create_factor(10, 100000))
#> Error: vector memory exhausted (limit reached?)

model.matrix() also couldn't calculate a vector with `10^8 elements and 20 levels.

res <- bench::press(
  n = 10 ^ seq_len(7),
  n_levels = 20,
  {
    fff <- create_factor(n, n_levels)
    bench::mark(
      min_iterations = 10,
      new = unname(get_indicators(fff)),
      old = unname(old_fun(fff))
    )
  }
)

ggplot2::autoplot(res)

Created on 2022-10-19 with reprex v2.0.2

@DavisVaughan DavisVaughan changed the title Add get_indicators() function Implement fct_encode_one_hot() Oct 28, 2022
@DavisVaughan DavisVaughan merged commit c2c896c into main Oct 28, 2022
@DavisVaughan DavisVaughan deleted the get_indicators branch October 28, 2022 20:41
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants