Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_YeoJohnson() possible speedup #782

Closed
kadyb opened this issue Sep 7, 2021 · 10 comments · Fixed by #812
Closed

step_YeoJohnson() possible speedup #782

kadyb opened this issue Sep 7, 2021 · 10 comments · Fixed by #812

Comments

@kadyb
Copy link

kadyb commented Sep 7, 2021

I did a speed comparison of the Yeo-Johnson transformation functions from {recipes} and {bestNormalize} packages. It seems there is some overhead in {recipes} package (especially for large datasets) and I wonder if this can be optimized in some way.

library("bestNormalize")
library("recipes")

set.seed(123)
df = data.frame(x = rgamma(20000000, 1, 1))

### bestNormalize
system.time({
  x1 = yeojohnson(df$x, standardize = FALSE, lower = -5, upper = 5)
})
#>  user  system elapsed 
#> 60.06    6.31   66.38
hist(x1[["x.t"]])

### recipes
rec = recipe( ~ ., data = df)
rec = step_YeoJohnson(rec, x)
system.time({
  estimates = prep(rec, training = df, retain = TRUE)
  x2 = bake(estimates, new_data = NULL)
})
#>  user  system elapsed 
#> 68.74   10.11   78.86
hist(x2[["x"]])
@juliasilge
Copy link
Member

Looks like there may be some differences between how recipes currently implements the Yeo-Johnson transformation and how bestNormalize does:

set.seed(123)
df = data.frame(x = rgamma(2e6, 1, 1))

results <- bench::mark(
  iterations = 10, check = FALSE,
  recipes = recipes::estimate_yj(df$x),  
  bestNormalize = bestNormalize::yeojohnson(df$x, standardize = FALSE, lower = -5, upper = 5)
)

results
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 recipes          1.22s    1.43s     0.705    2.55GB     18.2
#> 2 bestNormalize    1.02s    1.09s     0.926    1.55GB     11.6

Created on 2021-09-09 by the reprex package (v2.0.1)

I don't know that we will want to consider importing an external package for this, but maybe we can look into anything our implementation is doing that may be unnecessarily slow, given our priorities.

@kadyb
Copy link
Author

kadyb commented Sep 9, 2021

Kindly ping @petersonR, could you give some advice on optimization?

@kadyb
Copy link
Author

kadyb commented Sep 10, 2021

Out of curiosity I also checked the BoxCox transformation and in this case {recipes} is faster.

library("bestNormalize")
library("recipes")

set.seed(123)
df = data.frame(x = rgamma(20000000, 1, 1))

### bestNormalize
system.time({
  x1 = boxcox(df$x, standardize = FALSE, lower = -5, upper = 5)
})
#>  user  system elapsed
#> 74.26    8.44   82.89

### recipes
rec = recipe( ~ ., data = df)
rec = step_BoxCox(rec, x)
system.time({
  estimates = prep(rec, training = df, retain = TRUE)
  x2 = bake(estimates, new_data = NULL)
})
#>  user  system elapsed
#> 53.87    4.02   57.92

@EmilHvitfeldt
Copy link
Member

EmilHvitfeldt commented Sep 10, 2021

I couldn't resist, so I tried to optimize a little bit and I got a 60% increase over recipes::estimate_yj with a lot less memory allocation

set.seed(123)
df = data.frame(x = rgamma(2e6, 1, 1))

results <- bench::mark(
  iterations = 10, check = TRUE,
  recipes = recipes::estimate_yj(df$x),
  optimized = estimate_yj(df$x)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

results
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 recipes       2.86s    2.99s     0.324    2.55GB    10.1 
#> 2 optimized     1.27s    1.32s     0.731  673.06MB     4.97

Created on 2021-09-10 by the reprex package (v2.0.1)

Code

estimate_yj <- function(dat, limits = c(-5, 5), num_unique = 5, 
                                  na_rm = TRUE) {
    na_rows <- which(is.na(dat))
    if (length(na_rows) > 0) {
        if (na_rm) {
            dat <- dat[-na_rows]
        }
        else {
            rlang::abort("Missing values in data. See `na_rm` option")
        }
    }
    eps <- 0.001
    if (length(unique(dat)) < num_unique) 
        return(NA)
    dat_neg <- dat < 0
    ind_neg <- list(is = which(dat_neg), not = which(!dat_neg))
    
    const <- sum(sign(dat) * log(abs(dat) + 1))
    
    res <- optimize(yj_obj, interval = limits, maximum = TRUE, 
        dat = dat, ind_neg = ind_neg, const = const, tol = 1e-04)
    lam <- res$maximum
    if (abs(limits[1] - lam) <= eps | abs(limits[2] - lam) <= 
        eps) 
        lam <- NA
    lam
}

yj_obj <- function(lam, dat, ind_neg, const) {
    # dat <- dat[complete.cases(dat)]
    ll_yj(lambda = lam, y = dat, ind_neg = ind_neg, const = const)
}

ll_yj <- function(lambda, y, ind_neg, const, eps = 0.001) {
    # y <- y[!is.na(y)]
    n <- length(y)
    # nonneg <- all(y > 0)
    y_t <- yj_transform(y, lambda, ind_neg)
    mu_t <- mean(y_t)
    var_t <- var(y_t) * (n - 1)/n
    res <- -0.5 * n * log(var_t) + (lambda - 1) * const
    res
}

yj_transform <- function(x, lambda, ind_neg, eps = 0.001) {
    if (is.na(lambda)) 
        return(x)
    if (!inherits(x, "tbl_df") || is.data.frame(x)) {
        x <- unlist(x, use.names = FALSE)
    }
    else {
        if (!is.vector(x)) 
            x <- as.vector(x)
    }
    not_neg <- ind_neg[["not"]]
    is_neg <- ind_neg[["is"]]
    nn_trans <- function(x, lambda) if (abs(lambda) < eps) 
        log(x + 1)
    else ((x + 1)^lambda - 1)/lambda
    ng_trans <- function(x, lambda) if (abs(lambda - 2) < eps) 
        -log(-x + 1)
    else -((-x + 1)^(2 - lambda) - 1)/(2 - lambda)
    if (length(not_neg) > 0) 
        x[not_neg] <- nn_trans(x[not_neg], lambda)
    if (length(is_neg) > 0) 
        x[is_neg] <- ng_trans(x[is_neg], lambda)
    x
}

@petersonR
Copy link

Thank you all for looking into this and working to optimize these transformation functions. Emil, I will borrow a few of your tips for bestNormalize if you don't mind.

@EmilHvitfeldt
Copy link
Member

@petersonR Sure thing!

@EmilHvitfeldt
Copy link
Member

@juliasilge do you want me to package my above improvements into a PR?

@juliasilge
Copy link
Member

Yes, that sounds great! Can you include a few more examples with the PR showing how the results compare, beyond the tests we have?

@kadyb
Copy link
Author

kadyb commented Sep 20, 2021

Thanks! I rerun the code from the first post and now {recipes} is 30 seconds faster compared to the earlier version. Below is a comparison with the new (1.8.2) {bestNormalize} version - performance is very similar.

set.seed(123)
df = data.frame(x = rgamma(2e6, 1, 1))

result = bench::mark(
  iterations = 10, check = FALSE,
  recipes = recipes::estimate_yj(df$x),
  bestNormalize = bestNormalize::yeojohnson(df$x, standardize = FALSE, lower = -5, upper = 5)
)

result
#>   expression      min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 recipes       4.39s  4.45s     0.225     672MB    0.360
#> 2 bestNormalize 4.95s  4.98s     0.200     732MB    0.280

@github-actions
Copy link

github-actions bot commented Oct 5, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants