Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using rlang with memoise #79

Open
SteveBronder opened this issue Jan 4, 2019 · 8 comments
Open

Using rlang with memoise #79

SteveBronder opened this issue Jan 4, 2019 · 8 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@SteveBronder
Copy link

The below example fails because memoise can't cache with arguments that are expressions.

library(rlang)
library(memoise)

foo = data.frame(a = 1:5, b = 1:5)

quo_subset = function(df, rows) {
  rows = rlang::enquo(rows)
  vals = rlang::eval_tidy(rows, data = df)
  df[vals,]
}

# Cool!
quo_subset(foo, b == 2)
## a b
## 2 2 2
quo_mem_subset = memoise(quo_subset)
# Dang!
quo_mem_subset(foo, b == 2)
# Error in FUN(X[[i]], ...) : object 'b' not found
traceback()
## 4: FUN(X[[i]], ...)
## 3: FUN(X[[i]], ...)
## 2: lapply(called_args, eval, parent.frame())
## 1: quo_mem_subset(foo, b == 2)

This seems like it should be fine? Line 19 of memoise seems to be the culprit.

SessionInfo()

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin17.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] memoise_1.1.0 rlang_0.3.0.1

loaded via a namespace (and not attached):
[1] compiler_3.5.1 tools_3.5.1    yaml_2.2.0     digest_0.6.18 
@jimhester jimhester added the bug an unexpected problem or unintended behavior label Apr 3, 2020
@yogat3ch
Copy link

yogat3ch commented Nov 4, 2021

Bumping this, memoise doesn't seem to be compatible with rlang and it would be wonderful if it was!

@yogat3ch
Copy link

yogat3ch commented Nov 4, 2021

We figured out a nice hack for some use scenarios just in case anyone lands here. We used omit_args argument to memoise to skip the argument that is acted on by rlang, then we added a default argument that captures the calling function call as an argument such that the hash is accurate.

Example is below:

get_tbl_query <- function(tbl, scenario_ids, filter_month_day = FALSE, addtl_exp, cf = the_get_fn()) {
  # stuff
}

We ignore addtl_exp since its captured by rlang::enexpr and evaluated by rlang::eval_bare internally, and it will cause an error if parsed to create the hash.
To ensure that the correct hash is called we used the mc argument that captures three levels up (our top level function with unique arguments)

the_get_fn <- function(x = tail(rlang::trace_back(bottom = 3)$calls, 1)[[1]]) {x}

The bottom argument would need to be adjusted for how far back up the call stack the call is that needs to be hashed.

@wch
Copy link
Member

wch commented Nov 5, 2021

I don't think there's a general solution to this problem. memoise takes the (evaluated) input values, combines them, and hashes them; the resulting value is used as a key. This isn't going to play well with nonstandard evaluation, where you want to use the unevaluated expression as the key.

To illustrate, here's a normal function, which prints a string representation of the input. Each time it executes, it also prints Running f() -- this will be informative when we memoize the function later.

library(rlang)
library(memoise)

f <- function(x) {
  message("Running f()")
  paste("The input value was:", x)
}

a <- 10
f(a + 1)
#> Running f()
#> [1] "The input value was: 11"

a <- 20
f(a + 1)
#> Running f()
#> [1] "The input value was: 21"

Now let's memoize it. When there's a cache hit, it will return the same value, but it will not print out Running f().

fm <- memoise(f)

a <- 10
fm(a + 1)
#> Running f()
#> [1] "The input value was: 11"

fm(a + 1)  # Will have a cache hit, so won't print "Running f()".
#> [1] "The input value was: 11"

a <- 20
fm(a + 1)
#> Running f()
#> [1] "The input value was: 21"

When it sees the input value of 11 a second time, it doesn't need to execute f(); it simply gets the return value from the cache. But when it sees a new input value, 21, it executes f() again. So far, this is exactly what we'd expect.

Now, here's a function with non-standard evaluation. It uses rlang::enexpr(), which for our purposes is similar to rlang::enquo() or substitute(). It simply prints out the unevaluated expression that was passed in.

g <- function(expr) {
  message("Running g()")
  expr_string <- deparse(enexpr(expr))
  paste("The input expression was:", expr_string)
}

a <- 10
g(a + 1)
#> Running g()
#> [1] "The input expression was: a + 1"

a <- 20
g(a + 1)
#> Running g()
#> [1] "The input expression was: a + 1"

The only thing that matters for the result is the unevaluated input expression a + 1 -- the actual value of a is irrelevant. When we memoize it, we get the following behavior:

gm <- memoise(g)

a <- 10
gm(a + 1)
#> Running g()
#> [1] "The input expression was: a + 1"

gm(a + 1) # Will have a cache hit.
#> [1] "The input expression was: a + 1"

a <- 20
gm(a + 1)
#> Running g()
#> [1] "The input expression was: a + 1"

Notice that this time around, when the value of a is changed, it causes the memoized function gm() to call f(). This is not what you want, but the alternative would break how memoise() works.

Suppose memoise() used the unevaluated expression as a cache key. Then each time you call gm(a+1), it simply has to fetch the value out of the cache (after the first run, of course). That works for gm().

But what about for fm()? fm() cares about the (evaluated) input values. If we simply used the unevaluated expression a+1 as the cache key, then each time the user called fm(a+1), it would return the cached value. But that would return the incorrect value when we change the value of a from 10 to 20. It would behave like this:

## Note: This chunk is not what actually happens with fm().
## It shows what would happen if fm() used the unevaluated expression for the key.

a <- 10
fm(a+1)
#> Running f()
#> [1] "The input value was: 11"

a <- 20
fm(a+1)
#> [1] "The input value was: 11"

I think that what you want is for memoise to know what the function is doing with an input:

  • If it's using the input value normally, then use the value for a cache key.
  • If it's capturing the input as an unevaluated expression, then use the uneval'ed expression for the cache key.

However, memoise() has no idea what the memoized function is doing internally, so it can't know which one of these routes to go. One could imagine a version of memoise where you specify that some input values are used for the cache key, and other input expressions are used for the cache key, but we don't currently have plans for that.

In summary memoise() won't work well with non-standard evaluation. You could avoid NSE by putting the burden on the user to pass in quoted expressions or quosures. This isn't great from a usability standpoint, but it will make the caching behave as you want.

h <- function(expr) {
  message("Running h()")
  paste("The input expression was:", deparse(expr))
}

a <- 10
h(expr(a + 1))
#> Running h()
#> [1] "The input expression was: `a + 1`"


hm <- memoise(h)

a <- 10
hm(expr(a + 1))
#> Running h()
#> [1] "The input expression was: `a + 1`"

a <- 20
hm(expr(a + 1))  # Cache hit
#> [1] "The input expression was: `a + 1`"

@yogat3ch
Copy link

yogat3ch commented Nov 5, 2021

Hi @wch,
Thanks for elaborating on the functionality of memoise in greater depth!
I think I understand what you've explained here but I'm a little confused on the fourth chunk because you use
gm <- memoise(f)
but the message is Running g(). How is that possible?

gm <- memoise(f)

a <- 10
gm(a + 1)
#> Running g()
#> [1] "The input expression was: a + 1"

gm(a + 1) # Will have a cache hit.
#> [1] "The input expression was: a + 1"

a <- 20
gm(a + 1)
#> Running g()
#> [1] "The input expression was: a + 1"

Another thing I find puzzling here, why did fm not have a cache hit when the input is changed in the global env

fm(a + 1)  # Will have a cache hit, so won't print "Running f()".
#> [1] "The input value was: 11"

a <- 20
fm(a + 1)
#> Running f()
#> [1] "The input value was: 21"

But in chunk 5, changing the input value in global env and passing it to fm triggers a cache hit:

a <- 10
fm(a+1)
#> Running f()
#> [1] "The input value was: 11"

a <- 20
fm(a+1)
#> [1] "The input value was: 11"

If I'm getting the gist of the last chunk I think it's that if unevaluated expressions passed in will not change the hash even if the global environment inputs to that expression are changed. However, I was noticing that if we use !! like so then we dont run into that issue:

h <- function(expr) {
  message("Running h()")
  paste("The input expression was:", deparse(expr))
}

a <- 10
h(expr(a + 1))
#> Running h()
#> [1] "The input expression was: `a + 1`"


hm <- memoise(h)

a <- 10
hm(expr(!!a + 1))
#> Running h()
#> [1] "The input expression was: `a + 1`"

a <- 20
hm(expr(!!a + 1))  # Cache hit
#> [1] "The input expression was: `a + 1`"

Unfortunately, with the size of our app and the number of repetitious long db queries, NSE is a must. We've nested the memoised function inside of a function that will have unique inputs each time it's run, so I'm hoping that because cf (formerly mc but wanted to disambiguate from the mc inside of a memoised function) will evaluate to that function with unique inputs the hash will be unique for the inside of our nested memoised function.
I do understand that this hack won't work in all use cases though.
Theoretically,couldn't you have a function argument that's just the trace_back in instances where you're using NSE? In most use cases except for the most basic, isn't it going to give you a unique hash?

@wch
Copy link
Member

wch commented Nov 5, 2021

I think I understand what you've explained here but I'm a little confused on the fourth chunk because you use
gm <- memoise(f)
but the message is Running g(). How is that possible?

Sorry, that was some bad copying and pasting and editing. It should have been gm <- memoise(g). I've corrected it in the original post.

Another thing I find puzzling here, why did fm not have a cache hit when the input is changed in the global env .... But in chunk 5, changing the input value in global env and passing it to fm triggers a cache hit:

Sorry I wasn't clear about that -- chunk 5 shows what would happen if it used the unevaluated expression for the key. I've added a comment to the top of that chunk to make it clearer.

@yogat3ch
Copy link

yogat3ch commented Nov 5, 2021

@wch Ah, thank you for clarifying that!

@wch
Copy link
Member

wch commented Nov 5, 2021

Here's a wrapper function for memoise() called memoise2(), which allows you to designate specific parameters for which the unevaluated expression will be used for caching, instead of the value.

library(memoise)
library(rlang)

# A version of memoise(). Some parameters can be designated as `expr_vars`. For
# these params, the unevaluated expression will be used for caching, instead of
# the value (which is what is normally used for caching).
memoise2 <- function(f, ..., expr_vars = character(0)) {
  f_wrapper <- function(args) {
    eval_tidy(expr(f(!!!args)))
  }
  f_wrapper_m <- memoise(f_wrapper, ...)


  function(...) {
    # Capture args as (unevaluated) quosures
    dot_args <- dots_definitions(...)$dots

    # For each arg, if it's in our set of `expr_vars`, extract the expression
    # (and discard the environment); if it's not in that set, then evaluate the
    # quosure. 
    dot_args <- mapply(
      dot_args,
      names(dot_args), 
      FUN = function(quo, name) {
        if (name %in% expr_vars) {
          get_expr(quo)
        } else {
          eval_tidy(quo)
        }
      },
      SIMPLIFY = FALSE
    )

    # Print out the captured items
    # str(dot_args)
    
    # Call the memoized wrapper function.
    f_wrapper_m(dot_args)
  }
}


f <- function(x, y) {
  message("Running f()")
  paste0("Captured x: ", deparse(enexpr(x)), ".   Evaluated y: ", y)
}

fm <- memoise2(f, expr_vars = c("x"))

a <- 10
b <- 10
f(x=a+1, b+2)
#> Running f()
#> [1] "Captured x: a + 1.   Evaluated y: 12"

# Run the memoized version twice. Second time results in a hit.
fm(x=a+1, b+2)
#> Running f()
#> [1] "Captured x: a + 1.   Evaluated y: 12"
fm(x=a+1, b+2)
#> [1] "Captured x: a + 1.   Evaluated y: 12"

# Changing `b` causes a cache miss, because it results a different value for y,
# and y is a normal arg, where the value is used for caching.
b <- 20
fm(x=a+1, b+2)
#> Running f()
#> [1] "Captured x: a + 1.   Evaluated y: 22"

# Changing `a` does NOT cause a cache miss, because when we called memoise2, we
# designated `x` as an arg for which the unevaluated expression should be used for
# caching, instead of the value.
a <- 20
fm(x=a+1, b+2)
#> [1] "Captured x: a + 1.   Evaluated y: 22"

Note that there are some limitations:

  • When it invokes the user-defined function (f, in the example), it simply uses the unevaluated expression, and it doesn't call f from the expected environment. For some uses of NSE, this is fine; for others, it isn't. The times where it is OK is when the expression is not evaluated in the calling environment.
    There's a possible way to address this, but I think it would cause the caching to be much less effective. in essence, f_wrapper would have the signature function(args, env), and you'd pass in the calling environment. But if that environment is used for caching, then it would result in cache misses when calling it from different environments, and that would happen each time you call a function which calls the memoized function.
  • The function returned by memoise2() only has ... args, so autocompletion won't work. That's probably fixable.

@yogat3ch
Copy link

yogat3ch commented Nov 12, 2021

Hi @wch,
Apologies for the delayed response on this, just now have the time to get to it.
Thank you for taking the time to create this wrapper for memoise and I hope it proves helpful for folks landing on this thread.

I don't think I explained my question too well honestly, but we were able to solve it.
For our situation, we actually need the memoised results of the unevaluated expression to be dependent on all of it's tidy evaluated inputs rather than just having the unevaluated expression be a hashed value for the memoisation. IE The NSE expression is not unique to the results, the variables unquoted with !! in the NSE expression are.

In our case, we have a two standard inputs to our memoised function tbl_query which calls a database and munges the data. The data munging is accomplished with an unevaluated expression that is evaluated with lots of !! unquoting.

tbl_query <- function(tbl = "tbl_res_elev", scenario_ids, filter_month_day = FALSE, addtl_exp, cf = find_calling_fn()) {

  out <- dplyr::tbl(db_connect(), tbl)
  if (!missing(scenario_ids))
    out <- dplyr::filter(out, scenario_id %in% scenario_ids)

  if (filter_month_day)
    out <- dplyr::filter(out, month(date) == 12 & day(date) == 31)


  if (!missing(addtl_exp))
    out <- rlang::eval_bare(rlang::enexpr(addtl_exp))

  if (inherits(out, "tbl_Pool"))
    out <- dplyr::collect(out)
  return(out)
}

To overcome the limitation with NSE causing cache misses (or incorrect cache matches) we use the cf argument (short for "calling functions").

We made it such that tbl_query is always nested inside a function with the prefix get_ and this function is always called explicitly rather than anonymously such that the get_ always shows up in the prior calls. The NSE argument addtl_exp relies on arguments passed down by these get_ functions that determine which memoised result to return.
To ensure accurate cache matches we pass a function to cf called find_calling_fn which traverses the traceback, finds any call with the get_ prefix, evaluates it's arguments in it's calling context and represents the call with evaluated arguments as a character string in the output such that memoise accounts for the values passed to these upstream functions when it hashes.

You'll also see code that ensures this doesn't happen in db_connect where we also use it - though this is for better debugging and not related to memoisation.

find_calling_fn <- function(x = rlang::trace_back()) {
  code <- purrr::map_chr(tail(x$calls, 10), ~paste0(rlang::expr_deparse(.x), collapse = ""))

  in_db_connect <- any(stringr::str_detect(code, "get0\\(\"cf\"\\, envir \\= ce\\)"))

  idx_get <- stringr::str_which(code, "get_\\w+")
  if (!in_db_connect) {
    for (i in idx_get) {
      # the call index in the traceback
      idx_frame <- length(x$calls) - (length(code) - i)
      # the call itself
      the_call <- x$calls[[idx_frame]]
      # the call frame
      the_frame <- sys.frame(x$parents[idx_frame])
      # evaluate the args from the call in their respective frame (similar to memoise)
      .args <- purrr::map(rlang::call_args(the_call), rlang::eval_bare, env = the_frame)
      if (UU::is_legit(.args)) {
        # refit the call with the evaluated args so memoise makes a unique hash for the arguments
        new_code <- paste0(rlang::expr_deparse(rlang::call_modify(rlang::call_standardise(the_call, env = the_frame), !!!.args)), collapse = "")

        code[i] <- new_code
      }
    }
  }

  code <- code[idx_get]
  if (golem::app_dev() && is_debug() && !in_db_connect) {
    cli::cat_line(cli::col_green("cf_hash: "), cli::code_highlight(code, "material"))
  }
  invisible(code)
}

I know this is all highly specific and probably not entirely clear but I hope it's useful for folks landing on this thread. Happy to answer questions if need be.

I'm wondering if you see any potential exceptions or pitfalls to this situation, provided that we observe the syntactical conventions that make it work, that might cause memoisation to start malfunctioning?

swpease added a commit to swpease/hotdeckts that referenced this issue Mar 18, 2024
This *appears* correct. I tested it by inspection with the SUGG and CRAM data on a set seed in CV. I think I *could* set up a test for this, but it'd be a lot of work, a time-consuming test, and amount to repeatedly re-doing said test-by-inspection.

As it is, this memoization looks like it cuts the run time of CV down to a quarter of what it was.

refs: https://memoise.r-lib.org/reference/memoise.html#details
https://rdrr.io/r/base/ns-hooks.html
r-lib/profvis#134
r-lib/memoise#79 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants