Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache invalidation is too strict? #58

Closed
karldw opened this issue Oct 28, 2017 · 4 comments
Closed

Cache invalidation is too strict? #58

karldw opened this issue Oct 28, 2017 · 4 comments

Comments

@karldw
Copy link
Contributor

karldw commented Oct 28, 2017

I noticed that memoise gives different results when a function is source-d vs run line by line, or when a package is installed on one platform vs another. (As Jim Hester mentioned in #55, this happens because the function itself is included in the hash, and I guess the underlying bytecode can differ.)

Below is a quick example that demonstrates the non-caching behavior when the program is run line by line or sourced.

One way to address this, if you think it's a problem that should be addressed, would be to use or borrow from digest::sha1. sha1 is a generic that dispatches to methods that try to look at meaningful differences (see vignette). Note that in the example below, the digests differ, but the sha1s are the same.

Do you think this is worth addressing? It seems like a non-issue for in-memory caches, but could be a pain for on-disk, potentially cross-platform caches.

library(memoise)
library(digest)

cache <- cache_filesystem("temp_Rcache")
fn <- function() {
  message("Actually running the function")
}
message("Digest: ", digest(fn, algo = "xxhash64"), "\nsha1:   ", sha1(fn))
fn_cached <- memoise(fn, cache = cache)
fn_cached()
@karldw
Copy link
Contributor Author

karldw commented Nov 5, 2017

I'd be happy to contribute a PR for this, if you think it's an issue worth addressing.

I think the difference stems from the srcref attribute of body(fn), which contains the source filename. digest's sha1.function gets around this by hashing as.character(body(fn)), where memoise just hashes body(fn).

If caches are shared across multiple machines with S3/Drive/Dropbox, it seems useful to allow for differing filenames. (I was wrong in my previous comment; the cross-platform aspect doesn't really matter.)

jimhester pushed a commit that referenced this issue Jan 3, 2018
`body(some_function)` returns the function's `srcref`, which will differ
across computers. Using `as.character(body(some_function))` hashes the
contents of the function, but not the source file.

See also: `digest::sha1`
https://cran.r-project.org/web/packages/digest/vignettes/sha1.html

Fixes #58
@krlmlr
Copy link
Member

krlmlr commented Nov 6, 2018

I wonder if as.list(body(fn)) would work too? I'm not sure if as.character() really drops the srcref, which seems desirable.

@karldw
Copy link
Contributor Author

karldw commented Nov 7, 2018

As far as I can tell, both work. Do you prefer one way over the other?

@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2018

It's interesting that timings are opposite of what I expected:

bench::mark(
  digest::digest(as.character(body(tibble::new_tibble))),
  digest::digest(as.list(body(tibble::new_tibble))),
  check = FALSE
)
#> # A tibble: 2 x 10
#>   expression     min  mean  median      max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch> <bch:t> <bch:tm>     <dbl> <bch:byt> <dbl> <int>
#> 1 digest::d…  86.9µs 107µs  95.4µs 532.67µs     9343.    5.93MB     6  4428
#> 2 digest::d… 340.3µs 441µs 399.9µs   1.03ms     2269.   85.74KB     2  1103
#> # ... with 1 more variable: total_time <bch:tm>

Created on 2018-11-07 by the reprex package (v0.2.1)

As far as I can tell, the body() still has a srcref, but it's ignored in as.character() . I think the current implementation is fine, sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants