Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose algo / digest in cache_ #28

Closed
MikeBadescu opened this issue Oct 25, 2016 · 7 comments
Closed

expose algo / digest in cache_ #28

MikeBadescu opened this issue Oct 25, 2016 · 7 comments

Comments

@MikeBadescu
Copy link

Would it be possible to expose to the user the algo parameter for digest added in 4b3eb9f?

128 characters for a file name might be a little too much for Windows when the total path length should be 260 (although this is changing). Moreover, other hash functions are faster.

Even better, would it be possible to supply the digest function from the cache_ closure (as it is done with reset, set, etc.)? This would allow the user to provide their own digest function.

@jimhester
Copy link
Member

I think this is a good suggestion, there should be a more public way to do so.

It is technically possible to change this already (though undocumented), the _digest object in the functions environment holds the digest function used.

f <- function(x) x
memoise(f) -> fm
environment(fm)$`_digest` <- function(..., algo) digest::digest(algo = "xxhash64", ...)
fm(1)
#> [1] 1
fm(2)
#> [1] 2

@jimhester
Copy link
Member

25162b8 now has the cache functions specify the digest, and the algo paramter is in their constructor, so this can be easily overridden if needed.

@MikeBadescu
Copy link
Author

This was super fast, thank you! I agree with using xxhash64 for the file system.

@wlandau-lilly
Copy link

I am curious about the thought process behind the choice of hash functions, bothxxhash64 for the file system cache and sha512 for the other two caches. I am facing a similar decision in ropensci/drake#53 (offshoot from ropensci/drake#4).

@MikeBadescu
Copy link
Author

As stated in the body of the issue, certain Windows versions impose limitations on the total path length, i.e., folder\subfolder\yet_another_long_subfolder\very_long_file_name. To avoid such issues, a smaller file name will help, thus xxhash, which has a good speed performance. IIRC, the fastest implementation was the 64 bit version of xxhash running on a 64 bit OS; moreover, 64 chars has a much better chance at avoiding collisions. At the end, it is a trade-off.

@MikeBadescu
Copy link
Author

I think these are the links regarding speed: http://create.stephan-brumme.com/xxhash/ http://cyan4973.github.io/xxHash/
However, the serialization (first part of digest) in R takes a lot of time and it would become the main concern now with respect to performance.

@wlandau-lilly
Copy link

Thank you, @MikeBadescu. Your explanation and references helped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants