Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rehash files when and only when necessary. Use timestamps to decide. #4

Closed
wlandau-lilly opened this issue Feb 27, 2017 · 16 comments
Closed

Comments

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Feb 27, 2017

I use fingerprints (hashes) to detect when the user's external files change. Fingerprints are expensive, so I use file modification times to judge whether fingerprinting is even worth the expense. The trouble is that file.mtime() is egregiously imprecise on Windows and Mac, so true updates to files could potentially be missed. My current workaround is to force a rehash when

  1. the file is less than 100 Kb, or
  2. the file was just built by drake (not imported)

This rule mostly covers it, but manual changes to any medium-to-large file may be ignored if drake looks at that file in the same second. I would really like to just get more precise timestamps. With even millisecond precision, I could just wait until the next increment in mtime before importing or building the next file.

Sometime in the future, I may be able to assume all file systems support high-resolution times. Apparently, R 3.3.3 will have this solved for Windows. But until that happens on all platforms, I do not think I can solve this issue.

@wlandau-lilly
Copy link
Collaborator Author

For 2.0.0, I got rid of the force_rehash argument. It's esoteric for most users, and it shouldn't really be necessary.

@wlandau-lilly wlandau-lilly changed the title Improve use of timestamps to minimize unnecessary rehashing of files Timestamps: minimize time spent rehashing of files, but still rebuild files when needed. Mar 7, 2017
@AlexAxthelm
Copy link
Collaborator

AlexAxthelm commented Aug 8, 2017

Would using xxhash64 rather than md5 as the hashing algorithm be sufficient? It's noticeably faster on files, and not slower on objects in memory.

It's built into digest already. There would have to be some logic to identify hashes that are already md5, to prevent rebuilding, but that same function could update the md5 hash to use xxhash64 to make things faster on the next time around.

I ran a test this morning comparing the digest algorithms:

hashtimes

library(tidyverse)
library(progress)
library(dplyr)
library(stringr)
library(microbenchmark)
library(digest)

results <- NULL
algo_list <- c("md5", "sha1",  "xxhash32", "xxhash64", "murmur32")
n_loops <- 25
pb <- progress::progress_bar$new(
  format = ":algo :size [:bar] :percent eta: :eta",
  total = (2^n_loops) * length(algo_list)
  )
tf <- tempfile(fileext = ".RDS")

for (i in seq(from = 1, to = n_loops, by = 1)){
  rn <- rnorm(2^i)
  saveRDS(rn, tf)
  for (algo in algo_list){
    pb$tick(len = 2^i, tokens = list(algo = stringr::str_pad(algo, 9), size = stringr::str_pad(i, 3)))
    mbx <- microbenchmark::microbenchmark(
      object = digest::digest(rn, algo = "md5", file = FALSE),
      file = digest::digest(tf, algo = algo, file = TRUE),
      unit = "s"
    )
    filesize <- file.size(tf)
    objectsize = object.size(rn)
    mbx <- dplyr::mutate(mbx,
      algorithm = algo,
      size = ifelse(expr == "file", filesize, objectsize)
      )
    results <- dplyr::bind_rows(
      results,
      mbx
    )
  }
}

print("saving")
saveRDS(results, "hash_results.RDS")

print("plotting")
ggplot(
  results,
  aes(
    x = size / (10^6),
    y = time / (10^9),
    group = factor(algorithm),
    color = factor(algorithm),
    shape = factor(expr)
    )
) + 
  labs(
    x = "Mb", y = "Seconds"
  ) +
  facet_wrap(~ expr) +
  # geom_point() +
  geom_smooth()

I can't embed the RDS in a comment here, But I can make gist if you want. The individual results mostly show that md5 has a higher variance in hash time than th xx family.

@wlandau-lilly
Copy link
Collaborator Author

@AlexAxthelm Thanks for the idea! I will need some time to think about this, but I am absolutely interested.

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Aug 8, 2017

@AlexAxthelm A few initial thoughts:

  • Speeding up and avoiding hashing are the best things we can do to boost performance, so +1 bringing this up.
  • Changing the algorithm could threaten back compatibility, but as you say, we could grandfather in md5 sums if we're careful.
  • Faster hashing would be great, but it would not totally solve this particular issue. The original purpose was to make better use of file modification times to detect cases where we already know the file is up to date: i.e., where it would be pointless to hash at all.

@wlandau-lilly
Copy link
Collaborator Author

Also, richfitz/storr#20 is relevant here.

@wlandau-lilly
Copy link
Collaborator Author

And yet md5 and sha1 seem to be more commonly used, possibly because they may be better at avoiding collisions. Safety may be worth extra hashing time, and I think we should investigate further. I did notice that xxhash64 hashes are relatively short.

write.csv(mtcars, "mtcars.csv")
library(digest)
digest("mtcars.csv", file = TRUE, algo = "sha1")
## [1] "0d84e86eebe633a69a78206e47f3522eafe0fbbf"
digest("mtcars.csv", file = TRUE, algo = "md5")
## [1] "6463474bfe6973a81dc7cbc4a71e8dd1"
digest("mtcars.csv", file = TRUE, algo = "xxhash64")
## [1] "cd094514792ef806"

@wlandau-lilly wlandau-lilly changed the title Timestamps: minimize time spent rehashing of files, but still rebuild files when needed. Rehash files when and only when necessary. Use timestamps to decide. Aug 8, 2017
@AlexAxthelm
Copy link
Collaborator

Hadn't considered noticed that xxhash64 was actually shorter. I guess that's one way to make things faster. 😃

@wlandau-lilly
Copy link
Collaborator Author

In these past several months, I have not had any new ideas about making better use of file modification times. The original point of this issue has exhausted its shelf life. For the latest discussion of drake's hashing, see #53.

@wlandau-lilly
Copy link
Collaborator Author

The decision to hash will be better encapsulated with the should_rehash_file() function. See R/hash.R in the next release (v4.1.0).

should_rehash_file <- function(filename, new_mtime, old_mtime,
  size_cutoff){
  do_rehash <- file.size(filename) < size_cutoff | new_mtime > old_mtime
  if (is.na(do_rehash)){
    do_rehash <- TRUE
  }
  do_rehash
}

file_hash <- function(target, config, size_cutoff = 1e5) {
  if (is_not_file(target))
    return(as.character(NA))
  filename <- eply::unquote(target)
  if (!file.exists(filename))
    return(as.character(NA))
  old_mtime <- ifelse(target %in% config$inventory_filemtime,
    config$cache$get(key = target, namespace = "filemtime"),
    -Inf)
  new_mtime <- file.mtime(filename)
  do_rehash <- should_rehash_file(
    filename = filename,
    new_mtime = new_mtime,
    old_mtime = old_mtime,
    size_cutoff = size_cutoff)
  if (do_rehash){
    rehash_file(target)
  } else {
    config$cache$get(target)$value
  }
}

@wlandau
Copy link
Collaborator

wlandau commented Nov 13, 2018

Back to the original issue: we could consider microbenchmark::get_nanotime() for timestamps. It seems to resolve the resolution issues I was seeing on Windows, and it could help clean up some messy logic.

@wlandau
Copy link
Collaborator

wlandau commented Nov 13, 2018

Oops, I keep forgetting: whatever we do in R, the actual file system will use its own potentially low-resolution method for assigning modification times to files. Never mind.

@AlexAxthelm
Copy link
Collaborator

At the risk of adding more complexity, could we use an optional meta-object that contains the nanotimes for file creation, and drake checks if that meta-object exists, and uses those precise timestamps if it does, and falls back to the filesystem if not. That way, the default behavior doesn't change, but if someone is having problems, they could flip a flag to get the higher precision.

@wlandau
Copy link
Collaborator

wlandau commented Nov 16, 2018

I think it would be easy to include nanotimes in the existing target-level metadata. But if the user manually edits a file using, say, a text editor, how do we get that nanotime? This is exactly what I keep forgetting to consider.

@AlexAxthelm
Copy link
Collaborator

We wouldn't. drake would need to use max(<<filesystem_time>>, <<recorded_nanotime>>), and we accept that if the user edits a file outside of drake, then that precision can never be recovered.

@wlandau
Copy link
Collaborator

wlandau commented Nov 16, 2018

Hmm... that could actually work. I will need to think about it some more. I think it would be great to simplify the logic in should_rehash_file() and increase reliance on time stamps over hashes. Because there is still a loss in precision here, I am not yet sure if high-res time stamps will make a difference there, and it would need to be weighed against adding microbenchmark as a package dependency.

@wlandau
Copy link
Collaborator

wlandau commented Jun 2, 2019

Update: effective 9793232, drake now uses file.size() to help decide whether to rehash files. The new decision rule is to recompute the hash if and only if:

  1. The file is under 100 KB, or
  2. The new time stamp is later than the old time stamp, or
  3. The file size changed.

cc @skolenik.

@wlandau wlandau mentioned this issue Jun 13, 2019
4 tasks
wlandau pushed a commit that referenced this issue Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

3 participants