Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit file rehashing policy #1062

Closed
3 tasks done
wlandau opened this issue Apr 30, 2023 · 7 comments
Closed
3 tasks done

Revisit file rehashing policy #1062

wlandau opened this issue Apr 30, 2023 · 7 comments
Assignees

Comments

@wlandau
Copy link
Collaborator

wlandau commented Apr 30, 2023

Prework

  • I understand and agree to help guide.
  • I understand and agree to contributing guide.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Problem

The current file rehashing policy is coded in file_should_rehash():

targets/R/class_file.R

Lines 58 to 63 in c64a7e8

file_should_rehash <- function(file, time, size, bytes) {
small <- bytes < file_small_bytes
touched <- !identical(time, file$time)
resized <- !identical(size, file$size)
small || touched || resized
}

In particular, small files are always rehashed which is a bottleneck for pipelines with large numbers of small files. Because time stamps have low resolution on e.g. Windows, they are only trusted when the file is large.

Proposal

For small files in _targets/objects/ which the user should not modify by hand, I propose we try to avoid rehashing them. We might just compare the modification time to Sys.time() and trust the time stamp if it is older than a second. We could reduce this threshold on non-Windows machines. Would be good to revisit the actual timestamp resolution on various platforms.

I plan to keep the existing policy for format = "file" because those files are controlled by the user and it is harder to make the required simplifying assumptions for more nuanced cache invalidation.

@wlandau wlandau self-assigned this Apr 30, 2023
@wlandau
Copy link
Collaborator Author

wlandau commented Apr 30, 2023

In drake, I started with a policy that naively checked timestamps before checking hashes. On a low-spec Windows machine, I soon began to notice that files were incorrectly labeled up to date when two drake::make() calls happened within one second of each other. This was happening because time stamps had low precision. In fact, from a quick search, it looks like the precision of timestamps can be as poor as 2 seconds.

At the time, I worked around this problem by making drake always hash files that were less than 100 Kb, and I continued this policy in targets (file_small_bytes). I intend to keep this policy for format = "file" targets because these are files that the user is likely to modify manually. I want to relax this policy for files in _targets/objects/, but it is tricky. Suppose we make a rule to not trust time stamps less than 2 seconds old. This may not be enough because the user could modify a file in _targets/objects within a seconds of its creation/modification by targets, which could prevent the correct hash from being found.

@wlandau
Copy link
Collaborator Author

wlandau commented Apr 30, 2023

In drake, I started with a policy that naively checked timestamps before checking hashes. On a low-spec Windows machine, I soon began to notice that files were incorrectly labeled up to date when two drake::make() calls happened within one second of each other. This was happening because time stamps had low precision. In fact, from a quick search, it looks like the precision of timestamps can be as poor as 2 seconds.

I think this was happening for external user-defined files, the equivalent of format = "file" in targets. That would make sense because drake has always used storr to store ordinary targets, and back in 2017 I would not have messed around trying to find time stamps of internal files in a storr file system.

@wlandau
Copy link
Collaborator Author

wlandau commented Apr 30, 2023

I was hoping to find out which systems support trustworthy file timestamps, but it looks like this information is too varied and too brittle.

@wlandau
Copy link
Collaborator Author

wlandau commented Apr 30, 2023

I think the right call may be to enhance the file argument of tar_cue(). Proposed values:

  • "auto" (default): trust the time stamps of files in _targets/objects/, but do not trust the time stamps of any external local files. Given Revisit file rehashing policy #1062 (comment), I think this is permissible. I think I will need to check the docs and make sure users know not to modify _targets/objects/ manually.
  • "hash": always check the hash.
  • "system": check the timestamp and file size from the operating system. If both are the same, consider the file up to date. If either changed, recompute the hash and check if it agrees with the one from the metadata. I can add a bunch of docs for this, including in the manual.
  • "skip": do not check the output files.

Remarks:

  • file = TRUE will map to file = "auto", and file = FALSE will map to file = "skip" (with deprecation warnings).
  • As I mentioned, above, the current policy is to trust time stamps for large files and recompute hashes for small files. This avoids some of the very worst outcomes in both extremes, but it is neither efficient nor safe. I plan to remove this file size threshold policy as part of this issue.
  • For cloud storage, my current thinking is to recommend file = "skip" if you can trust that the bucket will stay where you leave it, but I might reconsider.

@wlandau
Copy link
Collaborator Author

wlandau commented May 1, 2023

On second thought, the more complicated file cue would create more for users to think about. For format = "file", I like the file threshold rule more. Smaller files are more likely to change more often, and the general approach has worked for years. Maybe something else would work better, but I still want to fully automate it.

@wlandau
Copy link
Collaborator Author

wlandau commented May 1, 2023

New plan:

  • Make format = "file" never trust timestamps.
  • Add a new "file_fast" format that always trusts timestamps.
  • Add a new trust_object_timestamps to control whether to trust the timestamps of files in _targets/objects/ (default: TRUE).
  • Abandon the file size cutoff for always rehashing files.
  • Scan the docs and make sure the policy is to never write to _targets/objects/ outside the pipeline.

@wlandau
Copy link
Collaborator Author

wlandau commented May 1, 2023

Implemented in 720a865.

@wlandau wlandau closed this as completed May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant