Revisit file rehashing policy #1062

wlandau · 2023-04-30T17:19:16Z

Prework

I understand and agree to help guide.
I understand and agree to contributing guide.
New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Problem

The current file rehashing policy is coded in file_should_rehash():

Lines 58 to 63 in c64a7e8

    
           file_should_rehash <- function(file, time, size, bytes) { 
        
             small <- bytes < file_small_bytes 
        
             touched <- !identical(time, file$time) 
        
             resized <- !identical(size, file$size) 
        
             small || touched || resized 
        
           }

In particular, small files are always rehashed which is a bottleneck for pipelines with large numbers of small files. Because time stamps have low resolution on e.g. Windows, they are only trusted when the file is large.

Proposal

For small files in _targets/objects/ which the user should not modify by hand, I propose we try to avoid rehashing them. We might just compare the modification time to Sys.time() and trust the time stamp if it is older than a second. We could reduce this threshold on non-Windows machines. Would be good to revisit the actual timestamp resolution on various platforms.

I plan to keep the existing policy for format = "file" because those files are controlled by the user and it is harder to make the required simplifying assumptions for more nuanced cache invalidation.

The text was updated successfully, but these errors were encountered:

wlandau · 2023-04-30T22:46:21Z

In drake, I started with a policy that naively checked timestamps before checking hashes. On a low-spec Windows machine, I soon began to notice that files were incorrectly labeled up to date when two drake::make() calls happened within one second of each other. This was happening because time stamps had low precision. In fact, from a quick search, it looks like the precision of timestamps can be as poor as 2 seconds.

At the time, I worked around this problem by making drake always hash files that were less than 100 Kb, and I continued this policy in targets (file_small_bytes). I intend to keep this policy for format = "file" targets because these are files that the user is likely to modify manually. I want to relax this policy for files in _targets/objects/, but it is tricky. Suppose we make a rule to not trust time stamps less than 2 seconds old. This may not be enough because the user could modify a file in _targets/objects within a seconds of its creation/modification by targets, which could prevent the correct hash from being found.

wlandau · 2023-04-30T23:20:48Z

In drake, I started with a policy that naively checked timestamps before checking hashes. On a low-spec Windows machine, I soon began to notice that files were incorrectly labeled up to date when two drake::make() calls happened within one second of each other. This was happening because time stamps had low precision. In fact, from a quick search, it looks like the precision of timestamps can be as poor as 2 seconds.

I think this was happening for external user-defined files, the equivalent of format = "file" in targets. That would make sense because drake has always used storr to store ordinary targets, and back in 2017 I would not have messed around trying to find time stamps of internal files in a storr file system.

wlandau · 2023-04-30T23:21:12Z

I was hoping to find out which systems support trustworthy file timestamps, but it looks like this information is too varied and too brittle.

wlandau · 2023-04-30T23:43:43Z

I think the right call may be to enhance the file argument of tar_cue(). Proposed values:

"auto" (default): trust the time stamps of files in _targets/objects/, but do not trust the time stamps of any external local files. Given Revisit file rehashing policy #1062 (comment), I think this is permissible. I think I will need to check the docs and make sure users know not to modify _targets/objects/ manually.
"hash": always check the hash.
"system": check the timestamp and file size from the operating system. If both are the same, consider the file up to date. If either changed, recompute the hash and check if it agrees with the one from the metadata. I can add a bunch of docs for this, including in the manual.
"skip": do not check the output files.

Remarks:

file = TRUE will map to file = "auto", and file = FALSE will map to file = "skip" (with deprecation warnings).
As I mentioned, above, the current policy is to trust time stamps for large files and recompute hashes for small files. This avoids some of the very worst outcomes in both extremes, but it is neither efficient nor safe. I plan to remove this file size threshold policy as part of this issue.
For cloud storage, my current thinking is to recommend file = "skip" if you can trust that the bucket will stay where you leave it, but I might reconsider.

wlandau · 2023-05-01T09:16:15Z

On second thought, the more complicated file cue would create more for users to think about. For format = "file", I like the file threshold rule more. Smaller files are more likely to change more often, and the general approach has worked for years. Maybe something else would work better, but I still want to fully automate it.

wlandau · 2023-05-01T12:05:27Z

New plan:

Make format = "file" never trust timestamps.
Add a new "file_fast" format that always trusts timestamps.
Add a new trust_object_timestamps to control whether to trust the timestamps of files in _targets/objects/ (default: TRUE).
Abandon the file size cutoff for always rehashing files.
Scan the docs and make sure the policy is to never write to _targets/objects/ outside the pipeline.

wlandau · 2023-05-01T14:59:30Z

Implemented in 720a865.

wlandau added the topic: performance label Apr 30, 2023

wlandau self-assigned this Apr 30, 2023

wlandau closed this as completed May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit file rehashing policy #1062

Revisit file rehashing policy #1062

wlandau commented Apr 30, 2023

wlandau commented Apr 30, 2023

wlandau commented Apr 30, 2023

wlandau commented Apr 30, 2023

wlandau commented Apr 30, 2023 •

edited

wlandau commented May 1, 2023 •

edited

wlandau commented May 1, 2023

wlandau commented May 1, 2023

Revisit file rehashing policy #1062

Revisit file rehashing policy #1062

Comments

wlandau commented Apr 30, 2023

Prework

Problem

Proposal

wlandau commented Apr 30, 2023

wlandau commented Apr 30, 2023

wlandau commented Apr 30, 2023

wlandau commented Apr 30, 2023 • edited

wlandau commented May 1, 2023 • edited

wlandau commented May 1, 2023

wlandau commented May 1, 2023

wlandau commented Apr 30, 2023 •

edited

wlandau commented May 1, 2023 •

edited