store outputs in a content-addressed store #33

BrianHicks · 2022-08-18T22:38:40Z

That lets us avoid doing work, like so:

$ roc run examples/hello/rbt.roc
🔨 Rebuilding host...
2022-08-18T22:34:02.255Z DEBUG [host::coordinator] preparing to run job d2a0eee0ece469cb
2022-08-18T22:34:02.259Z TRACE [host::store] moving out into collection path
2022-08-18T22:34:02.259Z DEBUG [host::store] moving f8ccb17e33d1f54fdc0761c864ef8d5881430bc55e887e0647ffc951b96d8032 into store

$ roc run examples/hello/rbt.roc
🔨 Rebuilding host...
2022-08-18T22:34:11.889Z DEBUG [host::coordinator] preparing to run job d2a0eee0ece469cb
2022-08-18T22:34:11.889Z DEBUG [host::coordinator] already had output of this job; skipping

I know this code will need to be changed for this to work in parallel, but we need this logic anyway so I think that's OK for the moment.

Based on #32; merge that first!

src/cli.rs

src/coordinator.rs

src/job.rs

src/store.rs

bhansconnect · 2022-08-25T04:54:06Z

src/store.rs

+            ///////////////////////////////////////////////
+            let mut ancestors: Vec<&Path> = source.ancestors().skip(1).collect();
+            ancestors.pop(); // we've made sure this is relative, so the first item is ""
+            ancestors.reverse(); // go `[a, a/b, a/b/c]` instead of `[a/b/c, a/b, a]`


I think you can put this into the iterator before the collect source.ancestors().skip(1).rev().collect(). Should be more efficient.

In that case we'd need to pop "" from the front (see my other comment) and would need to use remove() with an O(n) runtime, or use a VecDeque. reverse() happens in place. The docs don't specify a runtime, but I'd imagine it's like half the length since we'd be doing n/2 swaps.

But in the bigger picture, I don't expect these will be more than a handful of directories deep. There's no need for extremely deep nesting since output directories do not clash with each other in the store. The n here would small enough to not matter, unless I've badly misunderstood something about how iterators operate!

Ah, so I guess you could do source.ancestors().skip(1).rev().skip(1) and avoid the collecting all together.

But that is also harder to understand, so I totally get it if you prefer this setup.

I'm not sure either of those is particularly intuitive. I should probably add a comment on this similar to the one I left above!

BTW, turns out we can't do this: rev is only implemented for double-ended iterators, which ancestors is not. ☹️

bhansconnect · 2022-08-25T04:55:38Z

src/store.rs

+                    &ancestor.display(),
+                    temp.path().display()
+                );
+                std::fs::create_dir(temp.path().join(&ancestor)).with_context(|| {


Do you really just want std::fs::create_dir_all?

that might simplify the code some, yeah, but at a performance cost:

we'd be doing a filesystem read for each item in the ancestor chain for every item in the outputs to check if we need to make the directory

we'd have to walk all the output paths again later to make the directories read-only after all the files are situated within them

That approach would save us some memory and a little bit of complexity, but make disk performance much worse. I could still see it being ok though, since ideally we will be avoiding rebuilds (and the subsequent workspace consumptions) as much as possible. The exception would be on the first build, where we'll be rebuilding everything.

we'd be doing a filesystem read for each item in the ancestor chain

Not exactly. create_dir_all, starts at the child node and works backwards towards the ancestors as necessary. So to create a/b/c/d/e where a/b already exists, it will:

try and fail to create a/b/c/d/e

try and fail to create a/b/c/d

succeed at creating a/b/c

succeed at creating a/b/c/d

succeed at creating a/b/c/d/e

Not sure the cost of failing to create a dir because it's parent doesn't exist, but assuming that is cheap, this should be fast.

we'd have to walk all the output paths again later to make the directories read-only after all the files are situated within them

True, the cost here is probably fs dependent. I would assume not marking anything read only and then later marking everything read only would have low cost. 1, most things should be in fs cache. 2, you would only be traversing metadata that should be stored pretty compactly. Also, I would assume the cost to hash and copy files would overshadow all of this anyway.

Just to clarify, in this scenario, either everything or nothing should exist, correct? Since either we ran before and have all the outputs, or we haven't and need to build all directories. If this is correct, your most efficient option would be to process all ancestors of all files at once and just do a de-duplication of the list of ancestors.

Not exactly. create_dir_all, starts at the child node and works backwards towards the ancestors as necessary.

Ah, that's on me for assuming that create_dir_all would start from the root. Thanks for clarifying.

Also, I would assume the cost to hash and copy files would overshadow all of this anyway.

Hashing, sure, but we're renaming the files instead of copying them since the original bytes would be reaped anyway when the build workspace was deleted.

Just to clarify, in this scenario, either everything or nothing should exist, correct?

Yeah, that's right.

If this is correct, your most efficient option would be to process all ancestors of all files at once and just do a de-duplication of the list of ancestors.

Let me rephrase this to make sure I understand what you're suggesting:

calculate the hash based on the files

exit early if we already have the hash

do the filesystem work, deduplicating ancestors

I ask because we already are effectively deduplicating ancestors—we're leaning heavily on the fact that the output directory won't exist in advance, so we don't need to check if some parent directory exists in advance of creating it. I wonder about combining our suggestions, such that the "do the filesystem work" above looks like:

create the temporary collection directory

create all the directories

move all the files, marking them as read-only

mark all the directories as read-only

move the collection directory into the store

(I want to insist on steps 1 and 5 for safety, even though they'll add just a little overhead. It'd be really bad to have a corrupt path in the CAS!)

Anyway, I think this could end up much nicer, both from an efficiency perspective and an ease-of-understanding one. We could reuse the approach for create-move-mark here, or use some other kind of filesystem walking like create_dir_all. (I think the deciding factor maybe should be approachability instead of speed here, although with some reasonable documentation of assumptions I think the current approach could be fairly understandable.)

oh also, this maybe should be a separate refactor in a new PR! The current approach works OK—this would get that last little bit of performance out and make it easier to understand, but I don't think it makes sense to block the rest of the outstanding PRs on that work.

Yeah, I am fine with either solution, just feel there are ways to simplify the readability here. Anyway, not needed for this PR for sure. Maybe file an issue to track if you want to change it.

yeah. I think I do want to change it. Might as well apply the guiding principles (understandable, then approachable, then fast) to the code as well as the system's behavior. Thank you for this excellent review!

src/store.rs

bhansconnect · 2022-08-25T05:00:32Z

src/store.rs

+        // that it doesn't get automatically removed when it's dropped. We've
+        // so far avoided that to avoid leaving temporary directories laying
+        // around in case of errors.
+        std::fs::rename(temp.into_path(), &final_location)


Instead of having a temporary directory that is in /tmp, would it make more sense to put in the store with a prefix or suffix, that way if something goes wrong, you can still look at what was outputted and potentially debug better?

Yeah, that might make sense. Easy to implement, too!

BrianHicks requested a review from rtfeldman August 18, 2022 22:38

BrianHicks mentioned this pull request Aug 22, 2022

update Nix inputs #34

Merged

BrianHicks force-pushed the simplify branch from 99eee29 to 502677b Compare August 24, 2022 10:50

BrianHicks force-pushed the caching-store branch from 2a990c8 to 2886da1 Compare August 24, 2022 10:59

bhansconnect reviewed Aug 25, 2022

View reviewed changes

bhansconnect requested review from bhansconnect and removed request for rtfeldman August 25, 2022 05:07

BrianHicks force-pushed the simplify branch from 502677b to eba18a5 Compare August 25, 2022 10:28

BrianHicks added 22 commits August 25, 2022 05:29

remove the intermediate not-glue rbt layer

5d003fc

RunnableJob -> Job

60408cd

keep a basic store in the coordinator

8b4e5bc

store a mapping from input hashes to content

268cbbc

make IDs a little stabler for better cache hits

d413dde

note for later

58cb549

copy from workspace to store

4d3b98f

make sure the root we're passed exists

aaf8e79

deal with failures while reading

ae151e2

use aliases

8ce067c

get rid of a couple unnecessary println! calls

88780fa

this is just a quick lookup away now

11f1aef

only run the job if we don't have an output

0b5adf1

add a logging framework

7a41259

replace println! with the log framework

5607a70

sprinkle some log lines around for debugging

b56531f

use the nicer workspace consumer

9399b0f

this was already done

19d8b7f

correct typo

44f8466

don't re-run if we already have the output

c55310e

input files aren't actually handled yet

0d551fb

output path needs to participate in content hash

fdd4674

BrianHicks added 2 commits August 25, 2022 05:29

📎

0f13d25

store consumes workspace again

801289b

BrianHicks force-pushed the caching-store branch from 2886da1 to 801289b Compare August 25, 2022 10:29

Base automatically changed from simplify to trunk August 25, 2022 15:28

bhansconnect approved these changes Aug 25, 2022

View reviewed changes

BrianHicks mentioned this pull request Aug 25, 2022

store does a little too much work preemptively #39

Closed

BrianHicks merged commit 494927b into trunk Aug 25, 2022

BrianHicks deleted the caching-store branch August 25, 2022 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store outputs in a content-addressed store #33

store outputs in a content-addressed store #33

BrianHicks commented Aug 18, 2022

bhansconnect Aug 25, 2022

BrianHicks Aug 25, 2022

bhansconnect Aug 25, 2022

BrianHicks Aug 25, 2022

BrianHicks Aug 26, 2022

bhansconnect Aug 25, 2022

BrianHicks Aug 25, 2022

bhansconnect Aug 25, 2022 •

edited

BrianHicks Aug 25, 2022

BrianHicks Aug 25, 2022

bhansconnect Aug 25, 2022 •

edited

BrianHicks Aug 25, 2022

bhansconnect Aug 25, 2022

BrianHicks Aug 25, 2022

store outputs in a content-addressed store #33

store outputs in a content-addressed store #33

Conversation

BrianHicks commented Aug 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhansconnect Aug 25, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhansconnect Aug 25, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhansconnect Aug 25, 2022 •

edited

bhansconnect Aug 25, 2022 •

edited