Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

store outputs in a content-addressed store #33

Merged
merged 24 commits into from
Aug 25, 2022
Merged

store outputs in a content-addressed store #33

merged 24 commits into from
Aug 25, 2022

Conversation

BrianHicks
Copy link
Member

That lets us avoid doing work, like so:

$ roc run examples/hello/rbt.roc
馃敤 Rebuilding host...
2022-08-18T22:34:02.255Z DEBUG [host::coordinator] preparing to run job d2a0eee0ece469cb
2022-08-18T22:34:02.259Z TRACE [host::store] moving out into collection path
2022-08-18T22:34:02.259Z DEBUG [host::store] moving f8ccb17e33d1f54fdc0761c864ef8d5881430bc55e887e0647ffc951b96d8032 into store

$ roc run examples/hello/rbt.roc
馃敤 Rebuilding host...
2022-08-18T22:34:11.889Z DEBUG [host::coordinator] preparing to run job d2a0eee0ece469cb
2022-08-18T22:34:11.889Z DEBUG [host::coordinator] already had output of this job; skipping

I know this code will need to be changed for this to work in parallel, but we need this logic anyway so I think that's OK for the moment.

Based on #32; merge that first!

src/cli.rs Show resolved Hide resolved
src/coordinator.rs Show resolved Hide resolved
src/job.rs Show resolved Hide resolved
src/job.rs Show resolved Hide resolved
src/store.rs Show resolved Hide resolved
src/store.rs Show resolved Hide resolved
///////////////////////////////////////////////
let mut ancestors: Vec<&Path> = source.ancestors().skip(1).collect();
ancestors.pop(); // we've made sure this is relative, so the first item is ""
ancestors.reverse(); // go `[a, a/b, a/b/c]` instead of `[a/b/c, a/b, a]`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can put this into the iterator before the collect source.ancestors().skip(1).rev().collect(). Should be more efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we'd need to pop "" from the front (see my other comment) and would need to use remove() with an O(n) runtime, or use a VecDeque. reverse() happens in place. The docs don't specify a runtime, but I'd imagine it's like half the length since we'd be doing n/2 swaps.

But in the bigger picture, I don't expect these will be more than a handful of directories deep. There's no need for extremely deep nesting since output directories do not clash with each other in the store. The n here would small enough to not matter, unless I've badly misunderstood something about how iterators operate!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so I guess you could do source.ancestors().skip(1).rev().skip(1) and avoid the collecting all together.

But that is also harder to understand, so I totally get it if you prefer this setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either of those is particularly intuitive. I should probably add a comment on this similar to the one I left above!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, turns out we can't do this: rev is only implemented for double-ended iterators, which ancestors is not. 鈽癸笍

&ancestor.display(),
temp.path().display()
);
std::fs::create_dir(temp.path().join(&ancestor)).with_context(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really just want std::fs::create_dir_all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might simplify the code some, yeah, but at a performance cost:

  • we'd be doing a filesystem read for each item in the ancestor chain for every item in the outputs to check if we need to make the directory
  • we'd have to walk all the output paths again later to make the directories read-only after all the files are situated within them

That approach would save us some memory and a little bit of complexity, but make disk performance much worse. I could still see it being ok though, since ideally we will be avoiding rebuilds (and the subsequent workspace consumptions) as much as possible. The exception would be on the first build, where we'll be rebuilding everything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd be doing a filesystem read for each item in the ancestor chain

Not exactly. create_dir_all, starts at the child node and works backwards towards the ancestors as necessary. So to create a/b/c/d/e where a/b already exists, it will:

  1. try and fail to create a/b/c/d/e
  2. try and fail to create a/b/c/d
  3. succeed at creating a/b/c
  4. succeed at creating a/b/c/d
  5. succeed at creating a/b/c/d/e

Not sure the cost of failing to create a dir because it's parent doesn't exist, but assuming that is cheap, this should be fast.

we'd have to walk all the output paths again later to make the directories read-only after all the files are situated within them

True, the cost here is probably fs dependent. I would assume not marking anything read only and then later marking everything read only would have low cost. 1, most things should be in fs cache. 2, you would only be traversing metadata that should be stored pretty compactly. Also, I would assume the cost to hash and copy files would overshadow all of this anyway.


Just to clarify, in this scenario, either everything or nothing should exist, correct? Since either we ran before and have all the outputs, or we haven't and need to build all directories. If this is correct, your most efficient option would be to process all ancestors of all files at once and just do a de-duplication of the list of ancestors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. create_dir_all, starts at the child node and works backwards towards the ancestors as necessary.

Ah, that's on me for assuming that create_dir_all would start from the root. Thanks for clarifying.

Also, I would assume the cost to hash and copy files would overshadow all of this anyway.

Hashing, sure, but we're renaming the files instead of copying them since the original bytes would be reaped anyway when the build workspace was deleted.

Just to clarify, in this scenario, either everything or nothing should exist, correct?

Yeah, that's right.

If this is correct, your most efficient option would be to process all ancestors of all files at once and just do a de-duplication of the list of ancestors.

Let me rephrase this to make sure I understand what you're suggesting:

  1. calculate the hash based on the files
  2. exit early if we already have the hash
  3. do the filesystem work, deduplicating ancestors

I ask because we already are effectively deduplicating ancestors鈥攚e're leaning heavily on the fact that the output directory won't exist in advance, so we don't need to check if some parent directory exists in advance of creating it. I wonder about combining our suggestions, such that the "do the filesystem work" above looks like:

  1. create the temporary collection directory
  2. create all the directories
  3. move all the files, marking them as read-only
  4. mark all the directories as read-only
  5. move the collection directory into the store

(I want to insist on steps 1 and 5 for safety, even though they'll add just a little overhead. It'd be really bad to have a corrupt path in the CAS!)

Anyway, I think this could end up much nicer, both from an efficiency perspective and an ease-of-understanding one. We could reuse the approach for create-move-mark here, or use some other kind of filesystem walking like create_dir_all. (I think the deciding factor maybe should be approachability instead of speed here, although with some reasonable documentation of assumptions I think the current approach could be fairly understandable.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh also, this maybe should be a separate refactor in a new PR! The current approach works OK鈥攖his would get that last little bit of performance out and make it easier to understand, but I don't think it makes sense to block the rest of the outstanding PRs on that work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am fine with either solution, just feel there are ways to simplify the readability here. Anyway, not needed for this PR for sure. Maybe file an issue to track if you want to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I think I do want to change it. Might as well apply the guiding principles (understandable, then approachable, then fast) to the code as well as the system's behavior. Thank you for this excellent review!

src/store.rs Show resolved Hide resolved
// that it doesn't get automatically removed when it's dropped. We've
// so far avoided that to avoid leaving temporary directories laying
// around in case of errors.
std::fs::rename(temp.into_path(), &final_location)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a temporary directory that is in /tmp, would it make more sense to put in the store with a prefix or suffix, that way if something goes wrong, you can still look at what was outputted and potentially debug better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that might make sense. Easy to implement, too!

Base automatically changed from simplify to trunk August 25, 2022 15:28
@BrianHicks BrianHicks merged commit 494927b into trunk Aug 25, 2022
@BrianHicks BrianHicks deleted the caching-store branch August 25, 2022 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants