-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
store outputs in a content-addressed store #33
Conversation
2a990c8
to
2886da1
Compare
/////////////////////////////////////////////// | ||
let mut ancestors: Vec<&Path> = source.ancestors().skip(1).collect(); | ||
ancestors.pop(); // we've made sure this is relative, so the first item is "" | ||
ancestors.reverse(); // go `[a, a/b, a/b/c]` instead of `[a/b/c, a/b, a]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can put this into the iterator before the collect source.ancestors().skip(1).rev().collect()
. Should be more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we'd need to pop ""
from the front (see my other comment) and would need to use remove()
with an O(n)
runtime, or use a VecDeque
. reverse()
happens in place. The docs don't specify a runtime, but I'd imagine it's like half the length since we'd be doing n/2 swaps.
But in the bigger picture, I don't expect these will be more than a handful of directories deep. There's no need for extremely deep nesting since output directories do not clash with each other in the store. The n
here would small enough to not matter, unless I've badly misunderstood something about how iterators operate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so I guess you could do source.ancestors().skip(1).rev().skip(1)
and avoid the collecting all together.
But that is also harder to understand, so I totally get it if you prefer this setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either of those is particularly intuitive. I should probably add a comment on this similar to the one I left above!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, turns out we can't do this: rev
is only implemented for double-ended iterators, which ancestors
is not.
&ancestor.display(), | ||
temp.path().display() | ||
); | ||
std::fs::create_dir(temp.path().join(&ancestor)).with_context(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really just want std::fs::create_dir_all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might simplify the code some, yeah, but at a performance cost:
- we'd be doing a filesystem read for each item in the ancestor chain for every item in the outputs to check if we need to make the directory
- we'd have to walk all the output paths again later to make the directories read-only after all the files are situated within them
That approach would save us some memory and a little bit of complexity, but make disk performance much worse. I could still see it being ok though, since ideally we will be avoiding rebuilds (and the subsequent workspace consumptions) as much as possible. The exception would be on the first build, where we'll be rebuilding everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd be doing a filesystem read for each item in the ancestor chain
Not exactly. create_dir_all, starts at the child node and works backwards towards the ancestors as necessary. So to create a/b/c/d/e
where a/b
already exists, it will:
- try and fail to create
a/b/c/d/e
- try and fail to create
a/b/c/d
- succeed at creating
a/b/c
- succeed at creating
a/b/c/d
- succeed at creating
a/b/c/d/e
Not sure the cost of failing to create a dir because it's parent doesn't exist, but assuming that is cheap, this should be fast.
we'd have to walk all the output paths again later to make the directories read-only after all the files are situated within them
True, the cost here is probably fs dependent. I would assume not marking anything read only and then later marking everything read only would have low cost. 1, most things should be in fs cache. 2, you would only be traversing metadata that should be stored pretty compactly. Also, I would assume the cost to hash and copy files would overshadow all of this anyway.
Just to clarify, in this scenario, either everything or nothing should exist, correct? Since either we ran before and have all the outputs, or we haven't and need to build all directories. If this is correct, your most efficient option would be to process all ancestors of all files at once and just do a de-duplication of the list of ancestors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. create_dir_all, starts at the child node and works backwards towards the ancestors as necessary.
Ah, that's on me for assuming that create_dir_all
would start from the root. Thanks for clarifying.
Also, I would assume the cost to hash and copy files would overshadow all of this anyway.
Hashing, sure, but we're renaming the files instead of copying them since the original bytes would be reaped anyway when the build workspace was deleted.
Just to clarify, in this scenario, either everything or nothing should exist, correct?
Yeah, that's right.
If this is correct, your most efficient option would be to process all ancestors of all files at once and just do a de-duplication of the list of ancestors.
Let me rephrase this to make sure I understand what you're suggesting:
- calculate the hash based on the files
- exit early if we already have the hash
- do the filesystem work, deduplicating ancestors
I ask because we already are effectively deduplicating ancestors鈥攚e're leaning heavily on the fact that the output directory won't exist in advance, so we don't need to check if some parent directory exists in advance of creating it. I wonder about combining our suggestions, such that the "do the filesystem work" above looks like:
- create the temporary collection directory
- create all the directories
- move all the files, marking them as read-only
- mark all the directories as read-only
- move the collection directory into the store
(I want to insist on steps 1 and 5 for safety, even though they'll add just a little overhead. It'd be really bad to have a corrupt path in the CAS!)
Anyway, I think this could end up much nicer, both from an efficiency perspective and an ease-of-understanding one. We could reuse the approach for create-move-mark here, or use some other kind of filesystem walking like create_dir_all
. (I think the deciding factor maybe should be approachability instead of speed here, although with some reasonable documentation of assumptions I think the current approach could be fairly understandable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh also, this maybe should be a separate refactor in a new PR! The current approach works OK鈥攖his would get that last little bit of performance out and make it easier to understand, but I don't think it makes sense to block the rest of the outstanding PRs on that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am fine with either solution, just feel there are ways to simplify the readability here. Anyway, not needed for this PR for sure. Maybe file an issue to track if you want to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. I think I do want to change it. Might as well apply the guiding principles (understandable, then approachable, then fast) to the code as well as the system's behavior. Thank you for this excellent review!
// that it doesn't get automatically removed when it's dropped. We've | ||
// so far avoided that to avoid leaving temporary directories laying | ||
// around in case of errors. | ||
std::fs::rename(temp.into_path(), &final_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a temporary directory that is in /tmp
, would it make more sense to put in the store with a prefix or suffix, that way if something goes wrong, you can still look at what was outputted and potentially debug better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that might make sense. Easy to implement, too!
2886da1
to
801289b
Compare
That lets us avoid doing work, like so:
I know this code will need to be changed for this to work in parallel, but we need this logic anyway so I think that's OK for the moment.
Based on #32; merge that first!