Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic way to pull codebase #4811

Open
ceedubs opened this issue Mar 21, 2024 · 5 comments
Open

Deterministic way to pull codebase #4811

ceedubs opened this issue Mar 21, 2024 · 5 comments

Comments

@ceedubs
Copy link
Contributor

ceedubs commented Mar 21, 2024

Tools like Bazel and Nix ensure reproducible builds by constraining IO at build time. One way that Nix enforces this (I assume Bazel too?) is by only allowing builds to perform network activity if the result has a fixed output hash. Unfortunately, a pull from Share does not result in a file with a fixed hash. I suspect that two culprits are timestamps (like in the reflog) and fetches happening in parallel, but for all I know it could be that SQLite is just completely incompatible with deterministic file hashes (unlike a git codebase).

So far in Nix builds I have gotten around this by only saving the result of compile and not the whole codebase. But this isn't an ideal solution for a couple of reasons:

  • It loses all names for definitions, so if you get a runtime failure it prints as gibberish.
  • It prevents the ability to have an intermediate codebase layer that you reuse for different builds (runtime, tests, etc).

Some notes on the properties I care about:

  • It would be fine if the result were only deterministic for an exact version of ucm; it's okay if the hash is different for ucm 0.5.19 vs 0.5.20.
  • Ideally the has would not change as Share changes.
  • I just need a file or directory with a predictable hash. The pull itself doesn't necessarily need to be deterministic if I have a ucm command to create a deterministic copy of a codebase or something.

Related (but more helpful for Docker than Bazel/Nix): #3892

Side note: it seems a bit ironic that this is hard in Unison, a language premised on code being content-addressed, when it comes for free(ish) in just about any language that uses text files and traditional source control 😬.

@aryairani
Copy link
Contributor

Yeah this is a bummer. I was surprised to be reminded that git reflog doesn't include timestamps.

I did a basic sqlite test (create a table and add two rows), and that did produce identical results in two trials.

It would be nice to know if anyone is using reflog timestamps. They seem nice, but I'm not sure I've used them. They also are a culprit in some nondeterministic transcript outputs, which cause CI to fail.

@ceedubs
Copy link
Contributor Author

ceedubs commented Apr 19, 2024

@aryairani is it really just reflog timestamps? I assumed that if I did a pull or clone it would fetch a bunch of stuff in parallel which would result in different orders of rows in my SQLite tables.

@aryairani
Copy link
Contributor

@ceedubs I'm not sure about the parallel fetches, I would guess that you're right.

I think that fetching stuff in parallel may not be that useful though and we might consider turning the number of concurrent fetches to 1 or something, which then should help.

Side note, I just talked to @rlmark who definitely uses the reflog timestamps.

@aryairani
Copy link
Contributor

@aryairani
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants