Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a deterministic number of digits in rustc_tools_util commit hashes #13222

Merged
merged 1 commit into from
Aug 5, 2024

Conversation

dtolnay
Copy link
Member

@dtolnay dtolnay commented Aug 5, 2024

Using git rev-parse --short in rustc_tools_util causes nondeterministic compilation of projects that use setup_version_info! and get_version_info! when built from the exact same source code and git commit. The number of digits printed by --short is sensitive to how many other branches and tags in the repository have been fetched so far, what other commits have been worked on in other branches, how recently you had run git gc, platform-specific variation in git's default configuration, and platform differences in the sequence of steps performed by the release pipeline. Someone can compile a tool from a particular commit, switch branches to work on a different commit (or simply do a git fetch), go back to the first commit and be unable to reproduce the binary that was built from it previously.

Currently, variation in short commit hashes causes Clippy version strings to be out of sync between different targets. On x86_64-unknown-linux-gnu:

$ clippy-driver +1.80.0 --version
clippy 0.1.80 (0514789 2024-07-21)

Whereas on aarch64-apple-darwin:

$ clippy-driver +1.80.0 --version
clippy 0.1.80 (05147895 2024-07-21)

changelog: none

@rustbot
Copy link
Collaborator

rustbot commented Aug 5, 2024

r? @Alexendoo

rustbot has assigned @Alexendoo.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties label Aug 5, 2024
@@ -104,10 +104,11 @@ impl std::fmt::Debug for VersionInfo {
#[must_use]
pub fn get_commit_hash() -> Option<String> {
let output = std::process::Command::new("git")
.args(["rev-parse", "--short", "HEAD"])
.args(["rev-parse", "HEAD"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rustc uses --short=9 in bootstrap. But that would result in at least 9 characters. IIUC it will only use more characters, if 9 chars would not lead to a unique SHA. I think this is desirable.

https://github.com/rust-lang/rust/blob/2b78d920964e1d70927bcd208529bda0e11120d0/src/bootstrap/src/utils/channel.rs#L69-L71

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to remove --short from there too since, as you allude to, "at least 9" is still a source of nondeterminism that is unjustified.

"At least 9" is not actually better than "exactly 10" in practice. In fact, "at least 9" gives rise to 8× more collisions than "exactly 10", so it's worse from that perspective. For every new commit added to the repo, the probability that it makes a previously-unambiguous 9-digit hash become ambiguous is 16× higher than the probability that it collides with any previous 10-digit hash (offset by a factor of 2 because the second situation results in 2 ambiguous hashes, not just one).

In general, --short=N is not meaningful to use when the repository history is going to be mutated between when the short hash is generated until when a user makes use of the short hash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, --short=N is not meaningful to use when the repository history is going to be mutated between when the short hash is generated until when a user makes use of the short hash.

That is a really good argument. I didn't think this through in that "direction".

I just ran this script I found on SO. The Clippy repo today needs 9 digits to produce a unique hash for all existing commits:

$ git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do echo ${#line}; done; ) | sort -n | uniq -c
   1735 4
  17526 5
   3084 6
    203 7
     19 8
      1 9

The Rust repo itself already requires 10 digits:

$ git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do echo ${#line}; done; ) | sort -n | uniq -c
  20978 5
 206615 6
  36783 7
   2438 8
    151 9
      8 10

So to be on the save side and future proof, lets go with 10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like rustc should use 12 digits... 8 commits with the same first 9 characters means we're close to 50% on two commits with the same first 10 characters? Or maybe that's off by a factor of 2, but not by an order of magnitude.

Copy link
Member

@flip1995 flip1995 Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't mean that 8 commits share the same 9 characters, just that there are 8 pairs of refs that have the same first 9 characters. The refs don't have to be 2 commits with the same first 9 characters. git also takes blobs and trees into account:

$ git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do if [ ${#line} -eq 10 ]; then echo ${line::-1}; fi; done; ) | xargs -I {} git switch -d {}
error: short object ID f2dc18d0a is ambiguous
hint: The candidates are:
hint:   f2dc18d0a118 commit 2023-12-01 - update addr docs
hint:   f2dc18d0a09c tree
fatal: invalid reference: f2dc18d0a
error: short object ID e76b3f3b5 is ambiguous
hint: The candidates are:
hint:   e76b3f3b5b1b commit 2022-04-09 - Rename `unsigned_offset_from` to `sub_ptr`
hint:   e76b3f3b5d22 blob
fatal: invalid reference: e76b3f3b5
error: short object ID 2c4337a70 is ambiguous
hint: The candidates are:
hint:   2c4337a70a4d commit 2021-02-10 - Comments :3
hint:   2c4337a7029e blob
fatal: invalid reference: 2c4337a70
error: short object ID 01c625617 is ambiguous
hint: The candidates are:
hint:   01c6256178fb commit 2020-07-02 - Change debuginfo to default to 1 if `debug = true` is set
hint:   01c6256173f0 blob
fatal: invalid reference: 01c625617
error: short object ID 576294237 is ambiguous
hint: The candidates are:
hint:   576294237b10 commit 2018-01-29 - fix typos
hint:   576294237f84 tree
fatal: invalid reference: 576294237
error: short object ID 0b2e6f808 is ambiguous
hint: The candidates are:
hint:   0b2e6f808769 commit 2014-09-07 - Relicense shootout-reverse-complement.rs to the shootout license.
hint:   0b2e6f808075 tree
fatal: invalid reference: 0b2e6f808
error: short object ID 2d7b690b2 is ambiguous
hint: The candidates are:
hint:   2d7b690b2eaa commit 2013-12-09 - Register new snapshots
hint:   2d7b690b2ab9 tree
fatal: invalid reference: 2d7b690b2
error: short object ID 7ee90cc7b is ambiguous
hint: The candidates are:
hint:   7ee90cc7be7b commit 2012-04-24 - syntax: Rename is_word to is_keyword, etc.
hint:   7ee90cc7b29f tree
fatal: invalid reference: 7ee90cc7b

In fact, no commits share the first 9 chars. I don't know if this is for the better or worse.

How did you come up with the 50% number? I couldn't follow the logic.

@flip1995
Copy link
Member

flip1995 commented Aug 5, 2024

@bors r+

Thanks!

@bors
Copy link
Collaborator

bors commented Aug 5, 2024

📌 Commit 9f6536c has been approved by flip1995

It is now in the queue for this repository.

@bors
Copy link
Collaborator

bors commented Aug 5, 2024

⌛ Testing commit 9f6536c with merge 5f6d07b...

@bors
Copy link
Collaborator

bors commented Aug 5, 2024

☀️ Test successful - checks-action_dev_test, checks-action_remark_test, checks-action_test
Approved by: flip1995
Pushing 5f6d07b to master...

@bors bors merged commit 5f6d07b into rust-lang:master Aug 5, 2024
8 checks passed
@dtolnay dtolnay deleted the deterministic branch August 5, 2024 19:46
@RalfJung
Copy link
Member

RalfJung commented Oct 8, 2024 via email

@flip1995
Copy link
Member

flip1995 commented Oct 8, 2024

I see. But it's not 8 commits with the same 9 char prefix, but 8 pairs (of 2 and not more) of refs with the same 9 char prefix. So there are 14/16=87.5% left for all of them. But I don't think we can put a simple % number on it. I also don't know how we could calculate the probability of a 10 char clash in the near future. For this I clearly lack knowledge in statistics and security/hashing.

@RalfJung
Copy link
Member

RalfJung commented Oct 8, 2024

it's not 8 commits with the same 9 char prefix, but 8 pairs (of 2 and not more) of refs with the same 9 char prefix

Right, I misinterpreted the data. That sounds to me like we are fairly close to getting a first pair with the same 9 char prefix.

According to this, the approximate expected number of git commits one needs to have to see a 9-char collision is 5.4 * 10^7. For a 10-char collision it's 1.3 * 10^8. I currently have 272371 commits in my checkout. That still seems waay off though, in fact even an 8-char collision is unexpected at this number (expected number there is 2 * 10^7), so I am probably doing something wrong.

The refs don't have to be 2 commits with the same first 9 characters. git also takes blobs and trees into account:

Probably that is related to the mistake in my analysis. I don't know how to count all git objects (including the packed ones) though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants