-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a deterministic number of digits in rustc_tools_util commit hashes #13222
Conversation
r? @Alexendoo rustbot has assigned @Alexendoo. Use |
@@ -104,10 +104,11 @@ impl std::fmt::Debug for VersionInfo { | |||
#[must_use] | |||
pub fn get_commit_hash() -> Option<String> { | |||
let output = std::process::Command::new("git") | |||
.args(["rev-parse", "--short", "HEAD"]) | |||
.args(["rev-parse", "HEAD"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rustc
uses --short=9
in bootstrap
. But that would result in at least 9 characters. IIUC it will only use more characters, if 9 chars would not lead to a unique SHA. I think this is desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning to remove --short
from there too since, as you allude to, "at least 9" is still a source of nondeterminism that is unjustified.
"At least 9" is not actually better than "exactly 10" in practice. In fact, "at least 9" gives rise to 8× more collisions than "exactly 10", so it's worse from that perspective. For every new commit added to the repo, the probability that it makes a previously-unambiguous 9-digit hash become ambiguous is 16× higher than the probability that it collides with any previous 10-digit hash (offset by a factor of 2 because the second situation results in 2 ambiguous hashes, not just one).
In general, --short=N
is not meaningful to use when the repository history is going to be mutated between when the short hash is generated until when a user makes use of the short hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general,
--short=N
is not meaningful to use when the repository history is going to be mutated between when the short hash is generated until when a user makes use of the short hash.
That is a really good argument. I didn't think this through in that "direction".
I just ran this script I found on SO. The Clippy repo today needs 9 digits to produce a unique hash for all existing commits:
$ git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do echo ${#line}; done; ) | sort -n | uniq -c
1735 4
17526 5
3084 6
203 7
19 8
1 9
The Rust repo itself already requires 10 digits:
$ git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do echo ${#line}; done; ) | sort -n | uniq -c
20978 5
206615 6
36783 7
2438 8
151 9
8 10
So to be on the save side and future proof, lets go with 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like rustc should use 12 digits... 8 commits with the same first 9 characters means we're close to 50% on two commits with the same first 10 characters? Or maybe that's off by a factor of 2, but not by an order of magnitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't mean that 8 commits share the same 9 characters, just that there are 8 pairs of refs that have the same first 9 characters. The refs don't have to be 2 commits with the same first 9 characters. git
also takes blob
s and tree
s into account:
$ git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do if [ ${#line} -eq 10 ]; then echo ${line::-1}; fi; done; ) | xargs -I {} git switch -d {}
error: short object ID f2dc18d0a is ambiguous
hint: The candidates are:
hint: f2dc18d0a118 commit 2023-12-01 - update addr docs
hint: f2dc18d0a09c tree
fatal: invalid reference: f2dc18d0a
error: short object ID e76b3f3b5 is ambiguous
hint: The candidates are:
hint: e76b3f3b5b1b commit 2022-04-09 - Rename `unsigned_offset_from` to `sub_ptr`
hint: e76b3f3b5d22 blob
fatal: invalid reference: e76b3f3b5
error: short object ID 2c4337a70 is ambiguous
hint: The candidates are:
hint: 2c4337a70a4d commit 2021-02-10 - Comments :3
hint: 2c4337a7029e blob
fatal: invalid reference: 2c4337a70
error: short object ID 01c625617 is ambiguous
hint: The candidates are:
hint: 01c6256178fb commit 2020-07-02 - Change debuginfo to default to 1 if `debug = true` is set
hint: 01c6256173f0 blob
fatal: invalid reference: 01c625617
error: short object ID 576294237 is ambiguous
hint: The candidates are:
hint: 576294237b10 commit 2018-01-29 - fix typos
hint: 576294237f84 tree
fatal: invalid reference: 576294237
error: short object ID 0b2e6f808 is ambiguous
hint: The candidates are:
hint: 0b2e6f808769 commit 2014-09-07 - Relicense shootout-reverse-complement.rs to the shootout license.
hint: 0b2e6f808075 tree
fatal: invalid reference: 0b2e6f808
error: short object ID 2d7b690b2 is ambiguous
hint: The candidates are:
hint: 2d7b690b2eaa commit 2013-12-09 - Register new snapshots
hint: 2d7b690b2ab9 tree
fatal: invalid reference: 2d7b690b2
error: short object ID 7ee90cc7b is ambiguous
hint: The candidates are:
hint: 7ee90cc7be7b commit 2012-04-24 - syntax: Rename is_word to is_keyword, etc.
hint: 7ee90cc7b29f tree
fatal: invalid reference: 7ee90cc7b
In fact, no commits share the first 9 chars. I don't know if this is for the better or worse.
How did you come up with the 50% number? I couldn't follow the logic.
@bors r+ Thanks! |
☀️ Test successful - checks-action_dev_test, checks-action_remark_test, checks-action_test |
I guessed 50% because 8 is half of 16, the number of options for a hex digit. It's a rough estimate.
|
I see. But it's not 8 commits with the same 9 char prefix, but 8 pairs (of 2 and not more) of refs with the same 9 char prefix. So there are 14/16=87.5% left for all of them. But I don't think we can put a simple |
Right, I misinterpreted the data. That sounds to me like we are fairly close to getting a first pair with the same 9 char prefix. According to this, the approximate expected number of git commits one needs to have to see a 9-char collision is 5.4 * 10^7. For a 10-char collision it's 1.3 * 10^8. I currently have 272371 commits in my checkout. That still seems waay off though, in fact even an 8-char collision is unexpected at this number (expected number there is 2 * 10^7), so I am probably doing something wrong.
Probably that is related to the mistake in my analysis. I don't know how to count all git objects (including the packed ones) though. |
Using
git rev-parse --short
in rustc_tools_util causes nondeterministic compilation of projects that usesetup_version_info!
andget_version_info!
when built from the exact same source code and git commit. The number of digits printed by--short
is sensitive to how many other branches and tags in the repository have been fetched so far, what other commits have been worked on in other branches, how recently you had rungit gc
, platform-specific variation in git's default configuration, and platform differences in the sequence of steps performed by the release pipeline. Someone can compile a tool from a particular commit, switch branches to work on a different commit (or simply do a git fetch), go back to the first commit and be unable to reproduce the binary that was built from it previously.Currently, variation in short commit hashes causes Clippy version strings to be out of sync between different targets. On x86_64-unknown-linux-gnu:
Whereas on aarch64-apple-darwin:
changelog: none