Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Reduce the number of bytes hashed by IchHasher. #37427
Conversation
|
r? @eddyb (rust_highfive has picked a reviewer for you, use r? to override) |
|
The first optimization looks good, I'm happy to approve that. |
|
I've changed the second commit to leb128-encode integers. |
Do you have numbers on the performance impact of that? |
|
leb128 made very little difference to the performance. The number of bytes hashed increased slightly (a few percent) compared to the "just truncate" version, mostly because 255 is a fairly common value and it leb128-encodes to two bytes; I think it's used by Hasher to indicate the end of some types, or something like that? The speed difference was negligible -- leb128 encoding is much cheaper than blake2b hashing :) |
| @@ -50,23 +73,28 @@ impl Hasher for IchHasher { | |||
| } | |||
|
|
|||
| #[inline] | |||
| fn write_u8(&mut self, i: u8) { | |||
| self.write_uleb128(i as u64); | |||
michaelwoerister
Oct 29, 2016
•
Contributor
u8 doesn't need to use leb128. There's nothing to be gained here.
u8 doesn't need to use leb128. There's nothing to be gained here.
nnethercote
Oct 31, 2016
Author
Contributor
I think it does need to use leb128. If you have a u16 value like 256 it gets encoded as [128,2]. That would be indistinguishable from two u8 values 128 and 2.
I think it does need to use leb128. If you have a u16 value like 256 it gets encoded as [128,2]. That would be indistinguishable from two u8 values 128 and 2.
arielb1
Oct 31, 2016
•
Contributor
But encoding is typeful - a u8 can never be in the same start as a u16. A sufficient criterion is that the encoding for each type is prefix-free.
But encoding is typeful - a u8 can never be in the same start as a u16. A sufficient criterion is that the encoding for each type is prefix-free.
michaelwoerister
Oct 31, 2016
Contributor
For a hash like this to be correct, it has to be unambiguous. My personal benchmark for this, the one I find most intuitive, is: If we were serializing this data instead of hashing it, would we be able to correctly deserialize it again? In this case yes, because we always know whether we are reading a u8 or a u16 or something else next. A problem only arises if we have a bunch of, say, u16s in a row but they are encoded with an ambiguous variable length encoding.
For a hash like this to be correct, it has to be unambiguous. My personal benchmark for this, the one I find most intuitive, is: If we were serializing this data instead of hashing it, would we be able to correctly deserialize it again? In this case yes, because we always know whether we are reading a u8 or a u16 or something else next. A problem only arises if we have a bunch of, say, u16s in a row but they are encoded with an ambiguous variable length encoding.
|
It's a bit unfortunate that we have to duplicate the leb128 implementation. Can you factor things in a way so that we can have a unit test for it (e.g. by making |
|
|
This significantly reduces the number of bytes hashed by IchHasher.
|
I redid the second commit so it uses the leb128 encoding from libserialize. |
|
Looks good to me! Happy to merge it after you renamed the |
|
|
||
| #[inline] | ||
| fn write_usize(&mut self, i: usize) { | ||
| // always hash as u64, so we don't depend on the size of `usize` |
michaelwoerister
Nov 2, 2016
Contributor
Since are encoding as leb128 before hashing, it should not make a difference whether the number was a usize or a u64 before. The leb128 representation of both will be identical.
Since are encoding as leb128 before hashing, it should not make a difference whether the number was a usize or a u64 before. The leb128 representation of both will be identical.
|
|
||
| #[derive(Debug)] | ||
| pub struct IchHasher { | ||
| state: ArchIndependentHasher<Blake2bHasher>, | ||
| bytes: Vec<u8>, |
michaelwoerister
Nov 2, 2016
Contributor
Can you rename this to something like leb128_helper? Otherwise one might think that it contains the bytes that were hashed so far.
Can you rename this to something like leb128_helper? Otherwise one might think that it contains the bytes that were hashed so far.
This significantly reduces the number of bytes hashed by IchHasher.
|
Updated to address comments. |
|
@bors r+ Thanks, @nnethercote! |
|
|
Reduce the number of bytes hashed by IchHasher.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
```
stdlib syntex-incr
------ -----------
original 156,781,386 255,095,596
half-SawSpan 106,744,403 176,345,419
short-ints 45,890,534 118,014,227
no-SawSpan[*] 6,831,874 45,875,714
[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
comparison's sake.
```
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
|
|
|
@bors retry |
Reduce the number of bytes hashed by IchHasher.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
```
stdlib syntex-incr
------ -----------
original 156,781,386 255,095,596
half-SawSpan 106,744,403 176,345,419
short-ints 45,890,534 118,014,227
no-SawSpan[*] 6,831,874 45,875,714
[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
comparison's sake.
```
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
|
|
…lwoerister
Reduce the number of bytes hashed by IchHasher.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
```
stdlib syntex-incr
------ -----------
original 156,781,386 255,095,596
half-SawSpan 106,744,403 176,345,419
short-ints 45,890,534 118,014,227
no-SawSpan[*] 6,831,874 45,875,714
[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
comparison's sake.
```
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
Reduce the number of bytes hashed by IchHasher.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
```
stdlib syntex-incr
------ -----------
original 156,781,386 255,095,596
half-SawSpan 106,744,403 176,345,419
short-ints 45,890,534 118,014,227
no-SawSpan[*] 6,831,874 45,875,714
[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
comparison's sake.
```
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
|
|
|
@bors: retry On Fri, Nov 4, 2016 at 10:23 PM, bors notifications@github.com wrote:
|
Reduce the number of bytes hashed by IchHasher.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
```
stdlib syntex-incr
------ -----------
original 156,781,386 255,095,596
half-SawSpan 106,744,403 176,345,419
short-ints 45,890,534 118,014,227
no-SawSpan[*] 6,831,874 45,875,714
[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
comparison's sake.
```
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
|
|
|
@bors: retry On Fri, Nov 4, 2016 at 11:04 PM, bors notifications@github.com wrote:
|
Reduce the number of bytes hashed by IchHasher.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
```
stdlib syntex-incr
------ -----------
original 156,781,386 255,095,596
half-SawSpan 106,744,403 176,345,419
short-ints 45,890,534 118,014,227
no-SawSpan[*] 6,831,874 45,875,714
[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
comparison's sake.
```
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
…lwoerister
Reduce the number of bytes hashed by IchHasher.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
```
stdlib syntex-incr
------ -----------
original 156,781,386 255,095,596
half-SawSpan 106,744,403 176,345,419
short-ints 45,890,534 118,014,227
no-SawSpan[*] 6,831,874 45,875,714
[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
comparison's sake.
```
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
The vast majority of spans have the same filename for the start of the span
and the end of the span, so hashing the filename just once in those cases is
a big win.
with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
shrinking them down before hashing is also a big win.
This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.
Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.
For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.