Reduce the number of bytes hashed by IchHasher. #37427

Merged
merged 2 commits into from Nov 5, 2016

Conversation

Projects
None yet
7 participants
@nnethercote
Contributor

nnethercote commented Oct 27, 2016

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.

  • Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
    The vast majority of spans have the same filename for the start of the span
    and the end of the span, so hashing the filename just once in those cases is
    a big win.
  • u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
    with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
    shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake. 

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.

@rust-highfive

This comment has been minimized.

Show comment
Hide comment
@rust-highfive

rust-highfive Oct 27, 2016

Collaborator

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

Collaborator

rust-highfive commented Oct 27, 2016

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

@eddyb

This comment has been minimized.

Show comment
Hide comment
Member

eddyb commented Oct 27, 2016

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Oct 27, 2016

Contributor

The first optimization looks good, I'm happy to approve that.
I don't think the second one is safe, e.g. &[0x1234, 0x56] will have the same hash as &[0x12, 0x3456] which we don't want here. You could try encoding integers with leb128 (there's implementation somewhere in the compiler) and see if that is faster than hashing them directly.

Contributor

michaelwoerister commented Oct 27, 2016

The first optimization looks good, I'm happy to approve that.
I don't think the second one is safe, e.g. &[0x1234, 0x56] will have the same hash as &[0x12, 0x3456] which we don't want here. You could try encoding integers with leb128 (there's implementation somewhere in the compiler) and see if that is faster than hashing them directly.

@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Oct 28, 2016

Contributor

I've changed the second commit to leb128-encode integers.

Contributor

nnethercote commented Oct 28, 2016

I've changed the second commit to leb128-encode integers.

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Oct 28, 2016

Contributor

I've changed the second commit to leb128-encode integers

Do you have numbers on the performance impact of that?

Contributor

michaelwoerister commented Oct 28, 2016

I've changed the second commit to leb128-encode integers

Do you have numbers on the performance impact of that?

@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Oct 28, 2016

Contributor

leb128 made very little difference to the performance. The number of bytes hashed increased slightly (a few percent) compared to the "just truncate" version, mostly because 255 is a fairly common value and it leb128-encodes to two bytes; I think it's used by Hasher to indicate the end of some types, or something like that?

The speed difference was negligible -- leb128 encoding is much cheaper than blake2b hashing :)

Contributor

nnethercote commented Oct 28, 2016

leb128 made very little difference to the performance. The number of bytes hashed increased slightly (a few percent) compared to the "just truncate" version, mostly because 255 is a fairly common value and it leb128-encodes to two bytes; I think it's used by Hasher to indicate the end of some types, or something like that?

The speed difference was negligible -- leb128 encoding is much cheaper than blake2b hashing :)

@@ -50,23 +73,28 @@ impl Hasher for IchHasher {
}
+ #[inline]
+ fn write_u8(&mut self, i: u8) {
+ self.write_uleb128(i as u64);

This comment has been minimized.

@michaelwoerister

michaelwoerister Oct 29, 2016

Contributor

u8 doesn't need to use leb128. There's nothing to be gained here.

@michaelwoerister

michaelwoerister Oct 29, 2016

Contributor

u8 doesn't need to use leb128. There's nothing to be gained here.

This comment has been minimized.

@nnethercote

nnethercote Oct 31, 2016

Contributor

I think it does need to use leb128. If you have a u16 value like 256 it gets encoded as [128,2]. That would be indistinguishable from two u8 values 128 and 2.

@nnethercote

nnethercote Oct 31, 2016

Contributor

I think it does need to use leb128. If you have a u16 value like 256 it gets encoded as [128,2]. That would be indistinguishable from two u8 values 128 and 2.

This comment has been minimized.

@arielb1

arielb1 Oct 31, 2016

Contributor

But encoding is typeful - a u8 can never be in the same start as a u16. A sufficient criterion is that the encoding for each type is prefix-free.

@arielb1

arielb1 Oct 31, 2016

Contributor

But encoding is typeful - a u8 can never be in the same start as a u16. A sufficient criterion is that the encoding for each type is prefix-free.

This comment has been minimized.

@michaelwoerister

michaelwoerister Oct 31, 2016

Contributor

For a hash like this to be correct, it has to be unambiguous. My personal benchmark for this, the one I find most intuitive, is: If we were serializing this data instead of hashing it, would we be able to correctly deserialize it again? In this case yes, because we always know whether we are reading a u8 or a u16 or something else next. A problem only arises if we have a bunch of, say, u16s in a row but they are encoded with an ambiguous variable length encoding.

@michaelwoerister

michaelwoerister Oct 31, 2016

Contributor

For a hash like this to be correct, it has to be unambiguous. My personal benchmark for this, the one I find most intuitive, is: If we were serializing this data instead of hashing it, would we be able to correctly deserialize it again? In this case yes, because we always know whether we are reading a u8 or a u16 or something else next. A problem only arises if we have a bunch of, say, u16s in a row but they are encoded with an ambiguous variable length encoding.

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Oct 29, 2016

Contributor

It's a bit unfortunate that we have to duplicate the leb128 implementation. Can you factor things in a way so that we can have a unit test for it (e.g. by making write_uleb128 a free standing function that writes its output to a &mut [u8])? These hashes are critical for correctness, so I'd like them to be well tested.
Alternatively, you could add a Vec<u8> to the hasher struct and re-use that to avoid allocations.

Contributor

michaelwoerister commented Oct 29, 2016

It's a bit unfortunate that we have to duplicate the leb128 implementation. Can you factor things in a way so that we can have a unit test for it (e.g. by making write_uleb128 a free standing function that writes its output to a &mut [u8])? These hashes are critical for correctness, so I'd like them to be well tested.
Alternatively, you could add a Vec<u8> to the hasher struct and re-use that to avoid allocations.

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Oct 31, 2016

Contributor

☔️ The latest upstream changes (presumably #37439) made this pull request unmergeable. Please resolve the merge conflicts.

Contributor

bors commented Oct 31, 2016

☔️ The latest upstream changes (presumably #37439) made this pull request unmergeable. Please resolve the merge conflicts.

Don't hash span filenames twice in IchHasher.
This significantly reduces the number of bytes hashed by IchHasher.
@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Nov 2, 2016

Contributor

I redid the second commit so it uses the leb128 encoding from libserialize.

Contributor

nnethercote commented Nov 2, 2016

I redid the second commit so it uses the leb128 encoding from libserialize.

@michaelwoerister

Looks good to me! Happy to merge it after you renamed the bytes field and removed the comment about usize.

+
+ #[inline]
+ fn write_usize(&mut self, i: usize) {
+ // always hash as u64, so we don't depend on the size of `usize`

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 2, 2016

Contributor

Since are encoding as leb128 before hashing, it should not make a difference whether the number was a usize or a u64 before. The leb128 representation of both will be identical.

@michaelwoerister

michaelwoerister Nov 2, 2016

Contributor

Since are encoding as leb128 before hashing, it should not make a difference whether the number was a usize or a u64 before. The leb128 representation of both will be identical.

#[derive(Debug)]
pub struct IchHasher {
state: ArchIndependentHasher<Blake2bHasher>,
+ bytes: Vec<u8>,

This comment has been minimized.

@michaelwoerister

michaelwoerister Nov 2, 2016

Contributor

Can you rename this to something like leb128_helper? Otherwise one might think that it contains the bytes that were hashed so far.

@michaelwoerister

michaelwoerister Nov 2, 2016

Contributor

Can you rename this to something like leb128_helper? Otherwise one might think that it contains the bytes that were hashed so far.

leb128-encode integers before hashing them in IchHasher.
This significantly reduces the number of bytes hashed by IchHasher.
@nnethercote

This comment has been minimized.

Show comment
Hide comment
@nnethercote

nnethercote Nov 2, 2016

Contributor

Updated to address comments.

Contributor

nnethercote commented Nov 2, 2016

Updated to address comments.

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
Contributor

michaelwoerister commented Nov 2, 2016

@bors r+

Thanks, @nnethercote!

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 2, 2016

Contributor

📌 Commit d73c68c has been approved by michaelwoerister

Contributor

bors commented Nov 2, 2016

📌 Commit d73c68c has been approved by michaelwoerister

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 3, 2016

Contributor

⌛️ Testing commit d73c68c with merge 69c4cfb...

Contributor

bors commented Nov 3, 2016

⌛️ Testing commit d73c68c with merge 69c4cfb...

bors added a commit that referenced this pull request Nov 3, 2016

Auto merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 3, 2016

Contributor

💔 Test failed - auto-mac-64-opt-rustbuild

Contributor

bors commented Nov 3, 2016

💔 Test failed - auto-mac-64-opt-rustbuild

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Nov 3, 2016

Contributor

@bors retry

Contributor

michaelwoerister commented Nov 3, 2016

@bors retry

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 3, 2016

Contributor

⌛️ Testing commit d73c68c with merge e67a9ba...

Contributor

bors commented Nov 3, 2016

⌛️ Testing commit d73c68c with merge e67a9ba...

bors added a commit that referenced this pull request Nov 3, 2016

Auto merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 3, 2016

Contributor

💔 Test failed - auto-mac-64-opt-rustbuild

Contributor

bors commented Nov 3, 2016

💔 Test failed - auto-mac-64-opt-rustbuild

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
Member

alexcrichton commented Nov 3, 2016

@bors: retry

jonathandturner added a commit to jonathandturner/rust that referenced this pull request Nov 4, 2016

Rollup merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.

alexcrichton added a commit to alexcrichton/rust that referenced this pull request Nov 4, 2016

Rollup merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.

bors added a commit that referenced this pull request Nov 5, 2016

bors added a commit that referenced this pull request Nov 5, 2016

bors added a commit that referenced this pull request Nov 5, 2016

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 5, 2016

Contributor

⌛️ Testing commit d73c68c with merge aa90c30...

Contributor

bors commented Nov 5, 2016

⌛️ Testing commit d73c68c with merge aa90c30...

bors added a commit that referenced this pull request Nov 5, 2016

Auto merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 5, 2016

Contributor

💔 Test failed - auto-win-msvc-64-opt-rustbuild

Contributor

bors commented Nov 5, 2016

💔 Test failed - auto-win-msvc-64-opt-rustbuild

bors added a commit that referenced this pull request Nov 5, 2016

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 5, 2016

Member

@bors: retry

On Fri, Nov 4, 2016 at 10:23 PM, bors notifications@github.com wrote:

💔 Test failed - auto-win-msvc-64-opt-rustbuild
https://buildbot.rust-lang.org/builders/auto-win-msvc-64-opt-rustbuild/builds/2924


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37427 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAD95CE-b49-yhmL8ANQMNdJFGs-mxsAks5q7BK4gaJpZM4Kh3Qk
.

Member

alexcrichton commented Nov 5, 2016

@bors: retry

On Fri, Nov 4, 2016 at 10:23 PM, bors notifications@github.com wrote:

💔 Test failed - auto-win-msvc-64-opt-rustbuild
https://buildbot.rust-lang.org/builders/auto-win-msvc-64-opt-rustbuild/builds/2924


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37427 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAD95CE-b49-yhmL8ANQMNdJFGs-mxsAks5q7BK4gaJpZM4Kh3Qk
.

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 5, 2016

Contributor

⌛️ Testing commit d73c68c with merge 149d86a...

Contributor

bors commented Nov 5, 2016

⌛️ Testing commit d73c68c with merge 149d86a...

bors added a commit that referenced this pull request Nov 5, 2016

Auto merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.
@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 5, 2016

Contributor

💔 Test failed - auto-mac-64-opt-rustbuild

Contributor

bors commented Nov 5, 2016

💔 Test failed - auto-mac-64-opt-rustbuild

bors added a commit that referenced this pull request Nov 5, 2016

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 5, 2016

Member

@bors: retry

On Fri, Nov 4, 2016 at 11:04 PM, bors notifications@github.com wrote:

💔 Test failed - auto-mac-64-opt-rustbuild
https://buildbot.rust-lang.org/builders/auto-mac-64-opt-rustbuild/builds/2906


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37427 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAD95Bmi7WeKI29iTiuqcuIKNbC5cbn7ks5q7ByCgaJpZM4Kh3Qk
.

Member

alexcrichton commented Nov 5, 2016

@bors: retry

On Fri, Nov 4, 2016 at 11:04 PM, bors notifications@github.com wrote:

💔 Test failed - auto-mac-64-opt-rustbuild
https://buildbot.rust-lang.org/builders/auto-mac-64-opt-rustbuild/builds/2906


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37427 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAD95Bmi7WeKI29iTiuqcuIKNbC5cbn7ks5q7ByCgaJpZM4Kh3Qk
.

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Nov 5, 2016

Contributor

⌛️ Testing commit d73c68c with merge 0883996...

Contributor

bors commented Nov 5, 2016

⌛️ Testing commit d73c68c with merge 0883996...

bors added a commit that referenced this pull request Nov 5, 2016

Auto merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.

@bors bors merged commit d73c68c into rust-lang:master Nov 5, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

alexcrichton added a commit to alexcrichton/rust that referenced this pull request Nov 5, 2016

Rollup merge of #37427 - nnethercote:opt-IchHasher, r=michaelwoerister
Reduce the number of bytes hashed by IchHasher.

IchHasher uses blake2b hashing, which is expensive, so the fewer bytes hashed
the better. There are two big ways to reduce the number of bytes hashed.
- Filenames in spans account for ~66% of all bytes (for builds with debuginfo).
  The vast majority of spans have the same filename for the start of the span
  and the end of the span, so hashing the filename just once in those cases is
  a big win.
- u32 and u64 and usize values account for ~25%--33% of all bytes (for builds
  with debuginfo). The vast majority of these are small, i.e. fit in a u8, so
  shrinking them down before hashing is also a big win.

This PR implements these two optimizations. I'm certain the first one is safe.
I'm about 90% sure that the second one is safe.

Here are measurements of the number of bytes hashed when doing
debuginfo-enabled builds of stdlib and
rustc-benchmarks/syntex-0.42.2-incr-clean.

```
                    stdlib   syntex-incr
                    ------   -----------
original       156,781,386   255,095,596
half-SawSpan   106,744,403   176,345,419
short-ints      45,890,534   118,014,227
no-SawSpan[*]    6,831,874    45,875,714

[*] don't hash the SawSpan at all. Not part of this PR, just implemented for
    comparison's sake.
```

For debug builds of syntex-0.42.2-incr-clean, the two changes give a 1--2%
speed-up.

bors added a commit that referenced this pull request Nov 5, 2016

bors added a commit that referenced this pull request Nov 5, 2016

bors added a commit that referenced this pull request Nov 5, 2016

bors added a commit that referenced this pull request Nov 5, 2016

bors added a commit that referenced this pull request Nov 6, 2016

bors added a commit that referenced this pull request Nov 6, 2016

@nnethercote nnethercote deleted the nnethercote:opt-IchHasher branch Nov 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment