Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-optimize the heck out of LEB128 reading and writing. #69050

Merged
merged 1 commit into from Feb 13, 2020

Conversation

@nnethercote
Copy link
Contributor

nnethercote commented Feb 11, 2020

This commit makes the following writing improvements:

  • Removes the unnecessary write_to_vec function.
  • Reduces the number of conditions per loop from 2 to 1.
  • Avoids a mask and a shift on the final byte.

And the following reading improvements:

  • Removes an unnecessary type annotation.
  • Fixes a dangerous unchecked slice access. Imagine a slice [0x80] --
    the current code will read past the end of the slice some number of
    bytes. The bounds check at the end will subsequently trigger, unless
    something bad (like a crash) happens first. The cost of doing bounds
    check in the loop body is negligible.
  • Avoids a mask on the final byte.

And the following improvements for both reading and writing:

  • Changes for to loop for the loops, avoiding an unnecessary
    condition on each iteration. This also removes the need for
    leb128_size.

All of these changes give significant perf wins, up to 5%.

r? @michaelwoerister

This commit makes the following writing improvements:
- Removes the unnecessary `write_to_vec` function.
- Reduces the number of conditions per loop from 2 to 1.
- Avoids a mask and a shift on the final byte.

And the following reading improvements:
- Removes an unnecessary type annotation.
- Fixes a dangerous unchecked slice access. Imagine a slice `[0x80]` --
  the current code will read past the end of the slice some number of
  bytes. The bounds check at the end will subsequently trigger, unless
  something bad (like a crash) happens first. The cost of doing bounds
  check in the loop body is negligible.
- Avoids a mask on the final byte.

And the following improvements for both reading and writing:
- Changes `for` to `loop` for the loops, avoiding an unnecessary
  condition on each iteration. This also removes the need for
  `leb128_size`.

All of these changes give significant perf wins, up to 5%.
@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 11, 2020

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

Copy link

rust-timer commented Feb 11, 2020

Awaiting bors try build completion

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 11, 2020

⌛️ Trying commit ad7802f with merge d902ca0...

bors added a commit that referenced this pull request Feb 11, 2020
Micro-optimize the heck out of LEB128 reading and writing.

This commit makes the following writing improvements:
- Removes the unnecessary `write_to_vec` function.
- Reduces the number of conditions per loop from 2 to 1.
- Avoids a mask and a shift on the final byte.

And the following reading improvements:
- Removes an unnecessary type annotation.
- Fixes a dangerous unchecked slice access. Imagine a slice `[0x80]` --
  the current code will read past the end of the slice some number of
  bytes. The bounds check at the end will subsequently trigger, unless
  something bad (like a crash) happens first. The cost of doing bounds
  check in the loop body is negligible.
- Avoids a mask on the final byte.

And the following improvements for both reading and writing:
- Changes `for` to `loop` for the loops, avoiding an unnecessary
  condition on each iteration. This also removes the need for
  `leb128_size`.

All of these changes give significant perf wins, up to 5%.

r? @michaelwoerister
@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 11, 2020

Local check results:

clap-rs-check
        avg: -2.7%      min: -5.6%      max: -0.0%
ucd-check
        avg: -1.3%      min: -2.8%      max: -0.4%
coercions-check
        avg: -1.0%?     min: -2.2%?     max: -0.0%?
tuple-stress-check
        avg: -0.7%      min: -1.6%      max: -0.0%
wg-grammar-check
        avg: -0.6%      min: -1.6%      max: -0.0%
html5ever-check
        avg: -0.9%      min: -1.4%      max: -0.2%
script-servo-check
        avg: -0.8%      min: -1.1%      max: -0.1%
cranelift-codegen-check
        avg: -0.5%      min: -1.0%      max: -0.1%
unused-warnings-check
        avg: -0.4%      min: -1.0%      max: -0.0%
webrender-check
        avg: -0.6%      min: -1.0%      max: -0.1%
regression-31157-check
        avg: -0.6%      min: -1.0%      max: -0.2%
regex-check
        avg: -0.7%      min: -1.0%      max: -0.1%
piston-image-check
        avg: -0.6%      min: -0.9%      max: -0.1%
cargo-check
        avg: -0.5%      min: -0.9%      max: -0.0%
webrender-wrench-check
        avg: -0.6%      min: -0.8%      max: -0.1%
hyper-2-check
        avg: -0.4%      min: -0.8%      max: -0.1%
keccak-check
        avg: -0.3%      min: -0.8%      max: -0.0%
futures-check
        avg: -0.5%      min: -0.8%      max: -0.1%
syn-check
        avg: -0.5%      min: -0.8%      max: -0.1%
packed-simd-check
        avg: -0.4%      min: -0.8%      max: -0.0%
ripgrep-check
        avg: -0.5%      min: -0.8%      max: -0.1%
serde-check
        avg: -0.3%      min: -0.8%      max: -0.0%
encoding-check
        avg: -0.5%      min: -0.8%      max: -0.1%
serde-serde_derive-check
        avg: -0.4%      min: -0.7%      max: -0.0%
style-servo-check
        avg: -0.4%      min: -0.7%      max: -0.0%
tokio-webpush-simple-check
        avg: -0.5%      min: -0.7%      max: -0.2%
inflate-check
        avg: -0.2%      min: -0.7%      max: -0.0%
await-call-tree-check
        avg: -0.6%      min: -0.7%      max: -0.4%
issue-46449-check
        avg: -0.5%      min: -0.7%      max: -0.4%
wf-projection-stress-65510-che...
        avg: -0.2%      min: -0.6%      max: 0.0%
unicode_normalization-check
        avg: -0.2%      min: -0.6%      max: -0.0%
helloworld-check
        avg: -0.3%      min: -0.5%      max: -0.1%
ctfe-stress-4-check
        avg: -0.2%?     min: -0.5%?     max: 0.2%?
unify-linearly-check
        avg: -0.3%      min: -0.4%      max: -0.2%
deeply-nested-check
        avg: -0.3%      min: -0.4%      max: -0.2%
deep-vector-check
        avg: -0.1%      min: -0.3%      max: -0.0%
token-stream-stress-check
        avg: -0.1%      min: -0.1%      max: -0.0%

The biggest improvements are on "clean incremental" runs, followed by "patched incremental".

src/libserialize/leb128.rs Show resolved Hide resolved
src/libserialize/leb128.rs Show resolved Hide resolved
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 11, 2020

☀️ Try build successful - checks-azure
Build commit: d902ca0 (d902ca046d0a8cc72dd69a16627fa5da540030f1)

@rust-timer

This comment has been minimized.

Copy link

rust-timer commented Feb 11, 2020

Queued d902ca0 with parent dc4242d, future comparison URL.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Feb 11, 2020

That's interesting. I remember that switching the code from loop to for sped up the code considerably a couple of years ago. My theory now is that that past speedup came from duplicating the machine code for each integer type, allowing the branch predictor to do a better job, and that that speedup was so big that it was faster even though the for loop introduced more overhead.

Anyway, I'm happy to get any kind of improvement here. And it's even more safe than before 🎉

(In case someone is interested in the past of this implementation: https://github.com/michaelwoerister/encoding-bench contains a number of different versions that I tried out. It's rather messy as it's essentially a private repo but an interesting aspect is the test data files that are generated from actual rustc invocations)

@rust-timer

This comment has been minimized.

Copy link

rust-timer commented Feb 11, 2020

Finished benchmarking try commit d902ca0, comparison URL.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Feb 12, 2020

@bors r+

Thanks, @nnethercote!

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 12, 2020

📌 Commit ad7802f has been approved by michaelwoerister

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 12, 2020

@bors r- until I have tried out @ranma42's suggestion.

@ranma42

This comment has been minimized.

Copy link
Contributor

ranma42 commented Feb 12, 2020

I was just finding it strange that the most significant bit was cleared out (_ & 0x7f) just before it was being set (_ | 0x80).
I do not think it should make any difference in the timing (or even the generated code, as I believe LLVM will optimize it out).
If this is an performance-sensitive part of the compiler, I will try to have a deeper look :)

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Feb 12, 2020

@nnethercote If you're bored, I wonder how this implementation compares to the pre-#59820 one in libproc_macro (which I implemented from scratch in safe code).

It definitely feels like your new version here is close to mine, but without checking I can't tell which one LLVM will prefer (or if they compile all the same).

EDIT: also, has anyone considered using SIMD here, like @BurntSushi and others have employed for handling UTF-8/regexes etc.? I'm asking because UTF-8 is like a more complex LEB128.

@bjorn3

This comment has been minimized.

Copy link
Contributor

bjorn3 commented Feb 12, 2020

UTF-8 validation handles a lot of codepoints every call, while these read and write methods only handle a single LEB128 int per call, so SIMD is likely not useful.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Feb 12, 2020

while these read and write methods only handle a single LEB128 int per call

May not be relevant, but the serialized data is basically a sequence of LEB128s (perhaps intermixed with strings), they just semantically represent more hierarchical values than an UTF-8 stream.

@ranma42

This comment has been minimized.

Copy link
Contributor

ranma42 commented Feb 12, 2020

If you are willing to do processor-specific tuning, PDEP/PEXT (available on modern x86 processors) might be better suited than generic SIMD for this task.

@gereeter

This comment has been minimized.

Copy link
Contributor

gereeter commented Feb 12, 2020

also, has anyone considered using SIMD here

See also Masked VByte [arXiv].

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 13, 2020

@nnethercote If you're bored, I wonder how this implementation compares to the pre-#59820 one in libproc_macro (which I implemented from scratch in safe code).

I tried the read and write implementations from libproc_macro individually, they both were slower than the code in this PR.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 13, 2020

also, has anyone considered using SIMD here

See also Masked VByte [arXiv].

Thanks for the link, I will take a look... but not in this PR :)

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 13, 2020

@bors r=michaelwoerister

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 13, 2020

📌 Commit ad7802f has been approved by michaelwoerister

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 13, 2020

BTW, in case anyone is curious, here's how I approached this bug. From profiling with Callgrind I saw that clap-rs-Check-CleanIncr was the benchmark+run+build combination most affected by LEB128 encoding. Its text output has entries like this:

265,344,872 ( 2.97%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:rustc::ty::query::on_disk_cache::__ty_decoder_impl::<impl serialize::serialize::Decoder for rustc::ty::query::on_disk_cache::CacheDecoder>::read_usize
236,097,015 ( 2.64%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::ty::query::on_disk_cache::CacheEncoder<E> as serialize::serialize::Encoder>::emit_u32
213,551,888 ( 2.39%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:rustc::ty::codec::encode_with_shorthand
165,042,682 ( 1.85%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc_target::abi::VariantIdx as serialize::serialize::Decodable>::decode
 40,540,500 ( 0.45%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<u32 as serialize::serialize::Encodable>::encode
 24,026,292 ( 0.27%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:serialize::serialize::Encoder::emit_seq
 20,160,540 ( 0.23%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::dep_graph::serialized::SerializedDepNodeIndex as serialize::serialize::Decodable>::decode
  9,661,323 ( 0.11%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:serialize::serialize::Decoder::read_tuple
  4,898,927 ( 0.05%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::ty::query::on_disk_cache::CacheEncoder<E> as serialize::serialize::Encoder>::emit_usize
  3,384,018 ( 0.04%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc_metadata::rmeta::encoder::EncodeContext as serialize::serialize::Encoder>::emit_u32
  2,296,440 ( 0.03%)  /home/njn/moz/rust0/src/libserialize/leb128.rs:<rustc::ty::UniverseIndex as serialize::serialize::Decodable>::decode

These are instruction counts, and the percentages sum to about 11%. Lots of different functions are involved because the LEB128 functions are inlined, but the file is leb128.rs in all of them, so I could tell where the relevant code lives. And the annotated code in that file looks like this:

          .           macro_rules! impl_write_unsigned_leb128 {
          .               ($fn_name:ident, $int_ty:ident) => {
          .                   #[inline]
          .                   pub fn $fn_name(out: &mut Vec<u8>, mut value: $int_ty) {
          .                       for _ in 0..leb128_size!($int_ty) {
143,877,210 ( 1.61%)                  let mut byte = (value & 0x7F) as u8;
 48,003,612 ( 0.54%)                  value >>= 7;
239,884,434 ( 2.69%)                  if value != 0 {
 47,959,070 ( 0.54%)                      byte |= 0x80;
          .                           }
          .
          .                           write_to_vec(out, byte);
          .
 47,959,070 ( 0.54%)                  if value == 0 {
          .                               break;
          .                           }
          .                       }
          .                   }
          .               };
          .           }
          .
          .           impl_write_unsigned_leb128!(write_u16_leb128, u16);
-- line 50 ----------------------------------------
-- line 57 ----------------------------------------
          .               ($fn_name:ident, $int_ty:ident) => {
          .                   #[inline]
          .                   pub fn $fn_name(slice: &[u8]) -> ($int_ty, usize) {
          .                       let mut result: $int_ty = 0;
          .                       let mut shift = 0;
          .                       let mut position = 0;
          .
          .                       for _ in 0..leb128_size!($int_ty) {
 59,507,824 ( 0.67%)                  let byte = unsafe { *slice.get_unchecked(position) };
          .                           position += 1;
204,126,888 ( 2.29%)                  result |= ((byte & 0x7F) as $int_ty) << shift;
119,023,350 ( 1.33%)                  if (byte & 0x80) == 0 {
          .                               break;
          .                           }
          .                           shift += 7;
          .                       }
          .
          .                       // Do a single bounds check at the end instead of for every byte.
 67,805,748 ( 0.76%)              assert!(position <= slice.len());
          .
          .                       (result, position)
          .                   }
          .               };
          .           }

Those percentages also add up to about 11%. Plus I poked around a bit at call sites and found this in a different file (libserialize/opaque.rs):

         .           macro_rules! read_uleb128 {
          .               ($dec:expr, $fun:ident) => {{
100,680,777 ( 1.13%)          let (value, bytes_read) = leb128::$fun(&$dec.data[$dec.position..]);
 67,858,196 ( 0.76%)          $dec.position += bytes_read;
 43,378,625 ( 0.49%)          Ok(value)
          .               }};
          .           }

which is another 2.38%. So it was clear that LEB128 reading/writing was hot.

I then tried gradually improving the code. I ended up measuring 18 different changes to the code. 10 of them were improvements (which I kept), and 8 were regressions (which I discarded). The following table shows the notes I took. The descriptions of the changes are a bit cryptic, but the basic technique should be clear.

IMPROVEMENTS
            clap-rs-Check-CleanIncr
feb10/Leb0  8,992M        $RUSTC0
feb10/Leb1  8,927M/99.3%  First attempt
feb11/Leb4  8,996M        $RUSTC0 but with bounds checking
feb11/Leb5  8,983M        `loop` for reading
feb11/Leb6  8,928M/99.3%  `loop` for writing, `write_to_vec` removed
feb11/Leb8  8,829M/98.1%  avoid mask on final byte in read loop
feb11/Leb9  8,529M/94.8%  in write loop, avoid a condition
feb11/Leb10 8,488M/94.4%  in write loop, mask/shift on final byte
feb13/Leb13 8,488M/94.4%  in write loop, push `(value | 0x80) as u8`
feb13/Leb15 8,488M/94.4%  in read loop, do `as` before `&`
feb13/Leb18 8,492M/94.4%  Landed (not sure about the extra 4M, oh well)

REGRESSIONS
feb11/Leb2  8,927M/99.3%  add slice0, slice1, slice2 vars
feb11/Leb3  9,127M        move the slow loop into a separate no-inline function
feb11/Leb7  8,930M        `< 128` in read loop
feb11/Leb11 8,492M        use `byte < 0x80` in read loop
feb12/Leb12 8,721M        unsafe pushing in write
feb13/Leb14 8,494M/94.4%  in write loop, push `(value as u8) | 0x80`
feb13/Leb16 8,831M        eddyb's write loop
feb13/Leb17 8,578M        eddyb's read loop

Every iteration took about 6.5 minutes to recompile, and about 2 minutes to measure with Cachegrind. I interleaved these steps with other work, so in practice each iteration took anywhere from 10-30 minutes, depending on context-switching delays.

The measurements in the notes are close to those from the CI run, which indicate the following for clap-rs-Check-CleanIncr:

  • instructions: -5.3%
  • cycles: -4.4%
  • wall-time: -3.9%

Instruction counts are almost deterministic and highly reliable. Cycle counts are more variable but still reasonable. Wall-time is highly variable and barely trustworthy. But they're all pointing in the same direction, which is encouraging.

Looking at the instruction counts, we saw that LEB128 operations were about 11-13% of instructions originally, and instruction counts went down by about 5%, which suggests that the LEB128 operations are a bit less than twice as fast as they were. Pretty good.

Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Feb 13, 2020
…r=michaelwoerister

Micro-optimize the heck out of LEB128 reading and writing.

This commit makes the following writing improvements:
- Removes the unnecessary `write_to_vec` function.
- Reduces the number of conditions per loop from 2 to 1.
- Avoids a mask and a shift on the final byte.

And the following reading improvements:
- Removes an unnecessary type annotation.
- Fixes a dangerous unchecked slice access. Imagine a slice `[0x80]` --
  the current code will read past the end of the slice some number of
  bytes. The bounds check at the end will subsequently trigger, unless
  something bad (like a crash) happens first. The cost of doing bounds
  check in the loop body is negligible.
- Avoids a mask on the final byte.

And the following improvements for both reading and writing:
- Changes `for` to `loop` for the loops, avoiding an unnecessary
  condition on each iteration. This also removes the need for
  `leb128_size`.

All of these changes give significant perf wins, up to 5%.

r? @michaelwoerister
bors added a commit that referenced this pull request Feb 13, 2020
Rollup of 9 pull requests

Successful merges:

 - #67642 (Relax bounds on HashMap/HashSet)
 - #68848 (Hasten macro parsing)
 - #69008 (Properly use parent generics for opaque types)
 - #69048 (Suggestion when encountering assoc types from hrtb)
 - #69049 (Optimize image sizes)
 - #69050 (Micro-optimize the heck out of LEB128 reading and writing.)
 - #69068 (Make the SGX arg cleanup implementation a NOP)
 - #69082 (When expecting `BoxFuture` and using `async {}`, suggest `Box::pin`)
 - #69104 (bootstrap: Configure cmake when building sanitizer runtimes)

Failed merges:

r? @ghost
@bors bors merged commit ad7802f into rust-lang:master Feb 13, 2020
5 checks passed
5 checks passed
homu Test successful
Details
pr Build #20200211.19 succeeded
Details
pr (Linux mingw-check) Linux mingw-check succeeded
Details
pr (Linux x86_64-gnu-llvm-7) Linux x86_64-gnu-llvm-7 succeeded
Details
pr (Linux x86_64-gnu-tools) Linux x86_64-gnu-tools succeeded
Details
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 13, 2020

☔️ The latest upstream changes (presumably #69118) made this pull request unmergeable. Please resolve the merge conflicts.

@nnethercote nnethercote deleted the nnethercote:micro-optimize-leb128 branch Feb 13, 2020
@Veedrac

This comment has been minimized.

Copy link
Contributor

Veedrac commented Feb 13, 2020

In response to earlier comments, PEXT can be used to encode with something like (untested)

fn leb128enc(value: u32) -> [u8; 8] {
    let hi = 0x8080_8080_8080_8080;
    let split = unsafe { _pdep_u64(value as u64, !hi) };
    let tags = ((!0 >> (split | 1).leading_zeros()) & hi;
    return (split | tags).to_le_bytes();
}

You can do a similar thing with PDEP for decoding. Encoding larger integers is probably just best off using a branch to handle full chunks of 56 bits (with let tags = hi) before finishing with the above.

@nnethercote

This comment has been minimized.

Copy link
Contributor Author

nnethercote commented Feb 13, 2020

@fitzgen tried using PEXT a while back in a different project. For the common case (small integers that fit in 1 byte) it was a slight slowdown:
https://twitter.com/fitzgen/status/1138784734417432576

@fitzgen

This comment has been minimized.

Copy link
Member

fitzgen commented Feb 13, 2020

Also, on intel chips, pext is implemented in hardware and super fast (one or two cycles iirc), but on amd it is implemented in microcode and is muuuuuch slower (150-300 cycles). Would have to be careful with it.

@fitzgen

This comment has been minimized.

@Veedrac

This comment has been minimized.

Copy link
Contributor

Veedrac commented Feb 13, 2020

@nnethercote The thing I would worry about with PEXT is the copy; if you do that byte-at-a-time (or with memcpy) you probably eat a lot of the earnings. The key for a fast variable-length copy is to always add the maximum size and then bump the pointer by the length instead (or truncate the vector, in the Rust case). Being able to avoid the >10% mispredict rate probably pays for the few extra instructions in the common cases, but you need to specifically design for that.

nnethercote added a commit to nnethercote/rust that referenced this pull request Feb 14, 2020
PR rust-lang#69050 changed LEB128 reading and writing. After it landed I did some
double-checking and found that the writing changes were universally a
speed-up, but the reading changes were not. I'm not exactly sure why,
perhaps there was a quirk of inlining in the particular revision I was
originally working from.

This commit reverts some of the reading changes, while still avoiding
`unsafe` code. I have checked it on multiple revisions and the speed-ups
seem to be robust.
bors added a commit that referenced this pull request Feb 14, 2020
Tweak LEB128 reading some more.

PR #69050 changed LEB128 reading and writing. After it landed I did some
double-checking and found that the writing changes were universally a
speed-up, but the reading changes were not. I'm not exactly sure why,
perhaps there was a quirk of inlining in the particular revision I was
originally working from.

This commit reverts some of the reading changes, while still avoiding
`unsafe` code. I have checked it on multiple revisions and the speed-ups
seem to be robust.

r? @michaelwoerister
nnethercote added a commit to nnethercote/rust that referenced this pull request Feb 16, 2020
PR rust-lang#69050 changed LEB128 reading and writing. After it landed I did some
double-checking and found that the writing changes were universally a
speed-up, but the reading changes were not. I'm not exactly sure why,
perhaps there was a quirk of inlining in the particular revision I was
originally working from.

This commit reverts some of the reading changes, while still avoiding
`unsafe` code. I have checked it on multiple revisions and the speed-ups
seem to be robust.
bors added a commit that referenced this pull request Feb 16, 2020
Tweak LEB128 reading some more.

PR #69050 changed LEB128 reading and writing. After it landed I did some
double-checking and found that the writing changes were universally a
speed-up, but the reading changes were not. I'm not exactly sure why,
perhaps there was a quirk of inlining in the particular revision I was
originally working from.

This commit reverts some of the reading changes, while still avoiding
`unsafe` code. I have checked it on multiple revisions and the speed-ups
seem to be robust.

r? @michaelwoerister
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.