Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm #85234

Alexhuszagh · 2021-05-12T18:26:43Z

Summary

When the fast-path algorithm cannot be used (see #85198), Rust defaults back to a the Bellerophon algorithm, based off this paper. Examples of floats that can be correctly parsed via the Bellerophon algorithm include "9007199254740992.0" (1 << 53), while near-halfway cases such "9007199254740992992e-3" must fall back to slower algorithms (just less than halfway of (1 << 53) + 1). Unfortunately, the current implementation of the Bellerophon algorithm requires the use of arbitrary-precision arithmetic, which can lead to a 10,000x performance penalty.

Please see the "Sample Repository" below for the exact specifics, or in order to replicate these changes. This is an initial attempt as part of an ongoing effort to speed up float parsing in Rust, and aims to integrate algorithms I've implemented (currently used in nom and serde-json) back in the core library.

Issue

When parsing floating-point numbers, if the number cannot be exactly parsed by the fast-path algorithm, it falls back to an extended-precision representation (often consisting of 64-bits for the significant digits, or mantissa, and 16-bits for the exponent). For a more detailed description of halfway cases, see the halfway cases section below.

If this extended-precision algorithm can be unambiguously rounded to the nearest native float, by showing that the max error is less than the different to the nearest halfway case, then we have an accurate representation and can skip slower algorithms.

These slower algorithms make use of arbitrary-precision arithmetic to exactly represent the significant digits of the float, and therefore round to the nearest native float. The current implementation of Bellerophon, however, generates the significant digits from a big integer, which leads to significantly reduced performance.

By using a 64-bit representation of the significant digits parsed from the first 19-20 digits of the float, we can improve performance by orders of magnitude.

Halfway Cases

When parsing floats, the most significant problem is determining how to round the resulting value. The IEEE-754 standard specifies rounding to nearest, then tie even.

For example, using this rounding scheme to decimal numbers:

8.9 would round to 9.0.
9.1 would round to 9.0.
9.5 would round to 10.0.
10.5 would round to 10.0.

With parsing from decimal strings to binary, fixed-width floating point numbers, we must round to the nearest float. This becomes tricky when values are near their halfway point. For example, with a single-precision float f32, we would round as follows:

16777216.9 rounds to 16777216.0
16777217.0 rounds to 16777216.0
16777217.1 rounds to 16777218.0

This is easier illustrated if we represent the float in binary. First, here's the layout of an IEEE-754 single-precision float as bits:

🟦🟩🟩🟩🟩🟩🟩🟩🟩🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪

Where:

🟦 is the sign bit.
🟩 are the exponent bits.
🟪 are the mantissa, or significant digit, bits.

We'll ignore the exponent and sign bits right now, and only consider the mantissa, or significant digits. The lowest exponent bit, also called the hidden bit, is used as an implicit, extra bit of precision for normal floats, meaning we have 24-bits of precision. For 3 numbers, we would therefore have the following representations, where the last bit is truncated off:

16777216.0 => 100000000000000000000000 0
16777217.0 => 100000000000000000000000 1
16777218.0 => 100000000000000000000001 0

Therefore, 16777217.0 is exactly halfway between 16777216.0 and 16777218.0. Although solving these halfway cases can superficially seem easy, simple algorithms will fail even when parsing the shortest, accurate decimal representation.

Binary Sizes

These were compiled on a target of x86_64-unknown-linux-gnu, running kernel version 5.11.16-100, on a Rust version of rustc 1.53.0-nightly (132b4e5d1 2021-04-13). The sizes reflect the binary sizes reported by ls -sh, both before and after running the strip command. The debug profile was used for opt-levels 0 and 1, and was as follows:

[profile.dev]
opt-level = "..."
debug = true
lto = false

The release profile was used for opt-levels 2, 3, s and z and was as follows:

[profile.release]
opt-level = "..."
debug = false
debug-assertions = false
lto = true

core

These are the binary sizes prior to making changes.

opt-level	size	size(stripped)
0	3.6M	360K
1	3.5M	316K
2	1.3M	236K
3	1.3M	248K
s	1.3M	244K
z	1.3M	248K

moderate

These are the binary sizes after making changes to speed up the Bellerophon algorithm.

opt-level	size	size(stripped)
0	3.6M	364K
1	3.5M	316K
2	1.3M	248K
3	1.3M	252K
s	1.3M	244K
z	1.3M	244K

Performance

Overall, the changes to speed up Bellerophon algorithm led to a:

~-79% change in performance for the MODERATE float.
~-99.7% change in performance for the LARGE float.
~-99.94% change in performance for the DENORMAL float.

And it did not affect the performance of the fast-path algorithm.

These benchmarks were run on an i7-6560U CPU @ 2.20GHz, on a target of x86_64-unknown-linux-gnu, running kernel version 5.11.16-100, on a Rust version of rustc 1.53.0-nightly (132b4e5d1 2021-04-13). The performance CPU governor was used for all benchmarks, and were run on A/C power with only tmux and Sublime Text open for all benchmarks. The floats that were parsed are as follows:

// Example fast-path value.
const FAST: &str = "1.2345e22";
// Example disguised fast-path value.
const DISGUISED: &str = "1.2345e30";
// Example moderate path value: clearly not halfway `1 << 53`.
const MODERATE: &str = "9007199254740992.0";
// Example exactly-halfway value `(1<<53) + 1`.
const HALFWAY: &str = "9007199254740993.0";
// Example large, near-halfway value.
const LARGE: &str = "8.988465674311580536566680e307";
// Example denormal, near-halfway value.
const DENORMAL: &str = "8.442911973260991817129021e-309";

core

These are the benchmarks prior to making changes.

float	speed
fast	32.952ns
disguised	129.86ns
moderate	237.08ns
halfway	371.21ns
large	287.81us
denormal	122.36us

moderate

These are the binary sizes after making changes to speed up the Bellerophon algorithm.

float	speed
fast	26.668ns
disguised	34.599ns
moderate	49.378ns
halfway	224.81ns
large	796.34ns
denormal	63.763ns

Correctness Concerns

There are a few correctness concerns, since this uses a potentially truncated representation of the significant digits for error calculation. I've therefore made the error detection stricter, so it rejects more halfway cases than before and correctly compounds error with truncated cases and non-normalized representations after multiplication.

In practice, this only rejects a handful of cases that would be normally accept by the algorithm, with a major benefit to overall performance.

I've also extended the powers-of-10 to handle denormal floats, as well as values that could lead to Infinity, and updated the internal logic to ensure correct rounding.

This passes all of Rust's float parsing tests, as well as carefully crafted examples to try to detect errors, and therefore is unlikely to have any correctness issues.

Sample Repository

I've created a simple, minimal repository tracking these changes on rust-dec2flt, which has a core branch that is identical to Rust's current implementation in the core library. The moderate branch contains the changes to improve parsing speeds for Bellerophon algorithm. This currently relies on changes made to infer binary exponents, however, can be trivially re-written to explicitly store them.

The text was updated successfully, but these errors were encountered:

Update Rust Float-Parsing Algorithms to use the Eisel-Lemire algorithm. # Summary Rust, although it implements a correct float parser, has major performance issues in float parsing. Even for common floats, the performance can be 3-10x [slower](https://arxiv.org/pdf/2101.11408.pdf) than external libraries such as [lexical](https://github.com/Alexhuszagh/rust-lexical) and [fast-float-rust](https://github.com/aldanor/fast-float-rust). Recently, major advances in float-parsing algorithms have been developed by Daniel Lemire, along with others, and implement a fast, performant, and correct float parser, with speeds up to 1200 MiB/s on Apple's M1 architecture for the [canada](https://github.com/lemire/simple_fastfloat_benchmark/blob/0e2b5d163d4074cc0bde2acdaae78546d6e5c5f1/data/canada.txt) dataset, 10x faster than Rust's 130 MiB/s. In addition, [edge-cases](rust-lang#85234) in Rust's [dec2flt](https://github.com/rust-lang/rust/tree/868c702d0c9a471a28fb55f0148eb1e3e8b1dcc5/library/core/src/num/dec2flt) algorithm can lead to over a 1600x slowdown relative to efficient algorithms. This is due to the use of Clinger's correct, but slow [AlgorithmM and Bellepheron](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.4152&rep=rep1&type=pdf), which have been improved by faster big-integer algorithms and the Eisel-Lemire algorithm, respectively. Finally, this algorithm provides substantial improvements in the number of floats the Rust core library can parse. Denormal floats with a large number of digits cannot be parsed, due to use of the `Big32x40`, which simply does not have enough digits to round a float correctly. Using a custom decimal class, with much simpler logic, we can parse all valid decimal strings of any digit count. ```rust // Issue in Rust's dec2fly. "2.47032822920623272088284396434110686182e-324".parse::<f64>(); // Err(ParseFloatError { kind: Invalid }) ``` # Solution This pull request implements the Eisel-Lemire algorithm, modified from [fast-float-rust](https://github.com/aldanor/fast-float-rust) (which is licensed under Apache 2.0/MIT), along with numerous modifications to make it more amenable to inclusion in the Rust core library. The following describes both features in fast-float-rust and improvements in fast-float-rust for inclusion in core. **Documentation** Extensive documentation has been added to ensure the code base may be maintained by others, which explains the algorithms as well as various associated constants and routines. For example, two seemingly magical constants include documentation to describe how they were derived as follows: ```rust // Round-to-even only happens for negative values of q // when q ≥ −4 in the 64-bit case and when q ≥ −17 in // the 32-bitcase. // // When q ≥ 0,we have that 5^q ≤ 2m+1. In the 64-bit case,we // have 5^q ≤ 2m+1 ≤ 2^54 or q ≤ 23. In the 32-bit case,we have // 5^q ≤ 2m+1 ≤ 2^25 or q ≤ 10. // // When q < 0, we have w ≥ (2m+1)×5^−q. We must have that w < 2^64 // so (2m+1)×5^−q < 2^64. We have that 2m+1 > 2^53 (64-bit case) // or 2m+1 > 2^24 (32-bit case). Hence,we must have 2^53×5^−q < 2^64 // (64-bit) and 2^24×5^−q < 2^64 (32-bit). Hence we have 5^−q < 2^11 // or q ≥ −4 (64-bit case) and 5^−q < 2^40 or q ≥ −17 (32-bitcase). // // Thus we have that we only need to round ties to even when // we have that q ∈ [−4,23](in the 64-bit case) or q∈[−17,10] // (in the 32-bit case). In both cases,the power of five(5^|q|) // fits in a 64-bit word. const MIN_EXPONENT_ROUND_TO_EVEN: i32; const MAX_EXPONENT_ROUND_TO_EVEN: i32; ``` This ensures maintainability of the code base. **Improvements for Disguised Fast-Path Cases** The fast path in float parsing algorithms attempts to use native, machine floats to represent both the significant digits and the exponent, which is only possible if both can be exactly represented without rounding. In practice, this means that the significant digits must be 53-bits or less and the then exponent must be in the range `[-22, 22]` (for an f64). This is similar to the existing dec2flt implementation. However, disguised fast-path cases exist, where there are few significant digits and an exponent above the valid range, such as `1.23e25`. In this case, powers-of-10 may be shifted from the exponent to the significant digits, discussed at length in rust-lang#85198. **Digit Parsing Improvements** Typically, integers are parsed from string 1-at-a-time, requiring unnecessary multiplications which can slow down parsing. An approach to parse 8 digits at a time using only 3 multiplications is described in length [here](https://johnnylee-sde.github.io/Fast-numeric-string-to-int/). This leads to significant performance improvements, and is implemented for both big and little-endian systems. **Unsafe Changes** Relative to fast-float-rust, this library makes less use of unsafe functionality and clearly documents it. This includes the refactoring and documentation of numerous unsafe methods undesirably marked as safe. The original code would look something like this, which is deceptively marked as safe for unsafe functionality. ```rust impl AsciiStr { #[inline] pub fn step_by(&mut self, n: usize) -> &mut Self { unsafe { self.ptr = self.ptr.add(n) }; self } } ... #[inline] fn parse_scientific(s: &mut AsciiStr<'_>) -> i64 { // the first character is 'e'/'E' and scientific mode is enabled let start = *s; s.step(); ... } ``` The new code clearly documents safety concerns, and does not mark unsafe functionality as safe, leading to better safety guarantees. ```rust impl AsciiStr { /// Advance the view by n, advancing it in-place to (n..). pub unsafe fn step_by(&mut self, n: usize) -> &mut Self { // SAFETY: same as step_by, safe as long n is less than the buffer length self.ptr = unsafe { self.ptr.add(n) }; self } } ... /// Parse the scientific notation component of a float. fn parse_scientific(s: &mut AsciiStr<'_>) -> i64 { let start = *s; // SAFETY: the first character is 'e'/'E' and scientific mode is enabled unsafe { s.step(); } ... } ``` This allows us to trivially demonstrate the new implementation of dec2flt is safe. **Inline Annotations Have Been Removed** In the previous implementation of dec2flt, inline annotations exist practically nowhere in the entire module. Therefore, these annotations have been removed, which mostly does not impact [performance](aldanor/fast-float-rust#15 (comment)). **Fixed Correctness Tests** Numerous compile errors in `src/etc/test-float-parse` were present, due to deprecation of `time.clock()`, as well as the crate dependencies with `rand`. The tests have therefore been reworked as a [crate](https://github.com/Alexhuszagh/rust/tree/master/src/etc/test-float-parse), and any errors in `runtests.py` have been patched. **Undefined Behavior** An implementation of `check_len` which relied on undefined behavior (in fast-float-rust) has been refactored, to ensure that the behavior is well-defined. The original code is as follows: ```rust #[inline] pub fn check_len(&self, n: usize) -> bool { unsafe { self.ptr.add(n) <= self.end } } ``` And the new implementation is as follows: ```rust /// Check if the slice at least `n` length. fn check_len(&self, n: usize) -> bool { n <= self.as_ref().len() } ``` Note that this has since been fixed in [fast-float-rust](aldanor/fast-float-rust#29). **Inferring Binary Exponents** Rather than explicitly store binary exponents, this new implementation infers them from the decimal exponent, reducing the amount of static storage required. This removes the requirement to store [611 i16s](https://github.com/rust-lang/rust/blob/868c702d0c9a471a28fb55f0148eb1e3e8b1dcc5/library/core/src/num/dec2flt/table.rs#L8). # Code Size The code size, for all optimizations, does not considerably change relative to before for stripped builds, however it is **significantly** smaller prior to stripping the resulting binaries. These binary sizes were calculated on x86_64-unknown-linux-gnu. **new** Using rustc version 1.55.0-dev. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|400k|300K 1|396k|292K 2|392k|292K 3|392k|296K s|396k|292K z|396k|292K **old** Using rustc version 1.53.0-nightly. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.2M|304K 1|3.2M|292K 2|3.1M|284K 3|3.1M|284K s|3.1M|284K z|3.1M|284K # Correctness The dec2flt implementation passes all of Rust's unittests and comprehensive float parsing tests, along with numerous other tests such as Nigel Toa's comprehensive float [tests](https://github.com/nigeltao/parse-number-fxx-test-data) and Hrvoje Abraham [strtod_tests](https://github.com/ahrvoje/numerics/blob/master/strtod/strtod_tests.toml). Therefore, it is unlikely that this algorithm will incorrectly round parsed floats. # Issues Addressed This will fix and close the following issues: - resolves rust-lang#85198 - resolves rust-lang#85214 - resolves rust-lang#85234 - fixes rust-lang#31407 - fixes rust-lang#31109 - fixes rust-lang#53015 - resolves rust-lang#68396 - closes aldanor/fast-float-rust#15

Alexhuszagh mentioned this issue May 12, 2021

Consider integration in core and std ? aldanor/fast-float-rust#15

Closed

bjorn3 added A-floating-point Area: Floating point numbers and arithmetic C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 12, 2021

Alexhuszagh mentioned this issue Jun 30, 2021

Update Rust Float-Parsing Algorithms to use the Eisel-Lemire algorithm. #86761

Merged

bors closed this as completed in 8752b40 Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm #85234

Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm #85234

Alexhuszagh commented May 12, 2021

Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm #85234

Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm #85234

Comments

Alexhuszagh commented May 12, 2021

Summary

Issue

Halfway Cases

Binary Sizes

Performance

Correctness Concerns

Sample Repository