Improved Performance for Disguised Fast-Path Cases in Float Parsing

# Summary

Rust's float-parsing algorithm dec2flt uses a slower parsing algorithm than necessary than required to parse numbers like `"1.2345e30"`, which can slow down parsing times by nearly 300%. Adding trivial changes to dec2flt leads to dramatically improved parsing times, without increasing binary sizes, or slowing down other parse cases. Please see the "Sample Repository" below for the exact specifics, or in order to replicate these changes. This is an [initial attempt](https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670) as part of an ongoing effort to speed up float parsing in Rust, and aims to integrate algorithms I've implemented (currently used in nom and serde-json) back in the core library.

# Issue

When parsing floating-point numbers, there is a fast-path algorithm that uses native floats to parse the float if applicable. This only occurs if:
- The significant digits of the float, or mantissa, can be represented in `mantissa_size+1` bits.
- The exponent can be exactly represented, or the absolute value is less than `⌊(mantissa_size+1) / log2(5) ⌋`.

Please note that this is the exponent relative to the significant digits, for example, for `"1.2345e5"`, this exponent would be `1`, but for `"12345e5"` this exponent would be `5`.

The reason why we use `mantissa_size+1` is due to the implicit, hidden bit of the float. A longer post detailing the attempts to improve float parsing on rust-internals can be found [here](https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670). The exact values for `f32` are as follows:

**f32:**
- significant digit bits: 24
- exponent range: `[-10, 10]`

**f64:**
- significant digit bits: 53
- exponent range: `[-22, 22]`

However, there is an exception: if the value has less significant bits than the maximum, but has an exponent larger than our range, we can shift powers-of-10 from the exponent to the significant digits. For example, `"1.2345e30"` would have significant digits of `12345` and an exponent of `26`, which is outside our range of `[-22, 22]`. However, if we shift `10^4` from the exponent to the significant digits, we get significant digits of `123450000` and an exponent of `22`, which is a valid fast-path case. This leads to a massive performance improvement with a large number of real-world float cases, and has an insignificant impact on other cases.

# Binary Sizes

These were compiled on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The sizes reflect the binary sizes reported by `ls -sh`, both before and after running the `strip` command. The debug profile was used for opt-levels `0` and `1`, and was as follows:

```toml
[profile.dev]
opt-level = "..."
debug = true
lto = false
```

The release profile was used for opt-levels `2`, `3`, `s` and `z` and was as follows:

```toml
[profile.release]
opt-level = "..."
debug = false
debug-assertions = false
lto = true
```

**core**

These are the binary sizes prior to making changes.

opt-level|size|size(stripped)
|:-:|:-:|:-:|
0|3.6M|360K
1|3.5M|316K
2|1.3M|236K
3|1.3M|248K
s|1.3M|244K
z|1.3M|248K

**disguised**

These are the binary sizes after making changes to speed up disguised fast-path cases.

opt-level|size|size(stripped)
|:-:|:-:|:-:|
0|3.6M|360K
1|3.5M|316K
2|1.3M|236K
3|1.3M|248K
s|1.3M|252K
z|1.3M|248K

# Performance

Overall, the changes to speed up disguised fast-path cases led to ~-75% change in performance relative to core, without impacting any other benchmarks.

These benchmarks were run on an `i7-6560U CPU @ 2.20GHz`, on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The performance CPU governor was used for all benchmarks, and were run consecutively on A/C power with only tmux and Sublime Text open for all benchmarks. The floats that were parsed are as follows:

```rust
// Example fast-path value.
const FAST: &str = "1.2345e22";
// Example disguised fast-path value.
const DISGUISED: &str = "1.2345e30";
// Example moderate path value: clearly not halfway `1 << 53`.
const MODERATE: &str = "9007199254740992.0";
// Example exactly-halfway value `(1<<53) + 1`.
const HALFWAY: &str = "9007199254740993.0";
// Example large, near-halfway value.
const LARGE: &str = "8.988465674311580536566680e307";
// Example denormal, near-halfway value.
const DENORMAL: &str = "8.442911973260991817129021e-309";
```

**core**

These are the benchmarks prior to making changes.

|float|speed|
|:-:|:-:|
|fast|32.952ns|
|disguised|129.86ns|
|moderate|237.08ns|
|halfway|371.21ns|
|large|287.81us|
|denormal|122.36us|

**disguised**

These are the benchmarks after making changes to speed up disguised fast-path cases.

|float|speed|
|:-:|:-:|
|fast|32.572ns|
|disguised|33.813ns|
|moderate|233.03ns|
|halfway|350.99ns|
|large|300.29us|
|denormal|129.36us|

# Correctness Concerns

None, since this merely transfer powers-of-10 from the exponent to the significant digits, using integer multiplication, and therefore can trivially be verified for correctness.

# Changes

The diff, which would be relative to `library/core/src/num`, is as follows:

```diff
diff --git a/src/dec2flt/algorithm.rs b/src/dec2flt/algorithm.rs
index 2b0b4cb..76d8105 100644
--- a/src/dec2flt/algorithm.rs
+++ b/src/dec2flt/algorithm.rs
@@ -110,7 +110,7 @@ mod fpu_precision {
 ///
 /// This is extracted into a separate function so that it can be attempted before constructing
 /// a bignum.
-pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], e: i64) -> Option<T> {
+pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], mut e: i64) -> Option<T> {
     let num_digits = integral.len() + fractional.len();
     // log_10(f64::MAX_SIG) ~ 15.95. We compare the exact value to MAX_SIG near the end,
     // this is just a quick, cheap rejection (and also frees the rest of the code from
@@ -118,14 +118,29 @@ pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], e: i64) -> Opt
     if num_digits > 16 {
         return None;
     }
-    if e.abs() >= T::CEIL_LOG5_OF_MAX_SIG as i64 {
+    let max_exp = T::FLOOR_LOG5_OF_MAX_SIG as i64;
+    let min_exp = -max_exp;
+    let shift_exp = T::FLOOR_LOG10_OF_MAX_SIG as i64;
+    let disguised_exp = max_exp + shift_exp;
+    if e < min_exp || e > disguised_exp {
         return None;
     }
-    let f = num::from_str_unchecked(integral.iter().chain(fractional.iter()));
+    let mut f = num::from_str_unchecked(integral.iter().chain(fractional.iter()));
     if f > T::MAX_SIG {
         return None;
     }
 
+    // Handle a disguised fast path case here.
+    if e > max_exp {
+        let shift = e - max_exp;
+        let value = f.checked_mul(T::short_int_pow10(shift as usize))?;
+        if value > T::MAX_SIG {
+            return None;
+        }
+        f = value;
+        e = max_exp;
+    }
+
     // The fast path crucially depends on arithmetic being rounded to the correct number of bits
     // without any intermediate rounding. On x86 (without SSE or SSE2) this requires the precision
     // of the x87 FPU stack to be changed so that it directly rounds to 64/32 bit.
diff --git a/src/dec2flt/rawfp.rs b/src/dec2flt/rawfp.rs
index a3acf3d..15a5839 100644
--- a/src/dec2flt/rawfp.rs
+++ b/src/dec2flt/rawfp.rs
@@ -73,13 +73,21 @@ pub trait RawFloat:
     /// represented, the other code in this module makes sure to never let that happen.
     fn from_int(x: u64) -> Self;
 
+    fn short_int_pow10(e: usize) -> u64 {
+        table::SHORT_POWERS[e]
+    }
+
     /// Gets the value 10<sup>e</sup> from a pre-computed table.
-    /// Panics for `e >= CEIL_LOG5_OF_MAX_SIG`.
+    /// Panics for `e >= FLOOR_LOG5_OF_MAX_SIG`.
     fn short_fast_pow10(e: usize) -> Self;
 
     /// What the name says. It's easier to hard code than juggling intrinsics and
     /// hoping LLVM constant folds it.
-    const CEIL_LOG5_OF_MAX_SIG: i16;
+    const FLOOR_LOG5_OF_MAX_SIG: i16;
+
+    /// What the name says. It's easier to hard code than juggling intrinsics and
+    /// hoping LLVM constant folds it.
+    const FLOOR_LOG10_OF_MAX_SIG: i16;
 
     // A conservative bound on the decimal digits of inputs that can't produce overflow or zero or
     /// subnormals. Probably the decimal exponent of the maximum normal value, hence the name.
@@ -147,7 +155,8 @@ impl RawFloat for f32 {
 
     const SIG_BITS: u8 = 24;
     const EXP_BITS: u8 = 8;
-    const CEIL_LOG5_OF_MAX_SIG: i16 = 11;
+    const FLOOR_LOG5_OF_MAX_SIG: i16 = 10;
+    const FLOOR_LOG10_OF_MAX_SIG: i16 = 7;
     const MAX_NORMAL_DIGITS: usize = 35;
     const INF_CUTOFF: i64 = 40;
     const ZERO_CUTOFF: i64 = -48;
@@ -196,7 +205,8 @@ impl RawFloat for f64 {
 
     const SIG_BITS: u8 = 53;
     const EXP_BITS: u8 = 11;
-    const CEIL_LOG5_OF_MAX_SIG: i16 = 23;
+    const FLOOR_LOG5_OF_MAX_SIG: i16 = 22;
+    const FLOOR_LOG10_OF_MAX_SIG: i16 = 15;
     const MAX_NORMAL_DIGITS: usize = 305;
     const INF_CUTOFF: i64 = 310;
     const ZERO_CUTOFF: i64 = -326;
diff --git a/src/dec2flt/table.rs b/src/dec2flt/table.rs
index 97b497e..bd9e53d 100644
--- a/src/dec2flt/table.rs
+++ b/src/dec2flt/table.rs
@@ -1234,6 +1234,30 @@ pub static POWERS: ([u64; 611], [i16; 611]) = (
     ],
 );
 
+#[rustfmt::skip]
+pub const SHORT_POWERS: [u64; 20] = [
+    1,
+    10,
+    100,
+    1000,
+    10000,
+    100000,
+    1000000,
+    10000000,
+    100000000,
+    1000000000,
+    10000000000,
+    100000000000,
+    1000000000000,
+    10000000000000,
+    100000000000000,
+    1000000000000000,
+    10000000000000000,
+    100000000000000000,
+    1000000000000000000,
+    10000000000000000000,
+];
+
 #[rustfmt::skip]
 pub const F32_SHORT_POWERS: [f32; 11] = [
     1e0,
```

I'd be happy to submit a pull request with these changes, if they are satisfactory to you.

# Sample Repository

I've created a simple, minimal repository tracking these changes on [rust-dec2flt](https://github.com/Alexhuszagh/rust-dec2flt), which has a [core branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/core) that is identical to Rust's current implementation in the core library. The [disguised branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/disguised) contains the changes to improve parsing speeds for disguised fast-path cases. I will also, if there is interest, gradually be making changes for the moderate and slow-path algorithms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved Performance for Disguised Fast-Path Cases in Float Parsing #85198

Summary

Issue

Binary Sizes

Performance

Correctness Concerns

Changes

Sample Repository

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

float	speed
fast	32.952ns
disguised	129.86ns
moderate	237.08ns
halfway	371.21ns
large	287.81us
denormal	122.36us

float	speed
fast	32.572ns
disguised	33.813ns
moderate	233.03ns
halfway	350.99ns
large	300.29us
denormal	129.36us

opt-level	size	size(stripped)
0	3.6M	360K
1	3.5M	316K
2	1.3M	236K
3	1.3M	248K
s	1.3M	244K
z	1.3M	248K

opt-level	size	size(stripped)
0	3.6M	360K
1	3.5M	316K
2	1.3M	236K
3	1.3M	248K
s	1.3M	252K
z	1.3M	248K

Improved Performance for Disguised Fast-Path Cases in Float Parsing #85198

Description

Summary

Issue

Binary Sizes

Performance

Correctness Concerns

Changes

Sample Repository

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions