seq: use BigDecimal to represent floats #2698

jfinkels · 2021-10-02T02:19:20Z

Use BigDecimal to represent arbitrary precision floats in order to
prevent numerical precision issues when iterating over a sequence of
numbers. This commit makes several changes at once to accomplish this
goal.

First, it creates a new struct, PreciseNumber, that is responsible for
storing not only the number itself but also the number of digits (both
integer and decimal) needed to display it. This information is collected
at the time of parsing the number, which lives in the new
numberparse.rs module.

Second, it uses the BigDecimal struct to store arbitrary precision
floating point numbers instead of the previous f64 primitive
type. This protects against issues of numerical precision when
repeatedly accumulating a very small increment.

Third, since neither the BigDecimal nor BigInt types have a
representation of infinity, minus infinity, minus zero, or NaN, we add
the ExtendedBigDecimal and ExtendedBigInt enumerations which extend
the basic types with these concepts.

jfinkels · 2021-10-02T02:57:31Z

There are some remaining issues due to parsing/formatting of NaNs and -0.0 not being implemented in Rust versions prior to 1.53.0; see this pull request rust-lang/rust#78618. I'm confident that I can make this work with some adjustments though.

sylvestre · 2021-10-02T08:28:24Z

MinRustV is failing (interesting issue :)

jfinkels · 2021-10-05T01:21:57Z

I updated this branch with changes that I hope support Rust v1.47.0. The updates in v1.53.0 that allow parsing and displaying floating point negative zero would make the code much simpler, but I suppose that will have to wait!

jfinkels · 2021-10-06T00:48:45Z

I have added an additional commit that adds a missing test case: the use of -w with both negative and positive floats in the sequence.

jfinkels · 2021-10-09T22:44:11Z

@sylvestre it looks like the merge commit you added may have broken something. Should I try to fix that?

jfinkels · 2021-10-11T17:04:15Z

I will update this pull request due to the new feature added in pull request #2701.

jfinkels · 2021-10-11T18:07:14Z

tests/by-util/test_seq.rs

@@ -60,7 +60,7 @@ fn test_hex_identifier_in_wrong_place() {
        .args(&["1234ABCD0x"])
        .fails()
        .no_stdout()
-        .stderr_contains("invalid hexadecimal argument: '1234ABCD0x'")
+        .stderr_contains("invalid floating point argument: '1234ABCD0x'")


I have changed the error message here (and in the tests above) to match the behavior of GNU seq:

$ seq ff0x seq: invalid floating point argument: ‘ff0x’ Try 'seq --help' for more information.

It makes the implementation simpler, since anything that does not start with "0x" or "-0x" follows the code path for parsing a base ten number, eventually resulting in a ParseNumberError::Float.

sylvestre · 2021-10-23T21:48:26Z

Is it expected that it has no impact on the GNU tests?

Changes from master: PASS +0 / FAIL +0 / ERROR +0 / SKIP +0

jfinkels · 2021-10-24T15:15:11Z

I don't remember anymore, I'll double check the GNU tests and report back here.

jfinkels · 2021-10-24T15:26:58Z

This pull request will resolve a few of the test cases but not all of them. There are still a few test failures in at least the tests/misc/seq.pl file and tests/misc/seq-precision.sh file, among others. Here are two test cases that I expect to remain failing after this pull request.

The tests/misc/seq.pl file will remain failing because we don't yet support the -f option (see issue seq: missing -f option #2616): https://github.com/coreutils/coreutils/blob/f60a3981c31519d03fd2be19e006349ea00b4464/tests/misc/seq.pl#L87-L99
The tests/misc/seq-precision.sh file will remain failing because we don't yet support hexadecimal floating point numbers (see issue seq: add support for parsing hexadecimal floating point inputs #2660): https://github.com/coreutils/coreutils/blob/f60a3981c31519d03fd2be19e006349ea00b4464/tests/misc/seq-precision.sh#L52-L55

Use `BigDecimal` to represent arbitrary precision floats in order to prevent numerical precision issues when iterating over a sequence of numbers. This commit makes several changes at once to accomplish this goal. First, it creates a new struct, `PreciseNumber`, that is responsible for storing not only the number itself but also the number of digits (both integer and decimal) needed to display it. This information is collected at the time of parsing the number, which lives in the new `numberparse.rs` module. Second, it uses the `BigDecimal` struct to store arbitrary precision floating point numbers instead of the previous `f64` primitive type. This protects against issues of numerical precision when repeatedly accumulating a very small increment. Third, since neither the `BigDecimal` nor `BigInt` types have a representation of infinity, minus infinity, minus zero, or NaN, we add the `ExtendedBigDecimal` and `ExtendedBigInt` enumerations which extend the basic types with these concepts.

miDeb

I like this change, just a few comments. Have you done any benchmarks to check how performance looks like? I'd obviously expect BigDecimal to be much slower than using f64 but I'm interested in the effect of wrapping BigInt in ExtendedBigInt.

miDeb · 2021-11-01T23:20:04Z

src/uu/seq/src/extendedbigdecimal.rs

+            (Self::BigDecimal(_), Self::MinusInfinity) => false,
+            (Self::BigDecimal(_), Self::Infinity) => false,
+            (Self::BigDecimal(_), Self::Nan) => false,
+            (Self::BigDecimal(_), Self::MinusZero) => false,


I think positive 0 and negative 0 should compare equal.

I'll make that change.

Oh actually, this is intentionally false.

The reason I am representing MinusZero as a distinct concept at all is because if -0.0 is given in the input as the starting number, it must be displayed as -0.0 in the output. Because -0.0 gets rendered as "0.0" in older versions of Rust, we need to know whether the value is ExtendedBigDecimal::MinusZero or the value is ExtendedBigDecimal::BigDecimal(BigDecimal::zero()). The equality comparison happens when we are deciding how to render the current value in write_value_float():

if *value == ExtendedBigDecimal::MinusZero && is_first_iteration {

https://github.com/uutils/coreutils/pull/2698/files#diff-8ef9575cb5c2b35e15751308bf63d6dbf3ac217a383790f9355bc5e13ab2fdb9R231

If the BigDecimal::zero() and MinusZero were to compare equal, then the write_value_float() would write "-0.0" when it should be writing "0.0". For example,

$ ./target/debug/coreutils seq -w 0.0 1 -0.0 1.0

We don't want that, so we need to ensure that negative zero and positive zero are considered different things.

When the minimum Rust version becomes large enough, we may be able to eliminate much of the special handling code for negative zero.

miDeb · 2021-11-01T23:20:50Z

src/uu/seq/src/extendedbigdecimal.rs

+            (Self::MinusInfinity, Self::MinusInfinity) => true,
+            (Self::MinusInfinity, Self::Nan) => false,
+            (Self::Nan, _) => false,
+            (Self::MinusZero, Self::BigDecimal(_)) => false,


miDeb · 2021-11-01T23:23:56Z

src/uu/seq/src/extendedbigdecimal.rs

+                // zero. See
+                // https://github.com/rust-lang/rust/pull/78618. Currently,
+                // this just formats "0.0".
+                (0.0f32).fmt(f)


So we want to change this to (-0.0f32).fmt(f) at a later point?

Yes, I believe so.

src/uu/seq/src/seq.rs

miDeb · 2021-11-02T19:34:08Z

src/uu/seq/src/seq.rs

+    write!(writer, "{}", value_as_str)
+}
+
+// TODO `print_seq()` and `print_seq_integers()` are nearly identical,


Maybe we could create a trait that both ExtendedBigInt and ExtendedBigDecimal implement and make write_valuegeneric, but we can do this at a later point.

Yes definitely. If this pull request gets merged, I propose we open a new issue for that change. This diff is already larger than I usually like to produce.

src/uu/seq/src/seq.rs

jfinkels · 2021-11-05T01:44:14Z

I have not yet done any benchmarks. I can work on that next.

jfinkels · 2021-11-06T13:52:12Z

I have made the changes suggested by @miDeb, and provided a response to the suggestion about negative zero and positive zero comparing equal. Since it came up, I have added a simple BENCHMARKING.md file describing how to use hyperfine to benchmark the tool.

Finally, here's a quick comparison between the GNU version of seq, the uutils version on the current master branch (commit a7aa6b8), and the version on this branch, seq-bigdecimal-extended-numbers.

When enumerating the floating point numbers between 0 and 100,000 by increments of 0.1, the master branch version is fastest, the GNU version is 1.23 times slower, and this branch's version is 1.72 times slower:

$ hyperfine --warmup 10 "seq 0 0.1 100000 > /dev/null" "./seq-master 0 0.1 100000 > /dev/null" "./seq-branch 0 0.1 100000 > /dev/null"
Benchmark #1: seq 0 0.1 100000 > /dev/null
  Time (mean ± σ):     353.7 ms ±   0.8 ms    [User: 353.3 ms, System: 0.5 ms]
  Range (min … max):   352.8 ms … 355.3 ms    10 runs
 
Benchmark #2: ./seq-master 0 0.1 100000 > /dev/null
  Time (mean ± σ):     276.8 ms ±   5.8 ms    [User: 202.6 ms, System: 74.2 ms]
  Range (min … max):   271.7 ms … 288.6 ms    11 runs
 
Benchmark #3: ./seq-branch 0 0.1 100000 > /dev/null
  Time (mean ± σ):     477.0 ms ±   6.3 ms    [User: 391.3 ms, System: 85.6 ms]
  Range (min … max):   469.8 ms … 491.5 ms    10 runs
 
Summary
  './seq-master 0 0.1 100000 > /dev/null' ran
    1.28 ± 0.03 times faster than 'seq 0 0.1 100000 > /dev/null'
    1.72 ± 0.04 times faster than './seq-branch 0 0.1 100000 > /dev/null'

When enumerating the integers between 0 and 2,000,000, the GNU version is fastest, the master branch version is 31 times slower, and this branch's version is 40 times slower.

$ hyperfine --warmup 20 "seq 0 2000000 > /dev/null" "./seq-master 2000000 > /dev/null" "./seq-branch 2000000 > /dev/null"
Benchmark #1: seq 0 2000000 > /dev/null
  Time (mean ± σ):      14.8 ms ±   0.7 ms    [User: 14.7 ms, System: 0.4 ms]
  Range (min … max):    13.9 ms …  16.9 ms    198 runs
 
Benchmark #2: ./seq-master 2000000 > /dev/null
  Time (mean ± σ):     462.1 ms ±   2.1 ms    [User: 303.9 ms, System: 158.3 ms]
  Range (min … max):   458.5 ms … 463.9 ms    10 runs
 
Benchmark #3: ./seq-branch 2000000 > /dev/null
  Time (mean ± σ):     604.0 ms ±  12.4 ms    [User: 449.5 ms, System: 154.6 ms]
  Range (min … max):   592.0 ms … 636.0 ms    10 runs
 
Summary
  'seq 0 2000000 > /dev/null' ran
   31.25 ± 1.59 times faster than './seq-master 2000000 > /dev/null'
   40.85 ± 2.23 times faster than './seq-branch 2000000 > /dev/null'

My intent in making this pull request was to provide correctness, not efficiency. A future pull request could make optimizations. For example, in the common case of enumerating small positive integers, much simpler code could be used.

miDeb · 2021-11-06T14:44:45Z

Thanks for the very detailed response! I agree that we can do some optimizations later. I think my main concern is that wrapping the BigInt might not be necessary because we only need it for the negative zero case, which was previously handled by a single flag. Likewise wrapping BigDecimal might be avoidable as well, but again that's something we can think about later.

vulppine mentioned this pull request Oct 4, 2021

seq: Adds hexadecimal integer parsing #2701

Merged

jfinkels force-pushed the seq-bigdecimal-extended-numbers branch from 54bca04 to 499dde4 Compare October 5, 2021 01:19

jfinkels force-pushed the seq-bigdecimal-extended-numbers branch from 499dde4 to 0e90c48 Compare October 5, 2021 01:29

jfinkels force-pushed the seq-bigdecimal-extended-numbers branch from f44fb7c to 3379817 Compare October 11, 2021 18:04

jfinkels commented Oct 11, 2021

View reviewed changes

jfinkels force-pushed the seq-bigdecimal-extended-numbers branch from 3379817 to c2c2622 Compare October 30, 2021 00:57

miDeb reviewed Nov 2, 2021

View reviewed changes

jfinkels added 3 commits November 4, 2021 20:38

fixup! seq: use BigDecimal to represent floats

d07c15e

fixup! seq: use BigDecimal to represent floats

ba9b4fc

fixup! seq: use BigDecimal to represent floats

38d444e

jfinkels added 2 commits November 6, 2021 09:32

fixup! seq: use BigDecimal to represent floats

57787a8

fixup! seq: use BigDecimal to represent floats

bd8600b

miDeb merged commit 2e12316 into uutils:master Nov 6, 2021

jfinkels mentioned this pull request Jan 15, 2022

seq: eliminate special handling of -0.0 #2870

Closed

jfinkels deleted the seq-bigdecimal-extended-numbers branch January 29, 2022 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seq: use BigDecimal to represent floats #2698

seq: use BigDecimal to represent floats #2698

jfinkels commented Oct 2, 2021

jfinkels commented Oct 2, 2021

sylvestre commented Oct 2, 2021

jfinkels commented Oct 5, 2021

jfinkels commented Oct 6, 2021

jfinkels commented Oct 9, 2021

jfinkels commented Oct 11, 2021

jfinkels Oct 11, 2021

sylvestre commented Oct 23, 2021

jfinkels commented Oct 24, 2021

jfinkels commented Oct 24, 2021

miDeb left a comment

miDeb Nov 1, 2021

jfinkels Nov 5, 2021

jfinkels Nov 5, 2021

miDeb Nov 1, 2021

miDeb Nov 1, 2021

jfinkels Nov 5, 2021

miDeb Nov 2, 2021

jfinkels Nov 5, 2021

jfinkels commented Nov 5, 2021

jfinkels commented Nov 6, 2021

miDeb commented Nov 6, 2021

seq: use BigDecimal to represent floats #2698

seq: use BigDecimal to represent floats #2698

Conversation

jfinkels commented Oct 2, 2021

jfinkels commented Oct 2, 2021

sylvestre commented Oct 2, 2021

jfinkels commented Oct 5, 2021

jfinkels commented Oct 6, 2021

jfinkels commented Oct 9, 2021

jfinkels commented Oct 11, 2021

Choose a reason for hiding this comment

sylvestre commented Oct 23, 2021

jfinkels commented Oct 24, 2021

jfinkels commented Oct 24, 2021

miDeb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfinkels commented Nov 5, 2021

jfinkels commented Nov 6, 2021

miDeb commented Nov 6, 2021