Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seq: use BigDecimal to represent floats #2698

Merged
merged 6 commits into from
Nov 6, 2021

Conversation

jfinkels
Copy link
Collaborator

@jfinkels jfinkels commented Oct 2, 2021

Use BigDecimal to represent arbitrary precision floats in order to
prevent numerical precision issues when iterating over a sequence of
numbers. This commit makes several changes at once to accomplish this
goal.

First, it creates a new struct, PreciseNumber, that is responsible for
storing not only the number itself but also the number of digits (both
integer and decimal) needed to display it. This information is collected
at the time of parsing the number, which lives in the new
numberparse.rs module.

Second, it uses the BigDecimal struct to store arbitrary precision
floating point numbers instead of the previous f64 primitive
type. This protects against issues of numerical precision when
repeatedly accumulating a very small increment.

Third, since neither the BigDecimal nor BigInt types have a
representation of infinity, minus infinity, minus zero, or NaN, we add
the ExtendedBigDecimal and ExtendedBigInt enumerations which extend
the basic types with these concepts.

@jfinkels
Copy link
Collaborator Author

jfinkels commented Oct 2, 2021

There are some remaining issues due to parsing/formatting of NaNs and -0.0 not being implemented in Rust versions prior to 1.53.0; see this pull request rust-lang/rust#78618. I'm confident that I can make this work with some adjustments though.

@sylvestre
Copy link
Sponsor Contributor

MinRustV is failing (interesting issue :)

@jfinkels
Copy link
Collaborator Author

jfinkels commented Oct 5, 2021

I updated this branch with changes that I hope support Rust v1.47.0. The updates in v1.53.0 that allow parsing and displaying floating point negative zero would make the code much simpler, but I suppose that will have to wait!

@jfinkels jfinkels force-pushed the seq-bigdecimal-extended-numbers branch from 499dde4 to 0e90c48 Compare October 5, 2021 01:29
@jfinkels
Copy link
Collaborator Author

jfinkels commented Oct 6, 2021

I have added an additional commit that adds a missing test case: the use of -w with both negative and positive floats in the sequence.

@jfinkels
Copy link
Collaborator Author

jfinkels commented Oct 9, 2021

@sylvestre it looks like the merge commit you added may have broken something. Should I try to fix that?

@jfinkels
Copy link
Collaborator Author

I will update this pull request due to the new feature added in pull request #2701.

@jfinkels jfinkels force-pushed the seq-bigdecimal-extended-numbers branch from f44fb7c to 3379817 Compare October 11, 2021 18:04
@@ -60,7 +60,7 @@ fn test_hex_identifier_in_wrong_place() {
.args(&["1234ABCD0x"])
.fails()
.no_stdout()
.stderr_contains("invalid hexadecimal argument: '1234ABCD0x'")
.stderr_contains("invalid floating point argument: '1234ABCD0x'")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the error message here (and in the tests above) to match the behavior of GNU seq:

$ seq ff0x
seq: invalid floating point argument: ‘ff0x’
Try 'seq --help' for more information.

It makes the implementation simpler, since anything that does not start with "0x" or "-0x" follows the code path for parsing a base ten number, eventually resulting in a ParseNumberError::Float.

@sylvestre
Copy link
Sponsor Contributor

Is it expected that it has no impact on the GNU tests?

Changes from master: PASS +0 / FAIL +0 / ERROR +0 / SKIP +0

@jfinkels
Copy link
Collaborator Author

I don't remember anymore, I'll double check the GNU tests and report back here.

@jfinkels
Copy link
Collaborator Author

This pull request will resolve a few of the test cases but not all of them. There are still a few test failures in at least the tests/misc/seq.pl file and tests/misc/seq-precision.sh file, among others. Here are two test cases that I expect to remain failing after this pull request.

Use `BigDecimal` to represent arbitrary precision floats in order to
prevent numerical precision issues when iterating over a sequence of
numbers. This commit makes several changes at once to accomplish this
goal.

First, it creates a new struct, `PreciseNumber`, that is responsible for
storing not only the number itself but also the number of digits (both
integer and decimal) needed to display it. This information is collected
at the time of parsing the number, which lives in the new
`numberparse.rs` module.

Second, it uses the `BigDecimal` struct to store arbitrary precision
floating point numbers instead of the previous `f64` primitive
type. This protects against issues of numerical precision when
repeatedly accumulating a very small increment.

Third, since neither the `BigDecimal` nor `BigInt` types have a
representation of infinity, minus infinity, minus zero, or NaN, we add
the `ExtendedBigDecimal` and `ExtendedBigInt` enumerations which extend
the basic types with these concepts.
@jfinkels jfinkels force-pushed the seq-bigdecimal-extended-numbers branch from 3379817 to c2c2622 Compare October 30, 2021 00:57
Copy link
Contributor

@miDeb miDeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, just a few comments. Have you done any benchmarks to check how performance looks like? I'd obviously expect BigDecimal to be much slower than using f64 but I'm interested in the effect of wrapping BigInt in ExtendedBigInt.

(Self::BigDecimal(_), Self::MinusInfinity) => false,
(Self::BigDecimal(_), Self::Infinity) => false,
(Self::BigDecimal(_), Self::Nan) => false,
(Self::BigDecimal(_), Self::MinusZero) => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think positive 0 and negative 0 should compare equal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make that change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually, this is intentionally false.

The reason I am representing MinusZero as a distinct concept at all is because if -0.0 is given in the input as the starting number, it must be displayed as -0.0 in the output. Because -0.0 gets rendered as "0.0" in older versions of Rust, we need to know whether the value is ExtendedBigDecimal::MinusZero or the value is ExtendedBigDecimal::BigDecimal(BigDecimal::zero()). The equality comparison happens when we are deciding how to render the current value in write_value_float():

if *value == ExtendedBigDecimal::MinusZero && is_first_iteration {

https://github.com/uutils/coreutils/pull/2698/files#diff-8ef9575cb5c2b35e15751308bf63d6dbf3ac217a383790f9355bc5e13ab2fdb9R231

If the BigDecimal::zero() and MinusZero were to compare equal, then the write_value_float() would write "-0.0" when it should be writing "0.0". For example,

$ ./target/debug/coreutils seq -w 0.0 1
-0.0
1.0

We don't want that, so we need to ensure that negative zero and positive zero are considered different things.

When the minimum Rust version becomes large enough, we may be able to eliminate much of the special handling code for negative zero.

(Self::MinusInfinity, Self::MinusInfinity) => true,
(Self::MinusInfinity, Self::Nan) => false,
(Self::Nan, _) => false,
(Self::MinusZero, Self::BigDecimal(_)) => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

// zero. See
// https://github.com/rust-lang/rust/pull/78618. Currently,
// this just formats "0.0".
(0.0f32).fmt(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want to change this to (-0.0f32).fmt(f) at a later point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so.

src/uu/seq/src/seq.rs Outdated Show resolved Hide resolved
src/uu/seq/src/seq.rs Outdated Show resolved Hide resolved
write!(writer, "{}", value_as_str)
}

// TODO `print_seq()` and `print_seq_integers()` are nearly identical,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could create a trait that both ExtendedBigInt and ExtendedBigDecimal implement and make write_valuegeneric, but we can do this at a later point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely. If this pull request gets merged, I propose we open a new issue for that change. This diff is already larger than I usually like to produce.

src/uu/seq/src/seq.rs Outdated Show resolved Hide resolved
@jfinkels
Copy link
Collaborator Author

jfinkels commented Nov 5, 2021

I have not yet done any benchmarks. I can work on that next.

@jfinkels
Copy link
Collaborator Author

jfinkels commented Nov 6, 2021

I have made the changes suggested by @miDeb, and provided a response to the suggestion about negative zero and positive zero comparing equal. Since it came up, I have added a simple BENCHMARKING.md file describing how to use hyperfine to benchmark the tool.

Finally, here's a quick comparison between the GNU version of seq, the uutils version on the current master branch (commit a7aa6b8), and the version on this branch, seq-bigdecimal-extended-numbers.

When enumerating the floating point numbers between 0 and 100,000 by increments of 0.1, the master branch version is fastest, the GNU version is 1.23 times slower, and this branch's version is 1.72 times slower:

$ hyperfine --warmup 10 "seq 0 0.1 100000 > /dev/null" "./seq-master 0 0.1 100000 > /dev/null" "./seq-branch 0 0.1 100000 > /dev/null"
Benchmark #1: seq 0 0.1 100000 > /dev/null
  Time (mean ± σ):     353.7 ms ±   0.8 ms    [User: 353.3 ms, System: 0.5 ms]
  Range (min … max):   352.8 ms … 355.3 ms    10 runs
 
Benchmark #2: ./seq-master 0 0.1 100000 > /dev/null
  Time (mean ± σ):     276.8 ms ±   5.8 ms    [User: 202.6 ms, System: 74.2 ms]
  Range (min … max):   271.7 ms … 288.6 ms    11 runs
 
Benchmark #3: ./seq-branch 0 0.1 100000 > /dev/null
  Time (mean ± σ):     477.0 ms ±   6.3 ms    [User: 391.3 ms, System: 85.6 ms]
  Range (min … max):   469.8 ms … 491.5 ms    10 runs
 
Summary
  './seq-master 0 0.1 100000 > /dev/null' ran
    1.28 ± 0.03 times faster than 'seq 0 0.1 100000 > /dev/null'
    1.72 ± 0.04 times faster than './seq-branch 0 0.1 100000 > /dev/null'

When enumerating the integers between 0 and 2,000,000, the GNU version is fastest, the master branch version is 31 times slower, and this branch's version is 40 times slower.

$ hyperfine --warmup 20 "seq 0 2000000 > /dev/null" "./seq-master 2000000 > /dev/null" "./seq-branch 2000000 > /dev/null"
Benchmark #1: seq 0 2000000 > /dev/null
  Time (mean ± σ):      14.8 ms ±   0.7 ms    [User: 14.7 ms, System: 0.4 ms]
  Range (min … max):    13.9 ms …  16.9 ms    198 runs
 
Benchmark #2: ./seq-master 2000000 > /dev/null
  Time (mean ± σ):     462.1 ms ±   2.1 ms    [User: 303.9 ms, System: 158.3 ms]
  Range (min … max):   458.5 ms … 463.9 ms    10 runs
 
Benchmark #3: ./seq-branch 2000000 > /dev/null
  Time (mean ± σ):     604.0 ms ±  12.4 ms    [User: 449.5 ms, System: 154.6 ms]
  Range (min … max):   592.0 ms … 636.0 ms    10 runs
 
Summary
  'seq 0 2000000 > /dev/null' ran
   31.25 ± 1.59 times faster than './seq-master 2000000 > /dev/null'
   40.85 ± 2.23 times faster than './seq-branch 2000000 > /dev/null'

My intent in making this pull request was to provide correctness, not efficiency. A future pull request could make optimizations. For example, in the common case of enumerating small positive integers, much simpler code could be used.

@miDeb miDeb merged commit 2e12316 into uutils:master Nov 6, 2021
@miDeb
Copy link
Contributor

miDeb commented Nov 6, 2021

Thanks for the very detailed response! I agree that we can do some optimizations later. I think my main concern is that wrapping the BigInt might not be necessary because we only need it for the negative zero case, which was previously handled by a single flag. Likewise wrapping BigDecimal might be avoidable as well, but again that's something we can think about later.

@jfinkels jfinkels deleted the seq-bigdecimal-extended-numbers branch January 29, 2022 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants