Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float string formatting with no specified presentation type behavior. #99694

Open
jagerber48 opened this issue Nov 22, 2022 · 18 comments
Open

Float string formatting with no specified presentation type behavior. #99694

jagerber48 opened this issue Nov 22, 2022 · 18 comments
Labels
docs Documentation in the Doc dir

Comments

@jagerber48
Copy link

jagerber48 commented Nov 22, 2022

Documentation

The string format specification mini language is used to customize the presentation floats (and other types) as strings. Details like the minimum field with and precision of the representation of the float can be controlled. Documentation for the mini language is found at https://docs.python.org/3/library/string.html#format-specification-mini-language.

In the documentation it states that

For float this is the same as 'g', except that when fixed-point notation is used to format the result, it always includes at least one digit past the decimal point. The precision used is as large as needed to represent the given value faithfully.

However, I don't find this to be the case in practice:

value = 3141500.0
print(value)
print(f'{value:}')
print(f'{value:g}')

results in

3141500.0
3141500.0
3.1415e+06

Is there accurate documentation about the expected formatting of floats when presentation type is not specified? If not could someone describe the expected behavior?

@jagerber48 jagerber48 added the docs Documentation in the Doc dir label Nov 22, 2022
@mdickinson
Copy link
Member

mdickinson commented Nov 22, 2022

The docs could possibly be clarified here; "same as 'g'" is intended to mean "the same style as 'g'-formatting", but without specifying the precision used. The precision is described later in the same paragraph:

The precision used is as large as needed to represent the given value faithfully.

So in the example you give, the relevant precision would be 7, so the text is supposed to indicate that the result is equivalent to formatting with .7g, apart from the extra ".0" already described.

>>> value = 3141500.0
>>> f"{value:.7g}"
'3141500'
>>> f"{value}"
'3141500.0'

Perhaps we'd do better not to mention the "g" format at all here, and to give a lower-level description of the behaviour.

@mdickinson
Copy link
Member

OTOH, a precision of 5 is also large enough to represent the given value faithfully for this particular example, so that text is indeed misleading.

@ericvsmith
Copy link
Member

Is there accurate documentation about the expected formatting of floats when presentation type is None? If not could someone describe the expected behavior?

I think you mean when the presentation type is the empty string? I don't think it can be None.

format(a_float, '') is the same as str(a_float). I think it's documented somewhere, but I can't put my finger on it.

@jagerber48
Copy link
Author

Even

The precision used is as large as needed to represent the given value faithfully.

is basically meaningless without further clarification:

print(f'{0.1}')
print(f'{0.1:.17}')

gives

0.1
0.10000000000000001

There should at least be information about how floats get rounded when cast to strings.

@ericvsmith Yes, the questions is what happens when there is no explicit presentation type in the format spec. under format() I see that format(a_float, '') is probably the same as str(a_float): https://docs.python.org/3/library/functions.html#format. But I don't see documentation for str for floats anywhere.

@pochmann
Copy link
Contributor

The doc says None, not None. The latter means the Python value while the former is just the English word. I suggest you replace None appropriately.

@jagerber48 jagerber48 changed the title Float string formatting with None presentation type behavior. Float string formatting with no specified presentation type behavior. Nov 23, 2022
@ericvsmith
Copy link
Member

Are there any suggested changes here? If not, I think we should close this.

@jagerber48
Copy link
Author

jagerber48 commented Jan 9, 2023

Are there any suggested changes here? If not, I think we should close this.

Yes. On this page https://docs.python.org/3/library/string.html#formatspec the "None" row for presentation type is ambiguous at best at flat out wrong at worst. See the two example above. The main issues are (1) the comparison with the g format type which doesn't make sense because g format type may use scientific notation whereas no format type will never use scientific notation and (2) There is no documentation anywhere as far as I can see about how floats get rounded when they are displayed directly.

I tried to dig into some source code to figure out how and where floats do get rounded for string formatting but I wasn't able to figure it out since I have very little familiarity with the C code underlying Python.

(1) could be addressed by changing the documentation in the linked page. (2) could be addressed on this page or possibly another documentation page.

Also, for what it's worth, though this shouldn't really be necessary, the reason I became interested in this is because I've been having a look at a project that involves custom numeric string formatting. The project will either extend or re-implement the native Python float string formatting but some edge cases are challenging to code to or unit test without more clear documentation on what native behavior is expected.

@ericvsmith
Copy link
Member

Without a format spec, it can still show scientific notation:

>>> format(1e100, '')
'1e+100'

Or am I missing your point?

For the other point, @mdickinson can answer better than I, but I don't think there are any guarantees. Where possible, we use the so-called "short float repr", but this isn't guaranteed on all platforms.

@mdickinson
Copy link
Member

Ignoring the docs completely for a moment, it may be helpful to look at the source to understand the current behaviour. On a typical machine (i.e., one where we're not being forced back to the legacy code instead of using dtoa.c for one reason or another), all these formatting operations go through the format_float_short function:

cpython/Python/pystrtod.c

Lines 1006 to 1011 in 2e80c2a

static char *
format_float_short(double d, char format_code,
int mode, int precision,
int always_add_sign, int add_dot_0_if_integer,
int use_alt_formatting, int no_negative_zero,
const char * const *float_strings, int *type)

With some debugging printfs added to the format_float_short function, we get:

>>> x = 1729.3141
>>> format(x, '')
format_code=r, mode=0, precision=0
flags: add_dot_0_if_integer
'1729.3141'
>>> format(x, 'g')
format_code=g, mode=2, precision=6
flags: none
'1729.31'
>>> format(x, '.3')
format_code=g, mode=2, precision=3
flags: add_dot_0_if_integer
'1.73e+03'
>>> format(x, '.3g')
format_code=g, mode=2, precision=3
flags: none
'1.73e+03'
>>> format(x, '21')
format_code=r, mode=0, precision=0
flags: add_dot_0_if_integer
'            1729.3141'
>>> format(x, '21g')
format_code=g, mode=2, precision=6
flags: none
'              1729.31'

There are two separate cases for a "no presentation type" format specification:

  • If there's no presentation type and no precision is given (e.g., a format string of '', or '21', or '+'), then the output is based on the repr, so on most machines will be using the shortest roundtrippable string algorithm.
  • If there's no presentation type and a precision is given (e.g., '.5', or '<12.5', or ...) then the inputs to format_float_short exactly match what the inputs would have been if g were appended to the format string, except that the add_dot_0_if_integer flag is passed.

As to the effects of that add_dot_0_if_integer flag: we get exactly the same signficant digits in the output as we would have done with the 'g' presentation type (computed in the exact same way through the dtoa.c functions). The only difference lies in how we choose whether to use scientific notation or not, and only applies in one corner case.

  • For the g presentation type, we use scientific notation if either the formatted value (after the rounding that's implicit in the formatting operation) is small (smaller than 0.0001), or if it's large enough that the final significant digit would have been in the tens place or above; the goal is to avoid padding on the right with zeros and thereby printing more significant digits than have been computed (and potentially printing the wrong significant digits).
  • For a missing presentation type, we instead use scientific notation if the formatted value after rounding is small (same threshold as before), or if it's large enough that the final significant digit would be in the units place or above. That's because for non-scientific notation we'd end up adding a misleading zero in this case.

Here's an example where that difference manifests itself:

>>> format(157.6, ".3g")
'158'
>>> format(157.6, ".3")
'1.58e+02'

In that second case, if we'd followed the exact same rules as for .3g and then added the trailing zero, we would have ended up with 158.0, which would be misleading. Hence the use of scientific notation. With a precision of 4, we see no difference:

>>> format(157.6, ".4g")
'157.6'
>>> format(157.6, ".4")
'157.6'

Similarly, with a precision of 2, both cases will use scientific notation:

>>> format(157.6, ".2g")
'1.6e+02'
>>> format(157.6, ".2")
'1.6e+02'

And a case where that extra .0 comes into play (the key point here is to get a representation that Python would interpret as a float rather than an int if it were given as a literal):

>>> format(1576.0, ".6g")
'1576'
>>> format(1576.0, ".6")
'1576.0'

@mdickinson
Copy link
Member

mdickinson commented Jan 9, 2023

So going back to the docs, it looks as though we're not clearly expressing that "with precision" versus "without precision" distinction, and then we're kinda mashing the two cases together in the text. The

the same as 'g', except that when fixed-point notation is used to format the result, it always includes at least one digit past the decimal point

text is accurate, but only when a precision is used, while the

The precision used is as large as needed to represent the given value faithfully.

text is accurate in the case where no precision is used (and is valid for both the legacy and the dtoa.c-bsaed floating-point repr in that case).

@jagerber48
Copy link
Author

jagerber48 commented Jan 10, 2023

@mdickinson thanks so much for the thorough analysis, it's just what I needed. I'll parse through this and see if I can draft revised documentation for this. A note to all, I'm not really familiar with Decimal or its use cases so I won't target that part of the documentation.

@jagerber48
Copy link
Author

Here's my attempt at a summary of the cases. I'm considering the formatting of a float to a string when no presentation type is specified. In this case:

  • If no precision is specified then the string is formatted the same as 'f' with a precision as large as needed to represent the given value faithfully.
    • But note that f'{123:}' == f'{123:.0f}' == '123' while f'{123.:}' == f'{123.0:}' == f'{123:.1f}' == f'{123.:.1f}' == f'{123.0:.1f}'. I'm not sure how to express this edge case in words. It actually looks to me like the integer formatting documentation needs to be extended to include f as a presentation type.
  • If precision is specified then this is the same as 'g', except that when fixed-point notation is used to format the result, it always includes at least one digit past the decimal point

Do these statements seem correct? @mdickinson? The usage of 'f' in the case where no precision is given seems like a big an important case. I'm also not sure if this covers all corner cases, but I think it covers at least more cases than the current documentation. If this is correct I can work on drafting up new text for the documentation.

@mdickinson
Copy link
Member

Doesn't that example match the docs? Can you say what output you were expecting instead?

@jagerber48
Copy link
Author

jagerber48 commented May 1, 2023

Doesn't that example match the docs? Can you say what output you were expecting instead?

Sorry, I just deleted my comment because I started to think the same. I'll see if I can restore it.

The case was

>>> format(100, '.5g')
'100'

I was expecting 100.00 because that is the result of format(100, '.2f') which the docs say .5g gets converted to in this case. I was ignoring the part of the docs about removing trailing zeros. I guess the trailing zeros are considered insignificant even though 5 significant digits are requested and they are, by my definition of significant, significant.

I guess this is the use case for

>>> format(100, '#.5g')
'100.00'

@mdickinson
Copy link
Member

mdickinson commented May 1, 2023

Thanks. Yes, the doc wording is using "insignificant" in the sense that removing the zero doesn't change the value (as opposed to, for example, removing a trailing zero from the string "100"); the intent is not to match the meaning of "significant" in "significant digits". It hadn't occurred to me that readers might try to link those two things. There may well be a better way of wording this.

@mdickinson
Copy link
Member

Responding to your earlier comment:

If no precision is specified then the string is formatted the same as 'f' with a precision as large as needed to represent the given value faithfully.

Yes, I think that's accurate.

But note that f'{123:}' == f'{123:.0f}' == '123' while f'{123.:}' == f'{123.0:}' == f'{123:.1f}' == f'{123.:.1f}' == f'{123.0:.1f}'. [...]

Yes. For floats, one of the constraints being applied for the no-precision case is that the output should "look like" a float rather than appearing to be an integer - that is, the output string should have either a decimal separator or an exponent indicator. A second constraint is that (for whatever reason - it's essentially cosmetic), we never produce a trailing "." without a following digit.

If precision is specified then this is the same as 'g', except that when fixed-point notation is used to format the result, it always includes at least one digit past the decimal point

Again, this seems accurate, yes.

I do understand the frustration: for a lot of this, the reason behind the behaviour is no better than history + "C did it/does it that way", so this is the behaviour that users expected at the time. But it's virtually impossible to change these sorts of details without breaking someone's code, somewhere. (For this exact reason I have a back-burner project for a third-party library to generalise formatting and rounding tasks and make them more consistent and flexible, but it may be some time before it sees the light of day.)

@jagerber48
Copy link
Author

No, I was incorrect that no format string plus no precision is similar to 'f' precision type. The 'f' precision type will never give scientific notation, whereas the no precision type will use scientific notation for floats >=1e16 or <= 1e-5 (need to check exact changeover value I think?). It may be worth pointing out that something like format(100, 'f') casts 100 to a float before formatting whereas format(100, '') does not.

Your formatting and rounding package sounds interesting. I think the python community is missing a standalone package for easy formatting of scientific numbers. The native formatting is close enough that I can understand why something hasn't arisen. But it seems to be solving slightly different problems so it ends up being a little clunky to get just what you want if you're being very particular about digits (like you sometimes are in certain sciences). The uncertainties package does a pretty nice job of this but its formatting syntax is intertwined pretty heavily with the native python string formatting which, in my opinion, is a bit clunky. I've written my own uncertainties formatting which incorporates rounding using the round function and a new mini format language dedicated to the formatting of value + uncertainty float pairs according to significant digits. One feature which I haven't seen anywhere is the ability to format a float via "engineering notation" where the exponents are always integer multiples of 3 (mm, km, MHz, etc.).

@jagerber48
Copy link
Author

jagerber48 commented May 2, 2023

I don't yet know how to contribute to the code that generates the python documentation. I'll work on learning that, but in the meantime, here's a first draft of my proposed changes.


under 'g' presentation type change

In both cases insignificant trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it, unless the '#' option is used.

to

In both cases trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it, unless the '#' option is used.

Simply remove the word "insignificant".


under 'None' I'm going to assume the Decimal does not exist since I don't understand those well enough to write a docstring for them. Someone else will have to help me integrate my float explanation with the Decimal explanation (or I'll have to spend more time learning).

rewrite the following:

For float, if a precision is provided then this is the same as 'g' except that when fixed-point notation is used to format the result, it always includes at least one digit past the decimal point. If no precision is provided then the numerical part of the result is resolved using the float str() method. In this case, on most systems, the float is converted to the shortest string that will roundtrip back to same float. If the float is <1e-4 or >=1e16 it will be displayed in scientific notation.

A comment here: The main blind spot in the documentations of the "None" presentation type is that the rules for the str() method for float are not laid out anywhere in the documentation that I can find. Not sure if this is true for Decimal also or not. Those rules can either be laid out here (like in my text above), or elsewhere in the docs and pointed to at this point. I think this table should be one-stop-shop for answering "how will this number get formatted", so my vote would be to put it here for now (as opposed to on a page about floats or something...)


Under 'G':

General format. Same as 'g' except switches to 'E' if the number gets too large.

to

General format. Same as 'g' except switches to 'E' if the number gets too large or small.


At the bottom (or top) of the table include the comment

All format types in the table above except "None" cast integer inputs to floats before formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

4 participants