clarify amount parsing/rendering/commodity directives #793

simonmichael · 2018-06-01T23:52:42Z

Here are some notes and actions aimed at improving amount parsing, rendering, and commodity directives, from this mail thread which spun off from #698. Other issues that might be affected:
#489 #561 #688

(Here's a shorter summary of the points below.)

problems

Problems with amount parsing/rendering in current master:

numbers are parsed loosely, accepting any of two decimal separators
we don't warn about inconsistent choice of decimal separator across amounts
ambiguous-separator amounts can be silently misparsed (a single digit group separator is interpreted as a decimal separator)
commodity directives, the recommended solution, have unclear semantics
- they are used for: declaring commodity symbols, resolving input decimal separator ambiguity, controlling output style
- the directive's scope for each of of these is non-obvious. Should subfiles be affected ? other files ? transactions in the same file preceding the directive ?
commodity directives used to resolve the input decimal separator also limit output style
- they force digit group separators to appear in output
- they force the output decimal separator to match the input
D directives do all that commodity directives do and more, adding complexity
allowing commodities to have different input decimal separators within a file is excessive flexibility
allowing commodities to have different output decimal separator within a report is excessive flexibility
lack of clarity makes this annoying to learn, costly to support; ongoing issue reports

goals

be convenient and intuitive
be i18n-aware
detect and report errors, avoid guessing
don't require learning detailed rules
in every situation, do something that's sensible at least in hindsight
keep as much backward and sideways compatibility as possible
end confusion and bug reports about basic number parsing and rendering

short term

clarify how commodity directives can control both input and output

a directive for a commodity sets its input decimal point for the rest of the current file, exclusive of subfiles
a directive for a commodity declares its symbol's validity for the rest of the current file, exclusive of subfiles
the first directive for a commodity across all files sets its output style in the report

parse decimal separators more carefully

commodities and input decimal separators are always declared together, and per file
at the start of each file (and each included file), no commodities or input decimal separators are known
parsing a commodity directive
- declares the commodity and its input decimal separator for the current file
- the directive's amount must have a separator
- if the separator is ambiguous is it assumed to be the decimal separator. Could display a warning.
- if this is the first directive for this commodity among all files: also declares the commodity's output style for the report
parsing a definite-separator amount whose commodity is not yet declared, has the same effect as a commodity directive. Could display a warning. In a future strict mode, this will raise an error.
parsing an ambiguous-separator amount whose commodity is not yet declared, has the same effect as a commodity directive and also displays a warning (one per file)
parsing an amount whose decimal separator is inconsistent with the one declared for its commodity, raises an error

clarify docs

"use commodity directives in each file to help parse it correctly, declaring commodity symbols and the decimal separator used"
"use commodity directives in the first/uppermost file to help control each commodity's output style, declaring the symbol position, digit groups, decimal separator, and number of decimal places"

simplify D directive if it gives any trouble

D only specifies a default symbol, eg: D $ . The old syntax is accepted for backwards compatibility but only the symbol matters.

medium term

add a decimal or decimal-separator directive to set that once per file. "decimal ,"
add an alternate simpler form of commodity directive that just defines a symbol. "commodity $". And perhaps allow declaring multiple symbols with one directive.
use system locale to choose a default output decimal separator
add an --amount-style command line option that overrides output style, for individual commodities or all commodities.
add a strict mode that does more error checking

awjchen · 2018-06-10T01:12:32Z

I haven't been participating more in this discussion because I know I have no experience designing syntax and have little experience with hledger. But since there aren't yet any other comments, it might be worth saying something, even if only to summarize your points.

Summary

If I understand correctly, the major proposed short-term change to hledger's parser are to associate a unique decimal separator to every commodity so that the numerical value of commoditized amounts is unambiguous. (Output styles and user documentation are addressed as well, but I will focus on the parser, since that is the only part of hledger I know anything about.)

The decimal separator for a commodity will be determined by a commodity directive, if present; otherwise, it the decimal separator for a commodity will be determined by the first encountered amount of that commodity. In either case, the example amount must contain a separator that can be interpreted as a decimal separator (where we will assume that a separator is a decimal separator wherever we can), or else an error will be thrown (?).

The medium-term objectives seem to be refinements of the main idea of explicitly declaring a decimal separator.

Comments

It looks like the focus on the decimal separator, as opposed to e.g. the digit group separators, is sound since the decimal separator is the only separator that affects the interpretation of an amount.

Given that we want to throw errors when encountering decimal separator inconsistency, the proposed short-term changes seem minimal, which is good. No new syntax is introduced, for instance.

This is a breaking change for users that have mixed amount styles for a single commodity in a single journal, but perhaps this is not common practice?

Alternatives

Also, are there other feasible alternatives to disambiguating the parsing of amounts? I can't imagine an alternative to the proposed change that would be better for maintaining backwards and sideways compatibility. One alternative I can imagine, one that would be less sideways compatible, would be to have the user choose amount parsing styles from a number of pre-set styles. In particular, these styles would determine the decimal separator, or the absence thereof. The idea is that, since there are only a finite number of conventions for numbers or monetary amounts, we might be able to write specialized parsers for each one.

If we are indeed able to implement parsers for all (or at least most?) conventional systems, I imagine it would be convenient for the user to simply choose one of them; rather than learning the rules for specification, they would only need to learn the name of the system they already know. I also think it could be beneficial to restrict the syntax for amounts to that of conventional systems, in particular for the purpose of exporting or otherwise communicating hledger data. Furthermore, by using specialized parsers, we might be able to more cleanly handle "exotic" systems (e.g. checking that the Japanese digit group separators in issue #796 appear in a certain order and are not repeated). For backwards compatibility, maybe we could retain the current amount parser (or the currently proposed amount parser?) as the default parser.

simonmichael · 2018-06-11T18:58:24Z

Thanks for the input @awjchen. I would say it this way: Currently the input decimal separator is detected or guessed for each individual amount. This is too loose, allowing numbers to be quietly misparsed. As a workaround we allow commodity directives to specify it unambiguously, but this has several problems due to unclear semantics and scope. Overall the current setup causes confusion and bug reports. The proposal aims to make this more intuitive and robust by parsing decimal separators more strictly and tightening up the semantics: 1. Input decimal separator will be detected or guessed within each file, and for each commodity in that file. (We don't really need or want it to vary across commodities, but given current syntax it's easier to allow that.) 2. While parsing the amounts in a file, any inconsistency with the file-wide input decimal separator, or any guessing required, will be reported as a warning or error. 3. Commodity directives will have clearer semantics (first one sets output decimal separator, most recent one in parse stream sets input decimal separator) and scope (affecting subsequent entries in the current file only).

alerque · 2019-06-20T13:44:53Z

Cross linking related comment on symbols in currencies.

simonmichael · 2019-09-25T05:34:09Z

Another example of confusing behavior, via @bradyt. Not yet understood.

; problems with amount parsing/display
; cf https://github.com/simonmichael/hledger/issues/793

2019/09/24
    a                  2,000.00
    b                  1,000
    c

2019/09/26
    (d)             2000,00

comment

$ hledger print
2019/09/24
    a       2,000,000
    b           1,000
    c

2019/09/26
    (d)       2,000,000

$ hledger -f a.j reg a amt:'2000'
2019/09/24                   a                    2,000,000     2,000,000

[Opened as #1091].

And if they did, the stats command would now throw an error. Changed: journalApplyCommodityStyles journalInferCommodityStyles commodityStylesFromAmounts

simonmichael · 2020-03-25T18:00:42Z

Related: currently it seems not possible for hledger to display a decimal mark different from the one in the journal. Eg, you can't print reports with a . decimal point from this journal:

2020-01-01
    (a)         $1,23

The multi-commodity-directive behaviour proposed above (first one sets input decimal mark, most recent sets output decimal mark) would allow it, as would a dedicated decimal or decimal-mark directive.

simonmichael · 2020-11-09T00:43:01Z

https://hledger.org/csv.html#decimal-mark rule added in master, for CSV files. We might want to add it to journal format also.

This was referenced Jun 1, 2018

digit group separator parsed as decimal point #698

Closed

commodity directive with no decimal point causes wrong parsing #688

Closed

JPY separators #798

Closed

awjchen mentioned this issue Jun 26, 2018

Allow "re-parsing" with custom parse errors, some commodity cleanups #823

Closed

Repository owner deleted a comment from awchen Jul 11, 2019

simonmichael mentioned this issue Sep 25, 2019

some journals cause confusing output with same digit group mark & decimal mark #1091

Closed

simonmichael removed the help wanted label Mar 7, 2020

simonmichael mentioned this issue Nov 7, 2020

csv decimal mark #1382

Merged

apauley mentioned this issue Jul 8, 2021

clarify scope of commodity directive in regard to number notation #1375

Closed

simonmichael mentioned this issue Aug 25, 2021

support decimal-mark in journal files also #1670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarify amount parsing/rendering/commodity directives #793

clarify amount parsing/rendering/commodity directives #793

simonmichael commented Jun 1, 2018 •

edited

Loading

awjchen commented Jun 10, 2018

simonmichael commented Jun 11, 2018 via email •

edited

Loading

alerque commented Jun 20, 2019

simonmichael commented Sep 25, 2019 •

edited

Loading

simonmichael commented Mar 25, 2020

simonmichael commented Nov 9, 2020

clarify amount parsing/rendering/commodity directives #793

clarify amount parsing/rendering/commodity directives #793

Comments

simonmichael commented Jun 1, 2018 • edited Loading

problems

goals

short term

clarify how commodity directives can control both input and output

parse decimal separators more carefully

clarify docs

simplify D directive if it gives any trouble

medium term

awjchen commented Jun 10, 2018

simonmichael commented Jun 11, 2018 via email • edited Loading

alerque commented Jun 20, 2019

simonmichael commented Sep 25, 2019 • edited Loading

simonmichael commented Mar 25, 2020

simonmichael commented Nov 9, 2020

simonmichael commented Jun 1, 2018 •

edited

Loading

simonmichael commented Jun 11, 2018 via email •

edited

Loading

simonmichael commented Sep 25, 2019 •

edited

Loading