New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

space should not be parsed as decimal point #749

Closed
simonmichael opened this Issue May 3, 2018 · 11 comments

Comments

Projects
None yet
3 participants
@simonmichael
Owner

simonmichael commented May 3, 2018

With a journal containing only the below, the amount is parsed as 1. I think (hope) no country uses space for decimal point, so we should prevent this:

2018-01-01
  (a)   USD1 000

;$ hledger -f d.j reg amt:1
;2018/01/01                      (a)                       USD1 000      USD1 000

And possibly the same issue, the following commodity directive should complain about a missing decimal point (now required), instead it gives a parse error:

commodity 1 000  USD

2018-01-01
  (a)   USD1 000

; hledger: d.j:1:12:
; unexpected space
; expecting digit

A correct commodity directive does work as intended:

commodity 1 000.00  USD

2018-01-01
  (a)   USD1 000

;$ hledger -f d.j reg amt:1000
;2018/01/01                      (a)                   1 000.00 USD  1 000.00 USD

@simonmichael simonmichael changed the title from space as digit group separator can be parsed as decimal point to space as digit group separator should not be parsed as decimal point May 3, 2018

@ony

This comment has been minimized.

Collaborator

ony commented May 4, 2018

Space have special meaning and cannot be decimal separator:

    (firstSep, groups) <- option (Nothing, []) $ do
        leadingDigits <- some digitChar
        option (Nothing, [leadingDigits]) . try $ do
            firstSep <- oneOf sepChars <|> whitespaceChar
            groups <- some digitChar `sepBy1` char firstSep
            return (Just firstSep, leadingDigits : groups)

Isn't USD1 might be treated as a commodity?

@simonmichael

This comment has been minimized.

Owner

simonmichael commented May 4, 2018

Contrary to the haddock, firstSep <- oneOf sepChars <|> whitespaceChar allows a space (and possibly other things in the unicode Space category).

@ony

This comment has been minimized.

Collaborator

ony commented May 4, 2018

I see problem is in fromRawNumber actually we get (Just ' ', ["1", "000"], Nothing) same as before we assume that if there is two groups - last one is decimals regardless of what separator were used.

Something like next should do the trick:

diff --git a/hledger-lib/Hledger/Read/Common.hs b/hledger-lib/Hledger/Read/Common.hs
index 891e5b04..b5ee7c1b 100644
--- a/hledger-lib/Hledger/Read/Common.hs
+++ b/hledger-lib/Hledger/Read/Common.hs
@@ -604,21 +604,19 @@ fromRawNumber suggestedStyle negated raw = (quantity, precision, mdecimalpoint,
         AmountStyle{asdecimalpoint = Just d} -> (d ==)
         AmountStyle{asdigitgroups = Just (DigitGroups g _)} -> (g /=)
         AmountStyle{asprecision = 0} -> const False
-        _ -> const True
+        _ -> (`elemOf` digitSepChars)
 
 
 rawnumberp :: TextParser m (Maybe Char, [String], Maybe (Char, String))
 rawnumberp = do
-    let sepChars = ['.', ','] -- all allowed punctuation characters
-
     (firstSep, groups) <- option (Nothing, []) $ do
         leadingDigits <- some digitChar
         option (Nothing, [leadingDigits]) . try $ do
-            firstSep <- oneOf sepChars <|> whitespaceChar
+            firstSep <- oneOf digitSepChars <|> whitespaceChar
             groups <- some digitChar `sepBy1` char firstSep
             return (Just firstSep, leadingDigits : groups)
 
-    let remSepChars = maybe sepChars (`delete` sepChars) firstSep
+    let remSepChars = maybe digitSepChars (`delete` digitSepChars) firstSep
         modifier
             | null groups = fmap Just  -- if no digits so far, we require at least some decimals
             | otherwise = optional
@@ -634,6 +632,9 @@ rawnumberp = do
     return $ dbg8 "rawnumberp" (firstSep, groups, extraGroup)
     <?> "rawnumberp"
 
+-- | All allowed punctuation characters in number
+digitSepChars = ['.', ',']
+
 -- | Parse a unicode char that represents any non-control space char (Zs general category).
 whitespaceChar :: TextParser m Char
 whitespaceChar = charCategory Space

But ideally we need to return different kinds of values to distinguish ambiguous numbers and when it is clear that it is digit groups (+ separator if more than 1 group) with optional fractional part.

simonmichael added a commit that referenced this issue May 17, 2018

@simonmichael

This comment has been minimized.

simonmichael added a commit that referenced this issue May 17, 2018

@awjchen

This comment has been minimized.

Collaborator

awjchen commented May 18, 2018

Addressing the first example given in the OP:

In http://hledger.org/journal.html#amounts there is a section that explains the observed behaviour.

... However, there is some ambiguous way of representing numbers like $1.000 and $1,000 both may mean either one thousand or one dollar. By default hledger will assume that this is sole delimiter is used only for decimals. On the other hand commodity format declared prior to that line will help to resolve that ambiguity differently: ...

This seems to be implemented in the following snippet of fromRawNumber (starting on line 716 of Hledger/Read/Common.hs on the current master).

(mseparator, intparts, mdecimalpoint, frac) =
        case raw of
            -- just a single punctuation between two digits groups, assume it's a decimal point
            (Just s, [firstGroup, lastGroup], Nothing)
                -- if have a decimalHint restrict this assumpion only to a matching separator
                | maybe True (`asdecimalcheck` s) suggestedStyle -> (Nothing, [firstGroup], Just s, lastGroup)

Here, raw is a triple of Maybe Char (the separator character, if any), [String] (the digit groups), and Maybe (Char, String) (the decmial character and decimal string, if they are detected). raw is the output of some inital parsing done in rawnumberp.

In the first example, raw evaluates to (Just ' ', ["1", "000"], Nothing), matching the pattern of the first case (I have excluded the others), while suggestedStyle evaluates to Nothing, so the guard evaluates to True and we indeed interpret the space as a decimal point.


Addressing the second example of the OP:

The parsing fails on the directive on the first line. We are trying to parse amountp, which is an alternative of leftsymbolamountp, rightsymbolamountp, and nosymbolamount. The case we want is rightsymbolamountp.

leftsymbolamountp fails, as expected. But, while trying to parse rightsymbolamountp, we fail rawnumberp in Common.hs. Specifically, it fails on this line:

groups <- some digitChar `sepBy1` char firstSep

Here, firstSep is the space character, and we have just parsed the 1. Then, this line parses "000" for some digitChar, then successfully parses one of the two spaces before the commodity symbol USD, and is exepcting another some digitChar but instead meets a space.

Upon this failure, we backtrack and obtain a nonsensical result of parsing just 1 with no separators, causing a failure down the line in rawnumberp. We backtrack to nosymbolamountp, trying and failing to parse rawnumberp in the same way, which gives the error we see.

@simonmichael

This comment has been minimized.

Owner

simonmichael commented May 18, 2018

Re OP example 1, you're right, the observed behaviour is consistent and even documented. I meant to suggest we should change it and break consistency here, knowing that nobody ever uses space as decimal point (?), allowing that 1 to be parsed as part of the commodity symbol instead. I might be wrong though, it might be too confusing.

I should link to the context: this reddit thread where someone was complaining about having to double-quote commodity symbols with numbers in them.

@awjchen

This comment has been minimized.

Collaborator

awjchen commented May 19, 2018

Ah, I see. I'm a little wary about adding an additional rule for the parsing of amounts on top of the special case we already have for numbers with exactly two digit groups. To me this seems complex.

For the "SP500" commodity example from reddit, would be it reasonable to introduce a new separator for separating commodity symbols from their amounts? It would at least cut down the additional typing from two charcters to one.

(misclicked and closed the issue, reopening)

@awjchen

This comment has been minimized.

Collaborator

awjchen commented May 20, 2018

The second example journal file in PR #780 seems like something one would often write if one used whitespace as a decimal point. If this is the first time we've exposed this parsing bug, is that evidence that no one uses whitespace as a decimal point?

@awjchen

This comment has been minimized.

Collaborator

awjchen commented May 20, 2018

I take back what I said about disallowing whitespace as decimal points being complex. The manual has a rule that decimal points can only be commas or periods! So perhaps we should disallow it, since, in some sense, it's the current behaviour that complicates things by creating an exception to the above rule!

@simonmichael

This comment has been minimized.

Owner

simonmichael commented May 20, 2018

Good question. Eh.. this gets confusing. I think:

  • nobody uses space as decimal point with hledger
  • no country in the world uses space as decimal point
  • we fairly recently added the ability to use space (just space, not all whitespace, hopefully) as digit group (eg thousands) separator
  • consequently our existing logic allows it to be tried it as a decimal point as well
  • we should change the logic to never try space as decimal point
@awjchen

This comment has been minimized.

Collaborator

awjchen commented May 20, 2018

Agreed. I can make this change. I would implement changes along the lines of @ony's suggestions.

awjchen added a commit to awjchen/hledger that referenced this issue May 21, 2018

awjchen added a commit to awjchen/hledger that referenced this issue May 21, 2018

awjchen added a commit to awjchen/hledger that referenced this issue May 21, 2018

@wafflebot wafflebot bot added the in progress label May 21, 2018

awjchen added a commit to awjchen/hledger that referenced this issue May 21, 2018

simonmichael added a commit that referenced this issue May 21, 2018

lib: fix issue where spaces were allowed as decimal points
- Fixes #749
- Also enabling the tests prepared for #749

@wafflebot wafflebot bot removed the in progress label May 21, 2018

@simonmichael simonmichael changed the title from space as digit group separator should not be parsed as decimal point to space should not be parsed as decimal point May 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment