Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow optional underscores in numberLiteral parsers #2

Open
rmunn opened this issue Feb 17, 2017 · 6 comments
Open

Allow optional underscores in numberLiteral parsers #2

rmunn opened this issue Feb 17, 2017 · 6 comments

Comments

@rmunn
Copy link
Contributor

rmunn commented Feb 17, 2017

Now that Java and F# allow underscores in numeric literals, and some configuration file syntaxes like TOML do as well, it would be nice if the numberLiteral parser had an optional flag for allowing underscores between digits. As it is, if I want to write a TOML parser, I won't be able to leverage numberLiteral and I'll have to essentially reimplement it myself.

@stephan-tolksdorf
Copy link
Owner

I agree, an option to skip underscores in numeric literals as supported by Java and F# would be a good addition.

@stephan-tolksdorf
Copy link
Owner

stephan-tolksdorf commented Mar 5, 2017

I've implemented an AllowDigitSeparator option for the numberLiteral parser on the 'digit-separator' branch, see f2686b9

With this option the numberLiteral parser allows an underscore char ('_') as a digit separator. Each underscore must be surrounded by digits on both sides, and if an underscore is not followed by a digit, an error is generated. Is this behaviour suitable for your applications?

Note that this doesn't affect the pfloat parser, as the digit separator flag is not in the default options. If you wanted a variant of pfloat that supports digit separators you'd need to copy the implementation, add the AllowDigitSeparator option when calling numberLiteralE and then strip the underscores from the string in the NumberLiteral result before passing it to e.g. System.Double.Parse.

Changing the defaults for pfloat and the pint would be a breaking change of behavior, which I'd like to avoid. But we could add new variants with different options, or maybe variants that are configurable through an options argument. What do you think?

@rmunn
Copy link
Contributor Author

rmunn commented Mar 6, 2017

I don't want to change the defaults of pfloat and pint: number literals without underscores are by far the most common use case for most parsing scenarios, I believe. And yes, this change to the numberLiteral parser is exactly what I needed for my use case (parsing TOML).

What you describe about stripping the underscores from the string is pretty much what I did in my own code: I implemented a parser to parse a sequence of digits with optional underscores in between, and returned a string of the digits without the underscores which I then forwarded to System.Int32.Parse:

let digitsWithUnderscores = stringsSepBy (many1Satisfy isDigit)
                                         (underscore >>% "")
                            |> notEmpty

In the case of parsing floats, I used that digitsWithUnderscores parser in three different places in the float parser I wrote, and since it drops underscores, I could hand the whole thing off to System.Double.Parse as-is. However, it took a bit of fiddling to get the backtracking exactly right, so I'll be happy to replace it with a numberLiteral-based parser once this code lands in a released version.

I like the idea of different variants, but I can't come up with good names for them. The names should be short, preferably a single character appended to the existing names (e.g., pfloatU)... but I strongly dislike pfloatU when I look at it. It's not at all obvious what it means. And pfloatWithUnderscores (or even worse, pfloatWithOptionalUnderscores) is/are way too wordy. Oh, and while I'm mentioning horrible ideas that I don't like, pfloat_ is even worse than pfloatU. Blech.

I'm not thrilled about the idea of pfloat growing an options argument, as that would be somewhat inconsistent with the rest of the syntax where variants are different functions with one-character variant codes appended (e.g., sepBy vs sepBy1).

So I don't have any good ideas so far for function names, but if I come up with a bright idea over the next few days I'll mention it.

@stephan-tolksdorf
Copy link
Owner

Having a pfloat variant that takes an options argument would simplify adding further options in the future. Or maybe it should be a a generic pnumber parser that also accepts a function argument for mapping the string to a generic result type? The identifier parser is an example for a similarly configurable parser.

It would be great if you could maybe try out the new AllowDigitSeparator option a bit before I merge it into master, just so that we gain a little more confidence in its implementation.

@rmunn
Copy link
Contributor Author

rmunn commented Jan 26, 2018

Well, I've finally been able to try this branch out, and here are my thoughts:

First, there are some differences between various syntaxes out there that allow underscores between digits. For example, in Java and F#, 5_______2 is a valid numeric literal (which parses to 52). But in TOML, 5_______2 is an error (the TOML spec states that "Each underscore must be surrounded by at least one digit.") At the moment, the numberLiteral implementation allows only a single underscore between digits, and 5_______2 produces the error "Expecting: decimal digit" at the second underscore. This is actually what I want since I'm interested in building a TOML parser, but anyone trying to parse F#, or their own language where they've followed F#'s syntax rules, might want a second flag AllowMultipleSeparatorsBetweenDigits (which would be false to parse TOML but true to parse F#).

Or perhaps that second flag isn't worth adding just yet, until we see whether anyone complains about 5_______2 not being allowed. Hmmmm. I see three possible options:

  1. Add a second flag so that 5_____2 can be considered an error (for TOML parsing) or valid (for F# parsing), according to whatever the library's end users need. Pro: maximum flexibility. Con: an ever more complicated numberLiteral function, which is already complicated enough.
  2. Change the implementation to allow 5_____2. Pro: Can parse F# code. Con: My use case (TOML parsing) is broken, unless the TOML spec changes to allow multiple underscores.
  3. Release as-is. Pro: Least work, my use case works. Con: Some valid F# or Java numeric literals would be unparseable with the numberLiteral function as it is right now. (Though really, why would anyone WANT to write 5______2 unless you're trying to obfuscate your code?)

I'll poke around some more and offer further thoughts in a few days.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 5, 2018

There's a second language syntax besides TOML that disallows multiple underscores: Python just added underscores in numeric literals in Python 3.6, but the rules are "just one underscore between digits", like TOML. And it looks like TOML won't be changing. So there are now two broad "families" of language specs that allow underscores in numeric literals:

  1. Multiple underscores allowed between any pair of digits: C#, F#, Java
  2. Only one underscore allowed between any pair of digits: TOML, Python

Since both styles have multiple examples that aren't likely to change, it looks like alternative 1 from my previous comment is the way to go. I propose another flag, maybe AllowRunsOfDigitSeparator or AllowDigitSeparatorRuns or AllowMultipleDigitSeparators, to pick between F# style (multiple underscores allowed between any pair of digits) and Python style (a single underscore allowed between any pair of digits). Then the code that skips a single digitSeparator character would be rewritten to check the second flag and, if the second flag is set, allow skipping multiple digitSeparator characters before the next digit. I hope this will not add too much complexity to the feature, but I think the extra complexity is going to be necessary to allow parsing both F#/C#/Java and Python/TOML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants