Add custom number parsing #984

CrazyInfin8 · 2021-04-23T02:58:24Z

Added a custom number parser so we don't have to rely on strtoll or strtod

This PR will add support for parsing octal, binary, and hexadecimal number systems as well as provide support for separating digits with underscores for better organization. Because of this I will probably close #945 and #946.

I think parsing the numbers inside the VM is better than re-implementing strtod and strtoll because we already validate the number so why not parse it at the same time? There are also some slight complexities when using strtoll and strtod that Wren just doesn't use so this way is a tad bit simpler as well. Lastly the parser only parses the values to a double unlike strtoll that parses to a 64-bit integer then converts to a double which would allow for more (albeit slightly less accurate) digits in a hex, oct, and bin literal.

Now a couple of questions I have.

I've noticed that the literal: 0x is a valid in wren when compiling but is not valid when calling Num.fromString. I decided to patch it so that the literal 0x is a compile time error to be more consistent.
When calling Num.fromString, an invalid parse returns null instead of throwing an error or printing what it has so far. I get the argument that it is simpler to handle null than to handle error but I also noticed that a literal that is too large throws an error (see this test). I decided to copy this behavior but in my opinion I think we should either return null on all invalid parses if we do it for simplicity or return errors explaining why the parse went wrong as to be helpful.

ruby0x1 · 2021-04-23T03:05:33Z

from a quick glance there seems to be some duplication for some of the logic, with this kind of nuance it would be ideal to share it. Have you tried to consolidate it to a shared function or maybe that wasn't feasible?

CrazyInfin8 · 2021-04-23T03:16:02Z

One thing I wasn't sure was whether ObjString*.value is always null-terminated so there were plenty of added checks to verify that I'm not incrementing past the string provided (although I would actually have to test if the parser could have broken out of a non null terminated string though which I have not done so yet). If I can confirm that they are always null terminated, then the checks could be removed making the two functions more or less the same.

ruby0x1 · 2021-04-23T03:20:48Z

A quick look at the source seems to suggest it is...

https://github.com/wren-lang/wren/blob/main/src/vm/wren_value.c#L684

I haven't checked if every path goes via this one, but seems to be the case.

Edit: It should be noted though that ObjString* can contain null bytes within it, so it's context sensitive whether it matters or not that it's always null terminated. If it's dealing with user strings then it's not safe to assume that.

CrazyInfin8 · 2021-04-23T03:33:14Z

The other thing is where readNumber takes a parser, it calls lexError when it fails while Num.fromString returns null when it fails. I could probably pass in a function or pass in a string pointer to get error messages if it is mandatory to merge them together.

readNumber also doesn't check for a leading negative or prefix like Num.fromString does as well but I guess this may be negligible.

If it's dealing with user strings then it's not safe to assume that.

I'll leave in the check for now at least until we can say for certain that it is null terminated. I guess I'm just a little too paranoid 😅

ChayimFriedman2 · 2021-04-24T19:44:19Z

Unfortunately, parsing doubles is one of the hardest things in the universe 😄

Doing it correctly requires bigfloats, or at least bigints.

Although there are shortcuts, they never cover all of the cases.

As an example, the following number will throw an error (incorrectly) in your parser:

100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

I do support not using C's strtod(), but don't roll your own.

mhermier · 2021-04-25T06:14:22Z

We will be forced to do it to some extends. Considering all the features we want, it will be hard to find an implementation that has the correct licence, is correct and follow (or at minimum is adaptable to) all our requirements...

data-man · 2021-04-25T10:58:34Z

Really fast float parsing: https://github.com/fastfloat/fast_float
But it's C++.

PureFox48 · 2021-04-25T11:01:22Z

Well, if custom number parsing is so hard, why not just do #945 which everyone seems to think is a good idea and can be done with what we already have.

Personally, I'd be inclined to just put #946 on the back-burner for now. This feature was introduced recently into Go (along with binary and octal literals) but I've rarely used it. Most of the time where I'd want to use separators, it's for decimal numbers which are exact powers of ten so it's easier to just write them in the form 1eN anyway.

CrazyInfin8 · 2021-04-25T23:28:04Z

As an example, the following number will throw an error (incorrectly) in your parser:

100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

hmm my build seems to parse that number successfully.

System.print(Num.fromString("100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"))
// 1e+269
System.print(100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000)
// 1e+269

results seem equivalent to browsers console. If I were to put 309 zeroes or more, then it will start spitting errors which is equivalent to how the compiler behaved when it was using strtod.

CrazyInfin8 · 2021-04-25T23:31:37Z

I do have a new idea to make one function that both Num.fromString and readNumber can use instead of writing duplicate code but I am also wondering in which file something like that belongs to. could anyone guide me to which file is the best place to have a number parser function for both wren_compiler.c and wren_core.c?

ruby0x1 · 2021-04-25T23:34:15Z

wren_utils.c or maybe a new one should be fine to start with

ChayimFriedman2 · 2021-04-26T01:50:37Z

results seem equivalent to browsers console. If I were to put 309 zeroes or more, then it will start spitting errors which is equivalent to how the compiler behaved when it was using strtod.

Missed what you do with the exponent. But you still have rounding problems. Try with 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125E-316, for example.

For deeper dive into this, I suggest you to read How strtod() Works (and Sometimes Doesn't).

CrazyInfin8 · 2021-04-26T02:55:25Z

Missed what you do with the exponent. But you still have rounding problems. Try with 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125E-316, for example.

Again, this behaves just like how the old parsing method did using strtod. strtod sets errno whenever the normalized result would be less than DBL_MIN (assuming that the value was not meant to be zero). DBL_MIN is defined as 2.2250738585072014e-308 even though doubles can technically be as small as 4.9406564584124654e-324 by impacting the mantissa. I chose to stick with this behavior because I wasn't sure if it was done this way for a specific reason but it isn't to hard to check if the results is zero while the value is not meant to be zero if we wish to have exponents less than -308.

One potential issue I may be noticing is I don't actually check if the exponent will overflow as yet when adding or incrementing but I am planning to consolidate the two areas I parse numbers to one function so can add that in the next commit.

For deeper dive into this, I suggest you to read How strtod() Works (and Sometimes Doesn't).

For my previous PR I had to do some research as to how doubles work since I wasn't completely sure why my implementations were off. I did try creating an alternative to strtod here which functions very close to strtod but I had issues with the compiler complaining that const char was used as a non constant. I don't think we use all of the features of strtod though so I thought not trying to replicate strtod would be simpler in the long run.

ChayimFriedman2 · 2021-04-26T03:55:57Z

If you ignore overflows, try 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125.

CrazyInfin8 · 2021-04-26T04:40:43Z

If you ignore overflows, try 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125.

I'm not sure I follow what your trying to tell me.

This does print 6.6312368714698 on both this version of the parser and the original one. Plugging into the JS console, I managed to get some extra decimal places (6.631236871469758) so I think this is an issue with converting number to string rather than parsing string to number. We aren't going to actually get all of the digits saved exactly as the mantissa can only store integers that are at most 53-bit in size. 10(2^(53-1)). With this function log10(2^(53-1)) we can expect that at most 16 base 10 digits can be stored.

The goal of this PR is not to surpass the original method of parsing by accuracy but by convenience, although I do attempt to try to be as accurate as the original method by using long doubles where data could potentially be lost.

ChayimFriedman2 · 2021-04-26T04:42:25Z

You need to print all of the digits (the problem is usually with the last binary one): try with printf()'s %a specifier.

CrazyInfin8 · 2021-05-01T03:15:15Z

The parser is now consolidated into one function wrenParseNum in wren_utils.c.
Added comments to explain how the parsing works
Added overflow detection (only tested if the exponent part of a scientific number will overflow. Checking if too many digits will overflow is somewhat impractical as it would require more than 2GBs of numbers)

Some changed behavior:

Num.fromString No longer throws errors when the number would be too big or small. Instead it simply returns Null like it does for any other unsuccessful parse.
values are checked to see if they are infinity or if they are zero when the string provided is not zero. This is slightly different to how strtod sets errno since strtod checks if the value would be smaller than 2.2250738585072014e-308 to determine it is too small whereas the value can be as small as 4.9406564584124654e-324 before it is zero. This behavior seems more inline to how many other languages behave (although they also usually allow zero and infinity as values as well instead of throwing errors but this may be a discussion for a different issue).
Some error messages are changed to say that values are too big or to small instead of just saying it is too large. Also some values are slightly different in that they do not handle formatting like they originally did.

Add custom number parsing

8345297

minor fixes and consistancy changes

7805d3c

CrazyInfin8 added 3 commits April 30, 2021 18:28

Consolidate number parsing to one function

3f5a16a

Add overflow detection. Minor cleanups and readability improvements

d49e571

More minor readability improvements.

a33359a

CrazyInfin8 mentioned this pull request May 2, 2021

Add optional argument to specify base in Num.fromString #1001

Draft

include limits.h to build on pi

cff5637

This was referenced Jan 5, 2023

Num.fromString(" ") returns 0 #1131

Open

Fix null convertion in Num::fromString (#1131) #1132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom number parsing #984

Add custom number parsing #984

CrazyInfin8 commented Apr 23, 2021 •

edited

Loading

ruby0x1 commented Apr 23, 2021

CrazyInfin8 commented Apr 23, 2021

ruby0x1 commented Apr 23, 2021 •

edited

Loading

CrazyInfin8 commented Apr 23, 2021

ChayimFriedman2 commented Apr 24, 2021

mhermier commented Apr 25, 2021

data-man commented Apr 25, 2021

PureFox48 commented Apr 25, 2021

CrazyInfin8 commented Apr 25, 2021

CrazyInfin8 commented Apr 25, 2021

ruby0x1 commented Apr 25, 2021 •

edited

Loading

ChayimFriedman2 commented Apr 26, 2021

CrazyInfin8 commented Apr 26, 2021 •

edited

Loading

ChayimFriedman2 commented Apr 26, 2021

CrazyInfin8 commented Apr 26, 2021 •

edited

Loading

ChayimFriedman2 commented Apr 26, 2021

CrazyInfin8 commented May 1, 2021

Add custom number parsing #984

Are you sure you want to change the base?

Add custom number parsing #984

Conversation

CrazyInfin8 commented Apr 23, 2021 • edited Loading

ruby0x1 commented Apr 23, 2021

CrazyInfin8 commented Apr 23, 2021

ruby0x1 commented Apr 23, 2021 • edited Loading

CrazyInfin8 commented Apr 23, 2021

ChayimFriedman2 commented Apr 24, 2021

mhermier commented Apr 25, 2021

data-man commented Apr 25, 2021

PureFox48 commented Apr 25, 2021

CrazyInfin8 commented Apr 25, 2021

CrazyInfin8 commented Apr 25, 2021

ruby0x1 commented Apr 25, 2021 • edited Loading

ChayimFriedman2 commented Apr 26, 2021

CrazyInfin8 commented Apr 26, 2021 • edited Loading

ChayimFriedman2 commented Apr 26, 2021

CrazyInfin8 commented Apr 26, 2021 • edited Loading

ChayimFriedman2 commented Apr 26, 2021

CrazyInfin8 commented May 1, 2021

CrazyInfin8 commented Apr 23, 2021 •

edited

Loading

ruby0x1 commented Apr 23, 2021 •

edited

Loading

ruby0x1 commented Apr 25, 2021 •

edited

Loading

CrazyInfin8 commented Apr 26, 2021 •

edited

Loading

CrazyInfin8 commented Apr 26, 2021 •

edited

Loading