Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom number parsing #984

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

CrazyInfin8
Copy link
Contributor

@CrazyInfin8 CrazyInfin8 commented Apr 23, 2021

Added a custom number parser so we don't have to rely on strtoll or strtod

This PR will add support for parsing octal, binary, and hexadecimal number systems as well as provide support for separating digits with underscores for better organization. Because of this I will probably close #945 and #946.

I think parsing the numbers inside the VM is better than re-implementing strtod and strtoll because we already validate the number so why not parse it at the same time? There are also some slight complexities when using strtoll and strtod that Wren just doesn't use so this way is a tad bit simpler as well. Lastly the parser only parses the values to a double unlike strtoll that parses to a 64-bit integer then converts to a double which would allow for more (albeit slightly less accurate) digits in a hex, oct, and bin literal.


Now a couple of questions I have.

  1. I've noticed that the literal: 0x is a valid in wren when compiling but is not valid when calling Num.fromString. I decided to patch it so that the literal 0x is a compile time error to be more consistent.
  2. When calling Num.fromString, an invalid parse returns null instead of throwing an error or printing what it has so far. I get the argument that it is simpler to handle null than to handle error but I also noticed that a literal that is too large throws an error (see this test). I decided to copy this behavior but in my opinion I think we should either return null on all invalid parses if we do it for simplicity or return errors explaining why the parse went wrong as to be helpful.

@ruby0x1
Copy link
Member

ruby0x1 commented Apr 23, 2021

from a quick glance there seems to be some duplication for some of the logic, with this kind of nuance it would be ideal to share it. Have you tried to consolidate it to a shared function or maybe that wasn't feasible?

@CrazyInfin8
Copy link
Contributor Author

One thing I wasn't sure was whether ObjString*.value is always null-terminated so there were plenty of added checks to verify that I'm not incrementing past the string provided (although I would actually have to test if the parser could have broken out of a non null terminated string though which I have not done so yet). If I can confirm that they are always null terminated, then the checks could be removed making the two functions more or less the same.

@ruby0x1
Copy link
Member

ruby0x1 commented Apr 23, 2021

A quick look at the source seems to suggest it is...

https://github.com/wren-lang/wren/blob/main/src/vm/wren_value.c#L684

I haven't checked if every path goes via this one, but seems to be the case.

Edit: It should be noted though that ObjString* can contain null bytes within it, so it's context sensitive whether it matters or not that it's always null terminated. If it's dealing with user strings then it's not safe to assume that.

@CrazyInfin8
Copy link
Contributor Author

The other thing is where readNumber takes a parser, it calls lexError when it fails while Num.fromString returns null when it fails. I could probably pass in a function or pass in a string pointer to get error messages if it is mandatory to merge them together.

readNumber also doesn't check for a leading negative or prefix like Num.fromString does as well but I guess this may be negligible.

If it's dealing with user strings then it's not safe to assume that.

I'll leave in the check for now at least until we can say for certain that it is null terminated. I guess I'm just a little too paranoid 😅

@ChayimFriedman2
Copy link
Contributor

Unfortunately, parsing doubles is one of the hardest things in the universe 😄

Doing it correctly requires bigfloats, or at least bigints.

Although there are shortcuts, they never cover all of the cases.

As an example, the following number will throw an error (incorrectly) in your parser:

100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

I do support not using C's strtod(), but don't roll your own.

@mhermier
Copy link
Contributor

We will be forced to do it to some extends. Considering all the features we want, it will be hard to find an implementation that has the correct licence, is correct and follow (or at minimum is adaptable to) all our requirements...

@data-man
Copy link

Really fast float parsing: https://github.com/fastfloat/fast_float
But it's C++.

@PureFox48
Copy link
Contributor

Well, if custom number parsing is so hard, why not just do #945 which everyone seems to think is a good idea and can be done with what we already have.

Personally, I'd be inclined to just put #946 on the back-burner for now. This feature was introduced recently into Go (along with binary and octal literals) but I've rarely used it. Most of the time where I'd want to use separators, it's for decimal numbers which are exact powers of ten so it's easier to just write them in the form 1eN anyway.

@CrazyInfin8
Copy link
Contributor Author

As an example, the following number will throw an error (incorrectly) in your parser:

100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

hmm my build seems to parse that number successfully.

System.print(Num.fromString("100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"))
// 1e+269
System.print(100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000)
// 1e+269

results seem equivalent to browsers console. If I were to put 309 zeroes or more, then it will start spitting errors which is equivalent to how the compiler behaved when it was using strtod.

@CrazyInfin8
Copy link
Contributor Author

I do have a new idea to make one function that both Num.fromString and readNumber can use instead of writing duplicate code but I am also wondering in which file something like that belongs to. could anyone guide me to which file is the best place to have a number parser function for both wren_compiler.c and wren_core.c?

@ruby0x1
Copy link
Member

ruby0x1 commented Apr 25, 2021

wren_utils.c or maybe a new one should be fine to start with

@ChayimFriedman2
Copy link
Contributor

results seem equivalent to browsers console. If I were to put 309 zeroes or more, then it will start spitting errors which is equivalent to how the compiler behaved when it was using strtod.

Missed what you do with the exponent. But you still have rounding problems. Try with 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125E-316, for example.

For deeper dive into this, I suggest you to read How strtod() Works (and Sometimes Doesn't).

@CrazyInfin8
Copy link
Contributor Author

CrazyInfin8 commented Apr 26, 2021

Missed what you do with the exponent. But you still have rounding problems. Try with 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125E-316, for example.

Again, this behaves just like how the old parsing method did using strtod. strtod sets errno whenever the normalized result would be less than DBL_MIN (assuming that the value was not meant to be zero). DBL_MIN is defined as 2.2250738585072014e-308 even though doubles can technically be as small as 4.9406564584124654e-324 by impacting the mantissa. I chose to stick with this behavior because I wasn't sure if it was done this way for a specific reason but it isn't to hard to check if the results is zero while the value is not meant to be zero if we wish to have exponents less than -308.

One potential issue I may be noticing is I don't actually check if the exponent will overflow as yet when adding or incrementing but I am planning to consolidate the two areas I parse numbers to one function so can add that in the next commit.

For deeper dive into this, I suggest you to read How strtod() Works (and Sometimes Doesn't).

For my previous PR I had to do some research as to how doubles work since I wasn't completely sure why my implementations were off. I did try creating an alternative to strtod here which functions very close to strtod but I had issues with the compiler complaining that const char was used as a non constant. I don't think we use all of the features of strtod though so I thought not trying to replicate strtod would be simpler in the long run.

@ChayimFriedman2
Copy link
Contributor

If you ignore overflows, try 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125.

@CrazyInfin8
Copy link
Contributor Author

CrazyInfin8 commented Apr 26, 2021

If you ignore overflows, try 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125.

I'm not sure I follow what your trying to tell me.

This does print 6.6312368714698 on both this version of the parser and the original one. Plugging into the JS console, I managed to get some extra decimal places (6.631236871469758) so I think this is an issue with converting number to string rather than parsing string to number. We aren't going to actually get all of the digits saved exactly as the mantissa can only store integers that are at most 53-bit in size. 10(2^(53-1)). With this function log10(2^(53-1)) we can expect that at most 16 base 10 digits can be stored.

The goal of this PR is not to surpass the original method of parsing by accuracy but by convenience, although I do attempt to try to be as accurate as the original method by using long doubles where data could potentially be lost.

@ChayimFriedman2
Copy link
Contributor

You need to print all of the digits (the problem is usually with the last binary one): try with printf()'s %a specifier.

@CrazyInfin8
Copy link
Contributor Author

  • The parser is now consolidated into one function wrenParseNum in wren_utils.c.
  • Added comments to explain how the parsing works
  • Added overflow detection (only tested if the exponent part of a scientific number will overflow. Checking if too many digits will overflow is somewhat impractical as it would require more than 2GBs of numbers)

Some changed behavior:

  • Num.fromString No longer throws errors when the number would be too big or small. Instead it simply returns Null like it does for any other unsuccessful parse.
  • values are checked to see if they are infinity or if they are zero when the string provided is not zero. This is slightly different to how strtod sets errno since strtod checks if the value would be smaller than 2.2250738585072014e-308 to determine it is too small whereas the value can be as small as 4.9406564584124654e-324 before it is zero. This behavior seems more inline to how many other languages behave (although they also usually allow zero and infinity as values as well instead of throwing errors but this may be a discussion for a different issue).
  • Some error messages are changed to say that values are too big or to small instead of just saying it is too large. Also some values are slightly different in that they do not handle formatting like they originally did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants