Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN and infinity float literals #16

Merged
merged 8 commits into from
Jan 14, 2016

Conversation

jonathanvdc
Copy link
Contributor

This PR adds the following named floating-point literals. There names and values are listed below. I went will all-lowercase names because of the precedent set by @null, @void, @true and @false.

  • @inf_d - double.PositiveInfinity
  • -@inf_d - double.NegativeInfinity
  • @nan_d - double.NaN
  • @inf_f - float.PositiveInfinity
  • -@inf_f - float.NegativeInfinity
  • @nan_f - float.NaN

I think I also fixed a bug in BaseLexer.Match(int, int), which surfaced while I was rewriting the grammar. It seemed like the if and else cases were reversed there, and the lexer LLLPG generated wouldn't accept the new float literals until I changed that.

I also threw in some test cases for these new float literals, which seem to work fine.

Any comments or suggestions are most welcome.

PS: Visual Studio kept complaining that the end line sequences in BaseLexer.cs were inconsistent (which is probably my fault), so I told it to go ahead and reformat all line endings in that file as LF, which, at first glance, seemed to be the line ending scheme the other files were using. Sorry about that.

@qwertie
Copy link
Owner

qwertie commented Nov 23, 2015

I've been busy... sorry about the delay. You've made an excessive number of whitespace changes, which is very distracting. Um... wish you hadn't done that.

There is already a function ParseIdValue that detects named literals. The code for detecting inf/nan should go in there. Since it probably isn't convenient to directly detect negatives like -@inf_f (and they make me uncomfortable anyway), I would define instead @inf_f, @-inf_f, @inf_d and @-inf_f. Also, -@nan_f shouldn't be considered invalid - even if it isn't supported directly as a literal (and I think it could be allowed as a literal since positive and negative NaNs do exist), it would still make perfect sense as a tree @-(nan_f).

@jonathanvdc
Copy link
Contributor Author

Yeah, detecting inf and nan literals in ParseIdValue would be a bit simpler than the current implementation. I picked -@inf_f for negative infinity literals to make them analogous to regular number literals, as per your suggestion in #15 (comment), but I'm just as fine with the @-inf_f format. I guess it's a trade-off either way, as implementing the latter format would require a special-case in the printer, instead of the lexer. (This is based on the assumption that an id node -inf_f gets printed as @-inf_f right now. I haven't actually tested if that's the case, so feel free to correct me on that.)

Negative NaNs sure do exist, and I may have overlooked them when adding NaN literals to the lexer. But even with negative NaN literals, there would still be a whole bunch of NaNs that we can't encode right now. But according to the Wikipedia article you mentioned, any 32-bit float whose bits look like s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx other than positive and negative infinity is a NaN. Adding a positive NaN and its negation leaves us with 2^24 - 4 other NaNs that cannot be encoded correctly. Perhaps a more general solution is appropriate for this problem, like a format for specifying floating-point numbers' bits as hex numbers.

I'm really sorry about the whitespace. It kind of happened by accident. I suppose I could just kill the current named float literal implementation by reverting all changes to LesLexer.cs, LexLexerGrammar.cs and LesLexerGrammar.les, and then implement your @-inf_f proposal instead. That shouldn't be too hard, and it'd fix the whitespace edits.

Is that something you'd like me to do? Also, what about NaN? Adding nothing more than a negative NaN literal just feels like putting a Band-Aid on a gaping wound.

@qwertie
Copy link
Owner

qwertie commented Nov 24, 2015

Well, you could do it yourself or wait for me, but I probably won't do it soon. I am ill right now and my wedding is on Saturday.

@qwertie
Copy link
Owner

qwertie commented Nov 24, 2015

I'm okay with not being able to represent most nans, for now, since needing more than one is quite unusual. Supposing there were a special literal to represent all the nans bitwise, it still makes sense to have a default NaN with a straightforward name like @nan_d.

I have no opinion about whether @-nan_d should exist. As for the printer, you just have to find the code that handles @false and extend it, I figure

@jonathanvdc
Copy link
Contributor Author

All right. I rolled back my previous named float literal implementation, and replaced it with some ParseIdValue magic. The printer has also been modified to output @-inf_f/@-inf_d instead of -@inf_f/-@inf_d.

Furthermore, I have also fixed an edge case bug that arose from reserving @-inf_f and @-inf_d for their respective literals: identifiers "-inf_f" and "-inf_d" were printed as @-inf_f and @-inf_d, which were then interpreted by the parser as named float literals. The printer now prints these two identifiers as
@-inf_f and `@`-inf_d, which are then parse correctly.

I decided against adding @-nan_f/@-nan_d literals right now. I'm no expert on NaN. I don't know if negative NaNs are a meaningful concept. Adding these literals at this point without a reasonable use-case seems like something that can come back to haunt LES. Besides, they can always be added at some later date if somebody needs them.

Congrats on the wedding, by the way. Get well soon.

qwertie added a commit that referenced this pull request Jan 14, 2016
@qwertie qwertie merged commit 467c17e into qwertie:master Jan 14, 2016
@qwertie
Copy link
Owner

qwertie commented Jan 14, 2016

Hey, I finally got around to reviewing your code. Nice work! I'm happy with it just the way it is, which is really nice 'cause I can just click github's handy "Merge" button. Sorry I took so long! I got proper internet access 10 days ago btw...

@qwertie
Copy link
Owner

qwertie commented Jan 14, 2016

Ack, there is one thing: you used spaces for indents, whereas the Loyc repo uses tabs. I've updated master with tabs. Too bad VS still doesn't support per-project tabs/spaces--or my preferred solution, per-file auto-detection. In Visual Studio, I enable faintly-visible spaces in order to prevent myself from making this particular mistake:
image

@jonathanvdc
Copy link
Contributor Author

Thanks for taking the time to review my changes!

Oh, and sorry about the whole whitespace thing. I'll try too keep that in mind in the future. I wonder if monodevelop has a feature similar to that faintly visible spaces thing (I've been using an Ubuntu desktop lately, so Visual Studio's not much of an option for me right now).

By the way, do you think there's any chance of getting the BLT format upstreamed at some point? It's proven really useful to me as a reliable and compact representation of LNodes, and it's also become the default encoding for on-disk Flame IR. I'm sure that the code in my fork doesn't meet the Loyc repo's standards quite yet, but I'd be happy to elaborate on the format and/or make any necessary changes if you point me in the right direction - I'll see if I can tabify the source files this evening.

@qwertie
Copy link
Owner

qwertie commented Jan 18, 2016

To be honest... I really didn't understand the BLT format as you originally described it. I just had a feeling that it was more complicated than necessary and... well, I didn't feel like expending the effort necessary to understand it. Currently, I can't find the link to the BLT info.

I'm going to level with you... I've been battling depression over the past couple of months, but I recently had a burst of energy, which I put into work on the EC# parser again. This uncovered a bug in LLLPG, and unfortunately I realized that I had no idea how to fix the bug. The problem seems to be that LLLPG is too complicated, especially its prediction analysis. I'd like to rewrite it to use a simplified grammar representation that eliminates loops and other "gratuitous" elements. Maybe then, I thought, I could understand how my own algorithm worked well enough to not only fix that pesky bug, but maybe optimize it and subtly change the way zero-width assertions work... however, when I started working on that, I ran out of energy again.

I keep hoping that the next feature I do is going to finally attract some external attention - what I thought was a small feature, a macro to implement pattern recognition in C#, was my next chance to convince people to sit up and notice EC#. But since a bug in LLLPG is blocking progress, that small feature has become a lot of work and I'm having trouble solving it. Really what I should be doing right now is changing LES so that you don't need a semicolon after a closing brace, and then trying to convince the WebAssembly guys to use LES instead of an ad-hoc notation.

Ack, the power went out. Boo Philippines. Yay laptop! Will send this when the power comes back... huh, the water's out too. How about that.

@jonathanvdc
Copy link
Contributor Author

I'm really sorry to hear that you're feeling that way.

All parser generators, including LLLPG, are indistinguishable from magic to me. There's no way I can provide any advice on that, so I'll present my thoughts on the other topics you mentioned.

Look, I've been reading the WebAssembly design discussions myself, but I doubt adding additional complexity to LES in the form of optional semicolons will make it a more attractive format to the WebAssembly folks. Seems to me like they picked S-expressions because of their (percieved) simplicity. They're using a wide variety of programming languages over there, and mandating all WebAssembly tools to implement a full-fledged LES parser/printer would place a relatively heavy burden on non-.NET implementations.

None of that does in any way diminish Loyc - or LES -, though. I like LES, but no format is going to be all things to all people. Fortunately, Loyc's architecture makes it both feasible and easy to create custom printers/parsers. Using Loyc as a library is fun, too. I had a very positive experience with it when I was creating an on-disk IR format for Flame. Same thing when I created fecs. Given an S-expression printer/parser, I'm sure Loyc could easily be leveraged to create awesome WebAssembly tools, like a macro preprocessor.

If you really want the WebAssembly file format to be LES-like, then I'd recommend offering them a regular, easy-to-implement subset of LES that is exactly what WebAssembly needs, and nothing more. It's okay to aim for world domination with LES, but you can hardly expect that to happen instantaneously. World domination is a gradual process, after all. Just keep moving in the right direction, and you'll get there eventually.

Here's a link to my sad attempt at describing the BLT format: Core/Loyc.Binary/README.md. I have no idea what's supposed to be in a readme for a file format, to be honest. The example in that readme is probably confusing as hell, too. But that's just me being bad at explaining how BLT works. I would gladly answer any and all questions about BLT.

It's not an overly complex format, and its parser is just 412 lines of text, which I think is acceptable. I'm pretty happy about BLT file sizes. Statically linking the core Flame libraries results in a 791.6 kB DLL (CIL) vs an 810.3 kB BLT (IR) vs a 7.1 MB LES (IR).

I though BLT would be a cool addition to Loyc, because binary formats are convenient for inter-program communication. I needed a binary format for Flame, so I can imagine that (future) Loyc users may want one, too.

@qwertie
Copy link
Owner

qwertie commented Jan 19, 2016

Uhh... What part(s) of LES would you say they don't need? Perhaps attributes aren't needed (that's the main difference between Loyc trees and s-exprs), but having them might turn out to be useful. While making semicolons optional will slightly complicate the parser, I figure that moving LES syntax closer to Javascript would please them, as they are experienced browser devs and therefore likely to be "traditionalist".

It's good that you provide an example in the BLT document, because I generally cannot understand specs without seeing a detailed example first. The problem I have understanding the 1 + 2 + x + y example is perhaps that I don't understand what TemplatedNode is, and that I don't know how the textual description corresponds to the bytes in the file. I suggest showing the actual bytes (or ASCII characters where applicable) with two or three columns, like

Offset Bytes Meaning
0 'B' 'L' 'T' BLT 3-byte file-type signature
3 3 Symbol table has 3 items (LEB128 encoding)
4 1 '+' $0: "+"
6 1 'x' $1: "x"
8 1 'y' $2: "y"

@jonathanvdc
Copy link
Contributor Author

I'd say ditching space-sensitivity and Python mode is a good idea, even if the latter has not been implemented yet. Both seem like hacks, which may be undesirable in a format that's supposed to last "forever." Attributes are nice (and easy to implement on a syntactic level), but they also complicate things when it comes to semantics: which attributes can a WebAssembly VM safely ignore, and which attributes does it recognize? WebAssembly wasn't designed with attributes in mind, so maybe they're more trouble than they're worth.

IIRC, PNaCl was undone in part by its dependence on the ever-changing LLVM IR, which made it virtually impossible for the PNaCl project to upgrade to a newer version of LLVM without breaking compatibility. It makes sense for them to be looking for a more stable format now. You're obviously not done making changes to LES yet. The WebAssembly group wants a stable, well-defined file format that'll last for a long time. A mini-version of LES with exactly those properties should work nicely.

As for the BLT file: I am, embarassingly enough, unable to compile Loyc on this machine (Windows-specific pre-build commands, I reckon), which includes the bltc LES <-> BLT converter. I'll try to encode the ((1 + 2) + x) + y node manually below, based on the BLT source code. You got that first part absolutely right, by the way.

Offset Bytes Meaning
0 'B' 'L' 'T' BLT 3-byte file-type signature
3 3 Symbol table has 3 items (LEB128 encoding)
4 1 '+' $0: "+"
6 1 'x' $1: "x"
8 1 'y' $2: "y"
10 2 Template table has 2 items (LEB128 encoding)
11 1, 2, 0 0 0 0, 5, 5 CallIdNode - Symbol $0: "+" (Int32 - Why am I still using this here?) - 2 children (LEB128 encoding): - Int32 - Int32
19 1, 2, 0 0 0 0, 0, 1 CallIdNode - Symbol $0: "+" (Int32) - 2 children (LEB128 encoding): - TemplatedNode - Id
27 1 Number of top-level LNodes in the BLT (LEB128 encoding)
28 0 TemplatedNode (explicit encoding prefix, which states that this node is a templated node)
29 1 Template #1 (this node is a TemplatedNode, because it has an explicit prefix. Only top-level nodes have encoding type prefixes)
30 1 Template #1 (this node is a TemplatedNode, because that's what the parent node's template, i.e. template #1, says)
31 0 Template #0 (same as above, this node is the first argument of a template #1 node)
32 0 0 0 1 Int32 1 (Int32, as per template #0)
36 0 0 0 2 Int32 2 (Int32, as per template #0)
40 2 $2: "y" (Id, as per template #1)
41 1 $1: "x" (Id, as per template #1)

As you can see, a node template defines a data structure of sorts. The "structure" of a BLT's data is defined by the template table, whereas the actual data is defined by the top-level node list.

Looks like I'm still using 32-bit integers to encode id indices in CallIdNode templates. I'll see what I can do about that as soon as I get Loyc to compile on Linux.

@jonathanvdc
Copy link
Contributor Author

Update: got (most) Loyc projects to compile on Linux. Pushed to my fork.

@qwertie
Copy link
Owner

qwertie commented Feb 15, 2016

Hmmm. So looking at template #0, "CallIdNode" (byte 1) refers to "A call node whose target is an id node". The second byte means there are 2 children; the next four bytes refer to symbol 0 which is "+", and its arguments have "type 5" which is one of the 17 types in the README file of BLT (of which 15 are literal types, one is for identifiers, and one is for references to templates.) Hence the bytes 1, 2, 0 0 0 0, 5, 5. The other template is the same except for having different argument types and thus different final bytes.

I should have mentioned this before, but when doing examples of this type, it's best to use disjoint numbers to avoid confusion, e.g. ((8 + 9) + x) + y rather than ((1 + 2) + x) + y because 8 and 9 do not appear in the binary file except as literals, thus a reader knows when he sees "1" that it is not a literal.

Hmm. It makes me uncomfortable that there is a special optimization for "A call node whose target is an id node", and in particular that the optimization is necessary. If the target were encoded the same way as the arguments, your file format would require the "top-level node list" to include three extra bytes for 3 references to "+" ($0).

This is not unprecedented - LNode itself specifically optimizes a call with a Target that is an identifier. But I tend to look at file formats with a more critical eye because of their longevity and resistance to change. And I note that if there were a file like foo(a, b) + bar(c, d) + baz(e, f), the "optimization" would, depending on how the encoder is written, make the file bigger rather than smaller (you'd need 3 CallId templates rather than 1 Call template). It's fair to ask if it's the "right" decision to optimize for cases like f(1, 2); f(3, 4); f(5, 6) in which f is repeated, but not optimize for cases like f(1, null); g(x, null); h(g(), null) in which an argument is repeated (in BLT, f is deduplicated while null is not). To be sure, it's not a bad decision; it just makes me wonder if there is a better approach that could optimize more cases.

Also, Wasm discussions make me wonder whether the binary format should be canonical and "multi-level", i.e. have a multi-stage encode/decode pipeline. BLT is currently flexible, allowing a single Loyc tree to be encoded in numerous ways. JSStats (who somehow isn't really getting along with the other devs, I'm not sure why) has proposed reducing degrees of freedom so that two different "level 0 encoded" Wasm modules would never convert to the same text form. This makes sense to me, since their level 0 is the "uncompressed" binary form, and canonicality could allow "directly" comparing binaries produced from different sources. "Level 1" would be a compression layer, and thus the same module could have infinite level-1 encodings. (There's one drawback to mandating canonicality: a small overhead in the decoder to detect non-canonical files.)

I'm sorry I was slow to respond again. The lack of interest in Loyc tools from everyone but you, combined with that blocking bug in LLLPG, and personal issues, has sapped my energy. I wish I could hang out with you for a hackathon or something, maybe I'd get motivated again. Belgium! This world is too darn big. I'm curious, what do you do for a living? I'm currently unemployed.

There aren't any special build steps in Loyc.sln btw. The only obvious potential problem is that there are more configurations than just "Debug" and "Release", ones for .NET 3.5, .NET 4 and .NET 4.5. I've been meaning to make a solution based on PCLs instead (at the sacrifice of finally dropping support for Visual Studio 2010 and .NET 3.5). How did you fix your problem (building on Linux)?

@qwertie
Copy link
Owner

qwertie commented Feb 15, 2016

Actually let's do any more discussion of BLT in #17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants