Skip to content

Final changes to LES3 #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
qwertie opened this issue Sep 17, 2017 · 13 comments
Open

Final changes to LES3 #52

qwertie opened this issue Sep 17, 2017 · 13 comments

Comments

@qwertie
Copy link
Owner

qwertie commented Sep 17, 2017

I'm changing a bunch of things, mainly in order to finalize LESv3. However, the changes to operator precedence/classification affect both LESv2 and LES3, partly because both languages share precedence-choosing code but also for the sake of consistency. These changes are not committed as of today.

!! to suffix operator

To reiterate, !! changed from an infix to a suffix operator based on Kotlin's !! operator. I'm duplicating this change now in LESv2. And I noticed that suffix operator printing seems broken in LESv2 (e.g. a++ came out as `'++suf`a - it parses fine but looks ugly) so I fixed that.

.dotKeywords will be stored as #hashKeywords and # is an id-char

To save us the trouble of changing the whole Loyc codebase, the dot in a .keyword will be changed into a # in the Loyc tree. Also, # will be an identifier character. Thus .return x is equivalent to #return(x). I'm also reverting the definition of LNode.IsSpecialName to exclude ..

.keyw0rds can now contain digits

.keywords will be allowed to be any valid non-backquoted identifier.

Number parsing change

In order to simplify the lexer by removing a lookahead loop, I'm removing the requirement to have a "p0" clause on hexadecimal floating-point numbers. Consequently, 0x1.Ep0 can be written as 0x1.E, which will be treated as a floating-point number instead of 0x1 . E. However, 0x1.Equals(1) now has the bizarre interpretation quals"0x1.E"(1).

Weird Operators

Given a "weird" operator like x *+ y, the precedence of *+ was previously based on the first character, +. I think the motivation for this was to act similar to Nim operators. But then I noticed something: MATLAB and Julia have "elementwise" operators like X .* Y which multiply pairs of scalars in matrices X and Y. So I checked several other languages. Apparently, most other languages do not have any "compound" operators, but Swift does. Swift has &*, &/, &+, &-, and &%. In Swift, MATLAB and Julia, the last character determines the precedence. Bacause Swift, MATLAB, etc. are more popular than Nim, I'm changing LES to decide precedence based on the last character instead of the first. Sound good @jonathanvdc ?

Earlier I added a !. operator with precedence of . so that Kotlin's !!. operator would have the correct precedence. This change makes the !. operator redundant. Null-dot ?. must still be special-cased as its precedence is lower than ..

Note: Long operators like |%| will continue to be based on the first and last character. In the case of combo operators, like x s>> y, the initial identifier is ignored for the purpose of deciding precedence so that this particular operator has the same precedence as >>.

Operators and fractions

I decided that, to better match other programming languages, x*.2 should be parsed as x * 0.2 rather than as x *. 2. However, the tricky thing is that 0..2 and 0...3 should still be parsed as ranges, 0 .. 2 and 0 ... 3 respectively. To achieve this I tried splitting Operator into two rules:

token DotOperator returns [object result] : 
	'.' OpChar* 
	{$result = ParseOp(out _type);};
token Operator returns [object result] : 
	('$'|OpCharExceptDot) (OpCharExceptDot | '.' ((~'0'..'9' | EOF) =>))*
	{$result = ParseOp(out _type);};

[inline] extern OpCharExceptDot :
	'~'|'!'|'%'|'^'|'&'|'*'|'-'|'+'|'='|'|'|'<'|'>'|'/'|'?'|':';
[inline] extern OpChar : 
	'~'|'!'|'%'|'^'|'&'|'*'|'-'|'+'|'='|'|'|'<'|'>'|'/'|'?'|':'|'.';

This is an efficient implementation. However, with this grammar, If you write x.+.2 it is parsed as x .+. 2 which is not necessarily the desired result. It's a perfectly reasonable way to parse, but I think it would be better to match Julia/MATLAB. So I thought of an alternative which actually looks simpler in the grammar:

token Operator returns [object result] : 
	( '$'? (OpCharExceptDot | '.' ((~'0'..'9' | EOF) =>) '.'*)+ / '$' )
	{$result = ParseOp(out _type);};

This turns out to generate more, and slower, code. But for the sake of compatibility, I'll accept that.

:: operator

I think the precedence of :: should be changed to match C++ and EC#. In LESv2 I chose the syntax x::Loyc.Symbol for variable declarations, which required :: to have a lower precedence than .. Why didn't I just use the syntax x: Loyc.Symbol? I think it was because it conflicted with LES's old "Python mode" where you could write

if c:
    print "c is true!"

Meaning if c { print "c is true!" }. This feature was removed 2016-06-21, I think in order to make the language easier to parse and ensure colon would behave like any other operator.

So I will raise the precedence of :: up to ., but will make the change exclusive to LESv3 because I have a fairly large amount of LESv2 code relying on the old precedence.

Minor implementation detail

?. will now be classified as TokenType.Dot while .* will no longer be classified as TokenType.Dot.

Arrow operators

I have reduced the precedence of arrow operators -> <-. Technically their precedence is now above && but below | ^. My thinking is that some people would like to use arrows as assignment operators: flag <- x > y || y < 0 gets the structure (flag <- (x > y)) || (y < 0).

The previous precedence of -> <- was sort of a compromise between C (which wants high precedence) and other languages that want lower precedence. But arguably the old precedence served neither case very well. Now I'm thinking that people wanting to use a C++-style arrow operator as in obj->f() should pick another operator, such as obj*.f().

Continuators

The set of possible continuators should be carefully considered because it cannot be compatibly changed later. Since no one has offered an opinion, I am going to suggest that the set be the ten words else, elseif, elsif, catch, except, finally, while, until, plus, using, plus the set of all identifiers that begin with two underscores (__etc). Note that instead of double-underscores it would seem more natural to add #hashKeywords to the set; however, this creates an ambiguity in case of code like the following:

    .foo c { f() }
    #bar(x)

I think it's more likely this was intended to be two separate statements, rather than that #bar(x) is intended to continue the .foo statement. I selected __, a traditional "potential reserved word" marker in C++, because it is not currently used for any purpose in Loyc. Unlike continuators like else that will be stored with # in the Loyc tree (#else), double-underscore identifiers will be stored unchanged. Though it could equally be argued that #s should not be added even in the former case.

Since continuators are not allowed to be used as binary operators, I removed and, or, but from the continuator set, thinking that some users would prefer to use them as binary operators.

And introducing token lists / prefix-expressions / whatever you call 'em

I proposed that LESv3 not have token literals (unlike LESv2 and EC#) and instead adopt "token lists" such as ' + x 1. No one offered an opinion on this, or about whether '(+ x y) should be represented as `'`(`'()`(`'+`, x, y)) or as `'`(`'+`(x, y)) so I'm somewhat arbitrarily selecting the first representation.

Decisions on other questions previously raised

  • Should attributes be allowed in the middle of an expression, as in foo: @immutable string? Still undecided; currently, no.
  • Should comma be allowed as a separator within braces, as in { a, b, c }? I don't think so. The parser isn't currently complaining about it, but I may add a check, possibly a check dependent on whether { is followed by a string (to carve out an exception for JSON syntax).
  • Should comma be allowed as a separator within tuples as in (x, y)? Yes, but (x,) will be a syntax error and parsed as (x, ``).
  • What should # mean? It will be treated as an identifier character like an underscore.
  • Should continuator keywords be bona fide keywords? No, I think not.
  • Edit: Should non-ASCII identifiers be supported without backquotes? I suppose that can wait for the next version.

Idea

Numeric literals can have any identifier as a suffix, including backquoted identifiers. This feature could be used to support my favorite feature that few languages allow: compound unit types like 1.2`kB/record` or 3e8`m/s` . Currently an expression like size `kB` is meaningless, but I suppose it could be defined as some sort of suffix operator which could then be used for unit types. However, this idea has the disadvantage that a numeric value with units would have a different syntactic structure than any other expression with units, and 3`px` would have a different meaning than 3 `px` .

@jonathanvdc
Copy link
Contributor

Wow, that sounds like a lot of work! This all seems super reasonable to me. I'm especially happy about you deciding to store .dotKeywords as #hashKeywords.

I also like your idea of supporting arbitrary suffices for numeric literals. How does that interoperate with custom literals? Is 3e8`m/s` parsed more or less like #unit(3e8, "m/s") or does the suffix actually change the literal's Value?

@qwertie
Copy link
Owner Author

qwertie commented Sep 17, 2017

3e8`m/s` is an ordinary custom literal meaning `m/s`"3e8" and both syntaxes were already supported. The suffix is part of the LNode.Value, although I'm thinking of splitting the suffix from the Value in order to remove the distinction between, say, Value = 5.0 and Value = new CustomLiteral(5.0, (Symbol)"n").

So I'm thinking there should be a new ValueTypeMarker (ValueType?) property which perhaps would be null by default but would be set to n, short for number, by the LESv3 parser when it sees 5.0.

Which reminds me of another change I forgot to mention - I've changed the type marker number to n so that negative literals like n"-1.0" are as short as possible.

@jonathanvdc
Copy link
Contributor

jonathanvdc commented Sep 17, 2017

Hmmm. Adding a ValueTypeMarker field might be a more efficient representation than using CustomLiteral values. But one thing that I think is great about CustomLiteral instances is that you actually need to handle them separately—Value is double won't evaluate to true for 5.0n. Adding a ValueTypeMarker makes it easy to write code that accidentally ignores custom literals by never even looking at the ValueTypeMarker field.

And people probably aren't going to check ValueTypeMarker unless they expect a custom literal. So if I didn't know custom literals were a thing and I then wrote a simple calculator program, that program would probably evaluate 1m + 2mi to 3, which is pretty deceptive and probably not what anyone intended for the program to do.

@qwertie
Copy link
Owner Author

qwertie commented Sep 18, 2017

Um, three things. First, there is supposed to be no semantic difference between 5 and n"5" as n is the default type marker for numbers. Yet there are at least three seemingly legitimate possible values of LNode.Value - 5, new CustomLiteral(5, @@n), and new CustomLiteral("5", @@n). So that's awkward. What I'm proposing at least reduces this from three to two plausible representations.

Second, there are multiple type markers that produce, say, a 32-bit integer. You can write n"5" to use the general number parser, or i32"5" to ask for a 32-bit result specifically. It would be nice, but I guess not crucial, to be able to round-trip such differences.

Third, I want to port LES to other languages. I don't want the APIs in different languages to be much different from each other. So consider JavaScript, which has no integer type. What, then, is the distinction between 5 and 5.0? How do we store 5 and 5.0 in such a way that they don't get mixed up? It's dumb for all simple literals to need a separate CustomLiteral object. But if there's a separate valueType attribute, we can store the distinction in there. (however, obviously it's no good if both of them use a type marker of n - I don't think I should add i for integer as it occurs to me that i should mean "imaginary", but I can use i32 and i64 for ints.)

@qwertie
Copy link
Owner Author

qwertie commented Sep 19, 2017

P.S. given an unknown suffix like 2mi, the LESv3 parser actually stores "2" as a string.

@jonathanvdc
Copy link
Contributor

I think it's interesting that suffices don't always make a difference semantically and that all custom literals with unknown suffices are parsed as strings. I also seem to recall that you said you would've liked to parse custom literals that are actually integers/floats as what they are instead of strings.

Maybe the root cause of most custom literal–related issues is that custom literals and value types are orthogonal, but they're represented using the same prefix/suffix scheme.

Here's a strawman proposal: drop the CustomLiteral type and instead wrap literals in call nodes to create custom literals, like so: #unit(2, @@mi). Advantages include:

  1. A valueTypeMarker field can make sure i32"1" round-trips fine without making custom literals trickier.
  2. Less ambiguity. If custom literals are encoded as call nodes, then there is no confusion between 5, new CustomLiteral(5, @@n), etc.
  3. Custom literals don't need to be strings, which means that they don't have to be parsed twice (once by the parser, again when a CustomLiteral is encountered during a tree walk).
  4. Identifiers and call nodes can have units too. Languages that use units will probably need to be able to accept something like x miles.
  5. This doesn't have to be mutually exclusive with a dedicated syntax in LES. Perhaps a suffix in backticks can be treated as syntactic sugar for a #unit call node.
  6. It doesn't require non-LES round-tripping parsers/printers to have special syntax for custom literals. That's useful for languages that don't have built-in support for units, like EC#.

I know it's a completely different approach, but I feel like it solves a lot of issues with custom literals. What do you think?

@qwertie
Copy link
Owner Author

qwertie commented Sep 20, 2017

I'm not going to call this #unit because I consider the question of how to represent units as an orthogonal, higher-level concern. I'll use #customLiteral instead.

I probably don't understand your proposal because you refer to a valueTypeMarker field but I do not see any role for a valueTypeMarker in your proposal. It looks at first like you are proposing that 2mi should produce three LNodes (actually 4 at the conceptual level), one of which is a literal with Value=(Symbol)"mi". I will therefore ignore point (1) which makes no sense in the context of the previous paragraph.

So let me first demolish my faulty interpretation of your strawman.

[edit: struck out because arguing against a false interpretation of a strawman is such a silly exercise] I presume under your idea there is a different representation for 2mi and mi"2", namely #customLiteral(2, @@mi) versus #typeMarker("2", @@mi)? This would seem to imply that negative numbers must always be stored as strings, while positive numbers can be stored either as strings or not-strings depending on how they are written. This would also imply that the distinction between, say 13_6xx and 1_36xx is lost - and since we have no idea what xx means, we also have no idea if the user might wanted that difference to be significant. And how should LES react to giant numbers like 0x1_0000_0000_0000_0001xx? The increased memory use of #typeMarker("2", @@mi) versus a custom literal also is a downside.

The biggest problem I have with this idea relates to one of the motivations for introducing custom literals in the first place. Namely, literals are no longer orthogonal to calls, as the two are conflated. If I create a Loyc tree programmatically with the structure #customLiteral(5, @@i32), it'll be printed out as 5i32 and then upon parsing it becomes the non-custom literal 5. Thus it is impossible to round-trip the call #customLiteral(5, @@i32).

Hmm, reading over your "advantages list" it seems that what you really want to propose is a "units" thing, not a "custom literal" thing. Please don't confuse the two, they are totally different concepts that would exist for totally different reasons. The "Idea" regarding units that I wrote down earlier is, as I implied, imperfect. It was an idea about co-opting custom literals to represent units - consider it an interesting hack that one could use if LES does not support units in any other way.

@jonathanvdc
Copy link
Contributor

I'm not particularly invested in the #unit symbol—#customLiteral seems more accurate.

I'm sorry that I didn't explain what I meant in depth. I'll try to sketch my motivations for proposing to use a #customLiteral node.

First off, right now a prefix/suffix can mean one of two things.

  1. A type marker that tells the parser how a literal should be parsed, e.g., i32, n. Such a marker may or may not be semantically relevant, but we'd like to preserve it so we can round-trip literals. Crucially, I expect that only the parser and the printer will care about type markers because applications that accept a Loyc tree tend to care about what a literal's Value is, but not about whether it was spelled n"5" or 5i32. Just like ecsc doesn't care whether an integer literal is in hex or in decimal. So I propose to encode a type marker as a field (if I understand correctly, valueTypeMarker) in literal nodes. That way, the type marker does not require an extra object and non–printer/parser logic can ignore them.

  2. A custom literal prefix/suffix turns a string literal into something else. I don't know of any fundamental reason why custom literals should always be represented by strings: for a lot of use cases (for example, units), representing custom literals by numbers makes more sense. And if you're going to represent a custom literal as a number, then you might want to use, say, a single-precision float. AFAICT, this is pretty much impossible if you use the current syntax—a number can either have an f suffix or a mi suffix, but not both. So I propose that custom literals and type markers each get their own underlying representation and LES syntax. For example: 2.0f, 2.0`mi` and 2.0f`mi` .

Furthermore, it's the job of some separate component outside of the printer/parser to interpret custom literals. So we probably shouldn't hide the custom literal suffix in a field of a literal node (like valueTypeMarker). Instead we should try and make sure that whatever logic is responsible for handling custom literals doesn't accidentally ignore them by failing to check a field.

I think a call to #customLiteral is the least bad way to represent a custom literal. A possible alternative is to keep using the CustomLiteral type, but I think that's rather awkward given that a hypothetical valueTypeMarker in the literal node applies to the custom literal's underlying storage; new LiteralNode(new CustomLiteral(5.0, @@mi), "n").Print() would come out 5.0n`mi` , which is kind of hacky. I'm not fundamentally opposed to using CustomLiteral, though.

Does that make more sense? Would you like me to elaborate on some specific aspect of this proposal?

@qwertie
Copy link
Owner Author

qwertie commented Sep 21, 2017

I don't know of any fundamental reason why custom literals should always be represented by strings

They are not always represented by strings.

custom literals and type markers

Type markers are what make custom literals custom. I'm not sure that you understand why I created custom literals and type markers, so let me explain.

The vision is this: there will be X LES parsers written by Y people in Z languages. I would hope that someday Z>20, you know? Now, each programming environment is different, so each one will have a set of literals that it supports... and another set that it does not support. Most will have built-in support for string, float64, float32, int32 and characters (though not necessarily proper 21-bit characters), but not all of them. Many will support int64, BigInt, Symbol and small integers (byte, short, etc.). Some will support regexes, some might even support unums.

The purpose of custom literals is to handle literals in a uniform way across the myriad environments where LES may be used, in a way that is (1) future-proof and (2) allows unknown literals to be round-tripped by lexers that don't understand them. I want to avoid a rigid framework that says "these are the standard literals which everyone should support, and everything else is non-standard". Instead, quite the opposite: the only thing that will be required from an LES parser is the ability to store a literal as a string. Everything else is optional.

And today it occurred to me that lexers should by convention support a plug-in literal parsing system so users can expand (or even clear) the set of supported literal types. Probably the parser collection should even be a separate object from the lexer itself.

If a literal type is unknown, then it should be stored as a string. It should not be stored as a number even if it is written as a number, partly because in general it's infeasible. For example, consider this number:

 1234567890_1234567890_1234567890.1234567890_1234567890e+1999f64

Some environments may be able to store this in a numeric form, but most can't. So rather attempt to parse it, I believe it's better to keep it as a string. And remember, if you want to write a negative literal, it must be written as a string:

f64"-1234567890_1234567890_1234567890.1234567890_1234567890e+1999"

So the use of string versus number notation is not a strong signal about whether the user wants it to be parsed as a string or as a number.

In fact, some users will explicitly want to avoid parsing literals because a parsed literal contains less information than the original literal, e.g. 12_3 and 1_23 parse to the same thing and so the original information is lost unless some special measures are taken. (The Loyc lexers use NodeStyle to preserve the distinction between hex, binary and decimal literals, but that's all they can preserve.)

If your goal is to load an LES file and, say, find and replace all instances of Pow(x,y) with x**y, you don't actually want to change 12_3 to 123 as a side-effect (though you will change the spacing - because writing an entire Roslyn clone by myself is not my goal.)

Now, in the design of LES we must consider not only the conversion of LES to Loyc trees, but the reverse conversion also. So here are a couple of things to consider:

  1. At first it seems sensible that we only need to support a sequence of letters as the type marker, e.g. 123abcdefg, but nothing stops a user from creating a Literal whose type marker is something weird, like $$$. Therefore, I decided that a literal could use any string as a type marker in backquotes.
  2. The LES printer may encounter a literal type it doesn't recognize. What do we do with it? Well since I chose a highly general syntax, a natural default behavior is to call .ToString() to get the string part of the literal, and .GetType().FullName to get the type marker (I guess we need a plug-in system for the printer too).

So, given what I had in mind, I don't understand why you are proposing a system that involves two suffixes like 5n`mi` .

@jonathanvdc
Copy link
Contributor

jonathanvdc commented Sep 21, 2017

Oh, I see. I was under the impression that by "custom literals" you meant something like C++'s user-defined literals. Thanks for clearing that up.

A plug-in literal parsing system for the lexer sounds promising. I'm not sure if .GetType().FullName is the best way to name a type marker, though. Putting the System namespace in a suffix doesn't make much sense for non-CLR languages. Int32"4" is reasonable on any platform, but `System.Int32`"4" is very .NET-centric.

Edit: maybe FullName does make sense if you want to automatically parse/deserialize data in the lexer.

qwertie added a commit that referenced this issue Sep 27, 2017
@qwertie
Copy link
Owner Author

qwertie commented Oct 1, 2017

I'm tweaking the precedence of -> and <-. These operators were immiscible with logical and bitwise operators (|, &&, |, ^ , &, ==, >, etc.) but I don't really see why they should be. I also made these oeprators right-associative. So c && a <- b > 1 will mean c && (a <- (b > 1)).

Btw I once proposed <- as a "slide" operator, an operator I would still like to have...

@qwertie
Copy link
Owner Author

qwertie commented Oct 1, 2017

Oh, and I'm changing them to be right-associative.

@qwertie
Copy link
Owner Author

qwertie commented Mar 27, 2020

One more thing, I notice that four-char codes like "\u2022" are valid JSON but larger code points like "\u1F4A9" and "\u01F4A9" (💩) don't work; only the first four characters contribute to the escape sequence. Traditionally I've parsed up to 6 characters after \u... I think I should introduce a new escape sequence "\U1F4A9" for code points that may be longer than 4 characters and then treat \u the same way JSON/JavaScript does.

@qwertie qwertie changed the title Final changes to LESv3 Final changes to LES3 Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants