-
Notifications
You must be signed in to change notification settings - Fork 23
Final changes to LES3 #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Wow, that sounds like a lot of work! This all seems super reasonable to me. I'm especially happy about you deciding to store I also like your idea of supporting arbitrary suffices for numeric literals. How does that interoperate with custom literals? Is |
So I'm thinking there should be a new Which reminds me of another change I forgot to mention - I've changed the type marker |
Hmmm. Adding a And people probably aren't going to check |
Um, three things. First, there is supposed to be no semantic difference between Second, there are multiple type markers that produce, say, a 32-bit integer. You can write Third, I want to port LES to other languages. I don't want the APIs in different languages to be much different from each other. So consider JavaScript, which has no integer type. What, then, is the distinction between |
P.S. given an unknown suffix like |
I think it's interesting that suffices don't always make a difference semantically and that all custom literals with unknown suffices are parsed as strings. I also seem to recall that you said you would've liked to parse custom literals that are actually integers/floats as what they are instead of strings. Maybe the root cause of most custom literal–related issues is that custom literals and value types are orthogonal, but they're represented using the same prefix/suffix scheme. Here's a strawman proposal: drop the
I know it's a completely different approach, but I feel like it solves a lot of issues with custom literals. What do you think? |
I'm not going to call this I probably don't understand your proposal because you refer to a So let me first demolish my faulty interpretation of your strawman. [edit: struck out because arguing against a false interpretation of a strawman is such a silly exercise]
Hmm, reading over your "advantages list" it seems that what you really want to propose is a "units" thing, not a "custom literal" thing. Please don't confuse the two, they are totally different concepts that would exist for totally different reasons. The "Idea" regarding units that I wrote down earlier is, as I implied, imperfect. It was an idea about co-opting custom literals to represent units - consider it an interesting hack that one could use if LES does not support units in any other way. |
I'm not particularly invested in the I'm sorry that I didn't explain what I meant in depth. I'll try to sketch my motivations for proposing to use a First off, right now a prefix/suffix can mean one of two things.
Furthermore, it's the job of some separate component outside of the printer/parser to interpret custom literals. So we probably shouldn't hide the custom literal suffix in a field of a literal node (like I think a call to Does that make more sense? Would you like me to elaborate on some specific aspect of this proposal? |
They are not always represented by strings.
Type markers are what make custom literals custom. I'm not sure that you understand why I created custom literals and type markers, so let me explain. The vision is this: there will be X LES parsers written by Y people in Z languages. I would hope that someday Z>20, you know? Now, each programming environment is different, so each one will have a set of literals that it supports... and another set that it does not support. Most will have built-in support for string, float64, float32, int32 and characters (though not necessarily proper 21-bit characters), but not all of them. Many will support int64, BigInt, Symbol and small integers (byte, short, etc.). Some will support regexes, some might even support unums. The purpose of custom literals is to handle literals in a uniform way across the myriad environments where LES may be used, in a way that is (1) future-proof and (2) allows unknown literals to be round-tripped by lexers that don't understand them. I want to avoid a rigid framework that says "these are the standard literals which everyone should support, and everything else is non-standard". Instead, quite the opposite: the only thing that will be required from an LES parser is the ability to store a literal as a string. Everything else is optional. And today it occurred to me that lexers should by convention support a plug-in literal parsing system so users can expand (or even clear) the set of supported literal types. Probably the parser collection should even be a separate object from the lexer itself. If a literal type is unknown, then it should be stored as a string. It should not be stored as a number even if it is written as a number, partly because in general it's infeasible. For example, consider this number:
Some environments may be able to store this in a numeric form, but most can't. So rather attempt to parse it, I believe it's better to keep it as a string. And remember, if you want to write a negative literal, it must be written as a string:
So the use of string versus number notation is not a strong signal about whether the user wants it to be parsed as a string or as a number. In fact, some users will explicitly want to avoid parsing literals because a parsed literal contains less information than the original literal, e.g. If your goal is to load an LES file and, say, find and replace all instances of Now, in the design of LES we must consider not only the conversion of LES to Loyc trees, but the reverse conversion also. So here are a couple of things to consider:
So, given what I had in mind, I don't understand why you are proposing a system that involves two suffixes like |
Oh, I see. I was under the impression that by "custom literals" you meant something like C++'s user-defined literals. Thanks for clearing that up. A plug-in literal parsing system for the lexer sounds promising. I'm not sure if Edit: maybe |
I'm tweaking the precedence of Btw I once proposed |
Oh, and I'm changing them to be right-associative. |
One more thing, I notice that four-char codes like "\u2022" are valid JSON but larger code points like "\u1F4A9" and "\u01F4A9" (💩) don't work; only the first four characters contribute to the escape sequence. Traditionally I've parsed up to 6 characters after |
I'm changing a bunch of things, mainly in order to finalize LESv3. However, the changes to operator precedence/classification affect both LESv2 and LES3, partly because both languages share precedence-choosing code but also for the sake of consistency. These changes are not committed as of today.
!! to suffix operator
To reiterate,
!!
changed from an infix to a suffix operator based on Kotlin's!!
operator. I'm duplicating this change now in LESv2. And I noticed that suffix operator printing seems broken in LESv2 (e.g.a++
came out as`'++suf`a
- it parses fine but looks ugly) so I fixed that..dotKeywords
will be stored as#hashKeywords
and#
is an id-charTo save us the trouble of changing the whole Loyc codebase, the dot in a
.keyword
will be changed into a#
in the Loyc tree. Also,#
will be an identifier character. Thus.return x
is equivalent to#return(x)
. I'm also reverting the definition ofLNode.IsSpecialName
to exclude.
..keyw0rds
can now contain digits.keywords
will be allowed to be any valid non-backquoted identifier.Number parsing change
In order to simplify the lexer by removing a lookahead loop, I'm removing the requirement to have a "p0" clause on hexadecimal floating-point numbers. Consequently,
0x1.Ep0
can be written as0x1.E
, which will be treated as a floating-point number instead of0x1 . E
. However,0x1.Equals(1)
now has the bizarre interpretationquals"0x1.E"(1)
.Weird Operators
Given a "weird" operator like
x *+ y
, the precedence of*+
was previously based on the first character,+
. I think the motivation for this was to act similar to Nim operators. But then I noticed something: MATLAB and Julia have "elementwise" operators likeX .* Y
which multiply pairs of scalars in matricesX
andY
. So I checked several other languages. Apparently, most other languages do not have any "compound" operators, but Swift does. Swift has&*
,&/
,&+
,&-
, and&%
. In Swift, MATLAB and Julia, the last character determines the precedence. Bacause Swift, MATLAB, etc. are more popular than Nim, I'm changing LES to decide precedence based on the last character instead of the first. Sound good @jonathanvdc ?Earlier I added a
!.
operator with precedence of.
so that Kotlin's!!.
operator would have the correct precedence. This change makes the!.
operator redundant. Null-dot?.
must still be special-cased as its precedence is lower than.
.Note: Long operators like
|%|
will continue to be based on the first and last character. In the case of combo operators, likex s>> y
, the initial identifier is ignored for the purpose of deciding precedence so that this particular operator has the same precedence as>>
.Operators and fractions
I decided that, to better match other programming languages,
x*.2
should be parsed asx * 0.2
rather than asx *. 2
. However, the tricky thing is that0..2
and0...3
should still be parsed as ranges,0 .. 2
and0 ... 3
respectively. To achieve this I tried splittingOperator
into two rules:This is an efficient implementation. However, with this grammar, If you write
x.+.2
it is parsed asx .+. 2
which is not necessarily the desired result. It's a perfectly reasonable way to parse, but I think it would be better to match Julia/MATLAB. So I thought of an alternative which actually looks simpler in the grammar:This turns out to generate more, and slower, code. But for the sake of compatibility, I'll accept that.
::
operatorI think the precedence of
::
should be changed to match C++ and EC#. In LESv2 I chose the syntaxx::Loyc.Symbol
for variable declarations, which required::
to have a lower precedence than.
. Why didn't I just use the syntaxx: Loyc.Symbol
? I think it was because it conflicted with LES's old "Python mode" where you could writeMeaning
if c { print "c is true!" }
. This feature was removed 2016-06-21, I think in order to make the language easier to parse and ensure colon would behave like any other operator.So I will raise the precedence of
::
up to.
, but will make the change exclusive to LESv3 because I have a fairly large amount of LESv2 code relying on the old precedence.Minor implementation detail
?.
will now be classified asTokenType.Dot
while.*
will no longer be classified asTokenType.Dot
.Arrow operators
I have reduced the precedence of arrow operators
-> <-
. Technically their precedence is now above&&
but below| ^
. My thinking is that some people would like to use arrows as assignment operators:flag <- x > y || y < 0
gets the structure(flag <- (x > y)) || (y < 0)
.The previous precedence of
-> <-
was sort of a compromise between C (which wants high precedence) and other languages that want lower precedence. But arguably the old precedence served neither case very well. Now I'm thinking that people wanting to use a C++-style arrow operator as inobj->f()
should pick another operator, such asobj*.f()
.Continuators
The set of possible continuators should be carefully considered because it cannot be compatibly changed later. Since no one has offered an opinion, I am going to suggest that the set be the ten words
else, elseif, elsif, catch, except, finally, while, until, plus, using
, plus the set of all identifiers that begin with two underscores (__etc
). Note that instead of double-underscores it would seem more natural to add#hashKeywords
to the set; however, this creates an ambiguity in case of code like the following:I think it's more likely this was intended to be two separate statements, rather than that
#bar(x)
is intended to continue the.foo
statement. I selected__
, a traditional "potential reserved word" marker in C++, because it is not currently used for any purpose in Loyc. Unlike continuators likeelse
that will be stored with#
in the Loyc tree (#else
), double-underscore identifiers will be stored unchanged. Though it could equally be argued that#
s should not be added even in the former case.Since continuators are not allowed to be used as binary operators, I removed
and, or, but
from the continuator set, thinking that some users would prefer to use them as binary operators.And introducing token lists / prefix-expressions / whatever you call 'em
I proposed that LESv3 not have token literals (unlike LESv2 and EC#) and instead adopt "token lists" such as
' + x 1
. No one offered an opinion on this, or about whether'(+ x y)
should be represented as`'`(`'()`(`'+`, x, y))
or as`'`(`'+`(x, y))
so I'm somewhat arbitrarily selecting the first representation.Decisions on other questions previously raised
foo: @immutable string
? Still undecided; currently, no.{ a, b, c }
? I don't think so. The parser isn't currently complaining about it, but I may add a check, possibly a check dependent on whether{
is followed by a string (to carve out an exception for JSON syntax).(x, y)
? Yes, but(x,)
will be a syntax error and parsed as(x, ``)
.#
mean? It will be treated as an identifier character like an underscore.Idea
Numeric literals can have any identifier as a suffix, including backquoted identifiers. This feature could be used to support my favorite feature that few languages allow: compound unit types like
1.2`kB/record`
or3e8`m/s`
. Currently an expression likesize `kB`
is meaningless, but I suppose it could be defined as some sort of suffix operator which could then be used for unit types. However, this idea has the disadvantage that a numeric value with units would have a different syntactic structure than any other expression with units, and3`px`
would have a different meaning than3 `px`
.The text was updated successfully, but these errors were encountered: