Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical syntax simplification #90

Merged
merged 1 commit into from May 29, 2014

Conversation

Projects
None yet
10 participants
@cmr
Copy link
Member

commented May 24, 2014

No description provided.

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 24, 2014

Another benefit of this is that the output of the lexer can be only spans and their associated token type, rather than having to do any work.


LIT_STR_RAW
: 'r' LIT_STR_RAW_INNER
| 'r' '"' .*? '"'

This comment has been minimized.

Copy link
@lilyball

lilyball May 24, 2014

Contributor

This can't just be 'r' LIT_STR_RAW_INNER2? (and the inner tokens should probably be swapped).

This comment has been minimized.

Copy link
@cmr

cmr May 24, 2014

Author Member

Indeed it can.

;

fragment IDSTART : [a-zA-Z_] | XIDSTART ;
fragment IDCONT : [a-zA-Z_0-9] | XIDCONT ;

This comment has been minimized.

Copy link
@lilyball

lilyball May 24, 2014

Contributor

Aren't [a-zA-Z_] already in XIDSTART and [a-zA-Z_0-9] in XIDCONT?

This comment has been minimized.

Copy link
@cmr

cmr May 24, 2014

Author Member

Yes, but antlr is slow at actually matching XIDSTART/XIDCONT afaik.

This comment has been minimized.

Copy link
@lilyball

lilyball May 24, 2014

Contributor

Are you proposing we actually use antlr itself as the lexer?

This comment has been minimized.

Copy link
@cmr

cmr May 24, 2014

Author Member

I'll just remove these, it's not a big deal.

- Character escapes are not unescaped. That is, if you write '\x20', this
lexer will give you `LIT_CHAR('\x20')` rather than `LIT_CHAR(' ')`. The same
applies to string literals.
- The UNDERSCORE token has been removed. It can be considered an identifier.

This comment has been minimized.

Copy link
@lilyball

lilyball May 24, 2014

Contributor

We don't allow underscores as identifiers today, we only allow it as a pattern. Will treating it as an identifier not be problematic?

This comment has been minimized.

Copy link
@cmr

cmr May 24, 2014

Author Member

No, for the same reason that keywords aren't a problem as identifiers.

This comment has been minimized.

Copy link
@bstrie

bstrie May 25, 2014

Contributor

I don't understand what you mean by "not a problem". Do you just mean that let for = 2; would be rejected in the parser, rather than the lexer? And to be clear, this wouldn't change the status of _ as a pattern rather than an identifier, correct?

This comment has been minimized.

Copy link
@cmr

cmr May 25, 2014

Author Member

Correct. This is what happens today, except for the special case of _

This comment has been minimized.

Copy link
@lilyball

lilyball May 25, 2014

Contributor

We could extend this lexer to generate tokens for keywords (or just a single KEYWORD token to cover all the keywords), instead of requiring the parser to distinguish between keywords and identifiers.

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 24, 2014

This needs to take into account rust-lang/rust#14400 still.

;

LIT_FLOAT
: [0-9][0-9_]* ('.' [0-9][0-9_]*)? ([eE] [-+]? [0-9][0-9]*)? FLOAT_SUFFIX?

This comment has been minimized.

Copy link
@chris-morgan

chris-morgan May 25, 2014

Member

The exponent [0-9]* part—should it be [0-9_]*?

Also I think this will be tightening what is accepted; at present, for example, 1. is acceptable (but not 1.f32 for clear reasons), but this change will break that. Is that deliberate? Desirable? &c.

This comment has been minimized.

Copy link
@cmr

cmr May 25, 2014

Author Member

Good catch. That was not deliberate, I didn't mean to change the float literal syntax at all.

# Drawbacks

Including comments in the token stream is very non-traditional and not
strictly necessary.

This comment has been minimized.

Copy link
@pcwalton

pcwalton May 25, 2014

Contributor

(It might actually be a good idea for rustfmt though!)

This comment has been minimized.

Copy link
@cmr

cmr May 25, 2014

Author Member

rustfmt was what I was thinking of primarily with this decision.

This comment has been minimized.

Copy link
@chris-morgan

chris-morgan May 26, 2014

Member

I’m wondering more how it will fit in with macros (procedural and structural). I presume that they would both be wanting to simply ignore regular comments.

This comment has been minimized.

Copy link
@cmr

cmr May 26, 2014

Author Member

Creating the token-tree would strip them out.

@pcwalton

This comment has been minimized.

Copy link
Contributor

commented May 25, 2014

+1, sounds like an improvement for rustfmt

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 25, 2014

@kballard is the CRLF stuff correct? I extended the places that accept newline to also accept '\r\n', but not '\r', and I've removed '\r' from the whitespace skipping.

@lilyball

This comment has been minimized.

Copy link
Contributor

commented May 25, 2014

@cmr My patch actually allows bare '\r' in whitespace skipping and in non-doc comments. It only rejects it inside of strings and doc comments. That said, I don't know if it's worth trying to be that permissive. It may be better just to go ahead and treat a bare '\r' without a subsequent '\n' as a hard error anywhere in the file.

;
LIT_CHAR
: '\'' ( '\\' CHAR_ESCAPE | [^'\n\t\r] ) '\''

This comment has been minimized.

Copy link
@lilyball

lilyball May 25, 2014

Contributor

Shouldn't [^'\n\t\r] be ~['\n\t\r]?

This comment has been minimized.

Copy link
@lilyball

lilyball May 25, 2014

Contributor

Also the character set needs to include \\ or else invalid character escapes will end up matching anyway.

;

LIT_STR
: '"' ('\\\n' | '\\\r\n' | ~[\n])* '"'

This comment has been minimized.

Copy link
@lilyball

lilyball May 25, 2014

Contributor

This rule looks rather broken to me. You're matching the string \n, the string \r followed by a newline, or anything that's not a newline. And it's greedy.

;

fragment BLOCK_COMMENT
: '/*' (BLOCK_COMMENT | ~('\n' | '\r\n')*? '*/'

This comment has been minimized.

Copy link
@lilyball

lilyball May 25, 2014

Contributor

This looks broken too. It's matching anything that's not a LF or CRLF. Which basically means it doesn't allow newlines, but does allow bare CR's. It also has unbalanced parentheses, and if you were to balance it, it would only allow a nested block comment at the very beginning.

;
COMMENT
: '//' ~('\n' | '\r\n')*

This comment has been minimized.

Copy link
@lilyball

lilyball May 25, 2014

Contributor

This disallows LF and CRLF, but seems like it wouldn't disallow CR. It should just be '//' ~[\r\n]*. Also, should it consume the trailing newline? The sample comment rule on this documentation page, does. That would make it '//' ~[\r\n]* '\r'? '\n'

This comment has been minimized.

Copy link
@lilyball

lilyball May 25, 2014

Contributor

It probably shouldn't include the trailing newline. Any tooling that handles these comments won't want that newline, and it can just be skipped by the WS rule. So that leaves just '//' ~[\r\n]*.

@zwarich

This comment has been minimized.

Copy link
Contributor

commented May 27, 2014

This grammar is wildly ambiguous. Identifiers, numbers and operators can be tokenized in multiple ways.

@huonw

This comment has been minimized.

Copy link
Member

commented May 27, 2014

@zwarich do you have an example of an ambiguous sequence of tokens?

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 27, 2014

"12.12" could be INTEGER(12) DOT INTEGER(12)

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 27, 2014

etc. will be relatively easy to fix.

@lilyball

This comment has been minimized.

Copy link
Contributor

commented May 27, 2014

Doesn't antlr4 pick the longest matching token?

@huonw

This comment has been minimized.

Copy link
Member

commented May 27, 2014

(Yeah, isn't the maximal munch principal the standard way to resolve "ambiguities" like this?)

@zwarich

This comment has been minimized.

Copy link
Contributor

commented May 27, 2014

@kballard @huonw Yes, that is the standard way of resolving ambiguities with lexical syntax, and apparently the 'lexer grammar' feature of ANTLR makes it choose this strategy, as opposed to what it uses for normal grammars.

@anasazi

This comment has been minimized.

Copy link

commented May 27, 2014

I like the idea of keeping comments after lexing so pretty-printers / refactoring tools can use the same lexer as the compiler, but how about we just make comment dropping a micropass between the lexer and parser instead of adding to the parser workload?

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 27, 2014

Sure, whatever.

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 27, 2014

Fixed most things, and verified that it works as I expect.

@cmr

This comment has been minimized.

Copy link
Member Author

commented May 29, 2014

cc @nikomatsakis @pcwalton @brson I've updated this. It behaves as I expect for the code I've run it against, and accepts/rejects everything it should in the compiler/libs/testsuite/servo.

@brson brson merged commit 26b5257 into rust-lang:master May 29, 2014

@brson

This comment has been minimized.

Copy link
Contributor

commented May 29, 2014

withoutboats pushed a commit to withoutboats/rfcs that referenced this pull request Jan 15, 2017

@Centril Centril added the A-syntax label Nov 23, 2018

wycats pushed a commit to wycats/rust-rfcs that referenced this pull request Mar 5, 2019

Merge pull request rust-lang#90 from ember-cli/addon-tree-caching
RFC for caching results of `treeFor` hook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.