Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The essence of lexer #59706

Open
wants to merge 1 commit into
base: master
from

Conversation

@matklad
Copy link
Member

commented Apr 4, 2019

cc @eddyb

I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This draft PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though!

High-level picture

An rust_lexer crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function: next_token(&str) -> (TokenKind, usize) which returns the first token of a non-empty string (usize is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients.

Finer Points

Iterator API

We probably should expose a convenience function fn tokenize(&str) -> impl Iterator<Item = Token>

EDIT: I've added tokenize

Error handling

The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer:

  • unterminated block comment
  • unterminated string literals

Example of errors not detected by the lexer:

  • invalid escape sequence in a string literal
  • out of range integer literal
  • bare \r in the doc comment.

The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself.

In particular, in this PR unclosed /* comment is handled by the new lexer, bare \r and distinction between doc and non-doc comments is handled by the old lexer.

Performance

No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions.

Long term, I hope that this approach will allow for better performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil.

Cursor API

For implementation, I am going slightly unconventionally. Instead of defining a Lexer struct with a bunch of helper methods (current, bump) and a bunch of lexing methods (lex_comment, lex_whitespace), I define a Cursor struct which has only helpers, and define a top-level function with a &mut Cursor argument for each grammar production. I find this C-style more readable for parsers and lexers.

EDIT: swithced to a more conventional setup with lexing methods

So, what do folks think about this?

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

commented Apr 4, 2019

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

I'm personally pretty unfamiliar with this work, but @matklad do you know who'd be good to review this?

@matklad

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

That is a good question. Perhaps @eddyb or @petrochenkov? I also feel that maybe this needs to be tagged with T-compiler and discussed more generally?

@alexcrichton

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

@bors

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

☔️ The latest upstream changes (presumably #59721) made this pull request unmergeable. Please resolve the merge conflicts.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Apr 7, 2019

Since I was assigned, here are my priorities:

  • High priority: the lexer code (both interfaces and implementation) can be tweaked at any time for performance or other reasons (this means zero stability guarantees) without infrastructural hurdles (no separate repos, submodule updates, crate version changes).
  • Lower priority: reuse of the lexer code with other projects.

So, if the lexer crate follows the model of rustc-ap-syntax, then I'm happy.
(It should probably be named librustc_lexer rather than rust_lexer in that case.)

If the first priority is satisfied, then I'm not even too interested in discussing the exact interface of the proposed reusable lexer - it could be improved at any time if some usability or performance issues are found.
Frankly, I have no idea how the perfect reusable lexer interface should look, I never wrote a whole lexer and don't know the requirements.
What this PR does seems fine for a start.

Reassigning to someone who can into high-level design.

@petrochenkov petrochenkov assigned Zoxc and eddyb and unassigned petrochenkov Apr 7, 2019

@matklad

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

Thanks!

I agree that this should be just a usual library in the rust monorepo, and that it shouldn't have any compatibility guarantees. As a stretch goal, I'd love to additionally make sure that just cargo test inside the librustc_lexer's dir works. This would help with

  1. lexer-specific test suite: now, to test my changes, I need to build the rest of the compiler, b/c some bits are only covered by run-pass, and that is slow
  2. a second "specification" implementation (a bunch of regexes + special casing /* and r#") which is compared with the production one and used in the language reference (this is something @rust-lang/wg-grammar might be interesting in).

The hard requirement for me though is building on stable. This is different from ap-syntax model, which is nightly only. I hope it'll "just work", the interface seems pretty minimal (although various unicode tables in libcore might be a problem). At worst, we can have a feature-flag in the create to enable rustc_private stuff.

@Dylan-DPC

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

r? @petrochenkov
(since they weren't assigned by highfive)

@rust-highfive rust-highfive assigned petrochenkov and unassigned Zoxc and eddyb Apr 15, 2019

@petrochenkov petrochenkov assigned Zoxc and eddyb and unassigned petrochenkov Apr 15, 2019

@eddyb

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

One concern I have is that the API of the old lexer kind of preceded external iterators.
That may sound strange, but, AFAIK, the lexer Reader may literally be older than Iterator.

So that said, I think we should have one or two of:

  1. a stateless "match token at start of string" API
  2. a stateful Iterator that applies 1. repeatedly

What I don't we should have is anything resembling the current API, which is stateful but at the same time it

I also agree with @petrochenkov that rustc_lex(er) (or, IMO, syntax_lex(er)) are better names.

Show resolved Hide resolved src/rust_lexer/Cargo.lock Outdated
Show resolved Hide resolved src/rust_lexer/Cargo.toml Outdated
Show resolved Hide resolved src/rust_lexer/src/lib.rs Outdated
Show resolved Hide resolved src/rust_lexer/src/lib.rs Outdated
Show resolved Hide resolved src/rust_lexer/src/lib.rs Outdated
@matklad

This comment has been minimized.

Copy link
Member Author

commented Apr 16, 2019

Thanks for the review @eddyb! Given the general thumbsup here, I'll work on this in the coming weeks to make this production ready!

So that said, I think we should have one or two of:

I think we should do both: stateless one is less powerful (you can't lex python-style f-strings with it), so, while rust lexical grammar admits stateless lexing, we should use it. Stateless is also good for incremental relexing. For the users though, iterator API on top of stateless API would be preferable.

I also plan to initially preserve the API of the current code in libsyntax exactly (by proxiing to the new crate), and do simplification refactoring in a separate PR.

Now that I think about it, is_beginning_of_file is only used for shebangs, right?

Yeah, I was debating about what to do with shebangs as well... Part of me wants to say "nah, this is implementation defined concern", and just don't handle it in this library. Your proposal of a separate fn strip_shebang is nice though: we both keep the core interface clean, but handle shebangs in an implementation-independent way.

That would make this a &str -> Option, which is very close to FromStr!

Is it OK for FromStr to consume only part of the input though? I guess, if the public API is fn tokenize(src: &str) -> impl Iterator<Item = Token> this doesn't even really matter.

I think it would be nicer if these were methods.

Heh, for me personally, free-standing functions for grammar productions and methods for lookahead/bump work much better, but, even if this approach is objectively better, it's still makes sense to go with methods to minimize exoticism. Will fix that!

@mominul

This comment has been minimized.

Copy link

commented Jul 14, 2019

This is really strange: changing ? to a manual match improves performance by 10-20%: matklad@90909b0. Seems like a codegen bug.

This is the bug #37939

@eddyb

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

I've also switched from using xid-unicode

Is this needed at all? I'm not sure I understand why this was done.

@matklad

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2019

Is this needed at all? I'm not sure I understand why this was done.

Just to keep closer to the original implementation. I am not sure what the maintenance status of unicode-xid it, and it doesn't belong to the rust-lang org. Also it doens't contain definition of Pattern_White_Space and hard-coding that seems wrong.

I do think we should switch to unicode-xid, but I'd love to do that in a separate PR, as it looks like we need to:

  • make sure that it matches the current state of unicode
  • make sure that it is maintained
  • make sure that t-compiler has publish rights for this crate
  • move this crate to rust-lang org (or at least discuss this)?
  • add Pattern_White_Space to it
@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

One remaining unresolved comment - #59706 (comment).

@petrochenkov petrochenkov removed their assignment Jul 15, 2019

@matklad

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2019

Heh, indeed! Fixed (EDIT: and now pushed the commit)!

I wonder if we should remove (in a follow-up PR) number of hashes from literal kind (thus making it a c-style enum), and store the whole thing, with hashes, inside of Lit::symbol? If we do that, we should probably include quotes for all other stringy things as well, and that will break optimization where we reuse the same symbol in both token and ast literals...

}
/// For debug assertions only
pub(crate) fn prev(&self) -> char {
#[cfg(debug_assertions)]

This comment has been minimized.

Copy link
@bjorn3

bjorn3 Jul 16, 2019

Contributor

Maybe make the function itself conditional on debug_assertions?

This comment has been minimized.

Copy link
@matklad

matklad Jul 16, 2019

Author Member

That was my original idea, but that would just push these cfgs outwards: #62527

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

@matklad

I wonder if we should remove (in a follow-up PR) number of hashes from literal kind (thus making it a c-style enum), and store the whole thing, with hashes, inside of Lit::symbol? If we do that, we should probably include quotes for all other stringy things as well, and that will break optimization where we reuse the same symbol in both token and ast literals...

The motivation behind the current representation is ability to query the literal kind easily (without looking into its content), and at the same time avoiding duplication between the kind and content so they cannot go out of sync for various synthetic tokens (#60936 (comment)).

@bors

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

☔️ The latest upstream changes (presumably #62723) made this pull request unmergeable. Please resolve the merge conflicts.

@matklad matklad force-pushed the matklad:the-essence-of-lexer branch from e412f84 to fcefc6f Jul 16, 2019

@matklad

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

Just to clarify, this is waiting for @eddyb's review, right? I wonder if this also should be brought up on some t-compiler meeting?

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2019

@bors

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2019

📌 Commit fcefc6f has been approved by petrochenkov

@eddyb

This comment has been minimized.

Copy link
Member

commented Jul 20, 2019

@matklad I'm not blocking this, I think my only concern was about the unicode thing and you replied.

@bors

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2019

🔒 Merge conflict

This pull request and the master branch diverged in a way that cannot be automatically merged. Please rebase on top of the latest master branch, and let the reviewer approve again.

How do I rebase?

Assuming self is your fork and upstream is this repository, you can resolve the conflict following these steps:

  1. git checkout the-essence-of-lexer (switch to your branch)
  2. git fetch upstream master (retrieve the latest master)
  3. git rebase upstream/master -p (rebase on top of it)
  4. Follow the on-screen instruction to resolve conflicts (check git status if you got lost).
  5. git push self the-essence-of-lexer --force-with-lease (update this PR)

You may also read Git Rebasing to Resolve Conflicts by Drew Blessing for a short tutorial.

Please avoid the "Resolve conflicts" button on GitHub. It uses git merge instead of git rebase which makes the PR commit history more difficult to read.

Sometimes step 4 will complete without asking for resolution. This is usually due to difference between how Cargo.lock conflict is handled during merge and rebase. This is normal, and you should still perform step 5 to update this PR.

Error message
warning: Cannot merge binary files: Cargo.lock (HEAD vs. heads/homu-tmp)
Auto-merging Cargo.lock
CONFLICT (content): Merge conflict in Cargo.lock
Automatic merge failed; fix conflicts and then commit the result.

Introduce rustc_lexer
The idea here is to make a reusable library out of the existing
rust-lexer, by separating out pure lexing and rustc-specific concerns,
like spans, error reporting an interning.

So, rustc_lexer operates directly on `&str`, produces simple tokens
which are a pair of type-tag and a bit of original text, and does not
report errors, instead storing them as flags on the token.

@matklad matklad force-pushed the matklad:the-essence-of-lexer branch from fcefc6f to 395ee0b Jul 20, 2019

@matklad

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2019

@bors r=petrochenkov

@bors

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2019

📌 Commit 395ee0b has been approved by petrochenkov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.