Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The essence of lexer #59706

Merged
merged 1 commit into from Jul 21, 2019
Merged

The essence of lexer #59706

merged 1 commit into from Jul 21, 2019

Conversation

matklad
Copy link
Member

@matklad matklad commented Apr 4, 2019

cc @eddyb

I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This draft PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though!

High-level picture

An rust_lexer crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function: next_token(&str) -> (TokenKind, usize) which returns the first token of a non-empty string (usize is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients.

Finer Points

Iterator API

We probably should expose a convenience function fn tokenize(&str) -> impl Iterator<Item = Token>

EDIT: I've added tokenize

Error handling

The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer:

  • unterminated block comment
  • unterminated string literals

Example of errors not detected by the lexer:

  • invalid escape sequence in a string literal
  • out of range integer literal
  • bare \r in the doc comment.

The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself.

In particular, in this PR unclosed /* comment is handled by the new lexer, bare \r and distinction between doc and non-doc comments is handled by the old lexer.

Performance

No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions.

Long term, I hope that this approach will allow for better performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil.

Cursor API

For implementation, I am going slightly unconventionally. Instead of defining a Lexer struct with a bunch of helper methods (current, bump) and a bunch of lexing methods (lex_comment, lex_whitespace), I define a Cursor struct which has only helpers, and define a top-level function with a &mut Cursor argument for each grammar production. I find this C-style more readable for parsers and lexers.

EDIT: swithced to a more conventional setup with lexing methods

So, what do folks think about this?

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 4, 2019
@alexcrichton
Copy link
Member

I'm personally pretty unfamiliar with this work, but @matklad do you know who'd be good to review this?

@matklad
Copy link
Member Author

matklad commented Apr 5, 2019

That is a good question. Perhaps @eddyb or @petrochenkov? I also feel that maybe this needs to be tagged with T-compiler and discussed more generally?

@alexcrichton alexcrichton added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2019
@alexcrichton
Copy link
Member

r? @petrochenkov

@bors
Copy link
Contributor

bors commented Apr 5, 2019

☔ The latest upstream changes (presumably #59721) made this pull request unmergeable. Please resolve the merge conflicts.

@petrochenkov
Copy link
Contributor

Since I was assigned, here are my priorities:

  • High priority: the lexer code (both interfaces and implementation) can be tweaked at any time for performance or other reasons (this means zero stability guarantees) without infrastructural hurdles (no separate repos, submodule updates, crate version changes).
  • Lower priority: reuse of the lexer code with other projects.

So, if the lexer crate follows the model of rustc-ap-syntax, then I'm happy.
(It should probably be named librustc_lexer rather than rust_lexer in that case.)

If the first priority is satisfied, then I'm not even too interested in discussing the exact interface of the proposed reusable lexer - it could be improved at any time if some usability or performance issues are found.
Frankly, I have no idea how the perfect reusable lexer interface should look, I never wrote a whole lexer and don't know the requirements.
What this PR does seems fine for a start.

Reassigning to someone who can into high-level design.

@petrochenkov petrochenkov assigned Zoxc and eddyb and unassigned petrochenkov Apr 7, 2019
@matklad
Copy link
Member Author

matklad commented Apr 8, 2019

Thanks!

I agree that this should be just a usual library in the rust monorepo, and that it shouldn't have any compatibility guarantees. As a stretch goal, I'd love to additionally make sure that just cargo test inside the librustc_lexer's dir works. This would help with

  1. lexer-specific test suite: now, to test my changes, I need to build the rest of the compiler, b/c some bits are only covered by run-pass, and that is slow
  2. a second "specification" implementation (a bunch of regexes + special casing /* and r#") which is compared with the production one and used in the language reference (this is something @rust-lang/wg-grammar might be interesting in).

The hard requirement for me though is building on stable. This is different from ap-syntax model, which is nightly only. I hope it'll "just work", the interface seems pretty minimal (although various unicode tables in libcore might be a problem). At worst, we can have a feature-flag in the create to enable rustc_private stuff.

@Dylan-DPC-zz

This comment has been minimized.

@rust-highfive rust-highfive assigned petrochenkov and unassigned Zoxc and eddyb Apr 15, 2019
@petrochenkov petrochenkov assigned Zoxc and eddyb and unassigned petrochenkov Apr 15, 2019
@eddyb
Copy link
Member

eddyb commented Apr 16, 2019

One concern I have is that the API of the old lexer kind of preceded external iterators.
That may sound strange, but, AFAIK, the lexer Reader may literally be older than Iterator.

So that said, I think we should have one or two of:

  1. a stateless "match token at start of string" API
  2. a stateful Iterator that applies 1. repeatedly

What I don't we should have is anything resembling the current API, which is stateful but at the same time it

I also agree with @petrochenkov that rustc_lex(er) (or, IMO, syntax_lex(er)) are better names.

src/rust_lexer/Cargo.lock Outdated Show resolved Hide resolved
src/rust_lexer/Cargo.toml Outdated Show resolved Hide resolved
src/rust_lexer/src/lib.rs Outdated Show resolved Hide resolved
src/rust_lexer/src/lib.rs Outdated Show resolved Hide resolved
src/rust_lexer/src/lib.rs Outdated Show resolved Hide resolved
@matklad
Copy link
Member Author

matklad commented Apr 16, 2019

Thanks for the review @eddyb! Given the general thumbsup here, I'll work on this in the coming weeks to make this production ready!

So that said, I think we should have one or two of:

I think we should do both: stateless one is less powerful (you can't lex python-style f-strings with it), so, while rust lexical grammar admits stateless lexing, we should use it. Stateless is also good for incremental relexing. For the users though, iterator API on top of stateless API would be preferable.

I also plan to initially preserve the API of the current code in libsyntax exactly (by proxiing to the new crate), and do simplification refactoring in a separate PR.

Now that I think about it, is_beginning_of_file is only used for shebangs, right?

Yeah, I was debating about what to do with shebangs as well... Part of me wants to say "nah, this is implementation defined concern", and just don't handle it in this library. Your proposal of a separate fn strip_shebang is nice though: we both keep the core interface clean, but handle shebangs in an implementation-independent way.

That would make this a &str -> Option, which is very close to FromStr!

Is it OK for FromStr to consume only part of the input though? I guess, if the public API is fn tokenize(src: &str) -> impl Iterator<Item = Token> this doesn't even really matter.

I think it would be nicer if these were methods.

Heh, for me personally, free-standing functions for grammar productions and methods for lookahead/bump work much better, but, even if this approach is objectively better, it's still makes sense to go with methods to minimize exoticism. Will fix that!

bors added a commit that referenced this pull request Jul 21, 2019
The essence of lexer

cc @eddyb

I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This **draft** PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though!

### High-level picture

An `rust_lexer` crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function: `next_token(&str) -> (TokenKind, usize)` which returns the first token of a non-empty string (`usize` is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients.

### Finer Points

#### Iterator API

We probably should expose a convenience function `fn tokenize(&str) -> impl Iterator<Item = Token>`

EDIT: I've added `tokenize`

#### Error handling

The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer:

* unterminated block comment
* unterminated string literals

Example of errors **not** detected by the lexer:

* invalid escape sequence in a string literal
* out of range integer literal
* bare `\r` in the doc comment.

The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself.

In particular, in this PR unclosed `/*` comment is handled by the new lexer, bare `\r` and distinction between doc and non-doc comments is handled by the old lexer.

#### Performance

No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions.

Long term, I hope that this approach will allow for *better* performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil.

#### Cursor API

For implementation, I am going slightly unconventionally. Instead of defining a `Lexer` struct with a bunch of helper methods (`current`, `bump`) and a bunch of lexing methods (`lex_comment`, `lex_whitespace`), I define a `Cursor` struct which has only helpers, and define a top-level function with a `&mut Cursor` argument for each grammar production. I find this C-style more readable for parsers and lexers.

EDIT: swithced to a more conventional setup with lexing methods

So, what do folks think about this?
@bors
Copy link
Contributor

bors commented Jul 21, 2019

☀️ Test successful - checks-azure
Approved by: petrochenkov
Pushing 83dfe7b to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Jul 21, 2019
@bors bors merged commit 395ee0b into rust-lang:master Jul 21, 2019
@matklad matklad deleted the the-essence-of-lexer branch July 21, 2019 11:14
@matklad
Copy link
Member Author

matklad commented Jul 21, 2019

🎉 hopefully, this is the last lexer for Rust :)

@eddyb
Copy link
Member

eddyb commented Jul 21, 2019

@matklad Would be cool to share it between the compiler and proc-macro2, as well!
cc @dtolnay

Centril added a commit to Centril/rust that referenced this pull request Sep 5, 2019
Use unicode-xid crate instead of libcore

This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock).

Reasons to do this:

* removing rustc-binary-specific stuff from libcore
* making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency)
* making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler

Reasons not to do this:

* increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway.
* xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster.

<details>

<summary>old description</summary>

Followup to rust-lang#59706

r? @eddyb

Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641.

cc unicode-rs/unicode-xid#11

</details>
Centril added a commit to Centril/rust that referenced this pull request Sep 5, 2019
Use unicode-xid crate instead of libcore

This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock).

Reasons to do this:

* removing rustc-binary-specific stuff from libcore
* making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency)
* making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler

Reasons not to do this:

* increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway.
* xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster.

<details>

<summary>old description</summary>

Followup to rust-lang#59706

r? @eddyb

Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641.

cc unicode-rs/unicode-xid#11

</details>
Centril added a commit to Centril/rust that referenced this pull request Sep 5, 2019
Use unicode-xid crate instead of libcore

This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock).

Reasons to do this:

* removing rustc-binary-specific stuff from libcore
* making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency)
* making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler

Reasons not to do this:

* increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway.
* xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster.

<details>

<summary>old description</summary>

Followup to rust-lang#59706

r? @eddyb

Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641.

cc unicode-rs/unicode-xid#11

</details>
@matklad
Copy link
Member Author

matklad commented Sep 5, 2019

published as https://crates.io/crates/rustc_lexer

@eddyb
Copy link
Member

eddyb commented Sep 6, 2019

@matklad Maybe we should move it out of tree if we publish it on crates.io?

cc @rust-lang/compiler I'm not sure if we have a clear policy on this.

@Centril
Copy link
Contributor

Centril commented Sep 6, 2019

I'm opposed to moving important parts of the compiler out of tree because it becomes impractical to review changes to them from a language team perspective. I don't want to have to have to check what a bump in a PR that just updates Cargo.lock means in terms of semantic changes to the language.

@matklad
Copy link
Member Author

matklad commented Sep 6, 2019

I feel like we need a dedicated discussion/RFC to figure out how to organize libraries in the librariified world. At the moment, I am content with the status quo, and I don't feel like we should change anything right now. Rather, we should ask this question when something bigger, like chalk, matures.

Long term, I personally would prefer a monorepo setup, but one where just cargo test --package thing-I-am-hacking is enough for most testing and where ./x.py test is something that mostly only bors executes. That is, I don't see a problem in things being in the same source tree, I see a problem with building the whole tree to get the basic testing of a single component.

@eddyb
Copy link
Member

eddyb commented Sep 6, 2019

FWIW I would like ./x.py test --stage 0 src/librustc_lexer to "just" work.

@Mark-Simulacrum has some ideas about being able to use the most recent CI build artifact to avoid bootstrapping, not sure if they're required or not.

@Mark-Simulacrum
Copy link
Member

IMO, if that doesn't work today, it's probably a bug. Furthermore, I would expect crates that are decoupled from the compiler (i.e., don't depend on unstable details from libstd and such) to work via cargo test --manifest-path ... as well, if you want to use fully "native" Cargo.

@matklad
Copy link
Member Author

matklad commented Sep 6, 2019

FWIW I would like ./x.py test --stage 0 src/librustc_lexer to "just" work.

It works, but only after going via the whole bootstraping process.

Furthermore, I would expect crates that are decoupled from the compiler (i.e., don't depend on unstable details from libstd and such) to work via cargo test --manifest-path ... as well, if you want to use fully "native" Cargo.

Wow, this indeed just works for rustc_lexer, now that #62848 is merged. It's not maximally useful at the moment, as there are few lexer specific tests, but hopefully rust-lang/wg-grammar#3 will fix that.

I guess that I am now happy with the current setup as the long-term setup :) The only minor thing which maybe worth doing, if we are to pursue librarization seriously, is to move clean, bootstrap-independent librarified components from /src/libfoo to /crates/foo: the current src directory looks a little like kitchen sink.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet