The essence of lexer #59706
I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This draft PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though!
We probably should expose a convenience function
EDIT: I've added
The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer:
Example of errors not detected by the lexer:
The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself.
In particular, in this PR unclosed
No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions.
Long term, I hope that this approach will allow for better performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil.
For implementation, I am going slightly unconventionally. Instead of defining a
EDIT: swithced to a more conventional setup with lexing methods
So, what do folks think about this?
Since I was assigned, here are my priorities:
So, if the lexer crate follows the model of rustc-ap-syntax, then I'm happy.
If the first priority is satisfied, then I'm not even too interested in discussing the exact interface of the proposed reusable lexer - it could be improved at any time if some usability or performance issues are found.
Reassigning to someone who can into high-level design.
I agree that this should be just a usual library in the rust monorepo, and that it shouldn't have any compatibility guarantees. As a stretch goal, I'd love to additionally make sure that just
The hard requirement for me though is building on stable. This is different from ap-syntax model, which is nightly only. I hope it'll "just work", the interface seems pretty minimal (although various unicode tables in libcore might be a problem). At worst, we can have a feature-flag in the create to enable rustc_private stuff.
One concern I have is that the API of the old lexer kind of preceded external iterators.
So that said, I think we should have one or two of:
What I don't we should have is anything resembling the current API, which is stateful but at the same time it
I also agree with @petrochenkov that
Thanks for the review @eddyb! Given the general thumbsup here, I'll work on this in the coming weeks to make this production ready!
I think we should do both: stateless one is less powerful (you can't lex python-style f-strings with it), so, while rust lexical grammar admits stateless lexing, we should use it. Stateless is also good for incremental relexing. For the users though, iterator API on top of stateless API would be preferable.
I also plan to initially preserve the API of the current code in libsyntax exactly (by proxiing to the new crate), and do simplification refactoring in a separate PR.
Yeah, I was debating about what to do with shebangs as well... Part of me wants to say "nah, this is implementation defined concern", and just don't handle it in this library. Your proposal of a separate
Is it OK for
Heh, for me personally, free-standing functions for grammar productions and methods for lookahead/bump work much better, but, even if this approach is objectively better, it's still makes sense to go with methods to minimize exoticism. Will fix that!
Just to keep closer to the original implementation. I am not sure what the maintenance status of
I do think we should switch to
Heh, indeed! Fixed (EDIT: and now pushed the commit)!
I wonder if we should remove (in a follow-up PR) number of hashes from literal kind (thus making it a c-style enum), and store the whole thing, with hashes, inside of
The motivation behind the current representation is ability to query the literal kind easily (without looking into its content), and at the same time avoiding duplication between the kind and content so they cannot go out of sync for various synthetic tokens (#60936 (comment)).
This pull request and the master branch diverged in a way that cannot be automatically merged. Please rebase on top of the latest master branch, and let the reviewer approve again.
How do I rebase?
You may also read Git Rebasing to Resolve Conflicts by Drew Blessing for a short tutorial.
Please avoid the "Resolve conflicts" button on GitHub. It uses
Sometimes step 4 will complete without asking for resolution. This is usually due to difference between how