Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preserve spans from the lexer into the parser? #125

Closed
kevinbarabash opened this issue Apr 16, 2022 · 6 comments
Closed

How to preserve spans from the lexer into the parser? #125

kevinbarabash opened this issue Apr 16, 2022 · 6 comments

Comments

@kevinbarabash
Copy link

kevinbarabash commented Apr 16, 2022

I'm struggling to access char spans in my parser when using a two-step lexer+parser approach:

My lexer looks like:

Parser<char, Vec<(Token, Span)>, Error = Simple<char>>

and I started working on a parser that looks like:

Parser<(Token, Span), WithSpan<Expr>, Error = Simple<(Token, Span)>>

In nano_rust.rs, the parser only accepts Token instead of (Token, Span).

I'm going to try doing the same and then using the token spans provided by .map_with_span() in the parser to look up the span in the original char stream from the Vet<(Token, Span)> produced by the lexer.

I was wondering if this is the recommended way of accessing char spans from the parser when using a two-step parsing approach?

@CraftSpider
Copy link
Collaborator

Personally, I use a custom Span with two fields.

struct Span {
    source_span: (usize, usize),
    stream_span: (usize, usize),
}

impl chumsky::Span for Span {
    // Return stream span here
}

Then later, you use Stream::from_iter() and construct your spans there from both bits of info.

@kevinbarabash
Copy link
Author

@CraftSpider I like that the custom Span you suggested provides a way to differentiate source spans from (token) stream spans. Is the idea to use this custom Span for the parser and for the lexer just use (usize, usize)?

@CraftSpider
Copy link
Collaborator

Yes. I lex with logos, so it just uses its range internally, then when I hit chumsky, I convert to this new differentiated span, and that span gets preserved into future steps.

@CraftSpider
Copy link
Collaborator

If you want more example code, I can grab some snippets from the implementation

@zesterer
Copy link
Owner

In nano_rust.rs, the parser only accepts Token instead of (Token, Span).

Chumsky supports spans 'natively', so you don't need to mention them in the type of the input token. Although nano_rust's parser takes Token, it still uses the same span internally, which is why the span is preserved until runtime (if you try invoking a runtime error, just as calling a function with the wrong number of arguments, you will see this).

@kevinbarabash
Copy link
Author

I ended up dropping the separate lexer that was generating a Vac<Token>. I'm working on a parser for a language with syntax very similar to JavaScript (but with more expressions and less statements). As part of this, I'd like to support parsing JSX and I don't think that's possible with a distinct lexer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants