Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of Python-like parsing #20

Closed
VinnyVicious opened this issue Nov 3, 2021 · 18 comments
Closed

Example of Python-like parsing #20

VinnyVicious opened this issue Nov 3, 2021 · 18 comments
Labels
wontfix This will not be worked on

Comments

@VinnyVicious
Copy link

Having an example on how to parse Python-like languages that are aware of indentation would be interesting.

@zesterer
Copy link
Owner

zesterer commented Nov 4, 2021

Python's parser/lexer (whichever processes identation, I've not checked) is not context-free because correctly parsing each line relies on the contextual information available in previous lines (i.e: lines 'pair up' with neighbouring lines into blocks if their indentation matches).

Chumsky (and virtually all parser combinators and parser generators) does not play well with context-sensitive grammars. It's usually necessary to hand-write a context-sensitive parser for such languages.

All is not lost, however! A neat trick you can use is to write your lexer by hand and have it translate Pythonic indentation into traditional code blocks with delimiters. Once this is done, the tokens can be fed into Chumsky and parsed in a context-free manner.

I would actually advise you use this strategy: get rid of the context-sensitive indentation as early as you possibly can in the compiler, it makes parsing (and representing the AST later on) needlessly complex.

I've actually written an example lexer that does this in the past, complete with correct handling of parentheses and newlines. You can take a look at it here. Once you've turned the text into a stream of tokens, they're pretty easy to feed to Chumsky.

It's probably possible to implement something similar as custom parser for Chumsky too, although it's obviously a bit ugly for the reasons I mentioned above. I'll look into this a little more to see if it's possible.

@VinnyVicious
Copy link
Author

Thank you for the detailed answer and the bonus example! Really appreciate it.

@Xandaros
Copy link

Xandaros commented Nov 9, 2021

Something I have seen before is introducing special INDENT and OUTDENT tokens in your lexer. I thought that was quite clever.

@mainrs
Copy link

mainrs commented Nov 12, 2021

Something I have seen before is introducing special INDENT and OUTDENT tokens in your lexer. I thought that was quite clever.

Yep, that's how I've seen a lot of lexer/parser combinations do it. Although I find it then hard to work with languages (like Python) that use newlines as a separator and as a whitespace.

@zesterer zesterer added the wontfix This will not be worked on label Nov 15, 2021
@LikeLakers2
Copy link

@zesterer Apologies for the bump after two weeks, but I'm curious.

I'll look into this a little more to see if it's possible.

If it turns out not to be possible, could an indent-aware parsing example still be included in the /examples/ directory? That way, users would know how they might make Chumsky work with indent-aware content, even if it requires some hand-written code on their end.

@zesterer
Copy link
Owner

I've had a think about it and although I believe that it should be possible to implement, I don't think that any implementation I could add to Chumsky would be sufficiently flexible to cover all use-cases.

In terms of a hand-written example implementation, I'm very happy to accept something like this if someone is interested in implementing it. Right now, unfortunately, I'm focusing on a number of other things so I likely won't get time to work on it until after New Year.

@zesterer zesterer reopened this Apr 14, 2022
@zesterer
Copy link
Owner

I've actually made some progress on this! master now contains a parser called semantic_whitespace that can be used to tokenise Python-style syntax into token trees, respecting semantic indentation. You can find an example of its use here.

I'd really like feedback about how this function might be improved: I'm sure that it's not general enough for some cases, but it's not yet clear to me how best to generalise it. Thoughts would be very welcome!

@tatsuya6502
Copy link

tatsuya6502 commented May 1, 2022

Hi @zesterer,

Thank you for adding the semantic_whitespace parser. It looks promising!

I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use Stream::from_nested to create a flattened token stream before feeding them to the parser.

I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?

@VinnyVicious
Copy link
Author

Could that approach be merged back to master?

@zesterer
Copy link
Owner

zesterer commented Jul 5, 2022

@tatsuya6502

Hi @zesterer,

Thank you for adding the semantic_whitespace parser. It looks promising!

I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use Stream::from_nested to create a flattened token stream before feeding them to the parser.

I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?

Hey, sorry it took so long to respond to this!

Yes, this approach works fine for now. That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the zero-copy branch, but it seems... mechanically complicated to say the least.

Could you open a PR with this change to the example? I think it's merge-worthy still, provided it's made clear that the Stream::from_nested is conceptually distinct from the lexing stage.

@tatsuya6502
Copy link

Hi @zesterer,

No worries. I opened the PR #166. I simply rebased my branch to the latest master. Please let me know if you want me to change some codes.

That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the zero-copy branch, but it seems... mechanically complicated to say the least.

I hope you will find a way. But I am already quite happy with current Chumsky too. Thank you for creating such a great library!

@ehllie
Copy link

ehllie commented Sep 30, 2022

I'm not certain if the example is meant to cover this, but the pythonic parser does not enforce that the individual tokens be separated by whitespace, meaning ambiguous syntax like

10foo:
    20bar

Gets parsed to: [(Int(10), 0..2), (Ident("foo"), 2..5), (Op(":"), 5..6), (Open(Block), 9..14), (Int(20), 9..11), (Ident("bar"), 11..14), (Close(Block), 9..14)]
I could not figure out how to enforce that while using the semantic_whitespace parser, but that could be down to me not understanding the library well enough yet.

@zesterer
Copy link
Owner

The pythonic example is just a very simple example, not a full parser for Python code. The issue you're seeing is unrelated to semantic_whitespace, it's just a product of it being a simple example rather than a fully-fledged parser.

@TTTPOB
Copy link

TTTPOB commented Feb 9, 2023

hello, I am not sure whether here is the right place to ask about the question.
it seems that the semantic_indentation parser will report indentation demilited code blocks with span (index_of_opening_indentation, end_of_the_opening_indentation_line). but for parenthesis delimited code block the span is correct.
In pythonic.rs example, here is a parenthesis delimited code block
image
and here are the related items in token stream
image

here is a indentation code block
image
here are the items
image

I am writing a parser for a dsl which is a superset of python, and when i am building the lexer, I have seen that as early as the nested TokenTree step, the span is not correct.
image

obviously the span should be 14..44

I am wandering is this mean that semantic_indentation parser has some bug? or is this by-design?
I am new to language parser and chumsky, thank you!

@zesterer
Copy link
Owner

zesterer commented Feb 9, 2023

@TTTPOB

It's quite likely that semantic_indentation has either a bug or is not quite able to express what you want: it's a stop-gap for chumsky's (current) lack of true context-sensitive parsing, which is required to parse Python-style indentation properly.

However, there's rapid movement on this front! The zero-copy branch (due for merge soon) is getting a rugged, powerful system for expressive context-sensitive parsing. You can see an example of this being used to parse Python-style indentation here.

If you feel you can't wait for zero-copy and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.

@TTTPOB
Copy link

TTTPOB commented Feb 10, 2023

If you feel you can't wait for zero-copy and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.

Thank you! it looks promising, I will try to write a lexer first, and maybe switch to the new release after.

@zesterer
Copy link
Owner

zesterer commented Feb 20, 2023

zero-copy now supports context-sensitive parsing (see #269), enough to properly implement Python style indentation sensitivity (and much more!). There's even an example.

@VinnyVicious
Copy link
Author

@zesterer this is fantastic. Thank you so much for all the hard work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

8 participants