New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of Python-like parsing #20
Comments
Python's parser/lexer (whichever processes identation, I've not checked) is not context-free because correctly parsing each line relies on the contextual information available in previous lines (i.e: lines 'pair up' with neighbouring lines into blocks if their indentation matches). Chumsky (and virtually all parser combinators and parser generators) does not play well with context-sensitive grammars. It's usually necessary to hand-write a context-sensitive parser for such languages. All is not lost, however! A neat trick you can use is to write your lexer by hand and have it translate Pythonic indentation into traditional code blocks with delimiters. Once this is done, the tokens can be fed into Chumsky and parsed in a context-free manner. I would actually advise you use this strategy: get rid of the context-sensitive indentation as early as you possibly can in the compiler, it makes parsing (and representing the AST later on) needlessly complex. I've actually written an example lexer that does this in the past, complete with correct handling of parentheses and newlines. You can take a look at it here. Once you've turned the text into a stream of tokens, they're pretty easy to feed to Chumsky. It's probably possible to implement something similar as custom parser for Chumsky too, although it's obviously a bit ugly for the reasons I mentioned above. I'll look into this a little more to see if it's possible. |
Thank you for the detailed answer and the bonus example! Really appreciate it. |
Something I have seen before is introducing special |
Yep, that's how I've seen a lot of lexer/parser combinations do it. Although I find it then hard to work with languages (like Python) that use newlines as a separator and as a whitespace. |
@zesterer Apologies for the bump after two weeks, but I'm curious.
If it turns out not to be possible, could an indent-aware parsing example still be included in the |
I've had a think about it and although I believe that it should be possible to implement, I don't think that any implementation I could add to Chumsky would be sufficiently flexible to cover all use-cases. In terms of a hand-written example implementation, I'm very happy to accept something like this if someone is interested in implementing it. Right now, unfortunately, I'm focusing on a number of other things so I likely won't get time to work on it until after New Year. |
I've actually made some progress on this! I'd really like feedback about how this function might be improved: I'm sure that it's not general enough for some cases, but it's not yet clear to me how best to generalise it. Thoughts would be very welcome! |
Hi @zesterer, Thank you for adding the I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way? |
Could that approach be merged back to master? |
Hey, sorry it took so long to respond to this! Yes, this approach works fine for now. That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the Could you open a PR with this change to the example? I think it's merge-worthy still, provided it's made clear that the |
Hi @zesterer, No worries. I opened the PR #166. I simply rebased my branch to the latest master. Please let me know if you want me to change some codes.
I hope you will find a way. But I am already quite happy with current Chumsky too. Thank you for creating such a great library! |
I'm not certain if the example is meant to cover this, but the pythonic parser does not enforce that the individual tokens be separated by whitespace, meaning ambiguous syntax like 10foo:
20bar Gets parsed to: |
The |
It's quite likely that However, there's rapid movement on this front! The If you feel you can't wait for |
Thank you! it looks promising, I will try to write a lexer first, and maybe switch to the new release after. |
|
@zesterer this is fantastic. Thank you so much for all the hard work on this. |
Having an example on how to parse Python-like languages that are aware of indentation would be interesting.
The text was updated successfully, but these errors were encountered: