Example of Python-like parsing #20

VinnyVicious · 2021-11-03T21:59:12Z

Having an example on how to parse Python-like languages that are aware of indentation would be interesting.

zesterer · 2021-11-04T10:43:22Z

Python's parser/lexer (whichever processes identation, I've not checked) is not context-free because correctly parsing each line relies on the contextual information available in previous lines (i.e: lines 'pair up' with neighbouring lines into blocks if their indentation matches).

Chumsky (and virtually all parser combinators and parser generators) does not play well with context-sensitive grammars. It's usually necessary to hand-write a context-sensitive parser for such languages.

All is not lost, however! A neat trick you can use is to write your lexer by hand and have it translate Pythonic indentation into traditional code blocks with delimiters. Once this is done, the tokens can be fed into Chumsky and parsed in a context-free manner.

I would actually advise you use this strategy: get rid of the context-sensitive indentation as early as you possibly can in the compiler, it makes parsing (and representing the AST later on) needlessly complex.

I've actually written an example lexer that does this in the past, complete with correct handling of parentheses and newlines. You can take a look at it here. Once you've turned the text into a stream of tokens, they're pretty easy to feed to Chumsky.

It's probably possible to implement something similar as custom parser for Chumsky too, although it's obviously a bit ugly for the reasons I mentioned above. I'll look into this a little more to see if it's possible.

VinnyVicious · 2021-11-04T18:50:34Z

Thank you for the detailed answer and the bonus example! Really appreciate it.

Xandaros · 2021-11-09T23:40:00Z

Something I have seen before is introducing special INDENT and OUTDENT tokens in your lexer. I thought that was quite clever.

mainrs · 2021-11-12T14:00:36Z

Something I have seen before is introducing special INDENT and OUTDENT tokens in your lexer. I thought that was quite clever.

Yep, that's how I've seen a lot of lexer/parser combinations do it. Although I find it then hard to work with languages (like Python) that use newlines as a separator and as a whitespace.

LikeLakers2 · 2021-11-30T00:26:20Z

@zesterer Apologies for the bump after two weeks, but I'm curious.

I'll look into this a little more to see if it's possible.

If it turns out not to be possible, could an indent-aware parsing example still be included in the /examples/ directory? That way, users would know how they might make Chumsky work with indent-aware content, even if it requires some hand-written code on their end.

zesterer · 2021-11-30T15:51:32Z

I've had a think about it and although I believe that it should be possible to implement, I don't think that any implementation I could add to Chumsky would be sufficiently flexible to cover all use-cases.

In terms of a hand-written example implementation, I'm very happy to accept something like this if someone is interested in implementing it. Right now, unfortunately, I'm focusing on a number of other things so I likely won't get time to work on it until after New Year.

zesterer · 2022-04-14T17:25:01Z

I've actually made some progress on this! master now contains a parser called semantic_whitespace that can be used to tokenise Python-style syntax into token trees, respecting semantic indentation. You can find an example of its use here.

I'd really like feedback about how this function might be improved: I'm sure that it's not general enough for some cases, but it's not yet clear to me how best to generalise it. Thoughts would be very welcome!

tatsuya6502 · 2022-05-01T07:47:34Z

Hi @zesterer,

Thank you for adding the semantic_whitespace parser. It looks promising!

I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use Stream::from_nested to create a flattened token stream before feeding them to the parser.

I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?

VinnyVicious · 2022-07-05T14:16:29Z

Could that approach be merged back to master?

zesterer · 2022-07-05T14:55:24Z

@tatsuya6502

Hi @zesterer,

Thank you for adding the semantic_whitespace parser. It looks promising!

I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use Stream::from_nested to create a flattened token stream before feeding them to the parser.

I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?

Hey, sorry it took so long to respond to this!

Yes, this approach works fine for now. That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the zero-copy branch, but it seems... mechanically complicated to say the least.

Could you open a PR with this change to the example? I think it's merge-worthy still, provided it's made clear that the Stream::from_nested is conceptually distinct from the lexing stage.

tatsuya6502 · 2022-07-06T17:01:36Z

Hi @zesterer,

No worries. I opened the PR #166. I simply rebased my branch to the latest master. Please let me know if you want me to change some codes.

That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the zero-copy branch, but it seems... mechanically complicated to say the least.

I hope you will find a way. But I am already quite happy with current Chumsky too. Thank you for creating such a great library!

ehllie · 2022-09-30T12:40:39Z

I'm not certain if the example is meant to cover this, but the pythonic parser does not enforce that the individual tokens be separated by whitespace, meaning ambiguous syntax like

10foo:
    20bar

Gets parsed to: [(Int(10), 0..2), (Ident("foo"), 2..5), (Op(":"), 5..6), (Open(Block), 9..14), (Int(20), 9..11), (Ident("bar"), 11..14), (Close(Block), 9..14)]
I could not figure out how to enforce that while using the semantic_whitespace parser, but that could be down to me not understanding the library well enough yet.

zesterer · 2022-09-30T12:56:23Z

The pythonic example is just a very simple example, not a full parser for Python code. The issue you're seeing is unrelated to semantic_whitespace, it's just a product of it being a simple example rather than a fully-fledged parser.

TTTPOB · 2023-02-09T16:08:45Z

hello, I am not sure whether here is the right place to ask about the question.
it seems that the semantic_indentation parser will report indentation demilited code blocks with span (index_of_opening_indentation, end_of_the_opening_indentation_line). but for parenthesis delimited code block the span is correct.
In pythonic.rs example, here is a parenthesis delimited code block

and here are the related items in token stream

here is a indentation code block

here are the items

I am writing a parser for a dsl which is a superset of python, and when i am building the lexer, I have seen that as early as the nested TokenTree step, the span is not correct.

obviously the span should be 14..44

I am wandering is this mean that semantic_indentation parser has some bug? or is this by-design?
I am new to language parser and chumsky, thank you!

zesterer · 2023-02-09T21:24:10Z

@TTTPOB

It's quite likely that semantic_indentation has either a bug or is not quite able to express what you want: it's a stop-gap for chumsky's (current) lack of true context-sensitive parsing, which is required to parse Python-style indentation properly.

However, there's rapid movement on this front! The zero-copy branch (due for merge soon) is getting a rugged, powerful system for expressive context-sensitive parsing. You can see an example of this being used to parse Python-style indentation here.

If you feel you can't wait for zero-copy and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.

TTTPOB · 2023-02-10T01:02:52Z

If you feel you can't wait for zero-copy and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.

Thank you! it looks promising, I will try to write a lexer first, and maybe switch to the new release after.

zesterer · 2023-02-20T22:27:44Z

zero-copy now supports context-sensitive parsing (see #269), enough to properly implement Python style indentation sensitivity (and much more!). There's even an example.

VinnyVicious · 2023-04-18T22:56:47Z

@zesterer this is fantastic. Thank you so much for all the hard work on this.

zesterer closed this as completed Nov 15, 2021

zesterer added the wontfix This will not be worked on label Nov 15, 2021

zesterer reopened this Apr 14, 2022

tatsuya6502 mentioned this issue Jul 6, 2022

Update the pythonic example to demonstrate how to create a flatten token stream from a token tree #166

Closed

zesterer closed this as completed Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example of Python-like parsing #20

Example of Python-like parsing #20

VinnyVicious commented Nov 3, 2021

zesterer commented Nov 4, 2021 •

edited

VinnyVicious commented Nov 4, 2021

Xandaros commented Nov 9, 2021

mainrs commented Nov 12, 2021

LikeLakers2 commented Nov 30, 2021

zesterer commented Nov 30, 2021

zesterer commented Apr 14, 2022

tatsuya6502 commented May 1, 2022 •

edited

VinnyVicious commented Jul 5, 2022

zesterer commented Jul 5, 2022 •

edited

tatsuya6502 commented Jul 6, 2022

ehllie commented Sep 30, 2022

zesterer commented Sep 30, 2022

TTTPOB commented Feb 9, 2023

zesterer commented Feb 9, 2023

TTTPOB commented Feb 10, 2023

zesterer commented Feb 20, 2023 •

edited

VinnyVicious commented Apr 18, 2023

Example of Python-like parsing #20

Example of Python-like parsing #20

Comments

VinnyVicious commented Nov 3, 2021

zesterer commented Nov 4, 2021 • edited

VinnyVicious commented Nov 4, 2021

Xandaros commented Nov 9, 2021

mainrs commented Nov 12, 2021

LikeLakers2 commented Nov 30, 2021

zesterer commented Nov 30, 2021

zesterer commented Apr 14, 2022

tatsuya6502 commented May 1, 2022 • edited

VinnyVicious commented Jul 5, 2022

zesterer commented Jul 5, 2022 • edited

tatsuya6502 commented Jul 6, 2022

ehllie commented Sep 30, 2022

zesterer commented Sep 30, 2022

TTTPOB commented Feb 9, 2023

zesterer commented Feb 9, 2023

TTTPOB commented Feb 10, 2023

zesterer commented Feb 20, 2023 • edited

VinnyVicious commented Apr 18, 2023

zesterer commented Nov 4, 2021 •

edited

tatsuya6502 commented May 1, 2022 •

edited

zesterer commented Jul 5, 2022 •

edited

zesterer commented Feb 20, 2023 •

edited