Dialects: Deprecate as many RegexParsers as we can #3390

WittierDinosaur · 2022-05-22T16:12:59Z

Search before asking

I searched the issues and found no similar issues.

Description

RegexParser doesn't support a hash matching - making it far less performant than its StringParser equivalent. In an ideal world everything would be direct from the Lexer, but if that can't be, StringParser is better. I doubt that we can remove every instance of this, but the more we remove, the better.

Use case

No response

Dialect

All

Are you willing to work on and submit a PR to address the issue?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

barrywhart · 2022-05-23T20:59:37Z

Can you explain what you mean by hash matching? Is this something in the StringParser code itself, or something about how Python string == works?

WittierDinosaur · 2022-05-24T15:08:24Z

i just mean the simple() matching route in the BaseSegment

alanmcruickshank · 2022-08-11T21:08:40Z

I might be able to have a go at this while I try and unpick the remnants of #3692. I think they're related.

alanmcruickshank · 2022-08-31T15:56:19Z

To make progress on this my suggestion builds on #3819.

The introduction of TypedMatcher doesn't directly solve this problem, but I think it provides a potential path forward. The way I see it, the constraints we have are:

.simple() matching is currently tied to string matching.
Some segments don't have a single string form, or even one which can be addressed using MultiStringParser.
There isn't an obvious way to match very custom strings in the simple route.

Given those constraints I think we need to break one of them. That boils down to making .simple() matching more powerful, to do more than just strings, but in a way that is likewise very performant. In my mind there are two options for doing that:

Making regexes a first class citizen and being able to effectively look ahead using them.
Leveraging the type of future segments as a way of doing effective lookaheads.

I think the latter is much more viable. I don't think I can see a way to performantly do the first one.

More specifically I'm suggesting:

All regex operations are pushed up into the lexing step. That produces a set of raw segments of different types which can be later picked up. We then deprecate the RegexParser and replace all use cases with the TypedParser.
We extend the .simple() matching route to have two paths:
- The existing string based route where we use in and index on the raw strings to identify matches ahead in the sequence of segments.
- A second pathway which uses the type of segments to look ahead. This could leverage roughly the same methodology and look ahead for pre-lexed tokens from the lexer which had already done the heavy lifting.

The second pathway should run at the same order of magnitude speed as the first one, and only needs to run at all if some of our options need to use it.

Very specifically I'm thinking that:

.simple() returns a tuple rather than just a string (<pathway>, <marker>) so that we might have results like ("raw", "USING") or ("type", "numeric_literal").
The BaseGrammar._look_ahead_match() method, which does most of the heavy lifting for simple matching, will take the main load of logic to achieve this, potentially with some additional helper functions which could also be used in other places.
With both of these things done it might be possible to make .simple() a required method on all matchers and change the type hints to reflect that. We could also deprecate some of the code which handles cases where .simple() doesn't exist.

@WittierDinosaur - I think you've put the most thought into this so far. What do you think of this as a path forward?

WittierDinosaur added enhancement New feature or request parser Means adding functionality to the parser or dialects performance dialect Issue related to general dialect support. Use dialect-specific label instead where appropriate labels May 22, 2022

judahrand mentioned this issue Jun 29, 2022

Core: Add MultiStringParser to match a collection of strings #3510

Merged

1 task

alanmcruickshank self-assigned this Aug 11, 2022

alanmcruickshank removed their assignment Mar 20, 2023

alanmcruickshank mentioned this issue Mar 24, 2023

.simple() matching for TypedMatcher #4612

Merged

alanmcruickshank mentioned this issue Apr 2, 2023

Remove three RegexParsers #4658

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dialects: Deprecate as many RegexParsers as we can #3390

Dialects: Deprecate as many RegexParsers as we can #3390

WittierDinosaur commented May 22, 2022

barrywhart commented May 23, 2022

WittierDinosaur commented May 24, 2022

alanmcruickshank commented Aug 11, 2022

alanmcruickshank commented Aug 31, 2022

Dialects: Deprecate as many RegexParsers as we can #3390

Dialects: Deprecate as many RegexParsers as we can #3390

Comments

WittierDinosaur commented May 22, 2022

Search before asking

Description

Use case

Dialect

Are you willing to work on and submit a PR to address the issue?

Code of Conduct

barrywhart commented May 23, 2022

WittierDinosaur commented May 24, 2022

alanmcruickshank commented Aug 11, 2022

alanmcruickshank commented Aug 31, 2022