Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dialects: Deprecate as many RegexParsers as we can #3390

Open
3 tasks done
WittierDinosaur opened this issue May 22, 2022 · 4 comments
Open
3 tasks done

Dialects: Deprecate as many RegexParsers as we can #3390

WittierDinosaur opened this issue May 22, 2022 · 4 comments
Labels
dialect Issue related to general dialect support. Use dialect-specific label instead where appropriate enhancement New feature or request parser Means adding functionality to the parser or dialects performance

Comments

@WittierDinosaur
Copy link
Contributor

Search before asking

  • I searched the issues and found no similar issues.

Description

RegexParser doesn't support a hash matching - making it far less performant than its StringParser equivalent. In an ideal world everything would be direct from the Lexer, but if that can't be, StringParser is better. I doubt that we can remove every instance of this, but the more we remove, the better.

Use case

No response

Dialect

All

Are you willing to work on and submit a PR to address the issue?

  • Yes I am willing to submit a PR!

Code of Conduct

@WittierDinosaur WittierDinosaur added enhancement New feature or request parser Means adding functionality to the parser or dialects performance dialect Issue related to general dialect support. Use dialect-specific label instead where appropriate labels May 22, 2022
@barrywhart
Copy link
Member

Can you explain what you mean by hash matching? Is this something in the StringParser code itself, or something about how Python string == works?

@WittierDinosaur
Copy link
Contributor Author

i just mean the simple() matching route in the BaseSegment

@alanmcruickshank
Copy link
Member

I might be able to have a go at this while I try and unpick the remnants of #3692. I think they're related.

@alanmcruickshank
Copy link
Member

To make progress on this my suggestion builds on #3819.

The introduction of TypedMatcher doesn't directly solve this problem, but I think it provides a potential path forward. The way I see it, the constraints we have are:

  • .simple() matching is currently tied to string matching.
  • Some segments don't have a single string form, or even one which can be addressed using MultiStringParser.
  • There isn't an obvious way to match very custom strings in the simple route.

Given those constraints I think we need to break one of them. That boils down to making .simple() matching more powerful, to do more than just strings, but in a way that is likewise very performant. In my mind there are two options for doing that:

  1. Making regexes a first class citizen and being able to effectively look ahead using them.
  2. Leveraging the type of future segments as a way of doing effective lookaheads.

I think the latter is much more viable. I don't think I can see a way to performantly do the first one.

More specifically I'm suggesting:

  • All regex operations are pushed up into the lexing step. That produces a set of raw segments of different types which can be later picked up. We then deprecate the RegexParser and replace all use cases with the TypedParser.
  • We extend the .simple() matching route to have two paths:
    • The existing string based route where we use in and index on the raw strings to identify matches ahead in the sequence of segments.
    • A second pathway which uses the type of segments to look ahead. This could leverage roughly the same methodology and look ahead for pre-lexed tokens from the lexer which had already done the heavy lifting.

The second pathway should run at the same order of magnitude speed as the first one, and only needs to run at all if some of our options need to use it.

Very specifically I'm thinking that:

  • .simple() returns a tuple rather than just a string (<pathway>, <marker>) so that we might have results like ("raw", "USING") or ("type", "numeric_literal").
  • The BaseGrammar._look_ahead_match() method, which does most of the heavy lifting for simple matching, will take the main load of logic to achieve this, potentially with some additional helper functions which could also be used in other places.
  • With both of these things done it might be possible to make .simple() a required method on all matchers and change the type hints to reflect that. We could also deprecate some of the code which handles cases where .simple() doesn't exist.

@WittierDinosaur - I think you've put the most thought into this so far. What do you think of this as a path forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dialect Issue related to general dialect support. Use dialect-specific label instead where appropriate enhancement New feature or request parser Means adding functionality to the parser or dialects performance
Projects
None yet
Development

No branches or pull requests

3 participants