New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite of matching interface #5230
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
alanmcruickshank
changed the title
[DRAFT] Parsing with less mutation
Rewrite of matching interface
Sep 30, 2023
Done 👍 |
Conflicts: src/sqlfluff/core/parser/grammar/greedy.py test/core/parser/grammar/grammar_other_test.py
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WARNING: This PR is pretty big, see instructions below on how to review.
In local testing against a large dbt project, this cuts parsing times by about 45% compared to current
main
(595ad1f) and more than half compared to the most recent release (2.3.2). Notably, on large files, it cut peak parsing time by almost half.Summary
This is another pass at what I tried with #5143 (although a lot more groundwork has been done now). It closes out the work I had planned to resolve #5124.
At it's core, this PR changes the interface for the
.match()
method, and in particular changes the structure of the return valueMatchResult
.match
method would be expected to both assess potential code for whether it fitted the specification, but then also mutate the raw segments into new segments to return from the method. This mean that in searching through the tree, we would do a lot of segment manipulation. Many instantiations and many mutations (and re-mutation)..match()
method only returns the instructions on how to create segments, but does no mutation. That all happens at the end of the matching process when we call.apply()
to materialise the result.This has the upside of making the separation between matching and application much cleaner, and means that most of the logic is manipulation of indices (integers) rather than a lot of
tuple
manipulation. We can also drop a lot of the consistency checks, where we test to make sure we've not lost any segments, because no segments are being added or removed until the very end of the process.Notable effects
Upsides:
Segment
classes, but instead match in principle and then instantiate once at the end.Side effects:
The most obvious side effect is that when looking at anything currently matched with anAnything()
grammar or aGreedyUntil()
grammar, we used to previous get nested brackets "for free". Now if you look at those sections of parsed files (usually script sections, or dialects like Materialize where lots is left very open), you'll see we don't get nested sections any more.Anything
grammar which means that in that case, we do nest any brackets found, but for the normal greedy matching we don't. That means the changes to the parsed tree structures are really minimal. 🏆 . I think this means I'm not happy merging this as a bugfix release, even given the scale of the internal re-write.expression
grammar or not. I think given the scale of these changes this is acceptable collateral - and on reading the dialects, I think the new behaviour is what was intended rather than the old.How to review.
Here's my suggestion on how to review this:
segments
module. These changes should hopefully all make sense by this point, in particular the changes to how fix validation is done. You'll also see we've been able to remove quite a lot of code here.match_algorithms.py
changes. These are probably the deepest. All of the methods here had to have at least some rewriting. I've tried to give them more sensible names, and where possible, stick to a similar structure so that the git diff allows you to follow along. The ones that have changed the most are the ones around bracket matching, which are total re-writes. There's also a scattering of new methods here for wrangling indices. Hopefully my comments are sufficient to understand mostly what's going on, and the test coverage is fairly good here.