Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of matching interface #5230

Merged
merged 99 commits into from Oct 8, 2023
Merged

Rewrite of matching interface #5230

merged 99 commits into from Oct 8, 2023

Conversation

alanmcruickshank
Copy link
Member

@alanmcruickshank alanmcruickshank commented Sep 22, 2023

WARNING: This PR is pretty big, see instructions below on how to review.

In local testing against a large dbt project, this cuts parsing times by about 45% compared to current main (595ad1f) and more than half compared to the most recent release (2.3.2). Notably, on large files, it cut peak parsing time by almost half.

Summary

This is another pass at what I tried with #5143 (although a lot more groundwork has been done now). It closes out the work I had planned to resolve #5124.

At it's core, this PR changes the interface for the .match() method, and in particular changes the structure of the return value MatchResult.

  • Previously, a match method would be expected to both assess potential code for whether it fitted the specification, but then also mutate the raw segments into new segments to return from the method. This mean that in searching through the tree, we would do a lot of segment manipulation. Many instantiations and many mutations (and re-mutation).
  • Now, the .match() method only returns the instructions on how to create segments, but does no mutation. That all happens at the end of the matching process when we call .apply() to materialise the result.

This has the upside of making the separation between matching and application much cleaner, and means that most of the logic is manipulation of indices (integers) rather than a lot of tuple manipulation. We can also drop a lot of the consistency checks, where we test to make sure we've not lost any segments, because no segments are being added or removed until the very end of the process.

Notable effects

Upsides:

  • Even just by measuring the test runtimes, this improves parsing times by ~20%. Perhaps more on larger files. Most of that is because we avoid passing around (and frequently recreating) Segment classes, but instead match in principle and then instantiate once at the end.
  • This includes ripping out some very old logic in the depths of the matching logic. Notably, this takes a totally different approach to bracket matching which I think is much more robust.
  • Consistency checking is much less necessary because we only mutate segments once (at the end).
  • In the process I've added much more extensive testing.
  • We never re-parse the same segments twice (except during fix validation, and even then we convert back to a flattened sequence of raw segments first) - this makes the matching logic much easier to reason about. NOTE: making sure this still works properly was the reason for a lot of my other recent PRs as this wasn't entirely trivial enable.
    • ... actually, this isn't quite true. Some of the bracket matching does cover the same ground more than once, but it's always the same (or better) than the old way. I think there's a way of doing bracket counting during lexing to enable faster lookups, but that's a job for another day.

Side effects:

  • The most obvious side effect is that when looking at anything currently matched with an Anything() grammar or a GreedyUntil() grammar, we used to previous get nested brackets "for free". Now if you look at those sections of parsed files (usually script sections, or dialects like Materialize where lots is left very open), you'll see we don't get nested sections any more.
    • UPDATE: Not true anymore - I've added a special path specifically for the Anything grammar which means that in that case, we do nest any brackets found, but for the normal greedy matching we don't. That means the changes to the parsed tree structures are really minimal. 🏆 . I think this means I'm not happy merging this as a bugfix release, even given the scale of the internal re-write.
  • There are a few other small parsing changes which I don't think are consequential, mostly whether some literals get wrapped in an expression grammar or not. I think given the scale of these changes this is acceptable collateral - and on reading the dialects, I think the new behaviour is what was intended rather than the old.

How to review.

Here's my suggestion on how to review this:

  1. Start the review by looking at the grammars, this will familiarise you with the interface that I'm proposing.
  2. Then review the changed parse yaml fixtures. This will show the side effects of this change. You'll see most of them are the un-nesting of bracketed sections in script blocks (see side effects above).
  3. Then take a run through the changes to the segments module. These changes should hopefully all make sense by this point, in particular the changes to how fix validation is done. You'll also see we've been able to remove quite a lot of code here.
  4. Last I would suggest running through the match_algorithms.py changes. These are probably the deepest. All of the methods here had to have at least some rewriting. I've tried to give them more sensible names, and where possible, stick to a similar structure so that the git diff allows you to follow along. The ones that have changed the most are the ones around bracket matching, which are total re-writes. There's also a scattering of new methods here for wrangling indices. Hopefully my comments are sufficient to understand mostly what's going on, and the test coverage is fairly good here.

@alanmcruickshank alanmcruickshank changed the title [DRAFT] Parsing with less mutation Rewrite of matching interface Sep 30, 2023
@alanmcruickshank
Copy link
Member Author

alanmcruickshank commented Oct 1, 2023

I think I might be able to make the matching of brackets for Anything() more consistent with the old behaviour without sacrificing much performance. Will see if I can add that on to this - but also could be done as a second PR. Carry on reviewing if you're able.

Done 👍

Conflicts:
	src/sqlfluff/core/parser/grammar/greedy.py
	test/core/parser/grammar/grammar_other_test.py
@alanmcruickshank alanmcruickshank merged commit 106cac1 into main Oct 8, 2023
29 checks passed
@alanmcruickshank alanmcruickshank deleted the ac/match_segs_3 branch October 8, 2023 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pathway to "parse-less" parsing.
2 participants