Rewrite of matching interface #5230

alanmcruickshank · 2023-09-22T09:20:05Z

WARNING: This PR is pretty big, see instructions below on how to review.

In local testing against a large dbt project, this cuts parsing times by about 45% compared to current main (595ad1f) and more than half compared to the most recent release (2.3.2). Notably, on large files, it cut peak parsing time by almost half.

Summary

This is another pass at what I tried with #5143 (although a lot more groundwork has been done now). It closes out the work I had planned to resolve #5124.

At it's core, this PR changes the interface for the .match() method, and in particular changes the structure of the return value MatchResult.

Previously, a match method would be expected to both assess potential code for whether it fitted the specification, but then also mutate the raw segments into new segments to return from the method. This mean that in searching through the tree, we would do a lot of segment manipulation. Many instantiations and many mutations (and re-mutation).
Now, the .match() method only returns the instructions on how to create segments, but does no mutation. That all happens at the end of the matching process when we call .apply() to materialise the result.

This has the upside of making the separation between matching and application much cleaner, and means that most of the logic is manipulation of indices (integers) rather than a lot of tuple manipulation. We can also drop a lot of the consistency checks, where we test to make sure we've not lost any segments, because no segments are being added or removed until the very end of the process.

Notable effects

Upsides:

Even just by measuring the test runtimes, this improves parsing times by ~20%. Perhaps more on larger files. Most of that is because we avoid passing around (and frequently recreating) Segment classes, but instead match in principle and then instantiate once at the end.
This includes ripping out some very old logic in the depths of the matching logic. Notably, this takes a totally different approach to bracket matching which I think is much more robust.
Consistency checking is much less necessary because we only mutate segments once (at the end).
In the process I've added much more extensive testing.
We never re-parse the same segments twice (except during fix validation, and even then we convert back to a flattened sequence of raw segments first) - this makes the matching logic much easier to reason about. NOTE: making sure this still works properly was the reason for a lot of my other recent PRs as this wasn't entirely trivial enable.
- ... actually, this isn't quite true. Some of the bracket matching does cover the same ground more than once, but it's always the same (or better) than the old way. I think there's a way of doing bracket counting during lexing to enable faster lookups, but that's a job for another day.

Side effects:

The most obvious side effect is that when looking at anything currently matched with an Anything() grammar or a GreedyUntil() grammar, we used to previous get nested brackets "for free". Now if you look at those sections of parsed files (usually script sections, or dialects like Materialize where lots is left very open), you'll see we don't get nested sections any more.
- UPDATE: Not true anymore - I've added a special path specifically for the Anything grammar which means that in that case, we do nest any brackets found, but for the normal greedy matching we don't. That means the changes to the parsed tree structures are really minimal. 🏆 . I think this means I'm not happy merging this as a bugfix release, even given the scale of the internal re-write.
There are a few other small parsing changes which I don't think are consequential, mostly whether some literals get wrapped in an expression grammar or not. I think given the scale of these changes this is acceptable collateral - and on reading the dialects, I think the new behaviour is what was intended rather than the old.

How to review.

Here's my suggestion on how to review this:

Start the review by looking at the grammars, this will familiarise you with the interface that I'm proposing.
Then review the changed parse yaml fixtures. This will show the side effects of this change. You'll see most of them are the un-nesting of bracketed sections in script blocks (see side effects above).
Then take a run through the changes to the segments module. These changes should hopefully all make sense by this point, in particular the changes to how fix validation is done. You'll also see we've been able to remove quite a lot of code here.
Last I would suggest running through the match_algorithms.py changes. These are probably the deepest. All of the methods here had to have at least some rewriting. I've tried to give them more sensible names, and where possible, stick to a similar structure so that the git diff allows you to follow along. The ones that have changed the most are the ones around bracket matching, which are total re-writes. There's also a scattering of new methods here for wrangling indices. Hopefully my comments are sufficient to understand mostly what's going on, and the test coverage is fairly good here.

alanmcruickshank · 2023-10-01T23:34:59Z

I think I might be able to make the matching of brackets for Anything() more consistent with the old behaviour without sacrificing much performance. Will see if I can add that on to this - but also could be done as a second PR. Carry on reviewing if you're able.

Done 👍

Conflicts: src/sqlfluff/core/parser/grammar/greedy.py test/core/parser/grammar/grammar_other_test.py

alanmcruickshank added 28 commits September 18, 2023 23:39

MatchResult2

2b5c82e

Switch over to match2 (WIP)

47e9212

Add match2 into meta segments

cde9251

incorporate progress bar

2aa58fb

Add back in some more tests from the stash

298e407

port terminator filtering logic to greedy_match2

48b6a5d

WIP: Sequence STRICT

9ed679d

finish off sequence tests and remove is_clean

c93d7ec

fix greedy tests

2be3804

Reuse the skipping methods

ace5421

AnyOf parse modes match2

d09f326

fix bug (?) in terminator trimming

76dcdb9

Consolidate deeper into match2

4df30ad

parsing fixes

b777b40

Fix in bracketed matching

11a4cd2

more sequence differences

4639849

ansi tweek and better error handling in generate yamls

0b94ad4

missing section for no gaps

67672ef

accessor grammar unification

4e01b34

nit

b676e39

Unify lexer names and types for brackets

d4ce4c4

Merge branch 'ac/bracket_lexing' into ac/match_segs_3

4318146

Log the time spent in the parse fixture script

219dac7

Reword the statement

6396868

Merge branch 'ac/yml_timing' into ac/match_segs_3

61ce594

Strip out more of the old

afa79bc

resolve validation bug on new match

a0d8c44

Merge remote-tracking branch 'origin/main' into ac/match_segs_3

45f1d01

alanmcruickshank mentioned this pull request Sep 25, 2023

Introduce "word" segment #5234

Merged

Pull all the segments in together

27daf03

alanmcruickshank added 5 commits September 30, 2023 16:56

linting

d3c001f

bracket resolution coverage

fc0cde5

Remove unused sections of _position_segments

14e5e39

coverage

5e52cb7

docstrings and tidyup

569f63c

alanmcruickshank changed the title ~~[DRAFT] Parsing with less mutation~~ Rewrite of matching interface Sep 30, 2023

alanmcruickshank added 3 commits September 30, 2023 18:26

coverage and mypy

f690f81

update docs

e4b2af9

doclinting

da691ef

alanmcruickshank marked this pull request as ready for review September 30, 2023 17:47

alanmcruickshank added the performance label Sep 30, 2023

alanmcruickshank mentioned this pull request Oct 1, 2023

Refactor the Bracketed grammar .match() #5190

Closed

Merge branch 'main' into ac/match_segs_3

027ea93

alanmcruickshank added 2 commits October 2, 2023 11:20

Remove GreedyUntil

3b0ecb4

Merge branch 'ac/greedy'

29b89a8

Conflicts: src/sqlfluff/core/parser/grammar/greedy.py test/core/parser/grammar/grammar_other_test.py

alanmcruickshank mentioned this pull request Oct 2, 2023

Finish the removal of GreedyUntil #5263

Merged

alanmcruickshank added 3 commits October 2, 2023 17:34

Structure tests for Anything.

d14f405

Merge remote-tracking branch 'origin/main' into ac/ms3_greedy_temp

4b685ec

Port bracket matching logic back for Anything.

5edfd17

alanmcruickshank requested review from greg-finley and barrywhart October 3, 2023 20:18

alanmcruickshank and others added 5 commits October 5, 2023 12:36

Merge remote-tracking branch 'origin/main' into ac/match_segs_3

a672cda

Merge remote-tracking branch 'origin/main' into ac/match_segs_3

6d0c819

linting

578aeb1

fix tsql issue

a06e1f7

Merge branch 'main' into ac/match_segs_3

812fd92

alanmcruickshank merged commit 106cac1 into main Oct 8, 2023
29 checks passed

alanmcruickshank deleted the ac/match_segs_3 branch October 8, 2023 21:29

alanmcruickshank mentioned this pull request Oct 23, 2023

Parser: Bracketed should just be a sequence with brackets tacked on #3388

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite of matching interface #5230

Rewrite of matching interface #5230

alanmcruickshank commented Sep 22, 2023 •

edited

alanmcruickshank commented Oct 1, 2023 •

edited

Rewrite of matching interface #5230

Rewrite of matching interface #5230

Conversation

alanmcruickshank commented Sep 22, 2023 • edited

Summary

Notable effects

How to review.

alanmcruickshank commented Oct 1, 2023 • edited

alanmcruickshank commented Sep 22, 2023 •

edited

alanmcruickshank commented Oct 1, 2023 •

edited