Core: Add `MultiStringParser` to match a collection of strings #3510

judahrand · 2022-06-29T10:45:02Z

Brief summary of the change made

This parser takes a collection of strings and returns a match if any of
them are found. This is more performant than implementing this through
the RegexParser as a hash matching route can be taken.

This looks to at least partially address #3390 and does see a modest 2% speed up in parsing when running tox -e bench

Before:

==== overall timings ====
Clock time:    10.77
=== templating ===
cnt:               1 sum:            0.01
min:            0.01 max:            0.01
avg:            0.01
=== lexing ===
cnt:               1 sum:            0.02
min:            0.02 max:            0.02
avg:            0.02
=== parsing ===
cnt:               1 sum:            2.02
min:            2.02 max:            2.02
avg:            2.02
=== linting ===
cnt:               1 sum:            8.68
min:            8.68 max:            8.68
avg:            8.68
===END PROCESS OUTPUT===
Fix command failed with return code: 1
===== Done =====
Run    #0: {'004_L003_indentation_3': 0.5691906670108438, 'B_001_package': 0.7864880409906618}
Run    #1: {'004_L003_indentation_3': 0.40543387504294515, 'B_001_package': 0.675880833005067}
Run    #2: {'004_L003_indentation_3': 0.3828648329945281, 'B_001_package': 0.6732317919959314}

After:

==== overall timings ====
Clock time:    10.70
=== templating ===
cnt:               1 sum:            0.01
min:            0.01 max:            0.01
avg:            0.01
=== lexing ===
cnt:               1 sum:            0.03
min:            0.03 max:            0.03
avg:            0.03
=== parsing ===
cnt:               1 sum:            1.98
min:            1.98 max:            1.98
avg:            1.98
=== linting ===
cnt:               1 sum:            8.66
min:            8.66 max:            8.66
avg:            8.66
===END PROCESS OUTPUT===
Fix command failed with return code: 1
===== Done =====
Run    #0: {'004_L003_indentation_3': 1.0236480419989675, 'B_001_package': 0.7732945830211975}
Run    #1: {'004_L003_indentation_3': 0.40719666599761695, 'B_001_package': 0.6686641669948585}
Run    #2: {'004_L003_indentation_3': 0.39612441597273573, 'B_001_package': 0.6604930419707671}

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
- .yml rule test cases in test/fixtures/rules/std_rule_cases.
- .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
- Full autofix test cases in test/fixtures/linter/autofix.
- Other.
Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.

This parser takes a collection of strings and returns a match if any of them are found. This is more performant than implementing this through the `RegexParser` as a hash matching route can be taken.

codecov · 2022-06-29T10:58:15Z

Codecov Report

Merging #3510 (e9c66ea) into main (ba60508) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #3510   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          171       171           
  Lines        12997     13027   +30     
=========================================
+ Hits         12997     13027   +30

Impacted Files	Coverage Δ
src/sqlfluff/dialects/dialect_ansi.py	`100.00% <ø> (ø)`
src/sqlfluff/dialects/dialect_bigquery.py	`100.00% <ø> (ø)`
src/sqlfluff/dialects/dialect_exasol.py	`100.00% <ø> (ø)`
src/sqlfluff/dialects/dialect_sparksql.py	`100.00% <ø> (ø)`
src/sqlfluff/core/dialects/base.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/parser/__init__.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/parser/grammar/base.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/parser/parsers.py	`100.00% <100.00%> (ø)`
src/sqlfluff/dialects/dialect_snowflake.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba60508...e9c66ea. Read the comment docs.

tunetheweb

This is great! 🚀🚀🚀

Not only is it (slightly) faster, it's also more readable.

Could we add some tests to test/core/parse/grammar_test.py? I realise code coverage has returned 100% but still think it would be good to have some explicit tests (positive and negative).

Also have some other feedback below.

tunetheweb · 2022-06-29T11:09:04Z

src/sqlfluff/core/parser/parsers.py

+        self.templates = templates
+        self.raw_class = raw_class
+        self.name = name
+        self.type = type
+        self.optional = optional
+        self.segment_kwargs = segment_kwargs or {}


Should we uppercase during _init_ to prevent having to do it for each match?

Also should this just call super like RegexParse does for the other values? In case the base StringParser ever changes in future?

I avoided doing this as I don't want to set self.template. mypy will complain if self.template is set with a Collection[str] rather than str - hence using self.templates instead.

However, uppering in __init__ feels reasonable if you think it is better. I only didn't for consistency with how StringParser returns simple.

No I agree you need a templates rather than template as it is a different type as you say. I'm just saying you might want to call super anyway on the other ones (including setting template to an empty string?). Yes it's the same as you have as all the super().__init__ does is set them like you are setting here, but still feels like a better thing to do in case that ever changes, and it is what RegexParser is doing. But I'm not a python coder by trade so maybe it's fine the way it is?

I wonder should StringParser also be changed to upper at __init__ time?

including setting template to an empty string?

This is what feels odd to me. I'm not sure we should set an attribute which isn't relevant to the class at all.

I wonder should StringParser also be changed to upper at init time?

Quite possibly, though I suspect the overhead is very small!

This is what feels odd to me. I'm not sure we should set an attribute which isn't relevant to the class at all.

Yeah I get your hesitancy. Let's leave this comment open and let the others weigh in.

I've implemented a compromise where I do call super() but then call del self.template this way we get the advantages of calling super() while not adding an attribute that doesn't serve any purpose other than to confuse.

src/sqlfluff/core/parser/parsers.py

judahrand · 2022-06-29T11:42:16Z

Could we add some tests to test/core/parse/grammar_test.py? I realise code coverage has returned 100% but still think it would be good to have some explicit tests (positive and negative).

@tunetheweb there doesn't seem to be a precedent for how to add tests for a new Parser. For example, the RegexParser appears just twice in grammar_test.py and then only to test other elements of the matching. NamedParser doesn't appear in the tests at all. Did you have an idea of how you'd like the new Parser included?

tunetheweb · 2022-06-29T11:51:36Z

Oh that's a good point. I just presumed they were in there when I did a quick search but didn't look at the test themselves!

That's a bit poor of us to be honest. Though, as I say, we still have 100% code coverage as the dialect tests do test.

Stil think it would be good to add. Could we add a simple test case to test__parser__grammar_oneof?

judahrand · 2022-06-29T12:01:05Z

@tunetheweb I hope that was what you were after?

tunetheweb · 2022-06-29T12:05:24Z

Ideally we'd leave the current test along, and add a couple of new ones to be something like:

Ref.keyword("foo") matches MultiStringParser(["foo","bar"], KeywordSegment))
Ref.keyword("fo") does not match MultiStringParser(["foo","bar"], KeywordSegment))

judahrand · 2022-06-29T12:10:46Z

@tunetheweb I'm still not convinced grammar_test.py is the right place to be adding a test for the Parser at all... After all it inherits directly from Matchable not BaseGrammar and so isn't a 'grammar' at all.

There should probably be a parser_test.py?

tunetheweb · 2022-06-29T12:13:16Z

There is a parse_test.py. Maybe belongs in there instead?

judahrand · 2022-06-29T12:16:33Z

There is a parse_test.py. Maybe belongs in there instead?

That seems to be testing BaseSegment which doesn't inherit from Matchable.

judahrand · 2022-06-29T12:29:28Z

@tunetheweb That should be better?

judahrand · 2022-06-29T12:40:43Z

@tunetheweb This should be good to go now, I think.

tunetheweb

Changes LGTM!

I'll leave it open to see if any of the others spot anything.

barrywhart

Nice work! I have a couple questions which could potentially improve the speed a bit more.

barrywhart · 2022-06-29T23:05:38Z

src/sqlfluff/core/parser/parsers.py

@@ -100,6 +100,47 @@ def match(
        return MatchResult.from_unmatched(segments)


+class MultiStringParser(StringParser):


Is there some benefit to inheriting from StringParser? I notice that:

In the constructor, we delete one of the parent class' fields (template)

None of the methods use the superclass

Basically, I'm wondering if this class doesn't need to inherit from StringParser.

There are other methods which are used from StringParser, for example match. Though, you are right that the inheritance is odd. I reckon there should be a BaseParser perhaps which inherits from Matchable and implements the reusable bits of StringParser given that it looks to me like all other Parsers inherit from StringParser?

Had a go at this and keen for feedback.

barrywhart · 2022-06-29T23:06:16Z

src/sqlfluff/core/parser/parsers.py

+        Because string matchers are not case sensitive we can
+        just return the templates here.
+        """
+        return list(self.templates)


Is this function called frequently? I notice that we're creating and returning a new list each time, which could be slow.

The compromise here was whether to cache the set which is used for matching or to cache the list which is expected to be returned by simple. We could cache both as class attributes?

I've implemented a self._simple which is set in __init__ which should avoid creating the list multiple times (this goes for StringParser too).

src/sqlfluff/core/parser/parsers.py

barrywhart · 2022-06-29T23:09:06Z

@WittierDinosaur: Would you like to review this since it relates to one of the performance issues you created?

alanmcruickshank

Neat - I like this. The regex construction for the set matching was a bit hacky and this provides a really neat solution to that.

Good work 👍 .

WittierDinosaur · 2022-06-30T22:01:58Z

@barrywhart I would, but unfornately I'm on holiday now and for a couple of weeks. I'll evaluate it post-hoc when I'm back, but conceptually I like the idea. I'll leave it to you guys on the implementation!

tunetheweb · 2022-06-30T22:11:24Z

OK in that case I vote to merge rather than waiting for you.

@barrywhart could you review the changes for your feedback?

Add MultiStringParser to match a set of strings

13a3832

This parser takes a collection of strings and returns a match if any of them are found. This is more performant than implementing this through the `RegexParser` as a hash matching route can be taken.

judahrand changed the title ~~Core: Add MultiStringParser to match a set of strings~~ Core: Add MultiStringParser to match a collection of strings Jun 29, 2022

tunetheweb requested changes Jun 29, 2022

View reviewed changes

tunetheweb requested review from barrywhart, alanmcruickshank and WittierDinosaur June 29, 2022 11:14

judahrand added 2 commits June 29, 2022 12:31

Perform .upper() in __init__

f129f07

Check is_code first for performance

733c3f2

Use super to initialize object

db03a79

judahrand force-pushed the mult-string-parser branch from 7f811a9 to 79d2277 Compare June 29, 2022 12:29

judahrand force-pushed the mult-string-parser branch from 79d2277 to 94e3aa8 Compare June 29, 2022 12:34

Add MultiStringParser simple test

88404f5

judahrand force-pushed the mult-string-parser branch from 94e3aa8 to 88404f5 Compare June 29, 2022 12:36

tunetheweb approved these changes Jun 29, 2022

View reviewed changes

barrywhart reviewed Jun 29, 2022

View reviewed changes

judahrand requested a review from barrywhart June 30, 2022 07:42

judahrand force-pushed the mult-string-parser branch from aa529e9 to 9da1b18 Compare June 30, 2022 07:55

judahrand added 2 commits June 30, 2022 09:18

Implement BaseParser abstract class

0bb6468

Avoid calling .upper() when matching

96f63a4

Avoid list creation at every simple call

ed04460

judahrand force-pushed the mult-string-parser branch from 9da1b18 to ed04460 Compare June 30, 2022 08:34

alanmcruickshank approved these changes Jun 30, 2022

View reviewed changes

judahrand force-pushed the mult-string-parser branch 2 times, most recently from 60b728a to b792eec Compare June 30, 2022 18:35

Remove more RegexParsers

414c713

judahrand force-pushed the mult-string-parser branch from b792eec to 414c713 Compare June 30, 2022 18:35

Merge branch 'main' into mult-string-parser

e9c66ea

barrywhart approved these changes Jun 30, 2022

View reviewed changes

barrywhart merged commit 911774d into sqlfluff:main Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Add `MultiStringParser` to match a collection of strings #3510

Core: Add `MultiStringParser` to match a collection of strings #3510

judahrand commented Jun 29, 2022 •

edited

codecov bot commented Jun 29, 2022 •

edited

tunetheweb left a comment

tunetheweb Jun 29, 2022

judahrand Jun 29, 2022 •

edited

tunetheweb Jun 29, 2022

judahrand Jun 29, 2022 •

edited

tunetheweb Jun 29, 2022

judahrand Jun 29, 2022

judahrand commented Jun 29, 2022 •

edited

tunetheweb commented Jun 29, 2022

judahrand commented Jun 29, 2022

tunetheweb commented Jun 29, 2022

judahrand commented Jun 29, 2022

tunetheweb commented Jun 29, 2022

judahrand commented Jun 29, 2022

judahrand commented Jun 29, 2022

judahrand commented Jun 29, 2022

tunetheweb left a comment

barrywhart left a comment

barrywhart Jun 29, 2022

judahrand Jun 30, 2022 •

edited

judahrand Jun 30, 2022

barrywhart Jun 29, 2022

judahrand Jun 30, 2022

judahrand Jun 30, 2022

barrywhart commented Jun 29, 2022

alanmcruickshank left a comment

WittierDinosaur commented Jun 30, 2022

tunetheweb commented Jun 30, 2022

		@@ -100,6 +100,47 @@ def match(
		return MatchResult.from_unmatched(segments)


		class MultiStringParser(StringParser):

Core: Add MultiStringParser to match a collection of strings #3510

Core: Add MultiStringParser to match a collection of strings #3510

Conversation

judahrand commented Jun 29, 2022 • edited

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

codecov bot commented Jun 29, 2022 • edited

Codecov Report

tunetheweb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

judahrand Jun 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

judahrand Jun 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

judahrand commented Jun 29, 2022 • edited

tunetheweb commented Jun 29, 2022

judahrand commented Jun 29, 2022

tunetheweb commented Jun 29, 2022

judahrand commented Jun 29, 2022

tunetheweb commented Jun 29, 2022

judahrand commented Jun 29, 2022

judahrand commented Jun 29, 2022

judahrand commented Jun 29, 2022

tunetheweb left a comment

Choose a reason for hiding this comment

barrywhart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

judahrand Jun 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barrywhart commented Jun 29, 2022

alanmcruickshank left a comment

Choose a reason for hiding this comment

WittierDinosaur commented Jun 30, 2022

tunetheweb commented Jun 30, 2022

Core: Add `MultiStringParser` to match a collection of strings #3510

Core: Add `MultiStringParser` to match a collection of strings #3510

judahrand commented Jun 29, 2022 •

edited

codecov bot commented Jun 29, 2022 •

edited

judahrand Jun 29, 2022 •

edited

judahrand Jun 29, 2022 •

edited

judahrand commented Jun 29, 2022 •

edited

judahrand Jun 30, 2022 •

edited