Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Add MultiStringParser to match a collection of strings #3510

Merged
merged 10 commits into from Jun 30, 2022

Conversation

judahrand
Copy link
Contributor

@judahrand judahrand commented Jun 29, 2022

Brief summary of the change made

This parser takes a collection of strings and returns a match if any of
them are found. This is more performant than implementing this through
the RegexParser as a hash matching route can be taken.

This looks to at least partially address #3390 and does see a modest 2% speed up in parsing when running tox -e bench

Before:

==== overall timings ====
Clock time:    10.77
=== templating ===
cnt:               1 sum:            0.01
min:            0.01 max:            0.01
avg:            0.01
=== lexing ===
cnt:               1 sum:            0.02
min:            0.02 max:            0.02
avg:            0.02
=== parsing ===
cnt:               1 sum:            2.02
min:            2.02 max:            2.02
avg:            2.02
=== linting ===
cnt:               1 sum:            8.68
min:            8.68 max:            8.68
avg:            8.68
===END PROCESS OUTPUT===
Fix command failed with return code: 1
===== Done =====
Run    #0: {'004_L003_indentation_3': 0.5691906670108438, 'B_001_package': 0.7864880409906618}
Run    #1: {'004_L003_indentation_3': 0.40543387504294515, 'B_001_package': 0.675880833005067}
Run    #2: {'004_L003_indentation_3': 0.3828648329945281, 'B_001_package': 0.6732317919959314}

After:

==== overall timings ====
Clock time:    10.70
=== templating ===
cnt:               1 sum:            0.01
min:            0.01 max:            0.01
avg:            0.01
=== lexing ===
cnt:               1 sum:            0.03
min:            0.03 max:            0.03
avg:            0.03
=== parsing ===
cnt:               1 sum:            1.98
min:            1.98 max:            1.98
avg:            1.98
=== linting ===
cnt:               1 sum:            8.66
min:            8.66 max:            8.66
avg:            8.66
===END PROCESS OUTPUT===
Fix command failed with return code: 1
===== Done =====
Run    #0: {'004_L003_indentation_3': 1.0236480419989675, 'B_001_package': 0.7732945830211975}
Run    #1: {'004_L003_indentation_3': 0.40719666599761695, 'B_001_package': 0.6686641669948585}
Run    #2: {'004_L003_indentation_3': 0.39612441597273573, 'B_001_package': 0.6604930419707671}

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

This parser takes a collection of strings and returns a match if any of
them are found. This is more performant than implementing this through
the `RegexParser` as a hash matching route can be taken.
@judahrand judahrand changed the title Core: Add MultiStringParser to match a set of strings Core: Add MultiStringParser to match a collection of strings Jun 29, 2022
@codecov
Copy link

codecov bot commented Jun 29, 2022

Codecov Report

Merging #3510 (e9c66ea) into main (ba60508) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #3510   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          171       171           
  Lines        12997     13027   +30     
=========================================
+ Hits         12997     13027   +30     
Impacted Files Coverage Δ
src/sqlfluff/dialects/dialect_ansi.py 100.00% <ø> (ø)
src/sqlfluff/dialects/dialect_bigquery.py 100.00% <ø> (ø)
src/sqlfluff/dialects/dialect_exasol.py 100.00% <ø> (ø)
src/sqlfluff/dialects/dialect_sparksql.py 100.00% <ø> (ø)
src/sqlfluff/core/dialects/base.py 100.00% <100.00%> (ø)
src/sqlfluff/core/parser/__init__.py 100.00% <100.00%> (ø)
src/sqlfluff/core/parser/grammar/base.py 100.00% <100.00%> (ø)
src/sqlfluff/core/parser/parsers.py 100.00% <100.00%> (ø)
src/sqlfluff/dialects/dialect_snowflake.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba60508...e9c66ea. Read the comment docs.

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! 🚀🚀🚀

Not only is it (slightly) faster, it's also more readable.

Could we add some tests to test/core/parse/grammar_test.py? I realise code coverage has returned 100% but still think it would be good to have some explicit tests (positive and negative).

Also have some other feedback below.

Comment on lines 115 to 120
self.templates = templates
self.raw_class = raw_class
self.name = name
self.type = type
self.optional = optional
self.segment_kwargs = segment_kwargs or {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we uppercase during _init_ to prevent having to do it for each match?

Also should this just call super like RegexParse does for the other values? In case the base StringParser ever changes in future?

Copy link
Contributor Author

@judahrand judahrand Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided doing this as I don't want to set self.template. mypy will complain if self.template is set with a Collection[str] rather than str - hence using self.templates instead.

However, uppering in __init__ feels reasonable if you think it is better. I only didn't for consistency with how StringParser returns simple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I agree you need a templates rather than template as it is a different type as you say. I'm just saying you might want to call super anyway on the other ones (including setting template to an empty string?). Yes it's the same as you have as all the super().__init__ does is set them like you are setting here, but still feels like a better thing to do in case that ever changes, and it is what RegexParser is doing. But I'm not a python coder by trade so maybe it's fine the way it is?

I wonder should StringParser also be changed to upper at __init__ time?

Copy link
Contributor Author

@judahrand judahrand Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including setting template to an empty string?

This is what feels odd to me. I'm not sure we should set an attribute which isn't relevant to the class at all.

I wonder should StringParser also be changed to upper at init time?

Quite possibly, though I suspect the overhead is very small!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what feels odd to me. I'm not sure we should set an attribute which isn't relevant to the class at all.

Yeah I get your hesitancy. Let's leave this comment open and let the others weigh in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented a compromise where I do call super() but then call del self.template this way we get the advantages of calling super() while not adding an attribute that doesn't serve any purpose other than to confuse.

src/sqlfluff/core/parser/parsers.py Outdated Show resolved Hide resolved
src/sqlfluff/core/parser/parsers.py Outdated Show resolved Hide resolved
@judahrand
Copy link
Contributor Author

judahrand commented Jun 29, 2022

Could we add some tests to test/core/parse/grammar_test.py? I realise code coverage has returned 100% but still think it would be good to have some explicit tests (positive and negative).

@tunetheweb there doesn't seem to be a precedent for how to add tests for a new Parser. For example, the RegexParser appears just twice in grammar_test.py and then only to test other elements of the matching. NamedParser doesn't appear in the tests at all. Did you have an idea of how you'd like the new Parser included?

@tunetheweb
Copy link
Member

Oh that's a good point. I just presumed they were in there when I did a quick search but didn't look at the test themselves!

That's a bit poor of us to be honest. Though, as I say, we still have 100% code coverage as the dialect tests do test.

Stil think it would be good to add. Could we add a simple test case to test__parser__grammar_oneof?

@judahrand
Copy link
Contributor Author

@tunetheweb I hope that was what you were after?

@tunetheweb
Copy link
Member

Ideally we'd leave the current test along, and add a couple of new ones to be something like:

Ref.keyword("foo") matches MultiStringParser(["foo","bar"], KeywordSegment))
Ref.keyword("fo") does not match MultiStringParser(["foo","bar"], KeywordSegment))

@judahrand
Copy link
Contributor Author

@tunetheweb I'm still not convinced grammar_test.py is the right place to be adding a test for the Parser at all... After all it inherits directly from Matchable not BaseGrammar and so isn't a 'grammar' at all.

There should probably be a parser_test.py?

@tunetheweb
Copy link
Member

There is a parse_test.py. Maybe belongs in there instead?

@judahrand
Copy link
Contributor Author

There is a parse_test.py. Maybe belongs in there instead?

That seems to be testing BaseSegment which doesn't inherit from Matchable.

@judahrand
Copy link
Contributor Author

@tunetheweb That should be better?

@judahrand
Copy link
Contributor Author

@tunetheweb This should be good to go now, I think.

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM!

I'll leave it open to see if any of the others spot anything.

Copy link
Member

@barrywhart barrywhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have a couple questions which could potentially improve the speed a bit more.

@@ -100,6 +100,47 @@ def match(
return MatchResult.from_unmatched(segments)


class MultiStringParser(StringParser):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some benefit to inheriting from StringParser? I notice that:

  • In the constructor, we delete one of the parent class' fields (template)
  • None of the methods use the superclass

Basically, I'm wondering if this class doesn't need to inherit from StringParser.

Copy link
Contributor Author

@judahrand judahrand Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other methods which are used from StringParser, for example match. Though, you are right that the inheritance is odd. I reckon there should be a BaseParser perhaps which inherits from Matchable and implements the reusable bits of StringParser given that it looks to me like all other Parsers inherit from StringParser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a go at this and keen for feedback.

Because string matchers are not case sensitive we can
just return the templates here.
"""
return list(self.templates)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function called frequently? I notice that we're creating and returning a new list each time, which could be slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compromise here was whether to cache the set which is used for matching or to cache the list which is expected to be returned by simple. We could cache both as class attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented a self._simple which is set in __init__ which should avoid creating the list multiple times (this goes for StringParser too).

src/sqlfluff/core/parser/parsers.py Outdated Show resolved Hide resolved
@barrywhart
Copy link
Member

@WittierDinosaur: Would you like to review this since it relates to one of the performance issues you created?

Copy link
Member

@alanmcruickshank alanmcruickshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat - I like this. The regex construction for the set matching was a bit hacky and this provides a really neat solution to that.

Good work 👍 .

@judahrand judahrand force-pushed the mult-string-parser branch 2 times, most recently from 60b728a to b792eec Compare June 30, 2022 18:35
@WittierDinosaur
Copy link
Contributor

@barrywhart I would, but unfornately I'm on holiday now and for a couple of weeks. I'll evaluate it post-hoc when I'm back, but conceptually I like the idea. I'll leave it to you guys on the implementation!

@tunetheweb
Copy link
Member

OK in that case I vote to merge rather than waiting for you.

@barrywhart could you review the changes for your feedback?

@barrywhart barrywhart merged commit 911774d into sqlfluff:main Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants