New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add MultiStringParser
to match a collection of strings
#3510
Conversation
This parser takes a collection of strings and returns a match if any of them are found. This is more performant than implementing this through the `RegexParser` as a hash matching route can be taken.
MultiStringParser
to match a set of stringsMultiStringParser
to match a collection of strings
Codecov Report
@@ Coverage Diff @@
## main #3510 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 171 171
Lines 12997 13027 +30
=========================================
+ Hits 12997 13027 +30
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! 🚀🚀🚀
Not only is it (slightly) faster, it's also more readable.
Could we add some tests to test/core/parse/grammar_test.py
? I realise code coverage has returned 100% but still think it would be good to have some explicit tests (positive and negative).
Also have some other feedback below.
src/sqlfluff/core/parser/parsers.py
Outdated
self.templates = templates | ||
self.raw_class = raw_class | ||
self.name = name | ||
self.type = type | ||
self.optional = optional | ||
self.segment_kwargs = segment_kwargs or {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we uppercase during _init_
to prevent having to do it for each match?
Also should this just call super like RegexParse does for the other values? In case the base StringParser ever changes in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I avoided doing this as I don't want to set self.template
. mypy
will complain if self.template
is set with a Collection[str]
rather than str
- hence using self.templates
instead.
However, uppering in __init__
feels reasonable if you think it is better. I only didn't for consistency with how StringParser
returns simple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I agree you need a templates
rather than template
as it is a different type as you say. I'm just saying you might want to call super
anyway on the other ones (including setting template
to an empty string?). Yes it's the same as you have as all the super().__init__
does is set them like you are setting here, but still feels like a better thing to do in case that ever changes, and it is what RegexParser is doing. But I'm not a python coder by trade so maybe it's fine the way it is?
I wonder should StringParser
also be changed to upper
at __init__
time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including setting template to an empty string?
This is what feels odd to me. I'm not sure we should set an attribute which isn't relevant to the class at all.
I wonder should StringParser also be changed to upper at init time?
Quite possibly, though I suspect the overhead is very small!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what feels odd to me. I'm not sure we should set an attribute which isn't relevant to the class at all.
Yeah I get your hesitancy. Let's leave this comment open and let the others weigh in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented a compromise where I do call super()
but then call del self.template
this way we get the advantages of calling super()
while not adding an attribute that doesn't serve any purpose other than to confuse.
@tunetheweb there doesn't seem to be a precedent for how to add tests for a new |
Oh that's a good point. I just presumed they were in there when I did a quick search but didn't look at the test themselves! That's a bit poor of us to be honest. Though, as I say, we still have 100% code coverage as the dialect tests do test. Stil think it would be good to add. Could we add a simple test case to |
@tunetheweb I hope that was what you were after? |
Ideally we'd leave the current test along, and add a couple of new ones to be something like:
|
@tunetheweb I'm still not convinced There should probably be a |
There is a |
That seems to be testing |
7f811a9
to
79d2277
Compare
@tunetheweb That should be better? |
79d2277
to
94e3aa8
Compare
94e3aa8
to
88404f5
Compare
@tunetheweb This should be good to go now, I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM!
I'll leave it open to see if any of the others spot anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I have a couple questions which could potentially improve the speed a bit more.
src/sqlfluff/core/parser/parsers.py
Outdated
@@ -100,6 +100,47 @@ def match( | |||
return MatchResult.from_unmatched(segments) | |||
|
|||
|
|||
class MultiStringParser(StringParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some benefit to inheriting from StringParser
? I notice that:
- In the constructor, we delete one of the parent class' fields (
template
) - None of the methods use the superclass
Basically, I'm wondering if this class doesn't need to inherit from StringParser
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other methods which are used from StringParser
, for example match
. Though, you are right that the inheritance is odd. I reckon there should be a BaseParser
perhaps which inherits from Matchable
and implements the reusable bits of StringParser
given that it looks to me like all other Parsers inherit from StringParser
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a go at this and keen for feedback.
src/sqlfluff/core/parser/parsers.py
Outdated
Because string matchers are not case sensitive we can | ||
just return the templates here. | ||
""" | ||
return list(self.templates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function called frequently? I notice that we're creating and returning a new list each time, which could be slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compromise here was whether to cache the set
which is used for matching or to cache the list
which is expected to be returned by simple
. We could cache both as class attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented a self._simple
which is set in __init__
which should avoid creating the list
multiple times (this goes for StringParser
too).
@WittierDinosaur: Would you like to review this since it relates to one of the performance issues you created? |
aa529e9
to
9da1b18
Compare
9da1b18
to
ed04460
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat - I like this. The regex construction for the set matching was a bit hacky and this provides a really neat solution to that.
Good work 👍 .
60b728a
to
b792eec
Compare
b792eec
to
414c713
Compare
@barrywhart I would, but unfornately I'm on holiday now and for a couple of weeks. I'll evaluate it post-hoc when I'm back, but conceptually I like the idea. I'll leave it to you guys on the implementation! |
OK in that case I vote to merge rather than waiting for you. @barrywhart could you review the changes for your feedback? |
Brief summary of the change made
This parser takes a collection of strings and returns a match if any of
them are found. This is more performant than implementing this through
the
RegexParser
as a hash matching route can be taken.This looks to at least partially address #3390 and does see a modest 2% speed up in parsing when running
tox -e bench
Before:
After:
Are there any other side effects of this change that we should be aware of?
Pull Request checklist
Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
.yml
rule test cases intest/fixtures/rules/std_rule_cases
..sql
/.yml
parser test cases intest/fixtures/dialects
(note YML files can be auto generated withtox -e generate-fixture-yml
).test/fixtures/linter/autofix
.Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.