Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark3: parse some VALUES clauses #2245

Merged
merged 58 commits into from
Jan 31, 2022

Conversation

mcannamela
Copy link
Contributor

Brief summary of the change made

Makes progress on #2152.

VALUES clauses fail to parse in the spark3 dialect I believe because TableExpressionSegment does not include ValuesClauseSegment. I address that only for the spark3 dialect by including it, and also redefining some segments to address subsequent parsing failures.

At present, I believe I am covering most simple cases of VALUES used in FROM clauses, but not any of the optional directives likeORDER BY, LIMIT, OFFSET that you can see in e.g. the postgres docs (Spark SQL docs don't mention VALUES except in INSERT statements). I suspect this should all apply to the ansi dialect but I must admit I gave up trying to find the spec after not that much looking.

Also, it only works reliably when the VALUES clause is in parentheses:

    select * from (values 1, 2) --works 
    select * from values 1, 2 --fails

I think this is because the parse_grammar of FromClauseSegment is Delimited(Ref("FromExpressionSegment")), which splits on the first comma it finds rather than looking for the longest FromExpressionSegment. I don't see a good way around that. I could live with this, but I wish I had a way to hint to the user that this is what went wrong, since the output of parse does not make this at all obvious that just wrapping with parens will fix it.

Are there any other side effects of this change that we should be aware of?

The only dialect tests that fail for spark3 are ones that I added, but since I do override TableExpressionSegment for spark3 I imagine there are tests we were relying on in ansi that we would want to make sure spark3 passes. Before going nuts with that, I wanted to see whether I shouldn't just be pushing all this down into the ansi dialect anyway.

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

test/conftest.py Outdated Show resolved Hide resolved
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler())
for _, _, message in caplog.record_tuples:
logger.warning(message)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my debugging purposes.

i wanted to ask, is there a good place that explains how to read the detailed parsing output?

just by staring at it i could see it was only picking up part of from values 1,2 as the FromExpressionSegment, but beyond that i was having trouble reading it:

[PD:0  MD:1 ]	.None                                             	Ref.match OUT       	+   	[match=<MatchResult 1/9: 'from'>, seg="'from values 1 , 2'"]
[PD:0  MD:2 ]	..None                                            	Delim._look_ahead_match IN	    	[ls=7, seg='values 1 , 2']
[PD:0  MD:2 ]	..None                                            	Delim._look_ahead_match SI	    	[mq=[], sb=['VALUES', '', '1', '', ',', '', '2']]
[PD:0  MD:2 ]	..None                                            	Ref.match OUT       	+   	[match=<MatchResult 1/3: ','>, seg="', 2'"]
[PD:0  MD:2 ]	..None                                            	Delim._look_ahead_match SC	    	[bsm=(4, 1, <Ref: CommaSegment>)]
[PD:0  MD:4 ]	....FromExpressionSegment                         	OneOf.match PRN     	    	[ns=1, ps=1, ms=0, pruned=[<Ref: MLTableExpressionSegment>], opts=[<Ref: FromExpressionElementSegment>]]
[PD:0  MD:7 ]	.......FromExpressionElementSegment               	Ref.match OUT       	    	[match=<MatchResult 0/3: ''>, seg="'values 1'"]
[PD:0  MD:7 ]	.......FromExpressionElementSegment               	Optio.match PRN     	    	[ns=1, ps=1, ms=0, pruned=[<Bracketed: [<Ref: TableExpressionSegment>]>], opts=[<Ref: TableExpressionSegment>]]
[PD:0  MD:9 ]	.........TableExpressionSegment                   	OneOf.match PRN     	    	[ns=3, ps=1, ms=1, pruned=[<Bracketed: [<Ref: SelectableGrammar>]>], opts=[<Ref: ValuesClauseSegment>, <Ref: BareFunctionSegment>, <Ref: FunctionSegment>, <Ref: TableReferenceSegment>]]
[PD:0  MD:12]	............ValuesClauseSegment                   	Ref.match OUT       	+   	[match=<MatchResult 1/3: 'values'>, seg="'values 1'"]
[PD:0  MD:11]	...........ValuesClauseSegment                    	Start.match OUT     	++  	[match=<MatchResult 3/3: 'values 1'>, seg="'values 1'"]
[PD:0  MD:10]	..........ValuesClauseSegment                     	Value.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:10]	..........TableExpressionSegment                  	Ref.match OUT       	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:9 ]	.........TableExpressionSegment                   	OneOf.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:8 ]	........TableExpressionSegment                    	Table.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:8 ]	........FromExpressionElementSegment              	Ref.match OUT       	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:7 ]	.......FromExpressionElementSegment               	Optio.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:6 ]	......FromExpressionElementSegment                	Seque.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:5 ]	.....FromExpressionElementSegment                 	FromE.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:5 ]	.....FromExpressionSegment                        	Ref.match OUT       	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:4 ]	....FromExpressionSegment                         	OneOf.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:3 ]	...FromExpressionSegment                          	Condi.match OUT     	++  	[match=<MatchResult 1/1: ''>, seg="''"]
[PD:0  MD:3 ]	...FromExpressionSegment                          	Seque.match OUT     	++  	[match=<MatchResult 3/3: 'values 1'>, seg="'values 1'"]
[PD:0  MD:2 ]	..FromExpressionSegment                           	FromE.match OUT     	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:2 ]	..None                                            	Ref.match OUT       	++  	[match=<MatchResult 1/1: 'values 1'>, seg="'values 1'"]
[PD:0  MD:2 ]	..None                                            	Delim._look_ahead_match IN	    	[ls=2, seg=' 2']
[PD:0  MD:2 ]	..None                                            	Delim._look_ahead_match SI	    	[mq=[], sb=['', '2']]
[PD:0  MD:2 ]	..None                                            	Delim._look_ahead_match SC	    	[bsm=None]
[PD:0  MD:4 ]	....FromExpressionSegment                         	OneOf.match PRN     	    	[ns=1, ps=1, ms=0, pruned=[<Ref: MLTableExpressionSegment>], opts=[<Ref: FromExpressionElementSegment>]]
[PD:0  MD:7 ]	.......FromExpressionElementSegment               	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:7 ]	.......FromExpressionElementSegment               	Optio.match PRN     	    	[ns=1, ps=1, ms=0, pruned=[<Bracketed: [<Ref: TableExpressionSegment>]>], opts=[<Ref: TableExpressionSegment>]]
[PD:0  MD:9 ]	.........TableExpressionSegment                   	OneOf.match PRN     	    	[ns=3, ps=2, ms=0, pruned=[<Ref: ValuesClauseSegment>, <Bracketed: [<Ref: SelectableGrammar>]>], opts=[<Ref: BareFunctionSegment>, <Ref: FunctionSegment>, <Ref: TableReferenceSegment>]]
[PD:0  MD:10]	..........TableExpressionSegment                  	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:11]	...........FunctionSegment                        	OneOf.match PRN     	    	[ns=1, ps=1, ms=0, pruned=[<Sequence: [<Sequence: [<Ref: DatePartFunctionNameSe...]>], opts=[<Sequence: [<Sequence: [<AnyNumberOf: [<Ref: Functio..., <Ref: PostFunctionGrammar [opt]>]>]]
[PD:0  MD:17]	.................DatePartFunctionNameSegment      	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:16]	................DatePartFunctionNameSegment       	Seque.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:15]	...............DatePartFunctionNameSegment        	DateP.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:15]	...............FunctionSegment                    	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:14]	..............FunctionSegment                     	AnyNu.match PRN     	    	[ns=1, ps=0, ms=0, pruned=[], opts=[<Ref: FunctionNameSegment>]]
[PD:0  MD:17]	.................FunctionNameSegment              	AnyNu.match PRN     	    	[ns=1, ps=0, ms=0, pruned=[], opts=[<Sequence: [<Ref: SingleIdentifierGrammar>, <Ref: DotSegment>]>]]
[PD:0  MD:19]	...................SingleIdentifierGrammar        	OneOf.match PRN     	    	[ns=2, ps=0, ms=0, pruned=[], opts=[<Ref: NakedIdentifierSegment>, <Ref: QuotedIdentifierSegment>]]
[PD:0  MD:20]	....................SingleIdentifierGrammar       	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:20]	....................SingleIdentifierGrammar       	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:19]	...................SingleIdentifierGrammar        	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:19]	...................FunctionNameSegment            	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:18]	..................FunctionNameSegment             	Seque.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:17]	.................FunctionNameSegment              	AnyNu.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:17]	.................FunctionNameSegment              	OneOf.match PRN     	    	[ns=2, ps=0, ms=0, pruned=[], opts=[<Ref: FunctionNameIdentifierSegment>, <Ref: QuotedIdentifierSegment>]]
[PD:0  MD:18]	..................FunctionNameSegment             	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:18]	..................FunctionNameSegment             	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:17]	.................FunctionNameSegment              	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:16]	................FunctionNameSegment               	Seque.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:15]	...............FunctionNameSegment                	Funct.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:15]	...............FunctionSegment                    	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:14]	..............FunctionSegment                     	AnyNu.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:13]	.............FunctionSegment                      	Seque.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:12]	............FunctionSegment                       	Seque.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:11]	...........FunctionSegment                        	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:10]	..........FunctionSegment                         	Funct.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:10]	..........TableExpressionSegment                  	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:12]	............TableReferenceSegment                 	Delim._look_ahead_match IN	    	[ls=1, seg='2']
[PD:0  MD:12]	............TableReferenceSegment                 	Delim._look_ahead_match SI	    	[mq=[], sb=['2']]
[PD:0  MD:12]	............TableReferenceSegment                 	OneOf.match PRN     	    	[ns=1, ps=10, ms=0, pruned=[<Ref: OnKeywordSegment>, <Ref: AsKeywordSegment>, <Ref: UsingKeywordSegment>, <Ref: CommaSegment>, <Ref: CastOperatorSegment>, <Ref: StartSquareBracketSegment>, <Ref: StartBracketSegment>, <Ref: ColonSegment>, <Ref: DelimiterSegment>, <class 'sqlfluff.core.parser.segments.base.BracketedSegment'>], opts=[<Ref: BinaryOperatorGrammar>]]
[PD:0  MD:13]	.............BinaryOperatorGrammar                	OneOf.match PRN     	    	[ns=1, ps=3, ms=0, pruned=[<Ref: ArithmeticBinaryOperatorGrammar>, <Ref: StringBinaryOperatorGrammar>, <Ref: BooleanBinaryOperatorGrammar>], opts=[<Ref: ComparisonOperatorGrammar>]]
[PD:0  MD:14]	..............ComparisonOperatorGrammar           	OneOf.match PRN     	    	[ns=1, ps=9, ms=0, pruned=[<Ref: EqualsSegment>, <Ref: EqualsSegment_a>, <Ref: EqualsSegment_b>, <Ref: GreaterThanSegment>, <Ref: LessThanSegment>, <Ref: GreaterThanOrEqualToSegment>, <Ref: LessThanOrEqualToSegment>, <Ref: NotEqualToSegment_a>, <Ref: NotEqualToSegment_b>], opts=[<Ref: LikeOperatorSegment>]]
[PD:0  MD:15]	...............ComparisonOperatorGrammar          	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:14]	..............ComparisonOperatorGrammar           	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:14]	..............BinaryOperatorGrammar               	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:13]	.............BinaryOperatorGrammar                	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:13]	.............TableReferenceSegment                	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:12]	............TableReferenceSegment                 	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:12]	............TableReferenceSegment                 	NonCo.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:12]	............SingleIdentifierGrammar               	OneOf.match PRN     	    	[ns=2, ps=0, ms=0, pruned=[], opts=[<Ref: NakedIdentifierSegment>, <Ref: QuotedIdentifierSegment>]]
[PD:0  MD:13]	.............SingleIdentifierGrammar              	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:13]	.............SingleIdentifierGrammar              	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:12]	............SingleIdentifierGrammar               	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:12]	............TableReferenceSegment                 	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:11]	...........TableReferenceSegment                  	Delim.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:10]	..........TableReferenceSegment                   	Table.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:10]	..........TableExpressionSegment                  	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:9 ]	.........TableExpressionSegment                   	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:8 ]	........TableExpressionSegment                    	Table.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:8 ]	........FromExpressionElementSegment              	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:7 ]	.......FromExpressionElementSegment               	Optio.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:6 ]	......FromExpressionElementSegment                	Seque.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:5 ]	.....FromExpressionElementSegment                 	FromE.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:5 ]	.....FromExpressionSegment                        	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:4 ]	....FromExpressionSegment                         	OneOf.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:3 ]	...FromExpressionSegment                          	Seque.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:2 ]	..FromExpressionSegment                           	FromE.match OUT     	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:2 ]	..None                                            	Ref.match OUT       	    	[match=<MatchResult 0/1: ''>, seg="'2'"]
[PD:0  MD:1 ]	.None                                             	Delim.match OUT     	    	[match=<MatchResult 0/2: ''>, seg="'values 1 , 2'"]
[PD:0  MD:0 ]	None                                              	Seque.match OUT     	    	[match=<MatchResult 0/9: ''>, seg="'from values 1 , 2'"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing I'm aware of. @alanmcruickshank wrote the parser; maybe he has something to add?

@mcannamela
Copy link
Contributor Author

@barrywhart I've copied tests from your PR as you suggest

@jpy-git your PR allowed me to reuse the ValuesClauseSegment from ansi, which is great

I'm going to take it out of draft, ought not be more than a couple small items outstanding.

@mcannamela mcannamela marked this pull request as ready for review January 26, 2022 14:00
@barrywhart
Copy link
Member

@mcannamela: A couple of the test suites have failures. One is the normal Python tests.

Another is a new test added yesterday which runs the full sqlfluff lint on each of the parse test files in test/fixtures/dialects/spark3 and the other sibling directories. This tends to catch two kinds of issues:

  • Latent rule bugs
  • Dialects that may do some things in a new or different way compared to other dialects, causing rules to fail.

Can you look at the first? I'll take an initial look at the second one. Many rules are failing, but the error message is the same for each, and I think several of them reuse some of the same underlying components, so it may be essentially the same error repeated multiple times.

@barrywhart
Copy link
Member

I have some possible fixes for the rule issues. I'll create a PR against your PR later today with those fixes.

@tunetheweb
Copy link
Member

I have some possible fixes for the rule issues. I'll create a PR against your PR later today with those fixes.

Would a separate PR be better? Can then merge main into this branch after?

@barrywhart
Copy link
Member

@tunetheweb: Good idea! I created it as a standalone PR. It's similar to some of your fixes yesterday. #2488

@mcannamela
Copy link
Contributor Author

Thanks everybody for reviewing; I think I have addressed all feedback so assuming checks pass I'm hunting for ✅ s

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK think this is good to merge. Thanks for all your work on it. Realise there's still more to do, but had a quick look and can't see a simple solution so let's merge as is and tackle later.

One suggestion to correct docstring back ticks that I'll merge in.

src/sqlfluff/dialects/dialect_spark3.py Outdated Show resolved Hide resolved
@tunetheweb tunetheweb merged commit 986e66f into sqlfluff:main Jan 31, 2022
@tunetheweb tunetheweb changed the title spark3: parse some VALUES clauses Spark3: parse some VALUES clauses Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants