-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification #51242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can we check ANSI standard, and if they clarify this syntax? Then I think we can cover this change under ANSI flag. |
""
as "
in double-quoted literal
Hello @HyukjinKwon . I made a quick check, it seems ANSI standard supports using single-quote-quote as literals and double-quote-quote as identifiers, for example, I will make a more careful check, but how can we support 2 different dialects using one antlr lexer config, could you give a demo PR which implemented similar functions, and I can copy. |
If it complies ANSI, let's cover this under ANSI configuration, |
Hi @HyukjinKwon For string literal:
For identifier:
I don't think I need to cover this under ANSI configuration, because #38147 already did it. Our behaviors will be as follows (dqi= doubleQuotedIdentifiers):
I only changed the way we parse string concatenation, both in ANSI-mode and hive-mode, we should always treat 'O''Reilly' as O'Reilly, the previous way of treating 'O''Reilly' as concat(O, Reilly) can be regarded as a bug. ping @gengliangwang to take a look as you are the author of #38147 |
ping @gengliangwang , also cc @cloud-fan and @srielau since you participated in #38022 |
I like it in general. My worry is that the lexer solution does not permit control via a config. So we have a potential to break someone. A solution that hooks into the implicit concat code would allow you control. |
@srielau Thank you for your review, If I understand correctly, you mean that we may have some users write So we want to make these SQL bug-compatible by adding a config before |
@@ -111,6 +111,201 @@ org.apache.spark.sql.catalyst.parser.ParseException | |||
} | |||
|
|||
|
|||
-- !query | |||
select 1 from "not_""exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have similar behavior for the backtick quoting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not worried about the use of double quote for identifiers. I'm worried about thr use of double and single quotes for strings. |
I think you are considering situations like |
|
@srielau Thank you for your clarification, and I got your idea, we have 2 choices to implement this.
I tried solution 2, you can take a look at my latest commit, I will submit some test cases after we aggre on the solution. |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
.internal() | ||
.doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " + | ||
"literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " + | ||
"where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make it more readable, can we use single quote string literal here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the original doc also contained some error and I refined them.
@@ -6398,6 +6408,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf { | |||
|
|||
def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS) | |||
|
|||
def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only used once, we can inline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite { | |||
// Guard against off-by-one errors in the "all chars are hex" routine: | |||
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | |||
|
|||
// Double-quote escaping ("") | |||
assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we write the test with scala three quotes string literal """xxx"""
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can make it more readable, done.
@@ -154,6 +154,215 @@ org.apache.spark.sql.AnalysisException | |||
} | |||
|
|||
|
|||
-- !query | |||
select 1 from "not_""exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR updates the string literal parsing. We don't need to test so many different SQL statements that contain strig literal, but simply SELECT string literal
and focus more on the string literal itself
SELECT 'a''b'
SELECT 'a' 'b'
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists"
since this PR also changed the behavior when handling double-quoted identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thank you for your review, I resolved your comments, PTAL again.
.internal() | ||
.doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " + | ||
"literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " + | ||
"where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the original doc also contained some error and I refined them.
@@ -6398,6 +6408,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf { | |||
|
|||
def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS) | |||
|
|||
def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite { | |||
// Guard against off-by-one errors in the "all chars are hex" routine: | |||
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | |||
|
|||
// Double-quote escaping ("") | |||
assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can make it more readable, done.
@@ -154,6 +154,215 @@ org.apache.spark.sql.AnalysisException | |||
} | |||
|
|||
|
|||
-- !query | |||
select 1 from "not_""exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists"
since this PR also changed the behavior when handling double-quoted identifiers.
@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite { | |||
// Guard against off-by-one errors in the "all chars are hex" routine: | |||
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | |||
|
|||
// Double-quote escaping ("", '') | |||
assert(unescapeSQLString(""""a""a"""") == "a\"a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(unescapeSQLString(""""a""a"""") == "a\"a") | |
assert(unescapeSQLString(""" "a""a" """) == """a"a""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also improved the lines below.
|
||
|
||
-- !query | ||
SELECT "S""par""k", "S\"par\"k", 'S""par""k' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test the cases when there are spaces in the middle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, I added some cases with various usage of "
and '
and ' '.
@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite { | |||
// Guard against off-by-one errors in the "all chars are hex" routine: | |||
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | |||
|
|||
// Double-quote escaping ("", '') | |||
assert(unescapeSQLString(""""a""a"""") == """a"a""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit
assert(unescapeSQLString(""""a""a"""") == """a"a""") | |
assert(unescapeSQLString(""" "a""a" """.trim) == """ a"a """.trim) |
then it's cleaner what is the input and output
|
||
|
||
-- !query | ||
SELECT "S""par""k", "S\"par\"k", 'S""par""k' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit
SELECT "S""par""k", "S\"par\"k", 'S""par""k' | |
SELECT "S""par""k" AS c1, "S\"par\"k" AS c2, 'S""par""k' AS c3 |
to make it easier to match the result with the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's nitpicky and reasonable. I fixed them all.
The failed tests are unrelated, thanks, merging to master! |
What changes were proposed in this pull request?
This PR standardizes Spark's string literal parsing behavior to align with other database systems (MySQL, Hive, etc.) by implementing proper quote escaping rules:
Previous Behavior (Inconsistent):
New Behavior (Standard-Compliant):
Why are the changes needed?
Current behavior incorrectly treats consecutive quotes as string concatenation rather than escaping, now aligns with major databases:
MySQL: Reference
Hive: Hplsql.g4 implementation
Does this PR introduce any user-facing change?
Yes.
SET spark.sql.legacy.consecutiveStringLiterals.enabled=true
;How was this patch tested?
Added some SQL tests based on the existing test file.
Was this patch authored or co-authored using generative AI tooling?
No.