Skip to content

[SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification #51242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

dengziming
Copy link
Member

@dengziming dengziming commented Jun 22, 2025

What changes were proposed in this pull request?

This PR standardizes Spark's string literal parsing behavior to align with other database systems (MySQL, Hive, etc.) by implementing proper quote escaping rules:

Previous Behavior (Inconsistent):

spark.sql("""SELECT " ""aa"" " AS f1, " \"aa\" " AS f2, '""aa""' as f3""".stripMargin).show()
+----+------+------+
|  f1|    f2|    f3|
+----+------+------+
| aa | "aa" |""aa""|
+----+------+------+

New Behavior (Standard-Compliant):

spark.sql("""SELECT " ""aa"" " AS f1, " \"aa\" " AS f2, '""aa""' as f3""".stripMargin).show()

+------+------+------+
|    f1|    f2|    f3|
+------+------+------+
| "aa" | "aa" |""aa""|
+------+------+------+

Why are the changes needed?

Current behavior incorrectly treats consecutive quotes as string concatenation rather than escaping, now aligns with major databases:

MySQL: Reference

Hive: Hplsql.g4 implementation

Does this PR introduce any user-facing change?

Yes.

  1. New consecutive quote parsing, now: 'a''b' → a'b (quotes treated as escapes), previously: 'a''b' → ab (quotes treated as literal separators for string concatenation), legacy behavior can be restored via: SET spark.sql.legacy.consecutiveStringLiterals.enabled=true;
  2. Double-quote escaping ("", '') is now consistently supported in all string literal contexts, such as column comments, Partition directory names, and other SQL string literals;
  3. ANSI Identifier Enhancement, double-quoted identifiers now support quote escaping in ANSI mode: CREATE TABLE "t" ("I""d" INT), note that while parsing allows escaped quotes, some downstream analysis may reject identifiers containing special characters (e.g., " in table names).

How was this patch tested?

Added some SQL tests based on the existing test file.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member

Can we check ANSI standard, and if they clarify this syntax? Then I think we can cover this change under ANSI flag.

@dengziming dengziming changed the title [SPARK-52545][SQL] Handle "" as " in double-quoted literal [SPARK-52545][SQL] Double-quote should prioritize over implicit string concatenation Jun 23, 2025
@dengziming
Copy link
Member Author

dengziming commented Jun 23, 2025

Can we check ANSI standard, and if they clarify this syntax?

Hello @HyukjinKwon . I made a quick check, it seems ANSI standard supports using single-quote-quote as literals and double-quote-quote as identifiers, for example, select 'O''Reilly' output O'Reilly, CREATE TABLE "tab""le" ("i""d" INT) create table tab"le and field i"d. While using "O""Reilly" as a literal is an SQL dialect of Hive, MySql.

I will make a more careful check, but how can we support 2 different dialects using one antlr lexer config, could you give a demo PR which implemented similar functions, and I can copy.

@HyukjinKwon
Copy link
Member

If it complies ANSI, let's cover this under ANSI configuration, spark.sql.ansi.enabled.
If it doesn't, let's have a new configuration, and disable it by default.

@dengziming
Copy link
Member Author

dengziming commented Jun 23, 2025

Hi @HyukjinKwon
I have confirmed that the ANSI standard has relevant descriptions:

For string literal:

<character representation> ::= <nonquote character> | <quote symbol>
<nonquote character> ::= !! See the Syntax Rules.
<quote symbol> ::=<quote> <quote>
<quote> ::= '

For identifier:

<delimited identifier> ::= <double quote> <delimited identifier body> <double quote>
<delimited identifier body> ::=<delimited identifier part>...
<delimited identifier part> ::=<nondoublequote character> | <doublequote symbol>
<doublequote symbol> ::=""!! two consecutive double quote characters

I don't think I need to cover this under ANSI configuration, because #38147 already did it. Our behaviors will be as follows (dqi= doubleQuotedIdentifiers):

+---------------------------+----------------------+----------------------+
|                           | ANSI && dqi          | Hive                 |
+---------------------------+----------------------+----------------------+
| SELECT 'O''Reilly'        | literal: O'Reilly    | the same             |
+---------------------------+----------------------+----------------------+
| SELECT "O""Reilly"        | column: O"Reilly     | literal: O"Reilly    |
+---------------------------+----------------------+----------------------+
| CREATE TABLE t("I""d" int)| column: I"d          | error                |
+---------------------------+----------------------+----------------------+
| CREATE TABLE t('I''d' int)| error                | the same             |
+---------------------------+----------------------+----------------------+

I only changed the way we parse string concatenation, both in ANSI-mode and hive-mode, we should always treat 'O''Reilly' as O'Reilly, the previous way of treating 'O''Reilly' as concat(O, Reilly) can be regarded as a bug. ping @gengliangwang to take a look as you are the author of #38147

@HyukjinKwon
Copy link
Member

cc @gengliangwang

@dengziming
Copy link
Member Author

ping @gengliangwang , also cc @cloud-fan and @srielau since you participated in #38022

@srielau
Copy link
Contributor

srielau commented Jun 27, 2025

ping @gengliangwang , also cc @cloud-fan and @srielau since you participated in #38022

I like it in general. My worry is that the lexer solution does not permit control via a config. So we have a potential to break someone. A solution that hooks into the implicit concat code would allow you control.

@dengziming
Copy link
Member Author

@srielau Thank you for your review, If I understand correctly, you mean that we may have some users write select "Sp""ark" which equalsselect "spark" in the past, but after this change we lost this defect-induced feature.

So we want to make these SQL bug-compatible by adding a config before stringLit+ so concatenation has higher priority by default? I don't fancy the idea much since that would make Spark different from all popular databases, but I can change it if this is how Spark treats defect-induced features. Let's reach a consensus before making this big change.

@@ -111,6 +111,201 @@ org.apache.spark.sql.catalyst.parser.ParseException
}


-- !query
select 1 from "not_""exists"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have similar behavior for the backtick quoting?

Copy link
Member Author

@dengziming dengziming Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Spark already supports using double backtick inside a back-quoted identifier. Here is the grammar in SqlBaseLexer.g4 (BACKQUOTED_IDENTIFIER : '' ( ~'' | '``' )* '`' ;):
image

@srielau
Copy link
Contributor

srielau commented Jun 27, 2025

@srielau Thank you for your review, If I understand correctly, you mean that we may have some users write select "Sp""ark" which equalsselect "spark" in the past, but after this change we lost this defect-induced feature.

So we want to make these SQL bug-compatible by adding a config before stringLit+ so concatenation has higher priority by default? I don't fancy the idea much since that would make Spark different from all popular databases, but I can change it if this is how Spark treats defect-induced features. Let's reach a consensus before making this big change.

I'm not worried about the use of double quote for identifiers. I'm worried about thr use of double and single quotes for strings.
The ability to coalesce back to back literals (without ||) is a feature. What we are now saying is "Unless there is no whitespace at all". We can argue: "Why would anyone do that? What's the point?" But we can't call it a bug.
So the question is: Do we need to cater to this scenario to provide a switch?

@dengziming
Copy link
Member Author

Do we need to cater to this scenario to provide a switch?

I think you are considering situations like select 'S' 'pa''rk', it was parsed to 'S'+'pa'+'rk' and will be parsed to 'S'+'pa''rk', I think it's OK to change it since this is a progressive moving forward, I investigated several SQL systems and found that some systems like pgsql only support concatenating of continuous literals with newline(\n).
Can you also provide me with some examples and offer the expected behaviours that can not only support double-quote-escape but also maintain compatibility with old use cases?

@srielau
Copy link
Contributor

srielau commented Jun 28, 2025

Do we need to cater to this scenario to provide a switch?

I think you are considering situations like select 'S' 'pa''rk', it was parsed to 'S'+'pa'+'rk' and will be parsed to 'S'+'pa''rk', I think it's OK to change it since this is a progressive moving forward, I investigated several SQL systems and found that some systems like pgsql only support concatenating of continuous literals with newline(\n). Can you also provide me with some examples and offer the expected behaviours that can not only support double-quote-escape but also maintain compatibility with old use cases?
May by I did not make myself clear: I agree this is a desirable change. But I disagree that the change is a bug fix. It is a behavior change. Typically behavior changes are guarded with a "legacy" config.
The question is whether we're OK to not have one. @cloud-fan

@dengziming
Copy link
Member Author

dengziming commented Jun 28, 2025

@srielau Thank you for your clarification, and I got your idea, we have 2 choices to implement this.

  1. Adding a config in the grammar file, which would be very complex, since we only want to maintain old ways for string literal and we want to use new ways for comments, locations, and other rule which don't support stringLit+.
  2. Since we only want to maintain old ways when handling stringLit+, we can use unified grammar in lexing stage and add a config in parsing stage to control 2 different handling logics, which can be accomplished by changing visitStringLiteral of the Visitor.

I tried solution 2, you can take a look at my latest commit, I will submit some test cases after we aggre on the solution.

.internal()
.doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " +
"literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " +
"where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it more readable, can we use single quote string literal here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the original doc also contained some error and I refined them.

@@ -6398,6 +6408,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf {

def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS)

def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only used once, we can inline it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
// Guard against off-by-one errors in the "all chars are hex" routine:
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

// Double-quote escaping ("")
assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we write the test with scala three quotes string literal """xxx"""?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can make it more readable, done.

@@ -154,6 +154,215 @@ org.apache.spark.sql.AnalysisException
}


-- !query
select 1 from "not_""exists"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR updates the string literal parsing. We don't need to test so many different SQL statements that contain strig literal, but simply SELECT string literal and focus more on the string literal itself

SELECT 'a''b'
SELECT 'a' 'b'
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists" since this PR also changed the behavior when handling double-quoted identifiers.

Copy link
Member Author

@dengziming dengziming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thank you for your review, I resolved your comments, PTAL again.

.internal()
.doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " +
"literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " +
"where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the original doc also contained some error and I refined them.

@@ -6398,6 +6408,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf {

def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS)

def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
// Guard against off-by-one errors in the "all chars are hex" routine:
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

// Double-quote escaping ("")
assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can make it more readable, done.

@@ -154,6 +154,215 @@ org.apache.spark.sql.AnalysisException
}


-- !query
select 1 from "not_""exists"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists" since this PR also changed the behavior when handling double-quoted identifiers.

@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
// Guard against off-by-one errors in the "all chars are hex" routine:
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

// Double-quote escaping ("", '')
assert(unescapeSQLString(""""a""a"""") == "a\"a")
Copy link
Contributor

@cloud-fan cloud-fan Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(unescapeSQLString(""""a""a"""") == "a\"a")
assert(unescapeSQLString(""" "a""a" """) == """a"a""")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also improved the lines below.



-- !query
SELECT "S""par""k", "S\"par\"k", 'S""par""k'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test the cases when there are spaces in the middle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I added some cases with various usage of " and ' and ' '.

@dengziming dengziming changed the title [SPARK-52545][SQL] Double-quote should prioritize over implicit string concatenation [SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification Jul 1, 2025
@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
// Guard against off-by-one errors in the "all chars are hex" routine:
assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

// Double-quote escaping ("", '')
assert(unescapeSQLString(""""a""a"""") == """a"a""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit

Suggested change
assert(unescapeSQLString(""""a""a"""") == """a"a""")
assert(unescapeSQLString(""" "a""a" """.trim) == """ a"a """.trim)

then it's cleaner what is the input and output



-- !query
SELECT "S""par""k", "S\"par\"k", 'S""par""k'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit

Suggested change
SELECT "S""par""k", "S\"par\"k", 'S""par""k'
SELECT "S""par""k" AS c1, "S\"par\"k" AS c2, 'S""par""k' AS c3

to make it easier to match the result with the schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nitpicky and reasonable. I fixed them all.

@cloud-fan
Copy link
Contributor

The failed tests are unrelated, thanks, merging to master!

@cloud-fan cloud-fan closed this in c5ed408 Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants