[SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification #51242

dengziming · 2025-06-22T12:57:11Z

What changes were proposed in this pull request?

This PR standardizes Spark's string literal parsing behavior to align with other database systems (MySQL, Hive, etc.) by implementing proper quote escaping rules:

Previous Behavior (Inconsistent):

spark.sql("""SELECT " ""aa"" " AS f1, " \"aa\" " AS f2, '""aa""' as f3""".stripMargin).show()
+----+------+------+
|  f1|    f2|    f3|
+----+------+------+
| aa | "aa" |""aa""|
+----+------+------+

New Behavior (Standard-Compliant):

spark.sql("""SELECT " ""aa"" " AS f1, " \"aa\" " AS f2, '""aa""' as f3""".stripMargin).show()

+------+------+------+
|    f1|    f2|    f3|
+------+------+------+
| "aa" | "aa" |""aa""|
+------+------+------+

Why are the changes needed?

Current behavior incorrectly treats consecutive quotes as string concatenation rather than escaping, now aligns with major databases:

MySQL: Reference

Hive: Hplsql.g4 implementation

Does this PR introduce any user-facing change?

Yes.

New consecutive quote parsing, now: 'a''b' → a'b (quotes treated as escapes), previously: 'a''b' → ab (quotes treated as literal separators for string concatenation), legacy behavior can be restored via: SET spark.sql.legacy.consecutiveStringLiterals.enabled=true;
Double-quote escaping ("", '') is now consistently supported in all string literal contexts, such as column comments, Partition directory names, and other SQL string literals;
ANSI Identifier Enhancement, double-quoted identifiers now support quote escaping in ANSI mode: CREATE TABLE "t" ("I""d" INT), note that while parsing allows escaped quotes, some downstream analysis may reject identifiers containing special characters (e.g., " in table names).

How was this patch tested?

Added some SQL tests based on the existing test file.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2025-06-23T02:44:41Z

Can we check ANSI standard, and if they clarify this syntax? Then I think we can cover this change under ANSI flag.

dengziming · 2025-06-23T03:33:13Z

Can we check ANSI standard, and if they clarify this syntax?

Hello @HyukjinKwon . I made a quick check, it seems ANSI standard supports using single-quote-quote as literals and double-quote-quote as identifiers, for example, select 'O''Reilly' output O'Reilly, CREATE TABLE "tab""le" ("i""d" INT) create table tab"le and field i"d. While using "O""Reilly" as a literal is an SQL dialect of Hive, MySql.

I will make a more careful check, but how can we support 2 different dialects using one antlr lexer config, could you give a demo PR which implemented similar functions, and I can copy.

HyukjinKwon · 2025-06-23T03:38:09Z

If it complies ANSI, let's cover this under ANSI configuration, spark.sql.ansi.enabled.
If it doesn't, let's have a new configuration, and disable it by default.

…g concatenation

dengziming · 2025-06-23T13:35:32Z

Hi @HyukjinKwon
I have confirmed that the ANSI standard has relevant descriptions：

For string literal:

<character representation> ::= <nonquote character> | <quote symbol>
<nonquote character> ::= !! See the Syntax Rules.
<quote symbol> ::=<quote> <quote>
<quote> ::= '

For identifier:

<delimited identifier> ::= <double quote> <delimited identifier body> <double quote>
<delimited identifier body> ::=<delimited identifier part>...
<delimited identifier part> ::=<nondoublequote character> | <doublequote symbol>
<doublequote symbol> ::=""!! two consecutive double quote characters

I don't think I need to cover this under ANSI configuration, because #38147 already did it. Our behaviors will be as follows (dqi= doubleQuotedIdentifiers):

+---------------------------+----------------------+----------------------+
|                           | ANSI && dqi          | Hive                 |
+---------------------------+----------------------+----------------------+
| SELECT 'O''Reilly'        | literal: O'Reilly    | the same             |
+---------------------------+----------------------+----------------------+
| SELECT "O""Reilly"        | column: O"Reilly     | literal: O"Reilly    |
+---------------------------+----------------------+----------------------+
| CREATE TABLE t("I""d" int)| column: I"d          | error                |
+---------------------------+----------------------+----------------------+
| CREATE TABLE t('I''d' int)| error                | the same             |
+---------------------------+----------------------+----------------------+

I only changed the way we parse string concatenation, both in ANSI-mode and hive-mode, we should always treat 'O''Reilly' as O'Reilly, the previous way of treating 'O''Reilly' as concat(O, Reilly) can be regarded as a bug. ping @gengliangwang to take a look as you are the author of #38147

HyukjinKwon · 2025-06-24T06:32:52Z

cc @gengliangwang

dengziming · 2025-06-27T02:37:59Z

ping @gengliangwang , also cc @cloud-fan and @srielau since you participated in #38022

srielau · 2025-06-27T03:59:01Z

ping @gengliangwang , also cc @cloud-fan and @srielau since you participated in #38022

I like it in general. My worry is that the lexer solution does not permit control via a config. So we have a potential to break someone. A solution that hooks into the implicit concat code would allow you control.

dengziming · 2025-06-27T07:22:24Z

@srielau Thank you for your review, If I understand correctly, you mean that we may have some users write select "Sp""ark" which equalsselect "spark" in the past, but after this change we lost this defect-induced feature.

So we want to make these SQL bug-compatible by adding a config before stringLit+ so concatenation has higher priority by default? I don't fancy the idea much since that would make Spark different from all popular databases, but I can change it if this is how Spark treats defect-induced features. Let's reach a consensus before making this big change.

cloud-fan · 2025-06-27T13:53:55Z

sql/core/src/test/resources/sql-tests/analyzer-results/double-quoted-identifiers.sql.out

@@ -111,6 +111,201 @@ org.apache.spark.sql.catalyst.parser.ParseException
 }


+-- !query
+select 1 from "not_""exists"


do we have similar behavior for the backtick quoting?

Yes, Spark already supports using double backtick inside a back-quoted identifier. Here is the grammar in SqlBaseLexer.g4 (BACKQUOTED_IDENTIFIER : '' ( ~'' | '``' )* '`' ;):

srielau · 2025-06-27T21:43:11Z

@srielau Thank you for your review, If I understand correctly, you mean that we may have some users write select "Sp""ark" which equalsselect "spark" in the past, but after this change we lost this defect-induced feature.

So we want to make these SQL bug-compatible by adding a config before stringLit+ so concatenation has higher priority by default? I don't fancy the idea much since that would make Spark different from all popular databases, but I can change it if this is how Spark treats defect-induced features. Let's reach a consensus before making this big change.

I'm not worried about the use of double quote for identifiers. I'm worried about thr use of double and single quotes for strings.
The ability to coalesce back to back literals (without ||) is a feature. What we are now saying is "Unless there is no whitespace at all". We can argue: "Why would anyone do that? What's the point?" But we can't call it a bug.
So the question is: Do we need to cater to this scenario to provide a switch?

dengziming · 2025-06-28T03:27:42Z

Do we need to cater to this scenario to provide a switch?

I think you are considering situations like select 'S' 'pa''rk', it was parsed to 'S'+'pa'+'rk' and will be parsed to 'S'+'pa''rk', I think it's OK to change it since this is a progressive moving forward, I investigated several SQL systems and found that some systems like pgsql only support concatenating of continuous literals with newline(\n).
Can you also provide me with some examples and offer the expected behaviours that can not only support double-quote-escape but also maintain compatibility with old use cases?

srielau · 2025-06-28T04:12:36Z

Do we need to cater to this scenario to provide a switch?

I think you are considering situations like select 'S' 'pa''rk', it was parsed to 'S'+'pa'+'rk' and will be parsed to 'S'+'pa''rk', I think it's OK to change it since this is a progressive moving forward, I investigated several SQL systems and found that some systems like pgsql only support concatenating of continuous literals with newline(\n). Can you also provide me with some examples and offer the expected behaviours that can not only support double-quote-escape but also maintain compatibility with old use cases?
May by I did not make myself clear: I agree this is a desirable change. But I disagree that the change is a bug fix. It is a behavior change. Typically behavior changes are guarded with a "legacy" config.
The question is whether we're OK to not have one. @cloud-fan

dengziming · 2025-06-28T15:47:58Z

@srielau Thank you for your clarification, and I got your idea, we have 2 choices to implement this.

Adding a config in the grammar file, which would be very complex, since we only want to maintain old ways for string literal and we want to use new ways for comments, locations, and other rule which don't support stringLit+.
Since we only want to maintain old ways when handling stringLit+, we can use unified grammar in lexing stage and add a config in parsing stage to control 2 different handling logics, which can be accomplished by changing visitStringLiteral of the Visitor.

I tried solution 2, you can take a look at my latest commit, I will submit some test cases after we aggre on the solution.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

cloud-fan · 2025-06-30T14:14:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " +
+        "literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " +
+        "where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.")


to make it more readable, can we use single quote string literal here?

Yes, the original doc also contained some error and I refined them.

cloud-fan · 2025-06-30T14:15:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -6398,6 +6408,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf {

  def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS)

+  def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS)


it's only used once, we can inline it.

cloud-fan · 2025-06-30T14:15:41Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala

@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
    // Guard against off-by-one errors in the "all chars are hex" routine:
    assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

+    // Double-quote escaping ("")
+    assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"")


can we write the test with scala three quotes string literal """xxx"""?

That can make it more readable, done.

cloud-fan · 2025-06-30T14:18:23Z

...core/src/test/resources/sql-tests/analyzer-results/double-quoted-identifiers-enabled.sql.out

@@ -154,6 +154,215 @@ org.apache.spark.sql.AnalysisException
 }


+-- !query
+select 1 from "not_""exists"


This PR updates the string literal parsing. We don't need to test so many different SQL statements that contain strig literal, but simply SELECT string literal and focus more on the string literal itself

SELECT 'a''b' SELECT 'a' 'b' ...

It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists" since this PR also changed the behavior when handling double-quoted identifiers.

dengziming

@cloud-fan Thank you for your review, I resolved your comments, PTAL again.

dengziming · 2025-07-01T04:55:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " +
+        "literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " +
+        "where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.")


Yes, the original doc also contained some error and I refined them.

dengziming · 2025-07-01T04:55:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -6398,6 +6408,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf {

  def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS)

+  def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS)


dengziming · 2025-07-01T04:55:55Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala

@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
    // Guard against off-by-one errors in the "all chars are hex" routine:
    assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

+    // Double-quote escaping ("")
+    assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"")


That can make it more readable, done.

dengziming · 2025-07-01T05:08:25Z

...core/src/test/resources/sql-tests/analyzer-results/double-quoted-identifiers-enabled.sql.out

@@ -154,6 +154,215 @@ org.apache.spark.sql.AnalysisException
 }


+-- !query
+select 1 from "not_""exists"


It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists" since this PR also changed the behavior when handling double-quoted identifiers.

cloud-fan · 2025-07-01T06:09:21Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala

@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
    // Guard against off-by-one errors in the "all chars are hex" routine:
    assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

+    // Double-quote escaping ("", '')
+    assert(unescapeSQLString(""""a""a"""") == "a\"a")


Suggested change

assert(unescapeSQLString(""""a""a"""") == "a\"a")

assert(unescapeSQLString(""" "a""a" """) == """a"a""")

Done, also improved the lines below.

cloud-fan · 2025-07-01T06:10:38Z

sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out

+
+
+-- !query
+SELECT "S""par""k", "S\"par\"k", 'S""par""k'


can we test the cases when there are spaces in the middle?

That's a good idea, I added some cases with various usage of " and ' and ' '.

cloud-fan · 2025-07-01T13:48:33Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala

@@ -143,6 +143,16 @@ class ParserUtilsSuite extends SparkFunSuite {
    // Guard against off-by-one errors in the "all chars are hex" routine:
    assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa")

+    // Double-quote escaping ("", '')
+    assert(unescapeSQLString(""""a""a"""") == """a"a""")


super nit

Suggested change

assert(unescapeSQLString(""""a""a"""") == """a"a""")

assert(unescapeSQLString(""" "a""a" """.trim) == """ a"a """.trim)

then it's cleaner what is the input and output

cloud-fan · 2025-07-01T13:49:46Z

sql/core/src/test/resources/sql-tests/results/literals.sql.out

+
+
+-- !query
+SELECT "S""par""k", "S\"par\"k", 'S""par""k'


super nit

Suggested change

SELECT "S""par""k", "S\"par\"k", 'S""par""k'

SELECT "S""par""k" AS c1, "S\"par\"k" AS c2, 'S""par""k' AS c3

to make it easier to match the result with the schema.

That's nitpicky and reasonable. I fixed them all.

cloud-fan · 2025-07-02T04:14:30Z

The failed tests are unrelated, thanks, merging to master!

github-actions bot added the SQL label Jun 22, 2025

dengziming force-pushed the SPARK-52545 branch from bfa75c1 to 99c398d Compare June 23, 2025 02:44

dengziming changed the title ~~[SPARK-52545][SQL] Handle "" as " in double-quoted literal~~ [SPARK-52545][SQL] Double-quote should prioritize over implicit string concatenation Jun 23, 2025

[SPARK-52545][SQL] Double-quote should prioritize over implicit strin…

d31408d

…g concatenation

dengziming force-pushed the SPARK-52545 branch from 99c398d to d31408d Compare June 23, 2025 12:37

[SPARK-52545][SQL] Add quoted-identifier tests

79fe10b

dengziming force-pushed the SPARK-52545 branch from 5673e1d to 79fe10b Compare June 27, 2025 08:39

cloud-fan reviewed Jun 27, 2025

View reviewed changes

Add legacy config to make compatible with old logic

d850c7a

Add tests for legacy config

38125ca

cloud-fan reviewed Jun 30, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jun 30, 2025

View reviewed changes

dengziming force-pushed the SPARK-52545 branch from fd60191 to d2a8f5a Compare July 1, 2025 05:06

resolve comments, refine some annotations

5ad86f4

dengziming force-pushed the SPARK-52545 branch from d2a8f5a to 5ad86f4 Compare July 1, 2025 05:15

dengziming commented Jul 1, 2025

View reviewed changes

cloud-fan reviewed Jul 1, 2025

View reviewed changes

improve readability, improve test coverage.

3d80aa7

dengziming changed the title ~~[SPARK-52545][SQL] Double-quote should prioritize over implicit string concatenation~~ [SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification Jul 1, 2025

cloud-fan reviewed Jul 1, 2025

View reviewed changes

cloud-fan approved these changes Jul 1, 2025

View reviewed changes

more readability improvement

6fc03a4

cloud-fan closed this in c5ed408 Jul 2, 2025

		@@ -6398,6 +6408,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf {

		def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS)

		def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS)

	assert(unescapeSQLString(""""a""a"""") == "a\"a")
	assert(unescapeSQLString(""" "a""a" """) == """a"a""")

	assert(unescapeSQLString(""""a""a"""") == """a"a""")
	assert(unescapeSQLString(""" "a""a" """.trim) == """ a"a """.trim)

	SELECT "S""par""k", "S\"par\"k", 'S""par""k'
	SELECT "S""par""k" AS c1, "S\"par\"k" AS c2, 'S""par""k' AS c3

[SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification #51242

[SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification #51242

Uh oh!

Conversation

dengziming commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Jun 23, 2025

Uh oh!

dengziming commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 23, 2025

Uh oh!

dengziming commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 24, 2025

Uh oh!

dengziming commented Jun 27, 2025

Uh oh!

srielau commented Jun 27, 2025

Uh oh!

dengziming commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srielau commented Jun 27, 2025

Uh oh!

dengziming commented Jun 28, 2025

Uh oh!

srielau commented Jun 28, 2025

Uh oh!

dengziming commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dengziming commented Jun 22, 2025 •

edited

Loading

dengziming commented Jun 23, 2025 •

edited

Loading

dengziming commented Jun 23, 2025 •

edited

Loading

dengziming Jun 27, 2025 •

edited

Loading

dengziming commented Jun 28, 2025 •

edited

Loading

cloud-fan Jul 1, 2025 •

edited

Loading