Skip to content

[clang-format] Handle Trailing Whitespace After Line Continuation (P2223R2) #145243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 25, 2025

Conversation

naveen-seth
Copy link
Contributor

@naveen-seth naveen-seth commented Jun 22, 2025

Fixes #145226.

Implement P2223R2 in clang-format to correctly handle cases where a backslash '\' is followed by trailing whitespace before the newline.
Previously, clang-format failed to properly detect and handle such cases, leading to misformatted code.

With this, clang-format matches the behavior already implemented in Clang's lexer and DependencyDirectivesScanner.cpp, which allow trailing whitespace after a line continuation in any C++ standard.

…223R2)

Fixes llvm#145226.

Implement P2223R2 in clang-format to correctly handle cases where a
backslash '\' is followed by trailing whitespace before the newline.
Previously, clang-format failed to properly detect and
handle such cases, leading to misformatted code.

With this, clang-format matches the behavior already implemented in
Clang's lexer and DependencyDirectivesScanner.cpp, which allow
trailing whitespace after a line continuation in any C++ standard.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang-format labels Jun 22, 2025
@naveen-seth naveen-seth changed the title [clang-format] Handle Trailing Whitespace After Line Continuation (P2… [clang-format] Handle Trailing Whitespace After Line Continuation (P2223R2) Jun 22, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 22, 2025

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-format

Author: Naveen Seth Hanig (naveen-seth)

Changes

Fixes #145226.

Implement P2223R2 in clang-format to correctly handle cases where a backslash '\' is followed by trailing whitespace before the newline.
Previously, clang-format failed to properly detect and handle such cases, leading to misformatted code.

With this, clang-format matches the behavior already implemented in Clang's lexer and DependencyDirectivesScanner.cpp, which allow trailing whitespace after a line continuation in any C++ standard.


Full diff: https://github.com/llvm/llvm-project/pull/145243.diff

3 Files Affected:

  • (modified) clang/docs/ReleaseNotes.rst (+3)
  • (modified) clang/lib/Format/FormatTokenLexer.cpp (+23-8)
  • (added) clang/test/Format/splice-trailing-whitespace-p2223r2.cpp (+14)
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 96477ef6ddc9a..7c8e231655e86 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -1025,6 +1025,9 @@ clang-format
   ``enum`` enumerator lists.
 - Add ``OneLineFormatOffRegex`` option for turning formatting off for one line.
 - Add ``SpaceAfterOperatorKeyword`` option.
+- Support trailing whitespace in line splicing.
+  (P2223R2 <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2223r2.pdf>_, #GH145226)
+
 
 clang-refactor
 --------------
diff --git a/clang/lib/Format/FormatTokenLexer.cpp b/clang/lib/Format/FormatTokenLexer.cpp
index 4cc4f5f22db0d..bcc8e6ffd91ab 100644
--- a/clang/lib/Format/FormatTokenLexer.cpp
+++ b/clang/lib/Format/FormatTokenLexer.cpp
@@ -14,6 +14,7 @@
 
 #include "FormatTokenLexer.h"
 #include "FormatToken.h"
+#include "clang/Basic/CharInfo.h"
 #include "clang/Basic/SourceLocation.h"
 #include "clang/Basic/SourceManager.h"
 #include "clang/Format/Format.h"
@@ -1205,14 +1206,23 @@ static size_t countLeadingWhitespace(StringRef Text) {
   while (Cur < End) {
     if (isspace(Cur[0])) {
       ++Cur;
-    } else if (Cur[0] == '\\' && (Cur[1] == '\n' || Cur[1] == '\r')) {
-      // A '\' followed by a newline always escapes the newline, regardless
-      // of whether there is another '\' before it.
+    } else if (Cur[0] == '\\') {
+      // A '\' followed by a optional horizontal whitespace (P22232R2) and then
+      // newline  always escapes the newline, regardless  of whether there is
+      // another '\' before it.
       // The source has a null byte at the end. So the end of the entire input
       // isn't reached yet. Also the lexer doesn't break apart an escaped
       // newline.
-      assert(End - Cur >= 2);
-      Cur += 2;
+      const unsigned char *Lookahead = Cur + 1;
+      while (isHorizontalWhitespace(*Lookahead))
+        ++Lookahead;
+      if (*Lookahead == '\n' || *Lookahead == '\r') {
+        // Splice found, consume it.
+        Cur = Lookahead + 1;
+        continue;
+      }
+      // No line splice found; the '\' is a token.
+      break;
     } else if (Cur[0] == '?' && Cur[1] == '?' && Cur[2] == '/' &&
                (Cur[3] == '\n' || Cur[3] == '\r')) {
       // Newlines can also be escaped by a '?' '?' '/' trigraph. By the way, the
@@ -1295,13 +1305,18 @@ FormatToken *FormatTokenLexer::getNextToken() {
       case '/':
         // The text was entirely whitespace when this loop was entered. Thus
         // this has to be an escape sequence.
-        assert(Text.substr(i, 2) == "\\\r" || Text.substr(i, 2) == "\\\n" ||
-               Text.substr(i, 4) == "\?\?/\r" ||
+        assert(Text.substr(i, 4) == "\?\?/\r" ||
                Text.substr(i, 4) == "\?\?/\n" ||
                (i >= 1 && (Text.substr(i - 1, 4) == "\?\?/\r" ||
                            Text.substr(i - 1, 4) == "\?\?/\n")) ||
                (i >= 2 && (Text.substr(i - 2, 4) == "\?\?/\r" ||
-                           Text.substr(i - 2, 4) == "\?\?/\n")));
+                           Text.substr(i - 2, 4) == "\?\?/\n")) ||
+               (Text[i] == '\\' && [&]() -> bool {
+                 size_t j = i + 1;
+                 while (j < Text.size() && isHorizontalWhitespace(Text[j]))
+                   ++j;
+                 return j < Text.size() && (Text[j] == '\n' || Text[j] == '\r');
+               }()));
         InEscape = true;
         break;
       default:
diff --git a/clang/test/Format/splice-trailing-whitespace-p2223r2.cpp b/clang/test/Format/splice-trailing-whitespace-p2223r2.cpp
new file mode 100644
index 0000000000000..3cc08cc631965
--- /dev/null
+++ b/clang/test/Format/splice-trailing-whitespace-p2223r2.cpp
@@ -0,0 +1,14 @@
+// RUN: grep -Ev "// *[A-Z-]+:" %s \
+// RUN:   | clang-format -style='{BasedOnStyle: LLVM, AlignEscapedNewlines: DontAlign}' \
+// RUN:   | FileCheck -strict-whitespace %s
+
+// CHECK: {{^#define TAG\(\.\.\.\) \\}}
+// CHECK: {{^  struct a \{\};}}
+// There is whitespace following  v  this backslash!
+#define TAG(...)  struct a {     \     
+  };
+
+// CHECK: {{^int   i;}}
+// The comment below eats its following line because of the line splice.
+// I also have trailing whitespace. Nom nom nom \     
+int   i;

Copy link
Contributor

@HazardyKnusperkeks HazardyKnusperkeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait for @owenca .

Comment on lines -1299 to +1319
Text.substr(i, 4) == "\?\?/\r" ||
assert(Text.substr(i, 4) == "\?\?/\r" ||
Text.substr(i, 4) == "\?\?/\n" ||
(i >= 1 && (Text.substr(i - 1, 4) == "\?\?/\r" ||
Text.substr(i - 1, 4) == "\?\?/\n")) ||
(i >= 2 && (Text.substr(i - 2, 4) == "\?\?/\r" ||
Text.substr(i - 2, 4) == "\?\?/\n")));
Text.substr(i - 2, 4) == "\?\?/\n")) ||
(Text[i] == '\\' && [&]() -> bool {
size_t j = i + 1;
while (j < Text.size() && isHorizontalWhitespace(Text[j]))
++j;
return j < Text.size() && (Text[j] == '\n' || Text[j] == '\r');
}()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd delete the assertion. WDYT @HazardyKnusperkeks ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is, why was it put there. But I have a hard time understanding it, how can Text.substr(i, 4) be equal to ??/\r, when Text[i] is /?

I think an assert is fine, but it needs to be reformulated, and maybe add a comment about trigraph. I at least keep forgetting, that these things exist(ed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion was originated from here. Maybe @sstwcw still remembers why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit (ed80aca), which addresses the other feedback, doesn’t change anything here. I’m also having trouble understanding this, so I’ll wait for @sstwcw’s input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it put there? Before my patch, the code did not handle trigraphs. It looked at the character following the backslash. My patch added support for trigraphs. To continue checking for the character following the trigraph meant I had to add more code. I realized that I did not have to add code for checking past the first character because the entire block only ran when there was whitespace. So I removed the code for checking past the first character. Then @owenca asked why it was unnecessary to check further than the first character. I thought that if someone had to ask then that meant the reasoning was not obvious. So I added an assertion. I thought it could help developers understand the code.

Why can Text.substr(i, 4) be equal to ??/\r, when Text[i] is /? If you click on the arrow to see more lines around the changed code, you will see that the block also runs when the character is \ or ?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look around, but did not consider this while reading the assert...

Then the assert should stay, but maybe add a comment, as I said trigraphs are not on everyones mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want him to add a comment in this pull request, or do you want me to add a comment separately?

I can expand the comment on line 1296 like this.

The code preceding the loop and in the countLeadingWhitespace function guarantees that Text is entirely whitespace, not including comments but including escaped newlines which may be escaped with a trigraph. So if 1 of these characters show up, then it has to be in an escape sequence.

Copy link
Contributor Author

@naveen-seth naveen-seth Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s best if you author the comment, but I (or you) can add your suggested comment to this PR and credit you as a co-author (@sstwcw). Otherwise, I’ll merge it as is.

If you’d like me to add it in this PR, do I need to re-request a review afterward? @HazardyKnusperkeks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done separately.

@owenca owenca removed the clang Clang issues not falling into any other category label Jun 24, 2025
@owenca owenca requested a review from sstwcw June 24, 2025 08:27
@naveen-seth naveen-seth requested a review from owenca June 24, 2025 14:52
@HazardyKnusperkeks HazardyKnusperkeks merged commit dd47b84 into llvm:main Jun 25, 2025
7 checks passed
@naveen-seth naveen-seth deleted the clang-format-p2223r2 branch June 26, 2025 11:06
anthonyhatran pushed a commit to anthonyhatran/llvm-project that referenced this pull request Jun 26, 2025
…223R2) (llvm#145243)

Fixes llvm#145226.

Implement
[P2223R2](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2223r2.pdf)
in clang-format to correctly handle cases where a backslash '\\' is
followed by trailing whitespace before the newline.
Previously, `clang-format` failed to properly detect and handle such
cases, leading to misformatted code.

With this, `clang-format` matches the behavior already implemented in
Clang's lexer and `DependencyDirectivesScanner.cpp`, which allow
trailing whitespace after a line continuation in any C++ standard.
rlavaee pushed a commit to rlavaee/llvm-project that referenced this pull request Jul 1, 2025
…223R2) (llvm#145243)

Fixes llvm#145226.

Implement
[P2223R2](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2223r2.pdf)
in clang-format to correctly handle cases where a backslash '\\' is
followed by trailing whitespace before the newline.
Previously, `clang-format` failed to properly detect and handle such
cases, leading to misformatted code.

With this, `clang-format` matches the behavior already implemented in
Clang's lexer and `DependencyDirectivesScanner.cpp`, which allow
trailing whitespace after a line continuation in any C++ standard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clang-format format several times gives different result
5 participants