Skip to content

Documentation regarding raw string literals and backslashes is incorrect / misleading #113418

@geofft

Description

@geofft

Documentation

Raw string literals have a surprising but ultimately understandable/necessary behavior involving a backslash followed by a quote. Specifically, it is not possible to end a raw string literal with a single backslash or an odd number of backslashes. This is documented at the FAQ entry https://docs.python.org/3/faq/design.html#why-can-t-raw-strings-r-strings-end-with-a-backslash :

Raw strings were designed to ease creating input for processors (chiefly regular expression engines) that want to do their own backslash escape processing. Such processors consider an unmatched trailing backslash to be an error anyway, so raw strings disallow that. In return, they allow you to pass on the string quote character by escaping it with a backslash. These rules work well when r-strings are used for their intended purpose.

But this is subtly incorrect. It's not precisely true that you can "pass on the string quote character by escaping it with a backslash." If it were, you would expect r'\'' to evaluate to ', a single quote, just like '\'' does. Instead it evaluates to \', that is, the backslash character followed by a single quote. There is no way to get a single quote, by itself, into a raw string literal delimited by single quotes (and similarly for double quotes, and this is true for triple-quoted strings too). If you do pass your literal to a regex engine, you are ultimately getting a quote character, yes, but you're not "passing on the string quote character," alone, escaped by the backslash; you are passing on the two-character sequence \' to the engine, whose processing might interpret that sequence as e.g. matching a single quote in the target string.

This is reasonable for the use case of regex engines, but calling the behavior "escaping" is misleading. (It could be seen as an escape sequence, in that it prevents the standard interpretation of the next character as the closing quote, but you would not describe '\n' as "escaping the letter n with a backslash," despite it preventing the standard interpretation of the letter n. r'\'' is an escape sequence that expands to the two characters \', and there is no escape sequence that expands to ' alone.)

In other words, there are two things you can't do in a raw string literal: end the string's value with an odd number of backslashes, or contain the starting quote character without it being preceded by a backslash in the string's value.

Furthermore, the FAQ entry, as written, sounds like it's enforcing a rule because it's good for you (trailing backslashes are invalid in a regex) but the rule could be changed. There is an actual mechanical reason why a raw string literal cannot end with a single backslash / odd number of backslashes: when a parser in the middle of a raw string literal sees a single backslash followed by what would be a closing quote, it triggers the above interpretation. And when a parser in the middle of a raw string literal sees two backslashes, it is interpreted as two backslashes. So there is no possible syntax that could be used for a trailing backslash. (Put another way, because the sequence r'\'' is a complete string, the first four characters of it r'\' are necessarily an incomplete string.)

Documenting that would address the confusion in e.g. #75319 where one person asked (as I was also asking myself) why the syntax could not be extended to cover this - the FAQ makes it sound like it's just a choice to disallow it. Another person on that ticket also pointed out that the "escape" terminology is misleading.

Apart from the FAQ entry, the definition of raw string literals in section 2.4.1 of https://docs.python.org/3/reference/lexical_analysis.html is incomplete and does not point out these two surprising special cases. It simply says that they "treat backslashes as literal characters." They do appear literally in the output, but they are a bit more magical, because of the special interpretation of backslash-quote. If backslashes were just literal characters, any quote would immediately terminate the string (and the stringescapeseq/bytesescapeseq grammar productions would not apply to raw strings). The next section 2.4.1.1 covers this at the end, but it's pretty distant from the actual definition of raw strings and it uses the "escape" terminology (but, to its credit, clearly explains what is meant by that).

I would suggest something like the following for the FAQ entry:

Raw strings were designed to ease creating input for processors (chiefly regular expression engines) that want to do their own backslash escape processing, so backslashes are generally not processed specially at the syntax level. However, there needs to be a way to include the string's quote character itself inside a raw string. Because these processors do their own backslash handling, it is safe to send them an extra backslash before a quote character. So, Python interprets the two-character sequence of a backslash followed by the quote character as two literal characters to include in the raw string. The advantage of this interpretation is that the raw string remains raw: the exact same characters are in the source code as in the final string.

Because of this interpretation, a backslash followed by a closing quote already has a meaning - the quote is not interpreted as a closing quote. So, unfortunately, there is no syntax that can be used at the end of a raw string to mean a single backslash. (Similarly, there is also no syntax that can be used to include a quote character that is not preceded by a literal backslash.)

(leaving out the bits about "intended purpose" and "it's an error anyway"), and for the definition of raw strings, maybe merge the paragraph at the end of 2.4.1.1 into the one in 2.4.1 and rework it as something like

Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings. In a raw string, the normal interpretation of backslashes does not apply, so that every character in the string literal, including backslashes, appears unmodified in the string's value. A single backslash followed by a quote character used to start the raw string prevents the quote character from being interpreted as a closing quote (or part of a closing triple-quote), but the backslash still appears in the string's value. (That is, r"spam\'spam" is a ten-character string.) Similarly, a single backslash followed by a newline is interpreted as those two characters as part of the literal (though it prevents a SyntaxError from the newline), not as a line continuation that does not appear in the value.

(using "prevents it from being interpreted" language instead of "escapes"), and move the final sentence in the original ("Given that Python 2.x’s raw unicode literals behave differently than Python 3.x’s the 'ur' syntax is not supported.") two paragraphs down to where it talks about u literals.

Happy to send either of the above as a PR if this text seems agreeable. Alternatively, if it seems like a good starting point but someone wants to rework it, I believe I have a CLA on file, so consider it licensed appropriately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation in the Doc dir

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions