gh-63161: Fix PEP 263 support #139481

serhiy-storchaka · 2025-10-01T15:30:39Z

Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified.
Detect decoding error in comments for UTF-8 encoding.

Issue: Non-UTF8 encoding line #63161

* Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified. * Detect decoding error in comments for UTF-8 encoding.

ashm-dev · 2025-10-01T17:24:40Z

Parser/tokenizer/file_tokenizer.c

+        const char *line = tok->lineno <= 2 ? tok->buf : tok->cur;
+        int lineno = tok->lineno <= 2 ? 1 : tok->lineno;
+        if (!tok->encoding) {
+            /* The default encoding is UTF-8, so make sure we don't have any
+               non-UTF-8 sequences in it. */
+            if (!_PyTokenizer_ensure_utf8(line, tok, lineno)) {
+                _PyTokenizer_error_ret(tok);
+                return 0;
+            }
+        }
+        else {
+            PyObject *tmp = PyUnicode_Decode(line, strlen(line),


Suggested change

const char *line = tok->lineno <= 2 ? tok->buf : tok->cur;

int lineno = tok->lineno <= 2 ? 1 : tok->lineno;

if (!tok->encoding) {

/* The default encoding is UTF-8, so make sure we don't have any

non-UTF-8 sequences in it. */

if (!_PyTokenizer_ensure_utf8(line, tok, lineno)) {

_PyTokenizer_error_ret(tok);

return 0;

}

}

else {

PyObject *tmp = PyUnicode_Decode(line, strlen(line),

const int is_pseudo_line = (tok->lineno <= 2);

const char *line = is_pseudo_line ? tok->buf : tok->cur;

int lineno = is_pseudo_line ? 1 : tok->lineno;

size_t slen = strlen(line);

if (slen > (size_t)PY_SSIZE_T_MAX) {

_PyTokenizer_error_ret(tok);

return 0;

}

Py_ssize_t linelen = (Py_ssize_t)slen;

if (!tok->encoding) {

/* The default encoding is UTF-8, so make sure we don't have any

non-UTF-8 sequences in it. */

if (!_PyTokenizer_ensure_utf8(line, tok, lineno)) {

_PyTokenizer_error_ret(tok);

return 0;

}

}

else {

PyObject *tmp = PyUnicode_Decode(line, linelen,

vstinner

LGTM. I am not sure about the tokenizer changes, but I trust unit tests :-)

serhiy-storchaka · 2025-10-03T14:35:27Z

Unfortunately there was a regression which caused one of existing tests to fail. Earlier, decoding error for default (UTF-8) encoding was raised only when the tokenizer tried to decode an identifier or string literal. So you had an affected line with underscored identifier or string literal containing undecodable bytes in a traceback. Now it is raised at the beginning of parsing string or after reading a line from the file (only for first few lines).

Fixing this regression was not easy. But now you have a nice line with the cursor pointing exactly to the undecodable byte in a traceback, and this works in more cases than earlier.

But it did not work and still does not work if the encoding is explicitly specified. Then you get a SyntaxError without correct reference to the position of decoding error. This is a different complex issue.

pythongh-63161: Fix PEP 263 support

7e9910e

* Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified. * Detect decoding error in comments for UTF-8 encoding.

serhiy-storchaka requested a review from vstinner October 1, 2025 15:30

serhiy-storchaka requested review from pablogsal and lysnikolaou as code owners October 1, 2025 15:30

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Oct 1, 2025

bedevere-app bot added the awaiting core review label Oct 1, 2025

bedevere-app bot mentioned this pull request Oct 1, 2025

Non-UTF8 encoding line #63161

Open

ashm-dev reviewed Oct 1, 2025

View reviewed changes

vstinner approved these changes Oct 2, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Oct 2, 2025

Include the decoding error position for default encoding in SyntaxError.

3ab168a

serhiy-storchaka requested a review from iritkatriel as a code owner October 3, 2025 14:24

serhiy-storchaka added 2 commits October 3, 2025 17:49

Try to disable colorization.

9fd1bb2

Fix tests on Windows.

62993b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-63161: Fix PEP 263 support #139481

gh-63161: Fix PEP 263 support #139481

Uh oh!

serhiy-storchaka commented Oct 1, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

ashm-dev Oct 1, 2025

Uh oh!

vstinner left a comment

Uh oh!

serhiy-storchaka commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

gh-63161: Fix PEP 263 support #139481

Are you sure you want to change the base?

gh-63161: Fix PEP 263 support #139481

Uh oh!

Conversation

serhiy-storchaka commented Oct 1, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashm-dev Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Oct 3, 2025

Uh oh!

Uh oh!

serhiy-storchaka commented Oct 1, 2025 •

edited by bedevere-app bot

Loading