gh-63161: Fix tokenize detect_encoding() for non-ASCII coding #139235

vstinner · 2025-09-22T13:06:18Z

Issue: Non-UTF8 encoding line #63161

serhiy-storchaka

Please add a test for the coded cookie on the second line (and non-ascii first line).

Also add a test with specified ASCII encoding, but non-ASCII content that can still be decoded as UTF-8. E.g. '#coding=ascii €'.encode('utf-8') and corresponding for two lines.

vstinner · 2025-09-22T14:18:42Z

@serhiy-storchaka: I added more tests, please review the updated PR. Is it what you wanted?

serhiy-storchaka

Thank you for update. In two-line cases please use non-ASCII data in the first line, before the codec cookie. Test that the tokenizer uses correct encoding to decode comments in first lines.

It may be already tested elsewhere, but I would also add tests for non-ASCII data in the first and in the second comment lines, when no codec cookie is present (so UTF-8 should be used). For valid and invalid UTF-8.

I expect that the tokenizer correctly decodes files that match the explicit or implicit encoding, and reject files that do not match. And the interpreter should work the same.

vstinner · 2025-09-23T14:28:39Z

Ok, I added more tests. Please review the updated PR.

pythongh-63161: Fix tokenize detect_encoding() for non-ASCII coding

fb7b944

vstinner requested review from pablogsal and lysnikolaou as code owners September 22, 2025 13:06

vstinner added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Sep 22, 2025

bedevere-app bot added the awaiting core review label Sep 22, 2025

bedevere-app bot mentioned this pull request Sep 22, 2025

Non-UTF8 encoding line #63161

Open

Add NEWS entry

c535b65

serhiy-storchaka reviewed Sep 22, 2025

View reviewed changes

Add tests

5723fc5

serhiy-storchaka reviewed Sep 22, 2025

View reviewed changes

vstinner added 2 commits September 23, 2025 16:23

Add more tests

911dc3a

Test comments with no coding marker

e36d860

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding #139235

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding #139235

Uh oh!

vstinner commented Sep 22, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

vstinner commented Sep 22, 2025

Uh oh!

serhiy-storchaka left a comment

Uh oh!

vstinner commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding #139235

Are you sure you want to change the base?

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding #139235

Uh oh!

Conversation

vstinner commented Sep 22, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Sep 22, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Sep 23, 2025

Uh oh!

Uh oh!

vstinner commented Sep 22, 2025 •

edited by bedevere-app bot

Loading