Edge case in compiler when error displaying with non-utf8 lines #88515
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = None closed_at = <Date 2021-06-08.23:55:24.045> created_at = <Date 2021-06-08.17:50:20.904> labels = ['interpreter-core', '3.9', '3.10', '3.11'] title = 'Edge case in compiler when error displaying with non-utf8 lines' updated_at = <Date 2021-06-09.00:29:32.909> user = 'https://github.com/ammaraskar'
activity = <Date 2021-06-09.00:29:32.909> actor = 'pablogsal' assignee = 'none' closed = True closed_date = <Date 2021-06-08.23:55:24.045> closer = 'pablogsal' components = ['Parser'] creation = <Date 2021-06-08.17:50:20.904> creator = 'ammar2' dependencies =  files =  hgrepos =  issue_num = 44349 keywords = ['patch'] message_count = 6.0 messages = ['395347', '395350', '395351', '395354', '395369', '395370'] nosy_count = 4.0 nosy_names = ['ammar2', 'lys.nikolaou', 'pablogsal', 'miss-islington'] pr_nums = ['26611', '26616'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue44349' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']
The text was updated successfully, but these errors were encountered:
The AST currently stores column offsets for characters as byte-offsets. However, when displaying errors, these byte-offsets must be turned into character-offsets so that the characters line up properly with the characters on the line when printed. This is done with the function
However, consider a file like this:
'┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
File "test-normal.py", line 1
However if we use a custom source encoding line:
# -- coding: cp437 --
it ends up printing out
File "C:\Users\ammar\junk\test-utf16.py", line 2
where the carets/offsets are misaligned with the actual characters. This is because the string "┬ó" has the display width of 2 characters and encodes to 2 bytes in cp437 but when interpreted as utf-8 is the single character "¢" with a display width of 1.
Note that this edge case is relatively hard to trigger because ordinarily what will happen here is that the call to PyErr_ProgramTextObject will fail because it tries to decode the line as utf-8:
So this bug requires the input to be valid as both utf-8 and the source encoding.