decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") #59016

vpython · 2012-05-15T04:31:40Z

BPO	14811
Nosy	@pitrou, @vstinner, @tjguk, @ezio-melotti, @bitdancer, @briancurtin, @hynek, @serhiy-storchaka, @eryksun, @lysnikolaou, @pablogsal, @isidentical
Superseder	bpo-25643: Python tokenizer rewriting
Files	t33a.py: test case demonstrating bug detect_truncate.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-04-13.17:07:04.612>
created_at = <Date 2012-05-15.04:31:40.176>
labels = ['interpreter-core', 'type-bug', '3.8', '3.9', 'expert-unicode']
title = 'decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")'
updated_at = <Date 2021-04-13.23:35:33.078>
user = 'https://bugs.python.org/vpython'

bugs.python.org fields:

activity = <Date 2021-04-13.23:35:33.078>
actor = 'pablogsal'
assignee = 'none'
closed = True
closed_date = <Date 2021-04-13.17:07:04.612>
closer = 'vstinner'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2012-05-15.04:31:40.176>
creator = 'v+python'
dependencies = []
files = ['25593', '25605']
hgrepos = []
issue_num = 14811
keywords = ['patch']
message_count = 20.0
messages = ['160679', '160686', '160688', '160697', '160701', '160705', '160706', '160708', '160709', '160767', '160772', '160807', '165841', '167154', '390969', '390974', '390975', '390978', '391015', '391017']
nosy_count = 13.0
nosy_names = ['pitrou', 'vstinner', 'tim.golden', 'ezio.melotti', 'v+python', 'r.david.murray', 'brian.curtin', 'hynek', 'serhiy.storchaka', 'eryksun', 'lys.nikolaou', 'pablogsal', 'BTaskaya']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '25643'
type = 'behavior'
url = 'https://bugs.python.org/issue14811'
versions = ['Python 3.8', 'Python 3.9']

vpython · 2012-05-15T04:31:39Z

t33a.py demonstrates a compilation problem. OK, it has a long line, but making it one space longer (add a space after the left parenthesis) makes it work... so it must not be line length alone. Rather, since the error is about a bad UTF-8 character starting with \xc3, it seems that the UTF-8 decoder might play a role. I was surprised that I could reduce the test case by removing all the lines before and after these 3: the original failure was in a much longer file to which I added this line.

Originally detected in 3.2.2, I upgraded to 3.2.3 and the problem still occurred.

vpython · 2012-05-15T06:25:49Z

Forgot to mention that I was running on Windows, 64-bit.

hynek · 2012-05-15T06:45:54Z

Would you mind adding more information like the full traceback? By saying "compilation error", I presume you mean the compilation of the t33a.py file into byte code (and not compilation of Python itself)?

I can't reproduce it neither with the vanilla 3.2.3 on OS X nor with Ubuntu's 3.2.

My only suspicion is that the platform default encoding has bitten you, does it also crash if you add "# -- coding: utf-8 --" as the first line?

vpython · 2012-05-15T08:54:43Z

There is no traceback. Here is the text of the Syntax error.

d:\my\im\infiles>c:\python32\python.exe d:\my\py\t33a.py -h
File "d:\my\py\t33a.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xc3' in file d:\my\py\t33a.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

My understanding is Python 3 uses utf-8 as the default encoding for source files -- unless there is an encoding line; and I've set my emacs to save all .py files as utf-8-unix (meaning with no CR, if you aren't an emacs user).

I verified with a hex dump that the encoding in the file is UTF-8, but you are welcome to also, that is the file I uploaded.

So your testing would seem to indicate it is a platform specific bug. Try running it on Windows, then.

Further, if it were the platform default encoding, adding a space wouldn't cure it... the encoding of the file would still be UTF-8, and the platform default encoding would still be the same whatever you think it might be (but I think it is UTF-8 for source text), so adding a space would not effect an encoding mismatch.

hynek · 2012-05-15T09:45:51Z

You are right, file system encoding was platform dependent, not file encoding.

This space-after-parentheses trigger is odd; I'm adding the Windows guys to the ticket. Please tell us also your exact version of Windows.

pitrou · 2012-05-15T10:23:18Z

I tried to reproduce but failed to compile a Windows Python - see bpo-14813.

serhiy-storchaka · 2012-05-15T10:40:20Z

I can reproduce it on Linux. Minimal example:

$ ./python -c "open('longline.py', 'w').write('#' + repr('\u00A1' * 4096) + '\n')"
$ ./python longline.py
  File "longline.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xc2' in file longline.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

serhiy-storchaka · 2012-05-15T10:42:53Z

And for Python 2.7 too.

serhiy-storchaka · 2012-05-15T10:49:23Z

Function decoding_fgets (Parser/tokenizer.c) reads line in buffer of fixed size 8192 (line truncated to size 8191) and then fails because line is cut in the middle of a multibyte UTF-8 character.

bitdancer · 2012-05-15T21:35:42Z

By the way, Glenn, what you posted as "the syntax error" (which it was) *is* the traceback. A syntax error on the file directly being compiled will only have one line in the traceback.

vpython · 2012-05-15T22:31:48Z

Thanks, David, for the clarification. I had been mentally separating
syntax errors from other errors.

vstinner · 2012-05-16T07:00:28Z

Function decoding_fgets (Parser/tokenizer.c) reads line in buffer
of fixed size 8192 (line truncated to size 8191) and then fails
because line is cut in the middle of a multibyte UTF-8 character.

It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 bytes.

Attached patch detects when a line is truncated (longer than the internal buffer).

A better solution is maybe to reallocate the buffer if the string is longer than the buffer (write a universal fgets which allocates the buffer while the line is read). Most functions parsing Python source code uses a dynamic buffer. For example "import module" now reads the whole file content before parsing it (see FileLoader.get_data() in Lib/importlib/_bootstrap.py).

At least, we should use a longer buffer on Windows (ex: use 8192 on all platforms?).

I only found two functions parsing the a Python file line by line: PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are part of the C Python API and used by programs to execute Python code when Python is embeded in a program.

PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's just that the internal buffer is much smaller on Windows.

hynek · 2012-07-19T14:18:22Z

Are we going to fix this before 3.3? Any objections to Victor's patch?

vstinner · 2012-08-01T18:03:00Z

Are we going to fix this before 3.3? Any objections to Victor's patch?

detect_truncate.patch is now raising an error if a line is longer than BUFSIZ, whereas Python supports lines longer than BUFSIZ bytes (it's just that the encoding cookie is ignored if the line 1 or 2 is longer than BUFSIZ bytes). So my patch is not correct.

pablogsal · 2021-04-13T14:57:46Z

I don't get any error executing the t33a.py script

eryksun · 2021-04-13T16:14:08Z

I don't get any error executing the t33a.py script

The second line in t33a.py is 1618 bytes. The standard I/O BUFSIZ in Linux is 8192 bytes, but it's only 512 bytes in Windows. The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows.

pablogsal · 2021-04-13T16:42:17Z

no longer fails in Windows.

So that means we can close the issue, no?

vstinner · 2021-04-13T17:07:05Z

With https://bugs.python.org/issue14811#msg160706 I get a SyntaxError on Python 3.7, 3.8, 3.9 and 3.10.0a6. But I don't get an error on the master branch (Python 3.10.0a7+).

Eryk:

The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows.

Oh ok, this issue was fixed by the following commit which is part of v3.10.0a7 release:

commit 261a452
Author: Pablo Galindo <Pablogsal@gmail.com>
Date: Sun Mar 28 23:48:05 2021 +0100

bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)

eryksun · 2021-04-13T23:29:46Z

So that means we can close the issue, no?

This is a bug in 3.8 and 3.9, which need the fix to keep reading until "\n" is seen on the line. I arrived at this issue via bpo-38755 if you think it should be addressed there, but it's the same bug that's reported here.

pablogsal · 2021-04-13T23:35:33Z

Ok, let's continue the discussion on https://bugs.python.org/issue38755

vpython mannequin added build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs) labels May 15, 2012

ezio-melotti added the topic-unicode label May 15, 2012

hynek added type-bug An unexpected behavior, bug, or error and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) build The build process and cross-build labels May 15, 2012

pitrou added the OS-windows label May 15, 2012

serhiy-storchaka changed the title ~~compile fails - UTF-8 character decoding~~ Syntax error on long UTF-8 lines May 15, 2012

vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed OS-windows labels May 16, 2012

vstinner changed the title ~~Syntax error on long UTF-8 lines~~ decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") May 16, 2012

eryksun added 3.8 only security fixes 3.9 only security fixes labels Apr 13, 2021

vstinner closed this as completed Apr 13, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") #59016

decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") #59016

vpython mannequin commented May 15, 2012

vpython mannequin commented May 15, 2012

vpython mannequin commented May 15, 2012

hynek commented May 15, 2012

vpython mannequin commented May 15, 2012

hynek commented May 15, 2012

pitrou commented May 15, 2012

serhiy-storchaka commented May 15, 2012

serhiy-storchaka commented May 15, 2012

serhiy-storchaka commented May 15, 2012

bitdancer commented May 15, 2012

vpython mannequin commented May 15, 2012

vstinner commented May 16, 2012

hynek commented Jul 19, 2012

vstinner commented Aug 1, 2012

pablogsal commented Apr 13, 2021

eryksun commented Apr 13, 2021

pablogsal commented Apr 13, 2021

vstinner commented Apr 13, 2021

eryksun commented Apr 13, 2021

pablogsal commented Apr 13, 2021

decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") #59016

decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") #59016

Comments

vpython mannequin commented May 15, 2012

vpython mannequin commented May 15, 2012

vpython mannequin commented May 15, 2012

hynek commented May 15, 2012

vpython mannequin commented May 15, 2012

hynek commented May 15, 2012

pitrou commented May 15, 2012

serhiy-storchaka commented May 15, 2012

serhiy-storchaka commented May 15, 2012

serhiy-storchaka commented May 15, 2012

bitdancer commented May 15, 2012

vpython mannequin commented May 15, 2012

vstinner commented May 16, 2012

hynek commented Jul 19, 2012

vstinner commented Aug 1, 2012

pablogsal commented Apr 13, 2021

eryksun commented Apr 13, 2021

pablogsal commented Apr 13, 2021

vstinner commented Apr 13, 2021

eryksun commented Apr 13, 2021

pablogsal commented Apr 13, 2021