Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") #59016

Closed
vpython mannequin opened this issue May 15, 2012 · 20 comments
Labels
3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@vpython
Copy link
Mannequin

vpython mannequin commented May 15, 2012

BPO 14811
Nosy @pitrou, @vstinner, @tjguk, @ezio-melotti, @bitdancer, @briancurtin, @hynek, @serhiy-storchaka, @eryksun, @lysnikolaou, @pablogsal, @isidentical
Superseder
  • bpo-25643: Python tokenizer rewriting
  • Files
  • t33a.py: test case demonstrating bug
  • detect_truncate.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-04-13.17:07:04.612>
    created_at = <Date 2012-05-15.04:31:40.176>
    labels = ['interpreter-core', 'type-bug', '3.8', '3.9', 'expert-unicode']
    title = 'decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")'
    updated_at = <Date 2021-04-13.23:35:33.078>
    user = 'https://bugs.python.org/vpython'

    bugs.python.org fields:

    activity = <Date 2021-04-13.23:35:33.078>
    actor = 'pablogsal'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-04-13.17:07:04.612>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2012-05-15.04:31:40.176>
    creator = 'v+python'
    dependencies = []
    files = ['25593', '25605']
    hgrepos = []
    issue_num = 14811
    keywords = ['patch']
    message_count = 20.0
    messages = ['160679', '160686', '160688', '160697', '160701', '160705', '160706', '160708', '160709', '160767', '160772', '160807', '165841', '167154', '390969', '390974', '390975', '390978', '391015', '391017']
    nosy_count = 13.0
    nosy_names = ['pitrou', 'vstinner', 'tim.golden', 'ezio.melotti', 'v+python', 'r.david.murray', 'brian.curtin', 'hynek', 'serhiy.storchaka', 'eryksun', 'lys.nikolaou', 'pablogsal', 'BTaskaya']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '25643'
    type = 'behavior'
    url = 'https://bugs.python.org/issue14811'
    versions = ['Python 3.8', 'Python 3.9']

    @vpython
    Copy link
    Mannequin Author

    vpython mannequin commented May 15, 2012

    t33a.py demonstrates a compilation problem. OK, it has a long line, but making it one space longer (add a space after the left parenthesis) makes it work... so it must not be line length alone. Rather, since the error is about a bad UTF-8 character starting with \xc3, it seems that the UTF-8 decoder might play a role. I was surprised that I could reduce the test case by removing all the lines before and after these 3: the original failure was in a much longer file to which I added this line.

    Originally detected in 3.2.2, I upgraded to 3.2.3 and the problem still occurred.

    @vpython vpython mannequin added build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs) labels May 15, 2012
    @vpython
    Copy link
    Mannequin Author

    vpython mannequin commented May 15, 2012

    Forgot to mention that I was running on Windows, 64-bit.

    @hynek
    Copy link
    Member

    hynek commented May 15, 2012

    Would you mind adding more information like the full traceback? By saying "compilation error", I presume you mean the compilation of the t33a.py file into byte code (and not compilation of Python itself)?

    I can't reproduce it neither with the vanilla 3.2.3 on OS X nor with Ubuntu's 3.2.

    My only suspicion is that the platform default encoding has bitten you, does it also crash if you add "# -- coding: utf-8 --" as the first line?

    @vpython
    Copy link
    Mannequin Author

    vpython mannequin commented May 15, 2012

    There is no traceback. Here is the text of the Syntax error.

    d:\my\im\infiles>c:\python32\python.exe d:\my\py\t33a.py -h
    File "d:\my\py\t33a.py", line 2
    SyntaxError: Non-UTF-8 code starting with '\xc3' in file d:\my\py\t33a.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

    My understanding is Python 3 uses utf-8 as the default encoding for source files -- unless there is an encoding line; and I've set my emacs to save all .py files as utf-8-unix (meaning with no CR, if you aren't an emacs user).

    I verified with a hex dump that the encoding in the file is UTF-8, but you are welcome to also, that is the file I uploaded.

    So your testing would seem to indicate it is a platform specific bug. Try running it on Windows, then.

    Further, if it were the platform default encoding, adding a space wouldn't cure it... the encoding of the file would still be UTF-8, and the platform default encoding would still be the same whatever you think it might be (but I think it is UTF-8 for source text), so adding a space would not effect an encoding mismatch.

    @hynek
    Copy link
    Member

    hynek commented May 15, 2012

    You are right, file system encoding was platform dependent, not file encoding.

    This space-after-parentheses trigger is odd; I'm adding the Windows guys to the ticket. Please tell us also your exact version of Windows.

    @hynek hynek added type-bug An unexpected behavior, bug, or error and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) build The build process and cross-build labels May 15, 2012
    @pitrou
    Copy link
    Member

    pitrou commented May 15, 2012

    I tried to reproduce but failed to compile a Windows Python - see bpo-14813.

    @serhiy-storchaka
    Copy link
    Member

    I can reproduce it on Linux. Minimal example:

    $ ./python -c "open('longline.py', 'w').write('#' + repr('\u00A1' * 4096) + '\n')"
    $ ./python longline.py
      File "longline.py", line 1
    SyntaxError: Non-UTF-8 code starting with '\xc2' in file longline.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

    @serhiy-storchaka
    Copy link
    Member

    And for Python 2.7 too.

    @serhiy-storchaka
    Copy link
    Member

    Function decoding_fgets (Parser/tokenizer.c) reads line in buffer of fixed size 8192 (line truncated to size 8191) and then fails because line is cut in the middle of a multibyte UTF-8 character.

    @serhiy-storchaka serhiy-storchaka changed the title compile fails - UTF-8 character decoding Syntax error on long UTF-8 lines May 15, 2012
    @bitdancer
    Copy link
    Member

    By the way, Glenn, what you posted as "the syntax error" (which it was) *is* the traceback. A syntax error on the file directly being compiled will only have one line in the traceback.

    @vpython
    Copy link
    Mannequin Author

    vpython mannequin commented May 15, 2012

    Thanks, David, for the clarification. I had been mentally separating
    syntax errors from other errors.

    @vstinner
    Copy link
    Member

    Function decoding_fgets (Parser/tokenizer.c) reads line in buffer
    of fixed size 8192 (line truncated to size 8191) and then fails
    because line is cut in the middle of a multibyte UTF-8 character.

    It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 bytes.

    Attached patch detects when a line is truncated (longer than the internal buffer).

    A better solution is maybe to reallocate the buffer if the string is longer than the buffer (write a universal fgets which allocates the buffer while the line is read). Most functions parsing Python source code uses a dynamic buffer. For example "import module" now reads the whole file content before parsing it (see FileLoader.get_data() in Lib/importlib/_bootstrap.py).

    At least, we should use a longer buffer on Windows (ex: use 8192 on all platforms?).

    I only found two functions parsing the a Python file line by line: PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are part of the C Python API and used by programs to execute Python code when Python is embeded in a program.

    PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's just that the internal buffer is much smaller on Windows.

    @vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed OS-windows labels May 16, 2012
    @vstinner vstinner changed the title Syntax error on long UTF-8 lines decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") May 16, 2012
    @hynek
    Copy link
    Member

    hynek commented Jul 19, 2012

    Are we going to fix this before 3.3? Any objections to Victor's patch?

    @vstinner
    Copy link
    Member

    vstinner commented Aug 1, 2012

    Are we going to fix this before 3.3? Any objections to Victor's patch?

    detect_truncate.patch is now raising an error if a line is longer than BUFSIZ, whereas Python supports lines longer than BUFSIZ bytes (it's just that the encoding cookie is ignored if the line 1 or 2 is longer than BUFSIZ bytes). So my patch is not correct.

    @eryksun eryksun added 3.8 only security fixes 3.9 only security fixes labels Apr 13, 2021
    @pablogsal
    Copy link
    Member

    I don't get any error executing the t33a.py script

    @eryksun
    Copy link
    Contributor

    eryksun commented Apr 13, 2021

    I don't get any error executing the t33a.py script

    The second line in t33a.py is 1618 bytes. The standard I/O BUFSIZ in Linux is 8192 bytes, but it's only 512 bytes in Windows. The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows.

    @pablogsal
    Copy link
    Member

    no longer fails in Windows.

    So that means we can close the issue, no?

    @vstinner
    Copy link
    Member

    With https://bugs.python.org/issue14811#msg160706 I get a SyntaxError on Python 3.7, 3.8, 3.9 and 3.10.0a6. But I don't get an error on the master branch (Python 3.10.0a7+).

    Eryk:

    The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows.

    Oh ok, this issue was fixed by the following commit which is part of v3.10.0a7 release:

    commit 261a452
    Author: Pablo Galindo <Pablogsal@gmail.com>
    Date: Sun Mar 28 23:48:05 2021 +0100

    bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)
    

    @eryksun
    Copy link
    Contributor

    eryksun commented Apr 13, 2021

    So that means we can close the issue, no?

    This is a bug in 3.8 and 3.9, which need the fix to keep reading until "\n" is seen on the line. I arrived at this issue via bpo-38755 if you think it should be addressed there, but it's the same bug that's reported here.

    @pablogsal
    Copy link
    Member

    Ok, let's continue the discussion on https://bugs.python.org/issue38755

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants