Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() #58834

Closed
ericsnowcurrently opened this issue Apr 20, 2012 · 10 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ericsnowcurrently
Copy link
Member

BPO 14629
Nosy @loewis, @brettcannon, @ericsnowcurrently
Files
  • _tokenizer.c: extension module wrapping PyTokenizer_FindEncodingFilename
  • setup.py: setup script for the extension module
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-04-20.12:39:37.692>
    created_at = <Date 2012-04-20.05:17:17.452>
    labels = ['type-bug', 'library']
    title = 'discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename()'
    updated_at = <Date 2012-04-20.17:35:20.572>
    user = 'https://github.com/ericsnowcurrently'

    bugs.python.org fields:

    activity = <Date 2012-04-20.17:35:20.572>
    actor = 'eric.snow'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-04-20.12:39:37.692>
    closer = 'loewis'
    components = ['Library (Lib)']
    creation = <Date 2012-04-20.05:17:17.452>
    creator = 'eric.snow'
    dependencies = []
    files = ['25283', '25284']
    hgrepos = []
    issue_num = 14629
    keywords = []
    message_count = 9.0
    messages = ['158797', '158824', '158825', '158831', '158839', '158847', '158849', '158855', '158858']
    nosy_count = 5.0
    nosy_names = ['loewis', 'brett.cannon', 'Arfrever', 'python-dev', 'eric.snow']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14629'
    versions = ['Python 3.2', 'Python 3.3']

    @ericsnowcurrently
    Copy link
    Member Author

    (see http://mail.python.org/pipermail/python-dev/2012-April/118889.html)

    The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports.

    When a file has no encoding indicator (see PEP-263) it falls back to UTF8 (see PEP-3120). The tokenize module (Lib/tokenize.py) facilitates this through "detect_encoding()". The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()". Both check the first two lines of the file, per PEP-263.

    When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss. However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation.

    The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that demonstrates this discrepency. I'll use it in the following example.

    ---

    For tokenize.detect_encoding():

      import tokenize
      enc = tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline)
      print(enc)  # "utf-8" (no SyntaxError)

    For PyTokenizer_FindEncodingFilename():

    I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename().

      import _tokenizer
      enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py")
      print(enc)  # raises SyntaxError

    ---

    Some relevant, related notes:

    The discrepencies extend further too. The following code returns a UnicodeDecodeError, rather than a SyntaxError:

    tokenize.tokenize(open("/home/esnow/projects/import_cleanup/Lib/test/badsyntax_pep3120.py").readline)

    In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection. In the current repo tip (importlib-based import machinery, Lib/importlib/_bootstrap.py), the following results in a SyntaxError much later, during compilation.

      import test.badsyntax_pep3120

    importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()...

    @ericsnowcurrently ericsnowcurrently added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 20, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 20, 2012

    New changeset b07488490001 by Martin v. Löwis in branch '3.2':
    Issue bpo-14629: Raise SyntaxError in tokenizer.detect_encoding
    http://hg.python.org/cpython/rev/b07488490001

    New changeset 98a6a57c5876 by Martin v. Löwis in branch 'default':
    merge 3.2: bpo-14629
    http://hg.python.org/cpython/rev/98a6a57c5876

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 20, 2012

    Thanks for the report. This is now fixed in 3.2 and default.

    Notice that your usage tokenize is incorrect: you need to open the file in binary mode.

    @loewis loewis mannequin closed this as completed Apr 20, 2012
    @ericsnowcurrently
    Copy link
    Member Author

    Thanks, Martin! That did the trick.

    @ericsnowcurrently
    Copy link
    Member Author

    Apparently the message string contained by the SyntaxError is different between the two. I noticed due to the hard-coded check in test_find_module_encoding (in Lib/test/test_imp.py). I've brought up the specific issue of that hard-coded message check in bpo-14633. However, in case it otherwise matters that the message string be the same between the two, I've brought it up here.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 20, 2012

    IMO, the test is flawed testing for the specific error message. OTOH, the original message is better than the tokenize message in that it mentions the file name. However, tokenize does not have the file name available, so it can't possibly report it.

    I have no idea how to resolve this. Contributions are welcome.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 20, 2012

    New changeset a281a6622714 by Brett Cannon in branch 'default':
    Issue bpo-14633: Simplify imp.find_modue() test after fixes from issue
    http://hg.python.org/cpython/rev/a281a6622714

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 20, 2012

    New changeset 1b57de8a8383 by Brett Cannon in branch 'default':
    Issue bpo-14629: Mention the filename in SyntaxError exceptions from
    http://hg.python.org/cpython/rev/1b57de8a8383

    @ericsnowcurrently
    Copy link
    Member Author

    Looks good. Thanks for the help, Martin and Brett.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @benrg
    Copy link

    benrg commented Apr 27, 2022

    This change broke tokenizer on non-UTF-8 files; see issue #91972.

    I was going to write a pull request to fix it, but I realized I don't understand how decoding errors should be reported.

    The current behavior is that detect_encoding raises SyntaxError if the lines that it happens to read aren't valid UTF-8 (regardless of the declared encoding), while tokenize raises UnicodeDecodeError if any line isn't valid in the actual encoding. I doubt that's what was intended. Aside from the problem mentioned in #91972, it also means, for instance, that if the file has no encoding declaration and the first line is blank or a comment, you'll get a SyntaxError if there's an invalid byte in the first two lines, and a UnicodeDecodeError if there's an invalid byte in lines 3 or later.

    • Should UnicodeDecodeErrors really be translated to SyntaxErrors? compile does it, but potentially useful information is lost in the process. In fact, the column position returned by compile on decoding failure is not even correct: it's the byte position from the UnicodeDecodeError plus one, but the returned line is a str in which there may be preceding valid multibyte characters. The original UnicodeDecodeError is preserved only in stringified form.

    • Should detect_encoding really be raising an exception at all? I realize that was the whole point of this bug, but... the lines that it reads always have to be parsed again with the returned encoding, since they aren't necessarily comment lines, so nothing is gained by adding duplicate error-detection code for just those 1 or 2 lines.

      Perhaps somebody wants to scan a directory and find encoding errors without actually tokenizing the files. But their code is wrong if they depend on detect_encoding to spot errors, since it won't find most of them, and they probably don't want the SyntaxError it raises anyway (nor a UnicodeDecodeError with special first-lines-only information attached). They should just try to decode the whole file with the detected encoding.

      The behavior of _PyTokenizer_FindEncodingFilename seems to be just a side effect of the way it's implemented: it creates a tokenizer and looks for an encoding token. It has only one caller (display_source_line_with_margin in traceback.c), who rereads the file from the beginning after calling it. In the tokenize module, the logic is reversed: tokenize calls detect_encoding, and there's no reason for detect_encoding to raise SyntaxError – that's tokenize's job.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants