New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() #58834
Comments
(see http://mail.python.org/pipermail/python-dev/2012-April/118889.html) The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports. When a file has no encoding indicator (see PEP-263) it falls back to UTF8 (see PEP-3120). The tokenize module (Lib/tokenize.py) facilitates this through "detect_encoding()". The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()". Both check the first two lines of the file, per PEP-263. When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss. However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation. The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that demonstrates this discrepency. I'll use it in the following example. --- For tokenize.detect_encoding(): import tokenize
enc = tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline)
print(enc) # "utf-8" (no SyntaxError) For PyTokenizer_FindEncodingFilename(): I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename(). import _tokenizer
enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py")
print(enc) # raises SyntaxError --- Some relevant, related notes: The discrepencies extend further too. The following code returns a UnicodeDecodeError, rather than a SyntaxError: tokenize.tokenize(open("/home/esnow/projects/import_cleanup/Lib/test/badsyntax_pep3120.py").readline) In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection. In the current repo tip (importlib-based import machinery, Lib/importlib/_bootstrap.py), the following results in a SyntaxError much later, during compilation. import test.badsyntax_pep3120 importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()... |
New changeset b07488490001 by Martin v. Löwis in branch '3.2': New changeset 98a6a57c5876 by Martin v. Löwis in branch 'default': |
Thanks for the report. This is now fixed in 3.2 and default. Notice that your usage tokenize is incorrect: you need to open the file in binary mode. |
Thanks, Martin! That did the trick. |
Apparently the message string contained by the SyntaxError is different between the two. I noticed due to the hard-coded check in test_find_module_encoding (in Lib/test/test_imp.py). I've brought up the specific issue of that hard-coded message check in bpo-14633. However, in case it otherwise matters that the message string be the same between the two, I've brought it up here. |
IMO, the test is flawed testing for the specific error message. OTOH, the original message is better than the tokenize message in that it mentions the file name. However, tokenize does not have the file name available, so it can't possibly report it. I have no idea how to resolve this. Contributions are welcome. |
New changeset a281a6622714 by Brett Cannon in branch 'default': |
New changeset 1b57de8a8383 by Brett Cannon in branch 'default': |
Looks good. Thanks for the help, Martin and Brett. |
This change broke I was going to write a pull request to fix it, but I realized I don't understand how decoding errors should be reported. The current behavior is that
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: