discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() #58834

ericsnowcurrently · 2012-04-20T05:17:17Z

BPO	14629
Nosy	@loewis, @brettcannon, @ericsnowcurrently
Files	_tokenizer.c: extension module wrapping PyTokenizer_FindEncodingFilename setup.py: setup script for the extension module

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-04-20.12:39:37.692>
created_at = <Date 2012-04-20.05:17:17.452>
labels = ['type-bug', 'library']
title = 'discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename()'
updated_at = <Date 2012-04-20.17:35:20.572>
user = 'https://github.com/ericsnowcurrently'

bugs.python.org fields:

activity = <Date 2012-04-20.17:35:20.572>
actor = 'eric.snow'
assignee = 'none'
closed = True
closed_date = <Date 2012-04-20.12:39:37.692>
closer = 'loewis'
components = ['Library (Lib)']
creation = <Date 2012-04-20.05:17:17.452>
creator = 'eric.snow'
dependencies = []
files = ['25283', '25284']
hgrepos = []
issue_num = 14629
keywords = []
message_count = 9.0
messages = ['158797', '158824', '158825', '158831', '158839', '158847', '158849', '158855', '158858']
nosy_count = 5.0
nosy_names = ['loewis', 'brett.cannon', 'Arfrever', 'python-dev', 'eric.snow']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue14629'
versions = ['Python 3.2', 'Python 3.3']

ericsnowcurrently · 2012-04-20T05:17:17Z

(see http://mail.python.org/pipermail/python-dev/2012-April/118889.html)

The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports.

When a file has no encoding indicator (see PEP-263) it falls back to UTF8 (see PEP-3120). The tokenize module (Lib/tokenize.py) facilitates this through "detect_encoding()". The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()". Both check the first two lines of the file, per PEP-263.

When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss. However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation.

The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that demonstrates this discrepency. I'll use it in the following example.

---

For tokenize.detect_encoding():

  import tokenize
  enc = tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline)
  print(enc)  # "utf-8" (no SyntaxError)

For PyTokenizer_FindEncodingFilename():

I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename().

  import _tokenizer
  enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py")
  print(enc)  # raises SyntaxError

---

Some relevant, related notes:

The discrepencies extend further too. The following code returns a UnicodeDecodeError, rather than a SyntaxError:

tokenize.tokenize(open("/home/esnow/projects/import_cleanup/Lib/test/badsyntax_pep3120.py").readline)

In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection. In the current repo tip (importlib-based import machinery, Lib/importlib/_bootstrap.py), the following results in a SyntaxError much later, during compilation.

  import test.badsyntax_pep3120

importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()...

python-dev · 2012-04-20T12:37:26Z

New changeset b07488490001 by Martin v. Löwis in branch '3.2':
Issue bpo-14629: Raise SyntaxError in tokenizer.detect_encoding
http://hg.python.org/cpython/rev/b07488490001

New changeset 98a6a57c5876 by Martin v. Löwis in branch 'default':
merge 3.2: bpo-14629
http://hg.python.org/cpython/rev/98a6a57c5876

loewis · 2012-04-20T12:39:38Z

Thanks for the report. This is now fixed in 3.2 and default.

Notice that your usage tokenize is incorrect: you need to open the file in binary mode.

ericsnowcurrently · 2012-04-20T15:09:11Z

Thanks, Martin! That did the trick.

ericsnowcurrently · 2012-04-20T15:24:34Z

Apparently the message string contained by the SyntaxError is different between the two. I noticed due to the hard-coded check in test_find_module_encoding (in Lib/test/test_imp.py). I've brought up the specific issue of that hard-coded message check in bpo-14633. However, in case it otherwise matters that the message string be the same between the two, I've brought it up here.

loewis · 2012-04-20T16:19:30Z

IMO, the test is flawed testing for the specific error message. OTOH, the original message is better than the tokenize message in that it mentions the file name. However, tokenize does not have the file name available, so it can't possibly report it.

I have no idea how to resolve this. Contributions are welcome.

python-dev · 2012-04-20T17:00:10Z

New changeset a281a6622714 by Brett Cannon in branch 'default':
Issue bpo-14633: Simplify imp.find_modue() test after fixes from issue
http://hg.python.org/cpython/rev/a281a6622714

python-dev · 2012-04-20T17:24:34Z

New changeset 1b57de8a8383 by Brett Cannon in branch 'default':
Issue bpo-14629: Mention the filename in SyntaxError exceptions from
http://hg.python.org/cpython/rev/1b57de8a8383

ericsnowcurrently · 2012-04-20T17:35:21Z

Looks good. Thanks for the help, Martin and Brett.

benrg · 2022-04-27T01:31:45Z

This change broke tokenizer on non-UTF-8 files; see issue #91972.

I was going to write a pull request to fix it, but I realized I don't understand how decoding errors should be reported.

The current behavior is that detect_encoding raises SyntaxError if the lines that it happens to read aren't valid UTF-8 (regardless of the declared encoding), while tokenize raises UnicodeDecodeError if any line isn't valid in the actual encoding. I doubt that's what was intended. Aside from the problem mentioned in #91972, it also means, for instance, that if the file has no encoding declaration and the first line is blank or a comment, you'll get a SyntaxError if there's an invalid byte in the first two lines, and a UnicodeDecodeError if there's an invalid byte in lines 3 or later.

Should UnicodeDecodeErrors really be translated to SyntaxErrors? compile does it, but potentially useful information is lost in the process. In fact, the column position returned by compile on decoding failure is not even correct: it's the byte position from the UnicodeDecodeError plus one, but the returned line is a str in which there may be preceding valid multibyte characters. The original UnicodeDecodeError is preserved only in stringified form.
Should detect_encoding really be raising an exception at all? I realize that was the whole point of this bug, but... the lines that it reads always have to be parsed again with the returned encoding, since they aren't necessarily comment lines, so nothing is gained by adding duplicate error-detection code for just those 1 or 2 lines.

Perhaps somebody wants to scan a directory and find encoding errors without actually tokenizing the files. But their code is wrong if they depend on detect_encoding to spot errors, since it won't find most of them, and they probably don't want the SyntaxError it raises anyway (nor a UnicodeDecodeError with special first-lines-only information attached). They should just try to decode the whole file with the detected encoding.

The behavior of _PyTokenizer_FindEncodingFilename seems to be just a side effect of the way it's implemented: it creates a tokenizer and looks for an encoding token. It has only one caller (display_source_line_with_margin in traceback.c), who rereads the file from the beginning after calling it. In the tokenize module, the logic is reversed: tokenize calls detect_encoding, and there's no reason for detect_encoding to raise SyntaxError – that's tokenize's job.

ericsnowcurrently added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 20, 2012

loewis mannequin closed this as completed Apr 20, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() #58834

discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() #58834

ericsnowcurrently commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

python-dev mannequin commented Apr 20, 2012

loewis mannequin commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

loewis mannequin commented Apr 20, 2012

python-dev mannequin commented Apr 20, 2012

python-dev mannequin commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

benrg commented Apr 27, 2022

discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() #58834

discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() #58834

Comments

ericsnowcurrently commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

python-dev mannequin commented Apr 20, 2012

loewis mannequin commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

loewis mannequin commented Apr 20, 2012

python-dev mannequin commented Apr 20, 2012

python-dev mannequin commented Apr 20, 2012

ericsnowcurrently commented Apr 20, 2012

benrg commented Apr 27, 2022