New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file #79160
Comments
The file on above is for testing, it's encoding is utf-8, the length of When execute
I've found this error occurred on about line 630(the bottom of the function When Python execute xxx.py, Python will call the function If the lenght of raw bytes is too long(like greater than 1023 bytes), then Python will call I suggest that we should always use |
Thanks for the report. Is this a case of encoding not being declared at the top of the file or am I missing something? ➜ cpython git:(master) cat ../backups/bpo34979.py print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details # With encoding declared ➜ cpython git:(master) cat ../backups/bpo34979.py s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
str len : 340
bytes len : 1020 # Double the original string ➜ cpython git:(master) cat ../backups/bpo34979.py s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
str len : 680
bytes len : 2040 Thanks |
If you declare the encoding at the top of the file, then everything is But if you did not declare the encoding, then Python will use In my opinion, when the encoding of the file is utf-8, and because the Karthikeyan Singaravelan <report@bugs.python.org> 于2018年10月14日周日 下午1:06写道:
|
Got it. Thanks for the details and patience. I tested with less number of characters and it seems to work fine so using the encoding at the top is not a good way to test the original issue as you have mentioned. Then I searched around and found bpo-14811 with test. This seems to be a very similar issue and there is a patch to detect this scenario to throw SyntaxError that the line is longer than the internal buffer instead of an encoding related error. I applied the patch to master and it throws an error about the internal buffer length as expected. But the patch was not applied and it seems Victor had another solution in mind as per msg167154. I tested with the patch as below : # master ➜ cpython git:(master) cat ../backups/bpo34979.py s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details # Applying the patch file from bpo-14811 ➜ cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py # Patch on master diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index fc75bae537..48b3ac0ee9 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -586,6 +586,7 @@ static char *
decoding_fgets(char *s, int size, struct tok_state *tok)
{
char *line = NULL;
+ size_t len;
int badchar = 0;
for (;;) {
if (tok->decoding_state == STATE_NORMAL) {
@@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state *tok)
/* We want a 'raw' read. */
line = Py_UniversalNewlineFgets(s, size,
tok->fp, NULL);
+ if (line != NULL) {
+ len = strlen(line);
+ if (1 < len && line[len-1] != '\n') {
+ PyErr_Format(PyExc_SyntaxError,
+ "Line %i of file %U is longer than the internal buffer (%i)",
+ tok->lineno + 1, tok->filename, size);
+ return error_ret(tok);
+ }
+ }
break;
} else {
/* We have not yet determined the encoding. If it's the same issue then I think closing this issue and discussing there will be good since the issue has a patch with test and relevant discussion. Also it seems BUFSIZ is platform dependent so adding your platform details would also help. TIL about difference Python 2 and 3 on handling unicode related files. Thanks again! |
I think these two issue is the same issue, and the following is a patch
by the way, my platform is macOS Mojave Version 10.14 Karthikeyan Singaravelan <report@bugs.python.org> 于2018年10月14日周日 下午5:10写道:
|
Thanks for the confirmation. I think the expected solution is to use a buffer that can be resized. CPython accepts GitHub PRs so if you have time then I would suggest raising a PR against the linked issue since a lot of people have subscribed there and would get a good feedback. As a suggestion when you reply from email please remove the quoted content since it makes the message very long and hard to read in the bug tracker. |
Thanks for your suggestions. I will make a PR on github. The buffer is resizeable now, please see cpython/Parser/tokenizer.c#L1043 |
This is a part of more general bpo-25643. I'll try to revive that issue. |
On Windows, with 3.7, 3.8.0, and master, none of the demo.py statement here and the examples in bpo-38755 raise an error. I tried 'python -m module', running from IDLE editor, and interactive IDLE and REPL. Even the following worked. >>> s = (b'\xe2\x96\x91'*1111111).decode()
>>> s[-10:]
'░░░░░░░░░░' susaki, what OS, and do you have the same problem with current Python (at least 3.8)? Also, susuki, when replying by email, please delete the quoted message. When your message is added to the web page, the quoted message is redundant and distracting noise. If this issue effectively duplicates (part of) bpo-14811 and/or bpo-25643, it should be closed as a duplicate of one of them. |
I think this issue is duplicated with bpo-14811, I will close it. The key point of this issue is that the size of You can increase the size of ✦ ➜ cat demo.py ✦ ➜ ./python -V ✦ ➜ ./python demo.py |
duplicated with bpo-14811 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: