Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file #79160

ausaki · 2018-10-14T02:01:39Z

BPO	34979
Nosy	@terryjreedy, @ezio-melotti, @serhiy-storchaka, @zhangyangyu, @tirkarthi, @ausaki
PRs	bpo-34979: fix "SyntaxError: Non-UTF-8 code start with \xe8..." caused by function decoding_fgets #9923

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2019-11-16.17:37:53.843>
created_at = <Date 2018-10-14.02:01:38.987>
labels = ['interpreter-core', 'type-bug']
title = 'Python throws \xe2\x80\x9cSyntaxError: Non-UTF-8 code start with \\xe8...\xe2\x80\x9d when parse source file'
updated_at = <Date 2019-11-16.17:37:53.840>
user = 'https://github.com/ausaki'

bugs.python.org fields:

activity = <Date 2019-11-16.17:37:53.840>
actor = 'ausaki'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2019-11-16.17:37:53.843>
closer = 'ausaki'
components = ['Interpreter Core']
creation = <Date 2018-10-14.02:01:38.987>
creator = 'ausaki'
dependencies = []
files = []
hgrepos = []
issue_num = 34979
keywords = ['patch']
message_count = 11.0
messages = ['327686', '327689', '327697', '327699', '327702', '327706', '327709', '327878', '356711', '356741', '356760']
nosy_count = 6.0
nosy_names = ['terry.reedy', 'ezio.melotti', 'serhiy.storchaka', 'xiang.zhang', 'xtreak', 'ausaki']
pr_nums = ['9923']
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue34979'
versions = ['Python 3.6']

ausaki · 2018-10-14T02:01:38Z

# demo.py
s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

The file on above is for testing, it's encoding is utf-8, the length of s is 1020 bytes(3 * 340).

When execute python3 demo.py on terminal, Python will throws the following error:

$ python3 -V
Python 3.6.4

$ python3 demo.py
  File "demo.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file demo.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

I've found this error occurred on about line 630(the bottom of the function decoding_fgets) of the file cpython/Parser/tokenizer.c after I read Python-3.6.6's source code.

When Python execute xxx.py, Python will call the function decoding_fgets to read one line of raw bytes from file and save the raw bytes to a buffer, the initial length of the buffer is 1024 bytes, decoding_fgets will use the function valid_utf8 to check raw bytes's encoding.

If the lenght of raw bytes is too long(like greater than 1023 bytes), then Python will call decoding_fgets multiple times and increase buffer's size by 1024 bytes every time.so raw bytes read by decoding_fgets is maybe incomplete, for example, raw bytes contains a part of bytes of a character, that will cause valide_utf8 failed.

I suggest that we should always use fp_readl to read source coe from file.

tirkarthi · 2018-10-14T05:06:19Z

Thanks for the report. Is this a case of encoding not being declared at the top of the file or am I missing something?

➜ cpython git:(master) cat ../backups/bpo34979.py
s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

# With encoding declared

➜ cpython git:(master) cat ../backups/bpo34979.py
# -- coding: utf-8 --

s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
str len :  340
bytes len :  1020

# Double the original string

➜ cpython git:(master) cat ../backups/bpo34979.py
# -- coding: utf-8 --

s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
str len :  680
bytes len :  2040

Thanks

ausaki · 2018-10-14T08:22:55Z

If you declare the encoding at the top of the file, then everything is
fine, because in this case Python will use io.open to open the file and
use stream.readline to read one line of code, please see function
fp_setreadl in cpython/Parser/tokenizer.c for detail.

But if you did not declare the encoding, then Python will use
Py_UniversalNewlineFgets to read one line of raw bytes and check these
raw bytes's encoding by valid_utf8.

In my opinion, when the encoding of the file is utf-8, and because the
default file encoding of Python3 is utf-8, so whether we declare encoding
or did not is ok.

Karthikeyan Singaravelan <report@bugs.python.org> 于2018年10月14日周日下午1:06写道：

Karthikeyan Singaravelan <tir.karthi@gmail.com> added the comment:

Thanks for the report. Is this a case of encoding not being declared at
the top of the file or am I missing something?

➜ cpython git:(master) cat ../backups/bpo34979.py
s =
'测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe8' in file
../backups/bpo34979.py on line 1, but no encoding declared; see
http://python.org/dev/peps/pep-0263/ for details

With encoding declared

➜ cpython git:(master) cat ../backups/bpo34979.py

-- coding: utf-8 --

s =
'测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
str len : 340
bytes len : 1020

Double the original string

➜ cpython git:(master) cat ../backups/bpo34979.py

-- coding: utf-8 --

s =
'测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
str len : 680
bytes len : 2040

Thanks

----------
nosy: +xtreak

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue34979\>

tirkarthi · 2018-10-14T09:10:45Z

Got it. Thanks for the details and patience. I tested with less number of characters and it seems to work fine so using the encoding at the top is not a good way to test the original issue as you have mentioned. Then I searched around and found bpo-14811 with test. This seems to be a very similar issue and there is a patch to detect this scenario to throw SyntaxError that the line is longer than the internal buffer instead of an encoding related error. I applied the patch to master and it throws an error about the internal buffer length as expected. But the patch was not applied and it seems Victor had another solution in mind as per msg167154. I tested with the patch as below :

# master

➜ cpython git:(master) cat ../backups/bpo34979.py

s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

# Applying the patch file from bpo-14811

➜ cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 2
SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the internal buffer (1024)

# Patch on master

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index fc75bae537..48b3ac0ee9 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -586,6 +586,7 @@ static char *
 decoding_fgets(char *s, int size, struct tok_state *tok)
 {
     char *line = NULL;
+    size_t len;
     int badchar = 0;
     for (;;) {
         if (tok->decoding_state == STATE_NORMAL) {
@@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state *tok)
             /* We want a 'raw' read. */
             line = Py_UniversalNewlineFgets(s, size,
                                             tok->fp, NULL);
+           if (line != NULL) {
+                len = strlen(line);
+                if (1 < len && line[len-1] != '\n') {
+                    PyErr_Format(PyExc_SyntaxError,
+                            "Line %i of file %U is longer than the internal buffer (%i)",
+                                tok->lineno + 1, tok->filename, size);
+                    return error_ret(tok);
+                }
+            }
             break;
         } else {
             /* We have not yet determined the encoding.

If it's the same issue then I think closing this issue and discussing there will be good since the issue has a patch with test and relevant discussion. Also it seems BUFSIZ is platform dependent so adding your platform details would also help.

TIL about difference Python 2 and 3 on handling unicode related files. Thanks again!

ausaki · 2018-10-14T11:12:30Z

I think these two issue is the same issue, and the following is a patch
write by me, hope this patch will help.

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index 1af27bf..ba6fb3a 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -617,32 +617,21 @@ decoding_fgets(char *s, int size, struct tok_state
*tok)
         if (!check_coding_spec(line, strlen(line), tok, fp_setreadl)) {
             return error_ret(tok);
         }
-    }
-#ifndef PGEN
-    /* The default encoding is UTF-8, so make sure we don't have any
-       non-UTF-8 sequences in it. */
-    if (line && !tok->encoding) {
-        unsigned char *c;
-        int length;
-        printf("[DEBUG] - [decoding_fgets]: line = %s\n", line);
-        for (c = (unsigned char *)line; *c; c += length)
-            if (!(length = valid_utf8(c))) {
-                badchar = *c;
-                break;
+        if(!tok->encoding){
+            char* cs = new_string("utf-8", 5, tok);
+            int r = fp_setreadl(tok, cs);
+            if (r) {
+                tok->encoding = cs;
+                tok->decoding_state = STATE_NORMAL;
+            } else {
+                PyErr_Format(PyExc_SyntaxError,
+                             "You did not decalre the file encoding at the
top of the file, "
+                             "and we found that the file is not encoding
by utf-8,"
+                             "see http://python.org/dev/peps/pep-0263/ for
details.");
+                PyMem_FREE(cs);
             }
+        }
     }
-    if (badchar) {
-        /* Need to add 1 to the line number, since this line
-           has not been counted, yet.  */
-        PyErr_Format(PyExc_SyntaxError,
-                "Non-UTF-8 code starting with '\\x%.2x' "
-                "in file %U on line %i, "
-                "but no encoding declared; "
-                "see http://python.org/dev/peps/pep-0263/ for details",
-                badchar, tok->filename, tok->lineno + 1);
-        return error_ret(tok);
-    }
-#endif
     return line;
 }

by the way, my platform is macOS Mojave Version 10.14

Karthikeyan Singaravelan <report@bugs.python.org> 于2018年10月14日周日下午5:10写道：

Karthikeyan Singaravelan <tir.karthi@gmail.com> added the comment:

Got it. Thanks for the details and patience. I tested with less number of
characters and it seems to work fine so using the encoding at the top is
not a good way to test the original issue as you have mentioned. Then I
searched around and found bpo-14811 with test. This seems to be a very
similar issue and there is a patch to detect this scenario to throw
SyntaxError that the line is longer than the internal buffer instead of an
encoding related error. I applied the patch to master and it throws an
error about the internal buffer length as expected. But the patch was not
applied and it seems Victor had another solution in mind as per msg167154.
I tested with the patch as below :

master

➜ cpython git:(master) cat ../backups/bpo34979.py

s =
'测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file
../backups/bpo34979.py on line 2, but no encoding declared; see
http://python.org/dev/peps/pep-0263/ for details

Applying the patch file from bpo-14811

➜ cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 2
SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the
internal buffer (1024)

Patch on master

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index fc75bae537..48b3ac0ee9 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -586,6 +586,7 @@ static char *
decoding_fgets(char *s, int size, struct tok_state *tok)
{
char *line = NULL;
size_t len;
int badchar = 0;
for (;;) {
if (tok->decoding_state == STATE_NORMAL) {
@@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state
*tok)
/* We want a 'raw' read. */
line = Py_UniversalNewlineFgets(s, size,
tok->fp, NULL);
      if (line != NULL) {
           len = strlen(line);
           if (1 \< len && line[len-1] != '\\n') {
               PyErr_Format(PyExc_SyntaxError,
                       "Line %i of file %U is longer than the
internal buffer (%i)",
                           tok-\>lineno + 1, tok-\>filename, size);
               return error_ret(tok);
           }
       }
       break;
   } else {
       /* We have not yet determined the encoding.
If it's the same issue then I think closing this issue and discussing
there will be good since the issue has a patch with test and relevant
discussion. Also it seems BUFSIZ is platform dependent so adding your
platform details would also help.

TIL about difference Python 2 and 3 on handling unicode related files.
Thanks again!

----------

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue34979\>

tirkarthi · 2018-10-14T13:25:06Z

Thanks for the confirmation. I think the expected solution is to use a buffer that can be resized. CPython accepts GitHub PRs so if you have time then I would suggest raising a PR against the linked issue since a lot of people have subscribed there and would get a good feedback.

As a suggestion when you reply from email please remove the quoted content since it makes the message very long and hard to read in the bug tracker.

ausaki · 2018-10-14T13:53:16Z

Thanks for your suggestions. I will make a PR on github.

The buffer is resizeable now, please see cpython/Parser/tokenizer.c#L1043
<https://github.com/python/cpython/blob/master/Parser/tokenizer.c#L1043\>
for details.

serhiy-storchaka · 2018-10-17T09:14:15Z

This is a part of more general bpo-25643. I'll try to revive that issue.

terryjreedy · 2019-11-15T19:57:17Z

On Windows, with 3.7, 3.8.0, and master, none of the demo.py statement here and the examples in bpo-38755 raise an error. I tried 'python -m module', running from IDLE editor, and interactive IDLE and REPL. Even the following worked.

>>> s = (b'\xe2\x96\x91'*1111111).decode()
>>> s[-10:]
'░░░░░░░░░░'

susaki, what OS, and do you have the same problem with current Python (at least 3.8)?

Also, susuki, when replying by email, please delete the quoted message. When your message is added to the web page, the quoted message is redundant and distracting noise.

If this issue effectively duplicates (part of) bpo-14811 and/or bpo-25643, it should be closed as a duplicate of one of them.

ausaki · 2019-11-16T04:34:09Z

I think this issue is duplicated with bpo-14811, I will close it.

The key point of this issue is that the size of tok->buf is fixed and equals to BUFSIZ(defined in stdio.h, have different value depends on OS).
one line of code will be truncated If it’s size exceeds BUFSIZ, then the function valid_utf8 will failed.

You can increase the size of s to reproduce this issue.

✦ ➜ cat demo.py
s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

✦ ➜ ./python -V
Python 3.7.4

✦ ➜ ./python demo.py
File "demo.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe6' in file demo.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

ausaki · 2019-11-16T17:37:54Z

duplicated with bpo-14811

ausaki mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Oct 14, 2018

serhiy-storchaka self-assigned this Oct 17, 2018

ausaki mannequin closed this as completed Nov 16, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file #79160

Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file #79160

ausaki mannequin commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

tirkarthi commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

With encoding declared

-- coding: utf-8 --

Double the original string

-- coding: utf-8 --

tirkarthi commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

master

Applying the patch file from bpo-14811

Patch on master

tirkarthi commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

serhiy-storchaka commented Oct 17, 2018

terryjreedy commented Nov 15, 2019

ausaki mannequin commented Nov 16, 2019

ausaki mannequin commented Nov 16, 2019

Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file #79160

Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file #79160

Comments

ausaki mannequin commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

tirkarthi commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

With encoding declared

-- coding: utf-8 --

Double the original string

-- coding: utf-8 --

tirkarthi commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

master

Applying the patch file from bpo-14811

Patch on master

tirkarthi commented Oct 14, 2018

ausaki mannequin commented Oct 14, 2018

serhiy-storchaka commented Oct 17, 2018

terryjreedy commented Nov 15, 2019

ausaki mannequin commented Nov 16, 2019

ausaki mannequin commented Nov 16, 2019