bpo-42827: Fix crash on SyntaxError in multiline expressions #24140

lysnikolaou · 2021-01-06T12:53:37Z

When trying to extract the error line for the error message there
are two distinct cases:

The input comes from a file, which means that we can extract the
error line by using PyErr_ProgramTextObject and which we already
do.
The input does not come from a file, at which point we need to get
the source code from the tokenizer:
- If the tokenizer's current line number is the same with the line
  of the error, we get the line from tok->buf and we're ready.
- Else, we can extract the error line from the source code in the
  following two ways:
  - If the input comes from a string we have all the input
    in tok->str and we can extract the error line from it.
  - If the input comes from stdin, i.e. the interactive prompt, we
    do not have access to the previous line. That's why a new
    field tok->stdin_content is added which holds the whole input for the
    current (multiline) statement or expression. We can then extract the
    error line from tok->stdin_content like we do in the string case above.

https://bugs.python.org/issue42827

When trying to extract the error line for the error message there are two distinct cases: 1. The input comes from a file, which means that we can extract the error line by using `PyErr_ProgramTextObject` and which we already do. 2. The input does not come from a file, at which point we need to get the source code from the tokenizer: * If the tokenizer's current line number is the same with the line of the error, we get the line from `tok->buf` and we're ready. * Else, we can extract the error line from the source code in the following two ways: * If the input comes from a string we have all the input in `tok->str` and we can extract the error line from it. * If the input comes from stdin, i.e. the interactive prompt, we do not have access to the previous line. That's why a new field `tok->stdin_content` is added which holds the whole input for the current (multiline) statement or expression. We can then extract the error line from `tok->stdin_content` like we do in the string case above.

pablogsal · 2021-01-06T16:07:39Z

Thanks for the PR @lysnikolaou! I will review this soon but here is some initial question. My understanding is that p->tok->str is not decoded so there may be some issues with unicode. In particular, when I was making my version I got some failures with check(b'# -*- coding: cp1251 -*-\nPython = "\xcf\xb3\xf2\xee\xed" +', 2, 19, encoding='cp1251') when using p->tok->str. As there are no failing tests I assume it may be fine but it maybe is worth to double check.

Another though: another possibility is to not report the location of the error for stdin like the old parser does. This simplifies a bit this particular case and speeds up the parsing

Parser/tokenizer.c

lysnikolaou · 2021-01-07T18:08:54Z

As there are no failing tests I assume it may be fine but it maybe is worth to double check.

Yup, I checked this a bit more closely and it seems to be working fine. We can add some more tests, in order to be even more confident.

Another though: another possibility is to not report the location of the error for stdin like the old parser does. This simplifies a bit this particular case and speeds up the parsing

I like it a lot that the new parser reports the location in stdin as well and my feeling is that performance isn't such a big consideration here, since all the new code only runs, in case the input comes from the REPL. If you feel otherwise, we can of course explore other options as well.

bedevere-bot · 2021-01-07T22:21:15Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit bdfc2a4 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

Parser/pegen.c

Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>

Parser/pegen.c

Parser/tokenizer.c

bedevere-bot · 2021-01-07T22:59:14Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 78a0f9c 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

lysnikolaou · 2021-01-08T00:36:26Z

Bad news. The assertion assert(p->tok->fp == NULL || p->tok->fp == stdin) works for all cases except for one, which is when the tokenizer ended with tok->done == E_EOF, which signifies an unexpected EOF while parsing. Oddly enough in that case, tok->lineno is somewhat corrupted, since it points to a non-existent line (line 0 in my tests), so PyErr_ProgramTextObject fails and the assertion is reached. This also leads to the wrong line being reported in almost all of E_EOF erros in tokenizer-level:

➜  cpython git:(multiline-expr-crash) ✗ cat a.py 
\%                                                                                                                                                       
➜  cpython git:(multiline-expr-crash) ✗ ./python a.py        
  File "/home/lysnikolaou/repos/cpython/a.py", line 0
    \
     ^
SyntaxError: unexpected EOF while parsing
➜  cpython git:(multiline-expr-crash) ✗ cat b.py
"""abc%                                                                                                                                                  
➜  cpython git:(multiline-expr-crash) ✗ ./python b.py
  File "/home/lysnikolaou/repos/cpython/b.py", line 0
    """abc
          ^
SyntaxError: EOF while scanning triple-quoted string literal

Even more oddly, this wasn't a problem without the assertion, because p->tok->lineno == t->lineno and p->tok->buf still has the correct value, so the error line was computed correctly in that if.

Any thoughts on what we should do here? Fixing the tokenizer to point to the correct line upon an E_EOF error seems like the way to go, but it isn't at all trivial, at least for me.

pablogsal · 2021-01-08T01:58:00Z

Any thoughts on what we should do here? Fixing the tokenizer to point to the correct line upon an E_EOF error seems like the way to go, but it isn't at all trivial, at least for me.

Hummm, is this a problem that happens after the change in this PR or it happens already on master?

What does the old parser do with this?

Fixing the tokenizer to point to the correct line upon an E_EOF error seems like the way to go, but it isn't at all trivial, at least for me.

Isn't the tokenizer pointing to the last line upon E_EOF? Maybe I am misunderstanding the case where this happens.

Parser/pegen.c

Co-authored-by: Irit Katriel <iritkatriel@yahoo.com>

lysnikolaou · 2021-01-08T19:56:22Z

It was maybe a little bit too late last night, so I got a bit confused after a while and didn't explain the situation too well. Let me try to explain again after some more digging into the whole thing.

After some more consideration, I have reconsidered and this is not a tokenizer bug. It's an edge case, which we only handled correctly in the new parser code by accident, or at least it seems like that to me. Let me explain. When an E_EOF tokenizer error happens, which might be caused by a line continuation not being followed by another source line or by an unterminated triple-quoted string, p->tok->lineno does not point to the last line of source, it points to last_line + 1. This makes sense in my opinion, since it's not the last line that created the error, it's the non-existence of the following line. For example:

If we have the file a.py with the following content:

a = 0
\

Then the error looks like this:

➜  cpython git:(multiline-expr-crash) ✗ ./python a.py
  File "/home/lysnikolaou/repos/cpython/a.py", line 4
    \
     ^
SyntaxError: unexpected EOF while parsing

Note that it says line 4, while the file is 3 lines long.

So if this makes sense, why is it a problem? My thought was that for this case the error line would be found by the call to PyErr_ProgramTextObject on

cpython/Parser/pegen.c

Line 415 in 68e1f25

error_line = PyErr_ProgramTextObject(p->tok->filename, (int) lineno);

but it doesn't because lineno points to a non-existent line. And so the error line is actually found by looking at p->tok->buf on

cpython/Parser/pegen.c

Line 420 in 68e1f25

error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");

That's why the assertion I added on 78a0f9c#diff-ec9a9d08c5b1d518997db0f5b4795c528a41efc35507d6c2a384eee43f047cb7R439 fails for this case.

How do we fix this? Removing the assertion works. Changing the assertion to assert(p->tok->fp == NULL || p->tok->fp == stdin || p->tok->done == E_EOF) also works. I think we should go with the latter and add a comment (no matter how long) that explains all this.

Hopefully everything is clear now. If not, let me know and I'll give it another go.

What does the old parser do with this?

Just FYI the old parser used to have a specialized struct perrdetail that had a text field that was used for the error line, so it didn't have to use tok->lineno or tok->buf directly and thus it was not affected by this.

lysnikolaou · 2021-01-08T19:56:47Z

Also, thanks a lot @iritkatriel for the review!

pablogsal

Thanks for the thorough explanation @lysnikolaou.

LGTM! Let's go with this. I assume if we go with something like #24161 that exercises this path more we will detect quickly if we missed anything

…H-24140) When trying to extract the error line for the error message there are two distinct cases: 1. The input comes from a file, which means that we can extract the error line by using `PyErr_ProgramTextObject` and which we already do. 2. The input does not come from a file, at which point we need to get the source code from the tokenizer: * If the tokenizer's current line number is the same with the line of the error, we get the line from `tok->buf` and we're ready. * Else, we can extract the error line from the source code in the following two ways: * If the input comes from a string we have all the input in `tok->str` and we can extract the error line from it. * If the input comes from stdin, i.e. the interactive prompt, we do not have access to the previous line. That's why a new field `tok->stdin_content` is added which holds the whole input for the current (multiline) statement or expression. We can then extract the error line from `tok->stdin_content` like we do in the string case above. Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>

lysnikolaou requested review from gvanrossum and pablogsal January 6, 2021 12:53

the-knights-who-say-ni added the CLA signed label Jan 6, 2021

bedevere-bot added the awaiting core review label Jan 6, 2021

lysnikolaou added 2 commits January 6, 2021 16:12

Remove smelly symbol

c928526

Add news blurb

5b9dfed

pablogsal reviewed Jan 6, 2021

View reviewed changes

Parser/tokenizer.c Show resolved Hide resolved

Add PyMem_Free for the new field in PyTokenizer_Free

bdfc2a4

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021

pablogsal reviewed Jan 7, 2021

View reviewed changes

Parser/pegen.c Outdated Show resolved Hide resolved

Update lineno type to Py_ssize_t

75e714c

Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>

pablogsal reviewed Jan 7, 2021

View reviewed changes

Parser/pegen.c Outdated Show resolved Hide resolved

pablogsal reviewed Jan 7, 2021

View reviewed changes

Parser/pegen.c Show resolved Hide resolved

pablogsal reviewed Jan 7, 2021

View reviewed changes

Parser/tokenizer.c Outdated Show resolved Hide resolved

Address feedback

78a0f9c

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021

Remove assertion in raise error function

1fbb092

pablogsal mentioned this pull request Jan 8, 2021

bpo-42864: Improve error messages regarding unclosed parentheses #24161

Merged

iritkatriel reviewed Jan 8, 2021

View reviewed changes

Parser/pegen.c Outdated Show resolved Hide resolved

Parser/pegen.c Outdated Show resolved Hide resolved

lysnikolaou and others added 2 commits January 8, 2021 21:07

Fix comment

c37af49

Co-authored-by: Irit Katriel <iritkatriel@yahoo.com>

Fix second comment

75cca9f

Improve comments and assertions

5f4fd51

pablogsal approved these changes Jan 14, 2021

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Jan 14, 2021

pablogsal merged commit e5fe509 into python:master Jan 14, 2021

bedevere-bot removed the awaiting merge label Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-42827: Fix crash on SyntaxError in multiline expressions #24140

bpo-42827: Fix crash on SyntaxError in multiline expressions #24140

Uh oh!

lysnikolaou commented Jan 6, 2021 •

edited by bedevere-bot

Loading

Uh oh!

pablogsal commented Jan 6, 2021

Uh oh!

Uh oh!

lysnikolaou commented Jan 7, 2021

Uh oh!

bedevere-bot commented Jan 7, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commented Jan 7, 2021

Uh oh!

lysnikolaou commented Jan 8, 2021 •

edited

Loading

Uh oh!

pablogsal commented Jan 8, 2021

Uh oh!

Uh oh!

Uh oh!

lysnikolaou commented Jan 8, 2021

Uh oh!

lysnikolaou commented Jan 8, 2021

Uh oh!

pablogsal left a comment

Uh oh!

Uh oh!

Uh oh!

bpo-42827: Fix crash on SyntaxError in multiline expressions #24140

bpo-42827: Fix crash on SyntaxError in multiline expressions #24140

Uh oh!

Conversation

lysnikolaou commented Jan 6, 2021 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Jan 6, 2021

Uh oh!

Uh oh!

lysnikolaou commented Jan 7, 2021

Uh oh!

bedevere-bot commented Jan 7, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commented Jan 7, 2021

Uh oh!

lysnikolaou commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Jan 8, 2021

Uh oh!

Uh oh!

Uh oh!

lysnikolaou commented Jan 8, 2021

Uh oh!

lysnikolaou commented Jan 8, 2021

Uh oh!

pablogsal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lysnikolaou commented Jan 6, 2021 •

edited by bedevere-bot

Loading

lysnikolaou commented Jan 8, 2021 •

edited

Loading