Skip to content

bpo-42827: Fix crash on SyntaxError in multiline expressions #24140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 14, 2021

Conversation

lysnikolaou
Copy link
Member

@lysnikolaou lysnikolaou commented Jan 6, 2021

When trying to extract the error line for the error message there
are two distinct cases:

  1. The input comes from a file, which means that we can extract the
    error line by using PyErr_ProgramTextObject and which we already
    do.
  2. The input does not come from a file, at which point we need to get
    the source code from the tokenizer:
    • If the tokenizer's current line number is the same with the line
      of the error, we get the line from tok->buf and we're ready.
    • Else, we can extract the error line from the source code in the
      following two ways:
      • If the input comes from a string we have all the input
        in tok->str and we can extract the error line from it.
      • If the input comes from stdin, i.e. the interactive prompt, we
        do not have access to the previous line. That's why a new
        field tok->stdin_content is added which holds the whole input for the
        current (multiline) statement or expression. We can then extract the
        error line from tok->stdin_content like we do in the string case above.

https://bugs.python.org/issue42827

When trying to extract the error line for the error message there
are two distinct cases:
1. The input comes from a file, which means that we can extract the
   error line by using `PyErr_ProgramTextObject` and which we already
   do.
2. The input does not come from a file, at which point we need to get
   the source code from the tokenizer:
   * If the tokenizer's current line number is the same with the line
     of the error, we get the line from `tok->buf` and we're ready.
   * Else, we can extract the error line from the source code in the
     following two ways:
     * If the input comes from a string we have all the input
       in `tok->str` and we can extract the error line from it.
     * If the input comes from stdin, i.e. the interactive prompt, we
       do not have access to the previous line. That's why a new
       field `tok->stdin_content` is added which holds the whole input for the
       current (multiline) statement or expression. We can then extract the
       error line from `tok->stdin_content` like we do in the string case above.
@pablogsal
Copy link
Member

Thanks for the PR @lysnikolaou! I will review this soon but here is some initial question. My understanding is that p->tok->str is not decoded so there may be some issues with unicode. In particular, when I was making my version I got some failures with check(b'# -*- coding: cp1251 -*-\nPython = "\xcf\xb3\xf2\xee\xed" +', 2, 19, encoding='cp1251') when using p->tok->str. As there are no failing tests I assume it may be fine but it maybe is worth to double check.

Another though: another possibility is to not report the location of the error for stdin like the old parser does. This simplifies a bit this particular case and speeds up the parsing

@lysnikolaou
Copy link
Member Author

As there are no failing tests I assume it may be fine but it maybe is worth to double check.

Yup, I checked this a bit more closely and it seems to be working fine. We can add some more tests, in order to be even more confident.

Another though: another possibility is to not report the location of the error for stdin like the old parser does. This simplifies a bit this particular case and speeds up the parsing

I like it a lot that the new parser reports the location in stdin as well and my feeling is that performance isn't such a big consideration here, since all the new code only runs, in case the input comes from the REPL. If you feel otherwise, we can of course explore other options as well.

@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit bdfc2a4 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021
Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 78a0f9c 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 7, 2021
@lysnikolaou
Copy link
Member Author

lysnikolaou commented Jan 8, 2021

Bad news. The assertion assert(p->tok->fp == NULL || p->tok->fp == stdin) works for all cases except for one, which is when the tokenizer ended with tok->done == E_EOF, which signifies an unexpected EOF while parsing. Oddly enough in that case, tok->lineno is somewhat corrupted, since it points to a non-existent line (line 0 in my tests), so PyErr_ProgramTextObject fails and the assertion is reached. This also leads to the wrong line being reported in almost all of E_EOF erros in tokenizer-level:

➜  cpython git:(multiline-expr-crash) ✗ cat a.py 
\%                                                                                                                                                       
➜  cpython git:(multiline-expr-crash) ✗ ./python a.py        
  File "/home/lysnikolaou/repos/cpython/a.py", line 0
    \
     ^
SyntaxError: unexpected EOF while parsing
➜  cpython git:(multiline-expr-crash) ✗ cat b.py
"""abc%                                                                                                                                                  
➜  cpython git:(multiline-expr-crash) ✗ ./python b.py
  File "/home/lysnikolaou/repos/cpython/b.py", line 0
    """abc
          ^
SyntaxError: EOF while scanning triple-quoted string literal

Even more oddly, this wasn't a problem without the assertion, because p->tok->lineno == t->lineno and p->tok->buf still has the correct value, so the error line was computed correctly in that if.

Any thoughts on what we should do here? Fixing the tokenizer to point to the correct line upon an E_EOF error seems like the way to go, but it isn't at all trivial, at least for me.

@pablogsal
Copy link
Member

Any thoughts on what we should do here? Fixing the tokenizer to point to the correct line upon an E_EOF error seems like the way to go, but it isn't at all trivial, at least for me.

Hummm, is this a problem that happens after the change in this PR or it happens already on master?

What does the old parser do with this?

Fixing the tokenizer to point to the correct line upon an E_EOF error seems like the way to go, but it isn't at all trivial, at least for me.

Isn't the tokenizer pointing to the last line upon E_EOF? Maybe I am misunderstanding the case where this happens.

lysnikolaou and others added 2 commits January 8, 2021 21:07
Co-authored-by: Irit Katriel <iritkatriel@yahoo.com>
@lysnikolaou
Copy link
Member Author

It was maybe a little bit too late last night, so I got a bit confused after a while and didn't explain the situation too well. Let me try to explain again after some more digging into the whole thing.

After some more consideration, I have reconsidered and this is not a tokenizer bug. It's an edge case, which we only handled correctly in the new parser code by accident, or at least it seems like that to me. Let me explain. When an E_EOF tokenizer error happens, which might be caused by a line continuation not being followed by another source line or by an unterminated triple-quoted string, p->tok->lineno does not point to the last line of source, it points to last_line + 1. This makes sense in my opinion, since it's not the last line that created the error, it's the non-existence of the following line. For example:

If we have the file a.py with the following content:

a = 0
\

Then the error looks like this:

➜  cpython git:(multiline-expr-crash) ✗ ./python a.py
  File "/home/lysnikolaou/repos/cpython/a.py", line 4
    \
     ^
SyntaxError: unexpected EOF while parsing

Note that it says line 4, while the file is 3 lines long.

So if this makes sense, why is it a problem? My thought was that for this case the error line would be found by the call to PyErr_ProgramTextObject on

error_line = PyErr_ProgramTextObject(p->tok->filename, (int) lineno);
but it doesn't because lineno points to a non-existent line. And so the error line is actually found by looking at p->tok->buf on
error_line = PyUnicode_DecodeUTF8(p->tok->buf, size, "replace");
That's why the assertion I added on 78a0f9c#diff-ec9a9d08c5b1d518997db0f5b4795c528a41efc35507d6c2a384eee43f047cb7R439 fails for this case.

How do we fix this? Removing the assertion works. Changing the assertion to assert(p->tok->fp == NULL || p->tok->fp == stdin || p->tok->done == E_EOF) also works. I think we should go with the latter and add a comment (no matter how long) that explains all this.

Hopefully everything is clear now. If not, let me know and I'll give it another go.

What does the old parser do with this?

Just FYI the old parser used to have a specialized struct perrdetail that had a text field that was used for the error line, so it didn't have to use tok->lineno or tok->buf directly and thus it was not affected by this.

@lysnikolaou
Copy link
Member Author

Also, thanks a lot @iritkatriel for the review!

Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough explanation @lysnikolaou.

LGTM! Let's go with this. I assume if we go with something like #24161 that exercises this path more we will detect quickly if we missed anything

@pablogsal pablogsal merged commit e5fe509 into python:master Jan 14, 2021
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 13, 2021
…H-24140)

When trying to extract the error line for the error message there
are two distinct cases:

1. The input comes from a file, which means that we can extract the
   error line by using `PyErr_ProgramTextObject` and which we already
   do.
2. The input does not come from a file, at which point we need to get
   the source code from the tokenizer:
   * If the tokenizer's current line number is the same with the line
     of the error, we get the line from `tok->buf` and we're ready.
   * Else, we can extract the error line from the source code in the
     following two ways:
     * If the input comes from a string we have all the input
       in `tok->str` and we can extract the error line from it.
     * If the input comes from stdin, i.e. the interactive prompt, we
       do not have access to the previous line. That's why a new
       field `tok->stdin_content` is added which holds the whole input for the
       current (multiline) statement or expression. We can then extract the
       error line from `tok->stdin_content` like we do in the string case above.

Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants