Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsed expression has wrong col_offset when concatenating f-strings #83745

Closed
lysnikolaou opened this issue Feb 5, 2020 · 10 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@lysnikolaou
Copy link
Contributor

BPO 39564
Nosy @gvanrossum, @ericvsmith, @lysnikolaou, @pablogsal
Superseder
  • bpo-34364: problem with traceback for syntax error in f-string
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-02-05.22:05:07.916>
    created_at = <Date 2020-02-05.21:21:22.191>
    labels = ['interpreter-core', '3.8', 'type-bug', '3.7', '3.9']
    title = 'Parsed expression has wrong col_offset when concatenating f-strings'
    updated_at = <Date 2020-02-07.03:32:04.321>
    user = 'https://github.com/lysnikolaou'

    bugs.python.org fields:

    activity = <Date 2020-02-07.03:32:04.321>
    actor = 'lys.nikolaou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-02-05.22:05:07.916>
    closer = 'eric.smith'
    components = ['Interpreter Core']
    creation = <Date 2020-02-05.21:21:22.191>
    creator = 'lys.nikolaou'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39564
    keywords = []
    message_count = 10.0
    messages = ['361456', '361457', '361459', '361462', '361467', '361468', '361477', '361484', '361532', '361535']
    nosy_count = 4.0
    nosy_names = ['gvanrossum', 'eric.smith', 'lys.nikolaou', 'pablogsal']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '34364'
    type = 'behavior'
    url = 'https://bugs.python.org/issue39564'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @lysnikolaou
    Copy link
    Contributor Author

    When concatenating f-strings, if there is an expression in any STRING node other than the first, col_offset of the parsed expression has a wrong value.

    For example, parsing f"hello" f"{world}" outputs the following AST:

    Module(
    body=[
    Expr(
    value=JoinedStr(
    values=[
    Constant(
    value="hello",
    kind=None,
    lineno=1,
    col_offset=0,
    end_lineno=1,
    end_col_offset=19,
    ),
    FormattedValue(
    value=Name(
    id="world",
    ctx=Load(),
    lineno=1,
    *col_offset=1,*
    end_lineno=1,
    *end_col_offset=6,*
    ),
    conversion=-1,
    format_spec=None,
    lineno=1,
    col_offset=0,
    end_lineno=1,
    end_col_offset=19,
    ),
    ],
    lineno=1,
    col_offset=0,
    end_lineno=1,
    end_col_offset=19,
    ),
    lineno=1,
    col_offset=0,
    end_lineno=1,
    end_col_offset=19,
    )
    ],
    type_ignores=[],
    )

    Here, col_offset and end_col_offset are wrong in the parsed NAME 'world'.

    @lysnikolaou lysnikolaou added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 5, 2020
    @lysnikolaou lysnikolaou changed the title Parsed expression has wrong line/col info when concatenating f-strings Parsed expression has wrong col_offset when concatenating f-strings Feb 5, 2020
    @lysnikolaou lysnikolaou changed the title Parsed expression has wrong line/col info when concatenating f-strings Parsed expression has wrong col_offset when concatenating f-strings Feb 5, 2020
    @ericvsmith
    Copy link
    Member

    There are a number of bugs about this. See bpo-35212 and bpo-34364, at least.

    I have a stale patch that tried to fix this, but I was never able to get it completely correct.

    I have another approach that I hope to discuss at the language summit at PyCon. Hopefully progress after that will be quick.

    I'm going to close this as a duplicate. When I decide on a course of action, I'll either use bpo-34364 or create a new issue.

    @ericvsmith ericvsmith added the type-bug An unexpected behavior, bug, or error label Feb 5, 2020
    @ericvsmith ericvsmith added the type-bug An unexpected behavior, bug, or error label Feb 5, 2020
    @gvanrossum
    Copy link
    Member

    Note that this is somewhat urgent for oefen, since we strive to match the
    produced AST exactly.

    --Guido (mobile)

    @gvanrossum
    Copy link
    Member

    This is indeed a duplicate of bpo-35212 (the second message there also shows a problem with implicit f-string concatenation). There's not much information in bpo-34364 (just a merged PR, and a mention of a PR of yours from 2 years ago that apparently never materialized).

    FWIW I meant "pegen" (https://github.com/gvanrossum/pegen/) -- not sure why my mobile spell corrector turned that into "oefen" (maybe it's becoming multilingual :-).

    And the reason I think called this "urgent" is that until this is fixed in CPython 3.8 we have to disable the part of our tests that compares the AST generated by pegen to the one generated by ast.parse. (We've found several other issues with incorrect column offsets in ast.parse this way, and always managed to fix them upstream. :-)

    If we came up with a hacky fix, would you oppose incorporating it?

    I'm guessing your alternate approach is modifying the tokenizer to treat f"x{y}z{a}b" as several tokens:

    • f"x{
    • y (or multiple tokens if there's a complex expression here)
    • }z{
    • a (same)
    • }b"

    (This has been proposed before by a few others, and IIRC we did it this way in ABC, in the early '80s.)

    @ericvsmith
    Copy link
    Member

    Yes, my approach is to use the tokenizer to produce the parts of the f-string, instead of treating it as a normal string and making a second pass to parse out the pieces. I've been discussing this for a few years, and at last PyCon I talked to many tool vendors (editors, type checkers, etc.) to gather opinions, and no one was opposed. I've also had a few discussions since then, which were also positive. So I think I'll try to move ahead on it.

    I'm not sure this is the best place to discuss it, but let me give a little example on why it's complicated.

    Consider: f"x{y:.2f}z{a!s:10}b"

    What type of token is ".2f", and how does the tokenizer recognize it? Or maybe the token is ":.2f". It can be an arbitrary sequence of characters (with a few restrictions, like no newlines), ending before a right brace. And there's the issue of backslash escapes there.

    And the f"{z = :20}" thing we added doesn't make this any easier, since whitespace is significant after the "=" and before the ":", if it exists. And I need the entire text of the expression, before it's parsed, to make it work. I don't think that "unparsing" could ever be made to work in this case, or at least not so as to be compatible with 3.8 and preserving whitespace.

    I really haven't looked at what's involved in the implementation in any greater depth: my first goal was (and will be) to get buy-off on the change, then to look at implementation issues. PEP-536 has some discussion of this, but if I go forward I would want to start from scratch (and reject PEP-536).

    As far as the patch I was working on: the general problem is that you have to "prime" the compiler with a line number and column to start on, instead of it assuming line 1 column 1. Despite spending the better part of a week on it at the core sprints 1.5 years ago, I never got it to work (although I did delete and simplify a few functions, so it was a net win anyway). I wouldn't be opposed to a patch to fix this.

    @gvanrossum
    Copy link
    Member

    Wow. Do you still have a branch with that non-working patch? Maybe there are some ideas that can be salvaged.

    I'm wondering if maybe we're better leaving f-strings alone? The current code is quite voluminous, but it looks as if by the time you've addressed everything you mention using additional tokenizer state, you might end up with just as much or more convoluted code... F-strings totally rule, despite the current limitations!

    @ericvsmith
    Copy link
    Member

    I'll see if I can dig up the patch today. If I can find it, I'll attach it to bpo-34364.

    This is really the first time I've tried to write down all of the issues related to tokenizing f-strings. It does seem a little daunting, but I'm not done noodling it through. At first blush it looks like the tokenizer would need to remember if it's inside an f-string or not and switch to different rules if so. Which doesn't exactly describe your average tokenizer, but I'm not sure how Python's tokenizer would need to be changed to deal with it, or how messy that change would be.

    I should probably write an informational PEP about parsing f-strings. And I should include the reason I went with the "just a regular string which is later hand-parsed" approach: at the time, f-strings were a controversial topic (there were any number of reddit threads predicting doom and gloom if they were added). By parsing them as just regular strings with one simple added string prefix, it allowed existing tooling (editors, syntax highlighters, etc.) to easily skip over them just by recognizing 'f' as an additional string prefix.

    @lysnikolaou
    Copy link
    Contributor Author

    Also note that for pegen this is a solved problem. If we decide to change the parser, it may not be worth it to spend too much time on a universal solution.

    @ericvsmith
    Copy link
    Member

    Can I ask how pegen solved this problem? Did you move parsing of f-strings into the grammar?

    I found my patch and I'm working on resolving merge conflicts.

    @lysnikolaou
    Copy link
    Contributor Author

    Can I ask how pegen solved this problem? Did you move parsing of f-strings into the grammar?

    No, we actually do a very similar thing to what ast.c does.

    I think there is only one major difference between what pegen does and what ast.c does. If I understand correctly, fstring_compile_expr calls fstring_fix_node_location with parent set to the whole STRING+ node. We OTOH only pass as parent the current STRING token (we work with tokens returned by the tokenizer) and then walk the AST tree and adjust the line/col_offset based on that. I'm not sure that this is bug-free, but after some testing, I could find any obvious errors.

    If you want, you can take a loot at the code on we-like-parsers/pegen_experiments#183.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants