Parsed expression has wrong col_offset when concatenating f-strings #83745

lysnikolaou · 2020-02-05T21:21:22Z

BPO	39564
Nosy	@gvanrossum, @ericvsmith, @lysnikolaou, @pablogsal
Superseder	bpo-34364: problem with traceback for syntax error in f-string

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-02-05.22:05:07.916>
created_at = <Date 2020-02-05.21:21:22.191>
labels = ['interpreter-core', '3.8', 'type-bug', '3.7', '3.9']
title = 'Parsed expression has wrong col_offset when concatenating f-strings'
updated_at = <Date 2020-02-07.03:32:04.321>
user = 'https://github.com/lysnikolaou'

bugs.python.org fields:

activity = <Date 2020-02-07.03:32:04.321>
actor = 'lys.nikolaou'
assignee = 'none'
closed = True
closed_date = <Date 2020-02-05.22:05:07.916>
closer = 'eric.smith'
components = ['Interpreter Core']
creation = <Date 2020-02-05.21:21:22.191>
creator = 'lys.nikolaou'
dependencies = []
files = []
hgrepos = []
issue_num = 39564
keywords = []
message_count = 10.0
messages = ['361456', '361457', '361459', '361462', '361467', '361468', '361477', '361484', '361532', '361535']
nosy_count = 4.0
nosy_names = ['gvanrossum', 'eric.smith', 'lys.nikolaou', 'pablogsal']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '34364'
type = 'behavior'
url = 'https://bugs.python.org/issue39564'
versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

lysnikolaou · 2020-02-05T21:21:22Z

When concatenating f-strings, if there is an expression in any STRING node other than the first, col_offset of the parsed expression has a wrong value.

For example, parsing f"hello" f"{world}" outputs the following AST:

Module(
body=[
Expr(
value=JoinedStr(
values=[
Constant(
value="hello",
kind=None,
lineno=1,
col_offset=0,
end_lineno=1,
end_col_offset=19,
),
FormattedValue(
value=Name(
id="world",
ctx=Load(),
lineno=1,
*col_offset=1,*
end_lineno=1,
*end_col_offset=6,*
),
conversion=-1,
format_spec=None,
lineno=1,
col_offset=0,
end_lineno=1,
end_col_offset=19,
),
],
lineno=1,
col_offset=0,
end_lineno=1,
end_col_offset=19,
),
lineno=1,
col_offset=0,
end_lineno=1,
end_col_offset=19,
)
],
type_ignores=[],
)

Here, col_offset and end_col_offset are wrong in the parsed NAME 'world'.

ericvsmith · 2020-02-05T22:05:08Z

There are a number of bugs about this. See bpo-35212 and bpo-34364, at least.

I have a stale patch that tried to fix this, but I was never able to get it completely correct.

I have another approach that I hope to discuss at the language summit at PyCon. Hopefully progress after that will be quick.

I'm going to close this as a duplicate. When I decide on a course of action, I'll either use bpo-34364 or create a new issue.

gvanrossum · 2020-02-05T22:28:21Z

Note that this is somewhat urgent for oefen, since we strive to match the
produced AST exactly.

--Guido (mobile)

gvanrossum · 2020-02-06T00:48:43Z

This is indeed a duplicate of bpo-35212 (the second message there also shows a problem with implicit f-string concatenation). There's not much information in bpo-34364 (just a merged PR, and a mention of a PR of yours from 2 years ago that apparently never materialized).

FWIW I meant "pegen" (https://github.com/gvanrossum/pegen/) -- not sure why my mobile spell corrector turned that into "oefen" (maybe it's becoming multilingual :-).

And the reason I think called this "urgent" is that until this is fixed in CPython 3.8 we have to disable the part of our tests that compares the AST generated by pegen to the one generated by ast.parse. (We've found several other issues with incorrect column offsets in ast.parse this way, and always managed to fix them upstream. :-)

If we came up with a hacky fix, would you oppose incorporating it?

I'm guessing your alternate approach is modifying the tokenizer to treat f"x{y}z{a}b" as several tokens:

f"x{
y (or multiple tokens if there's a complex expression here)
}z{
a (same)
}b"

(This has been proposed before by a few others, and IIRC we did it this way in ABC, in the early '80s.)

ericvsmith · 2020-02-06T01:48:01Z

Yes, my approach is to use the tokenizer to produce the parts of the f-string, instead of treating it as a normal string and making a second pass to parse out the pieces. I've been discussing this for a few years, and at last PyCon I talked to many tool vendors (editors, type checkers, etc.) to gather opinions, and no one was opposed. I've also had a few discussions since then, which were also positive. So I think I'll try to move ahead on it.

I'm not sure this is the best place to discuss it, but let me give a little example on why it's complicated.

Consider: f"x{y:.2f}z{a!s:10}b"

What type of token is ".2f", and how does the tokenizer recognize it? Or maybe the token is ":.2f". It can be an arbitrary sequence of characters (with a few restrictions, like no newlines), ending before a right brace. And there's the issue of backslash escapes there.

And the f"{z = :20}" thing we added doesn't make this any easier, since whitespace is significant after the "=" and before the ":", if it exists. And I need the entire text of the expression, before it's parsed, to make it work. I don't think that "unparsing" could ever be made to work in this case, or at least not so as to be compatible with 3.8 and preserving whitespace.

I really haven't looked at what's involved in the implementation in any greater depth: my first goal was (and will be) to get buy-off on the change, then to look at implementation issues. PEP-536 has some discussion of this, but if I go forward I would want to start from scratch (and reject PEP-536).

As far as the patch I was working on: the general problem is that you have to "prime" the compiler with a line number and column to start on, instead of it assuming line 1 column 1. Despite spending the better part of a week on it at the core sprints 1.5 years ago, I never got it to work (although I did delete and simplify a few functions, so it was a net win anyway). I wouldn't be opposed to a patch to fix this.

gvanrossum · 2020-02-06T02:58:13Z

Wow. Do you still have a branch with that non-working patch? Maybe there are some ideas that can be salvaged.

I'm wondering if maybe we're better leaving f-strings alone? The current code is quite voluminous, but it looks as if by the time you've addressed everything you mention using additional tokenizer state, you might end up with just as much or more convoluted code... F-strings totally rule, despite the current limitations!

ericvsmith · 2020-02-06T09:23:19Z

I'll see if I can dig up the patch today. If I can find it, I'll attach it to bpo-34364.

This is really the first time I've tried to write down all of the issues related to tokenizing f-strings. It does seem a little daunting, but I'm not done noodling it through. At first blush it looks like the tokenizer would need to remember if it's inside an f-string or not and switch to different rules if so. Which doesn't exactly describe your average tokenizer, but I'm not sure how Python's tokenizer would need to be changed to deal with it, or how messy that change would be.

I should probably write an informational PEP about parsing f-strings. And I should include the reason I went with the "just a regular string which is later hand-parsed" approach: at the time, f-strings were a controversial topic (there were any number of reddit threads predicting doom and gloom if they were added). By parsing them as just regular strings with one simple added string prefix, it allowed existing tooling (editors, syntax highlighters, etc.) to easily skip over them just by recognizing 'f' as an additional string prefix.

lysnikolaou · 2020-02-06T14:33:55Z

Also note that for pegen this is a solved problem. If we decide to change the parser, it may not be worth it to spend too much time on a universal solution.

ericvsmith · 2020-02-07T02:39:57Z

Can I ask how pegen solved this problem? Did you move parsing of f-strings into the grammar?

I found my patch and I'm working on resolving merge conflicts.

lysnikolaou · 2020-02-07T03:32:04Z

Can I ask how pegen solved this problem? Did you move parsing of f-strings into the grammar?

No, we actually do a very similar thing to what ast.c does.

I think there is only one major difference between what pegen does and what ast.c does. If I understand correctly, fstring_compile_expr calls fstring_fix_node_location with parent set to the whole STRING+ node. We OTOH only pass as parent the current STRING token (we work with tokens returned by the tokenizer) and then walk the AST tree and adjust the line/col_offset based on that. I'm not sure that this is bug-free, but after some testing, I could find any obvious errors.

If you want, you can take a loot at the code on we-like-parsers/pegen_experiments#183.

lysnikolaou added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 5, 2020

lysnikolaou changed the title ~~Parsed expression has wrong line/col info when concatenating f-strings~~ Parsed expression has wrong col_offset when concatenating f-strings Feb 5, 2020

ericvsmith closed this as completed Feb 5, 2020

ericvsmith added the type-bug An unexpected behavior, bug, or error label Feb 5, 2020

ericvsmith closed this as completed Feb 5, 2020

ericvsmith added the type-bug An unexpected behavior, bug, or error label Feb 5, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsed expression has wrong col_offset when concatenating f-strings #83745

Parsed expression has wrong col_offset when concatenating f-strings #83745

lysnikolaou commented Feb 5, 2020

lysnikolaou commented Feb 5, 2020

ericvsmith commented Feb 5, 2020

gvanrossum commented Feb 5, 2020

gvanrossum commented Feb 6, 2020

ericvsmith commented Feb 6, 2020

gvanrossum commented Feb 6, 2020

ericvsmith commented Feb 6, 2020

lysnikolaou commented Feb 6, 2020

ericvsmith commented Feb 7, 2020

lysnikolaou commented Feb 7, 2020

Parsed expression has wrong col_offset when concatenating f-strings #83745

Parsed expression has wrong col_offset when concatenating f-strings #83745

Comments

lysnikolaou commented Feb 5, 2020

lysnikolaou commented Feb 5, 2020

ericvsmith commented Feb 5, 2020

gvanrossum commented Feb 5, 2020

Note that this is somewhat urgent for oefen, since we strive to match the produced AST exactly.

gvanrossum commented Feb 6, 2020

ericvsmith commented Feb 6, 2020

gvanrossum commented Feb 6, 2020

ericvsmith commented Feb 6, 2020

lysnikolaou commented Feb 6, 2020

ericvsmith commented Feb 7, 2020

lysnikolaou commented Feb 7, 2020

Note that this is somewhat urgent for oefen, since we strive to match the
produced AST exactly.