Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include column offsets for bytecode instructions #88116

Open
pablogsal opened this issue Apr 27, 2021 · 52 comments
Open

Include column offsets for bytecode instructions #88116

pablogsal opened this issue Apr 27, 2021 · 52 comments

Comments

@pablogsal
Copy link
Member

pablogsal commented Apr 27, 2021

BPO 43950
Nosy @terryjreedy, @nedbat, @aroberge, @markshannon, @serhiy-storchaka, @ammaraskar, @pablogsal, @miss-islington, @brandtbucher, @isidentical, @sweeneyde, @sobolevn
PRs
  • bpo-43950: Initial base implementation for PEP 657 #26955
  • bpo-43950: Print columns in tracebacks (PEP 657) #26958
  • bpo-43950: optimize column table assembling with pre-sizing the column table #26997
  • bpo-43950: use 0-indexed column offsets for bytecode positions #27011
  • bpo-43950: include position in dis.Instruction #27015
  • bpo-43950: Add option to opt-out of PEP-657 #27023
  • bpo-43950: Add option to opt-out of PEP-657 #27023
  • bpo-43950: Add option to opt-out of PEP-657 #27023
  • bpo-43950: Specialize tracebacks for subscripts/binary ops #27037
  • bpo-43950: Add documentation for PEP-657 #27047
  • bpo-43950: implement on-the-fly source tracking for interactive mode #27117
  • bpo-43950: make BinOp specializations more reliable #27126
  • bpo-43950: distinguish errors happening on character offset decoding #27217
  • bpo-43950: ensure source_line is present when specializing the traceback #27313
  • bpo-43950: support long lines in traceback.py #27336
  • bpo-43950: check against the raw string, not the pyobject #27337
  • bpo-43950: support some multi-line expressions for PEP 657 #27339
  • bpo-43950: support positions for dis.Instructions created through dis.Bytecode #28142
  • bpo-43950: handle wide unicode characters in tracebacks #28150
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-04-27.10:58:38.225>
    labels = []
    title = 'Include column offsets for bytecode instructions'
    updated_at = <Date 2022-01-18.12:31:06.245>
    user = 'https://github.com/pablogsal'

    bugs.python.org fields:

    activity = <Date 2022-01-18.12:31:06.245>
    actor = 'sobolevn'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = []
    creation = <Date 2021-04-27.10:58:38.225>
    creator = 'pablogsal'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43950
    keywords = ['patch']
    message_count = 39.0
    messages = ['392054', '392062', '392081', '392082', '392520', '392540', '392541', '392555', '392557', '396866', '396874', '396950', '396957', '396962', '397109', '397352', '397368', '397589', '397666', '397667', '397669', '397673', '397829', '397837', '397844', '397882', '397909', '397910', '397911', '397914', '397915', '398089', '398144', '398169', '398170', '398172', '398197', '400998', '410856']
    nosy_count = 12.0
    nosy_names = ['terry.reedy', 'nedbat', 'aroberge', 'Mark.Shannon', 'serhiy.storchaka', 'ammar2', 'pablogsal', 'miss-islington', 'brandtbucher', 'BTaskaya', 'Dennis Sweeney', 'sobolevn']
    pr_nums = ['26955', '26958', '26997', '27011', '27015', '27023', '27023', '27023', '27037', '27047', '27117', '27126', '27217', '27313', '27336', '27337', '27339', '28142', '28150']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue43950'
    versions = []

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 27, 2021

    If we could include column offsets from the AST nodes for every bytecode instructions we could leverage these to offer much better tracebacks and a lot more information to debuggers and similar tools. For instance, in an expression such as:

    z['aaaaaa']['bbbbbb']['cccccc']['dddddd].sddfsdf.sdfsdf

    where one of these elements is None, we could tell exactly what of these pieces is the actual None and we could do some cool highlighting on tracebacks.

    Similarly, coverage tools and debuggers could also make this distinction and therefore could offer reports with more granularity.

    The cost is not 0: it would be two integers per bytecode instruction, but I think it may be worth the effort.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 27, 2021

    I'm going to prepare a PEP since the discussion regarding if the two integers per bytecode are worth enough is going to be eternal.

    @markshannon
    Copy link
    Member

    markshannon commented Apr 27, 2021

    The additional cost will not only be the line number table, but we need to store the line for exceptions that are reraised after cleanup.
    Adding a column will mean more stack consumption.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 27, 2021

    Yup, but I still think is worth the cost, giving that debugging improvements are usually extremely popular among users.

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Apr 30, 2021

    Specific examples of current messages and proposed improvements would help focus discussion.

    If you are willing to only handle code lines up to 256 chars, only 2 bytes should be needed. (0,0) or (255,255) could mean 'somewhere beyond the 256th char'.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 30, 2021

    Specific examples of current messages and proposed improvements would help focus discussion.

    Yeah, I am proposing going from:

    >>> x['aaaaaa']['bbbbbb']['cccccc']['dddddd'].sddfsdf.sdfsdf
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'NoneType' object is not subscriptable

    to

    >>> x['aaaaaa']['bbbbbb']['cccccc']['dddddd'].sddfsdf.sdfsdf
                             ^^^^^^^^^^
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'NoneType' object is not subscriptable

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 30, 2021

    Basically, to highlight in all exceptions the range in the displayed line where the error ocurred. For instance:

    >>> foo(a, b/z+2, c, 132432 /x, d /y)
                         ^^^^^^^^^
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ZeroDivisionError: division by zero

    @terryjreedy
    Copy link
    Member

    terryjreedy commented May 1, 2021

    Marking what expression evaluated to None would be extremely helpful. So would marking the 0 denominator when there is more than one candidate: "e = a/b + c/d". It should be easy to revise IDLE Shell's print_exception to tag the span. In some cases, the code that goes from a traceback line to a line in a file and marks it could do the same.

    What would you do when the expression is not the last line?

    try:
    x/y
    ...
    except Exception as e:
    ...
    raise e

    The except and raise might even be in a separate module.
    I look forward to the PEP and discussion.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented May 1, 2021

    So would marking the 0 denominator when there is more than one candidate: "e = a/b + c/d".

    No, it will mark the offset of the bytecode that was getting executed when the exception was raised. Is just a way to mark what is raising the particular exception

    What would you do when the expression is not the last line?

    There is some logic needed to re-raise exceptions. But basically it boils down to comparate the line numbers to decide if you want to propagate or not the offsets.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 2, 2021

    New changeset 98eee94 by Pablo Galindo in branch 'main':
    bpo-43950: Add code.co_positions (PEP-657) (GH-26955)
    98eee94

    @miss-islington
    Copy link
    Contributor

    miss-islington commented Jul 2, 2021

    New changeset ec8759b by Batuhan Taskaya in branch 'main':
    bpo-43950: optimize column table assembling with pre-sizing object (GH-26997)
    ec8759b

    @miss-islington
    Copy link
    Contributor

    miss-islington commented Jul 4, 2021

    New changeset 44f91fc by Batuhan Taskaya in branch 'main':
    bpo-43950: use 0-indexed column offsets for bytecode positions (GH-27011)
    44f91fc

    @miss-islington
    Copy link
    Contributor

    miss-islington commented Jul 4, 2021

    New changeset 693cec0 by Batuhan Taskaya in branch 'main':
    bpo-43950: include position in dis.Instruction (GH-27015)
    693cec0

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 4, 2021

    New changeset 5644c7b by Ammar Askar in branch 'main':
    bpo-43950: Print columns in tracebacks (PEP-657) (GH-26958)
    5644c7b

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 7, 2021

    New changeset 4823d9a by Ammar Askar in branch 'main':
    bpo-43950: Add option to opt-out of PEP-657 (GH-27023)
    4823d9a

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 12, 2021

    New changeset 1890dd2 by Batuhan Taskaya in branch 'main':
    bpo-43950: Specialize tracebacks for subscripts/binary ops (GH-27037)
    1890dd2

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 13, 2021

    New changeset 9c3eaf8 by Ammar Askar in branch 'main':
    bpo-43950: Add documentation for PEP-657 (GH-27047)
    9c3eaf8

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 15, 2021

    New changeset 919ad53 by Batuhan Taskaya in branch 'main':
    bpo-43950: make BinOp specializations more reliable (GH-27126)
    919ad53

    @ammaraskar
    Copy link
    Member

    ammaraskar commented Jul 16, 2021

    As reported by Will McGugan on twitter:

    Printing may be wrong for some unicode characters like:

    Traceback (most recent call last):
      File "test.py", line 1, in <module>
        x = ("" / 0) + (1 / 2)
             ~~~~~~^
    TypeError: unsupported operand type(s) for /: 'str' and 'int'

    and

    Traceback (most recent call last):
      File "test.py", line 3, in <module>
        x = ("💩" / 0) + (1 / 2)
             ~~~~~~~
    TypeError: unsupported operand type(s) for /: 'str' and 'int'

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 16, 2021

    Ah, this is our friend _PyPegen_byte_offset_to_character_offset. I am glad we refactor it to be used by the Parser and the traceback mechanism. Ammar, would you like to take a look?

    @ammaraskar
    Copy link
    Member

    ammaraskar commented Jul 16, 2021

    Aah, I won't have time to investigate until Monday. I'll take a look then unless someone gets to it first :)

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 16, 2021

    We can probably leverage the unicodedata_UCD_east_asian_width_impl function in the unicodedata extension module to get the width of every character as emojis normally take 2 characters in the terminal. We could expose the necessary functions from there and use them in the _PyPegen_byte_offset_to_character_offset function

    @ammaraskar
    Copy link
    Member

    ammaraskar commented Jul 19, 2021

    Had some time to look into this. Just to summarize this problem, it deals with unicode points that are single characters but take up more than the width of a single character, even with a monospace font [1].

    In the examples from above, the Chinese character itself counts as one character in a Python string. However, notice that it needs two carets:

    >>> x = ""
    >>> print(x)
    该
    >>> len(x)
    1
    >>> print(x + '\n' + '^^')
    该
    ^^

    This issue is somewhat font dependent, in the case of the emoji I know that windows sometimes renders emojis as single-character wide black-and-white glyphs or colorful ones depending on the program.

    As Pablo alluded to, unicodedata.east_asian_width is probably the best solution we can implement. For these wide characters it provides:

    >>> unicodedata.east_asian_width('💩')
    'W'
    >>> unicodedata.east_asian_width('')
    'W'

    W corresponding to Wide. Whereas for regular width characters:

    >>> unicodedata.east_asian_width('b')
    'Na'
    >>> unicodedata.east_asian_width('=')
    'Na'

    we get Neutral (Not East Asian). This can be used to count the "displayed width" of the characters and hence the carets. However, organization is going to be a bit tricky since we're currently using _PyPegen_byte_offset_to_character_offset to get offsets to use for string slicing in the ast segment parsing code. We might have to make a separate function that gets the font display-width.

    -------------

    [1] Way more details on this issue here: https://denisbider.blogspot.com/2015/09/when-monospace-fonts-arent-unicode.html and an example of a Python library that tries to deal with this issue here: https://github.com/jquast/wcwidth

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Jul 19, 2021

    The effort to match caret lines to general unicode is similar to a previous issue that was closed as futile. (But I could not find it.) It has a downside that should be considered.

    The fundamental problem is that there is no fixed pitch font for unicode. (Let alone any font covering all of unicode.) Nor is there a single-double width definition, or font, for all of unicode. Some character sets are not amenable to such treatment.

    To see the problem easier, open, for instance, IDLE's option/settings dialog, showing the fonts tab and a multi-script sample. On Windows, select what I believe is the most 'fixed' font -- Courier New. ASCII, Latin1, IPA, Greek, Cyrillic, Hebrew, and Arabic are all rendered in the same fixed pitch. But the last 4 Cyrillic characters of "...ЪъЭэѠѤѬӜ" are extremely cramped and may be rendered differently from the rest. The East Asian characters are in a different fixed pitch, about 1.6 times the Ascii, etc. fixed pitch. (So the double-wide 2 is 1.6 rounded up. And with some fonts, the East Asian scripts are not all the same pitch.) The South Asian script are variable pitch and for the sample chars, average wider than 1 (they have 20 chars, like the Ascii, etc, lines). Tamil, especially, has a wide range of widths, with the widest as wide as the East Asian chars.

    On Windows, on my machine, pasting the sample text between quotes results in the Greek chars, the last 4 Cyrillic chars, and all Asian chars (including Hebrew and Arabic) being replaced by replacement chars. (I thought that this was better once, but maybe I mis-remember.) While one can get script-specific fonts, the fixed-pitch South Asian fonts I tried on Mac were hardly readable. My conclusion is that people using certain scripts and anyone wanting a wide variety of scripts needs to use a GUI-based editor and shell rather than a fixed-pitch terminal/console.

    As long as the caret line has 1 char per code char, a GUI program can use it to mark code characters, and do so differently for '~' and '^'. If some of these chars are doubled, exact character information is lost. If you go ahead with this, please use a third character, such as '-', for additions. GUI programs could then ignore these, given that they can otherwise can get the start character information.

    @ammaraskar
    Copy link
    Member

    ammaraskar commented Jul 19, 2021

    I think this is the previous issue you are talking about Terry. https://bugs.python.org/issue24665

    The correct algorithm is a little more complex than just using east_asian_widths. Ideally we would like to use the wcwidth function (https://man7.org/linux/man-pages/man3/wcwidth.3.html) which implements all the logic to somewhat properly figure out widths.

    Given Terry's comments from above and the previous issues I'm starting to think the fix for this issue might not be worth the added complexity.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 20, 2021

    New changeset fbc349f by Batuhan Taskaya in branch 'main':
    bpo-43950: Distinguish errors happening on character offset decoding (GH-27217)
    fbc349f

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 20, 2021

    I don't know, seems that rust can deal with this just fine:

    ❯ cargo run
    Compiling rtest v0.1.0 (/home/pablogsal/rtest)
    error[E0308]: mismatched types
    --> src/main.rs:7:10
    |
    7 | test("hellooooooo: 🥳 😏 😒 😞 😔 some emojis");
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | |
    | expected struct String, found &str
    | help: try using a conversion method: "hellooooooo: 🥳 😏 😒 😞 😔 some emojis".to_string()

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 20, 2021

    In bpo this looks bad, but test this small program yourself and check the error in your terminal:

    fn test(number: String) -> () {
    println!("Hello, world: {}", number);
    }

    fn main() {
    test("hellooooooo: 🥳 😏 😒 😞 😔 some emojis");
    }

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jul 20, 2021

    They seem to be checking the unicode width as we are proposing here. Check the calls to unicode_width::UnicodeWidthChar::width:

    https://github.com/rust-lang/rust/blob/master/compiler/rustc_errors/src/emitter.rs#L729-L733

    @ammaraskar
    Copy link
    Member

    ammaraskar commented Jul 20, 2021

    Indeed, and the unicode_width package seems to implement the wcwidth algorithm:

    https://bugs.python.org/issue12568 is an issue from the past that proposed adding wcwidth to the standard library.

    @JustAnotherArchivist
    Copy link
    Contributor

    JustAnotherArchivist commented Apr 14, 2022

    Yeah, I noticed that as well. The unpacking could be solved with a custom class that provides an __iter__ method for compatibility. For example:

    import dataclasses
    import typing
    
    @dataclasses.dataclass
    class X:
    	a: int
    	b: str
    	c: float
    
    	def __iter__(self):
    		return iter((self.a, self.b))
    
    x = X(42, 'foo', 3.1416)
    a, b = x
    print(a, b)
    print(x.a, x.b)

    This would still break other, more obscure uses of the named tuple specifics though, e.g. isinstance(x, tuple) or x._asdict().

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 14, 2022

    Yeah, I noticed that as well. The unpacking could be solved with a custom class that provides an __iter__ method for compatibility. For example:

    import dataclasses
    import typing
    
    @dataclasses.dataclass
    class X:
    	a: int
    	b: str
    	c: float
    
    	def __iter__(self):
    		return iter((self.a, self.b))
    
    x = X(42, 'foo', 3.1416)
    a, b = x
    print(a, b)
    print(x.a, x.b)

    This would still break other, more obscure uses of the named tuple specifics though, e.g. isinstance(x, tuple) or x._asdict().

    That's likely too convoluted and it will certainly be very surprising in a bad way to new users.

    @JustAnotherArchivist
    Copy link
    Contributor

    JustAnotherArchivist commented Apr 14, 2022

    I agree that it's more of an awful workaround than an actual solution.

    Perhaps a better option might be adding a kwarg to trace and stack that makes them return a different data structure with the extra info. It would then default to False for backward compatibility. There's precedent for that as well: typing.get_type_hints and its include_extras argument. Although the question would still be what data type would be returned there; if we don't anticipate further additions anytime soon, maybe just an extended named tuple would do, but a dataclass or similar structure would be more future-proof.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 14, 2022

    . Although the question would still be what data type would be returned there; if we don't anticipate further additions anytime soon, maybe just an extended named tuple would do, but a dataclass or similar structure would be more future-proof.

    Unfortunately adding a data class to the inspect module will have an impact on startup time and size, which is very undesirable as is on the hot path. I don't dislike the keyword argument, though, but we need to see what others think. @ammaraskar @isidentical

    @JelleZijlstra
    Copy link
    Member

    JelleZijlstra commented Apr 14, 2022

    We can make a namedtuple subclass that adds additional attributes. There is precedent for this in the form of os.stat_result, although that's a structseq (the C equivalent of namedtuple). It's a little confusing, but it should preserve compatibility and avoid convoluting the API.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 17, 2022

    We can make a namedtuple subclass that adds additional attributes. There is precedent for this in the form of os.stat_result, although that's a structseq (the C equivalent of namedtuple). It's a little confusing, but it should preserve compatibility and avoid convoluting the API.

    Yeah, I recall that one. I am not a fan of that approach as I find it very confusing. Maybe we should open a thread in python-dev about this.

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Apr 17, 2022

    I have sent an email about this to python-dev

    @gpshead
    Copy link
    Member

    gpshead commented Apr 17, 2022

    As awkward of a beast as a tuple subclass emulating the previous fixed size namedtuple while adding additional attributes seems, it is very practical.

    Users don't care about the specifics. They just need existing code to keep working and will want access to new values when updating this code or writing new code. Which means it should still be a tuple subclass for typing reasons known as inspect.Traceback and inspect.FrameInfo, support the existing indices and unpacking semantics, and support named field access. That this would be adding additional fields that indexing does not provide is a mere curiosity that most users would not notice as modern code should be field based anyways.

    Nobody should care if the implementation behind the returned instances is unusual. The behavior will be what they expect. It avoids the need to expand into multiple different return types via an arg flag or the creation of parallel APIs.

    This is a good practical example of why we should not design APIs to return tuples if they could conceivably change in the future. And why people unpacking on API calls that return tuples might find it wise to always [:slice] the return value as a style idiom.


    All this said... we could also just add another field to the namedtuple. Some code inspection among existing API users is required, but I expect most users either immediately unpack upon return value assignment or use field names. Mix and matching of indexing and named access on a single return value is probably rare. The easy modification to existing code using unpacking to deal with this kind of API change upon version upgrade is to append a [:5] or similar to the API call. We have had other tuple returning APIs add fields in the past where this was the workaround though I've forgotten what they were off the top of my head.

    This would be the more disruptive option. It is nice to avoid this if possible as it delays people's ability to update to and even test their code on 3.11. So I still lean towards the hybrid "half namedtuple" class.

    markshannon added a commit that referenced this issue Apr 21, 2022
    …ts. (GH-91666)
    
    * Stores all location info in linetable to conform to PEP 626.
    
    * Remove column table from code objects.
    
    * Remove end-line table from code objects.
    
    * Document new location table format
    pablogsal added a commit to pablogsal/cpython that referenced this issue Jun 28, 2022
    …e objects
    
     It seems like we'll fail to correctly decode a varint between INT_MAX
     and UINT_MAX or if a signed varint is > INT_MAX / 2 or < INT_MIN / 2 as
     this is written today. In fact, it might left shift a 1 into the sign
     bit, which would be UB.
    
     Indeed, this test program shows the issue:
    
    ```
    
    int main()
    {
        int val = 2147483647;
        unsigned int shift = 1;
        std::cout << (val << shift) << std::endl;
    }
    ```
    
    As this prints -2.
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 28, 2022
    …e objects (pythonGH-94375)
    
    (cherry picked from commit c485ec0)
    
    Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
    miss-islington added a commit that referenced this issue Jun 28, 2022
    …cts (GH-94375)
    
    (cherry picked from commit c485ec0)
    
    Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jun 28, 2022

    @markshannon I'm noticing that the implementation for varints and signed varints that we use has two encodings for 0 (both uval==0 and uval==1 decode to 0), and no way to encode -2**31. Is this a problem? What prevents us to need to encode -2**31?

    @pablogsal
    Copy link
    Member Author

    pablogsal commented Jun 28, 2022

    Also, weirdly there are two ways to encode 0 and no ways to encode -INT_MIN. @markshannon is this expected?

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests