Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytecode positions seem way too broad #93691

Closed
brandtbucher opened this issue Jun 10, 2022 · 5 comments
Closed

Bytecode positions seem way too broad #93691

brandtbucher opened this issue Jun 10, 2022 · 5 comments
Assignees
Labels
3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@brandtbucher
Copy link
Member

brandtbucher commented Jun 10, 2022

(Note that dis currently has a bug in displaying accurate location info in the presence of CACHEs. The correct information can be observed by working with co_positions directly or using the code from that PR.)

While developing specialist, I realized that there are lots of common code patterns that produce bytecode with unexpectedly large source ranges. In addition to being unhelpful for both friendly tracebacks (the original motivation) and things like bytecode introspection, I suspect these huge ranges may also be bloating the size of our internal position tables as well.

Consider the following function:

def analyze(path):                         #  1
    upper = lower = total = 0              #  2
    with open(path) as file:               #  3
        for line in file:                  #  4
            for character in line:         #  5
                if character.isupper():    #  6
                    upper += 1             #  7
                elif character.islower():  #  8
                    lower += 1             #  9
                total += 1                 # 10
    return lower / total, upper / total    # 11


import dis
from pprint import pprint as pp
def pos(p):
    return (p.lineno, p.end_lineno, p.col_offset, p.end_col_offset)

pp([(pos(x.positions), x.opname, x.argval) for x in dis.get_instructions(analyze)])

Things that should probably span one line at most:

  • The first GET_ITER/FOR_ITER pair span all of lines 4 through 10.
  • The second GET_ITER/FOR_ITER pair spans all of lines 5 through 10.
  • The first POP_JUMP_FORWARD_IF_FALSE spans all of lines 6 through 9.
  • The second POP_JUMP_FORWARD_IF_FALSE spans all of lines 8 through 9.
  • Ten instructions for with cleanup each span all of lines 3 through 10.

Things that should probably be artificial:

  • A JUMP_FORWARD spans all of line 7.
  • The first JUMP_BACKWARD spans all of line 10.
  • The second JUMP_BACKWARD spans all of lines 5 through 10.

Things I don't get:

  • A NOP spans all of lines 4 through 10.

As a result, over half of the generated bytecode for this function claims to span line 9, for instance. Also not shown here: the instructions for building functions and classes have similarly huge spans.

I think this can be tightened up in the compiler by:

  • Being more aggressive in calling SET_LOC on child nodes.
  • Being more aggressive in calling UNSET_LOC before unconditional jumps.

Linked PRs

@brandtbucher brandtbucher added type-bug An unexpected behavior, bug, or error interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.11 only security fixes 3.12 bugs and security fixes labels Jun 10, 2022
@brandtbucher
Copy link
Member Author

I can perhaps see why the argument could be made that we should have the location info for certain constructs span their entire block or jump range, but to me this just feels like the shape of the AST and the design of the compiler are leaking into the bytecode more than is really helpful in practice.

@brandtbucher
Copy link
Member Author

@brandtbucher
Copy link
Member Author

As a quick-and-dirty experiment: for the given example function, setting node.end_lineno = node.lineno on every node before compiling the AST resulted in a 20% reduction in the size of co_linetable.

@pablogsal
Copy link
Member

I can perhaps see why the argument could be made that we should have the location info for certain constructs span their entire block or jump range, but to me this just feels like the shape of the AST and the design of the compiler are leaking into the bytecode more than is really helpful in practice.

I don't think that many of these things were conscious decisions. Originally we added and enabled the infrastructure so the debug information could be propagated and we spent some time doing small optimisations, but we are missing a full pass over the compiler to fix things like this. Additionally, there are many instructions that don't really benefit from having position information because they are either artificial or don't map well to source code.

I think this is a very good find. With what seems like a small tedious amount of work we could reduce substantially the size for some functions, specially in block setup stuff.

Thanks for opening the issue and the insights @brandtbucher, this is very interesting indeed.

@markshannon
Copy link
Member

I don't think the shape of the AST is "leaking" into the bytecode. The AST defines the locations.
The problem is, IMO, in the design of the compiler. Tracking the "current" location in the compiler only makes sense if the bytecode is produced in a linear fashion, which it clearly isn't.

We should make the location used explicit when generating code, not use the implicit location stored in the compiler.
ADDOP(opcode, oparg, location)

@iritkatriel iritkatriel self-assigned this Sep 30, 2022
iritkatriel added a commit that referenced this issue Oct 17, 2022
carljm added a commit to carljm/cpython that referenced this issue Oct 17, 2022
* main: (31 commits)
  pythongh-95913: Move subinterpreter exper removal to 3.11 WhatsNew (pythonGH-98345)
  pythongh-95914: Add What's New item describing PEP 670 changes (python#98315)
  Remove unused arrange_output_buffer function from zlibmodule.c. (pythonGH-98358)
  pythongh-98174: Handle EPROTOTYPE under macOS in test_sendfile_fallback_close_peer_in_the_middle_of_receiving (python#98316)
  pythonGH-98327: Reduce scope of catch_warnings() in _make_subprocess_transport (python#98333)
  pythongh-93691: Compiler's code-gen passes location around instead of holding it on the global compiler state (pythonGH-98001)
  pythongh-97669: Create Tools/build/ directory (python#97963)
  pythongh-95534: Improve gzip reading speed by 10% (python#97664)
  pythongh-95913: Forward-port int/str security change to 3.11 What's New in main (python#98344)
  pythonGH-91415: Mention alphabetical sort ordering in the Sorting HOWTO (pythonGH-98336)
  pythongh-97930: Merge with importlib_resources 5.9 (pythonGH-97929)
  pythongh-85525: Remove extra row in doc (python#98337)
  pythongh-85299: Add note warning about entry point guard for asyncio example (python#93457)
  pythongh-97527: IDLE - fix buggy macosx patch (python#98313)
  pythongh-98307: Add docstring and documentation for SysLogHandler.createSocket (pythonGH-98319)
  pythongh-94808: Cover `PyFunction_GetCode`, `PyFunction_GetGlobals`, `PyFunction_GetModule` (python#98158)
  pythonGH-94597: Deprecate child watcher getters and setters (python#98215)
  pythongh-98254: Include stdlib module names in error messages for NameErrors (python#98255)
  Improve speed. Reduce auxiliary memory to 16.6% of the main array. (pythonGH-98294)
  [doc] Update logging cookbook with an example of custom handling of levels. (pythonGH-98290)
  ...
iritkatriel added a commit to iritkatriel/cpython that referenced this issue Jun 5, 2024
iritkatriel added a commit to iritkatriel/cpython that referenced this issue Jun 10, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 12, 2024
…tors (pythonGH-120330)

(cherry picked from commit 97b69db)

Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
iritkatriel added a commit that referenced this issue Jun 12, 2024
…ators (GH-120330) (#120399)

gh-93691: fix too broad source locations of for statement iterators (GH-120330)
(cherry picked from commit 97b69db)

Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
iritkatriel added a commit to iritkatriel/cpython that referenced this issue Jun 12, 2024
…t iterators (pythonGH-120330).

(cherry picked from commit 97b69db)

Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
iritkatriel added a commit that referenced this issue Jun 13, 2024
…ators (GH-120330 (#120405)

[3.12] gh-93691: fix too broad source locations of for statement iterators (GH-120330).
(cherry picked from commit 97b69db)
mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants