Skip to content

Conversation

@gaogaotiantian
Copy link
Member

@gaogaotiantian gaogaotiantian commented Feb 4, 2024

We observed a severe performance regression on sys.settrace when PEP 669 is implemented. One of the important reason is how we deal with the instrumented lines in the trace function.

In 3.11, it's a simple check for tstate->tracing in frame eval. However, because we have INSTRUMENTED_LINE now, the check is deeper than before. The extra calls and unnecessary code when tstate->tracing != 0 contributed to the majority of the overhead.

In a real life application, a trace function is normally complicated. Take coverage as an example, for each line that is really "instrumented", there will be many lines in the trace function that are "instrumented" but should not trigger instrument callbacks. Therefore, tstate->tracing != 0 is the real hot path for tracing.

I made a microbenchmark:

import sys

def do_nothing():
    a = 1

def trace(frame, event, arg):
    for _ in range(2000):
        a = 1
    return trace

sys.settrace(trace)


for i in range(10000):
    do_nothing()

This is a bit extreme as there will be a couple thousand "suppressed" line events in the trace function for each real line event. The result is:

before:

real    0m2.009s
user    0m2.008s
sys     0m0.001s

after:

real    0m1.128s
user    0m1.127s
sys     0m0.001s

3.11:

real    0m0.779s
user    0m0.768s
sys     0m0.011s

Then I realized - we should not instrument the code object when we are tracing. Many functions are only called in tracing callbacks (same as sys.monitoring callbacks), and they should not be instrumented. Not only we saved time by not instrumenting them and not calling the instrumented version of bytecode, we can actually enjoy specializing in tracing functions.

So I made some changes to the RESUME and INSTRUMENTED_RESUME instructions, which brings the result to

real    0m0.908s
user    0m0.898s
sys     0m0.011s

The code for RESUME has an extra check for tstate->tracing, but for normal code without monitoring, it should be immediately specialized to RESUME_CHECK so the impact should not be significant.

Even considering the fact that the example is a bit extreme, the result is super significant. I believe this has a better effect than #107780

Is there a way to do a more comprehensive benchmark on this PR?

One difference would be asserts. Moving the check out of the instrumentation function skips the assertions - which we could add back but it would make the code a bit uglier. Admittedly, structure wise, it might not be as pretty as the current code, but I think this amount of performance improvement is worth it.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice performance improvement, thanks.

One minor issue, but otherwise looks good.

@bedevere-app
Copy link

bedevere-app bot commented Feb 5, 2024

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@gaogaotiantian
Copy link
Member Author

Hi @markshannon , is there anything I need to do on this PR? Thanks!

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion. Otherwise looks good.

next_instr = this_instr;
}
else {
if (tstate->tracing == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because RESUME converts itself to RESUME_CHECK, this shouldn't matter in terms of performance, so maybe keep it simple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I remembered why I did not set this instr to RESUME_CHECK when tracing != 0 - it will convert this instr to RESUME_CHECK but RESUME_CHECK will be deopted as the version does not add up.

For your question - the key is to stop the code from being instrumented, removing this check will still result in an instrumented code, and instrumented code is always slower.

However, we should probably check if the code is tracing and do not deopt RESUME_CHECK if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to do this is to DISPATCH() in the instrument case, so we don't have to write the "normal" case twice. Is it frown upon to have multiple DISPATCH() in one bytecode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple DISPATCH()s should be fine.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this.

next_instr = this_instr;
}
else {
if (tstate->tracing == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple DISPATCH()s should be fine.

@gaogaotiantian
Copy link
Member Author

Okay I cleaned up the code a bit - using an extra DISPATCH() in instrument case to avoid duplicate code. This PR should be ready to go.

@markshannon markshannon merged commit 0a61e23 into python:main Feb 28, 2024
@gaogaotiantian gaogaotiantian deleted the sys-settrace-optimize branch February 28, 2024 18:27
gaogaotiantian added a commit to gaogaotiantian/cpython that referenced this pull request Mar 1, 2024
encukou pushed a commit that referenced this pull request Mar 1, 2024
…H-114986)" (GH-116178)

Revert "gh-107674: Improve performance of `sys.settrace` (GH-114986)"

This reverts commit 0a61e23.
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this pull request Mar 4, 2024
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this pull request Mar 4, 2024
….settrace` (pythonGH-114986)" (pythonGH-116178)

Revert "pythongh-107674: Improve performance of `sys.settrace` (pythonGH-114986)"

This reverts commit 0a61e23.
gaogaotiantian added a commit to gaogaotiantian/cpython that referenced this pull request Mar 21, 2024
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 25, 2024
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 25, 2024
….settrace` (pythonGH-114986)" (pythonGH-116178)

Revert "pythongh-107674: Improve performance of `sys.settrace` (pythonGH-114986)"

This reverts commit 0a61e23.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
….settrace` (pythonGH-114986)" (pythonGH-116178)

Revert "pythongh-107674: Improve performance of `sys.settrace` (pythonGH-114986)"

This reverts commit 0a61e23.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants