Generation of tier 2 traces is overly optimistic about the accuracy of profiling for branches. #115727

markshannon · 2024-02-20T15:21:46Z

Bug report

We have a test for our confidence that the trace is going the right way:

def test_confidence_score(self):
    def testfunc(n):
        bits = 0
        for i in range(n):
            if i & 0x01:
                bits += 1
            if i & 0x02:
                bits += 1
            if i&0x04:
                bits += 1
            if i&0x08:
                bits += 1
            ...

But consider this:

def test_confidence_score(self):
    def testfunc(n):
        bits = 0
        for i in range(n):
            if i & 0x10:
                bits += 1
            if i & 0x20:
                bits += 1
            if i&0x40:
                bits += 1
            if i&0x80:
                bits += 1
            ...

The direction of the branch changes every 16 iterations, so looks perfectly predictable to profiling but it as pathalogical as the first version.

How to fix this

We should lower our confidence a bit, even if profiling claims 100% predictability.

I don't think there is a perfect way to do this, but it shouldn't too hard to come up with something good enough.

Here is a possible scheme

Use lower confidence and threshold when generating the initial trace:

0.8 * profile_fraction + 0.1 for bool checks (maybe higher for is None checks)
0.95 - 0.05 * misses recorded for type checks
Stop trace at a lower confidence than before, say < 0.2

Reassess confidence in optimizer:

Type information may raise confidence (it we eliminate a guard, we have 100% confidence)
Stop trace at confidence < 0.33 (maybe allow a lower confidence if it completes a loop)

Linked PRs

The text was updated successfully, but these errors were encountered:

markshannon · 2024-02-20T15:28:47Z

@gvanrossum this might be of interest to you

Temporary fix of confidence score test.

gvanrossum · 2024-02-21T00:49:26Z

I tried to use lltrace debugging to understand better what's happening here (you probably already know this). A few issues contribute to the seemingly random behavior if the test is repeatedly called after resetting optimizers and executors. I simulated this effect as follows:

    def test_repeat_confidence_score(self):
        for _ in range(10):
            self.test_confidence_score()
            sys._clear_internal_caches()

One of the things I learned is that it's not just the bitmasks in the POP_JUMP_IF_* opcodes that aren't being reset; the counter in JUMP_BACKWARD survives the reset too (of course).

This has the interesting effect that the first run is different than all later runs: the first time, the counter starts at 0 and we optimize when it has hit the threshold (16), so we execute 16 iterations in Tier 1. But all later runs (well, until our poor signed 12 bits counter overflows) execute a single Tier 1 iteration, find the counter already > 16, and immediately optimize.

Since the initial iteration is always executed in Tier 1, and the jump bitmask is only updated in Tier 1, and the first iteration always has i == 0, this jump is always taken and counted on the first iteration, biasing the bitmask towards 1. On the second iteration, i == 1, so the jump isn't taken and the executor takes the side exit, but this manipulates the Tier 1 instruction pointer directly, and doesn't update the bitmask. On the third iteration, i == 2 and the executor makes it past the successful jump. Etcetera.

All in all we now find that for the second and following calls to testfunc(), the bitmask for the first POP_JUMP_IF_FALSE only gets updated once per call, and it always gets marked as "taken", gradually shifting in more 1 bits, and hence increasing the "confidence" as well. I believe this is what caused this test to break the refleaks CI build (the extra confidence caused a second POP_JUMP_IF_FALSE to be kept in the trace).

Is there a bug here? We might argue that we don't care -- this is obviously an unusual situation, regular apps won't be resetting the optimizer or clearing the executors, so the rolling over of the counter in JUMP_BACKWARD and the resulting strange bias in the POP_JUMP_IF_* bitmask won't affect regular users. But it sure surprised me!

gvanrossum · 2024-02-21T01:25:07Z

Use lower confidence and threshold when generating the initial trace:
* 0.8 * profile_fraction + 0.1 for bool checks (maybe higher for `is None` checks)

The first half of this is simple (it changes the multiplier to something in the range 0.5-0.9 instead of 0.5-1.0), and I'll send a PR for this.

I'm not sure what to do for None checks -- are you suggesting that if we see e.g. 16 None checks fail in a row we should be more confident that this trend continues than if we see 16 failing False checks in a row? That feels a bit specific to the examples we typically use in tests -- while the if i & 0x01 example is nice to generate a worst-case example quickly, it doesn't feel very realistic, so I don't know how to pick a different threshold for None checks,

* 0.95 - 0.05 * misses recorded for type checks

Assuming you're referring to CALL_ISINSTANCE, the machinery needed to actually record such failed type checks is a bit more involved -- we have spare cache entries in that opcode where we can record the outcome similar to the POP_JUMP_IF* opcodes, and then we could special-case that when constructing traces to adjust the confidence value based on that. Is that what you had in mind? Or are you talking about guards like _GUARD_BOTH_INT? (Hopefully those have much higher likelihood of being correct.)

Assuming CALL_ISINSTANCE, would you want the confidence adjustment to only happen when followed by a POP_JUMP_IF_*? How would we arrange that?

* Stop trace at a lower confidence than before, say < 0.2

I'm not ready for that yet. With the current threshold (1/3), it would take over 10 maximally-confident jumps to end the trace early -- that's gotta be very branchy code indeed. :-) For unpredictable jumps, there is no difference.

The theory is that even if we saw a jump go in the same direction the last 16 times we got there, we shouldn't be overly confident that it's still going to go the same way in the future. This PR makes it so that in the extreme cases, the confidence is multiplied by 0.9 instead of remaining unchanged. For unpredictable jumps, there is no difference (still 0.5). For somewhat predictable jumps, we interpolate.

…5728) Temporary fix of confidence score test.

…on#115748) The theory is that even if we saw a jump go in the same direction the last 16 times we got there, we shouldn't be overly confident that it's still going to go the same way in the future. This PR makes it so that in the extreme cases, the confidence is multiplied by 0.9 instead of remaining unchanged. For unpredictable jumps, there is no difference (still 0.5). For somewhat predictable jumps, we interpolate.

…5728) Temporary fix of confidence score test.

…on#115748) The theory is that even if we saw a jump go in the same direction the last 16 times we got there, we shouldn't be overly confident that it's still going to go the same way in the future. This PR makes it so that in the extreme cases, the confidence is multiplied by 0.9 instead of remaining unchanged. For unpredictable jumps, there is no difference (still 0.5). For somewhat predictable jumps, we interpolate.

markshannon added type-bug An unexpected behavior, bug, or error 3.13 bugs and security fixes labels Feb 20, 2024

Eclips4 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 20, 2024

bedevere-app bot mentioned this issue Feb 20, 2024

GH-115727: Temporary fix of confidence score test. #115728

Merged

markshannon mentioned this issue Feb 20, 2024

GH-112354: Initial implementation of warm up on exits and trace-stitching #114142

Merged

markshannon added a commit that referenced this issue Feb 20, 2024

GH-115727: Temporary fix of confidence score test. (GH-115728)

494739e

Temporary fix of confidence score test.

bedevere-app bot mentioned this issue Feb 21, 2024

gh-115727: Reduce confidence even on 100% predicted jumps #115748

Merged

woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024

pythonGH-115727: Temporary fix of confidence score test. (pythonGH-11…

58df653

…5728) Temporary fix of confidence score test.

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024

pythonGH-115727: Temporary fix of confidence score test. (pythonGH-11…

a2d41a3

…5728) Temporary fix of confidence score test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation of tier 2 traces is overly optimistic about the accuracy of profiling for branches. #115727

Generation of tier 2 traces is overly optimistic about the accuracy of profiling for branches. #115727

markshannon commented Feb 20, 2024 •

edited by bedevere-app bot

markshannon commented Feb 20, 2024

gvanrossum commented Feb 21, 2024

gvanrossum commented Feb 21, 2024

Generation of tier 2 traces is overly optimistic about the accuracy of profiling for branches. #115727

Generation of tier 2 traces is overly optimistic about the accuracy of profiling for branches. #115727

Comments

markshannon commented Feb 20, 2024 • edited by bedevere-app bot

Bug report

How to fix this

Here is a possible scheme

Linked PRs

markshannon commented Feb 20, 2024

gvanrossum commented Feb 21, 2024

gvanrossum commented Feb 21, 2024

markshannon commented Feb 20, 2024 •

edited by bedevere-app bot