Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need to change the contract and interface of _PyExecutorObject and _PyOptimizerObject #108866

Closed
markshannon opened this issue Sep 4, 2023 · 12 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@markshannon
Copy link
Member

markshannon commented Sep 4, 2023

The contract of _PyExecutorObject currently is that it executes zero or more bytecode instructions.
We should change that so that it must execute at least one instruction.
The reason for this change is so that we can use ENTER_EXECUTOR anywhere, and that we will need to replace arbitrary instructions with ENTER_EXECUTOR (see below for why)

If a _PyExecutorObject executes zero instructions, then ENTER_EXECUTOR is responsible for executing the original instruction.

If it executes one or more instructions the behavior of the first instruction is handled by the _PyExecutorObject so ENTER_EXECUTOR is just a simple, and fast, (tail) call.

We also want to change the signature of the execute function pointer to take _PyExecutorObject ** instead of _PyExecutorObject *.
See faster-cpython/ideas#621 for details.

We might as well make both changes at once.
I think our only "real" optimizer already executes at least three instructions, so it should be a fairly easy change.

Why do we need insert executors at arbitrary instructions?

Consider a nested if with at least two balanced hot paths, and at least one cold path.
At the join point, we want both paths to continue in optimized code, but as neither represents more than 50% of the flow, they will likely stop at the join point. Ideally, they will both jump into the same, following optimized code. But in order to find it, it needs to be attached to the tier 1 instructions using ENTER_EXECUTOR and that join point could be an arbitrary instruction, likely a LOAD_FAST or LOAD_GLOBAL.

Note that this makes life harder for the optimizer, as it cannot simply exit the optimized code if a guard fails in the first instruction. It is obliged to fully execute that instruction.

Other optimizers might also want to overwrite instructions with ENTER_EXECUTOR; PyTorch Dynamo, for example.

@gvanrossum

Linked PRs

@markshannon
Copy link
Member Author

We haven't implemented optimization on RESUME yet. One of the reasons for that is the awkwardness of handling executors that execute zero instructions.

@markshannon
Copy link
Member Author

We may also need the executor to return the next_instr, not the frame, as INSTRUMENTED_LINE needs both the next instruction (itself) and frame->prev_instr to point to the previous instruction.

@gvanrossum
Copy link
Member

Note that this makes life harder for the optimizer, as it cannot simply exit the optimized code if a guard fails in the first instruction. It is obliged to fully execute that instruction.

I think you meant this makes life harder for the executor? Or maybe it's harder for both. :-)

I'm not sure how to implement the proposal except by ensuring in the optimizer that the first Tier 1 instruction it translates doesn't deopt. This is easily checked, since gh-108583 introduced the HAS_DEOPT_FLAG and OPCODE_HAS_DEOPT(OP) to check for it. But it would reduce the effectiveness of the optimizer in case the first instruction can deoptimize -- we'd be forced not to optimize that trace at all. This would currently affect all for loops, because they all start with a specialization of FOR_ITER, all of which start with a guard.

I wouldn't want the Tier 2 executor to be responsible for executing the deoptimized Tier 1 instruction. I suppose in some cases we could define the "family head" in Tier 1 as a macro that is translatable to Tier 2 (in some cases it may already be translatable, though that isn't the case for FOR_ITER mentioned above), and the deopt operation could be reimplemented as a branch to a stub containing that macro expansion, followed by SAVE_IP and EXIT_TRACE. This would require changing all (or at least some) guard uops into specially-marked branch uops. I guess it's possible, but not simple.

We may also need the executor to return the next_instr, not the frame, as INSTRUMENTED_LINE needs both the next instruction (itself) and frame->prev_instr to point to the previous instruction.

Are you thinking of having INSTRUMENTED_LINE translated to Tier 2? Right now Tier 2 assumes all instrumentation is off, and attempts to enable it for code that is currently executing in Tier 2 should cause the executor to bail at the next CHECK_EVAL_BREAKER() call (that isn't implemented though).

Anyway, I see the attractiveness of the idea (you can plunk ENTER_EXECUTOR pretty much anywhere and you won't ever have to look at the opcode/oparg pair it hides), but I also see some issues implementing it.

@markshannon
Copy link
Member Author

This would currently affect all for loops, because they all start with a specialization of FOR_ITER, all of which start with a guard.

We start optimizing on JUMP_BACKWARD, which cannot fail and we currently skip in ENTER_EXECUTOR to avoid the potential for an infinite loop. FOR_ITER is the second instruction, so it is fine if that de-opts.

I wouldn't want the Tier 2 executor to be responsible for executing the deoptimized Tier 1 instruction.

Indeed. So don't. That what EXIT_TRACE is for, to avoid executing anything that is better handled by tier 1.

Are you thinking of having INSTRUMENTED_LINE translated to Tier 2?

No

I also see some issues implementing it.

Being obliged to execute the first instruction would make things complicated, except that there is no obligation to optimize at all. If it is too complicated, just let tier 1 handle it.

@gvanrossum
Copy link
Member

Okay, since the "first instruction" in this case is always JUMP_BACKWARD this all seems fine then. I guess in the future we'll also have RESUME which also can't fail (or we can know statically whether it can fail, right?).

@gvanrossum
Copy link
Member

Consider a nested if with at least two balanced hot paths, and at least one cold path.
At the join point, we want both paths to continue in optimized code, but as neither represents more than 50% of the flow, they will likely stop at the join point. Ideally, they will both jump into the same, following optimized code. But in order to find it, it needs to be attached to the tier 1 instructions using ENTER_EXECUTOR and that join point could be an arbitrary instruction, likely a LOAD_FAST or LOAD_GLOBAL.

I just realized I don't think I understood this example before. :-(

But maybe another motivation works for me. If ENTER_EXECUTOR exits without error and with the instruction pointer still pointing at the same (ENTER_EXECUTOR) instruction, presumably because the instruction that ENTER_EXECUTOR replaced needs to deoptimize, we'd need to special case this, otherwise we'd just be attempting to execute ENTER_EXECUTOR again, likely causing an infinite loop.

However, consider the example you gave, where we want to replace a specialization of LOAD_GLOBAL with ENTER_EXECUTOR. This specialization (e.g. LOAD_GLOBAL_MODULE) contains a guard, which calls DEOPT_IF(). So it looks like the Tier 2 interpreter won't be able to honor the contract "execute at least one instruction" in that case, so the optimizer no choice but to refuse to optimize at all, when presented with that particular first instruction.

So I'm still a little bit stumped. Aren't we introducing a lot of complexity with this requirement?

@gvanrossum
Copy link
Member

Separately, it would be nice if the ENTER_EXECUTOR patch could be inserted at the top of the loop instead of on the JUMP_BACKWARD, so that future calls can won't have to go through the Tier 1 interpreter for the first iteration. But this would definitely run into the "at least one instruction" thing.

The special-casing required wouldn't be too terrible, would it?

@markshannon
Copy link
Member Author

However, consider the example you gave, where we want to replace a specialization of LOAD_GLOBAL with ENTER_EXECUTOR. This specialization (e.g. LOAD_GLOBAL_MODULE) contains a guard, which calls DEOPT_IF(). So it looks like the Tier 2 interpreter won't be able to honor the contract "execute at least one instruction" in that case, so the optimizer no choice but to refuse to optimize at all, when presented with that particular first instruction.

That is all true, but easily solved for our tier 2 optimizer; we simply don't insert ENTER_EXECUTOR at a LOAD_GLOBAL.

However, the ability to add an ENTER_EXECUTOR anywhere is empowering. We might well want to insert ENTER_EXECUTOR in places other than JUMP_BACKWARD and RESUME.
Third party optimizers like PyTorch Dynamo might also want this capability to resume after what they call "graph breaks".

Separately, it would be nice if the ENTER_EXECUTOR patch could be inserted at the top of the loop

We want to enter the executor at the end of the loop. It effectively gives us loop peeling for free, which improves type stability.

The special-casing required wouldn't be too terrible, would it?

Only if it makes things measurably better.

@gvanrossum
Copy link
Member

Offline there was considerable skepticism about the feasibility -- e.g. even RESUME_CHECK still deoptimizes (in fact that's how it works :-), so by these rules it cannot be replaced by ENTER_EXECUTOR. @brandtbucher believes a simple check after the executor has run for whether it made any progress will be feasible. Instead of "measurably faster" the bar will probably have to be "not measurably slower".

Point taken about loop peeling.

@markshannon
Copy link
Member Author

markshannon commented Nov 14, 2023

There seems to be some reluctance to do this. So let me give the reasons why we need to do this.

We want a guarantee of progress when the flow of control becomes obscure, as it does when we are moving from tier 1 to tier 2 and from trace to trace, otherwise the program can get stuck.

If executors are not required to make progress, we cannot stitch them together freely as we might end up forming a loop that make no progress.

Consider three traces E, A, B where E is an executor and A and B are exits, stubs, or some other non-executor trace.
Provided we start with E we can freely join A and/or B to form larger regions including loops provided E is guaranteed to make progress.
However if E is not guaranteed to make progress, stitching becomes more complicated and thus error-prone and slower.

If the requirement that an executor makes progress is were difficult to implement, I would understand the reluctance.
But it isn't.
There are few ways it can be done. Here's two simple ones:

  • Always de-specialize the first instruction in the trace, e.g. convert LOAD_GLOBAL_MODULE into LOAD_GLOBAL
  • Use a special exit stub for the first instruction that executes the de-specialized instruction before continuing.

@gvanrossum
Copy link
Member

The example is too abstract, because I haven't seen the future yet where traces are joined. Is joining the same as stitching? Where would A an B come from? Currently all we have are exit stubs which live in an executor. I would need to understand more about stitching before believing that dropping the progress requirement causes a problem. I'm happy to hold off judgment until we are actually stitching.

Regarding tricks to force progress, consider CALL. It's now macro(CALL) = _SPECIALIZE_CALL + _CALL; but _CALL is not a viable uop. Likely that is because it uses DISPATCH_INLINED() and manipulated next_instr. Breaking it up into uops would not help, because then we'd have guards again that could deopt.

@gvanrossum
Copy link
Member

All right, off line I got a better understanding of stitching. The idea is that (almost) every exit from a trace (as of gh-112045 always through a deopt) has a counter, and if it becomes hot enough, we project a new trace from that point, and change the deopt exit to transfer directly to that new trace. This should give us polymorphism as well (e.g. the deopt from BINARY_OP_ADD_INT may lead to another trace that uses BINARY_OP_ADD_FLOAT). There are a possible mechanisms for this. One idea is that we could just reset the Tier 1 instruction to its unspecialized form and reset its counter. The Tier 1 version would only be executed when the Executor deopts, so in the example it would never see the int arguments. Another approach would have counters on the Tier 2 deopt exits. We could try both approaches together for different situations.

Anyway, I do see that this kind of stitching would be complicated if the target executor makes no progress. One simple way to guarantee that, BTW, would be to just leave certain instructions (like unspecializable CALL) in Tier 1 and create a trace with the first guaranteed-success Tier 1 bytecode after that (e.g. LOAD_FAST).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)
Projects
None yet
Development

No branches or pull requests

3 participants