Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-111520: Integrate the Tier 2 interpreter in the Tier 1 interpreter #111428

Merged
merged 27 commits into from
Nov 1, 2023

Conversation

gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Oct 28, 2023

See also faster-cpython/ideas#631

Beware! This changes the env vars to enable uops and their debugging to PYTHON_UOPS and PYTHON_LLTRACE.

Use `GOTO_ERROR(error)` instead of `goto error`.
This macro is defined differently in the tier two interpreter.
This replaces PYTHONUOPSDEBUG=N. The meaning of N is the same (for now):

  0: no tracing
  1: print when tier 2 trace created
  2: print contents of tier 2 trace
  3: print every uop executed
  4: print optimization attempts and details
  5: print tier 1 instructions and stack
@gvanrossum
Copy link
Member Author

With uops always enabled, only test_embed fails, but that's the same on main (see GH-111339). I added that to benchmark Tier 2 perf (who knows, the speedier Tier switching may make things faster there).

@gvanrossum
Copy link
Member Author

@markshannon Can you review this? If you think this is roughly going in the right direction I will clean it up.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising.

Python/bytecodes.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated

OPT_STAT_INC(traces_executed);
_Py_CODEUNIT *ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
_PyUOpInstruction *next_uop = self->trace;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been set by ENTER_EXECUTOR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. It would mean that next_uop would have to be a local in the top-level function scope. I am trying to reduce the number of locals there. For now let me keep it this way. I expect that it will be fine even when we start stitching, because each trace can only be entered at the top and we need the executor (so that we can decref it upon exiting the trace). The initial next_uop is just the executor plus a fixed offset anyway.

Python/ceval.c Show resolved Hide resolved
);
goto error_tier_two;

pop_4_error_tier_two:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be shared with tier 1? Unwinding should work the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, not quite. The debug output is different, the stats collection is different, but most importantly, we need to DECREF(self) here before jumping to error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need to set next_instr = frame->instr_ptr.

@gvanrossum gvanrossum changed the title Integrate the Tier 2 interpreter in the Tier 1 interpreter gh-111520: Integrate the Tier 2 interpreter in the Tier 1 interpreter Oct 30, 2023
- Remove CHECK_EVAL_BREAKER() from top of Tier 2 loop
- Make Tier 2 default case Py_UNREACHABLE() in non-debug mode
- GOTO_TIER_TWO() macro instead of `goto enter_tier_two`
- Inline ip_offset (Tier 2 LOAD_IP() is now a no-op)
- Move #define/#under TIER_ONE/TIER_TWO into generated .c.h files
- ceval_macros.h defines Tier 1 macros, Tier 2 is inlined in ceval.c
Python/bytecodes.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated

OPT_STAT_INC(traces_executed);
_Py_CODEUNIT *ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
_PyUOpInstruction *next_uop = self->trace;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. It would mean that next_uop would have to be a local in the top-level function scope. I am trying to reduce the number of locals there. For now let me keep it this way. I expect that it will be fine even when we start stitching, because each trace can only be entered at the top and we need the executor (so that we can decref it upon exiting the trace). The initial next_uop is just the executor plus a fixed offset anyway.

Python/ceval.c Show resolved Hide resolved
);
goto error_tier_two;

pop_4_error_tier_two:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, not quite. The debug output is different, the stats collection is different, but most importantly, we need to DECREF(self) here before jumping to error.

Python/ceval.c Outdated
Comment on lines 1048 to 1051
frame->return_offset = 0; // Don't leave this random
_PyFrame_SetStackPointer(frame, stack_pointer);
Py_DECREF(self);
goto resume_with_error;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not need to reset return_offset; we can also skip syncing stack_pointer; so we could make this slightly faster as follows (but it doesn't matter since errors are presumed to be on the slow path):

Suggested change
frame->return_offset = 0; // Don't leave this random
_PyFrame_SetStackPointer(frame, stack_pointer);
Py_DECREF(self);
goto resume_with_error;
Py_DECREF(self);
next_instr = frame->instr_ptr;
goto error;

Copy link
Member Author

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider this sufficiently cleaned up to undraft it, and if @markshannon doesn't comment I'll merge tomorrow.

I could noodle endlessly with the code for dropping from tier 2 back into tier 1, with and without errors; but it may be better to put that off until a more careful review.

@gvanrossum gvanrossum marked this pull request as ready for review November 1, 2023 00:23
@gvanrossum gvanrossum requested a review from a team as a code owner November 1, 2023 00:23
@gvanrossum gvanrossum marked this pull request as draft November 1, 2023 00:42
@gvanrossum
Copy link
Member Author

Whoops, back to draft, I accidentally left in the uops-forever cherrypick. Will fix later.

(I accidentally kept this commit after pushing it experimentally.
This means I've been testing with uops on all the time, which is
actually pretty amazing.)

This reverts commit e0e60ce.
@gvanrossum gvanrossum marked this pull request as ready for review November 1, 2023 04:35
@gvanrossum
Copy link
Member Author

Alas, we still run out of stack space on the recursion tests on Windows. I'll keep working on this.

@zooba
Copy link
Member

zooba commented Nov 1, 2023

Alas, we still run out of stack space on the recursion tests on Windows. I'll keep working on this.

The tests run against debug builds, which will have very different stack usage patterns compared to optimised builds. I wouldn't worry too much about changing anything other than the default recursion limit (which is different for debug vs. release), and then do a buildbot run to check the optimised builds.

Otherwise, the best way to reduce stack usage will be to reduce the size of local variables, possibly by refactoring into separate functions so that the variables are only allocated temporarily and don't remain on the stack as the Python code recurses.

@gvanrossum
Copy link
Member Author

gvanrossum commented Nov 1, 2023

Otherwise, the best way to reduce stack usage will be to reduce the size of local variables, possibly by refactoring into separate functions so that the variables are only allocated temporarily and don't remain on the stack as the Python code recurses.

But the whole exercise is to merge two functions in one because we want to be able to transfer back and forth using goto rather than calls!

I will see by how much the debug recursion limit needs to be adjusted to make the tests pass; I think I have only two extra local variables left.

@gvanrossum gvanrossum added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 1, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @gvanrossum for commit fdf1a2f 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 1, 2023
@gvanrossum
Copy link
Member Author

If this now passes the tests and the buildbots don't freak out I am going to merge this. However, @markshannon, could you have a look at how I fixed test_call? This feels like a band-aid.

@zooba
Copy link
Member

zooba commented Nov 1, 2023

Another fix would be to increase the value on this line:

<StackReserveSize Condition="$(Configuration) == 'Debug'">8000000</StackReserveSize>

That's what determines how much stack is available, and running out is what causes the hard crash. As you can see, it's already 4x larger for debug builds than releases, but it's still only 8MB and so making it larger isn't really going to hurt anyone much.

(I'll note that I haven't looked through the whole PR, so I don't know exactly what is causing it or whether the existing recursion limit comes into play at all. I'm just throwing out helpful pointers and trusting that you guys know what's actually going on :) )

@gvanrossum
Copy link
Member Author

Another fix would be to increase the value on this line:

Ah, thanks. That makes more sense. I don't know exactly what's going on either, but I know that I've added some local variables to the function containing the main interpreter loop, and that causes hard crashes in a bunch of tests that recurse deeply (at the C level). I'm now applying your fix instead of my previous hacks -- except I'm keeping the bit in test.support that dynamically pulls the value of Py_C_RECURSION_LIMIT1 from _testcapi (if it exists) rather than hardcoding a value that must be kept in sync with something defined in pystate.h. (It's technically still the case, but only if _testcapi can't be imported.)

I've just pushed that and if tests still pass I'll merge.

@gvanrossum gvanrossum merged commit 7e135a4 into python:main Nov 1, 2023
34 checks passed
@gvanrossum gvanrossum deleted the mix-tiers branch November 1, 2023 20:13
FullteaR pushed a commit to FullteaR/cpython that referenced this pull request Nov 3, 2023
…preter (python#111428)

- There is no longer a separate Python/executor.c file.
- Conventions in Python/bytecodes.c are slightly different -- don't use `goto error`,
  you must use `GOTO_ERROR(error)` (same for others like `unused_local_error`).
- The `TIER_ONE` and `TIER_TWO` symbols are only valid in the generated (.c.h) files.
- In Lib/test/support/__init__.py, `Py_C_RECURSION_LIMIT` is imported from `_testcapi`.
- On Windows, in debug mode, stack allocation grows from 8MiB to 12MiB.
- **Beware!** This changes the env vars to enable uops and their debugging
  to `PYTHON_UOPS` and `PYTHON_LLTRACE`.
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
…preter (python#111428)

- There is no longer a separate Python/executor.c file.
- Conventions in Python/bytecodes.c are slightly different -- don't use `goto error`,
  you must use `GOTO_ERROR(error)` (same for others like `unused_local_error`).
- The `TIER_ONE` and `TIER_TWO` symbols are only valid in the generated (.c.h) files.
- In Lib/test/support/__init__.py, `Py_C_RECURSION_LIMIT` is imported from `_testcapi`.
- On Windows, in debug mode, stack allocation grows from 8MiB to 12MiB.
- **Beware!** This changes the env vars to enable uops and their debugging
  to `PYTHON_UOPS` and `PYTHON_LLTRACE`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants