Skip to content

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Oct 11, 2025

I know it isn't good practice to bundle multiple changes into the same PR, but this PR does two things which are required to get tail calling working on MSVC and which I think are beneficial.

  1. Move the _PyStackRef off a per-ceval allocated buffer into a per-thread one. This actually reduces the C stack consumption of ceval.c. I had planned this for some time (see Move stackref buffer to thread state to reduce interp stack usage #138115), but we also need this to let MSVC know the variables are fine to outlive their current scope. This may be a tiny bit slower versus allocating it on the stack, but it removes the performance cliff of having any function calling more than 10 arguments suddenly getting a lot slower due to having to allocate memory. Overall, I think this is a better design choice for the two reasons of less C stack consumption and allowing longer calls.
  2. Add __restrict to correct places and scoping where needed to tell MSVC that a local variable does not outlive the scope. We only use this for MSVC for now as other compilers have different qualifier rules. However, it's definitely possible to extend this to clang and gcc to let them benefit in the future.

There is no noticeable perf drop on a nogil benchmark for the normal interpreter https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251011-3.15.0a0-0786133-NOGIL

Fidget-Spinner and others added 7 commits August 21, 2025 22:20
Co-Authored-By: Brandt Bucher <brandt@python.org>
Co-Authored-By: Hulon Jenkins <109993038+hulonjenkins@users.noreply.github.com>
Co-Authored-By: Chris Eibl <138194463+chris-eibl@users.noreply.github.com>
Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat, looks good to me, but don't let me be the final reviewer on this.

#endif

/* How much scratch space to give stackref to PyObject* conversion. */
#define MAX_STACKREF_SCRATCH 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How arbitrary is this amount? What is the usual amount of space required for conversions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to arbitrarily always consume 10 PyObject/PyStackRef per call. Meaning even if you had a single item, the whole 10 slot space would be used.

inst(LOAD_BUILD_CLASS, ( -- bc)) {
PyObject *bc_o;
int err = PyMapping_GetOptionalItem(BUILTINS(), &_Py_ID(__build_class__), &bc_o);
PyObject *Py_MSVC_RESTRICT bc_o;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a more semantically meaningful name than Py_MSVC_RESTRICT here? "Unaliased" or "no escape"?

And possibly define it as a macro like #define UNALIASED(x) x Py_MSVC_RESTRICT to get UNALIASED(PyObject *) bc_o?

Just looking to reduce the sense of "oh I have to go look up this MSVC thing to know what's going on here" when reading the code. There may be other ways.

Comment on lines 2294 to 2296
int *Py_MSVC_RESTRICT method_found_ptr = &method_found;
attr_o = _PySuper_Lookup(cls, self, name,
Py_TYPE(self)->tp_getattro == PyObject_GenericGetAttr ? method_found_ptr : NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent? Like in other places

PyObject *stack[] = {class, self};
PyObject *super = PyObject_Vectorcall(global_super, stack, oparg & 2, NULL);
PyObject *super;
Py_BEGIN_LOCALS_MUST_NOT_ESCAPE();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't require parentheses on macros like this, especially when they define a scope.

<ClCompile Include="..\Python\brc.c" />
<ClCompile Include="..\Python\ceval.c" />
<ClCompile Include="..\Python\ceval.c">
<AdditionalOptions Condition="'$(UseTailCallInterp)' == 'true' and $(PlatformToolset) != 'ClangCL'">/std:clatest %(AdditionalOptions)</AdditionalOptions>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we pass this option to clang-cl? Does it break?

Any possibility of passing a specific /std:c rather than "latest"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've worked with Chris on this and apparently it builds with clang-cl still the last time I checked with him? @chris-eibl

Copy link
Member

@chris-eibl chris-eibl Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we pass this option to clang-cl? Does it break?

Yupp, clang-cl doesn't like /std:clatest, hence I omitted it, because it doesn't need it.
Seems the clang-cl versions we are using (>=19)

  • support by default at least C99 (which is would be needed for restrict)
  • and support must tail, etc, in the "clang specific notation"

But I assume MSVC needs /std:clatest for [[msvc::musttail]] - @Fidget-Spinner correct?
And at least MSVC definitely needs at least /std:c11 for restrict (https://learn.microsoft.com/en-us/cpp/build/reference/std-specify-language-standard-version?view=msvc-170).

Any possibility of passing a specific /std:c rather than "latest"?

Yeah, we could explicitely enforce /std:c11for clang-cl, but we didn't need it in the past and AFAICT this PR didn't change anything that clang-cl sees, due to carefully crafted combinations of _MSC_VER and __clang__ like

#if defined(_MSC_VER) && !defined(__clang__)

and apparently it builds with clang-cl still the last time I checked with him?

Yes, it still does. Having done a lot of builds recently and still playing around.
And tail calling CI is green as well: https://github.com/python/cpython/actions/runs/18435075900/job/52527447202

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under the impression we need clatest for msvc musttail as well.

Fidget-Spinner and others added 3 commits October 14, 2025 04:49
# define Py_NO_INLINE_MSVC_TAILCALL
#endif

// Tells the compiler that this variable cannot be alised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar nit, but "cannot be aliased" is ambiguous (because it's not clear whether we are choosing not to alias it, or telling the compiler that it cannot choose to alias it).

I think we're promising that it won't be aliased? So "Tells the compiler to trust that we haven't aliased this variable" better explains what it's doing. Or just "... this variable is not going to be aliased".

Alternatively, "Tells the compiler not to alias the variable" (but I don't think that even makes sense, let alone being correct here).

Comment on lines +3256 to +3257
PyObject *attr_o;
Py_BEGIN_LOCALS_MUST_NOT_ESCAPE;
Copy link
Member

@zooba zooba Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another crazy idea, based on seeing how this is used (feel free to ignore unless you really like it):

Py_BEGIN_RESTRICTED_PYOBJECT_SCOPE(attr_o);
...
attr_o = ...
...
Py_END_RESTRICTED_PYOBJECT_SCOPE;

It seems like it only gets used in the same pattern, and this makes it really clear what it's for (at the expense of overly limiting how it could be used).

But it's all generated code, so not a big deal.

@markshannon
Copy link
Member

I know it isn't good practice to bundle multiple changes into the same PR

Indeed not. Can we move all the restrict annotations to their own PR? Are they MSVC specific? If the restrict annotation is correct, we should add them for all platforms. If they are not correct we shouldn't add them anywhere.

// we make no attempt to optimize here; specializations should
// handle any case whose performance we care about
PyObject *super;
Py_BEGIN_LOCALS_MUST_NOT_ESCAPE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Py_BEGIN_LOCALS_MUST_NOT_ESCAPE;
{ // 'stack' array must not escape

@Fidget-Spinner
Copy link
Member Author

I know it isn't good practice to bundle multiple changes into the same PR

Indeed not. Can we move all the restrict annotations to their own PR? Are they MSVC specific? If the restrict annotation is correct, we should add them for all platforms. If they are not correct we shouldn't add them anywhere.

Fair. This PR will be just fixing the stackref buffer then.

#define SPECIAL_MAX 3

// Special counterparts of ceval functions for performance reasons
PyAPI_FUNC(int) _PyEval_Mapping_GetOptionalItem(PyObject *obj, PyObject *key, PyObject **result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only difference the restrict annotation?
We could just add the annotation to PyMapping_GetOptionalItem the result pointer can never alias the other parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can't because PyMapping_GetOptionalItem is part of the public API, and we can't change the signature of it.

Copy link
Member

@markshannon markshannon Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't changing the signature, just telling the compiler than one of its parameters can't alias the others.
I'm surprised we need to do this anyway. I would have thought that strict aliasing already tells MSVC that they can't alias.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is, that strict aliasing can't help here, since in

PyMapping_GetOptionalItem(PyObject *obj, PyObject *key, PyObject **result)

all parameters are of type PyObject. And by using restrict we tell the compiler that "we assure you, that none of these PyObjects point to the same memory", and if we brake that contract, UB kicks in.

@markshannon
Copy link
Member

What is necessary for tail calling?

IIUC, the restict annotations and the tighter scoping around arrays in the interpreter are necessary, but moving the temporary arrays from the C stack to the threadstate are not. Is that correct?

@Fidget-Spinner
Copy link
Member Author

IIUC, the restict annotations and the tighter scoping around arrays in the interpreter are necessary, but moving the temporary arrays from the C stack to the threadstate are not. Is that correct?

No this is needed too. The problem is that the temporary arrays from the C stack are local variables. These variables are passed around in ways that are unpredictable to MSVC (such as passing to an arbitrary vectorcall function pointer). So we have to move this to thread state to get tail calling.

There are other side benefits and I've wanted to do this for awhile, so I thought might as well put this in this PR. Do you still want to split up the restrict stuff knowing that we need both to get TC on MSVC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants