Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Recommendation]: Don't mess around with builtins, that's what modules are for :) #96

Closed
rwgk opened this issue Oct 27, 2022 · 11 comments
Closed

Comments

@rwgk
Copy link

rwgk commented Oct 27, 2022

Problem description

The title is by @Yhg1s, I'm just the messenger here.

We happened to stumble over this in connection with pybind11, then I looked in nanobind and see that's also adding capsules to builtins:

int rv = PyDict_SetItemString(PyEval_GetBuiltins(), NB_INTERNALS_ID, capsule);

Thomas wrote:

  • pybind11 should use a separate module for this instead.
  • messing with builtins is going to have some undesireable performance effects in the future.
  • If you want per-interpreter state that's easily accessible, just stuff a container object into sys.modules.

Reproducible example code

No response

@wjakob
Copy link
Owner

wjakob commented Oct 27, 2022

Thanks for for reaching out @Yhg1s and @rwgk. Indeed, it would be nice to do this in a better way!

Just for context, what is this good for? nanobind and pybind11 maintain some internal data structures about the bindings. Some complex binding projects are split up across multiple extension modules, and in that case those internal data structures must be shared.

So this ugly capsule in PyEval_GetBuiltins() is so that the separate pybind11/nanobind extensions can find each other and exchange information.

It's actually kind of undesirable that this is even visible at all. The best solution would be if there is some "secret" dictionary only accessible by CPython (but not pure Python) code, where such things can be stashed.

The problem I see with sys.modules is that the user could simply delete the capsule from there in pure python code, which could trigger serious problems. (Perhaps builtins() has the same issue, I am actually not sure).

@Yhg1s, do you have any suggestions?

@JelleZijlstra
Copy link

For what it's worth, users can also delete things from builtins (with predictably catastrophic effects):

In [17]: del sys.modules["builtins"].print
Error in sys.excepthook:
Traceback (most recent call last):
  File "/main_instance_shell/jelle/venv3.9/lib/python3.9/site-packages/IPython/core/application.py", line 288, in excepthook
    return crashhandler.crash_handler_lite(etype, evalue, tb)
  File "/main_instance_shell/jelle/venv3.9/lib/python3.9/site-packages/IPython/core/crashhandler.py", line 227, in crash_handler_lite
    traceback.print_exception(etype, evalue, tb)
  File "/main_instance_shell/jelle/venv3.9/lib/python3.9/traceback.py", line 105, in print_exception
    print(line, file=file, end="")
NameError: name 'print' is not defined

@wjakob
Copy link
Owner

wjakob commented Oct 27, 2022

@Yhg1s, @JelleZijlstra: What about the Py_LIMITED_API function PyInterpreterState_GetDict()? (docs link: https://docs.python.org/3/c-api/init.html#c.PyInterpreterState_GetDict). Would that be an acceptable data structure for this purpose?

(FWIW the documentation of this function also mentions PyModule_GetState, but that one doesn't seem to be an option for us. The capsule is for inter-module communication and must be stored globally somewhere.)

@wjakob
Copy link
Owner

wjakob commented Oct 27, 2022

(By the way: @rwgk -- this may be an interesting coordinated switch to also do in pybind11 along with the internals ABI bump)

@Yhg1s
Copy link

Yhg1s commented Oct 27, 2022

There's nothing that is usable by an extension module, provided by Python, that isn't also usable by other modules. That includes builtins. (The builtins you're currently setting are even easier to access than the separate module I suggested, since you don't even need to import anything :)

If you want some way to control who can change things, you will need to add access control yourself. A separate module with accessor functions may be what you're looking for there, but how you're going to prevent people from using it incorrectly, I don't know.

@wjakob
Copy link
Owner

wjakob commented Oct 27, 2022

Is it possible to access the dictionary PyInterpreterState_GetDict() from pure Python code? (I don't see how barring some use of ctypes.)

@wjakob
Copy link
Owner

wjakob commented Oct 27, 2022

I was looking for where this API came from originally, and it seems to serve a related purpose in the cffi module: python/cpython#80305

wjakob added a commit that referenced this issue Oct 27, 2022
Previously, nanobind stored its internal data structures in the
``builtins`` dictionary, which was rightfully criticised in issue #96.
This tentative commit instead uses the interpreter state dictionary
exposed via ``PyInterpreterState_GetDict()``.
@JelleZijlstra
Copy link

PyInterpreterState_GetDict does look like a good fit for your use case, but I'm not a specialist in this area.

wjakob added a commit that referenced this issue Oct 27, 2022
Previously, nanobind stored its internal data structures in the
``builtins`` dictionary, which was rightfully criticised in issue #96.
This tentative commit instead uses the interpreter state dictionary
exposed via ``PyInterpreterState_GetDict()``.
@Yhg1s
Copy link

Yhg1s commented Oct 28, 2022

PyInterpreterState_GetDict() gets you a regular dict, which you can always get access to from Python code. (If nothing else, through the gc module.) What I'm trying to say is that you cannot protect against users intentionally messing with these internals, regardless of whether it's in builtins, a regular module or a special object in sys.modules, PyInterpreterState_GetDict or some other storage. Not even from Python code.

@wjakob
Copy link
Owner

wjakob commented Oct 28, 2022

What I'm trying to say is that you cannot protect against users intentionally messing with these internals, regardless of whether it's in builtins, a regular module or a special object in sys.modules, PyInterpreterState_GetDict or some other storage.

Okay, that makes sense. I am not trying to protect against such behavior. I moved storage of the internal capsule from builtins to the interpreter state dictionary (PyInterpreterState_GetDict()) in commit ca23da7.

The previous location (PyEval_GetBuiltins()) problematic for the reasons pointed out above. Automatically exporting this implementation detail to every module is also kind of horrible. The reason for not sticking the capsule into sys.modules is because this inter-module scratch space does not correspond to the notion of a module. It would seem like a misuse of that data structure.

PyInterpreterState_GetDict() is nice because it is not directly exposed to the user. Somebody malicious user could still try to get hold of this dictionary and break things, but we are not trying to protect against such an adversary.

For posterity, some answers to the question "what could possibly go wrong?":

  • If somebody uses gc to find and mess up the dictionary in question, the consequence is that calls to bound functions involving types that span across multiple extension libraries fail with TypeError.
  • If somebody manages to stick a capsule pointing to uninitialized memory in that place, it could crash the interpreter. Commit 42db2fd adds one small bit of extra protection, which is to set a name (e.g. __nb_internals_v5_gcc_libstdcpp__) to the capsule name attribute. nanobind won't decode a capsule if it has a different name. Note that this is mostly protection against unforeseen bugs rather than a malicious user. Somebody could still, e.g., use ctypes to create a capsule themselves which has that name and references some unrelated process memory. But they can already break things in arbitrary ways at that point without using nanobind.

@markshannon
Copy link

From a performance perspective (ignoring code quality issues), it is fine to add values to builtins, provided that:

  • It is done early as possible, ideally before importing any modules.
  • The value never changes.

Python 3.11 is fairly forgiving of changing builtins later, but future versions of CPython won't be, and neither is PyPy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants