-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sys.stdlib_module_names: list of stdlib module names (Python and extension modules) #87121
Comments
Some use cases require to know if a module comes from the stdlib or not. For example, I would like to only dump extension modules which don't come from the stdlib in bpo-42923: "Py_FatalError(): dump the list of extension modules". Stdlib modules are special. For example, the maintenance and updates are connected to the Python lifecycle. Stdlib modules cannot be updated with "pip install --upgrade". They are shipped with the system ("system" Python). They are usually "read only": on Unix, only the root user can write into /usr directory where the stdlib is installed, whereas modules installed with "pip install --user" can be modified by the current user. There is a third party project on PyPI which contains the list of stdlib modules: There is already sys.builtin_module_names: I propose to add a similar sys.module_names tuple of strings (module names). There are different constraints:
Should we only list top level packages, or also submodules? For example, only list "asyncio", or list the 31 submodules (asyncio.base_events, asyncio.futures, ...)? Maybe it can be decided on a case by case basis. For example, I consider that "os.path" is a stdlib module, even it's just an alias to "posixpath" or "ntpath" depending on the platform. I propose to include all extensions in the list, even if they are not built/available on some platforms. For example, "winsound" would also be listed on Linux, even if the extension is specific to Windows. I also propose to include stdlib module names even if they are overridden at runtime using PYTHONPATH with a different implementation. For example, "asyncio" would be in the list, even if an user creates "asyncio.py" file. The list would not depend on sys.path. -- Another option is to add an attribute to modules to mark them as coming from the stdlib. The API would be an attribute: module.__stdlib__ (bool). The attribute could be set directly in the module code. For example, add "__stdlib__ = True" in Python modules. Similar idea for C extension modules. Or the attribute could be set after importing the module, in the import site. But we don't control how stdlib modules are imported. -- For the specific case of bpo-42923, another option is to use a list of stdlib paths, and check module.__file__ to check if a module is a stdlib module, and also use sys.builtin_module_names. And so don't add any API. |
On Fedora 33, the stdlib lives in two main directories:
Example: >>> import os.path
>>> os.path.dirname(os.__file__) # Python
'/usr/lib64/python3.9'
>>> os.path.dirname(_asyncio.__file__)
'/usr/lib64/python3.9/lib-dynload' The Python stdlib path can be retrieved with: >>> import sysconfig; sysconfig.get_paths()['stdlib']
'/usr/lib64/python3.9' But I'm not sure how to retrieve /usr/lib64/python3.9/lib-dynload path. "platstdlib" is not what I expect: >>> import sysconfig; sysconfig.get_paths()['platstdlib']
'/usr/lib64/python3.9' I found DESTDIR in Makefile: >>> sysconfig.get_config_var('DESTSHARED')
'/usr/lib64/python3.9/lib-dynload' |
There is no technical difference between stdlib modules and other modules. Stdlib modules are only special in context of copyright and responsibility. |
I wrote PR 24238 which implements an hardcoded list of stdlib module names. The PR uses the CI to ensure that the list always remain up to date. If tomorrow a new module is added, a CI job fails ("file not up to date").
Hum, that would go against my bpo-42923 use case. Also for bpo-42923, I would prefer a tuple of strings ;-) |
Use cases for sys.module_names:
--- The isort uses the following script to generate the list of stdlib modules: The script uses sphinx.ext.intersphinx.fetch_inventory(...)["py:module"]. This API uses objects.inv from the online Python documentation. Example of Python 3.9: https://docs.python.org/3.9/objects.inv On the "dev" version, mkstdlibs.py lists 211 modules. |
(...) This API uses objects.inv from the online Python documentation. By the way, there is also this documentation page listing Python stdlib modules: |
Another way to gather the list of Python modules: pydoc help("modules") uses pkgutil.walk_packages(). pydoc uses ModuleScanner which uses sys.builtin_module_names and pkgutil.walk_packages(). pkgutil.walk_packages() calls pkgutil.iter_modules() which iterates on sys.meta_path and sys.path, and calls iter_modules() on each importer. For FileImporter, it iterates on os.listdir() on the importer path. For zipimporter, it iterates on zipimport._zip_directory_cache[importer.archive]. |
Note for myself: regen-keyword should be included in "make regen-all". |
A list of stdlib modules/extensions is IMHO problematic for maintenance, esp. if you consider that not all modules/extensions are installed on all systems (both because dependencies aren't present and because packagers have decided to unbundle parts of the stdlib). Wouldn't it be sufficient to somehow mark the stdlib entries on sys.path? Although that might give misleading answers with tools like pyinstaller/py2exe/py2app that package an application and its dependencies into a single zipfile. |
Ronald Oussoren:
My PR 24238 adds a list of module names (tuple of str): sys.module_names. It includes modules which are not available on the platform. For example, "winsound" is listed on Linux but not available, and "tkinter" is listed even if the extension could not be built (missing dependency). I tried to make it clear in the sys.module_names documentation (see my PR). Making the list conditional depending if the module is built or not is causing different issues. For example, for the "isort" (sort and group imports) use case, you want to know on Linux if "import winsound" is a stdlib import or a third party import (same for Linux-only modules on Windows). Moreover, there is a practical issue: extension modules are only built by setup.py *after* the sys module is built. I don't want to rebuild the sys module if building an extension failed. Even if a module is available (listed in sys.module_names and the file is present on disk), it doesn't mean that "it works". For example, "import multiprocessing" fails if there is no working lock implementation. Other modules have similar issues.
Having to actually import modules to check if it's a stdlib module or not is not convenient. Many stdlib modules have side effects on import. For example, "import antigravity" opens a web browser. An import can open files, spawn threads, run programs, etc. |
You wouldn't necessarily have to import a module to test, this is something that could be added to importlib. One (poorly thought out) option is to add sys._stdlib_path with the subsection of sys.path that contains the stdlib, and a function in importlib that returns if a spec is for a stdlib module. The disadvantage of this, or for the most part anything but your initial proposal, is that might not be save to use a function in importlib in Py_FatalError. |
Myself:
The "trace" usecase needs an exhaustive list of all module names. It is even less convenient to have to list all Python modules available on the system only to check modules coming from the stdlib. Ronald:
I'm not sure how it would work. I listed different cases which have different constraints:
For the test on the module name, how would it work with sys._stdlib_path? Should you import the module and then check if its path comes from sys._stdlib_path? Ronald:
PR 24254 is a working implementation of my use case: only list third party extension modules on a Python fatal error. It relies on PR 24238 sys.module_names list. The implementation works when called from a signal handler (when faulthandler catch fatal signals like SIGSEGV), it avoids memory allocations on the heap (one of the limits of a signal handler). |
For a module name use importlib.util.find_spec() to locate the module (or toplevel package if this is a module in package). The new importlib function could then use the spec and sys._stdlib_path to check if the spec is one for a stdlib module. This is pretty handwavy, but I do something similar in py2app (but completely based on paths calculated outside of the import machinery). For a module object you can extract the spec from the object and use the same function.
I think we agree on that point: my counter proposal won’t work in the faulthandler scenario, and may be problematic in the Py_FatalError case as well. Ronald |
BTW. A list of stdlib module names is not sufficient to determine if a module is in the stdlib, thanks to .pth files it is possible to have entries on sys.path before the stdlib. |
Ronald:
Yeah, I wrote it in my first message. I solved this issue with documentation: "Note: If a third party module has the same name than a standard library module and it comes before the standard library in sys.path, it overrides the standard library module on import." I updated sys.module_names documentation in my PR. IMO it's an acceptable limitation. |
Ronald:
The API providing a tuple of str (sys.module_names) works with the 4 use cases that I listed:
Ronald:
These tools worked for years without sys.module_names and don't need to be modified to use sys.module_names. sys.module_names is well defined, extract of its doc: "The list is the same on all platforms. Modules which are not available on some platforms and modules disabled at Python build are also listed." These packaging tools may require further checks than just checking if the name is in sys.module_names. These tools are complex anyway ;-) |
There is already sysconfig.get_paths()['stdlib']. Maybe we need to add a new key for extension modules. I don't think that these two options are exclusive. For me, it can even be a similar but different use case. |
New changeset cad8020 by Victor Stinner in branch 'master': |
sys.module_names also solves the following old StackOverflow question: "I have installed sooo many libraries/modules/packages with pip and now I cannot differentiate which is native to the python standard library and which is not. This causes problem when my code works on my machine but it doesn't work anywhere else." |
The pants project also uses a list of stdlib module names, similar to isort, to infer dependencies: "Pants is a scalable build system for monorepos: codebases containing multiple projects, often using multiple programming languages and frameworks, in a single unified code repository." |
Yet another use case: the friendly-traceback project detects typos on imports using the list of stdlib module names: For example, it suggest to replace "import Tkinter" with "import tkinter". |
Another use case similar to the "isort" use case: vim plugin to insert an import, it checks if the module name is a stdlib module. It checks the module path (for Python stdlib modules) and uses an hardcoded list of builtin stdlib modules. |
Another example: the unimport project ("linter, formatter for finding and removing unused import statements") uses the following function to guess if a module object is part of the stdlib: def is_std(package: str) -> bool:
"""Returns True if package module came with from Python."""
if package in C.BUILTIN_MODULE_NAMES:
return True
spec = get_spec(package)
if spec and isinstance(spec.origin, str):
return any(
(
spec.origin.startswith(C.STDLIB_PATH),
spec.origin in ["built-in", "frozen"],
spec.origin.endswith(".so"),
)
)
else:
return False |
For similar reasons as friendly-traceback, I'd be interested in a list of stdlib modules to be able to provide additional information in case of exceptions. For instance: string.ascii_lowercase
from maths import pi
> Before: ImportError('No module named maths',)
> After: ImportError("No module named maths. Did you mean 'math'?",) choice
from itertools import pi
> Before: ImportError('cannot import name pi',)
> After: ImportError("cannot import name pi. Did you mean 'from math import pi'?",) The first 2 cases only use the module name but the last 2 cases will actually import the modules to get the names in it and I want to do this for modules which are safe to import (no side-effect expected). Source: https://github.com/SylvainDe/DidYouMean-Python/blob/master/didyoumean/didyoumean_internal.py#L30 (but if you are interested in that kind of features, I'd recommend using friendly-traceback instead) |
Another potential use case: restrict pydoc web server to stdlib modules, see bpo-42988. |
I merged my PR, thanks for the feedback and reviews. |
Update: attribute renamed to sys.stdlib_module_names. |
I also created a thread on python-dev to discuss this new feature: |
I noticed that in #81261 all of the stdlib module names were explicitly listed, however as of Python 3.10 the stdlib now has a mechanism for this. python/cpython#87121 I figured it was better to use `sys.stdlib_module_names` going forward for 3.10+ instead of having to maintain this file for every new Python release. For docs see: https://docs.python.org/3/library/sys.html#sys.stdlib_module_names I did a symmetric difference to determine what the effective change would be. I verified that everything listed in this file ins included in sys.stdlib_module_names. However, there are files in sys.stdlib_module_names that are not included in the previous hard coded definition. Namely these are: ``` frozenset({'__future__', '_abc', '_aix_support', '_asyncio', '_bisect', '_blake2', '_bootsubprocess', '_bz2', '_codecs', '_codecs_cn', '_codecs_hk', '_codecs_iso2022', '_codecs_jp', '_codecs_kr', '_codecs_tw', '_collections', '_collections_abc', '_compat_pickle', '_compression', '_contextvars', '_crypt', '_csv', '_ctypes', '_curses', '_curses_panel', '_datetime', '_dbm', '_decimal', '_elementtree', '_frozen_importlib', '_frozen_importlib_external', '_functools', '_gdbm', '_hashlib', '_heapq', '_imp', '_io', '_json', '_locale', '_lsprof', '_lzma', '_markupbase', '_md5', '_msi', '_multibytecodec', '_multiprocessing', '_opcode', '_operator', '_osx_support', '_overlapped', '_pickle', '_posixshmem', '_posixsubprocess', '_py_abc', '_pydecimal', '_pyio', '_queue', '_random', '_scproxy', '_sha1', '_sha256', '_sha3', '_sha512', '_signal', '_sitebuiltins', '_socket', '_sqlite3', '_sre', '_ssl', '_stat', '_statistics', '_string', '_strptime', '_struct', '_symtable', '_threading_local', '_tkinter', '_tracemalloc', '_uuid', '_warnings', '_weakref', '_weakrefset', '_winapi', '_zoneinfo', 'antigravity', 'genericpath', 'idlelib', 'nt', 'nturl2path', 'opcode', 'pydoc_data', 'pyexpat', 'this'}) ``` I'm not sure if excluding these matters. I wouldn't think it would, but if it does and it is better to explicitly update this file each time, then feel free to close this. Pull Request resolved: #81520 Approved by: https://github.com/malfet
Summary: I noticed that in #81261 all of the stdlib module names were explicitly listed, however as of Python 3.10 the stdlib now has a mechanism for this. python/cpython#87121 I figured it was better to use `sys.stdlib_module_names` going forward for 3.10+ instead of having to maintain this file for every new Python release. For docs see: https://docs.python.org/3/library/sys.html#sys.stdlib_module_names I did a symmetric difference to determine what the effective change would be. I verified that everything listed in this file ins included in sys.stdlib_module_names. However, there are files in sys.stdlib_module_names that are not included in the previous hard coded definition. Namely these are: ``` frozenset({'__future__', '_abc', '_aix_support', '_asyncio', '_bisect', '_blake2', '_bootsubprocess', '_bz2', '_codecs', '_codecs_cn', '_codecs_hk', '_codecs_iso2022', '_codecs_jp', '_codecs_kr', '_codecs_tw', '_collections', '_collections_abc', '_compat_pickle', '_compression', '_contextvars', '_crypt', '_csv', '_ctypes', '_curses', '_curses_panel', '_datetime', '_dbm', '_decimal', '_elementtree', '_frozen_importlib', '_frozen_importlib_external', '_functools', '_gdbm', '_hashlib', '_heapq', '_imp', '_io', '_json', '_locale', '_lsprof', '_lzma', '_markupbase', '_md5', '_msi', '_multibytecodec', '_multiprocessing', '_opcode', '_operator', '_osx_support', '_overlapped', '_pickle', '_posixshmem', '_posixsubprocess', '_py_abc', '_pydecimal', '_pyio', '_queue', '_random', '_scproxy', '_sha1', '_sha256', '_sha3', '_sha512', '_signal', '_sitebuiltins', '_socket', '_sqlite3', '_sre', '_ssl', '_stat', '_statistics', '_string', '_strptime', '_struct', '_symtable', '_threading_local', '_tkinter', '_tracemalloc', '_uuid', '_warnings', '_weakref', '_weakrefset', '_winapi', '_zoneinfo', 'antigravity', 'genericpath', 'idlelib', 'nt', 'nturl2path', 'opcode', 'pydoc_data', 'pyexpat', 'this'}) ``` I'm not sure if excluding these matters. I wouldn't think it would, but if it does and it is better to explicitly update this file each time, then feel free to close this. Pull Request resolved: #81520 Approved by: https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4c9eae331b2695be011d96534a5ece44ba629601 Reviewed By: DanilBaibak Differential Revision: D37919540 Pulled By: DanilBaibak fbshipit-source-id: 6765556778ecad6edf254ad2ffc0d50d57ee1bbf
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: