New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate TOC class and remove its usage from codebase #7615
Conversation
Rework TOC-style 3-tuple detection in `add_datas` and `binaries`; instead of checking for instance of `TOC` class, check if the length of the first entry is three. This will allow us to replace the `TOC` class with regular list.
Instantiate `self.binaries` as a regular list instead of an instance of the `TOC` class; we do not need any of its functionality here.
Use the lazy string formatting offered by logger method calls instead of directly formatting strings using the % syntax.
We are returning a simple list of modules, so there's no need for TOC class there as we know the list has no duplicates.
There is exactly one call that passes the existing TOC in our codebase, and even there we can simply append the output to the existing TOC. The returned toc is now a list as opposed to instance of the TOC class. This removes implict deduplication functionality, but should not be an issue at this step, because entries are generated from the module graph (and so there should not be any duplicates here).
Turn `Tree` class into child of regular `list` instead of `TOC` class, as part of on-going `TOC` class depreciation. As `Tree` describes the contents of the filesystem, implicit TOC de-duplication should not be required in the first place. Also initialize `Target.dependencies` with plain `list`.
For now, the normalization helper function is a stub that internally uses the TOC class; this allows us to gradually limit/encapsulate the use of TOC class within the codebase.
Process extensions (rename them to full filenames) immediately after we obtain the list of extensions from the modulegraph.
Instead of using TOC class, perform explicit TOC list normalization at certain build stages.
When `Analysis.pure` TOC list was an instance of the `TOC` class, we could dynamically add an attribute `_code_cache` to it, and use that to pass the code cache dictionary around. The `PYZ` writer would then check for the presence of the attribute on the received arguments and extend its internal code cache accordingly. Now that the TOC list is a plain `list`, the custom attribute cannot be added anymore. To work around this, we store new dictionary in global `CONF['code_cache']`, where we associate the list's `id()` with its code cache dictionary. This gives us the same semantics as the old approach.
The CONF override in `test_issue_2492` and `test_issue_5131` should initialize the newly-introduced `code_cache` dictionary.
The `Splash` target is trying to detect whether user's code already uses `_tkinter` by looking for extension in the `binaries` TOC. However, it uses the `filenames` property of the `TOC` class, which is not available anymore now that the TOC lists have been switched to plain `list`. Futhermore, as it is trying to look up `_tkinter` extension by module name, this means that the detection has been effectively broken since pyinstaller#7273 where we pushed conversion of extension module names to full extension filenames into `Analysis`. So the search in `binaries` needs to be done with full extension filename rather than module name.
Replace the TOC list normalization stubs with proper implementation that encodes priority information. This for example ensures that if a file is collected both as a DATA and BINARY, we always treat is as BINARY (and hence subject it to additional binary processing).
We now have TOC normalization with priority system in place, so we can directly extend the EXE's TOC with entries from all passed targets' dependencies.
@@ -106,7 +106,11 @@ def __init__(self, *tocs, **kwargs): | |||
self.toc = [] | |||
self.code_dict = {} | |||
for toc in tocs: | |||
self.code_dict.update(getattr(toc, '_code_cache', {})) | |||
# Check if code cache association exists for the given TOC list | |||
code_cache = CONF['code_cache'].get(id(toc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going into these code_cache
dicts? I'm guessing a mapping from module names to module.__code__
attribute? If so then why does this cache need to be separated per toc? Why not one single process-wide code_cache
global dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, code_cache dicts are mappings of module names to their code objects, retrieved from modulegraph's nodes:
pyinstaller/PyInstaller/depend/analysis.py
Lines 523 to 542 in fb545db
def get_code_objects(self): | |
""" | |
Get code objects from ModuleGraph for pure Python modules. This allows to avoid writing .pyc/pyo files to hdd | |
at later stage. | |
:return: Dict with module name and code object. | |
""" | |
code_dict = {} | |
mod_types = PURE_PYTHON_MODULE_TYPES | |
for node in self.iter_graph(start=self._top_script_node): | |
# TODO This is terrible. To allow subclassing, types should never be directly compared. Use isinstance() | |
# instead, which is safer, simpler, and accepts sets. Most other calls to type() in the codebase should also | |
# be refactored to call isinstance() instead. | |
# get node type e.g. Script | |
mg_type = type(node).__name__ | |
if mg_type in mod_types: | |
if node.code: | |
code_dict[node.identifier] = node.code | |
return code_dict |
Previously, they were attached to pure
TOC objects and then collected in PYZ
- although in practice, PYZ
usually receives only one pure
TOC, so it ended up with one such dictionary.
I tried to match the original behavior mostly because of possible corner cases - for example, spec instantiates two Analysis
objects, with different search paths. In that case, I assume we might end up with different but eponymous modules, and would need separate code caches for each resulting PYZ and executable.
It's not a very likely scenario (hopefully?), but I didn't want to delve too much into it at this point. I think we'll need to rework the part with code caches again when we switch from the old modulegraph, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that switching to a single global code cache seems reasonable, though. It was my first idea as well, until I started to think about possible implications. If those are not a concern, I can add a commit that switches to single global code cache for the current build process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Let's leave it as is for now.
And ignore it in the TOC class unit tests.
Update documentation on TOC lists now that the TOC class has been officially deprecated.
Ensure that TOC is properly de-duplicated even if dest_name contains loops with parent directory path components. For example, `numpy/core/../../numpy.libs/libquadmath-2d0c479f.so.0.0.0` and `numpy/linalg/../../numpy.libs/libquadmath-2d0c479f.so.0.0.0` should be considered duplicates, as they are both normalized to `numpy.libs/libquadmath-2d0c479f.so.0.0.0`. Therefore, we now have the TOC normalization helpers to always sanitize the `dest_name` using `os.path.normpath` (with `pathlib` lacking the equivalent functionality), so that the entries are properly de-duplicated and that destination name is always in its compact/normalized form. We should probably also look into path normalization in the `bindepend.getImports` function, but at the end of the day, the TOC normalization serves as the last guard against problematic entries.
The
TOC
class with its on-line entry de-duplication seems convenient at first, but trying to be both a list (officially) and a set (unofficially) brings up a slew of corner-case issues, especially when it comes to element update and removal operations.Therefore, within our codebase, we mostly treat
TOC
as alist
. To avoid hitting the weird corner cases with modification behavior, we mostly treat it as immutable - if we need to modify the list, we create a plain list with filtered entries, and pass it to aTOC
constructor, which performs list normalization (i.e., entry de-duplication).One issue I have with the
TOC
's normalization is that it is order-based - if there is a DATA entry already present in the TOC, it is difficult to have it replaced with a BINARY entry. Although in the grand scheme of things, BINARY entries should have precedence over the DATA ones, due to binary dependency scanning and additional processing steps we subject them to on some OSes. While it would be possible to implement a priority-based mechanism within the existingTOC
class, doing so would even further complicate the already-muddy semantics of the element operations.Therefore, this PR marks the
TOC
class as deprecated (both with warning emitted from constructor and in revised documentation) and removes all uses of theTOC
class from our codebase. The implicit list normalization that theTOC
provided is now replaced by helper functions which we explicitly use to perform priority-based list normalization at certain "checkpoints" during the build process. Specifically, the TOC lists are explicitly normalized at the end ofAnalysis
instantiation, before they are stored inAnalysis
properties (e.g.,Analysis.pure
,Analysis.datas
,Analysis.binaries
). The build targets (PYZ
,EXE
,COLLECT
, etc.) also explicitly normalize the combined TOC that they construct from the TOCs that they are passed as arguments of the constructor - this is to account for the possibility of the user performing additional modification of the lists in the spec file (which could result in entry duplication).We now have a separate normalization function for PYZ TOC (because at least nominally, PYZ implementation is case sensitive) and "regular" TOC (which follows the OS-specific case-normalization behavior). And the priority system for the latter ensures the following precedence order: DEPENDENCY > BINARY, EXTENSION > everything else.
The
TOC
class also made it easier to implement shortcuts in other areas of the code - for example passing the code cache dictionary to PYZ via dynamic attribute set on theTOC
object. We cannot do that with plainlist
anymore, so code cache passing is now implemented via globalCONF
dictionary.Closes #6688.
See also: #6669, #6689.