Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate TOC class and remove its usage from codebase #7615

Merged
merged 18 commits into from May 10, 2023

Conversation

rokm
Copy link
Member

@rokm rokm commented May 9, 2023

The TOC class with its on-line entry de-duplication seems convenient at first, but trying to be both a list (officially) and a set (unofficially) brings up a slew of corner-case issues, especially when it comes to element update and removal operations.

Therefore, within our codebase, we mostly treat TOC as a list. To avoid hitting the weird corner cases with modification behavior, we mostly treat it as immutable - if we need to modify the list, we create a plain list with filtered entries, and pass it to a TOC constructor, which performs list normalization (i.e., entry de-duplication).

One issue I have with the TOC's normalization is that it is order-based - if there is a DATA entry already present in the TOC, it is difficult to have it replaced with a BINARY entry. Although in the grand scheme of things, BINARY entries should have precedence over the DATA ones, due to binary dependency scanning and additional processing steps we subject them to on some OSes. While it would be possible to implement a priority-based mechanism within the existing TOC class, doing so would even further complicate the already-muddy semantics of the element operations.

Therefore, this PR marks the TOC class as deprecated (both with warning emitted from constructor and in revised documentation) and removes all uses of the TOC class from our codebase. The implicit list normalization that the TOC provided is now replaced by helper functions which we explicitly use to perform priority-based list normalization at certain "checkpoints" during the build process. Specifically, the TOC lists are explicitly normalized at the end of Analysis instantiation, before they are stored in Analysis properties (e.g., Analysis.pure, Analysis.datas, Analysis.binaries). The build targets (PYZ, EXE, COLLECT, etc.) also explicitly normalize the combined TOC that they construct from the TOCs that they are passed as arguments of the constructor - this is to account for the possibility of the user performing additional modification of the lists in the spec file (which could result in entry duplication).

We now have a separate normalization function for PYZ TOC (because at least nominally, PYZ implementation is case sensitive) and "regular" TOC (which follows the OS-specific case-normalization behavior). And the priority system for the latter ensures the following precedence order: DEPENDENCY > BINARY, EXTENSION > everything else.

The TOC class also made it easier to implement shortcuts in other areas of the code - for example passing the code cache dictionary to PYZ via dynamic attribute set on the TOC object. We cannot do that with plain list anymore, so code cache passing is now implemented via global CONF dictionary.

Closes #6688.

See also: #6669, #6689.

rokm added 15 commits May 9, 2023 15:31
Rework TOC-style 3-tuple detection in `add_datas` and `binaries`;
instead of checking for instance of `TOC` class, check if the
length of the first entry is three.

This will allow us to replace the `TOC` class with regular list.
Instantiate `self.binaries` as a regular list instead of an
instance of the `TOC` class; we do not need any of its functionality
here.
Use the lazy string formatting offered by logger method calls
instead of directly formatting strings using the % syntax.
We are returning a simple list of modules, so there's no need for
TOC class there as we know the list has no duplicates.
There is exactly one call that passes the existing TOC in our
codebase, and even there we can simply append the output to the
existing TOC.

The returned toc is now a list as opposed to instance of the TOC
class. This removes implict deduplication functionality, but
should not be an issue at this step, because entries are generated
from the module graph (and so there should not be any duplicates
here).
Turn `Tree` class into child of regular `list` instead of `TOC`
class, as part of on-going `TOC` class depreciation.

As `Tree` describes the contents of the filesystem, implicit TOC
de-duplication should not be required in the first place.

Also initialize `Target.dependencies` with plain `list`.
For now, the normalization helper function is a stub that internally
uses the TOC class; this allows us to gradually limit/encapsulate
the use of TOC class within the codebase.
Process extensions (rename them to full filenames) immediately
after we obtain the list of extensions from the modulegraph.
Instead of using TOC class, perform explicit TOC list normalization
at certain build stages.
When `Analysis.pure` TOC list was an instance of the `TOC` class,
we could dynamically add an attribute `_code_cache` to it, and use
that to pass the code cache dictionary around. The `PYZ` writer
would then check for the presence of the attribute on the received
arguments and extend its internal code cache accordingly.

Now that the TOC list is a plain `list`, the custom attribute
cannot be added anymore. To work around this, we store new
dictionary in global `CONF['code_cache']`, where we associate
the list's `id()` with its code cache dictionary. This gives us
the same semantics as the old approach.
The CONF override in `test_issue_2492` and `test_issue_5131`
should initialize the newly-introduced `code_cache` dictionary.
The `Splash` target is trying to detect whether user's code already
uses `_tkinter` by looking for extension in the `binaries` TOC.
However, it uses the `filenames` property of the `TOC` class, which
is not available anymore now that the TOC lists have been switched
to plain `list`.

Futhermore, as it is trying to look up `_tkinter` extension by
module name, this means that the detection has been effectively
broken since pyinstaller#7273 where we pushed conversion of extension module
names to full extension filenames into `Analysis`. So the search
in `binaries` needs to be done with full extension filename rather
than module name.
Replace the TOC list normalization stubs with proper implementation
that encodes priority information. This for example ensures that
if a file is collected both as a DATA and BINARY, we always treat
is as BINARY (and hence subject it to additional binary processing).
We now have TOC normalization with priority system in place, so
we can directly extend the EXE's TOC with entries from all passed
targets' dependencies.
@rokm rokm marked this pull request as ready for review May 9, 2023 22:07
@rokm rokm requested a review from bwoodsend May 9, 2023 22:09
@@ -106,7 +106,11 @@ def __init__(self, *tocs, **kwargs):
self.toc = []
self.code_dict = {}
for toc in tocs:
self.code_dict.update(getattr(toc, '_code_cache', {}))
# Check if code cache association exists for the given TOC list
code_cache = CONF['code_cache'].get(id(toc))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going into these code_cache dicts? I'm guessing a mapping from module names to module.__code__ attribute? If so then why does this cache need to be separated per toc? Why not one single process-wide code_cache global dict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, code_cache dicts are mappings of module names to their code objects, retrieved from modulegraph's nodes:

def get_code_objects(self):
"""
Get code objects from ModuleGraph for pure Python modules. This allows to avoid writing .pyc/pyo files to hdd
at later stage.
:return: Dict with module name and code object.
"""
code_dict = {}
mod_types = PURE_PYTHON_MODULE_TYPES
for node in self.iter_graph(start=self._top_script_node):
# TODO This is terrible. To allow subclassing, types should never be directly compared. Use isinstance()
# instead, which is safer, simpler, and accepts sets. Most other calls to type() in the codebase should also
# be refactored to call isinstance() instead.
# get node type e.g. Script
mg_type = type(node).__name__
if mg_type in mod_types:
if node.code:
code_dict[node.identifier] = node.code
return code_dict

Previously, they were attached to pure TOC objects and then collected in PYZ - although in practice, PYZ usually receives only one pure TOC, so it ended up with one such dictionary.

I tried to match the original behavior mostly because of possible corner cases - for example, spec instantiates two Analysis objects, with different search paths. In that case, I assume we might end up with different but eponymous modules, and would need separate code caches for each resulting PYZ and executable.

It's not a very likely scenario (hopefully?), but I didn't want to delve too much into it at this point. I think we'll need to rework the part with code caches again when we switch from the old modulegraph, anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that switching to a single global code cache seems reasonable, though. It was my first idea as well, until I started to think about possible implications. If those are not a concern, I can add a commit that switches to single global code cache for the current build process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Let's leave it as is for now.

tests/unit/test_TOC.py Outdated Show resolved Hide resolved
rokm added 2 commits May 10, 2023 11:19
And ignore it in the TOC class unit tests.
Update documentation on TOC lists now that the TOC class has
been officially deprecated.
Ensure that TOC is properly de-duplicated even if dest_name
contains loops with parent directory path components. For example,
`numpy/core/../../numpy.libs/libquadmath-2d0c479f.so.0.0.0` and
`numpy/linalg/../../numpy.libs/libquadmath-2d0c479f.so.0.0.0`
should be considered duplicates, as they are both normalized to
`numpy.libs/libquadmath-2d0c479f.so.0.0.0`.

Therefore, we now have the TOC normalization helpers to always
sanitize the `dest_name` using `os.path.normpath` (with `pathlib`
lacking the equivalent functionality), so that the entries are
properly de-duplicated and that destination name is always in
its compact/normalized form.

We should probably also look into path normalization in the
`bindepend.getImports` function, but at the end of the day,
the TOC normalization serves as the last guard against problematic
entries.
@rokm rokm merged commit 1c86739 into pyinstaller:develop May 10, 2023
18 checks passed
@rokm rokm deleted the deprecate-toc-class branch May 10, 2023 20:59
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent TOC subtraction behavior when entry's type is not specified
2 participants