Deprecate TOC class and remove its usage from codebase #7615

rokm · 2023-05-09T13:31:42Z

The TOC class with its on-line entry de-duplication seems convenient at first, but trying to be both a list (officially) and a set (unofficially) brings up a slew of corner-case issues, especially when it comes to element update and removal operations.

Therefore, within our codebase, we mostly treat TOC as a list. To avoid hitting the weird corner cases with modification behavior, we mostly treat it as immutable - if we need to modify the list, we create a plain list with filtered entries, and pass it to a TOC constructor, which performs list normalization (i.e., entry de-duplication).

One issue I have with the TOC's normalization is that it is order-based - if there is a DATA entry already present in the TOC, it is difficult to have it replaced with a BINARY entry. Although in the grand scheme of things, BINARY entries should have precedence over the DATA ones, due to binary dependency scanning and additional processing steps we subject them to on some OSes. While it would be possible to implement a priority-based mechanism within the existing TOC class, doing so would even further complicate the already-muddy semantics of the element operations.

Therefore, this PR marks the TOC class as deprecated (both with warning emitted from constructor and in revised documentation) and removes all uses of the TOC class from our codebase. The implicit list normalization that the TOC provided is now replaced by helper functions which we explicitly use to perform priority-based list normalization at certain "checkpoints" during the build process. Specifically, the TOC lists are explicitly normalized at the end of Analysis instantiation, before they are stored in Analysis properties (e.g., Analysis.pure, Analysis.datas, Analysis.binaries). The build targets (PYZ, EXE, COLLECT, etc.) also explicitly normalize the combined TOC that they construct from the TOCs that they are passed as arguments of the constructor - this is to account for the possibility of the user performing additional modification of the lists in the spec file (which could result in entry duplication).

We now have a separate normalization function for PYZ TOC (because at least nominally, PYZ implementation is case sensitive) and "regular" TOC (which follows the OS-specific case-normalization behavior). And the priority system for the latter ensures the following precedence order: DEPENDENCY > BINARY, EXTENSION > everything else.

The TOC class also made it easier to implement shortcuts in other areas of the code - for example passing the code cache dictionary to PYZ via dynamic attribute set on the TOC object. We cannot do that with plain list anymore, so code cache passing is now implemented via global CONF dictionary.

Closes #6688.

See also: #6669, #6689.

Rework TOC-style 3-tuple detection in `add_datas` and `binaries`; instead of checking for instance of `TOC` class, check if the length of the first entry is three. This will allow us to replace the `TOC` class with regular list.

Instantiate `self.binaries` as a regular list instead of an instance of the `TOC` class; we do not need any of its functionality here.

Use the lazy string formatting offered by logger method calls instead of directly formatting strings using the % syntax.

We are returning a simple list of modules, so there's no need for TOC class there as we know the list has no duplicates.

There is exactly one call that passes the existing TOC in our codebase, and even there we can simply append the output to the existing TOC. The returned toc is now a list as opposed to instance of the TOC class. This removes implict deduplication functionality, but should not be an issue at this step, because entries are generated from the module graph (and so there should not be any duplicates here).

Turn `Tree` class into child of regular `list` instead of `TOC` class, as part of on-going `TOC` class depreciation. As `Tree` describes the contents of the filesystem, implicit TOC de-duplication should not be required in the first place. Also initialize `Target.dependencies` with plain `list`.

For now, the normalization helper function is a stub that internally uses the TOC class; this allows us to gradually limit/encapsulate the use of TOC class within the codebase.

Process extensions (rename them to full filenames) immediately after we obtain the list of extensions from the modulegraph.

Instead of using TOC class, perform explicit TOC list normalization at certain build stages.

When `Analysis.pure` TOC list was an instance of the `TOC` class, we could dynamically add an attribute `_code_cache` to it, and use that to pass the code cache dictionary around. The `PYZ` writer would then check for the presence of the attribute on the received arguments and extend its internal code cache accordingly. Now that the TOC list is a plain `list`, the custom attribute cannot be added anymore. To work around this, we store new dictionary in global `CONF['code_cache']`, where we associate the list's `id()` with its code cache dictionary. This gives us the same semantics as the old approach.

The CONF override in `test_issue_2492` and `test_issue_5131` should initialize the newly-introduced `code_cache` dictionary.

The `Splash` target is trying to detect whether user's code already uses `_tkinter` by looking for extension in the `binaries` TOC. However, it uses the `filenames` property of the `TOC` class, which is not available anymore now that the TOC lists have been switched to plain `list`. Futhermore, as it is trying to look up `_tkinter` extension by module name, this means that the detection has been effectively broken since pyinstaller#7273 where we pushed conversion of extension module names to full extension filenames into `Analysis`. So the search in `binaries` needs to be done with full extension filename rather than module name.

Replace the TOC list normalization stubs with proper implementation that encodes priority information. This for example ensures that if a file is collected both as a DATA and BINARY, we always treat is as BINARY (and hence subject it to additional binary processing).

We now have TOC normalization with priority system in place, so we can directly extend the EXE's TOC with entries from all passed targets' dependencies.

bwoodsend · 2023-05-09T22:51:17Z

PyInstaller/building/api.py

@@ -106,7 +106,11 @@ def __init__(self, *tocs, **kwargs):
        self.toc = []
        self.code_dict = {}
        for toc in tocs:
-            self.code_dict.update(getattr(toc, '_code_cache', {}))
+            # Check if code cache association exists for the given TOC list
+            code_cache = CONF['code_cache'].get(id(toc))


What's going into these code_cache dicts? I'm guessing a mapping from module names to module.__code__ attribute? If so then why does this cache need to be separated per toc? Why not one single process-wide code_cache global dict?

Yeah, code_cache dicts are mappings of module names to their code objects, retrieved from modulegraph's nodes:

pyinstaller/PyInstaller/depend/analysis.py

Lines 523 to 542 in fb545db

def get_code_objects(self):

"""

Get code objects from ModuleGraph for pure Python modules. This allows to avoid writing .pyc/pyo files to hdd

at later stage.

:return: Dict with module name and code object.

"""

code_dict = {}

mod_types = PURE_PYTHON_MODULE_TYPES

for node in self.iter_graph(start=self._top_script_node):

# TODO This is terrible. To allow subclassing, types should never be directly compared. Use isinstance()

# instead, which is safer, simpler, and accepts sets. Most other calls to type() in the codebase should also

# be refactored to call isinstance() instead.

# get node type e.g. Script

mg_type = type(node).__name__

if mg_type in mod_types:

if node.code:

code_dict[node.identifier] = node.code

return code_dict

Previously, they were attached to pure TOC objects and then collected in PYZ - although in practice, PYZ usually receives only one pure TOC, so it ended up with one such dictionary.

I tried to match the original behavior mostly because of possible corner cases - for example, spec instantiates two Analysis objects, with different search paths. In that case, I assume we might end up with different but eponymous modules, and would need separate code caches for each resulting PYZ and executable.

It's not a very likely scenario (hopefully?), but I didn't want to delve too much into it at this point. I think we'll need to rework the part with code caches again when we switch from the old modulegraph, anyway.

I agree that switching to a single global code cache seems reasonable, though. It was my first idea as well, until I started to think about possible implications. If those are not a concern, I can add a commit that switches to single global code cache for the current build process.

Fair enough. Let's leave it as is for now.

tests/unit/test_TOC.py

And ignore it in the TOC class unit tests.

Update documentation on TOC lists now that the TOC class has been officially deprecated.

Ensure that TOC is properly de-duplicated even if dest_name contains loops with parent directory path components. For example, `numpy/core/../../numpy.libs/libquadmath-2d0c479f.so.0.0.0` and `numpy/linalg/../../numpy.libs/libquadmath-2d0c479f.so.0.0.0` should be considered duplicates, as they are both normalized to `numpy.libs/libquadmath-2d0c479f.so.0.0.0`. Therefore, we now have the TOC normalization helpers to always sanitize the `dest_name` using `os.path.normpath` (with `pathlib` lacking the equivalent functionality), so that the entries are properly de-duplicated and that destination name is always in its compact/normalized form. We should probably also look into path normalization in the `bindepend.getImports` function, but at the end of the day, the TOC normalization serves as the last guard against problematic entries.

rokm added 15 commits May 9, 2023 15:31

depend: imphookapi: rework TOC-style tuple detection

609d9ba

Rework TOC-style 3-tuple detection in `add_datas` and `binaries`; instead of checking for instance of `TOC` class, check if the length of the first entry is three. This will allow us to replace the `TOC` class with regular list.

building: splash: remove reference to TOC class

132f673

Instantiate `self.binaries` as a regular list instead of an instance of the `TOC` class; we do not need any of its functionality here.

building: splash: clean up string formatting in logger method calls

24e57a7

Use the lazy string formatting offered by logger method calls instead of directly formatting strings using the % syntax.

depend: remove use of TOC class in get_bootstrap_modules

e0e0ae0

We are returning a simple list of modules, so there's no need for TOC class there as we know the list has no duplicates.

building: begin replacing TOC class with normalization helper function

2792fc2

For now, the normalization helper function is a stub that internally uses the TOC class; this allows us to gradually limit/encapsulate the use of TOC class within the codebase.

building: main: move extension processing further up in workflow

7634f85

Process extensions (rename them to full filenames) immediately after we obtain the list of extensions from the modulegraph.

building: main: remove direct uses of TOC class

1f2bed8

Instead of using TOC class, perform explicit TOC list normalization at certain build stages.

tests: fix CONF override in test_issue_2492 and test_issue_5131

b59792f

The CONF override in `test_issue_2492` and `test_issue_5131` should initialize the newly-introduced `code_cache` dictionary.

building: EXE: remove the work-around for merging PYZ.dependencies

0603e01

We now have TOC normalization with priority system in place, so we can directly extend the EXE's TOC with entries from all passed targets' dependencies.

tests: add basic tests for the new TOC normalization helpers

e6f6f1f

rokm force-pushed the deprecate-toc-class branch from 1185703 to b857375 Compare May 9, 2023 14:57

rokm marked this pull request as ready for review May 9, 2023 22:07

rokm requested a review from bwoodsend May 9, 2023 22:09

bwoodsend reviewed May 9, 2023

View reviewed changes

rokm added 2 commits May 10, 2023 11:19

building: add deprecation warning to TOC class

69e6130

And ignore it in the TOC class unit tests.

docs: update documentation on TOC lists

cf95497

Update documentation on TOC lists now that the TOC class has been officially deprecated.

rokm force-pushed the deprecate-toc-class branch from b857375 to cf95497 Compare May 10, 2023 09:20

rokm force-pushed the deprecate-toc-class branch from cb413a7 to 97bc98e Compare May 10, 2023 15:15

bwoodsend approved these changes May 10, 2023

View reviewed changes

rokm merged commit 1c86739 into pyinstaller:develop May 10, 2023
18 checks passed

rokm deleted the deprecate-toc-class branch May 10, 2023 20:59

github-actions bot locked as resolved and limited conversation to collaborators Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate TOC class and remove its usage from codebase #7615

Deprecate TOC class and remove its usage from codebase #7615

rokm commented May 9, 2023 •

edited

bwoodsend May 9, 2023

rokm May 10, 2023

rokm May 10, 2023

bwoodsend May 10, 2023

	def get_code_objects(self):
	"""
	Get code objects from ModuleGraph for pure Python modules. This allows to avoid writing .pyc/pyo files to hdd
	at later stage.

	:return: Dict with module name and code object.
	"""
	code_dict = {}
	mod_types = PURE_PYTHON_MODULE_TYPES
	for node in self.iter_graph(start=self._top_script_node):
	# TODO This is terrible. To allow subclassing, types should never be directly compared. Use isinstance()
	# instead, which is safer, simpler, and accepts sets. Most other calls to type() in the codebase should also
	# be refactored to call isinstance() instead.

	# get node type e.g. Script
	mg_type = type(node).__name__
	if mg_type in mod_types:
	if node.code:
	code_dict[node.identifier] = node.code
	return code_dict

Deprecate TOC class and remove its usage from codebase #7615

Deprecate TOC class and remove its usage from codebase #7615

Conversation

rokm commented May 9, 2023 • edited

bwoodsend May 9, 2023

Choose a reason for hiding this comment

rokm May 10, 2023

Choose a reason for hiding this comment

rokm May 10, 2023

Choose a reason for hiding this comment

bwoodsend May 10, 2023

Choose a reason for hiding this comment

rokm commented May 9, 2023 •

edited