Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

building: assorted fixes and cleanup #7273

Merged
merged 19 commits into from Nov 26, 2022

Conversation

rokm
Copy link
Member

@rokm rokm commented Nov 23, 2022

This branch was intended as a preliminary clean-up step that would make rebasing my symlinks branch back onto develop a bit easier. However, the clean-up endeavor quickly morphed into attempts to fix various forms of file duplication that occur under certain build conditions.

One such problem is that in the current implementation, EXTENSION TOC entries keep their module-like name until the very end, when they need to be collected. This means that all codepaths need to deal with extension file name adjustment; even worse, it also means that from the TOC's perspective, the following two entries are different and are thus not de-duplicated when they should be:

    ('torch._C',
       '...\\site-packages\\torch\\_C.cp39-win_amd64.pyd',
       'EXTENSION'),
    ('torch\\_C.cp39-win_amd64.pyd',
       '...\\site-pakages\\torch\\_C.cp39-win_amd64.pyd',
       'DATA'),

So when building a onefile torch program, the user ends up with a

WARNING: file already exists but should not: C:\Users\user\AppData\Local\Temp\MEI1234567\torch\_C.cp39-win_amd64.pyd

As part of the cleanup/fix, we push the EXTENSION name resolving into analysis stage, which should take care of this problem, as well as simplify the code because now the name handling needs to be done in only one place.

Due to the planned deprecation of TOC class and removal of its use from the codebase, the idea was to implement extra checks that would ensure that no file duplication is introduced during the process. So we now have optional run-time and build-time checks for duplication (for onefile PKG extraction at run-time, and during PKG and COLLECT assembly at build-time) and all are enabled on our CI to ensure we are aware of any duplication issues.

These extra checks revealed that MERGE has been broken since PyInstaller v5.0 as it is not removing entries from passed Analysis.binaries and Analysis.datas, due to 14949d3 changing the way toc[i] = (None, None, None) works. So trying to use MERGE correctly fails to yield any de-duplication at all. Therefore, the second part of the PR tries to deal with MERGE and introduce some sanity in the related parts of the codebase.

Closes #7215 - the documentation issue part is partly addressed by #7236 and party here, and the lack of de-duplication is addressed here.

@rokm rokm force-pushed the prevent-pkg-collect-duplication branch 2 times, most recently from b817760 to a69f5b0 Compare November 23, 2022 13:11
@rokm rokm marked this pull request as ready for review November 23, 2022 14:54
@rokm
Copy link
Member Author

rokm commented Nov 24, 2022

So I decided to test this on Windows with torch to ensure that the _C.cp310-win_amd64.pyd "file already exists but should not issue" is really gone for good and... it wasn't. Turns out this is because within TOC, only names for BINARY and DATA were case-normalized, so an EXTENSION or DEPENDENCY could still end up duplicating those. Fixed by ensuring case normalization is performed for all four typecodes. Hopefully this does not upset the CI.

@rokm
Copy link
Member Author

rokm commented Nov 24, 2022

Hmm... linter was updated to a stricter version overnight, and is now picking on an unrelated line of the code. I'll add a commit here, since I was planning on keep them separated anyway for easier change-to-commit-message association in the blame log.

The `trash` in `PKG.assemble` is never populated, so its use is
no-op.
`PKG.assemble` is trying to track duplicated binaries (multiple
binaries collected into the same location, or a single binary
collected into multiple locations), but does so in a very limited
way, by accounting only for BINARY entries. Remove this check
to simplify the code; we will add proper duplication detection
later.
Remove directory copying/collection codepaths from `assemble` methods
in `COLLECT` and `BUNDLE`. At the point when the data is actually
be collected, the TOC should not contain directory entries anymore;
the `format_binaries_and_datas` calls, sprinkled around the codebase,
should ensure that their contents are listed in TOC instead.

If any cases remain where a directory slips into the final TOC,
they should be caught and fixed. Hence, raise an error.
…tage

Move filename processing of EXTENSION TOC entries (i.e., converting
the module name to file path and adding the suffix) from the
build stage (i.e., `assemble` in `PKG`, `COLLECT`, and `BUNDLE`)
into analysis stage.

This ensures that during the build stage, the EXTENSION entries
in the TOC are already full filenames, same as other file-based
entries (DATA, BINARY, etc.).

This in turn means that the `add_suffix_to_extension` helper does
not need to worry about DEPENDENCY entries anymore, and can
process only EXTENSION ones, as implied by its name.

Early conversion of EXTENSION module names to file names also
prevents duplication when the same file is collected as both
an EXTENSION and some other type, for example DATA:
```
('torch._C',
   '...\\site-packages\\torch\\_C.cp39-win_amd64.pyd',
   'EXTENSION'),
('torch\\_C.cp39-win_amd64.pyd',
   '...\\site-pakages\\torch\\_C.cp39-win_amd64.pyd',
   'DATA'),
```
Prior to this commit, the entries were considered different
from the `TOC` perspective, but led to duplication in onefile
build's PKG once extension's name was changed to the file name
(whereas in onedir build, the first entry was overwritten by
the second).
Refactor the partially-redundant missing file check in `COLLECT.assemble`;
if the path is referring to an egg-contained resource, the part based on
`not os.path.exists()` will always evaluate to true, rendering the
second part of the check redundant.

This check was modified into its current form by 6a916fc without an
explanation, but the intention was likely to guard against a missing
file error. Therefore, refactor the check to use only `os.path.exists()`
to check for the existence, and display a warning about a missing
resource unless the path is referring to an egg-contained resource.

For the sake of consistency, apply the same refactoring to `PKG.assemble`
that uses a similar (albeit unmodified) check. Also explicitly check
for OPTION typecode instead of checking for empty filename string.
Implement a strict unpack mode, where a duplicated file in the
onefile build's PKG triggers a run-time error instead of a warning.

This mode is dynamically enabled at runtime based on the
`PYINSTALLER_STRICT_UNPACK_MODE` environment variable being set
and its value being different from 0.

This feature is primarily intended for the use on our CI as means
for catching file duplication in onefile builds.
Fix a double-free bug in (multipackage) dependency extraction codepath,
specifically in the error handling part, which can now be triggered
by the newly-added strict unpack mode.
If explicitly enabled via PYINSTALLER_STRICT_COLLECT_MODE
environment variable, we now raise an error if we try to collect
a duplicated file into PKG/CArchive or as part of a COLLECT.

Similarly to strict unpack mode, this is primarily aimed for use
on our CI.
Try to detect file duplication problems at build time.
Rewrite the TOC cleanup part in the MERGE to not use
`toc[i] = (None, None, None)` to track the entries to be deleted;
instead, construct new TOC like normal people would.

The original assignment-based approach seems to have been broken
since 14949d3, causing failure to modify the TOC after the first
such modification. This in turn means that MERGE did not correctly
modify the `binaries` and `datas` TOCs.
For DEPENDENCY entries, we do not need to check whether the source
file exists, as we will not be actually collecting it. Omitting
this check will make it easier to adjust the contents of the TOC
entries and their format (currently, the existence check passes
because the second entry is left unmodified, and thus matches the
original source file).
…entry

Within MERGE, do not combine the reference path and target file
name into a single string and store it as the destination name
(the first TOC element). Instead, store the target file name as
destination name (the first TOC element) and the reference path
into the source name (the second TOC element, which is otherwise
left unused for DEPENDENCY TOC entries).

Have the CArchive writer perform the final merge, before writing
the entry to the PKG file.

This ensures that the target name remains unchanged within the
TOC, making it subject of de-duplication codepaths and duplication
checks. Previously, an entry for DEPENDENCY may end up duplicating
another entry (e.g., EXTENSION) at run-time, due to target name
containing the reference path prefix.

We can also get rid of DEPENDENCY-specific handling in `checkCache`
(which returns without any processing if `fnm` contains a colon);
this crutch was needed because `PKG.assemble` incorrectly handled
DEPENDENCY entries and unnecessarily tried running them through
`checkCache`. So we rework that part of `PKG.assemble` to process
DEPENDENCY entries as part of general entry handling. At this point,
this becomes necessary, because even if we kept the hack in
`checkCache`, there is no colon in the `fnm` anymore, so the check
would fail, leading to error...
`PYZ.dependencies` might contain entries for the ˙zlib` and
`_struct` extensions (typecode: EXTENSION). However, these
extensions might in fact also be passed as DEPENDENCY entries
via a.binaries. In this case, DEPENDENCY entries should take
precedence, but until we replace the TOC class with a mechansim
that has some notion of type-based priorities, we need to
manipulate the insertion order to ensure that DEPENDENCY entries
from `binaries` are inserted before corresponding EXTENSION ones
from `PYZ.dependencies`.
… mode

When source-file-existence check fails in PKG or COLLECT, raise an
error if we are in stric collect mode.
Minor clean up of functions involved in multipackage dependency
extraction.
…ries

Remove redundant conditions in if statements to make them easier
to understand, and add comments for the next time when I have
the misfortune of having to poke around this part of the codebase.
Turns out that on Windows, the effort of previous commits at
solving the duplication problems is negated by the TOC class
case-normalizing the names only for BINARY and DATA entries,
but not for EXTENSION and DEPENDENCY ones.

Fix that by adding EXTENSION and DEPENDENCY to the list of typecodes
whose names need to be case-normnalized before comparison against
existing names.
@rokm rokm force-pushed the prevent-pkg-collect-duplication branch from 243377d to f49f251 Compare November 26, 2022 11:19
@rokm rokm enabled auto-merge (rebase) November 26, 2022 11:19
@rokm rokm merged commit 39eaa67 into pyinstaller:develop Nov 26, 2022
@rokm rokm deleted the prevent-pkg-collect-duplication branch November 26, 2022 15:07
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

documentation multipackage is not at current state of the code
2 participants