New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
building: assorted fixes and cleanup #7273
building: assorted fixes and cleanup #7273
Conversation
b817760
to
a69f5b0
Compare
So I decided to test this on Windows with |
Hmm... linter was updated to a stricter version overnight, and is now picking on an unrelated line of the code. I'll add a commit here, since I was planning on keep them separated anyway for easier change-to-commit-message association in the blame log. |
The `trash` in `PKG.assemble` is never populated, so its use is no-op.
`PKG.assemble` is trying to track duplicated binaries (multiple binaries collected into the same location, or a single binary collected into multiple locations), but does so in a very limited way, by accounting only for BINARY entries. Remove this check to simplify the code; we will add proper duplication detection later.
Remove directory copying/collection codepaths from `assemble` methods in `COLLECT` and `BUNDLE`. At the point when the data is actually be collected, the TOC should not contain directory entries anymore; the `format_binaries_and_datas` calls, sprinkled around the codebase, should ensure that their contents are listed in TOC instead. If any cases remain where a directory slips into the final TOC, they should be caught and fixed. Hence, raise an error.
…tage Move filename processing of EXTENSION TOC entries (i.e., converting the module name to file path and adding the suffix) from the build stage (i.e., `assemble` in `PKG`, `COLLECT`, and `BUNDLE`) into analysis stage. This ensures that during the build stage, the EXTENSION entries in the TOC are already full filenames, same as other file-based entries (DATA, BINARY, etc.). This in turn means that the `add_suffix_to_extension` helper does not need to worry about DEPENDENCY entries anymore, and can process only EXTENSION ones, as implied by its name. Early conversion of EXTENSION module names to file names also prevents duplication when the same file is collected as both an EXTENSION and some other type, for example DATA: ``` ('torch._C', '...\\site-packages\\torch\\_C.cp39-win_amd64.pyd', 'EXTENSION'), ('torch\\_C.cp39-win_amd64.pyd', '...\\site-pakages\\torch\\_C.cp39-win_amd64.pyd', 'DATA'), ``` Prior to this commit, the entries were considered different from the `TOC` perspective, but led to duplication in onefile build's PKG once extension's name was changed to the file name (whereas in onedir build, the first entry was overwritten by the second).
Refactor the partially-redundant missing file check in `COLLECT.assemble`; if the path is referring to an egg-contained resource, the part based on `not os.path.exists()` will always evaluate to true, rendering the second part of the check redundant. This check was modified into its current form by 6a916fc without an explanation, but the intention was likely to guard against a missing file error. Therefore, refactor the check to use only `os.path.exists()` to check for the existence, and display a warning about a missing resource unless the path is referring to an egg-contained resource. For the sake of consistency, apply the same refactoring to `PKG.assemble` that uses a similar (albeit unmodified) check. Also explicitly check for OPTION typecode instead of checking for empty filename string.
Implement a strict unpack mode, where a duplicated file in the onefile build's PKG triggers a run-time error instead of a warning. This mode is dynamically enabled at runtime based on the `PYINSTALLER_STRICT_UNPACK_MODE` environment variable being set and its value being different from 0. This feature is primarily intended for the use on our CI as means for catching file duplication in onefile builds.
Fix a double-free bug in (multipackage) dependency extraction codepath, specifically in the error handling part, which can now be triggered by the newly-added strict unpack mode.
If explicitly enabled via PYINSTALLER_STRICT_COLLECT_MODE environment variable, we now raise an error if we try to collect a duplicated file into PKG/CArchive or as part of a COLLECT. Similarly to strict unpack mode, this is primarily aimed for use on our CI.
Try to detect file duplication problems at build time.
Rewrite the TOC cleanup part in the MERGE to not use `toc[i] = (None, None, None)` to track the entries to be deleted; instead, construct new TOC like normal people would. The original assignment-based approach seems to have been broken since 14949d3, causing failure to modify the TOC after the first such modification. This in turn means that MERGE did not correctly modify the `binaries` and `datas` TOCs.
For DEPENDENCY entries, we do not need to check whether the source file exists, as we will not be actually collecting it. Omitting this check will make it easier to adjust the contents of the TOC entries and their format (currently, the existence check passes because the second entry is left unmodified, and thus matches the original source file).
…entry Within MERGE, do not combine the reference path and target file name into a single string and store it as the destination name (the first TOC element). Instead, store the target file name as destination name (the first TOC element) and the reference path into the source name (the second TOC element, which is otherwise left unused for DEPENDENCY TOC entries). Have the CArchive writer perform the final merge, before writing the entry to the PKG file. This ensures that the target name remains unchanged within the TOC, making it subject of de-duplication codepaths and duplication checks. Previously, an entry for DEPENDENCY may end up duplicating another entry (e.g., EXTENSION) at run-time, due to target name containing the reference path prefix. We can also get rid of DEPENDENCY-specific handling in `checkCache` (which returns without any processing if `fnm` contains a colon); this crutch was needed because `PKG.assemble` incorrectly handled DEPENDENCY entries and unnecessarily tried running them through `checkCache`. So we rework that part of `PKG.assemble` to process DEPENDENCY entries as part of general entry handling. At this point, this becomes necessary, because even if we kept the hack in `checkCache`, there is no colon in the `fnm` anymore, so the check would fail, leading to error...
`PYZ.dependencies` might contain entries for the ˙zlib` and `_struct` extensions (typecode: EXTENSION). However, these extensions might in fact also be passed as DEPENDENCY entries via a.binaries. In this case, DEPENDENCY entries should take precedence, but until we replace the TOC class with a mechansim that has some notion of type-based priorities, we need to manipulate the insertion order to ensure that DEPENDENCY entries from `binaries` are inserted before corresponding EXTENSION ones from `PYZ.dependencies`.
… mode When source-file-existence check fails in PKG or COLLECT, raise an error if we are in stric collect mode.
Minor clean up of functions involved in multipackage dependency extraction.
…ries Remove redundant conditions in if statements to make them easier to understand, and add comments for the next time when I have the misfortune of having to poke around this part of the codebase.
Turns out that on Windows, the effort of previous commits at solving the duplication problems is negated by the TOC class case-normalizing the names only for BINARY and DATA entries, but not for EXTENSION and DEPENDENCY ones. Fix that by adding EXTENSION and DEPENDENCY to the list of typecodes whose names need to be case-normnalized before comparison against existing names.
243377d
to
f49f251
Compare
This branch was intended as a preliminary clean-up step that would make rebasing my symlinks branch back onto
develop
a bit easier. However, the clean-up endeavor quickly morphed into attempts to fix various forms of file duplication that occur under certain build conditions.One such problem is that in the current implementation,
EXTENSION
TOC entries keep their module-like name until the very end, when they need to be collected. This means that all codepaths need to deal with extension file name adjustment; even worse, it also means that from the TOC's perspective, the following two entries are different and are thus not de-duplicated when they should be:So when building a onefile
torch
program, the user ends up with aAs part of the cleanup/fix, we push the
EXTENSION
name resolving into analysis stage, which should take care of this problem, as well as simplify the code because now the name handling needs to be done in only one place.Due to the planned deprecation of
TOC
class and removal of its use from the codebase, the idea was to implement extra checks that would ensure that no file duplication is introduced during the process. So we now have optional run-time and build-time checks for duplication (for onefilePKG
extraction at run-time, and duringPKG
andCOLLECT
assembly at build-time) and all are enabled on our CI to ensure we are aware of any duplication issues.These extra checks revealed that
MERGE
has been broken since PyInstaller v5.0 as it is not removing entries from passedAnalysis.binaries
andAnalysis.datas
, due to 14949d3 changing the waytoc[i] = (None, None, None)
works. So trying to useMERGE
correctly fails to yield any de-duplication at all. Therefore, the second part of the PR tries to deal withMERGE
and introduce some sanity in the related parts of the codebase.Closes #7215 - the documentation issue part is partly addressed by #7236 and party here, and the lack of de-duplication is addressed here.