Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive: clean up archive reader and writer classes #7518

Merged
merged 7 commits into from
Mar 26, 2023

Conversation

rokm
Copy link
Member

@rokm rokm commented Mar 25, 2023

Clean up archive reader and writer classes by removing the "common" ArchiveReader and ArchiveWriter classes. The archive data formats that we use actually have little in common, so trying to shoehorn them onto common/abstract classes actually does more harm than good and leads to questionable design choices.

One example is the complicated attempt at lazy file access (to prevent file locking on Windows), which needs a home-grown thread-local storage due to trying to track the last position in the files (which is redundant anyway, because every read access needs to seek to the data blob's position). This all originates from base ArchiveReader trying to keep a reference to the archive file, whereas ZlibArchiveReader needs to open/close the file on each archive access.

Similarly, the base class implementations imply that we have a data structure whose signature is PYL\0 while there is no such thing (anymore?).

@rokm rokm force-pushed the cleanup-archive-writer-reader branch 2 times, most recently from 34d4c82 to 4481ba6 Compare March 25, 2023 18:51
Remove the common `ArchiveReader` class. Because the different archive
structures used by PyInstaller have very little in common in terms of
format, having separate implementations actually simplifies them.

The refactoring also simplifies the lazy file access in
`ZlibArchiveReader`, which now stores just the archive file path and
opens the file whenever access is needed. We do not need to track the
previous position in file, because each access performs seek before
reading. This allows us to get rid of associated `FilePos` and
`ArchiveFile` classes and their home-grown thread-local storage for
file position pointers.
The test was testing the multi-threaded read behavior of the
ArchiveFile, which was used by archive readers to dynamically
open and close the archive file when reading from it. The
ArchiveFile wrapper is gone now (it is easier and less error-prone
to just open the file directly whenever needed), so remove the
test as well.
Move the helper for recursive listing of PKG contents from
`PyInstaller.utisl.cliutils.archive_viewer` to
`PyInstaller.archive.readers` and simplify the implementation.
Remove the common `ArchiveWriter` class. Because the different archive
structures used by PyInstaller have very little in common in terms of
format, having separate implementations actually simplifies them.
Add a `raw` flag that allows us to read the module data without
unmarshaling it (e.g., for extraction in archive viewer program).
Rewrite the `archive_viewer` utility to take advantage of the clean up
in archive reader classes. When opening embedded archive, do not
extract it to temporary file. Rename the dubious `--log` command line
option to `--list` (for non-interactive content listing). Display
entries line-by-line using `print` and string formatting instead
of dumping the TOC dictionary with `pprint.pprint`.
@rokm rokm force-pushed the cleanup-archive-writer-reader branch from 4481ba6 to 7befd43 Compare March 25, 2023 19:03
@rokm rokm marked this pull request as ready for review March 25, 2023 23:20
@rokm rokm requested a review from bwoodsend March 25, 2023 23:20
@rokm rokm merged commit 04b19aa into pyinstaller:develop Mar 26, 2023
@rokm rokm deleted the cleanup-archive-writer-reader branch March 26, 2023 10:25
rokm added a commit to rokm/pyinstaller that referenced this pull request Apr 14, 2023
The archive writers clean-up in pyinstaller#7518 introduced a subtle bug that
causes incorrect padding of TOC entries that require no padding;
such entries receive padding of the previously-processed entry.

This in turn breaks 16-byte alignment of the TOC entries, causing
bus error when running onefile executable on platforms with
stricter alignment requirements, for example armhf (32-bit
Raspian OS).
rokm added a commit to rokm/pyinstaller that referenced this pull request Apr 14, 2023
The archive writers clean-up in pyinstaller#7518 introduced a subtle bug that
causes incorrect padding of TOC entries that require no padding;
such entries receive padding of the previously-processed entry.

This in turn breaks 16-byte alignment of the TOC entries, causing
bus error when running onefile executable on platforms with
stricter alignment requirements, for example armhf (32-bit
Raspian OS).
rokm added a commit to rokm/pyinstaller that referenced this pull request Apr 14, 2023
The archive writers clean-up in pyinstaller#7518 introduced a subtle bug that
causes incorrect padding of TOC entries that require no padding;
such entries receive padding of the previously-processed entry.

This in turn breaks 16-byte alignment of the TOC entries, causing
bus error when running executables on platforms with stricter
data alignment requirements, for example armhf (32-bit Raspbian OS).
rokm added a commit that referenced this pull request Apr 14, 2023
The archive writers clean-up in #7518 introduced a subtle bug that
causes incorrect padding of TOC entries that require no padding;
such entries receive padding of the previously-processed entry.

This in turn breaks 16-byte alignment of the TOC entries, causing
bus error when running executables on platforms with stricter
data alignment requirements, for example armhf (32-bit Raspbian OS).
JonathonReinhart added a commit to JonathonReinhart/staticx that referenced this pull request Apr 16, 2023
fix: update to support new pyinstaller 5.10 syntax

Before pyinstaller/pyinstaller#7518:
- The `CArchiveReader.toc` attribute referred to a `CTOCReader` object.
- `CTOCReader.data` was a `list` of `(dpos, dlen, ulen, flag, typcd, name)` tuples.
- `CTOCReader.find()` was used to find an entry for a given name.
- `CArchiveReader.extract()` accepted either:
  - An index into the TOC (this is what `staticx` used to use: `n`)
  - OR an entry name

After pyinstaller/pyinstaller#7518:
- The `CTOCReader` class was removed.
- The `CArchiveReader.toc` attribute is now a `dict`: `name` => `(entry_offset, data_length, uncompressed_length, compression_flag, typecode)` (the same tuple but without the final `name` element)
- `CArchiveReader.extract()` accepts only an entry name
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants