New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bunch of hooks for contemporary ML stuff #676
Conversation
56ee492
to
01b1573
Compare
Create a separate `test_pytorch.py` test for `torch` and its associated libraries. Move existing tests to the new file.
Explicitly force the `pyi_builder` into onedir-only mode instead of skipping onefile tests. Reduces number of reported tests.
Add a test that uses `torchvision.datasets`, `torchvision.transforms`, and torchscript. The latter demonstrates the need for collecting source .py files from `torchvision`.
01b1573
to
ac5dfa6
Compare
Rename `hook-torchvision.ops.py` to `hook-torchvision.py`, and add `module_collection_mode = 'pyz+py'` to collect source .py files for torch JIT/torchscript.
When collecting binaries in the PyInstaller >= 6.0 codepath, explicitly collect versioned .so files by adding '*.so.*' to the list of search patterns passed to `collect_dynamic_libs`. Just in case that any of those libs happens to be dynamically loaded at run-time...
The contemporary PyPI torch wheels for linux use CUDA libraries that are installed via `nvidia-*` packages. Therefore, attempt to convert the `nvidia-*` requirements from the `torch` metadata into hidden imports. This way, we can provide hooks for `nvidia.*` packages that collect the shared libs, in case any of them are dynamically loaded (which currently seems to be the case with some of the shared libraries from `nvidia.cudnn`).
ac5dfa6
to
36f68b2
Compare
One caveat remains - on Linux, even if we collect CUDA libs from I think the reason is that To deal with this, we will need to add a mechanism that will allow the hooks to suppress (specific) symlinks generation during binary dependency analysis; but I need to think whether to introduce a dedicated mechanism for that, or try to make it part of more general file-exclusion mechanism that we were discussing some time back... |
36f68b2
to
d789ba0
Compare
On Linux, NVIDIA CUDA 11.x and 12.x shared libraries can be installed via PyPI wheels (e.g., `nvidia-cuda-runtime-cu12`, `nvidia-cudnn-cu12`, `nvidia-cublas-cu12`). These all provide sub-packages under `nvidia` top level package (e.g., `nvidia.cuda_runtime`, `nvidia.cudnn`, `nvidia.cublas`). Add hooks for these sub-packages that ensure that all shared libraries from the sub-packages `lib` directory are collected, in case they are dynamically loaded. For example, `torch` PyPI wheels for linux do not bundle CUDA inside the `torch/lib` (whereas the wheels from their own server, built with "non-default" CUDA versions bundle them), and dynamically load `libcudnn_ops_infer.so.8` from `nvidia.cudnn.lib`.
Add a test for torchaudio that uses a transform to resample a signal. The test shows the need to collect binaries from the torchaudio package, and, due to use of torchscript, also shows the need to collect source .py files.
Add hook for torchaudio that collects binaries and ensures that source .py files are collected.
Add a test for torchext that uses a tokenization transform from Berta Encoder. The test shows the need to collect binaries from the torchtext package, and, due to use of torchscript, also shows the need to collect source .py files. We perform only tokenization part of processing, in order to avoid having to download the whole model (~240 MB).
Add hook for torchtext that collects binaries and ensures that source .py files are collected.
Move tensorflow tests into separate `test_tensorflow` file. Improve the `tensorflow_onedir_only` test mark (force pyi_builder into generating only onedir case instead of skipping the onefile case).
Add a basic `transformers` pipeline test. Add a basic import test for `transformers.DebertaModel`, which shows that we need to collect source .py files for TorchScript.
Add hook for Hugging Face `transformers` package. Attempt to automatically collect metadata for all of package's dependencies (as declared in `deps` dictionary in the `transformers.dependency_versions_table` module). Collect source .py files as some of the functionality uses TorchScript.
Add a hook for fastai, which ensures that the package's source .py files are collected, as they are required for TorchScript. Add a test based on the "building models from tabular data" example from https://docs.fast.ai/quick_start.html, which demonstrates the need for the source .py files.
`torchvision.io.image` attempts to dynamically load `torchvision.image` extension, so add a hook that ensures the latter is collected. Add a basic image encoding/decoding test.
Add hook for timm, which ensures that the package's source .py files are collected, as they are required for TorchScript. Add a basic model listing and creation test.
Add test for `lightning`, based on their "Getting started" example with autoencoder trained on MNIST dataset. Requires `torchivsion` for dataset.
Add hook for (PyTorch) `lightning`. Currently, the main functionality is to ensure that the `version.info` file from the package is collected. We do not collect source .py files, as it seems that even if `lightning.LightningModule.to_torchscript()` is used, it requires the source where the model inheriting from `lightning.LightningModule` is defined, rather than `lightning`'s own sources. We can always add source .py files collection later, if it proves to be necessary.
Add hooks for `bitsandbytes`, and its dependency `triton`. Both packages have dynamically-loaded extension libraries, and both require collection of source .py files for (`triton`'s) JIT module. With `triton`, some of the submodules need to be collected only as source .py files (no PYZ), because the code naively tries to read the files pointed to by `__file__` attribute under assumption that they are source .py files. So we must not collect these modules into the PYZ.
Add hook and basic test for `linear_operator`. The package uses torchscript/JIT, so we need to collect its source .py files.
Add a basic test for GPyTorch, based on their "Simple GP Regression" example.
Add hook for `fvcore.nn` to collect its source .py files for torchscript/JIT. Add a basic import test that demonstrates the need for that.
Add hook for `detectron` to collect its source .py files for torchscript/JIT. Add a basic import test that demonstrates the need for that.
Add hook for `datasets` to collect its source .py files for torchscript/JIT. Add a basic dataset loading test that demonstrates the need for that.
Add basic test for Hugging Face `accelerate`; demonstrates that for the tested subset of functionality, no hook is necessary.
Use 120-character lines to reduce amount of line wrapping and make the hook easier to read.
Remove the `tensorflow.python._pywrap_tensorflow_internal` hack (adding it to excluded modules to avoid duplication) for PyInstaller >= 6.0, where the issue is alleviated thanks to the binary dependency analysis preserving the parent directory layout of discovered/collected shared libraries. This should fix the problem with `tensorflow` builds where the `_pywrap_tensorflow_internal` module is not used as a shared library, as seen in `tensorflow` builds for Raspberry Pi: pyinstaller#121
Collect sources for `tensorflow.python.autograph` to avoid run-time warning about AutoGraph being unavailable. Not sure if we need to collect sources for other parts of `tensorflow`, though; if that proves to be the case, we can adjust the collection mode later.
Add `_pyinstaller_hooks_contrib.compat` module to hide the gory details of stdlib `importlib.metadata` vs. `importlib_metadata`. Import the `importlib_metadata` from `PyInstaller.compat` if available (PyInstaller >= 6.0), otherwise duplicate the logic. Copy `importlib_metadata` and `packaging` requirements from PyInstaller to pyinstaller-hooks-contrib.
Determine the tensorflow's dist name using list of potential candidates, and if a dist is found, retrieve the version from it. Otherwise, fall back to reading `tensorflow.__version__`. The `tensorflow` package is available in several variants, and sometimes the `tensorflow` dist installs a separate dist (e.g., `tensorflow-intel` on Windows); but the user can install this separate dist without installing the "top-level" `tensorflow` one. In PyInstaller v5 and earlier, the `is_module_satisfies` fell back to querying `tensorflow.__version__` if the dist could not be found - in v6, the implementation of `is_module_satisfies` (or rather, `check_requirement`) checks only the metadata. As an added bonus, the direct version comparisons are nicer to read than the comparisons against `tf_pre_1_15_0`, `tf_post_1_15_0`, etc.
Linux builds of `tensorflow` can install CUDA via nvidia-* packages that are enabled by `and-cuda` extra marker. So parse the dist metadata for requirements, and turn the `nvidia-*` requirements into `nvidia.*` hidden imports. Consolidate the shared code for conversion of `nvidia-*` dist name into `nvidia.*` module in a utility function, and use it in both `torch` and `tensorflow` hooks.
Consolidate `pytest` configuration from `pytest.ini` into `setup.cfg`, to match what we have in the main PyInstaller repository. Add -v, -rsxXfE, and ----doctest-glob= to test flags. The addition of -v ensures that in manual (local) pytest runs, the test names are displayed as they are ran (the CI workflows seem to explicitly specify -v as part of their pytest commnads).
When running the `test_lightning_mnist_autoencoder` under arm64 macOS, `multiprocessing` seems to be activated at some point, and the test gets stuck due to lack of `multiprocessing.freeze_support` call.
Have the `tensorflow` standard hook collect binaries from the `tensorflow-plugins` package; this contains plugins for tensorflow's pluggable device architecture (such as `tensorflow-metal` for macOS and `tensorflow-directml-plugin` for Windows). Have the `tensorflow` run-time hook override the `site.getsitepackages()` with custom implementation that allows us to trick `tensorflow` into loading the plugins.
d789ba0
to
5ec43b8
Compare
…23.12 (#83) Bumps [pyinstaller-hooks-contrib](https://github.com/pyinstaller/pyinstaller-hooks-contrib) from 2023.11 to 2023.12. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/releases">pyinstaller-hooks-contrib's releases</a>.</em></p> <blockquote> <h2>2023.12</h2> <p>Please see the <a href="https://www.github.com/pyinstaller/pyinstaller-hooks-contrib/tree/master/CHANGELOG.rst">changelog</a> for more details</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/blob/master/CHANGELOG.rst">pyinstaller-hooks-contrib's changelog</a>.</em></p> <blockquote> <h2>2023.12 (2024-01-03)</h2> <p>New hooks</p> <pre><code> * Add hook for ``detectron2`` to collect its source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``fastai`` to collect its source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``fvcore.nn`` to collect its source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``langchain`` that collects data files from the package. (`[#681](pyinstaller/pyinstaller-hooks-contrib#681) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/681>`_) * Add hook for ``lightning`` (PyTorch Lightning) to ensure that its ``version.info`` data file is collected. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``linear_operator`` to collect its source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``seedir`` that collects the ``words.txt`` data file from the package. (`[#681](pyinstaller/pyinstaller-hooks-contrib#681) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/681>`_) * Add hook for ``timm`` (Hugging Face PyTorch Image Models) to collect its source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``torchaudio`` that collects dynamically-loaded extensions, as well as source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``torchtext`` that collects dynamically-loaded extensions, as well as source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``torchvision.io.image`` to ensure that dynamically-loaded extension, required by this module, is collected. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for ``VADER``. (`[#679](pyinstaller/pyinstaller-hooks-contrib#679) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/679>`_) * Add hook for Hugging Face ``datasets`` to collect its source .py files for TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) * Add hook for Hugging Face ``transformers``. The hook attempts to automatically collect the metadata of all dependencies (as declared in `deps` dictionary in the `transformers.dependency_versions_table` module), in order to make dependencies available at build time visible to ``transformers`` at run time. The hook also collects source .py files as some of the package's functionality uses TorchScript/JIT. (`[#676](pyinstaller/pyinstaller-hooks-contrib#676) <https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676>`_) </tr></table> </code></pre> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/16295d14a07aa7567903ecdc63556008dc988cb6"><code>16295d1</code></a> Release v2023.12</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/13fc0fd1656e9313172f751df063f75d68631a3e"><code>13fc0fd</code></a> hooks: remove hook for google.api</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/7d0c295856b465b132474d1f438b239396bc33a4"><code>7d0c295</code></a> hooks: add hook for seedir</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/adb0eeb5f654a45250cf5bfe3fe31670116d47a4"><code>adb0eeb</code></a> hooks: add hook for langchain</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/9322348879736e9ee154841556b8d71db7172657"><code>9322348</code></a> Scheduled weekly dependency update for week 00 (<a href="https://redirect.github.com/pyinstaller/pyinstaller-hooks-contrib/issues/680">#680</a>)</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/98e03f00795c147bec1768f0cb37e5053796b7ee"><code>98e03f0</code></a> hooks: add hook for VADER</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/df1035483f268fca3a2848eeec6235cdf9a2fee6"><code>df10354</code></a> Scheduled weekly dependency update for week 52 (<a href="https://redirect.github.com/pyinstaller/pyinstaller-hooks-contrib/issues/678">#678</a>)</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/b9ed3babf51a3ed9cfd87857c414d7c7e24994be"><code>b9ed3ba</code></a> cleanup: clean up and simplify <code>get_hooks_dirs</code> entry-point</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/a2f65ef1af8df2edc64f8097df0c10e98f6bae69"><code>a2f65ef</code></a> hooks: tensorflow: collect plugins from tensorflow-plugins</li> <li><a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/4ac0e974d0fae07cbeae4b1739bb89636d079a91"><code>4ac0e97</code></a> tests: add multiprocessing.freeze_support() call to lightning test</li> <li>Additional commits viewable in <a href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/compare/2023.11...2023.12">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pyinstaller-hooks-contrib&package-manager=pip&previous-version=2023.11&new-version=2023.12)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
How do we use these hooks? If it works as defaults? |
If you have |
Add bunch of hooks for contemporary machine-learning stuff, along with updating the existing hooks for
torch
andtensorflow
.We now have hooks for
torchaudio
andtorchtext
to collect the dynamically-loaded extensions, and source .py files for TorchScript. So we can close #375, which was initially about missing extension libs (and the rest of the horrors in there should be resolved in PyInstaller 6.x). Thetorchvision
hook is similarly updated to collect source .py files, and I've added a hook fortorchvision.io.image
to ensure that this module collects thetorchvision.image
extension that it dynamically loads.The
tensorflow
hook got bunch of updates as well:tensorflow.python._pywrap_tensorflow_internal
as a module anymore (in an attempt to avoid duplication); this exclusion was the cause of Tensorflow error undefined symbol: _ZTIN10tensorflow8OpKernelE #121, so we can now close that.is_module_satisfies
implicitly fell back to checkingtensorflow.__version__
, while in v6,is_module_satisfies
is purely metadata-based.tensorflow.python.autograph
to avoid run-time warning about AutoGraph being unavailabletensorflow-plugins
, and the run-time hook now monkey-patchessite.getsitepackages
to work around a faulty module file location check, which prevents the collected plugins from being loaded.Both
torch
andtensorflow
hooks now add hiddenimports tonvidia.*
subpackages, which seem to be contemporary way of providing CUDA libs via PyPI wheels on Linux.Then there's a hook for Hugging Face transformers - this one primarily deals with collection of metadata of all its dependencies; we collect metadata for all listed dependencies that are available at build time, to ensure they are also visible at run time (since we have no way of knowing what will be used). Plus, there's the source .py collection. This closes #462 and #562.
In addition, there are hooks for several other packages that we have seen in recent reports:
pyinstaller/pyinstaller#5672
pyinstaller/pyinstaller#5729
pyinstaller/pyinstaller#7647
pyinstaller/pyinstaller#7906
pyinstaller/pyinstaller#7911
pyinstaller/pyinstaller#7918
pyinstaller/pyinstaller#7985
pyinstaller/pyinstaller#8013
pyinstaller/pyinstaller#8155
Primarily, these are about collection of source .py files for packages that use TorchScript or some other form of JIT. While the plan in pyinstaller/pyinstaller#6290 is to automatically detect such cases, I think at this point it is more feasible to add hooks for the more popular packages.
There are bunch of tests added, but these are mostly there so that I could automatically test on different platforms (my main linux box, Windows laptop, and arm64 Mac) without having to juggle with test program scripts. They are definitely not meant for CI - both due to speed and because some of them need to download datasets and/or models. I did not put them under slow mark, but we can do so if we see them triggered by accident (i.e., something in our CI manages to pull in
torch
& co. as part of its dependencies).