Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bunch of hooks for contemporary ML stuff #676

Merged
merged 35 commits into from Dec 23, 2023

Conversation

rokm
Copy link
Member

@rokm rokm commented Dec 20, 2023

Add bunch of hooks for contemporary machine-learning stuff, along with updating the existing hooks for torch and tensorflow.

We now have hooks for torchaudio and torchtext to collect the dynamically-loaded extensions, and source .py files for TorchScript. So we can close #375, which was initially about missing extension libs (and the rest of the horrors in there should be resolved in PyInstaller 6.x). The torchvision hook is similarly updated to collect source .py files, and I've added a hook for torchvision.io.image to ensure that this module collects the torchvision.image extension that it dynamically loads.

The tensorflow hook got bunch of updates as well:

  • with PyInstaller >= 6.0, we do not need to exclude tensorflow.python._pywrap_tensorflow_internal as a module anymore (in an attempt to avoid duplication); this exclusion was the cause of Tensorflow error undefined symbol: _ZTIN10tensorflow8OpKernelE #121, so we can now close that.
  • the version check if now done based on disttribution name (which we need to guess, due to different possibilities) - this addresses a potential regression from PyInstaller v5, because there is_module_satisfies implicitly fell back to checking tensorflow.__version__, while in v6, is_module_satisfies is purely metadata-based.
  • we collect source files from tensorflow.python.autograph to avoid run-time warning about AutoGraph being unavailable
  • we now collect plugins from tensorflow-plugins, and the run-time hook now monkey-patches site.getsitepackages to work around a faulty module file location check, which prevents the collected plugins from being loaded.

Both torch and tensorflow hooks now add hiddenimports to nvidia.* subpackages, which seem to be contemporary way of providing CUDA libs via PyPI wheels on Linux.

Then there's a hook for Hugging Face transformers - this one primarily deals with collection of metadata of all its dependencies; we collect metadata for all listed dependencies that are available at build time, to ensure they are also visible at run time (since we have no way of knowing what will be used). Plus, there's the source .py collection. This closes #462 and #562.

In addition, there are hooks for several other packages that we have seen in recent reports:
pyinstaller/pyinstaller#5672
pyinstaller/pyinstaller#5729
pyinstaller/pyinstaller#7647
pyinstaller/pyinstaller#7906
pyinstaller/pyinstaller#7911
pyinstaller/pyinstaller#7918
pyinstaller/pyinstaller#7985
pyinstaller/pyinstaller#8013
pyinstaller/pyinstaller#8155

Primarily, these are about collection of source .py files for packages that use TorchScript or some other form of JIT. While the plan in pyinstaller/pyinstaller#6290 is to automatically detect such cases, I think at this point it is more feasible to add hooks for the more popular packages.

There are bunch of tests added, but these are mostly there so that I could automatically test on different platforms (my main linux box, Windows laptop, and arm64 Mac) without having to juggle with test program scripts. They are definitely not meant for CI - both due to speed and because some of them need to download datasets and/or models. I did not put them under slow mark, but we can do so if we see them triggered by accident (i.e., something in our CI manages to pull in torch & co. as part of its dependencies).

Create a separate `test_pytorch.py` test for `torch` and its
associated libraries. Move existing tests to the new file.
Explicitly force the `pyi_builder` into onedir-only mode instead of
skipping onefile tests. Reduces number of reported tests.
Add a test that uses `torchvision.datasets`, `torchvision.transforms`,
and torchscript. The latter demonstrates the need for collecting
source .py files from `torchvision`.
Rename `hook-torchvision.ops.py` to `hook-torchvision.py`, and
add `module_collection_mode = 'pyz+py'` to collect source .py
files for torch JIT/torchscript.
When collecting binaries in the PyInstaller >= 6.0 codepath,
explicitly collect versioned .so files by adding '*.so.*' to the
list of search patterns passed to `collect_dynamic_libs`.

Just in case that any of those libs happens to be dynamically
loaded at run-time...
The contemporary PyPI torch wheels for linux use CUDA libraries
that are installed via `nvidia-*` packages. Therefore, attempt to
convert the `nvidia-*` requirements from the `torch` metadata
into hidden imports. This way, we can provide hooks for `nvidia.*`
packages that collect the shared libs, in case any of them are
dynamically loaded (which currently seems to be the case with
some of the shared libraries from `nvidia.cudnn`).
@rokm
Copy link
Member Author

rokm commented Dec 21, 2023

One caveat remains - on Linux, even if we collect CUDA libs from nvidia.* packages, tensorflow fails to discover them at run-time.

I think the reason is that libtensorflow_cc.so.2, libtensorflow_framework.so.2, and _pywrap_tensorflow_internal.so symlinks in the sys._MEIPASS cause the discovery code to search in wrong directories (i.e., it takes the unresolved location of these files as base location for the search), as removing those three symlinks from generated frozen application seems to fix things.

To deal with this, we will need to add a mechanism that will allow the hooks to suppress (specific) symlinks generation during binary dependency analysis; but I need to think whether to introduce a dedicated mechanism for that, or try to make it part of more general file-exclusion mechanism that we were discussing some time back...

On Linux, NVIDIA CUDA 11.x and 12.x shared libraries can be
installed via PyPI wheels (e.g., `nvidia-cuda-runtime-cu12`,
`nvidia-cudnn-cu12`, `nvidia-cublas-cu12`). These all
provide sub-packages under `nvidia` top level package (e.g.,
`nvidia.cuda_runtime`, `nvidia.cudnn`, `nvidia.cublas`).

Add hooks for these sub-packages that ensure that all shared
libraries from the sub-packages `lib` directory are collected,
in case they are dynamically loaded.

For example, `torch` PyPI wheels for linux do not bundle CUDA
inside the `torch/lib` (whereas the wheels from their own
server, built with "non-default" CUDA versions bundle them),
and dynamically load `libcudnn_ops_infer.so.8` from
`nvidia.cudnn.lib`.
Add a test for torchaudio that uses a transform to resample a
signal. The test shows the need to collect binaries from the
torchaudio package, and, due to use of torchscript, also shows
the need to collect source .py files.
Add hook for torchaudio that collects binaries and ensures that
source .py files are collected.
Add a test for torchext that uses a tokenization transform from
Berta Encoder. The test shows the need to collect binaries from the
torchtext package, and, due to use of torchscript, also shows the
need to collect source .py files.

We perform only tokenization part of processing, in order to avoid
having to download the whole model (~240 MB).
Add hook for torchtext that collects binaries and ensures that
source .py files are collected.
Move tensorflow tests into separate `test_tensorflow` file.
Improve the `tensorflow_onedir_only` test mark (force pyi_builder
into generating only onedir case instead of skipping the onefile
case).
Add a basic `transformers` pipeline test. Add a basic import test
for `transformers.DebertaModel`, which shows that we need to collect
source .py files for TorchScript.
Add hook for Hugging Face `transformers` package.

Attempt to automatically collect metadata for all of package's
dependencies (as declared in `deps` dictionary in the
`transformers.dependency_versions_table` module).

Collect source .py files as some of the functionality uses
TorchScript.
Add a hook for fastai, which ensures that the package's source .py
files are collected, as they are required for TorchScript.

Add a test based on the "building models from tabular data" example
from https://docs.fast.ai/quick_start.html, which demonstrates
the need for the source .py files.
`torchvision.io.image` attempts to dynamically load
`torchvision.image` extension, so add a hook that ensures the
latter is collected.

Add a basic image encoding/decoding test.
Add hook for timm, which ensures that the package's source .py
files are collected, as they are required for TorchScript.

Add a basic model listing and creation test.
Add test for `lightning`, based on their "Getting started" example
with autoencoder trained on MNIST dataset. Requires `torchivsion`
for dataset.
Add hook for (PyTorch) `lightning`. Currently, the main functionality
is to ensure that the `version.info` file from the package is
collected.

We do not collect source .py files, as it seems that even if
`lightning.LightningModule.to_torchscript()` is used, it requires
the source where the model inheriting from `lightning.LightningModule`
is defined, rather than `lightning`'s own sources. We can always
add source .py files collection later, if it proves to be necessary.
Add hooks for `bitsandbytes`, and its dependency `triton`. Both
packages have dynamically-loaded extension libraries, and both
require collection of source .py files for (`triton`'s) JIT module.

With `triton`, some of the submodules need to be collected only as
source .py files (no PYZ), because the code naively tries to read
the files pointed to by `__file__` attribute under assumption that
they are source .py files. So we must not collect these modules
into the PYZ.
Add hook and basic test for `linear_operator`. The package uses
torchscript/JIT, so we need to collect its source .py files.
Add a basic test for GPyTorch, based on their "Simple GP Regression"
example.
Add hook for `fvcore.nn` to collect its source .py files for
torchscript/JIT.

Add a basic import test that demonstrates the need for that.
Add hook for `detectron` to collect its source .py files for
torchscript/JIT.

Add a basic import test that demonstrates the need for that.
Add hook for `datasets` to collect its source .py files for
torchscript/JIT.

Add a basic dataset loading test that demonstrates the need for that.
Add basic test for Hugging Face `accelerate`; demonstrates that
for the tested subset of functionality, no hook is necessary.
Use 120-character lines to reduce amount of line wrapping and make
the hook easier to read.
Remove the `tensorflow.python._pywrap_tensorflow_internal` hack
(adding it to excluded modules to avoid duplication) for
PyInstaller >= 6.0, where the issue is alleviated thanks to
the binary dependency analysis preserving the parent directory
layout of discovered/collected shared libraries.

This should fix the problem with `tensorflow` builds where the
`_pywrap_tensorflow_internal` module is not used as a shared
library, as seen in `tensorflow` builds for Raspberry Pi:
pyinstaller#121
Collect sources for `tensorflow.python.autograph` to avoid
run-time warning about AutoGraph being unavailable. Not sure if
we need to collect sources for other parts of `tensorflow`, though;
if that proves to be the case, we can adjust the collection mode
later.
Add `_pyinstaller_hooks_contrib.compat` module to hide the gory
details of stdlib `importlib.metadata` vs. `importlib_metadata`.
Import the `importlib_metadata` from `PyInstaller.compat` if
available (PyInstaller >= 6.0), otherwise duplicate the logic.

Copy `importlib_metadata` and `packaging` requirements from
PyInstaller to pyinstaller-hooks-contrib.
Determine the tensorflow's dist name using list of potential
candidates, and if a dist is found, retrieve the version from
it. Otherwise, fall back to reading `tensorflow.__version__`.

The `tensorflow` package is available in several variants, and
sometimes the `tensorflow` dist installs a separate dist (e.g.,
`tensorflow-intel` on Windows); but the user can install this
separate dist without installing the "top-level" `tensorflow`
one. In PyInstaller v5 and earlier, the `is_module_satisfies`
fell back to querying `tensorflow.__version__` if the dist could
not be found - in v6, the implementation of `is_module_satisfies`
(or rather, `check_requirement`) checks only the metadata.

As an added bonus, the direct version comparisons are nicer to
read than the comparisons against `tf_pre_1_15_0`, `tf_post_1_15_0`,
etc.
Linux builds of `tensorflow` can install CUDA via nvidia-* packages
that are enabled by `and-cuda` extra marker. So parse the dist
metadata for requirements, and turn the `nvidia-*` requirements
into `nvidia.*` hidden imports.

Consolidate the shared code for conversion of `nvidia-*` dist name
 into  `nvidia.*` module in a utility function, and use it in
both `torch` and `tensorflow` hooks.
Consolidate `pytest` configuration from `pytest.ini` into
`setup.cfg`, to match what we have in the main PyInstaller
repository.

Add -v, -rsxXfE, and ----doctest-glob= to test flags. The
addition of -v ensures that in manual (local) pytest runs,
the test names are displayed as they are ran (the CI
workflows seem to explicitly specify -v as part of their
pytest commnads).
When running the `test_lightning_mnist_autoencoder` under arm64
macOS, `multiprocessing` seems to be activated at some point,
and the test gets stuck due to lack of
`multiprocessing.freeze_support` call.
Have the `tensorflow` standard hook collect binaries from the
`tensorflow-plugins` package; this contains plugins for tensorflow's
pluggable device architecture (such as `tensorflow-metal` for macOS
and `tensorflow-directml-plugin` for Windows).

Have the `tensorflow` run-time hook override the `site.getsitepackages()`
with custom implementation that allows us to trick `tensorflow` into
loading the plugins.
@rokm rokm merged commit a2f65ef into pyinstaller:master Dec 23, 2023
2 checks passed
@rokm rokm deleted the deeplearning-libs branch December 23, 2023 13:37
github-actions bot pushed a commit to wxx9248/CIS-Game-Project-2023W that referenced this pull request Jan 4, 2024
…23.12 (#83)

Bumps
[pyinstaller-hooks-contrib](https://github.com/pyinstaller/pyinstaller-hooks-contrib)
from 2023.11 to 2023.12.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/releases">pyinstaller-hooks-contrib's
releases</a>.</em></p>
<blockquote>
<h2>2023.12</h2>
<p>Please see the <a
href="https://www.github.com/pyinstaller/pyinstaller-hooks-contrib/tree/master/CHANGELOG.rst">changelog</a>
for more details</p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/blob/master/CHANGELOG.rst">pyinstaller-hooks-contrib's
changelog</a>.</em></p>
<blockquote>
<h2>2023.12 (2024-01-03)</h2>
<p>New hooks</p>
<pre><code>
* Add hook for ``detectron2`` to collect its source .py files for
TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``fastai`` to collect its source .py files for
TorchScript/JIT.

(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``fvcore.nn`` to collect its source .py files for
TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``langchain`` that collects data files from the package.
(`[#681](pyinstaller/pyinstaller-hooks-contrib#681)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/681&gt;`_)
* Add hook for ``lightning`` (PyTorch Lightning) to ensure that its
``version.info`` data file is collected.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``linear_operator`` to collect its source .py files for
TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``seedir`` that collects the ``words.txt`` data file from
the package.
(`[#681](pyinstaller/pyinstaller-hooks-contrib#681)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/681&gt;`_)
* Add hook for ``timm`` (Hugging Face PyTorch Image Models) to collect
its
source .py files for TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``torchaudio`` that collects dynamically-loaded
extensions,
as well as source .py files for TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``torchtext`` that collects dynamically-loaded
extensions,
as well as source .py files for TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``torchvision.io.image`` to ensure that
dynamically-loaded
extension, required by this module, is collected.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for ``VADER``.
(`[#679](pyinstaller/pyinstaller-hooks-contrib#679)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/679&gt;`_)
* Add hook for Hugging Face ``datasets`` to collect its source .py files
for
TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
* Add hook for Hugging Face ``transformers``. The hook attempts to
  automatically collect the metadata of all dependencies (as declared
  in `deps` dictionary in the `transformers.dependency_versions_table`
  module), in order to make dependencies available at build time visible
to ``transformers`` at run time. The hook also collects source .py files
as some of the package's functionality uses TorchScript/JIT.
(`[#676](pyinstaller/pyinstaller-hooks-contrib#676)

&lt;https://github.com/pyinstaller/pyinstaller-hooks-contrib/issues/676&gt;`_)
&lt;/tr&gt;&lt;/table&gt; 
</code></pre>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/16295d14a07aa7567903ecdc63556008dc988cb6"><code>16295d1</code></a>
Release v2023.12</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/13fc0fd1656e9313172f751df063f75d68631a3e"><code>13fc0fd</code></a>
hooks: remove hook for google.api</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/7d0c295856b465b132474d1f438b239396bc33a4"><code>7d0c295</code></a>
hooks: add hook for seedir</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/adb0eeb5f654a45250cf5bfe3fe31670116d47a4"><code>adb0eeb</code></a>
hooks: add hook for langchain</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/9322348879736e9ee154841556b8d71db7172657"><code>9322348</code></a>
Scheduled weekly dependency update for week 00 (<a
href="https://redirect.github.com/pyinstaller/pyinstaller-hooks-contrib/issues/680">#680</a>)</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/98e03f00795c147bec1768f0cb37e5053796b7ee"><code>98e03f0</code></a>
hooks: add hook for VADER</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/df1035483f268fca3a2848eeec6235cdf9a2fee6"><code>df10354</code></a>
Scheduled weekly dependency update for week 52 (<a
href="https://redirect.github.com/pyinstaller/pyinstaller-hooks-contrib/issues/678">#678</a>)</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/b9ed3babf51a3ed9cfd87857c414d7c7e24994be"><code>b9ed3ba</code></a>
cleanup: clean up and simplify <code>get_hooks_dirs</code>
entry-point</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/a2f65ef1af8df2edc64f8097df0c10e98f6bae69"><code>a2f65ef</code></a>
hooks: tensorflow: collect plugins from tensorflow-plugins</li>
<li><a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/commit/4ac0e974d0fae07cbeae4b1739bb89636d079a91"><code>4ac0e97</code></a>
tests: add multiprocessing.freeze_support() call to lightning test</li>
<li>Additional commits viewable in <a
href="https://github.com/pyinstaller/pyinstaller-hooks-contrib/compare/2023.11...2023.12">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pyinstaller-hooks-contrib&package-manager=pip&previous-version=2023.11&new-version=2023.12)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@oushu1zhangxiangxuan1
Copy link

How do we use these hooks? If it works as defaults?

@bwoodsend
Copy link
Member

If you have pyinstaller-hooks-contrib>=2023.12 installed then they will apply automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants