Add bunch of hooks for contemporary ML stuff #676

Create a separate `test_pytorch.py` test for `torch` and its associated libraries. Move existing tests to the new file.

Explicitly force the `pyi_builder` into onedir-only mode instead of skipping onefile tests. Reduces number of reported tests.

Add a test that uses `torchvision.datasets`, `torchvision.transforms`, and torchscript. The latter demonstrates the need for collecting source .py files from `torchvision`.

Rename `hook-torchvision.ops.py` to `hook-torchvision.py`, and add `module_collection_mode = 'pyz+py'` to collect source .py files for torch JIT/torchscript.

When collecting binaries in the PyInstaller >= 6.0 codepath, explicitly collect versioned .so files by adding '*.so.*' to the list of search patterns passed to `collect_dynamic_libs`. Just in case that any of those libs happens to be dynamically loaded at run-time...

The contemporary PyPI torch wheels for linux use CUDA libraries that are installed via `nvidia-*` packages. Therefore, attempt to convert the `nvidia-*` requirements from the `torch` metadata into hidden imports. This way, we can provide hooks for `nvidia.*` packages that collect the shared libs, in case any of them are dynamically loaded (which currently seems to be the case with some of the shared libraries from `nvidia.cudnn`).

On Linux, NVIDIA CUDA 11.x and 12.x shared libraries can be installed via PyPI wheels (e.g., `nvidia-cuda-runtime-cu12`, `nvidia-cudnn-cu12`, `nvidia-cublas-cu12`). These all provide sub-packages under `nvidia` top level package (e.g., `nvidia.cuda_runtime`, `nvidia.cudnn`, `nvidia.cublas`). Add hooks for these sub-packages that ensure that all shared libraries from the sub-packages `lib` directory are collected, in case they are dynamically loaded. For example, `torch` PyPI wheels for linux do not bundle CUDA inside the `torch/lib` (whereas the wheels from their own server, built with "non-default" CUDA versions bundle them), and dynamically load `libcudnn_ops_infer.so.8` from `nvidia.cudnn.lib`.

Add a test for torchaudio that uses a transform to resample a signal. The test shows the need to collect binaries from the torchaudio package, and, due to use of torchscript, also shows the need to collect source .py files.

Add hook for torchaudio that collects binaries and ensures that source .py files are collected.

Add a test for torchext that uses a tokenization transform from Berta Encoder. The test shows the need to collect binaries from the torchtext package, and, due to use of torchscript, also shows the need to collect source .py files. We perform only tokenization part of processing, in order to avoid having to download the whole model (~240 MB).

Add hook for torchtext that collects binaries and ensures that source .py files are collected.

Move tensorflow tests into separate `test_tensorflow` file. Improve the `tensorflow_onedir_only` test mark (force pyi_builder into generating only onedir case instead of skipping the onefile case).

Add a basic `transformers` pipeline test. Add a basic import test for `transformers.DebertaModel`, which shows that we need to collect source .py files for TorchScript.

Add hook for Hugging Face `transformers` package. Attempt to automatically collect metadata for all of package's dependencies (as declared in `deps` dictionary in the `transformers.dependency_versions_table` module). Collect source .py files as some of the functionality uses TorchScript.

Add a hook for fastai, which ensures that the package's source .py files are collected, as they are required for TorchScript. Add a test based on the "building models from tabular data" example from https://docs.fast.ai/quick_start.html, which demonstrates the need for the source .py files.

`torchvision.io.image` attempts to dynamically load `torchvision.image` extension, so add a hook that ensures the latter is collected. Add a basic image encoding/decoding test.

Add hook for timm, which ensures that the package's source .py files are collected, as they are required for TorchScript. Add a basic model listing and creation test.

Add test for `lightning`, based on their "Getting started" example with autoencoder trained on MNIST dataset. Requires `torchivsion` for dataset.

Add hook for (PyTorch) `lightning`. Currently, the main functionality is to ensure that the `version.info` file from the package is collected. We do not collect source .py files, as it seems that even if `lightning.LightningModule.to_torchscript()` is used, it requires the source where the model inheriting from `lightning.LightningModule` is defined, rather than `lightning`'s own sources. We can always add source .py files collection later, if it proves to be necessary.

Add hooks for `bitsandbytes`, and its dependency `triton`. Both packages have dynamically-loaded extension libraries, and both require collection of source .py files for (`triton`'s) JIT module. With `triton`, some of the submodules need to be collected only as source .py files (no PYZ), because the code naively tries to read the files pointed to by `__file__` attribute under assumption that they are source .py files. So we must not collect these modules into the PYZ.

Add hook and basic test for `linear_operator`. The package uses torchscript/JIT, so we need to collect its source .py files.

Add a basic test for GPyTorch, based on their "Simple GP Regression" example.

Add hook for `fvcore.nn` to collect its source .py files for torchscript/JIT. Add a basic import test that demonstrates the need for that.

Add hook for `detectron` to collect its source .py files for torchscript/JIT. Add a basic import test that demonstrates the need for that.

Add hook for `datasets` to collect its source .py files for torchscript/JIT. Add a basic dataset loading test that demonstrates the need for that.

Add basic test for Hugging Face `accelerate`; demonstrates that for the tested subset of functionality, no hook is necessary.

Use 120-character lines to reduce amount of line wrapping and make the hook easier to read.

Remove the `tensorflow.python._pywrap_tensorflow_internal` hack (adding it to excluded modules to avoid duplication) for PyInstaller >= 6.0, where the issue is alleviated thanks to the binary dependency analysis preserving the parent directory layout of discovered/collected shared libraries. This should fix the problem with `tensorflow` builds where the `_pywrap_tensorflow_internal` module is not used as a shared library, as seen in `tensorflow` builds for Raspberry Pi: pyinstaller#121

Collect sources for `tensorflow.python.autograph` to avoid run-time warning about AutoGraph being unavailable. Not sure if we need to collect sources for other parts of `tensorflow`, though; if that proves to be the case, we can adjust the collection mode later.

Add `_pyinstaller_hooks_contrib.compat` module to hide the gory details of stdlib `importlib.metadata` vs. `importlib_metadata`. Import the `importlib_metadata` from `PyInstaller.compat` if available (PyInstaller >= 6.0), otherwise duplicate the logic. Copy `importlib_metadata` and `packaging` requirements from PyInstaller to pyinstaller-hooks-contrib.

Determine the tensorflow's dist name using list of potential candidates, and if a dist is found, retrieve the version from it. Otherwise, fall back to reading `tensorflow.__version__`. The `tensorflow` package is available in several variants, and sometimes the `tensorflow` dist installs a separate dist (e.g., `tensorflow-intel` on Windows); but the user can install this separate dist without installing the "top-level" `tensorflow` one. In PyInstaller v5 and earlier, the `is_module_satisfies` fell back to querying `tensorflow.__version__` if the dist could not be found - in v6, the implementation of `is_module_satisfies` (or rather, `check_requirement`) checks only the metadata. As an added bonus, the direct version comparisons are nicer to read than the comparisons against `tf_pre_1_15_0`, `tf_post_1_15_0`, etc.

Linux builds of `tensorflow` can install CUDA via nvidia-* packages that are enabled by `and-cuda` extra marker. So parse the dist metadata for requirements, and turn the `nvidia-*` requirements into `nvidia.*` hidden imports. Consolidate the shared code for conversion of `nvidia-*` dist name into `nvidia.*` module in a utility function, and use it in both `torch` and `tensorflow` hooks.

Consolidate `pytest` configuration from `pytest.ini` into `setup.cfg`, to match what we have in the main PyInstaller repository. Add -v, -rsxXfE, and ----doctest-glob= to test flags. The addition of -v ensures that in manual (local) pytest runs, the test names are displayed as they are ran (the CI workflows seem to explicitly specify -v as part of their pytest commnads).

When running the `test_lightning_mnist_autoencoder` under arm64 macOS, `multiprocessing` seems to be activated at some point, and the test gets stuck due to lack of `multiprocessing.freeze_support` call.

Have the `tensorflow` standard hook collect binaries from the `tensorflow-plugins` package; this contains plugins for tensorflow's pluggable device architecture (such as `tensorflow-metal` for macOS and `tensorflow-directml-plugin` for Windows). Have the `tensorflow` run-time hook override the `site.getsitepackages()` with custom implementation that allows us to trick `tensorflow` into loading the plugins.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bunch of hooks for contemporary ML stuff #676

Add bunch of hooks for contemporary ML stuff #676

Commits on Dec 21, 2023

Commits on Dec 22, 2023