Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using --no-binary disables reuse of locally compiled wheels #9162

Closed
Kirill888 opened this issue Nov 24, 2020 · 12 comments · Fixed by #11872
Closed

Using --no-binary disables reuse of locally compiled wheels #9162

Kirill888 opened this issue Nov 24, 2020 · 12 comments · Fixed by #11872
Labels
C: build logic Stuff related to metadata generation / wheel generation C: cache Dealing with cache and files in it UX: functionality research epic Temporary label to link tickets to #8516

Comments

@Kirill888
Copy link

Environment

  • pip version: 20.2.4
  • Python version: 3.8.5
  • OS: Ubuntu 20.04

Description

Using --no-binary=<some-lib> flag disables re-use of locally compiled and cached wheels for that library.

Expected behavior

When using --no-binary flag I expect locally compiled wheels to be cached and reused between invocations of pip wheel|install commands, but instead those get recompiled on every invocation.

How to Reproduce

Starting with empty cache, let's fetch and compile from source Cython package

pip wheel --verbose --no-deps --no-binary=Cython Cython

Above works as expected, downloads and caches source distribution for Cython, compiles it, stores compiled wheel in cache and copies into current working directory.

Stored in directory: /home/ubuntu/.cache/pip/wheels/82/f9/1a/ff4ade708988218648847e0438b632ce876aee8fa3c9b5fc6e

If we run this command again, I expect the cached version of the locally compiled wheel to be copied from cache, but instead Cython is re-compiled.

pip wheel --verbose --no-deps --no-binary=Cython Cython

So the locally compiled wheel is cached, but can never be extracted from the cache other than manually.

The impact of this is particularly noticeable for things like Cython and numpy which are commonly needed as a dependency for building other packages, but can also benefit from being compiled locally rather than using manylinux wheels. If you do something like pip wheel --no-binary :all: pandas scipy numpy you'll be recompiling numpy and Cython over and over again.

I have used constraints to work around this problem in the past, using paths to locally compiled wheels as constraints, but that is being deprecated as constraints are meant for versions only.

@webknjaz
Copy link
Member

I'm pretty sure that it's expected since cache stores binary wheels and you're telling pip not to use those.

@pfmoore
Copy link
Member

pfmoore commented Nov 25, 2020

Correct. Using locally cached and built wheels in spite of --no-binary would be a backward compatibility break, and would probably be a surprise to many people, so I don't think it's something we would want to do. Without a very strong use case that justifies (1) using --no-binary in the first place, and (2) wanting cached wheels to be used even when --no-binary is specified, this is unlikely to happen.

For context, remember that building from source can (in theory) result in a different wheel every time you build. That's clearly not the intended way builds should work, but --no-binary is essentially designed to handle that use case, so having it use the cache would (in effect) break its core reason for existing.

@webknjaz
Copy link
Member

@Kirill888 if for some reason you want to build certain packages from sdist, you could create a wheelhouse dir with pip wheel or something like that with the first run and then include that into your subsequent install commands.

@Kirill888
Copy link
Author

Thanks everyone I guess my interpretation of what --no-binary means is incorrect and biased by my needs. Still, the feature I'm after, and what I assumed --no-binary was designed for, is as following:

For certain library X do not use manylinux wheels that are published on pypi, but instead compile on a local machine, but don't recompile it every time (as this can be really slow, tens of minutes). This is particularly true for numpy, as it is a common build time dependency for all the other numeric libraries out there. My issue is that I can not easily force installation of locally compiled numpy as a build dependency for other libraries. So my options are

  1. Compile against manylinux numpy, then replace numpy with local version and hope it's ok on the binary interface
  2. Turn off build isolation, install all the base libraries needed, including locally compiled numpy (that step might not be possible as some libs might have conflicting Cython version dependencies)
  3. Use deprecated --constraints=local_numpy.txt with a file that points to locally compiled wheel

I maintain a largish Python environment used for scientific workflows, and while I love the fact that most hard to compile modules now ship with easy to install binary wheels, they are not great for "production". Awesome for experimentation, you can try out new library without investing time into understanding build requirements, not so great for production where you want best performance. For numeric code using more modern compiler alone can give you ~20% performance gain, easy (manylinux GCC is rather old one). But the main gains are from linking to performance libraries of course. Other reason to compile locally is to avoid binary dependency version conflicts, say pyproj and rasterio might ship incompatible versions of libgeos library and cause segfaults, just an example I have seen in the past.

Other than maintaining pypi proxy that filters out manylinux wheels for libraries I'd rather compile locally, what other options are there to force pip to prefer locally compiled wheels, without having to rebuild them everytime (including installation into isolated environment as a build dependency).

@Kirill888
Copy link
Author

For context, remember that building from source can (in theory) result in a different wheel every time you build. That's clearly not the intended way builds should work, but --no-binary is essentially designed to handle that use case, so having it use the cache would (in effect) break its core reason for existing.

Agree, I realize that almost no compiled software these days support reproducible builds, and I know how insanely hard it is to achieve that. What's worse, you can get very different functionality, not just different hash, for a compiled wheel depending on what libraries were installed at the time of compilation. But that's not a reason to disable caching, it's a reason to be aware of the cache, but not to disable it altogether. What is the behaviour for a module that requires compilation, but only ships source code? Something like say ciso8601, is this being re-compiled on every install, because things might have changed, or is previously compiled wheel get reused?

What I'm after is essentially "for library X pretend that there are no wheels for your platform, so fetch sources, compile them, cache the wheels, reuse the wheels".

@pfmoore
Copy link
Member

pfmoore commented Nov 26, 2020

I thought manylinux had a mechanism for declaring what manylinux versions you supported? Sorry, I'm not a linux specialist, so I don'yt have details. A quick search says that you want a _manylinux module - from PEP 513:

To handle potential future incompatibilities, we standardize a mechanism for a Python distributor to signal that a particular Python install definitely is or is not compatible with manylinux1: this is done by installing a module named _manylinux, and setting its manylinux1_compatible attribute

@uranusjr
Copy link
Member

uranusjr commented Nov 26, 2020

But that's not a reason to disable caching, it's a reason to be aware of the cache, but not to disable it altogether.

This is why they say cache invalidation is one of the two most difficult things in programming 😉

Wheel tags are limited and cannot capture all the possible compile-time variants, and pip has no way to know a previously-built wheel in cache is actually compatible with what you’re trying to do. So when you say --no-binary, it has to disable all binaries, otherwise it may end up hitting the cache when it shouldn’t, which would be very bad, likely worse.

The pip wheel solution above is the best approach IMO; you build once, and point pip to that location with something like --find-links or simpleindex plus --index-url. This promises pip that all binaries here are compatible, and instructs it to look here instead of PyPI.

@Kirill888
Copy link
Author

Thanks for the link to PEP 513, I was not aware of this, it might prove useful. I was not looking for that because I do not want to disable ALL manylinux wheels, just binary wheels for those libraries that ship mutually incompatible binaries or problematic for other reasons, for example libs that link to libcurl that then require special handling to support anything https related when running on Ubuntu, or require local compilation for optimal performance.

@uranusjr

Wheel tags are limited and cannot capture all the possible compile-time variants, and pip has no way to know a previously-built wheel in cache is actually compatible with what you’re trying to do.

Yet, for modules that do not ship wheels, but only sources, locally compiled wheels are cached and do get reused.

The pip wheel solution above is the best approach IMO; you build once, and point pip to that location with something like --find-links or simpleindex plus --index-url. This promises pip that all binaries here are compatible, and instructs it to look here instead of PyPI.

In the python environment I have, there are ~70 python modules that depend on numpy not all of them have compiled components, but let's say 10-20 do. So if I just did pip wheel -r requirements.txt and I had --no-binary=numpy in there, I would get numpy compiled 10-20 separate times. This would make things slow enough to not be able to use free CI solutions for the task.

We can probably close this issue, as I clearly misinterpreted what --no-binary is supposed to be, which I now understand is "compile me a fresh copy of this module from source no matter what you have in the cache". And that's fine. I just wish there was also a way to prefer locally compiled wheels without going all in with wheelhouse type of solution or hosting your own pypi, but that's probably a topic for a different issue.

@uranusjr
Copy link
Member

So if I just did pip wheel -r requirements.txt and I had --no-binary=numpy in there, I would get numpy compiled 10-20 separate times.

I don’t think this is the case, and would consider it a bug if it is.

@Kirill888
Copy link
Author

numpy
scipy
pandas
--no-binary=numpy,scipy,pandas

pip wheel --no-deps -r requirements.txt

builds numpy 3 times, possibly different versions of numpy even. It builds it two times to setup build environment for scipy and for pandas and then once more to compile numpy itself into a wheel.

@pfmoore
Copy link
Member

pfmoore commented Nov 26, 2020

Is numpy a build requirement for scipy and pandas? That's very different than if it's a runtime dependency. If you want to avoid setting up multiple build environments, you should be using --no-build-isolation (and managing the build environment manually).

@brainwane brainwane added the UX: functionality research epic Temporary label to link tickets to #8516 label Nov 29, 2020
@sbidoul
Copy link
Member

sbidoul commented Jan 2, 2021

It is unfortunate that --no-binary currently implies so many things, i.e.:

  1. do not download wheels
  2. install with the legacy setup.py install instead of building a wheel first (which we'll change at some point, as discussed in Raise a warning when pip falls back to the "legacy" direct install code #8102)
  3. disable usage of the wheel cache

@Kirill888 's arguments resonate with me, and I believe disentangling cache control from --no-binary would benefit many users. I even suspect that it is the common case. Today, users of --no-binary that would be impacted by changing (3), because subsequent builds produce different result, probably suffer from it today, as they get their cache "polluted", since the built wheel is stored in cache. This issue may become more prevalent when we allow --no-binary to do build+install instead of setup.py install.

@sbidoul sbidoul added C: build logic Stuff related to metadata generation / wheel generation C: cache Dealing with cache and files in it labels Jan 3, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: build logic Stuff related to metadata generation / wheel generation C: cache Dealing with cache and files in it UX: functionality research epic Temporary label to link tickets to #8516
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants