Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache wheels that pip wheel built locally #6853

Closed
wants to merge 9 commits into from

Conversation

sbidoul
Copy link
Member

@sbidoul sbidoul commented Aug 10, 2019

Instead of building to the target wheel directory, build to cache and then copy the result to the target directory. This more closely matches what pip install does and makes the cache more effective.

Closes #6852

@pradyunsg
Copy link
Member

Too sleepy to think through the implications of this right now, but this definitely needs tests.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 10, 2019

think through the implications of this right now

What would probably help is some sort of definition of what autobuilding means exactly.

this definitely needs tests

Of course, I'll add one (unless some reason for not changing this pops up).

@cjerdonek
Copy link
Member

What would probably help is some sort of definition of what autobuilding means exactly.

Here is the commit that introduced "autobuilding": 08acb66

Also, note in a comment to the commit that the docstring's unpack argument is apparently meant to document autobuilding.

@sbidoul sbidoul force-pushed the pip6852-sbi branch 2 times, most recently from 433d92c to 6198823 Compare August 12, 2019 21:13
@sbidoul
Copy link
Member Author

sbidoul commented Aug 12, 2019

Thanks for the pointer. So as I understand it, autobuilding=True means the wheel being built is temporary and meant to be unpacked to a target directory, while when it's False it means the wheel being built is the final artifact.

It should then be safe to cache in both cases

I added a test. I used log parsing as test assertion, as it's the closest I could find as being useful to test this.

@cjerdonek
Copy link
Member

Thanks for filing the PR. I don't think the implementation is as simple as what you provided though because should_use_ephemeral_cache() was originally implemented assuming that caching would only take place in the autobuilding=True case (i.e. in the install case, but not the wheel case). For example, look at how should_use_ephemeral_cache() returns early if autobuilding is false:

def should_use_ephemeral_cache(
req, # type: InstallRequirement
format_control, # type: FormatControl
autobuilding, # type: bool
cache_available # type: bool
):
# type: (...) -> Optional[bool]
"""
Return whether to build an InstallRequirement object using the
ephemeral cache.
:param cache_available: whether a cache directory is available for the
autobuilding=True case.
:return: True or False to build the requirement with ephem_cache=True
or False, respectively; or None not to build the requirement.
"""
if req.constraint:
return None
if req.is_wheel:
if not autobuilding:
logger.info(
'Skipping %s, due to already being wheel.', req.name,
)
return None
if not autobuilding:
return False
if req.editable or not req.source_dir:
return None
if "binary" not in format_control.get_allowed_formats(
canonicalize_name(req.name)):
logger.info(
"Skipping bdist_wheel for %s, due to binaries "
"being disabled for it.", req.name,
)
return None
if req.link and not req.link.is_artifact:
# VCS checkout. Build wheel just for this run.
return True
link = req.link
base, ext = link.splitext()
if cache_available and _contains_egg_info(base):
return False
# Otherwise, build the wheel just for this run using the ephemeral
# cache since we are either in the case of e.g. a local directory, or
# no cache directory is available to use.
return True

So we need to be very careful. Also, in the original commit introducing the autobuilding flag, it's a bit confusing because it was simultaneously used to mean two or three things. Now you want to change it's meaning in one axis, but not another. I also think the name of the argument should be changed to something like should_unpack (in line with what was apparently going to be used, judging from the docstring's content), to reduce confusion about what the flag will mean going forward.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 13, 2019

Yes, we need to be careful indeed.

Digging a bit more into this, it is correct that WheelBuilder.build() is only invoked by the pip install and pip wheel commands? autobuilding seems to be always True when installing and always False when invoked from pip wheel?

@cjerdonek
Copy link
Member

cjerdonek commented Aug 13, 2019

Digging a bit more into this, it is correct that WheelBuilder.build() is only invoked by the pip install and pip wheel commands. autobuilding seems to be is always True used when installing and always False when invoked from pip wheel?

Yes, I think so. At least that's how it was when I checked the original commit that introduced autobuilding. autobuilding=True was basically equivalent to "called from install as opposed to wheel." It's possible things have changed since then, but it should be easy to confirm by searching the code.

I suspect that the change to should_use_ephemeral_cache() could be as simple as removing the autobuilding argument. For that function, it seems to have the meaning of "should cache" (one of autobuilding's axes of meaning, to use the phrasing in my earlier comment). You want to change that to always being true in your PR. Actually, it might be a little more complicated than that. Perhaps should_use_ephemeral_cache()'s autobuilding argument needs to be changed to an argument that means "should build" (unless it's already built).

@sbidoul
Copy link
Member Author

sbidoul commented Aug 13, 2019

it should be easy to confirm by searching the code.

My investigation confirm this. So my questioning is moving from understanding what autobuildbuilding means to understanding when the regular cache must be used as opposed to the ephemeral cache.

Also I note cache_available = bool(self._wheel_dir or self.wheel_cache.cache_dir) is confusing because it does not necessarily mean a cache is available. It rather seems to mean a target directory is available for the built wheel.

And assert have_directory_for_build would become unnecessary, since if we always build to either the ephem cache or the regular cache, we always have a directory for build.

The more I read this the more I think it would be good to untangle the caching logic from the destination of the built wheel.

@cjerdonek
Copy link
Member

So my questioning is moving from understanding what autobuilding means to understanding when the regular cache must be used as opposed to the ephemeral cache.

Yes, that is a key question. Won't you more or less want to respect what should_use_ephemeral_cache() already does in the install case (i.e the autobuilding=True case), at least in the cases where the scenarios overlap / are applicable? That way install / wheel will behave the same as far as making that choice.

You may want to look at the original commit for further background / insight. For example, it has this text: 08acb66#diff-2695f32c4432acd141c3dbe7e7e3a6b0R54-R59

As well as this: 08acb66#diff-2695f32c4432acd141c3dbe7e7e3a6b0R679-R680

The more I read this the more I think it would be good to untangle the caching logic from the destination of the built wheel.

Certainly, perhaps. Though I do think it would be best IMO to keep individual PR's limited in scope so that each PR stays manageable. (It's easy for things to grow out of hand as one starts to improve surrounding code.)

Incidentally, in the wheel use case, will it help your use case to use the ephemeral cache, or is only the regular cache useful?

@cjerdonek
Copy link
Member

cjerdonek commented Aug 13, 2019

Also I note cache_available = bool(self._wheel_dir or self.wheel_cache.cache_dir) is confusing

In the case where cache_available is actually used (i.e. when autobuilding=True), isn't it the case that self._wheel_dir will always be None? This is because only the wheel command (implies autobuilding=False) has the wheel_dir option. Making that simplification could be its own PR.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 13, 2019

Incidentally, in the wheel use case, will it help your use case to use the ephemeral cache, or is only the regular cache useful?

Not really, as I want (rather I'd find it more coherent) for pip wheel to cache the same way as pip install does.

Making that simplification could be its own PR.

Maybe I can start with that to clarify the reasoning a bit.

@cjerdonek
Copy link
Member

Not really, as I want (rather I'd find it more coherent) for pip wheel to cache the same way as pip install does.

Yes, I was agreeing with that. What I meant was, in cases where pip install uses the ephemeral cache, is it ever useful in the pip wheel case (e.g. in any of your use cases) when pip wheel uses the ephemeral cache? Or are only the cases where pip wheel uses the regular cache of practical interest to you?

@sbidoul
Copy link
Member Author

sbidoul commented Aug 13, 2019

are only the cases where pip wheel uses the regular cache of practical interest to you?

So far, yes.

I found a problem with this PR as it stands now. Since should_use_ephemeral_cache returns False if not autobuilding, and since we concluded not autobuilding means pip wheel, it would currently cache something like pip wheel --editable while pip install --editable would not cache.

It works correctly in master because the caller function (build) knows that it will never use the cache in the case of pip wheel (i.e. autobuilding=False), but this is misleading IMO (at least, I fell in into the trap :).

So if we agree that pip wheel and pip install should behave the same way wrt caching, I'm wondering if we should not change should_use_ephemeral_cache to ignore autobuilding.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 13, 2019

Also, I would like to add more tests for caching (for this and #6640). The only example I found to find the cache directory is:

def test_install_builds_wheels(script, data, with_wheel):
# We need to use a subprocess to get the right value on Windows.
res = script.run('python', '-c', (
'from pip._internal.utils import appdirs; '
'print(appdirs.user_cache_dir("pip"))'
))
wheels_cache = os.path.join(res.stdout.rstrip('\n'), 'wheels')

Is there a better approach?

@sbidoul
Copy link
Member Author

sbidoul commented Aug 13, 2019

we should not change should_use_ephemeral_cache to ignore autobuilding

Hm, maybe not. I'm mislead again by the method name. It not only decides which cache to use, but also to skip building, and that part is different for pip wheel and pip install.

@cjerdonek
Copy link
Member

I found a problem with this PR as it stands now. Since should_use_ephemeral_cache returns False if not autobuilding, and since we concluded not autobuilding means pip wheel, it would currently cache something like pip wheel --editable while pip install --editable would not cache.

Yes, this is what I was getting at when I said this above:

... For example, look at how should_use_ephemeral_cache() returns early if autobuilding is false:

Regarding this--

I'm mislead again by the method name. It not only decides which cache to use, but also to skip building, and that part is different for pip wheel and pip install.

Yes, I was confused for a moment by this, too, which is why, after reflecting again, I said this:

Perhaps should_use_ephemeral_cache()'s autobuilding argument needs to be changed to an argument that means "should build" (unless it's already built).

The reason is that, for the wheel case, you want to be able to tell should_use_ephemeral_cache() that you are wanting a build to happen (i.e. not to return None). But the function will still need to decide whether to return False or True (whether to use the regular cache or not). In any case, let's wait for the clarification PR to be finished first. And then the code will be in a slightly simpler state to talk about.

The only example I found to find the cache directory is:

There are also some unit tests of should_use_ephemeral_cache(), which is the most important part IMO, because that decides whether to use the regular cache or not. So one component of this will be to add more unit tests of that function.

@cjerdonek
Copy link
Member

Here's a thought in addition to my comment here re: adding a build_optional / force_build argument to build(). What if instead of build() having a should_unpack argument, it has an after_build argument that's a callable with a signature something like after_build(builder, req, wheel_file). Then build() won't have to contain any logic or knowledge related to "unpacking."

@sbidoul
Copy link
Member Author

sbidoul commented Aug 14, 2019

I pushed the fix and further clarification of should_use_ephem_cache. There is one thing that I do not understand yet (namely what no source_dir means), but otherwise it becomes much clearer (to me at least).

I still need to expand the test, but if you want to have look at it already, your early feedback is welcome.

@cjerdonek I need to think more about the after_build suggestion. Do you think this must go in this PR or could it be left for another refactoring PR?

@cjerdonek
Copy link
Member

I have the feeling that the if logic can be made a lot simpler. It still seems unnecessarily complicated to me.

Since all of the None return values happen towards the beginning of should_use_ephem_cache(), what about the idea of dividing should_use_ephem_cache() into two methods: should_build() and should_use_ephem_cache()? The former only needs to be called when must_build isn't True, and then should_use_ephem_cache() only needs to be called if we know we need to build.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 14, 2019

dividing should_use_ephem_cache() into two methods: should_build() and should_use_ephem_cache()?

Yes, I was thinking about that too.

return None

if "binary" not in format_control.get_allowed_formats(
canonicalize_name(req.name)):
if must_build:
# --no-binary has no effect as a pip wheel option
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning True here, why wouldn't you instead want to skip this clause and continue on (so make the if check if not must_build and "binary" not in format_control ...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it nice to have the symmetry with other cases, and have a place to put that little comment about pip wheel --no-binary which I thought useful.

Copy link
Member

@cjerdonek cjerdonek Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is that it seems like this can cause the wrong return value. If there is a condition later on that can result in a return value of False, the way the code is here short-circuits that (and in particular means that --no-binary is affecting the return value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, if you do the should_build() approach, this issue will go away because the code will be structured differently. (It will wind up being equivalent to what I was suggesting you do here.)

if req.is_wheel:
if not should_unpack:
if must_build:
logger.info(
'Skipping %s, due to already being wheel.', req.name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to decide if this log message really needs to be specific to one of the two cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, any opinion about that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or remove the message completely? In my use cases (i.e. pip wheels -r requirements.txt), it's not really useful to know that some requirements were wheels in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be the topic of another PR?

@sbidoul
Copy link
Member Author

sbidoul commented Aug 14, 2019

dividing should_use_ephem_cache() into two methods: should_build() and should_use_ephem_cache()?

I think we would then need 3 methods?

  • should_build => False if constraint or already a wheel: needs to be tested for pip wheel and pip install
  • should_build_before_unpack => False if editable, etc: needs to be tested only for pip install
  • should_use_ephem_cache =>same test for pip wheel and pip install

At first glance it's not necessarily simpler.

@cjerdonek
Copy link
Member

should_build_before_unpack => False if editable, etc: needs to be tested only for pip install

This isn't the only way. You can put the editable check in should_use_ephem_cache() for the wheel case and it just won't be seen in the install case because it will be bypassed earlier in the process. In any case, it will be good to split the function into simpler functions. Whether that's two or three can be determined. But having simpler pieces to look at will make it easier to make that decision.

@cjerdonek
Copy link
Member

Another option is for should_build() to accept the must_build argument. There are many options. I would just come up with something that respects this division, and then it can be iterated on. Or let me know if you need additional thoughts.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 16, 2019

@pradyunsg Thanks for your interest in this PR. Sorry if I was not clear. It was hard to understand and it is still a bit hard to explain :) Let me try to summarize.

We have two cases in this area: pip install and pip wheel. Before, pip wheel did almost always build (except for constraints and requirements that were already wheels) and never cache, while pip install does build in some cases, and when it decides to build, it uses the ephemeral cache or the persistent cache.

With this PR, the behavior of pip install does not change and pip wheel still builds in the same conditions. The change is that when pip wheel builds, it now uses the persistent cache the same way as pip install does. And in conditions where pip wheel builds while pip install would not have built, it uses the ephemeral cache.

Good luck with your talk.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 19, 2019

I have rebased and amended #6851 so we can see that it fits nicely in f8096a5.

Let me know if there are remaining concerns that need addressing.

@cjerdonek
Copy link
Member

@cjerdonek regarding your suggestion about after_build, I don't feel comfortable adding this in this PR as it sounds like a very big change

I wouldn't say it's big (just moving a couple chunks of code), but no, it doesn't have to be done as part of this PR. I will review again when I'm able to set aside some time. Hopefully soon.

@cjerdonek
Copy link
Member

Okay, thanks for your patience, @sbidoul.

Coming back to this, I'll start with some comments on should_build(). should_build() looks a lot more complicated than it needs to be because of the number of times it's checking should_unpack. Here is I think a simpler way to write this function (taking out the docstring and comments for brevity). The key is that you can use a guard clause for early return:

def should_build(
    req,             # type: InstallRequirement
    format_control,  # type: FormatControl
    need_wheel,      # type: bool
):
    if req.constraint:
        return False

    if req.is_wheel:
        if need_wheel:
            logger.info(
                'Skipping %s, due to already being wheel.', req.name,
            )
        return False

    if need_wheel:
        return True

    if req.editable or not req.source_dir:
        return False

    if "binary" not in format_control.get_allowed_formats(
            canonicalize_name(req.name)):
        logger.info(
            "Skipping bdist_wheel for %s, due to binaries "
            "being disabled for it.", req.name,
        )
        return False

    return True

The other thing to notice is that I've changed the should_unpack argument to need_wheel because should_unpack doesn't really have any meaning in the context of a should_build() function. For should_build(), the important piece of info is whether the caller "wants" a wheel to be built. With this argument, the caller will pass need_wheel=(not should_unpack) since should_unpack in the calling code is False when running pip wheel.

Once this function is updated, I can take another look..

Instead of building to the target wheel directory,
build to cache instead and then copy the result to the target directory.
This more closely matches what pip install does and makes the
cache more effective.
@sbidoul
Copy link
Member Author

sbidoul commented Aug 24, 2019

@cjerdonek the early return in should_build makes a lot of sense, done.

I personally came to appreciate not should_unpack in should_build, reading it as "don't build needlessly if it should ultimately be kept unpacked". need_wheel reads well too though. I made the change from not should_unpack to need_wheel in a separate commit.

Copy link
Member

@cjerdonek cjerdonek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments (not a complete review though).

src/pip/_internal/wheel.py Outdated Show resolved Hide resolved
src/pip/_internal/wheel.py Outdated Show resolved Hide resolved
if "binary" not in format_control.get_allowed_formats(
canonicalize_name(req.name)):
# TODO explain this
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned the duplication of these three conditions in your comments, which I think we should aim to avoid. I think an elegant way to address this can be to include this code at the beginning:

if need_wheel and not should_build(
    req,
    need_wheel=False,
    format_control=format_control,
):
    return True

This is basically a translation into code this English sentence that you wrote:

And in conditions where pip wheel builds while pip install would not have built, it uses the ephemeral cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I'm not convinced. I see benefit to have should_build and should_cache independent: it's easier to reason about them. For instance if you look at the comment for if req.editable, there is a different reason to not build and to not cache editables. If we reintroduce a should_build test in should_cache, I think we pay a big price in readability to remove some maybe redundant tests that may be only accidentally the same.

Of course the fact that I don't fully understand the "no source_dir" and "no binary" parts does not help :) Yet I suspect the reasoning for caching and building in those cases could be different too.

Let me know what you think after reading the latest version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjerdonek I gave one more thought to this and I now think this test need wheel and not should_build it's actually not equivalent. In case need_wheel is False it would skip the test and may end up, e.g. caching an editable in install mode. Tss what a rabbit hole I put myself in :)

So for now I'm quite happy with the current version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave one more thought to this and I now think this test need wheel and not should_build it's actually not equivalent. In case need_wheel is False it would skip the test and may end up, e.g. caching an editable in install mode.

I don't see any issue.. What I suggested was logically the same by design. In install mode with an editable, should_build() would have returned False, so should_use_ephemeral_cache() would never have been called. To reiterate my comment, I don't think we should copy-paste all those lines verbatim. If it turns out that the logic needs to be different at some point in the future, we can always reconsider (but until then, YAGNI, as they say). If you'd prefer not to call should_build() with different arguments inside should_cache(), another possibility would be to refactor those copy-pasted lines into a helper function that is called from both functions. You can also document that should_cache() should only be called when should_build() is known to be true.

src/pip/_internal/wheel.py Outdated Show resolved Hide resolved
src/pip/_internal/wheel.py Outdated Show resolved Hide resolved
@sbidoul
Copy link
Member Author

sbidoul commented Sep 5, 2019

I see there is quite of work ongoing in this area of the code (by @chrahunt in particular).
Should I rebase this to fix the conflict? Are there any other concerns that I need to address?

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Sep 5, 2019
@chrahunt
Copy link
Member

chrahunt commented Sep 7, 2019

Sorry, I thought I commented on this.

I would rebase, I think our changes are very complementary (and I definitely like the split up of should_use_ephemeral_cache 👍).

Someone will get to this, but it can vary how long that takes. If you want more of a guarantee of forward progress and have a little time I would separate out the refactoring changes into a few smaller PRs (literally no PR is too small), so those changes would be easier to evaluate in a single sitting.

@sbidoul
Copy link
Member Author

sbidoul commented Sep 8, 2019

@chrahunt ok, thanks. I'll get back to this in a couple of weeks.

@sbidoul
Copy link
Member Author

sbidoul commented Oct 27, 2019

Ok, it turns out the couple of weeks became a couple of months, but I'm back to this in #7262. I therefore close this PR to continue in smaller increments.

@sbidoul sbidoul closed this Oct 27, 2019
@sbidoul sbidoul deleted the pip6852-sbi branch November 16, 2019 16:36
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Dec 16, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Dec 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pip wheel does not cache wheels it built locally
5 participants