Skip to content

Add build_sdist and tweak the existing interface #275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

dstufft
Copy link
Member

@dstufft dstufft commented May 31, 2017

  • Add the source_directory parameter to the build backend methods. This provides an explicit way to pass the source tree that is currently being operated on into each method.
  • Return the name of the metadata directory from get_wheel_metadata as a single item list, allowing future expansion to multiple wheels if needed.
    • Note: This could/should arguably be a single string and if we end up supporting multiple wheels, allow a list return value as well, or create a new method that returns a list.
  • For performance reasons, create an unpacked wheel instead of an already zipped .whl file.
  • Add a mandatory build_sdist method which will build an unpacked sdist, similarly to how build_wheel creates an unpacked wheel file.

dstufft added 5 commits May 31, 2017 07:17
The PEP was previously silent about how a build backend should determine
what directory contains the source tree it was being asked to operate
on. Instead of somewhat implicitly assuming it should be os.curdir, let
the build tool pass it in as part of the API contract.
Dumping a directory named {name}-{version}.dist-info into a directory
can run into problems where the calling tool can only determine what
the directory is supposed to be by globbing that directory. This can
cause bugs when a second entry gets erroneously added in the future (or
if the ability to build multiple wheels is added). Returning the
expected name side steps that issue, while using a list allows future
expansion to building multiple wheels.
Instead of creating a zipped up .whl file which a project like pip is
only going to immediately unzip again, create an unpacked wheel
directory that a tool like pip can just immediately install from, while
a tool like twine could zip up to create the wheel artifact.
Creating a source distribution is an important part of what is done with
``setup.py``, and any replacement for it must also handle that.
...

MUST build an unpacked sdist from the ``source_directory`` and place it into the
specified ``sdist_directory``. If the ``source_directory`` is already an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An sdist normally contains a top level folder called name-version, e.g. requests-2.17.3. Is the backend expected to create this inside sdist_directory, or assume that sdist_directory is that folder? I.e. which of these is true:

from os.path import join, exists
exists(join(sdist_directory, 'requests-2.7.13', 'tests'))
# OR
exists(join(sdist_directory, 'tests'))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first should be true largely because at this stage the metadata doesn't exist for the calling tool to know that it is requests-2.7.13 and not something else (I briefly considered trying to add metadata like what setuptools sometimes does, but I avoided it because it adds another layer of complexity to this PEP and I think we can avoid it).

I can adjust this PR to this effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a build_wheel hook and a get_wheel_metadata hook. But here, we just have build_sdist - is the backend expected to create the sdist metadata in sdist_directory? Why the discrepancy? And doesn't this require backends to track future "sdist >=2.0" specs to maintain correct metadata? I think we should decouple creating sdist metadata from backends. And once we get into this discussion, we're opening up the whole "sdist metadata" debate, which is nowhere near resolution. Hence my preference for a "just give me the files required to build" hook.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discrepancy is because as of today, there is no real standard for metadata inside of a sdist. The only real standard is "a tarball with a setup.py script that looks close to what setuptools supports". This PEP extends that to also include "a tarball with a pyproject.toml file with a particular key". We currently call python setup.py egg_info to get the metadata that "belongs" to the sdist (it doesn't really, it belongs to the emphereal binary output that the sdist creates), and with this PEP we can just call the get_wheel_metadata hook to achieve the same thing.

If we create a sdist 2.0 then we can focus on what metadata we want to include in an sdist to enable some other things that would be nice to have, but like you said that opens up a huge debate. This PEP is "let me replace python setup.py bdist_wheel with something else", and my statement is that we also need to replace python setup.py sdist, not that we need to drastically alter the status quo.

And yes, they'll have to track any work for a sdist 2.0 (or well they don't have to, just if they don't they'll never gain any improvements we get from that), just like they'll have to track a hypothetical wheel 2.0.


MUST build an unpacked sdist from the ``source_directory`` and place it into the
specified ``sdist_directory``. If the ``source_directory`` is already an
unpacked sdist (as opposed to a VCS source tree) then it SHOULD simply copy the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we specify how to tell what case it's handling?
  • What about cases which are neither? E.g. if I download a tarball from Github and unpack it, it is neither a VCS checkout (no .git directory), nor an sdist. Is it OK to error out? Should frontend tools try something else if this fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more cases occurred to me:

  • It's a known VCS checkout but the VCS tools are not available (e.g. git is not on PATH)
  • It's a VCS checkout of a VCS the tool does not recognise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we specify how to tell what case it's handling?
  • What about cases which are neither? E.g. if I download a tarball from Github and unpack it, it is neither a VCS checkout (no .git directory), nor an sdist. Is it OK to error out? Should frontend tools try something else if this fails?

I think that's the backend job to try to detect which case it's handling, and it's completely fair of the backend to throw up it's hands if it's in a situation it was not designed to handle (like, being neither in a VCS checkoukt nor in an sdist).

Backends could then be improved to handle different situations. I don't think we can future proof this part (e.g. an export from github migth look different from an export from sourceforge, due to the inclusion or not of a top-level directory with an arbitrary name that might or might not include the repository name).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, what I think we need is a definition of what possible cases there are and what they mean.

This hasn't been important so far in this PEP, because we've just treated a 'source tree' - a directory containing pyproject.toml - as the input for building a wheel. If we're going to specify a build_sdist step which may have different behaviour in different cases, we need to define what an sdist is, and what the different possible inputs for the build_sdist step are.

There are (at least) four kinds of input to pip which can result in it installing from source (names, sdist paths/urls, local directories, VCS URLs). In which of these cases will it:

a. Not call the build_sdist hook
b. Call it but be prepared to handle its failure by installing another way
c. Call it and fail if the hook fails

And that's just with pip; then there's the question of other tools which may use this API, and what they expect from it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't specify how to tell there are only three states that a thing can be in, it can be a wheel, a sdist, or something that can be turned into a wheel or sdist. While I used the word "VCS source tree" it doesn't literally have to be a VCS tree. It could just be a VCS-less directory on disk, that's fine. What requirements any particular build tool might place on what it requires of a target project, is really up to that particular build tool. Some might work just fine with a sdist generation step that is nothing more than shutil.copytree in which case they don't really need to differentate between a "VCS source tree" and a "sdist". Another project might, when in a VCS tree, dynamically compute a version number based on the git tags and "Bake" that into the sdist (using some internal to them scheme), that project would differentiate and would likely do it by looking at that internal scheme.

This PEP declares what a sdist is, it's the same thing we've had for like 2 decades, just s/setup.py/pyproject.toml/. That's perfectly reasonable for this PEP and I don't think we need to get bogged down in the weeds declaring a sdist 2.0 or anything like that.

As far as when would pip itself call this, as written in this PR it would call that build_sdist hook anytime it is building from source unless that source is literally a foobar-1.0.tar.gz. This means it'll do it from a git clone or if you unpack a sdist yourself (this is why the hook needs to handle the case of being run in something that was already a sdist) because we can't differentiate between an unpacked sdist and a random git clone (or whatever) without either getting bogged down in a sdist 2.0 or knowing more about what the internals of the build tool.

As far as failure cases go, if the hook fails the build fails.

...

This hook MUST return an additional list of strings containing PEP 508
dependency specifications, above and beyond those specified in the
``pyproject.toml`` file. Example::
``pyproject.toml`` file. The ``source_directory`` parameter MUST be a path that
points to the source tree that is being operated on. Example::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previously the idea was that the source directory would be the CWD when the hook is invoked. Is there a reason this is inadequate?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this [source_directory simply being CWD instead of being passed as parameter] in is inadequate?

Maybe "explicit is better than implicit"? But I guess the only effect of passing this parameter in is forcing the backend to check where it is and CDing in there before running any commands.

BTW, should this PEP also specify guidelines on handling CWD? like "if you os.chdir() somewhere else, make sure you go back to where you started"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this PEP also specify guidelines on handling CWD? like "if you os.chdir() somewhere else, make sure you go back to where you started"?

The idea is that the frontend would run these in a separate process, which would complete after running the hook, so there's no need for it to reset global state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth explicitly stating that backends MAY alter global process state, and it's up to front ends to deal with this possibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't specifically stated wether it was CWD or what (though I assumed the implication was that it was). I believe that passing this in is a better way to write this code, it makes it easier to write tests and reason about the interface, particularly since some projects may not run this in a separate subprocess (pip will of course, but I don't think it is or should be mandatory that you invoke it in a subprocess if that doesn't make sense for your specific use of the API).

That being said, if we want to mandate that it must be called via a subprocess and use an implicit parameter of os.curdir to pass in the build chain, I'm OK with being outvoted on that.

@@ -218,6 +241,14 @@ metadata. The directory passed in by the build frontend MUST be
identical to the directory created by ``get_wheel_metadata``,
including any unrecognized files it created.

This method only places the *contents* of the wheel file, without actually
creating the final artifact. This is to allow flexibility in how the result is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is annoying for flit, because it builds the .whl file directly, without an intermediate directory containing files in the same layout. Not a showstopper if other people want it this way, but I'm -0.5 on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically doing it that way means that in some cases we're going to round trip through zip and then unzip for no real reason. In your example, flit is going to produce a zipped up wheel, then pip is going to immediately unzip it. However if flit is modified to stage the files to an immediate directory, then using this API pip can just install directly from that, instead of taking the performance hit of a compress + decompress round trip.

@takluyver
Copy link
Contributor

I also think that if we're asking tools to produce an (unpacked) sdist, we need to specify more precisely what that is. What do backends need to put in there? What can frontends use it for, other than passing it back to the backend to build a wheel?

creating the final artifact. This is to allow flexibility in how the result is
going to be used. A front end like pip, which is only going to immediately
install this wheel can speed up the installation process by skipping a round
trip through zip. However a command like a hypothetical ``twine wheel`` could

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A front end like pip, which is only going to immediately install this wheel can speed up the installation process by skipping a round trip through zip.

Is this true for pip, though? Won't pip want to drop the built wheel in the wheel cache?

BTW I have this idea of pip eventually switching to having a wheel cache of unpacked wheels and just symlinking or hard-linking them inside the virtualenv, but this is a subject for another space...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From experience with conda (which uses hard linking much like you suggest), I'd be wary of this. I've seen files get modified in one installed version, but because it's hard linked, that modification affects every installed copy - and it's in the cache, so even reinstalling doesn't get the original version. When we suspect this, we usually advise people to uninstall and reinstall the whole Anaconda distribution.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From experience with conda (which uses hard linking much like you suggest), I'd be wary of this. I've seen files get modified in one installed version, but because it's hard linked, that modification affects every installed copy - and it's in the cache, so even reinstalling doesn't get the original version. When we suspect this, we usually advise people to uninstall and reinstall the whole Anaconda distribution.

I guess "switching to" is a strong expression. I was more thinking in the line of:

  • having a switch to pip to cache wheels unpacked, and
  • having another switch to use symlinks or hardlinks when installing (hardlinks can be create unpriviledged on Windows, while symlinks cannot). When missing, handle unpacked wheels by copying.

Like any pip switches, these switches could be persisted in a pip.conf file site-wide, user-wide or virtualenv-wide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't want to maintain and support both of those code paths, but then it's not my project. ;-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't want to maintain and support both of those code paths

It doesn't really require multiple code paths. Wheels need to be unpacked anyway, so moving the unpacked wheel to the cache is a single move if it was unpacked in a tempdir inside the wheel cache to begin with.

And since wheels cannot be blindly shutil.copytree()ed into the destination location anyway, you only need to abstract the method by which a single file is copied into the destination to switch it to sym/hard-linking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point about the wheel cache is a good one. I'm skeptical of the benefit we'd get in pip from this. @dstufft knows more about this area than me, but I'd like to see the benefits more clearly spelled out (and specifically how we'd handle the wheel cache in this case).

I'm strongly -1 on having unpacked wheels in the cache and links. Not all filesystems support links, Windows has issues with them, dangling links could become something we need to consider, ... I can't see the benefits justifying this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly -1 on having unpacked wheels in the cache and using links (hard or otherwise). The isolation provided by the current setup is a feature and I don't think adding another flag is worth it in this case.

That being said, we have a few cases that pip is going to be building a wheel, some of which will eventually store the resulting artifact in the wheel cache, but some of which are just going to build an emphereal wheel and then throw it away. Doing something like pip install . or pip install mycoolsdist-1.0.tar.gz are not going to store a cached wheel and will rebuild each time.

Beyond that, I could see us adjusting the cache to try and speed it up even more, one example might be to just cache unpacked wheel (but not link them, just essentially precomputing the unzip step), another idea might be to build the wheel files specially to store the files uncompressed inside of the .whl instead of compresses (obviously not something you'd want to publish to PyPI, but fine for a local cache).

Whether these make sense or not would require benchmarking the size trade off vs the speed won. However building an unpacked wheel basically costs us nothing (except maybe some churn in some projects) over building an already packed wheel. Even if we're in a case that we're ultimately going to zip it up again, we aren't really taking a performance hit or anything, we're just shifting the work from one phase in the process to another, while letting us flexibility to implement performance improvements where they do make sense.

def build_sdist(source_directory, sdist_directory, config_settings):
...

MUST build an unpacked sdist from the ``source_directory`` and place it into the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to leave it to the backend to decide what goes into a sdist? Certainly the backend needs to say "we must have these files if we are to be able to build a wheel", but for other things (like allowing the user to specify additional files to include in the sdist, like LICENSE) it seems like it would be better to have a standard method, probably specified via pyproject.toml that sdist-building tools would deal with, rather than expecting backends to all duplicate the same logic.

I think I'd prefer a get_build_files hook that has to return a list of the files needed for a build. Backends SHOULD return a minimal list. The files returned MUST be appropriate for including in a sdist (I think it's OK to leave this use of "sdist" intentionally vague), so VCS control files, tox caches, etc, need to be excluded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to leave it to the backend to decide what goes into a sdist?

As of now, this is the status quo. Some dists even require plugins (e.g. setuptools-git) so their backends can do that correctly.

Certainly the backend needs to say "we must have these files if we are to be able to build a wheel",

Which is why Thomas would prefer if we restricted this method to only handle the files needed to build the wheel, and tabling the whole sdist discussion for later.

I think I'd prefer a get_build_files hook that has to return a list of the files needed for a build

Not sure if it would be better to list the files instead of copying them, as a backend like flit could satisfy the current method with a shutil.copytree(). Maybe we just need to rename the method? The only thing that really needs to happen is for the destination directory to contain something, created by the backend, that the backend itself knows how to turn it into a wheel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with copying them.

The only thing that really needs to happen is for the destination directory to contain something, created by the backend, that the backend itself knows how to turn it into a wheel

That's a good description of what I was trying to get at. There may be other use cases (building sdists being the obvious one!) so I'd state it a little more generically as "the hook should write to the destination directory the minimal set of files needed by the backend, that the backend can use to build a wheel". But that's just details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it would be better to list the files instead of copying them, as a backend like flit could satisfy the current method with a shutil.copytree().

If the hook is specified as defining the minimal set of files needed to build a wheel, flit will probably do something slightly more advanced than calling copytree() on the whole directory, for performance reasons. I think the implementation difficulty would be similar whether it returns a list of files or copies them itself.

If the hook actually generates something like an sdist, as Donald favours, then it's a much more complex question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it before, but yes I think we should leave it up to the backend about what goes into the sdist, because the backend is the place that is handling this already in this PEP (unless we want to drastically alter this PEP). Building the wheel is more involved then building the sdist, not less it needs to compute almost all the same stuff, but also a bunch of other stuff.

In addition, we're already assuming (or we should be if we're not) that the build tool is going to be having logic to produce sdists, otherwise with setup.py gone, what is producing a sdist? If we're not baking support for sdists into the standard, then build tool authors are going to be incentivized to not make it possible to produce sdists, and I am strongly -1 on any proposal that makes or incentivizes it to be meaningfully harder to produce a sdist + wheels then just wheels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we're already assuming (or we should be if we're not) that the build tool is going to be having logic to produce sdists, otherwise with setup.py gone, what is producing a sdist?

There's nowhere that's explicitly stated, and I'm not sure it's as obvious as you think. Personally, I find your argument compelling (with the proviso that I don't have a good feel whether it's pip or some other tool that should be the generic "sdist builder") but I do think it ought to be presented as an argument and agreed on. Could you post this point to distutils-sig as a thread of its own? I fear it'll get missed in the rest of the debate.

@ncoghlan
Copy link
Contributor

ncoghlan commented Jun 5, 2017

I merged #277 with @takluyver's preferred approach to this, so we'll use that as the basis for any further design discussions.

@ncoghlan ncoghlan closed this Jun 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants