-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add build_sdist and tweak the existing interface #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The PEP was previously silent about how a build backend should determine what directory contains the source tree it was being asked to operate on. Instead of somewhat implicitly assuming it should be os.curdir, let the build tool pass it in as part of the API contract.
Dumping a directory named {name}-{version}.dist-info into a directory can run into problems where the calling tool can only determine what the directory is supposed to be by globbing that directory. This can cause bugs when a second entry gets erroneously added in the future (or if the ability to build multiple wheels is added). Returning the expected name side steps that issue, while using a list allows future expansion to building multiple wheels.
Instead of creating a zipped up .whl file which a project like pip is only going to immediately unzip again, create an unpacked wheel directory that a tool like pip can just immediately install from, while a tool like twine could zip up to create the wheel artifact.
Creating a source distribution is an important part of what is done with ``setup.py``, and any replacement for it must also handle that.
... | ||
|
||
MUST build an unpacked sdist from the ``source_directory`` and place it into the | ||
specified ``sdist_directory``. If the ``source_directory`` is already an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An sdist normally contains a top level folder called name-version, e.g. requests-2.17.3
. Is the backend expected to create this inside sdist_directory
, or assume that sdist_directory
is that folder? I.e. which of these is true:
from os.path import join, exists
exists(join(sdist_directory, 'requests-2.7.13', 'tests'))
# OR
exists(join(sdist_directory, 'tests'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first should be true largely because at this stage the metadata doesn't exist for the calling tool to know that it is requests-2.7.13
and not something else (I briefly considered trying to add metadata like what setuptools sometimes does, but I avoided it because it adds another layer of complexity to this PEP and I think we can avoid it).
I can adjust this PR to this effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a build_wheel
hook and a get_wheel_metadata
hook. But here, we just have build_sdist
- is the backend expected to create the sdist metadata in sdist_directory
? Why the discrepancy? And doesn't this require backends to track future "sdist >=2.0" specs to maintain correct metadata? I think we should decouple creating sdist metadata from backends. And once we get into this discussion, we're opening up the whole "sdist metadata" debate, which is nowhere near resolution. Hence my preference for a "just give me the files required to build" hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discrepancy is because as of today, there is no real standard for metadata inside of a sdist. The only real standard is "a tarball with a setup.py
script that looks close to what setuptools supports". This PEP extends that to also include "a tarball with a pyproject.toml
file with a particular key". We currently call python setup.py egg_info
to get the metadata that "belongs" to the sdist (it doesn't really, it belongs to the emphereal binary output that the sdist creates), and with this PEP we can just call the get_wheel_metadata
hook to achieve the same thing.
If we create a sdist 2.0 then we can focus on what metadata we want to include in an sdist to enable some other things that would be nice to have, but like you said that opens up a huge debate. This PEP is "let me replace python setup.py bdist_wheel
with something else", and my statement is that we also need to replace python setup.py sdist
, not that we need to drastically alter the status quo.
And yes, they'll have to track any work for a sdist 2.0 (or well they don't have to, just if they don't they'll never gain any improvements we get from that), just like they'll have to track a hypothetical wheel 2.0.
|
||
MUST build an unpacked sdist from the ``source_directory`` and place it into the | ||
specified ``sdist_directory``. If the ``source_directory`` is already an | ||
unpacked sdist (as opposed to a VCS source tree) then it SHOULD simply copy the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we specify how to tell what case it's handling?
- What about cases which are neither? E.g. if I download a tarball from Github and unpack it, it is neither a VCS checkout (no
.git
directory), nor an sdist. Is it OK to error out? Should frontend tools try something else if this fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more cases occurred to me:
- It's a known VCS checkout but the VCS tools are not available (e.g.
git
is not on PATH) - It's a VCS checkout of a VCS the tool does not recognise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we specify how to tell what case it's handling?
- What about cases which are neither? E.g. if I download a tarball from Github and unpack it, it is neither a VCS checkout (no .git directory), nor an sdist. Is it OK to error out? Should frontend tools try something else if this fails?
I think that's the backend job to try to detect which case it's handling, and it's completely fair of the backend to throw up it's hands if it's in a situation it was not designed to handle (like, being neither in a VCS checkoukt nor in an sdist).
Backends could then be improved to handle different situations. I don't think we can future proof this part (e.g. an export from github migth look different from an export from sourceforge, due to the inclusion or not of a top-level directory with an arbitrary name that might or might not include the repository name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more, what I think we need is a definition of what possible cases there are and what they mean.
This hasn't been important so far in this PEP, because we've just treated a 'source tree' - a directory containing pyproject.toml
- as the input for building a wheel. If we're going to specify a build_sdist
step which may have different behaviour in different cases, we need to define what an sdist is, and what the different possible inputs for the build_sdist
step are.
There are (at least) four kinds of input to pip which can result in it installing from source (names, sdist paths/urls, local directories, VCS URLs). In which of these cases will it:
a. Not call the build_sdist
hook
b. Call it but be prepared to handle its failure by installing another way
c. Call it and fail if the hook fails
And that's just with pip; then there's the question of other tools which may use this API, and what they expect from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't specify how to tell there are only three states that a thing can be in, it can be a wheel, a sdist, or something that can be turned into a wheel or sdist. While I used the word "VCS source tree" it doesn't literally have to be a VCS tree. It could just be a VCS-less directory on disk, that's fine. What requirements any particular build tool might place on what it requires of a target project, is really up to that particular build tool. Some might work just fine with a sdist generation step that is nothing more than shutil.copytree
in which case they don't really need to differentate between a "VCS source tree" and a "sdist". Another project might, when in a VCS tree, dynamically compute a version number based on the git tags and "Bake" that into the sdist (using some internal to them scheme), that project would differentiate and would likely do it by looking at that internal scheme.
This PEP declares what a sdist is, it's the same thing we've had for like 2 decades, just s/setup.py/pyproject.toml/
. That's perfectly reasonable for this PEP and I don't think we need to get bogged down in the weeds declaring a sdist 2.0 or anything like that.
As far as when would pip itself call this, as written in this PR it would call that build_sdist
hook anytime it is building from source unless that source is literally a foobar-1.0.tar.gz
. This means it'll do it from a git clone or if you unpack a sdist yourself (this is why the hook needs to handle the case of being run in something that was already a sdist) because we can't differentiate between an unpacked sdist and a random git clone (or whatever) without either getting bogged down in a sdist 2.0 or knowing more about what the internals of the build tool.
As far as failure cases go, if the hook fails the build fails.
... | ||
|
||
This hook MUST return an additional list of strings containing PEP 508 | ||
dependency specifications, above and beyond those specified in the | ||
``pyproject.toml`` file. Example:: | ||
``pyproject.toml`` file. The ``source_directory`` parameter MUST be a path that | ||
points to the source tree that is being operated on. Example:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think previously the idea was that the source directory would be the CWD when the hook is invoked. Is there a reason this is inadequate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this [
source_directory
simply beingCWD
instead of being passed as parameter] in is inadequate?
Maybe "explicit is better than implicit"? But I guess the only effect of passing this parameter in is forcing the backend to check where it is and CDing in there before running any commands.
BTW, should this PEP also specify guidelines on handling CWD? like "if you os.chdir()
somewhere else, make sure you go back to where you started"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this PEP also specify guidelines on handling CWD? like "if you os.chdir() somewhere else, make sure you go back to where you started"?
The idea is that the frontend would run these in a separate process, which would complete after running the hook, so there's no need for it to reset global state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth explicitly stating that backends MAY alter global process state, and it's up to front ends to deal with this possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't specifically stated wether it was CWD or what (though I assumed the implication was that it was). I believe that passing this in is a better way to write this code, it makes it easier to write tests and reason about the interface, particularly since some projects may not run this in a separate subprocess (pip will of course, but I don't think it is or should be mandatory that you invoke it in a subprocess if that doesn't make sense for your specific use of the API).
That being said, if we want to mandate that it must be called via a subprocess and use an implicit parameter of os.curdir
to pass in the build chain, I'm OK with being outvoted on that.
@@ -218,6 +241,14 @@ metadata. The directory passed in by the build frontend MUST be | |||
identical to the directory created by ``get_wheel_metadata``, | |||
including any unrecognized files it created. | |||
|
|||
This method only places the *contents* of the wheel file, without actually | |||
creating the final artifact. This is to allow flexibility in how the result is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is annoying for flit, because it builds the .whl
file directly, without an intermediate directory containing files in the same layout. Not a showstopper if other people want it this way, but I'm -0.5 on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically doing it that way means that in some cases we're going to round trip through zip and then unzip for no real reason. In your example, flit is going to produce a zipped up wheel, then pip is going to immediately unzip it. However if flit is modified to stage the files to an immediate directory, then using this API pip can just install directly from that, instead of taking the performance hit of a compress + decompress round trip.
I also think that if we're asking tools to produce an (unpacked) sdist, we need to specify more precisely what that is. What do backends need to put in there? What can frontends use it for, other than passing it back to the backend to build a wheel? |
creating the final artifact. This is to allow flexibility in how the result is | ||
going to be used. A front end like pip, which is only going to immediately | ||
install this wheel can speed up the installation process by skipping a round | ||
trip through zip. However a command like a hypothetical ``twine wheel`` could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A front end like pip, which is only going to immediately install this wheel can speed up the installation process by skipping a round trip through zip.
Is this true for pip
, though? Won't pip want to drop the built wheel in the wheel cache?
BTW I have this idea of pip eventually switching to having a wheel cache of unpacked wheels and just symlinking or hard-linking them inside the virtualenv, but this is a subject for another space...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From experience with conda (which uses hard linking much like you suggest), I'd be wary of this. I've seen files get modified in one installed version, but because it's hard linked, that modification affects every installed copy - and it's in the cache, so even reinstalling doesn't get the original version. When we suspect this, we usually advise people to uninstall and reinstall the whole Anaconda distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From experience with conda (which uses hard linking much like you suggest), I'd be wary of this. I've seen files get modified in one installed version, but because it's hard linked, that modification affects every installed copy - and it's in the cache, so even reinstalling doesn't get the original version. When we suspect this, we usually advise people to uninstall and reinstall the whole Anaconda distribution.
I guess "switching to" is a strong expression. I was more thinking in the line of:
- having a switch to pip to cache wheels unpacked, and
- having another switch to use symlinks or hardlinks when installing (hardlinks can be create unpriviledged on Windows, while symlinks cannot). When missing, handle unpacked wheels by copying.
Like any pip switches, these switches could be persisted in a pip.conf
file site-wide, user-wide or virtualenv-wide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't want to maintain and support both of those code paths, but then it's not my project. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't want to maintain and support both of those code paths
It doesn't really require multiple code paths. Wheels need to be unpacked anyway, so moving the unpacked wheel to the cache is a single move if it was unpacked in a tempdir inside the wheel cache to begin with.
And since wheels cannot be blindly shutil.copytree()
ed into the destination location anyway, you only need to abstract the method by which a single file is copied into the destination to switch it to sym/hard-linking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point about the wheel cache is a good one. I'm skeptical of the benefit we'd get in pip from this. @dstufft knows more about this area than me, but I'd like to see the benefits more clearly spelled out (and specifically how we'd handle the wheel cache in this case).
I'm strongly -1 on having unpacked wheels in the cache and links. Not all filesystems support links, Windows has issues with them, dangling links could become something we need to consider, ... I can't see the benefits justifying this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly -1 on having unpacked wheels in the cache and using links (hard or otherwise). The isolation provided by the current setup is a feature and I don't think adding another flag is worth it in this case.
That being said, we have a few cases that pip is going to be building a wheel, some of which will eventually store the resulting artifact in the wheel cache, but some of which are just going to build an emphereal wheel and then throw it away. Doing something like pip install .
or pip install mycoolsdist-1.0.tar.gz
are not going to store a cached wheel and will rebuild each time.
Beyond that, I could see us adjusting the cache to try and speed it up even more, one example might be to just cache unpacked wheel (but not link them, just essentially precomputing the unzip step), another idea might be to build the wheel files specially to store the files uncompressed inside of the .whl
instead of compresses (obviously not something you'd want to publish to PyPI, but fine for a local cache).
Whether these make sense or not would require benchmarking the size trade off vs the speed won. However building an unpacked wheel basically costs us nothing (except maybe some churn in some projects) over building an already packed wheel. Even if we're in a case that we're ultimately going to zip it up again, we aren't really taking a performance hit or anything, we're just shifting the work from one phase in the process to another, while letting us flexibility to implement performance improvements where they do make sense.
def build_sdist(source_directory, sdist_directory, config_settings): | ||
... | ||
|
||
MUST build an unpacked sdist from the ``source_directory`` and place it into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to leave it to the backend to decide what goes into a sdist? Certainly the backend needs to say "we must have these files if we are to be able to build a wheel", but for other things (like allowing the user to specify additional files to include in the sdist, like LICENSE) it seems like it would be better to have a standard method, probably specified via pyproject.toml
that sdist-building tools would deal with, rather than expecting backends to all duplicate the same logic.
I think I'd prefer a get_build_files
hook that has to return a list of the files needed for a build. Backends SHOULD return a minimal list. The files returned MUST be appropriate for including in a sdist (I think it's OK to leave this use of "sdist" intentionally vague), so VCS control files, tox caches, etc, need to be excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to leave it to the backend to decide what goes into a sdist?
As of now, this is the status quo. Some dists even require plugins (e.g. setuptools-git
) so their backends can do that correctly.
Certainly the backend needs to say "we must have these files if we are to be able to build a wheel",
Which is why Thomas would prefer if we restricted this method to only handle the files needed to build the wheel, and tabling the whole sdist
discussion for later.
I think I'd prefer a
get_build_files
hook that has to return a list of the files needed for a build
Not sure if it would be better to list the files instead of copying them, as a backend like flit
could satisfy the current method with a shutil.copytree()
. Maybe we just need to rename the method? The only thing that really needs to happen is for the destination directory to contain something, created by the backend, that the backend itself knows how to turn it into a wheel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with copying them.
The only thing that really needs to happen is for the destination directory to contain something, created by the backend, that the backend itself knows how to turn it into a wheel
That's a good description of what I was trying to get at. There may be other use cases (building sdists being the obvious one!) so I'd state it a little more generically as "the hook should write to the destination directory the minimal set of files needed by the backend, that the backend can use to build a wheel". But that's just details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it would be better to list the files instead of copying them, as a backend like flit could satisfy the current method with a shutil.copytree().
If the hook is specified as defining the minimal set of files needed to build a wheel, flit will probably do something slightly more advanced than calling copytree()
on the whole directory, for performance reasons. I think the implementation difficulty would be similar whether it returns a list of files or copies them itself.
If the hook actually generates something like an sdist, as Donald favours, then it's a much more complex question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it before, but yes I think we should leave it up to the backend about what goes into the sdist, because the backend is the place that is handling this already in this PEP (unless we want to drastically alter this PEP). Building the wheel is more involved then building the sdist, not less it needs to compute almost all the same stuff, but also a bunch of other stuff.
In addition, we're already assuming (or we should be if we're not) that the build tool is going to be having logic to produce sdists, otherwise with setup.py
gone, what is producing a sdist? If we're not baking support for sdists into the standard, then build tool authors are going to be incentivized to not make it possible to produce sdists, and I am strongly -1 on any proposal that makes or incentivizes it to be meaningfully harder to produce a sdist + wheels then just wheels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we're already assuming (or we should be if we're not) that the build tool is going to be having logic to produce sdists, otherwise with setup.py gone, what is producing a sdist?
There's nowhere that's explicitly stated, and I'm not sure it's as obvious as you think. Personally, I find your argument compelling (with the proviso that I don't have a good feel whether it's pip or some other tool that should be the generic "sdist builder") but I do think it ought to be presented as an argument and agreed on. Could you post this point to distutils-sig as a thread of its own? I fear it'll get missed in the rest of the debate.
I merged #277 with @takluyver's preferred approach to this, so we'll use that as the basis for any further design discussions. |
source_directory
parameter to the build backend methods. This provides an explicit way to pass the source tree that is currently being operated on into each method.get_wheel_metadata
as a single item list, allowing future expansion to multiple wheels if needed..whl
file.build_sdist
method which will build an unpacked sdist, similarly to howbuild_wheel
creates an unpacked wheel file.