Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental complexity and build steps for packages #248

Open
jaraco opened this issue Mar 9, 2019 · 8 comments
Open

Support incremental complexity and build steps for packages #248

jaraco opened this issue Mar 9, 2019 · 8 comments

Comments

@jaraco
Copy link
Member

jaraco commented Mar 9, 2019

At the most recent PyPA sprint in NYC (thanks Bloomberg), I finally was able to put my finger on what was bothering me about Python Packaging. The following describes what I envision as a simpler, more intuitive system for dealing with the bi-modal nature of a package under development.

Problem

The issue stems from the disparity between a built/installed package and a package under development. Even today, pip relies on setuptools to perform a "editable" install (package under development installed into an environment). It's a necessary step to build every package, even the most basic hello-world package, a project containing one file and one function in a directory. Before the project is "built", it has no (supported) metadata--not only does it have no metadata (no name), it also doesn't appear to be installed.

To create this metadata, one needs to select a build tool (setuptools, flit, etc), author the metadata in some source format for that tool, and then run that tool to generate the metadata. If one wishes to persist that metadata, it's generally not possible or desirable to persist that metadata with the project (such as in the SCM repo), but instead the best recommendation is that the project needs to enact the various steps to publish the package and only expect to get usable metadata from that published location (often PyPI), steps that include:

  1. Author the metadata in some source format.
  2. Run build tools to translate the source format to the publishable format.
  3. Publish the built package.

Solution - Incremental and Inferred Metadata

Imagine instead a world where simple packages could author the metadata directly. Users would create files in something akin to myproject.dist-info, files that presented the static or default metadata for the project and which a build tool would copy directly and extend. To avoid mutation of source files, this metadata directory would allow for metadata be supplied by multiple files, similar to the conf.d concept in Debian (among others). Part of the metadata could include which build steps are required for the project and which ones have been run, such that a build tool could determine what build steps are required.

Many projects would have no build steps - a git checkout might produce a viable package with metadata.

Furthermore, such a system could also define some inferred values, such that useful metadata could be derived from the source code itself. Imagine for example that if a project has no "name" defined, it could infer the project name from the containing folder or the basename of the SCM URL. The version could have a sane default like 0.0.0 but also honor SCM metadata (tags) if present.

In such a world, it may not be necessary to "build" anything to have a viable (distribution) package for a project. If one creates a directory and puts a Python module in it (or even without a module, really), it will already represent a minimal package (dirname-0.0.0). From there, the developer can add modules, Python packages, and other metadata--incrementally increasing the sophistication, complexity, and build requirements for the package.

Even projects with unmet build steps would have some viable metadata... and except for builds that can't happen in-place (for whatever reason), an editable install is basically a no-op: ensure the project is on sys.path.

Execution

To support this model, several changes would need to take place. The main change would be to create a new metadata format, one that supported the model described above. It would need to be both user-friendly and machine-readable. It should probably be flexible and extensible to support unforeseen use-cases.

A model for the build steps would need to be devised. I imagine these to be arbitrary callables, but they should have constraints on what artifacts they produce and where. I imagine that the output from some build steps might be the input for subsequent build steps.

In addition to build steps, there may be install steps. The basic, implicit install step is one that copies the manifest of the project's files to a site-packages directory, but perhaps separate install steps could define console scripts that get created or copy arbitrary data to other directories based on platform rules.

Tools that read metadata (pip, importlib_metadata, maybe pkg_resources) would need to develop support for this new format.

Discussion

I believe this proposal is largely compatible with and independent of the work done on PEP 517 and 518. It would build on and eventually supersede the work done in prior metadata specs such as PEP 566 and its predecessors. It would also supersede the setup.py, setup.cfg, and possibly some of pyproject.toml as the recommended way to for a packager to supply metadata.

Perhaps run-time compatibility could be provided with an optional install step that converts the metadata into one of the older formats.

I've posted this description here to gather feedback and to serve as a location to reference the concept and proposal. Most likely a PEP would be in order to formalize and refine the proposal. I'm happy to embark on that process after gaining some tacit consent and clearing any initial hurdles/concerns.

@KOLANICH
Copy link

it's generally not possible or desirable to persist that metadata with the project

Of course it is both necessary and desireable.

Imagine instead a world where simple packages could author the metadata directly. Users would create files in something akin to myproject.dist-info

setup.cfg, pyproject.toml.

@njsmith
Copy link
Member

njsmith commented Mar 10, 2019

The main change would be to create a new metadata format, one that supported the model described above. It would need to be both user-friendly and machine-readable. It should probably be flexible and extensible to support unforeseen use-cases.

I think you mean pyproject.toml :-). So I guess the distilled essence of this proposal would be: when pip or other tools want to gather metadata about what packages are available in a python environment, in addition to looking for .dist-info and .egg-info, they should also look for some section in any pyproject.toml files they see.

A model for the build steps would need to be devised. I imagine these to be arbitrary callables, but they should have constraints on what artifacts they produce and where. I imagine that the output from some build steps might be the input for subsequent build steps.

I don't think the incremental build part makes sense. Designing a new build framework is the land-war-in-Asia of software projects. Trying to standardize a one-size-fits-all-framework seems like a bad idea. (Which is why we've been trying to go the other way with PEP 517 etc.). And, I don't think it's the important part of your idea anyway – if you need to invoke a build step to make a package usable, then you might as well use PEP 517.

I'm also not sure there's much value in teaching pip that an empty directory you've run git init . in is mypackage-0.0.0. What's pip going to do with this information? It's not like there are packages on PyPI that depend on mypackage==0.0.0...

The case where I think this might be interesting, is for working with source checkouts of projects that have real names/versions/etc., but where the code is pure Python and the "build" is trivial. This is a special case in some ways, but it's a pretty common one, and it's the one where "just put the git checkout on your PYTHONPATH" is most attractive for development. And, it's the case where "develop" installs almost work perfectly – the only problem right now is that the metadata in the .egg-info directory is inevitably out of date. But if pip could read the metadata directly from the original source-of-truth, then that might solve this problem? I'm imagining something like, moving the static metadata that flit uses into some standardized place inside pyproject.toml, where build backends and package managers could both potentially get at it.

OTOH — the two most important pieces of metadata are the package name and version, and flit doesn't store the version in static metadata. (It pulls it out of the package.__version__ string.)

@ncoghlan
Copy link
Member

I like the framing of this aspect of the problem as "How can git clone result in a candidate sys.path entry with valid installation database metadata?"

I don't think we want to conflate this directly with pyproject.toml though, as there should still be a way to distinguish between "This needs to be built & installed to be usable" and "This can be used pretty much as-is", and tools shouldn't need to parse pyproject.toml to tell the difference.

Instead, I think the distinction can be made on the installation database side, by letting editable installs have a more relaxed notion of what their dist-info directory is required to contain: https://www.python.org/dev/peps/pep-0376/#one-dist-info-directory-per-installed-distribution

For example, we could allow touch myproject-0.0.1.dist-info/.gitignore to define a valid installation record marking the containing directory as an editable install for version 0.0.1 of myproject.

That doesn't include a valid metadata file according to https://packaging.python.org/specifications/core-metadata/#core-metadata-specifications, but the two required project-specific fields (Name and Version) can be inferred from the dist-info directory name.

@njsmith
Copy link
Member

njsmith commented Mar 10, 2019

For example, we could allow touch myproject-0.0.1.dist-info/.gitignore to define a valid installation record marking the containing directory as an editable install for version 0.0.1 of myproject.

But this would mean that now every time you release, you have to do git rename myproject-${OLD_VERSION}.dist-info myproject-${NEW_VERSION}.dist-info. Updating all the duplicate places Python projects keep version numbers is already a sore spot for packagers. And if that's a manual step, you might as well run setup.py egg_info or whatever your build system's equivalent is, and check in the full metadata. (And remember to re-do that on every release.)

@gaborbernat
Copy link

I don't think this should be version controlled as many of the metadata probably is platform and environment dependent. I agree we want this and should be blazing fast, but probably needs to be on the fly calculated at request time.

@ncoghlan
Copy link
Member

Yes, there are reasons people don't do things this way: because different things look in different places, and automating the data duplication process is easier than trying to get 3 decades worth of tools to all look in the same place.

But right now, we don't even explain what people would need to have a custom "bumpversion.py" generate directly in the repo in order to allow "git clone" to create a mostly valid PEP 376 database entry.

It really isn't that much:

  • dist-info directory with the right name and version
  • a METADATA file in that directory with the same Name and Version entries, plus a "Metadata-Version: 1.0" entry
  • an INSTALLER file saying "git"

You wouldn't be able to generate RECORD, but "no RECORD file" can become the marker for "unmanaged editable install".

@gaborbernat
Copy link

I would like to circle back why we need this information? How we'll use it? For me seems like what we want here is https://www.python.org/dev/peps/pep-0517/#prepare-metadata-for-build-wheel, and I'm not sure we really need to version control that 🤔

@ncoghlan
Copy link
Member

@gaborbernat: The use case isn't for full-fledged packages, it's for cases where you're not doing "proper" package management, but still want some form of basic machine-readable record keeping. This is why I don't think it makes sense to care about build steps - once you have a build process, then running pip -e . (or an equivalent) isn't a major burden.

By contrast, if we define what manually maintained metadata (or an ad hoc metadata generator) would look like, then that also defines what tools would need to do to implement their own pip -e . equivalent.

For example:

[ncoghlan@localhost tinkering]$ ./setversion.py foo 1.2.3 .
[ncoghlan@localhost tinkering]$ ./setversion.py bar 4.5.6 .
[ncoghlan@localhost tinkering]$ ./setversion.py baz 7.8.9 .
[ncoghlan@localhost tinkering]$ ls
bar-4.5.6.dist-info  baz-7.8.9.dist-info  foo-1.2.3.dist-info  LICENSE  README.md  setversion.py
[ncoghlan@localhost tinkering]$ cat foo-1.2.3.dist-info/METADATA 
Metadata-Version: 1.0
Name: foo
Version: 1.2.3
[ncoghlan@localhost tinkering]$ cat foo-1.2.3.dist-info/INSTALLER 
unmanaged
[ncoghlan@localhost tinkering]$ python3 -m pip list | grep -E '(foo|bar|baz)'
bar               4.5.6   
baz               7.8.9   
foo               1.2.3   

Where setversion.py is this no-dependencies script: https://github.com/ncoghlan/tinkering/blob/master/setversion.py

As far as tools like pip are concerned, foo, bar, and baz are now packages that actually exist on the local system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants