Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGXN meta sketch #4

Open
wants to merge 2 commits into
base: pre-sketch
Choose a base branch
from
Open

PGXN meta sketch #4

wants to merge 2 commits into from

Conversation

theory
Copy link
Owner

@theory theory commented Mar 21, 2024

Already published and in main, but making this PR for commentary.

Also strip newlines from HTML element attributes, as seen in the
`description` meta field in the headers, now that I'm including newlines
in descriptions.
@theory theory self-assigned this Mar 21, 2024
@theory theory changed the title Pgxn meta sketch PGXN meta sketch Mar 21, 2024
Copy link

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This isn't to be considered heavy criticism, but mostly curious answer-seeking for someone not very familiar with the operation of such a service.

Comment on lines +52 to +53
distributed together. Packages may be downloaded directly from version
control repositories or in [archive files] generated by a release tag.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on what you mean by the "version control repositories", and how you expect to download these packages from there?

I'd assume that you only want to distribute (and include) pre-built extensions using this metadata system, and I'd think that this would be annoying to do in the same system that also hosts the code, if only because including all those binaries in the code versioning repository would be hell.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For how to get the packages from source repos, I'd take a hard look at borrowing how Go does it. But perhaps it won't be necessary if we can improve the tooling overall such that everyone just automatically sets up release pipelines to publish to PGXN.

I intend this as an expansion of the PGXN Meta Spec for source code distribution. The idea is, however, to support enough metadata that we can build tools that auto-generate binary packages for distribution. Those packages would have a different (derived?) metadata format.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Go is a good example to copy from in this part, as Go packages are cross-platform distributions that you're expected to compile yourself.
I doubt that'll be the case in general for PGXN-distributed packages- I'd expect this to be mostly pre-packaged data shipping.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PGXN source packages are source code. Binary packaging will be a different thing (hopefully also to be provided by PGXN). I'm a little confused what I'm omitting from my above explanations to clarify that, or what I might be misunderstanding about your point. :-(

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went into this with the understanding that this would be primarily for PGXN, which I believe to be a package manager for PostgreSQL extensions; based on the pgxn install line I've seen tossed around lately.

Additionally, an extension will still function when distributed as binaries without C/C++/Rust/Python sources (assuming it was built for the relevant platform), but not all target systems have the infrastructure to build these binaries from sources.

So, my confusion seems to be what PGXN is: It's not a binary package repository, but a source package repository.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. A lot of the stuff in this proposal is designed to add metadata to facilitate the creation of binary packages, though.

Comment on lines +71 to +74
* **Source Distribution:** The contents of a single package bundled together
with [package metadata](#package-metadata) into distributable archive
file, usually named with the last part of the package path or the main
extension, a dash, and the version, e.g., `pgtap-1.14.3.zip`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a "source distribution", shouldn't this include the sources of the package?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, like why gets published to PGXN today. I don't talk about the contents so much, as this document is about the metadata, but perhaps at some point we should get more into designing a source distribution format like [Python defines. Today it's mostly driven by the needs of PGXS or pgrx.

Comment on lines +80 to +82
* **Release:** A single instance of a package and version published on PGXN,
expressed as the package path, an at sign, and the [semver]. Example:
`github.com/theory/pgtap@v1.14.3`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a single instance of a package and version" implies that there can be more instances of a "package and version".

I'd probably cover this as

**Release:** A single version of the package made available to the public on PGXN, expressed as [...]. One package's release can have different packages for different /release channels/.
Example: [...]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 752cb61.

Comment on lines +55 to +59
* **Package Path:** Identifies a package, declared in the [package
metadata](#package-metadata) file. A package path should describe both
what the package does and where to find it. Typically, a package path
consists of a repository root path --- the directory that contains the
metadata file --- and a directory within the repository.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you distinguish the repository root path in the Package Path? Must there be only a single directory level under the repository root path to get to the package?

A package path should describe both what the package does and where to find it

I don't think we want package descriptions in package paths.

Copy link
Owner Author

@theory theory Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm borrowing from Go's definitions (but s/module/extension/g). It just means they should be somewhat descriptive and not opaque, but a lot of people use funny names anyway.

Comment on lines +107 to +108
* **Maintainer**: List of maintainers, each an object with `name` and either
`email` or `url` (or both)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use plural maintainers here. Also, I'm not sure I agree with requiring email or url for all maintainers.
While seemingly useful way to contact maintainers, many projects have a public issue tracker that's better as a point of contact, and archiving these mail-addresses/urls isn't exactly great when considering things like GDPR or CCPA.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 752cb61.

Comment on lines +424 to +427
"linux": [ "amd64", "arm64" ],
"darwin": [ "amd64", "arm64" ],
"windows": [ "amd64" ],
"freebsd": [ "amd64" ]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more care for extenal dependencies: Debian dependencies are often named differently from those in the Red Hat family, which are different again from those in Suse, BSD, etc.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of leaving open for now how to specify dependencies that have all sorts of different names in different places, but reference a few leads later in the doc.

This bit you've highlighted isn't packages, though, but hardware architectures.

Comment on lines +488 to +490
* Is `pipeline` really necessary, given configure requirements? I think so,
because it tells the client the preferred build system to use, in case it
can't detect it for some reason.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiosity: Why is this pipeline here? Isn't this metadata for packaged packages, not to-be-packaged packages?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand the question. But the point of pipeline is so that a client that downloads one of these sources packages knows what build pipeline to use to build it (including compilation, etc.).

"downloads": 20
},
"ratings.example.com": {
"stags": {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"stags": {
"stats": {

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 71f6c7b.


* The `aggregates` section aggregates results from multiple sources, for
example summing all downloads or averaging ratings. The list of items to
aggregate could evolve regularly.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the system check which items to aggregate, and what aggregate to choose? I could see reasons to use any of a weighted average, mean, median, weighted median, sum, min, max, etc.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno, this is a forward-thinking bit of design I haven't really thought through, yet. I expect to build it up incrementally, though, perhaps just starting with download stats aggregated for 30, 90, and 365 days, as well as all-time.

* Each key in `sources` identifies a trusted downstream source of
information. Each would have its own schema describing its objects and
their meaning, along with URI templates to link to. For example,
`stats.example.com` might have these templates:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this consider summation loops in this federated network of statistics? Assuming anyone can run a PGXN node, of course.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't really follow, can you explain what "summation loops" means?

I'd like to come up with a federation model, but admit I wasn't really thinking about it here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, there needs to be a way to distinguish locally measured stats in a way that's clearly distinct from federated stats, and distinct from the aggregate.

With summation loops, I basically mean this:

Assume mirror-nl.pgxn.org has "downloads": 30, and federates with primary.pgxn.org, which has measured 40 downloads locally.

Assuming mirror-nl will update its data from the primary, and notices that it has 40 downloads. Aggregated with its downloads, that gives 70 downloads, which it then publishes.

Primary then pulls the data from mirror-nl for it's download statistics, and notices it has an aggregate 70 downloads. Added to its own pool of 40, that adds up to 110 total downloads.

After another sync, mirror-nl notices it has to update it's aggregate, because the primary now advertises 110 downloads, or 70 more than it's previous 40. Etc. Etc.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh. I hadn't even thought about the aggregation being bidirectional, but of course people would want that, if things are successful. Not an immediate goal, I think, because you're right, the technical infrastructure to prevent these kinds of loops will need some careful thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants