Initial implementation #16

pradyunsg · 2020-06-17T23:06:47Z

Built on top of #12

Filing early to get inputs on this design / approach. As of right now, the design has 4 interacting components: the Installer class encapsulates all the installation logic, interacts with all the other abstractions and is the main entry point. WheelSource and Destination, which do exactly what their name suggests. And, finally, there's the coolest kids on the block, Validators.

WheelSource represents a wheel file (like a .whl file on disk). It provides random access to dist-info files, and sequential access to all the files within the wheel. This lets Installer read the metadata files it needs (only WHEEL), any potential validators to read other metadata (like RECORD files), while making it straightforward to consume the contents of the wheel.

Destination handles all the file-writing stuff (so all the I/O) and has the job of rewriting the RECORD file appropriately.

Validators are... Callable[[WheelSource], None] which can raise an error to signal not-valid. That's it.

One of the quirks of this approach/design, is that we don't actually "unpack" the wheel root into the corresponding scheme, but instead figure out the correct place to put the files the moment we see them. This lets us simplify the Destination API, and avoids an extra step (potentially reducing I/O operations).

Two things that I am wondering:

Maybe we should have some way to preserve stat(...).st_mtime from WheelSource to Destination?
Is there a better abstraction to pass around than BufferedReader?

pradyunsg · 2020-06-17T23:11:18Z

The most important files are likely protocols.py (open to naming suggestions!) and _core.py (I like this name, let's keep this as is).

There's a few commits from #12, which don't really add much noise to the overall PR due to the installer.{records -> parsers} renaming.

pradyunsg · 2020-06-17T23:13:25Z

Oh, and this code doesn't actually run, since there's missing imports and shitz. I'll polish all of that up, once it's clear whether this is an appropriate direction to invest effort into. :)

FFY00 · 2020-06-17T23:24:46Z

Overall looks good, one thing jumps out to me, why does the API user need to implement Destination.rewrite_record?

FFY00 · 2020-06-17T23:27:38Z

Actually, I am not entirely sure how it is supposed to work. Does it check all files and fixes up RECORD?

pradyunsg · 2020-06-18T00:06:10Z

That's because only the Destination object has information about the "final" file paths, which needed for performing the rewrite of the RECORD file. I'm 100% open to figuring out a helper to simplify end-user implementations in the common case, but within the abstractions, I don't think the responsibility for rewriting the RECORD file can fit anywhere else.

pradyunsg · 2020-06-18T00:08:25Z

Does it check all files and fixes up RECORD?

See point 4 in "spread" on PEP 427's description of how to install a wheel: https://www.python.org/dev/peps/pep-0427/#installing-a-wheel-distribution-1-0-py32-none-any-whl

Basically, the path component of each record needs to be changed. This is worth mentioning in the rewrite_record's docstring.

FFY00 · 2020-06-18T01:02:19Z

That's because only the Destination object has information about the "final" file paths

Then we can ask the user to define a get_full_path method, which could then be used by our rewrite_record. As an API user, I don't want to care about RECORD rewriting logic. This is also one of those things that people can easily get wrong, we want to prevent that.

gaborbernat

No strong feelings about this. Would like to see some example uses (tests?), that can focus on the high-level interface. I think we established that virtualenv needs finer grain control 👍 than the scope of this library, so it doesn't block me in any way.

src/installer/_core.py

gaborbernat · 2020-06-18T09:02:25Z

src/installer/_core.py

+    Supports Wheel version 1.0 (PEP 427).
+    """
+
+    def __init__(self, name, validators):


We'll we need custom validators?

We can have them. The RECORD matches the file hashes check is also a validator (i.e. opt in).

src/installer/parsers.py

pradyunsg · 2020-06-18T09:19:54Z

I think we established that virtualenv needs finer grain control 👍 than the scope of this library, so it doesn't block me in any way.

Hmm... I'm not sure I follow. I thought that this would be able to cover the use case for virtualenv, since both source and destination are decoupled completely, and don't make assumptions about how the source is provided.

gaborbernat · 2020-06-18T09:38:35Z

Hmm... I'm not sure I follow. I thought that this would be able to cover the use case for virtualenv, since both source and destination are decoupled completely, and don't make assumptions about how the source is provided.

What do you define as a source? virtualenv does not install things straight from wheel... it rather generates pre-baked images on the disk and install is then essentially a copy/symlink. Furthermore, it still needs to trigger the console script generation in this case post copy/symlink. As I said this is highly custom and not sure we can/should force these restrictions on this library.

Additionally, virtualenv has probably the least aggressive python 2 deprecation policy, we'll probably support Python 2 until PyPy does it, and that's undefined for now.

FFY00 · 2020-06-18T09:53:07Z

What do you define as a source? virtualenv does not install things straight from wheel... it rather generates pre-baked images on the disk and install is then essentially a copy/symlink.

You can subclass Destination and abstract all that, in fact, you are expected to. In my case, in python-install I unpack the wheel to the filesystem and work from there. Does this still not work in your use-case? If so, why?

Furthermore, it still needs to trigger the console script generation in this case post copy/symlink. As I said this is highly custom and not sure we can/should force these restrictions on this library.

Script generation is optional and as such, should be provided independently from the install step. We might be able to come up with a model that works for you, but if we don't, you can still implement your own custom version.

pradyunsg · 2020-06-18T09:54:00Z

Then we can ask the user to define a get_full_path method, which could then be used by our rewrite_record. As an API user, I don't want to care about RECORD rewriting logic. This is also one of those things that people can easily get wrong, we want to prevent that.

Indeed!

I just spent a bit of time thinking, and I think it's possible to reduce Destination to a single method, by making write_file return the path that's supposed to go into the rewritten RECORD.

Since classes with a method are better represented as a function, I think we can reduce Destination down to a function.

FFY00 · 2020-06-18T09:58:42Z

Since classes with a method are better represented as a function, I think we can reduce Destination down to a function.

I might want to save state, but in that case, I can implement a class with __call__.

pradyunsg · 2020-06-18T10:01:09Z

I just spent a bit of time thinking, and I think it's possible to reduce Destination to a single method, by making write_file return the path that's supposed to go into the rewritten RECORD.

It's not. :(

rewrite_record also serves as "the last operation". I do think that the earlier plan, to have a RecordKeeper class that Destination classes can use to keep track of how to write a RECORD is sufficient. Maybe, what we want to do is a mix of the two, with RecordKeeper being inside Installer, write_file returning the destination path, and being passed into rewrite_record?

pradyunsg · 2020-06-18T10:06:20Z

What do you define as a source?

For virtualenv, it'd be based off the prebaked image.

install is then essentially a copy/symlink.

This can be what is done by the Destination object. Since the value from WheelSource.iter_files is passed straight into Destination.write_file, it's possible to use a object that holds the fioe descriptor, and performing the copy with it.

Furthermore, it still needs to trigger the console script generation in this case post copy/symlink.

This is completely controlled by the Destination class, which handles all the script generation.

As I said this is highly custom and not sure we can/should force these restrictions on this library.

Thanks for elaborating on this. I think this API, as it stands, does accommodates for virtualenv's needs. I'd appreciate it if you could take a look at this, and see if/how that might not be the case.

gaborbernat · 2020-06-18T10:38:12Z

Thanks for elaborating on this. I think this API, as it stands, does accommodates for virtualenv's needs. I'd appreciate it if you could take a look at this, and see if/how that might not be the case.

I'd have to give a try to use it. I'll do it at some point but unlikely I'll have time this within the next week.

pradyunsg · 2020-06-18T11:14:05Z

RecordKeeper being inside Installer, write_file returning the destination path, and being passed into rewrite_record

I think I like this. It makes it straightforward enough for any of the consumers to depend on this library (return the paths from write_file, use the object passed into rewrite_record) while eliminating the need to determine file paths.

I do wonder how we'll map the hashes / sizes to the files, so that's certainly something for me to think about. I'll iterate on this with a piece of paper, later, to see what I can come up with.

pradyunsg · 2020-06-18T12:51:57Z

Cool, lemme know if we're not going to use this for virtualenv, since there might be simplifications possible if we're not going to be trying to accommodate for virtualenv's fairly different approach to dealing with things.

agronholm · 2020-06-18T13:16:49Z

Maybe we should have some way to preserve stat(...).st_mtime from WheelSource to Destination?

If we don't, I see little point in recording the actual timestamps in the wheel file either.

pradyunsg · 2020-06-18T13:33:26Z

Fair enough. I'd imagine folks who care about build reproducibility definitely care.

100% open to suggestions on how to do this, since I'm actually not sure how to do this.

agronholm · 2020-06-18T19:58:02Z

Reproducibility.

How would it be less reproducible if we simply used a default timestamp, say 1970-01-01T00:00:00+00:00?

src/installer/_core.py

src/installer/protocols.py

brettcannon · 2020-06-18T21:32:24Z

src/installer/protocols.py

+
+    # All files in the wheel
+    @abc.abstractmethod
+    def iter_files(self):


I will say I dislike iter_*() API names as it doesn't communicate what the method actually does, just what you are expected to do with it or what the turn type is depending if you read that as shorthand as iterate or iterator, respectively. I personally would be fine with files() and simply document it returns an iterator.

Personally I like having iter_ since the fact that an iterator is returned is the second most important part of what the function does (the first is it iterators through files). Python does not offer any way to guard against iterator reuse, and the only thing to prevent accidental usage is to stick something into the name. Type-checking does not help here because it would happily accept usages like

def install_files(files: Iterable[File]): ... files = source.files() logger.debug('Processing %s', [path for path, _ in files]) install_files(files)

I won’t speak for others, but I can’t tell whether this code is correct or not in a quick code review. This works if files() returns a list, but not if it returns an iterator. Maybe I should call files() twice? But that wouldn make things worse if the function implies significant performance impact. I can go through this thought process every time I see this, and potentially dig into the documentation or source code. Or I can easily avoid all the mental work by simply adding iter_ to the function name, with a slight aesthetics cost. The trade-off is worth it IMO.

I'm gonna have to change this for timestamp + record-keeping reasons.

For now, I'm think gonna call this get_contents, returning a Tuple[Record, BufferedReader, Optional[stat_result]]. I'm a bit concerned with using Record here directly, since @agronholm has noted that he wants to be able to make wheel's WheelFile capable of being a WheelSource.

So are we going to restore the modification time and file attributes present in the archive when the wheel is being installed? Sadly the wheel PEP does not say anything about this which I feel is a significant problem.

Looking at pip's logic, we unpack the zip with the following guarantees:

All files are written based on system defaults and umask (i.e. permissions are not preserved), except that regular file members with any execute permissions (user, group, or world) have "chmod +x" applied after being written. Note that for windows, any execute changes using os.chmod are no-ops per the python docs.

And then install from the unpacked wheel with, among other things:

# Copy over the metadata for the file, currently this only # includes the atime and mtime. st = os.stat(srcfile) if hasattr(os, "utime"): os.utime(destfile, (st.st_atime, st.st_mtime))

I'm very confused right now about what the potential details we'd need, since IIUC, pip's not preserving the stat() details during unpacking, but preserving them when moving the files (!).

I have so many questions:

Can we have empty folders in a wheel? If yes, how should they get installed?

What metadata about a file in a wheel, do we actually care about?

Executable bits?

Timestamps?

What properties from os.stat / ZipInfo?

Would people be OK with files being written with timestamp=0? (this would be a concrete Destination implementation thing, so I'm not too worried if the answer is yes; but if we're preserving timestamps, somebody explain to me how they think it should work)

The answers to these questions should be added to the wheel spec, by the way 🙂

Can we have empty folders in a wheel? If yes, how should they get installed?

No. There is a Dicourse thread proposing to amend this.

Would people be OK with files being written with timestamp=0?

This would cause pain when integrating wheels into timestamp-based incremental build systems (e.g. GNU Make). I think setting timestamp to 0 would be worse than what pip currently does (which sets the timestamp to whenever the wheel is installed, I believe?)

Can we have empty folders in a wheel? If yes, how should they get installed?

No

Well, I'm gonna implement status quo, so not implementing any support for this sounds good. :)

timestamp-based incremental build systems

To be clear, the situation would be that WheelSource objects wouldn't provide timestamp information, so Destination objects would set the timestamp based on whatever logic they want - I imagine pip would preserve its current approach and keep the time dependent on unpack time.

The question really should've been, do we care about preserving timestamps from a WheelSource when installing from it? I think the answer is no, so I'm going to move forward with that assumption.

To me it seems correct to leave the timestamp in newly created files alone. If pip indeed does this already, then that opens the door for wheel to use the zip default timestamp for all files (instead of copying the timestamp from the file system). This too should be clarified in the new PEP which @uranusjr is apparently writing now.

src/installer/protocols.py

There's TODOs in this, but I hope this is pretty close to where we want this to be.

Co-authored-by: Brett Cannon <brett@python.org>

agronholm · 2020-06-21T20:27:19Z

@pradyunsg So, how would you prefer us to align the APIs? The new WheelFile (and related) APIs are going to be Python 3 only, while this project seems to be designed with Python 2 compatibility for reasons I don't really understand. This makes it impossible for you to import my code directly.

pradyunsg · 2020-06-21T20:39:43Z

FWIW, I think wheel.WheelFile can just implement the WheelSource protocol and be usable with this library. The WheelSource API (by design) permits Python 3 users to use pathlib.Path if they want to, and wheel.WheelFile would then be a Py3-only WheelSource (which is 100% OK).

I think I need to add 2 things here:

"RECORD management" in Installer, so that rewrite_record gets a RecordKeeper object in addition to the scheme.
Add RECORD-entry + optional timestamp information in iter_files (which should probably get renamed).

The first doesn't affect the WheelSource API. The second does.

I don't want us to end up introducing a wheel -> installer dependency, so the WheelSource protocol shouldn't depend on anything else from installer. Gimme a day or so to iterate here, and figure out what's the best way to add these things, without making it difficult to implement the WheelSource protocol in wheel.

src/installer/records.py

pradyunsg · 2020-07-05T18:52:16Z

Note to self: need to allow for some mechanism to add "install-time metadata" like INSTALLER, REQUESTED, direct_url.json (PEP 610) and more in the future (like proposed HASH file) without hard-coding support for the specific files in the code. This would allow for addition of these files to be done transparently as the .dist-info directory's contents evolve.

brettcannon · 2020-08-26T00:20:31Z

What's next on this PR (and project)? More time? 😉

pradyunsg · 2020-08-26T03:55:25Z

More time?

Yea, pretty much.

ofek · 2020-10-04T19:41:10Z

Any update?

pradyunsg · 2020-10-11T11:03:50Z

I've blocked out a few hours in the coming week to pull this forward, including a (small) chunk of co-working time with @FFY00! :)

pradyunsg · 2020-12-01T01:25:34Z

A few notes:

I can't find the notes from my and @FFY00's conversation. I'd like to blame my broken MacBook. :)
there's a few tweaks to be made to the API at play here, and there's a need for some wrangling w.r.t. handling RECORD entries properly in this.
I have an open offer for folks to co-work with me on this (https://discuss.python.org/t/pep-517-workflow-for-distributions/5959/5?u=pradyunsg) which would help me concentrate on getting this over-the-line.
There are abstract interfaces that the "core" Installer class would see, and those would need concrete implementations. I expect the abstract interfaces to be subject to change until we hit "stable" on this project.
I would still like downstream folks to try to integrate with this -- I'll communicate more about this on Scope of this project and API design #1, once this is ready-enough to actually try using. :)

pradyunsg · 2020-12-14T17:41:47Z

Update: @lkisskol and I are working on getting the ball rolling again this week.

pradyunsg · 2020-12-17T21:31:18Z

Closing since this isn't going to be the primary PR where work is done. Please follow #1 instead. :)

pradyunsg mentioned this pull request Jun 17, 2020

Scope of this project and API design #1

Closed

pradyunsg requested review from brettcannon and gaborbernat June 17, 2020 23:15

gaborbernat reviewed Jun 18, 2020

View reviewed changes

pradyunsg mentioned this pull request Jun 18, 2020

Rework installer.records #12

Merged

pradyunsg mentioned this pull request Jun 18, 2020

Refactor WheelFile to have a public API pypa/wheel#357

Closed

brettcannon reviewed Jun 18, 2020

View reviewed changes

FFY00 mentioned this pull request Jun 19, 2020

Provide sample code for a very basic client #17

Closed

uranusjr reviewed Jun 21, 2020

View reviewed changes

src/installer/protocols.py Show resolved Hide resolved

uranusjr reviewed Jun 21, 2020

View reviewed changes

src/installer/protocols.py Show resolved Hide resolved

pradyunsg force-pushed the initial-implementation branch from 5edb8c6 to 69f5c53 Compare June 21, 2020 19:29

pradyunsg and others added 3 commits June 22, 2020 01:08

Add parsing function for METADATA and WHEEL

Verified

This commit was signed with the committer’s verified signature.

pradyunsg Pradyun Gedam

GPG key ID: DA17C4B29CB32E4B

Verified
Learn about vigilant mode

507284b

Checkin initial implementation

Verified

This commit was signed with the committer’s verified signature.

pradyunsg Pradyun Gedam

GPG key ID: DA17C4B29CB32E4B

Verified
Learn about vigilant mode

2a397b6

There's TODOs in this, but I hope this is pretty close to where we want this to be.

Apply suggestions from code review

Verified

This commit was signed with the committer’s verified signature.

pradyunsg Pradyun Gedam

GPG key ID: DA17C4B29CB32E4B

Verified
Learn about vigilant mode

1ef1210

Co-authored-by: Brett Cannon <brett@python.org>

pradyunsg force-pushed the initial-implementation branch from 69f5c53 to 1ef1210 Compare June 21, 2020 19:39

pradyunsg mentioned this pull request Jun 22, 2020

Document all the existing components + proper Sphinx documentation #19

Merged

agronholm reviewed Jun 22, 2020

View reviewed changes

src/installer/records.py Show resolved Hide resolved

pradyunsg mentioned this pull request Jun 23, 2020

Stop using os.linesep when reading RECORD with csv #20

Merged

pradyunsg mentioned this pull request Jul 3, 2020

Expect a plain list in get_csv_rows_for_installed pypa/pip#8533

Merged

pradyunsg closed this Dec 17, 2020

pradyunsg deleted the initial-implementation branch May 3, 2021 13:20

Initial implementation #16

Initial implementation #16

Conversation

pradyunsg commented Jun 17, 2020 • edited Loading

pradyunsg commented Jun 17, 2020

pradyunsg commented Jun 17, 2020

FFY00 commented Jun 17, 2020

FFY00 commented Jun 17, 2020

pradyunsg commented Jun 18, 2020 • edited Loading

pradyunsg commented Jun 18, 2020

FFY00 commented Jun 18, 2020

gaborbernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pradyunsg commented Jun 18, 2020

gaborbernat commented Jun 18, 2020

FFY00 commented Jun 18, 2020

pradyunsg commented Jun 18, 2020

FFY00 commented Jun 18, 2020

pradyunsg commented Jun 18, 2020

pradyunsg commented Jun 18, 2020 • edited Loading

gaborbernat commented Jun 18, 2020

pradyunsg commented Jun 18, 2020

pradyunsg commented Jun 18, 2020

agronholm commented Jun 18, 2020

pradyunsg commented Jun 18, 2020

agronholm commented Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pradyunsg Jun 22, 2020 • edited by brettcannon Loading

Choose a reason for hiding this comment

pradyunsg Jun 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agronholm commented Jun 21, 2020 • edited Loading

pradyunsg commented Jun 21, 2020 • edited Loading

pradyunsg commented Jul 5, 2020

brettcannon commented Aug 26, 2020

pradyunsg commented Aug 26, 2020

ofek commented Oct 4, 2020

pradyunsg commented Oct 11, 2020 • edited Loading

pradyunsg commented Dec 1, 2020

pradyunsg commented Dec 14, 2020

pradyunsg commented Dec 17, 2020

pradyunsg commented Jun 17, 2020 •

edited

Loading

pradyunsg commented Jun 18, 2020 •

edited

Loading

pradyunsg commented Jun 18, 2020 •

edited

Loading

agronholm commented Jun 18, 2020 •

edited

Loading

pradyunsg Jun 22, 2020 •

edited by brettcannon

Loading

pradyunsg Jun 22, 2020 •

edited

Loading

agronholm commented Jun 21, 2020 •

edited

Loading

pradyunsg commented Jun 21, 2020 •

edited

Loading

pradyunsg commented Oct 11, 2020 •

edited

Loading