Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation #16

Closed
wants to merge 3 commits into from
Closed

Initial implementation #16

wants to merge 3 commits into from

Conversation

pradyunsg
Copy link
Member

@pradyunsg pradyunsg commented Jun 17, 2020

Built on top of #12

Filing early to get inputs on this design / approach. As of right now, the design has 4 interacting components: the Installer class encapsulates all the installation logic, interacts with all the other abstractions and is the main entry point. WheelSource and Destination, which do exactly what their name suggests. And, finally, there's the coolest kids on the block, Validators.

WheelSource represents a wheel file (like a .whl file on disk). It provides random access to dist-info files, and sequential access to all the files within the wheel. This lets Installer read the metadata files it needs (only WHEEL), any potential validators to read other metadata (like RECORD files), while making it straightforward to consume the contents of the wheel.

Destination handles all the file-writing stuff (so all the I/O) and has the job of rewriting the RECORD file appropriately.

Validators are... Callable[[WheelSource], None] which can raise an error to signal not-valid. That's it.


One of the quirks of this approach/design, is that we don't actually "unpack" the wheel root into the corresponding scheme, but instead figure out the correct place to put the files the moment we see them. This lets us simplify the Destination API, and avoids an extra step (potentially reducing I/O operations).


Two things that I am wondering:

  • Maybe we should have some way to preserve stat(...).st_mtime from WheelSource to Destination?
  • Is there a better abstraction to pass around than BufferedReader?

@pradyunsg
Copy link
Member Author

The most important files are likely protocols.py (open to naming suggestions!) and _core.py (I like this name, let's keep this as is).

There's a few commits from #12, which don't really add much noise to the overall PR due to the installer.{records -> parsers} renaming.

@pradyunsg
Copy link
Member Author

Oh, and this code doesn't actually run, since there's missing imports and shitz. I'll polish all of that up, once it's clear whether this is an appropriate direction to invest effort into. :)

@FFY00
Copy link
Member

FFY00 commented Jun 17, 2020

Overall looks good, one thing jumps out to me, why does the API user need to implement Destination.rewrite_record?

@FFY00
Copy link
Member

FFY00 commented Jun 17, 2020

Actually, I am not entirely sure how it is supposed to work. Does it check all files and fixes up RECORD?

@pradyunsg
Copy link
Member Author

pradyunsg commented Jun 18, 2020

That's because only the Destination object has information about the "final" file paths, which needed for performing the rewrite of the RECORD file. I'm 100% open to figuring out a helper to simplify end-user implementations in the common case, but within the abstractions, I don't think the responsibility for rewriting the RECORD file can fit anywhere else.

@pradyunsg
Copy link
Member Author

Does it check all files and fixes up RECORD?

See point 4 in "spread" on PEP 427's description of how to install a wheel: https://www.python.org/dev/peps/pep-0427/#installing-a-wheel-distribution-1-0-py32-none-any-whl

Basically, the path component of each record needs to be changed. This is worth mentioning in the rewrite_record's docstring.

@FFY00
Copy link
Member

FFY00 commented Jun 18, 2020

That's because only the Destination object has information about the "final" file paths

Then we can ask the user to define a get_full_path method, which could then be used by our rewrite_record. As an API user, I don't want to care about RECORD rewriting logic. This is also one of those things that people can easily get wrong, we want to prevent that.

Copy link

@gaborbernat gaborbernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feelings about this. Would like to see some example uses (tests?), that can focus on the high-level interface. I think we established that virtualenv needs finer grain control 👍 than the scope of this library, so it doesn't block me in any way.

src/installer/_core.py Show resolved Hide resolved
src/installer/_core.py Show resolved Hide resolved
Supports Wheel version 1.0 (PEP 427).
"""

def __init__(self, name, validators):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll we need custom validators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have them. The RECORD matches the file hashes check is also a validator (i.e. opt in).

src/installer/parsers.py Outdated Show resolved Hide resolved
src/installer/parsers.py Outdated Show resolved Hide resolved
@pradyunsg
Copy link
Member Author

I think we established that virtualenv needs finer grain control 👍 than the scope of this library, so it doesn't block me in any way.

Hmm... I'm not sure I follow. I thought that this would be able to cover the use case for virtualenv, since both source and destination are decoupled completely, and don't make assumptions about how the source is provided.

@gaborbernat
Copy link

Hmm... I'm not sure I follow. I thought that this would be able to cover the use case for virtualenv, since both source and destination are decoupled completely, and don't make assumptions about how the source is provided.

What do you define as a source? virtualenv does not install things straight from wheel... it rather generates pre-baked images on the disk and install is then essentially a copy/symlink. Furthermore, it still needs to trigger the console script generation in this case post copy/symlink. As I said this is highly custom and not sure we can/should force these restrictions on this library.

Additionally, virtualenv has probably the least aggressive python 2 deprecation policy, we'll probably support Python 2 until PyPy does it, and that's undefined for now.

@FFY00
Copy link
Member

FFY00 commented Jun 18, 2020

What do you define as a source? virtualenv does not install things straight from wheel... it rather generates pre-baked images on the disk and install is then essentially a copy/symlink.

You can subclass Destination and abstract all that, in fact, you are expected to. In my case, in python-install I unpack the wheel to the filesystem and work from there. Does this still not work in your use-case? If so, why?

Furthermore, it still needs to trigger the console script generation in this case post copy/symlink. As I said this is highly custom and not sure we can/should force these restrictions on this library.

Script generation is optional and as such, should be provided independently from the install step. We might be able to come up with a model that works for you, but if we don't, you can still implement your own custom version.

@pradyunsg
Copy link
Member Author

Then we can ask the user to define a get_full_path method, which could then be used by our rewrite_record. As an API user, I don't want to care about RECORD rewriting logic. This is also one of those things that people can easily get wrong, we want to prevent that.

Indeed!

I just spent a bit of time thinking, and I think it's possible to reduce Destination to a single method, by making write_file return the path that's supposed to go into the rewritten RECORD.

Since classes with a method are better represented as a function, I think we can reduce Destination down to a function.

@FFY00
Copy link
Member

FFY00 commented Jun 18, 2020

Since classes with a method are better represented as a function, I think we can reduce Destination down to a function.

I might want to save state, but in that case, I can implement a class with __call__.

@pradyunsg
Copy link
Member Author

I just spent a bit of time thinking, and I think it's possible to reduce Destination to a single method, by making write_file return the path that's supposed to go into the rewritten RECORD.

It's not. :(

rewrite_record also serves as "the last operation". I do think that the earlier plan, to have a RecordKeeper class that Destination classes can use to keep track of how to write a RECORD is sufficient. Maybe, what we want to do is a mix of the two, with RecordKeeper being inside Installer, write_file returning the destination path, and being passed into rewrite_record?

@pradyunsg
Copy link
Member Author

pradyunsg commented Jun 18, 2020

What do you define as a source?

For virtualenv, it'd be based off the prebaked image.

install is then essentially a copy/symlink.

This can be what is done by the Destination object. Since the value from WheelSource.iter_files is passed straight into Destination.write_file, it's possible to use a object that holds the fioe descriptor, and performing the copy with it.

Furthermore, it still needs to trigger the console script generation in this case post copy/symlink.

This is completely controlled by the Destination class, which handles all the script generation.

As I said this is highly custom and not sure we can/should force these restrictions on this library.

Thanks for elaborating on this. I think this API, as it stands, does accommodates for virtualenv's needs. I'd appreciate it if you could take a look at this, and see if/how that might not be the case.

@gaborbernat
Copy link

Thanks for elaborating on this. I think this API, as it stands, does accommodates for virtualenv's needs. I'd appreciate it if you could take a look at this, and see if/how that might not be the case.

I'd have to give a try to use it. I'll do it at some point but unlikely I'll have time this within the next week.

@pradyunsg
Copy link
Member Author

RecordKeeper being inside Installer, write_file returning the destination path, and being passed into rewrite_record

I think I like this. It makes it straightforward enough for any of the consumers to depend on this library (return the paths from write_file, use the object passed into rewrite_record) while eliminating the need to determine file paths.

I do wonder how we'll map the hashes / sizes to the files, so that's certainly something for me to think about. I'll iterate on this with a piece of paper, later, to see what I can come up with.

@pradyunsg
Copy link
Member Author

Cool, lemme know if we're not going to use this for virtualenv, since there might be simplifications possible if we're not going to be trying to accommodate for virtualenv's fairly different approach to dealing with things.

@agronholm
Copy link

Maybe we should have some way to preserve stat(...).st_mtime from WheelSource to Destination?

If we don't, I see little point in recording the actual timestamps in the wheel file either.

@pradyunsg
Copy link
Member Author

Fair enough. I'd imagine folks who care about build reproducibility definitely care.

100% open to suggestions on how to do this, since I'm actually not sure how to do this.

@agronholm
Copy link

agronholm commented Jun 18, 2020

Reproducibility.

How would it be less reproducible if we simply used a default timestamp, say 1970-01-01T00:00:00+00:00?

src/installer/_core.py Outdated Show resolved Hide resolved
src/installer/_core.py Show resolved Hide resolved
src/installer/_core.py Outdated Show resolved Hide resolved
src/installer/_core.py Show resolved Hide resolved
src/installer/_core.py Show resolved Hide resolved
src/installer/protocols.py Outdated Show resolved Hide resolved
src/installer/protocols.py Outdated Show resolved Hide resolved

# All files in the wheel
@abc.abstractmethod
def iter_files(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say I dislike iter_*() API names as it doesn't communicate what the method actually does, just what you are expected to do with it or what the turn type is depending if you read that as shorthand as iterate or iterator, respectively. I personally would be fine with files() and simply document it returns an iterator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I like having iter_ since the fact that an iterator is returned is the second most important part of what the function does (the first is it iterators through files). Python does not offer any way to guard against iterator reuse, and the only thing to prevent accidental usage is to stick something into the name. Type-checking does not help here because it would happily accept usages like

def install_files(files: Iterable[File]):
   ...

files = source.files()
logger.debug('Processing %s', [path for path, _ in files])
install_files(files)

I won’t speak for others, but I can’t tell whether this code is correct or not in a quick code review. This works if files() returns a list, but not if it returns an iterator. Maybe I should call files() twice? But that wouldn make things worse if the function implies significant performance impact. I can go through this thought process every time I see this, and potentially dig into the documentation or source code. Or I can easily avoid all the mental work by simply adding iter_ to the function name, with a slight aesthetics cost. The trade-off is worth it IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna have to change this for timestamp + record-keeping reasons.

For now, I'm think gonna call this get_contents, returning a Tuple[Record, BufferedReader, Optional[stat_result]]. I'm a bit concerned with using Record here directly, since @agronholm has noted that he wants to be able to make wheel's WheelFile capable of being a WheelSource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we going to restore the modification time and file attributes present in the archive when the wheel is being installed? Sadly the wheel PEP does not say anything about this which I feel is a significant problem.

Copy link
Member Author

@pradyunsg pradyunsg Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at pip's logic, we unpack the zip with the following guarantees:

All files are written based on system defaults and umask (i.e. permissions are not preserved), except that regular file members with any execute permissions (user, group, or world) have "chmod +x" applied after being written. Note that for windows, any execute changes using os.chmod are no-ops per the python docs.

And then install from the unpacked wheel with, among other things:

                # Copy over the metadata for the file, currently this only
                # includes the atime and mtime.
                st = os.stat(srcfile)
                if hasattr(os, "utime"):
                    os.utime(destfile, (st.st_atime, st.st_mtime))

I'm very confused right now about what the potential details we'd need, since IIUC, pip's not preserving the stat() details during unpacking, but preserving them when moving the files (!).

Copy link
Member Author

@pradyunsg pradyunsg Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have so many questions:

  • Can we have empty folders in a wheel? If yes, how should they get installed?
  • What metadata about a file in a wheel, do we actually care about?
    • Executable bits?
    • Timestamps?
    • What properties from os.stat / ZipInfo?
  • Would people be OK with files being written with timestamp=0? (this would be a concrete Destination implementation thing, so I'm not too worried if the answer is yes; but if we're preserving timestamps, somebody explain to me how they think it should work)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answers to these questions should be added to the wheel spec, by the way 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have empty folders in a wheel? If yes, how should they get installed?

No. There is a Dicourse thread proposing to amend this.

Would people be OK with files being written with timestamp=0?

This would cause pain when integrating wheels into timestamp-based incremental build systems (e.g. GNU Make). I think setting timestamp to 0 would be worse than what pip currently does (which sets the timestamp to whenever the wheel is installed, I believe?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have empty folders in a wheel? If yes, how should they get installed?

No

Well, I'm gonna implement status quo, so not implementing any support for this sounds good. :)

timestamp-based incremental build systems

To be clear, the situation would be that WheelSource objects wouldn't provide timestamp information, so Destination objects would set the timestamp based on whatever logic they want - I imagine pip would preserve its current approach and keep the time dependent on unpack time.

The question really should've been, do we care about preserving timestamps from a WheelSource when installing from it? I think the answer is no, so I'm going to move forward with that assumption.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems correct to leave the timestamp in newly created files alone. If pip indeed does this already, then that opens the door for wheel to use the zip default timestamp for all files (instead of copying the timestamp from the file system). This too should be clarified in the new PEP which @uranusjr is apparently writing now.

src/installer/protocols.py Show resolved Hide resolved
src/installer/protocols.py Show resolved Hide resolved
pradyunsg and others added 3 commits June 22, 2020 01:08
There's TODOs in this, but I hope this is pretty close to where we want
this to be.
Co-authored-by: Brett Cannon <brett@python.org>
@agronholm
Copy link

agronholm commented Jun 21, 2020

@pradyunsg So, how would you prefer us to align the APIs? The new WheelFile (and related) APIs are going to be Python 3 only, while this project seems to be designed with Python 2 compatibility for reasons I don't really understand. This makes it impossible for you to import my code directly.

@pradyunsg
Copy link
Member Author

pradyunsg commented Jun 21, 2020

FWIW, I think wheel.WheelFile can just implement the WheelSource protocol and be usable with this library. The WheelSource API (by design) permits Python 3 users to use pathlib.Path if they want to, and wheel.WheelFile would then be a Py3-only WheelSource (which is 100% OK).

I think I need to add 2 things here:

  • "RECORD management" in Installer, so that rewrite_record gets a RecordKeeper object in addition to the scheme.
  • Add RECORD-entry + optional timestamp information in iter_files (which should probably get renamed).

The first doesn't affect the WheelSource API. The second does.

I don't want us to end up introducing a wheel -> installer dependency, so the WheelSource protocol shouldn't depend on anything else from installer. Gimme a day or so to iterate here, and figure out what's the best way to add these things, without making it difficult to implement the WheelSource protocol in wheel.

@pradyunsg
Copy link
Member Author

Note to self: need to allow for some mechanism to add "install-time metadata" like INSTALLER, REQUESTED, direct_url.json (PEP 610) and more in the future (like proposed HASH file) without hard-coding support for the specific files in the code. This would allow for addition of these files to be done transparently as the .dist-info directory's contents evolve.

@brettcannon
Copy link
Member

What's next on this PR (and project)? More time? 😉

@pradyunsg
Copy link
Member Author

More time?

Yea, pretty much.

@ofek
Copy link
Contributor

ofek commented Oct 4, 2020

Any update?

@pradyunsg
Copy link
Member Author

pradyunsg commented Oct 11, 2020

I've blocked out a few hours in the coming week to pull this forward, including a (small) chunk of co-working time with @FFY00! :)

@pradyunsg
Copy link
Member Author

A few notes:

  • I can't find the notes from my and @FFY00's conversation. I'd like to blame my broken MacBook. :)
  • there's a few tweaks to be made to the API at play here, and there's a need for some wrangling w.r.t. handling RECORD entries properly in this.
  • I have an open offer for folks to co-work with me on this (https://discuss.python.org/t/pep-517-workflow-for-distributions/5959/5?u=pradyunsg) which would help me concentrate on getting this over-the-line.
  • There are abstract interfaces that the "core" Installer class would see, and those would need concrete implementations. I expect the abstract interfaces to be subject to change until we hit "stable" on this project.
  • I would still like downstream folks to try to integrate with this -- I'll communicate more about this on Scope of this project and API design #1, once this is ready-enough to actually try using. :)

@pradyunsg
Copy link
Member Author

Update: @lkisskol and I are working on getting the ball rolling again this week.

@pradyunsg
Copy link
Member Author

Closing since this isn't going to be the primary PR where work is done. Please follow #1 instead. :)

@pradyunsg pradyunsg closed this Dec 17, 2020
@pradyunsg pradyunsg deleted the initial-implementation branch May 3, 2021 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants