-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation #16
Conversation
The most important files are likely There's a few commits from #12, which don't really add much noise to the overall PR due to the |
Oh, and this code doesn't actually run, since there's missing imports and shitz. I'll polish all of that up, once it's clear whether this is an appropriate direction to invest effort into. :) |
Overall looks good, one thing jumps out to me, why does the API user need to implement |
Actually, I am not entirely sure how it is supposed to work. Does it check all files and fixes up |
That's because only the |
See point 4 in "spread" on PEP 427's description of how to install a wheel: https://www.python.org/dev/peps/pep-0427/#installing-a-wheel-distribution-1-0-py32-none-any-whl Basically, the path component of each record needs to be changed. This is worth mentioning in the |
Then we can ask the user to define a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong feelings about this. Would like to see some example uses (tests?), that can focus on the high-level interface. I think we established that virtualenv needs finer grain control 👍 than the scope of this library, so it doesn't block me in any way.
Supports Wheel version 1.0 (PEP 427). | ||
""" | ||
|
||
def __init__(self, name, validators): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll we need custom validators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have them. The RECORD matches the file hashes check is also a validator (i.e. opt in).
Hmm... I'm not sure I follow. I thought that this would be able to cover the use case for virtualenv, since both source and destination are decoupled completely, and don't make assumptions about how the source is provided. |
What do you define as a source? virtualenv does not install things straight from wheel... it rather generates pre-baked images on the disk and install is then essentially a copy/symlink. Furthermore, it still needs to trigger the console script generation in this case post copy/symlink. As I said this is highly custom and not sure we can/should force these restrictions on this library. Additionally, virtualenv has probably the least aggressive python 2 deprecation policy, we'll probably support Python 2 until PyPy does it, and that's undefined for now. |
You can subclass
Script generation is optional and as such, should be provided independently from the install step. We might be able to come up with a model that works for you, but if we don't, you can still implement your own custom version. |
Indeed! I just spent a bit of time thinking, and I think it's possible to reduce Destination to a single method, by making write_file return the path that's supposed to go into the rewritten RECORD. Since classes with a method are better represented as a function, I think we can reduce Destination down to a function. |
I might want to save state, but in that case, I can implement a class with |
It's not. :( rewrite_record also serves as "the last operation". I do think that the earlier plan, to have a RecordKeeper class that Destination classes can use to keep track of how to write a RECORD is sufficient. Maybe, what we want to do is a mix of the two, with RecordKeeper being inside Installer, write_file returning the destination path, and being passed into rewrite_record? |
For virtualenv, it'd be based off the prebaked image.
This can be what is done by the Destination object. Since the value from WheelSource.iter_files is passed straight into Destination.write_file, it's possible to use a object that holds the fioe descriptor, and performing the copy with it.
This is completely controlled by the Destination class, which handles all the script generation.
Thanks for elaborating on this. I think this API, as it stands, does accommodates for virtualenv's needs. I'd appreciate it if you could take a look at this, and see if/how that might not be the case. |
I'd have to give a try to use it. I'll do it at some point but unlikely I'll have time this within the next week. |
I think I like this. It makes it straightforward enough for any of the consumers to depend on this library (return the paths from write_file, use the object passed into rewrite_record) while eliminating the need to determine file paths. I do wonder how we'll map the hashes / sizes to the files, so that's certainly something for me to think about. I'll iterate on this with a piece of paper, later, to see what I can come up with. |
Cool, lemme know if we're not going to use this for virtualenv, since there might be simplifications possible if we're not going to be trying to accommodate for virtualenv's fairly different approach to dealing with things. |
If we don't, I see little point in recording the actual timestamps in the wheel file either. |
Fair enough. I'd imagine folks who care about build reproducibility definitely care. 100% open to suggestions on how to do this, since I'm actually not sure how to do this. |
How would it be less reproducible if we simply used a default timestamp, say |
|
||
# All files in the wheel | ||
@abc.abstractmethod | ||
def iter_files(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say I dislike iter_*()
API names as it doesn't communicate what the method actually does, just what you are expected to do with it or what the turn type is depending if you read that as shorthand as iterate
or iterator
, respectively. I personally would be fine with files()
and simply document it returns an iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I like having iter_
since the fact that an iterator is returned is the second most important part of what the function does (the first is it iterators through files). Python does not offer any way to guard against iterator reuse, and the only thing to prevent accidental usage is to stick something into the name. Type-checking does not help here because it would happily accept usages like
def install_files(files: Iterable[File]):
...
files = source.files()
logger.debug('Processing %s', [path for path, _ in files])
install_files(files)
I won’t speak for others, but I can’t tell whether this code is correct or not in a quick code review. This works if files()
returns a list, but not if it returns an iterator. Maybe I should call files()
twice? But that wouldn make things worse if the function implies significant performance impact. I can go through this thought process every time I see this, and potentially dig into the documentation or source code. Or I can easily avoid all the mental work by simply adding iter_
to the function name, with a slight aesthetics cost. The trade-off is worth it IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna have to change this for timestamp + record-keeping reasons.
For now, I'm think gonna call this get_contents
, returning a Tuple[Record, BufferedReader, Optional[stat_result]]
. I'm a bit concerned with using Record
here directly, since @agronholm has noted that he wants to be able to make wheel
's WheelFile
capable of being a WheelSource
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are we going to restore the modification time and file attributes present in the archive when the wheel is being installed? Sadly the wheel PEP does not say anything about this which I feel is a significant problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at pip's logic, we unpack the zip with the following guarantees:
All files are written based on system defaults and umask (i.e. permissions are not preserved), except that regular file members with any execute permissions (user, group, or world) have "chmod +x" applied after being written. Note that for windows, any execute changes using os.chmod are no-ops per the python docs.
And then install from the unpacked wheel with, among other things:
# Copy over the metadata for the file, currently this only
# includes the atime and mtime.
st = os.stat(srcfile)
if hasattr(os, "utime"):
os.utime(destfile, (st.st_atime, st.st_mtime))
I'm very confused right now about what the potential details we'd need, since IIUC, pip's not preserving the stat()
details during unpacking, but preserving them when moving the files (!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have so many questions:
- Can we have empty folders in a wheel? If yes, how should they get installed?
- What metadata about a file in a wheel, do we actually care about?
- Executable bits?
- Timestamps?
- What properties from
os.stat
/ZipInfo
?
- Would people be OK with files being written with timestamp=0? (this would be a concrete Destination implementation thing, so I'm not too worried if the answer is yes; but if we're preserving timestamps, somebody explain to me how they think it should work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answers to these questions should be added to the wheel spec, by the way 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have empty folders in a wheel? If yes, how should they get installed?
No. There is a Dicourse thread proposing to amend this.
Would people be OK with files being written with
timestamp=0
?
This would cause pain when integrating wheels into timestamp-based incremental build systems (e.g. GNU Make). I think setting timestamp to 0 would be worse than what pip currently does (which sets the timestamp to whenever the wheel is installed, I believe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have empty folders in a wheel? If yes, how should they get installed?
No
Well, I'm gonna implement status quo, so not implementing any support for this sounds good. :)
timestamp-based incremental build systems
To be clear, the situation would be that WheelSource objects wouldn't provide timestamp information, so Destination objects would set the timestamp based on whatever logic they want - I imagine pip would preserve its current approach and keep the time dependent on unpack time.
The question really should've been, do we care about preserving timestamps from a WheelSource when installing from it? I think the answer is no, so I'm going to move forward with that assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it seems correct to leave the timestamp in newly created files alone. If pip indeed does this already, then that opens the door for wheel to use the zip default timestamp for all files (instead of copying the timestamp from the file system). This too should be clarified in the new PEP which @uranusjr is apparently writing now.
5edb8c6
to
69f5c53
Compare
There's TODOs in this, but I hope this is pretty close to where we want this to be.
Co-authored-by: Brett Cannon <brett@python.org>
69f5c53
to
1ef1210
Compare
@pradyunsg So, how would you prefer us to align the APIs? The new WheelFile (and related) APIs are going to be Python 3 only, while this project seems to be designed with Python 2 compatibility for reasons I don't really understand. This makes it impossible for you to import my code directly. |
FWIW, I think I think I need to add 2 things here:
The first doesn't affect the WheelSource API. The second does. I don't want us to end up introducing a wheel -> installer dependency, so the |
Note to self: need to allow for some mechanism to add "install-time metadata" like INSTALLER, REQUESTED, direct_url.json (PEP 610) and more in the future (like proposed HASH file) without hard-coding support for the specific files in the code. This would allow for addition of these files to be done transparently as the .dist-info directory's contents evolve. |
What's next on this PR (and project)? More time? 😉 |
Yea, pretty much. |
Any update? |
I've blocked out a few hours in the coming week to pull this forward, including a (small) chunk of co-working time with @FFY00! :) |
A few notes:
|
Update: @lkisskol and I are working on getting the ball rolling again this week. |
Closing since this isn't going to be the primary PR where work is done. Please follow #1 instead. :) |
Built on top of #12
Filing early to get inputs on this design / approach. As of right now, the design has 4 interacting components: the
Installer
class encapsulates all the installation logic, interacts with all the other abstractions and is the main entry point.WheelSource
andDestination
, which do exactly what their name suggests. And, finally, there's the coolest kids on the block,Validator
s.WheelSource
represents a wheel file (like a .whl file on disk). It provides random access to dist-info files, and sequential access to all the files within the wheel. This letsInstaller
read the metadata files it needs (only WHEEL), any potential validators to read other metadata (like RECORD files), while making it straightforward to consume the contents of the wheel.Destination
handles all the file-writing stuff (so all the I/O) and has the job of rewriting the RECORD file appropriately.Validator
s are...Callable[[WheelSource], None]
which can raise an error to signal not-valid. That's it.One of the quirks of this approach/design, is that we don't actually "unpack" the wheel root into the corresponding scheme, but instead figure out the correct place to put the files the moment we see them. This lets us simplify the Destination API, and avoids an extra step (potentially reducing I/O operations).
Two things that I am wondering:
stat(...).st_mtime
fromWheelSource
toDestination
?