Splitting RequirementPreparer #7049

pradyunsg · 2019-09-20T05:22:13Z

Welcome to another edition of Pradyun dumps his thoughts in an issue to get inputs from actually smart humans!

Currently, RequirementPreparer has two "jobs" - fetch+unpack a requirement and generate metadata.

It does this using:

unpack_url from pip._internal.download
Distribution objects from pip._internal.distributions

There is some additional functionality that it provides. Mainly, it calls req.archive, which well, is used in pip download. I think there's benefit to splitting out all these operations.

Given that InstallRequirement.archive skips the generated pip-egg-info directory, I think it's reasonable to move the logic archive generation code to do so before metadata is generated.

This would result in a behavior change that I'm okay with -- we'd create an archive before calling egg_info, when using pip download. I'm not sure if this affects setuptools_scm, but based on my rudimentary understanding, it shouldn't. And if we really care a lot, I think we'd be better off moving the logic for the archiving into pip download, so that we can maintain the separation between these stages. We should probably be doing that anyway since in the more-correct resolver model, we'd only want to archive whatever is in the final set. Anyway, I'm calling it "not required" right now so I won't be making that change.

With that change, all the fetching related logic would happen before metadata generation. That'll allow splitting RequirementPreparer into RequirementFetcher and MetadataGenerators. This in turn would make it so that we can also introduce abstraction-style objects between these stages if we want to. I'm open to exploring that based on how the refactor here goes.

My understanding is that we can get away with making the MetadataGenerator to be just functions. In future, we could make them transform some kind of FetchedCandidate into a Distribution object. For now, they'll consume an InstallRequirement, do whatever we're doing today and return the same object. This change also confuses me what we'd want to be doing with Distribution objects (they have the build isolation code and call the metadata generation code in InstallRequirement) but, hey, one step at a time. I'll look into that once this "stage" is done with.

The text was updated successfully, but these errors were encountered:

chrahunt · 2019-09-20T05:51:25Z

And if we really care a lot, I think we'd be better off moving the logic for the archiving into pip download, so that we can maintain the separation between these stages.

I agree with this approach in general - currently a lot of what makes things complicated are pip download, pip wheel, and hash checking spread across at least download and RequirementPreparer. If we were able to do these operations at the end (by operating on a return value from Resolver.resolve for example) or using a callback/event mechanism then it would let us lift a lot of code from "generic" modules into the more specific command modules.

In future, we could make them transform some kind of FetchedCandidate into a Distribution object.

We had talked before about having the objects advance themselves (as opposed to being transformed from the outside). I played a little with that, which you can see for example RemoteWheel.__next__() -> LocalWheel, UnpackedSources.__next__() -> Local*Project, RemoteSdist.__next__() -> LocalSdist.

pradyunsg · 2019-09-20T06:46:57Z

We had talked before about having the objects advance themselves

Yep yep. I was mostly just hinting at how these operations would then be the same as the transitions we'd want in that model. The exact interface for how the transitions happen would likely stay the same as what we'd discussed.

cjerdonek · 2019-09-21T04:14:53Z

We had talked before about having the objects advance themselves (as opposed to being transformed from the outside).

I haven't seen the full design, but this seems a little too fancy to me. In general, I would go for decoupling things, use simple, straightforward patterns, and favor being explicit over cleverness or implicit behavior.

pradyunsg · 2022-03-27T10:47:23Z

Looking at this again, there's three things here:

Renaming _get_prepared_distribution, since it's not "preparing" (i.e. download+unpack + generate-metadata), it's only generating metadata. It and everything it calls should likely get renamed accordingly.
Factor out the renamed _get_prepared_distribution into the callers of RequirementPreparer, ideally into a separate public function or a separate class.
At this point, the RequirementPreparer would basically be a RequirementFetcher, so we canlikely go through and do the renaming across the board for that.

I think step 2 is going to be the most involved, since that's what involves moving code around -- the rest seems to be renaming things.

pradyunsg · 2022-03-27T10:58:00Z

I think the thing that makes step 2 complicated is the "partial fetch" logic for --use-feature=fast-deps. I am wondering whether we should drop fast-deps handling, given that:

We have PEP 658 for directly serving METADATA files from index servers
PyPI will be implementing PEP 658 in due time.
fast-deps did not provide any significant speedup over the regular usage

If we get rid of that, we'd basically have a much simpler workflow, where the download and metadata generation steps are separate and not intertwined in any way.

uranusjr · 2022-04-07T07:58:35Z

We should probably get PEP 658 in first, but that might not be trivial either with the current structure. Not quite sure, I only did a very brief investigation a while ago and may very well have missed obvious approaches.

pradyunsg · 2022-05-14T10:18:32Z

There's a PR implementing PEP 658, thanks to @cosmicexplorer -- #11111

cosmicexplorer · 2022-09-05T19:00:12Z

i hid my previous comment since it wasn't feasible, but wanted to restate that I think removing fast-deps handling until it can be shown to improve performance is probably a good idea if it simplifies any of this. I'll continue experimenting with that technique on my own but that shouldn't block any of this work.

pradyunsg · 2022-09-30T17:51:04Z

Based on #11447 and #8670, my guess is that it's still useful -- if a little poorly implemented.

I do think that it's worth exploring whether we should get rid of it once PEP 658 is implemented on PyPI though.

dholth · 2022-10-03T13:41:48Z

A huge % of wheels will have the entire .dist-info directory in the last 8-10 kb of the zip archive. I checked this once but don't have the measurement on hand.

cosmicexplorer · 2022-10-05T17:30:22Z

A huge % of wheels will have the entire .dist-info directory in the last 8-10 kb of the zip archive.

I might be misunderstanding you, but this exact realization indeed led directly to the current --use-feature=fast-deps implementation. It was surprising that that alone wasn't enough to produce a useful speedup without other changes such as parallel downloads, and I regret overselling the potential speedup of fast-deps when it was first proposed.

dholth · 2022-10-05T22:45:06Z

@cosmicexplorer the lazier lazy-wheel makes many fewer requests! Check it out! We avoid the HEAD request and inspect zipinfo to make a single read (translating to a single request) for the entire .dist-info. Almost 4s faster, but I have a fast connection and very low latency to fastly.

Less-lazy lazy wheel wastes a HEAD request, and ZipFile makes a couple of tiny reads for each file's inline header (not at the end of the file) if we don't prefetch.

#8670 (comment)

This work was part of https://github.com/conda-incubator/conda-package-streaming, we can do a great job on .conda (is ZIP) and .tar.bz2 conda files by hanging up the connection once we've seen what we are interested in.

pradyunsg · 2022-10-05T22:53:07Z

Folks, I'd like to suggest moving further discussion on lazy wheels to #8670 -- it's unrelated to the refactoring that is on-topic for this issue. :)

pradyunsg added type: refactor Refactoring code state: needs discussion This needs some more discussion labels Sep 20, 2019

pradyunsg added this to To do in Build Logic Refactor via automation Sep 20, 2019

pradyunsg mentioned this issue Oct 25, 2019

Build Logic Refactor #6607

Open

chrahunt mentioned this issue Nov 19, 2019

Cleanup operations.prepare more #7384

Merged

This was referenced Dec 4, 2019

Move operations.prepare.Downloader (and friends) to network.download.Downloader #7431

Merged

Move Downloader construction out of RequirementPreparer #7441

Merged

This was referenced Dec 14, 2019

Make unpack_* unit tests into functional tests #7478

Merged

Do not cleanup archive download tempdir immediately #7479

Merged

Add downloaded/local archive path to InstallRequirement #7483

Merged

chrahunt mentioned this issue Dec 24, 2019

Move download copying out of unpacking functions #7510

Merged

pradyunsg moved this from To do to In progress in Build Logic Refactor Dec 26, 2019

chrahunt mentioned this issue Jan 7, 2020

Extract downloaded file copying from operations.prepare #7570

Open

pradyunsg mentioned this issue Aug 15, 2022

Use data-dist-info-metadata (PEP 658) to decouple resolution from downloading #11111

Merged

pradyunsg mentioned this issue Aug 25, 2022

Avoid generating metadata in pip download --no-deps ... #1884

Open

This comment was marked as off-topic.

Sign in to view

cosmicexplorer mentioned this issue Sep 5, 2022

Extract and host wheel METADATA on upload pypi/warehouse#9972

Closed

pradyunsg mentioned this issue Sep 30, 2022

Remove the experimental fast-deps feature #11478

Open

cosmicexplorer mentioned this issue Jun 19, 2023

research area: parallel zip creation pex-tool/pex#2158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting RequirementPreparer #7049

Splitting RequirementPreparer #7049

pradyunsg commented Sep 20, 2019 •

edited

Loading

chrahunt commented Sep 20, 2019 •

edited

Loading

pradyunsg commented Sep 20, 2019

cjerdonek commented Sep 21, 2019

pradyunsg commented Mar 27, 2022

pradyunsg commented Mar 27, 2022 •

edited

Loading

uranusjr commented Apr 7, 2022

pradyunsg commented May 14, 2022

This comment was marked as off-topic.

cosmicexplorer commented Sep 5, 2022

pradyunsg commented Sep 30, 2022 •

edited

Loading

dholth commented Oct 3, 2022

cosmicexplorer commented Oct 5, 2022 •

edited

Loading

dholth commented Oct 5, 2022 •

edited

Loading

pradyunsg commented Oct 5, 2022

Splitting RequirementPreparer #7049

Splitting RequirementPreparer #7049

Comments

pradyunsg commented Sep 20, 2019 • edited Loading

chrahunt commented Sep 20, 2019 • edited Loading

pradyunsg commented Sep 20, 2019

cjerdonek commented Sep 21, 2019

pradyunsg commented Mar 27, 2022

pradyunsg commented Mar 27, 2022 • edited Loading

uranusjr commented Apr 7, 2022

pradyunsg commented May 14, 2022

This comment was marked as off-topic.

cosmicexplorer commented Sep 5, 2022

pradyunsg commented Sep 30, 2022 • edited Loading

dholth commented Oct 3, 2022

cosmicexplorer commented Oct 5, 2022 • edited Loading

dholth commented Oct 5, 2022 • edited Loading

pradyunsg commented Oct 5, 2022

pradyunsg commented Sep 20, 2019 •

edited

Loading

chrahunt commented Sep 20, 2019 •

edited

Loading

pradyunsg commented Mar 27, 2022 •

edited

Loading

pradyunsg commented Sep 30, 2022 •

edited

Loading

cosmicexplorer commented Oct 5, 2022 •

edited

Loading

dholth commented Oct 5, 2022 •

edited

Loading