Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importlib: Read distribution name/version from metadata directory names, if possible #12656

Merged
merged 3 commits into from
Jul 9, 2024

Conversation

ichard26
Copy link
Member

@ichard26 ichard26 commented Apr 28, 2024

importlib does not cache metadata in-memory, so querying even simple attributes like distribution names and versions can quickly become expensive (as each access requires reading METADATA). Fortunately, Distribution.canonical_name is optimized to parse the metadata directory name to query the name if possible. This commit extends this optimization to the finder implementation and version attribute.

.egg-info directory names tend to not include the version so they are not considered for optimizing version lookup.

simplewheel-2.0-1-py2.py3-none-any.whl had to be modified to rename the .dist-info directory which mistakenly included the wheel build tag (in violation of the wheel specification).

simplewheel/__init__.py
simplewheel-2.0-1.dist-info/DESCRIPTION.rst
simplewheel-2.0-1.dist-info/metadata.json
simplewheel-2.0-1.dist-info/top_level.txt
simplewheel-2.0-1.dist-info/WHEEL
simplewheel-2.0-1.dist-info/METADATA
simplewheel-2.0-1.dist-info/RECORD

Otherwise, it was mistaken for part of the version and led pip to think the wheel was a post-release, breaking tests...

This caught my eye when I was profiling the performance of various commands. iter_all_distributions() was often a noticeable chunk of the profile. For example, reading METADATA is responsible for 15% of the total pip check runtime in my Python 3.11.7 environment with 66 packages installed.

cProfile graph

image

With this patch, the percentage decreases to 5% (only reading Requires-Dist now requires expensive file IO)

cProfile graph

image

@ichard26 ichard26 added the type: performance Commands take too long to run label Apr 29, 2024
@ichard26 ichard26 force-pushed the optimise-importlib branch 2 times, most recently from ddd7dba to 2ddbd85 Compare May 1, 2024 21:13
@ichard26 ichard26 marked this pull request as ready for review May 1, 2024 21:29
@ichard26
Copy link
Member Author

ichard26 commented May 1, 2024

@uranusjr I know you contributed the importlib metadata backend, so I'd appreciate it if you could you review this! Do you think wheels with invalid metadata directories are frequent enough in the wild to render this untenable? Thanks!

@uranusjr
Copy link
Member

uranusjr commented May 2, 2024

The reason we didn’t do this is because some people rely on the “real” name they gave to the package, not the canonical name. This is an artistic choice that they value very high. We can probably use the version (since Version normalises it anyway).

@ichard26
Copy link
Member Author

ichard26 commented May 2, 2024

Hmm, yeah that's a good point. However, I do think the optimisation is still safe here as the finder already normalises the distribution names (it doesn't use raw_name anywhere else)

for dist in importlib.metadata.distributions(path=[location]):
info_location = get_info_location(dist)
try:
raw_name = get_dist_name(dist)
except BadMetadata as e:
logger.warning("Skipping %s due to %s", info_location, e.reason)
continue
normalized_name = canonicalize_name(raw_name)
if normalized_name in self._found_names:
continue
self._found_names.add(normalized_name)

If I remove the optimization from get_dist_name and only use it directly in Distribution.canonical_name and _DistributionFinder.find_impl, would that be okay?

I'll take a closer look at the Version attribute and try to find ways it would lead to undesirable behaviour sometime later this week.

@uranusjr
Copy link
Member

uranusjr commented May 2, 2024

I believe we use the raw name in some places, such as in pip show and pip list. Or maybe they don’t use get_dist_name? If this function is only used to get the canonical name, we should just rename it and make it always return the canonicalised value directly.

@ichard26
Copy link
Member Author

ichard26 commented May 2, 2024

@uranusjr sorry I should've clarified that get_dist_name is only used in those two places (the finder and Distribution canonical name property). I'll rename it 👍

@ichard26 ichard26 force-pushed the optimise-importlib branch 2 times, most recently from 0c846cb to 137fb7e Compare May 5, 2024 16:12
@ichard26
Copy link
Member Author

ichard26 commented May 6, 2024

I was profiling pip and I stumbled on a clearer profile to sell the benefit of this optimisation. Reinstalling a local wheel without dependencies spends a lot of time (barring imports) within get_dist_name() with 70 packages installed.

python -m pip install --no-deps six-1.16.0-py2.py3-none-any.whl --force-reinstall

image
image

There are five calls to get_distribution() (setuptools, six, six, pip, pip) and numerous calls to iter_all_distributions() which each require a full distribution search, calling get_dist_name() on every discovered distribution.

src/pip/_internal/metadata/importlib/_compat.py Outdated Show resolved Hide resolved
…e, if possible

importlib does not cache metadata in-memory, so querying even simple
attributes like distribution names and versions can quickly become
expensive (as each access requires reading METADATA). Fortunately,
`Distribution.canonical_name` is optimized to parse the metadata
directory name to query the name if possible. This commit extends this
optimization to the finder implementation and version attribute.

.egg-info directory names tend to not include the version so they are
not considered for optimizing version lookup.

simplewheel-2.0-1-py2.py3-none-any.whl had to be modified to rename the
.dist-info directory which mistakenly included the wheel build tag (in
violation of the wheel specification).

    simplewheel/__init__.py
    simplewheel-2.0-1.dist-info/DESCRIPTION.rst
    simplewheel-2.0-1.dist-info/metadata.json
    simplewheel-2.0-1.dist-info/top_level.txt
    simplewheel-2.0-1.dist-info/WHEEL
    simplewheel-2.0-1.dist-info/METADATA
    simplewheel-2.0-1.dist-info/RECORD

Otherwise, it was mistaken for part of the version and led pip to think
the wheel was a post-release, breaking tests...
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
@ichard26 ichard26 added this to the 24.2 milestone May 19, 2024
@pradyunsg pradyunsg merged commit ceeccee into pypa:main Jul 9, 2024
28 checks passed
@ichard26 ichard26 deleted the optimise-importlib branch July 10, 2024 23:03
@ichard26
Copy link
Member Author

Thank you for the review @pradyunsg and @uranusjr!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants