Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing pkg_resources is slow. Work is done at import time. #926

Closed
dsully opened this issue Jan 17, 2017 · 6 comments
Closed

Importing pkg_resources is slow. Work is done at import time. #926

dsully opened this issue Jan 17, 2017 · 6 comments

Comments

@dsully
Copy link

dsully commented Jan 17, 2017

Any "import pkg_resources" by a module is slow, in the 100-150ms range, depending on the system.

This is due to the number of modules imported by pkg_resources itself (email.parser is also slow), and side-effect work done at import time, notably the _initialize and _initialize_master_working_set functions at the bottom of pkg_resources/__init__.py.

As wall clock time matters to humans running CLIs, every millisecond counts.

The work done at import time should be deferred until needed, and the imports themselves also deferred if possible.

@rawouter
Copy link

Updating this bug, I traced it down from a very slow import OpenSSL on our system due to this issue. We do have a pretty large sys.path with lots of user generated 3rd party libraries, it takes more than 0.800ms to load and we aim to run scripts subseconds which is impossible:

$ time python -c "import pkg_resources"
real	0m0.896s
user	0m0.324s
sys	0m0.136s

$ python -c "import sys; print(len(sys.path))"
40

@akx
Copy link
Contributor

akx commented Mar 6, 2017

I'm seeing ~200 msec for a click CLI app I'm developing (printing the usage):

With:     0.43 real         0.38 user         0.04 sys
Without:  0.25 real         0.22 user         0.02 sys

@cboylan
Copy link

cboylan commented Mar 28, 2017

Ran into this and poked at it briefly. It looks like a significant portion of the cost here is related to the sorting in https://github.com/pypa/setuptools/blob/master/pkg_resources/__init__.py#L1977-L2000. Notice the number of calls to _by_version and total cumulative time there in the attached file. This file was generated by running python -m cProfile -s cumulative foo.py where foo.py is only import pkg_resources. Digging deeper the biggest cost in sorting appears to be parsing all of the versions using packaging.version.parse.

pkg_resources_import_profile.txt

The other big cost here appears to be https://github.com/pypa/setuptools/blob/master/pkg_resources/__init__.py#L640-L658.

I don't know enough about why this state is generated on import, but it seems like maybe things like sorting or even reading through the entire list of packages available could potentially be deferred until the info is necessary (also why is sorting important at all?).

And finally pip list reports setuptools (34.2.0) is the version of setuptools in use here.

@socketpair
Copy link

@mgedmin
Copy link
Contributor

mgedmin commented Apr 27, 2017

Duplicate of #510?

@jaraco
Copy link
Member

jaraco commented Apr 27, 2017

Yes, thanks.

@jaraco jaraco closed this as completed Apr 27, 2017
lpsinger added a commit to lpsinger/ligo.skymap that referenced this issue Apr 29, 2020
Speed up `import ligo.skymap` by up to a second by replacing uses of
`pkg_resources` with the new Python standard library module
`importlib.resources` (or, for Python < 3.7, the backport
`importlib_resources`). The old `pkg_resources` module is known to be
slow because it does a lot of work on startup.

See, for example,
[pypa/setuptools#926](pypa/setuptools#926) and
[pypa/setuptools#510](pypa/setuptools#510).
lpsinger added a commit to lpsinger/gwcelery that referenced this issue May 4, 2020
Speed up imports by up to a second by replacing uses of `pkg_resources`
with the new Python standard library module `importlib.resources` (or,
for Python < 3.7, the backport `importlib_resources`). The old
`pkg_resources` module is known to be slow because it does a lot of
work on startup.

See, for example,
[pypa/setuptools#926](pypa/setuptools#926) and
[pypa/setuptools#510](pypa/setuptools#510).
lalsuite-bot pushed a commit to lscsoft/lalsuite that referenced this issue May 6, 2020
Speed up imports by up to a second by replacing uses of `pkg_resources`
with the new Python standard library module `importlib.resources` (or,
for Python < 3.7, the backport `importlib_resources`). The old
`pkg_resources` module is known to be slow because it does a lot of
work on startup.

See, for example,
[pypa/setuptools#926](pypa/setuptools#926) and
[pypa/setuptools#510](pypa/setuptools#510).
tbabej added a commit to tbabej/tasklib that referenced this issue Sep 9, 2020
Providing __version__ attribute is a reasonably common convention among
packages in the Python ecosystem. Currently the only other reliable
alternative is to use pkg_resources.get_distribution method, however,
importing pkg_resources is notoriously slow [1,2].

Provide the __version__ attribute to provide an API interface to check
the version of tasklib at runtime.

Bump the version in order to reflect module API change.

[1] pypa/setuptools#510
[2] pypa/setuptools#926
tbabej added a commit to tbabej/tasklib that referenced this issue Sep 9, 2020
Providing __version__ attribute is a reasonably common convention among
packages in the Python ecosystem. Currently the only other reliable
alternative is to use pkg_resources.get_distribution method, however,
importing pkg_resources is notoriously slow [1,2].

Provide the __version__ attribute to provide an API interface to check
the version of tasklib at runtime.

Bump the version in order to reflect module API change.

[1] pypa/setuptools#510
[2] pypa/setuptools#926
tbabej added a commit to tools-life/taskwiki that referenced this issue Sep 18, 2020
Importing pkg_resources module is notoriously slow, see [1,2]. Tasklib
module now provides __version__ attribute for an easy method of version
checking.

[1] pypa/setuptools#510
[2] pypa/setuptools#926
tbabej added a commit to tools-life/taskwiki that referenced this issue Sep 18, 2020
Importing pkg_resources module is notoriously slow, see [1,2]. Tasklib
module now provides __version__ attribute for an easy method of version
checking.

[1] pypa/setuptools#510
[2] pypa/setuptools#926
Alphadelta14 added a commit to Alphadelta14/setuptools that referenced this issue Sep 23, 2020
Alphadelta14 added a commit to Alphadelta14/setuptools that referenced this issue Sep 23, 2020
mergify bot pushed a commit to DataDog/dd-trace-py that referenced this issue Jun 22, 2022
pkg_resources has known performance issues: pypa/setuptools#926. This PR replaces pkg_resources with importlib.metadata and uses this module to retrieve package names and versions. 

A further optimization was made to the importlib implementation which parses package metadata: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1. Benchmarks for this third optimization are also shown in the table below:

| benchmark                 | test case                 | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) |
|---------------------------|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:|
| ddtracerun-auto_telemetry | pkg_resources (1.x branch)       | 15                 | 326       | 13       | 274           | 52            | 19.0         |
| ddtracerun-auto_telemetry | importlib                 | 15                 | 285       | 5        | 270           | 15            | 5.6          |
| ddtracerun-auto_telemetry | importlib with partial parsing | 15                 | 285       | 10       | 269           | 16            | 5.9          |
| ddtracerun-auto_telemetry | importlib                 | 30                 | 377       | 5        | 350           | 27            | 7.7         |
| ddtracerun-auto_telemetry | importlib with partial parsing | 30                 | 362       | 7       | 350           | 12            | 3.4            |
| ddtracerun-auto_telemetry | importlib                 | 45                 | 381       | 24       | 348           | 31            | 8.9          |
| ddtracerun-auto_telemetry | importlib with partial parsing | 45                 | 363       | 9        | 350           | 23            | 6.3          |
| ddtracerun-auto_telemetry | importlib                 | 313                 | 1050       | 79       | 991           | 59            | 5.9       |
| ddtracerun-auto_telemetry | importlib with partial parsing | 313                 | 911       | 28        | 905           | 6            | 0.6         |


|             benchmark             | test case                 | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) |
|:---------------------------------:|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:|
| ddtracerun-auto_tracing_telemetry | pkg_resources (1.x)       | 15                 | 324       | 8        | 274           | 50            | 18.2         |
| ddtracerun-auto_tracing_telemetry | importlib                 | 15                 | 293       | 11       | 272           | 21            | 8.3          |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 15                 | 291       | 12       | 272           | 19            | 6.9          |
| ddtracerun-auto_tracing_telemetry | importlib                 | 30                 | 373       | 11       | 351           | 22            | 6.28           |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 30                 | 367       | 13       | 354           | 13             | 3.6          |
| ddtracerun-auto_tracing_telemetry | importlib                 | 45                 | 376       | 8        | 355           | 21            | 5.9          |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 45                 | 364       | 9        | 352           | 22            | 6.5          |
| ddtracerun-auto_tracing_telemetry | importlib                 | 313                 | 1010       | 80        | 960           | 50            | 5.2          |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 313                 | 910       | 20        | 873           | 37      | 4.2          |

Note: redis, requests and urllib3 were included in test cases with 30 and 45 packages. These packages were patched by `ddtrace-run` and this increased the baseline by ~74ms but the overhead of telemetry observed remained consistent. The case with 313 packages patched gevent, pylons, SQLAlchemy, requests, flask, grpc, cassandra, botocore, and urllib3. This was to simulate the overhead of telemetry in a real world application with telemetry enabled. 


Findings from benchmarking sending telemetry events with different number of packages installed, patching integrations, and/or enabling tracing:
 - Using importlib instead of pkg_resources reduced the overhead of telemetry in half (~50ms to ~19ms)
 - The number of packages does not appear to correlate with the overhead of telemetry 
      - the benchmarks might've been too noisy to measure the difference accurately. 
 - creating a custom parser to retrieve package names and versions from PKG-INFO and METADATA files lead to notable performance gains with a large number of packages.
     - the difference appears to be within a standard deviation so more testing is required to accurately measure the difference. 
     - Iterating on this approach might lead to better results: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1
     - These performance gains seem to be minor. It might not be work developing and maintaining a metadata parser.

## Checklist
- [x] Library documentation is updated.
- [x] [Corp site](https://github.com/DataDog/documentation/) documentation is updated (link to the PR).

## Reviewer Checklist
- [ ] Title is accurate.
- [ ] Description motivates each change.
- [ ] No unnecessary changes were introduced in this PR.
- [ ] PR cannot be broken up into smaller PRs.
- [ ] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary.
- [ ] Tests provided or description of manual testing performed is included in the code or PR.
- [ ] Release note has been added for fixes and features, or else `changelog/no-changelog` label added.
- [ ] All relevant GitHub issues are correctly linked.
- [ ] Backports are identified and tagged with Mergifyio.
- [ ] Add to milestone.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants