Restrict to unique distributions in entry points #281

jaraco · 2021-01-27T00:35:52Z

Closes #280.

…ique distributions. Fixes #280.

jaraco · 2021-01-27T01:41:37Z

In this latest commit, I add 2b64fa2 against the main branch to test performance of entry_points before the patch and now after. On my 2019 Macbook Pro, I see 9.5ms as the baseline and 28ms after the patch, a 200% increase.

anntzer · 2021-01-27T19:45:58Z

I'd guess the way to avoid the large performance penalty is along the lines of (simplified here)

diff --git i/importlib_metadata/__init__.py w/importlib_metadata/__init__.py
index 4c420a5..2a90b6a 100644
--- i/importlib_metadata/__init__.py
+++ w/importlib_metadata/__init__.py
@@ -590,6 +590,7 @@ class PathDistribution(Distribution):
                      .joinpath(), __div__, .parent, and .read_text().
         """
         self._path = path
+        self._normalized_name = path.stem.partition("-")[0]
 
     def read_text(self, filename):
         with suppress(
@@ -648,7 +649,7 @@ def entry_points():
 
     :return: EntryPoint objects for all installed packages.
     """
-    unique = functools.partial(unique_everseen, key=operator.attrgetter('name'))
+    unique = functools.partial(unique_everseen, key=operator.attrgetter('_normalized_name'))
     eps = itertools.chain.from_iterable(
         dist.entry_points for dist in unique(distributions())
     )

(on top of your PR)
i.e. take advantage that in practice, the vast majority of Distributions will be PathDistributions, for which the normalized name can be directly inferred from the filename (I didn't handle eggs here, but something similar applies, of course modulo the different normalization). To handle non-PathDistributions, the base Distribution class should also gain a _normalized_name property that computes it from the non-normalized name.
Of course, this is assuming that you distributions with different names but with the same normalized name can be considered identical; if that's not the case you'd need something slightly more elaborate but along the same lines (first key by normalized name, then for each group of multiple distributions with the same normalized name, if that group has size>1 (which should be rare in practice), actually resolve the non-normalized names and do the deduplication).
The patch above (which doesn't implement all these subtleties) removes nearly all the overhead of this PR.

Also, orthogonally, entry_points.txt parsing itself can be significantly sped up. I'd guess that best would be a hand-rolled parser as the format is actually much less general than everything ConfigParser supports (I didn't find a spec for entry_points.txt, but can it e.g. ever use https://docs.python.org/3/library/configparser.html#interpolation-of-values?), but a quick fix that already yields large gains (~2x speed up on entry_points() before this current PR) is to reuse a single ConfigParser instance, as ConfigParser instantiation is quite slow:

diff --git i/importlib_metadata/__init__.py w/importlib_metadata/__init__.py
index fac3063..257e19e 100644
--- i/importlib_metadata/__init__.py
+++ w/importlib_metadata/__init__.py
@@ -87,6 +87,10 @@ class EntryPoint(
 
     dist: Optional['Distribution'] = None
 
+    cp = ConfigParser(delimiters='=')
+    # case sensitive: https://stackoverflow.com/q/1611799/812183
+    cp.optionxform = str
+
     def load(self):
         """Load the entry point from its definition. If only a module
         is indicated by the value, return that module. Otherwise,
@@ -122,11 +126,9 @@ class EntryPoint(
 
     @classmethod
     def _from_text(cls, text):
-        config = ConfigParser(delimiters='=')
-        # case sensitive: https://stackoverflow.com/q/1611799/812183
-        config.optionxform = str
-        config.read_string(text)
-        return cls._from_config(config)
+        cls.cp.clear()
+        cls.cp.read_string(text)
+        return cls._from_config(cls.cp)
 
     @classmethod
     def _from_text_for(cls, text, dist):

Meanwhile, a hand-rolled parser such as

     @classmethod
     def _from_text(cls, text):
-        config = ConfigParser(delimiters='=')
-        # case sensitive: https://stackoverflow.com/q/1611799/812183
-        config.optionxform = str
-        config.read_string(text)
-        return cls._from_config(config)
+        if not text:
+            return
+        group = None
+        for line in filter(None, map(str.strip, text.splitlines())):
+            if line.startswith("["):
+                group = line[1:-1]
+            else:
+                name, value = map(str.strip, line.split("=", 1))
+                yield cls(name, value, group)

gives you another nearly 2-fold speedup (it depends on how well you want to handle malformed inputs, of course, and this is assuming we indeed don't need to support interpolation or anything fancy).

(edit: actually, half of the hand-rolled implementation could even be shared with the implementation of Distribution._read_sections.)

domdfcoding · 2021-01-28T13:56:38Z

this is assuming we indeed don't need to support interpolation or anything fancy

Could it fall back to ConfigParser in those cases?

anntzer · 2021-02-21T11:38:06Z

Could it fall back to ConfigParser in those cases?

Yes, but we'd need to actually detect if interpolation is being used (for which I guess "%(" in blah is a close enough approximation with no false negatives...).

jaraco · 2021-02-21T15:47:23Z

I'm all but certain that interpolation isn't expected. In fact, that's how pkg_resources parses it. So it's definitely safe not to rely on ConfigParser.

For this PR, I'm going to accept it as written, and we can work on optimization techniques subsequently.

jaraco added 6 commits January 26, 2021 17:48

When resolving entry points globally, only expose entry points for un…

35bc403

…ique distributions. Fixes #280.

Add test capturing failed expectation. Ref #280.

8935288

Merge branch 'bugfix/280-test' into bugfix/280-unique-distributions

b74a776

Update changelog

6cb27d3

Add performance test for entry_points

2b64fa2

Merge branch 'bugfix/280-perf' into bugfix/280-unique-distributions

10c22f3

jaraco merged commit 67041f7 into main Feb 21, 2021

jaraco deleted the bugfix/280-unique-distributions branch February 21, 2021 15:47

jaraco mentioned this pull request Feb 21, 2021

Improve efficiency of entry-point parsing #283

Closed

MridulS mentioned this pull request Mar 27, 2024

Double networkx.backends entry point with pytest on python 3.9 for networkx >=3.2 networkx/networkx#7372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict to unique distributions in entry points #281

Restrict to unique distributions in entry points #281

jaraco commented Jan 27, 2021 •

edited

Loading

jaraco commented Jan 27, 2021

anntzer commented Jan 27, 2021 •

edited

Loading

domdfcoding commented Jan 28, 2021

anntzer commented Feb 21, 2021

jaraco commented Feb 21, 2021

Restrict to unique distributions in entry points #281

Restrict to unique distributions in entry points #281

Conversation

jaraco commented Jan 27, 2021 • edited Loading

jaraco commented Jan 27, 2021

anntzer commented Jan 27, 2021 • edited Loading

domdfcoding commented Jan 28, 2021

anntzer commented Feb 21, 2021

jaraco commented Feb 21, 2021

jaraco commented Jan 27, 2021 •

edited

Loading

anntzer commented Jan 27, 2021 •

edited

Loading