Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict to unique distributions in entry points #281

Merged
merged 6 commits into from
Feb 21, 2021

Conversation

jaraco
Copy link
Member

@jaraco jaraco commented Jan 27, 2021

Closes #280.

@jaraco
Copy link
Member Author

jaraco commented Jan 27, 2021

In this latest commit, I add 2b64fa2 against the main branch to test performance of entry_points before the patch and now after. On my 2019 Macbook Pro, I see 9.5ms as the baseline and 28ms after the patch, a 200% increase.

@anntzer
Copy link
Contributor

anntzer commented Jan 27, 2021

I'd guess the way to avoid the large performance penalty is along the lines of (simplified here)

diff --git i/importlib_metadata/__init__.py w/importlib_metadata/__init__.py
index 4c420a5..2a90b6a 100644
--- i/importlib_metadata/__init__.py
+++ w/importlib_metadata/__init__.py
@@ -590,6 +590,7 @@ class PathDistribution(Distribution):
                      .joinpath(), __div__, .parent, and .read_text().
         """
         self._path = path
+        self._normalized_name = path.stem.partition("-")[0]
 
     def read_text(self, filename):
         with suppress(
@@ -648,7 +649,7 @@ def entry_points():
 
     :return: EntryPoint objects for all installed packages.
     """
-    unique = functools.partial(unique_everseen, key=operator.attrgetter('name'))
+    unique = functools.partial(unique_everseen, key=operator.attrgetter('_normalized_name'))
     eps = itertools.chain.from_iterable(
         dist.entry_points for dist in unique(distributions())
     )

(on top of your PR)
i.e. take advantage that in practice, the vast majority of Distributions will be PathDistributions, for which the normalized name can be directly inferred from the filename (I didn't handle eggs here, but something similar applies, of course modulo the different normalization). To handle non-PathDistributions, the base Distribution class should also gain a _normalized_name property that computes it from the non-normalized name.
Of course, this is assuming that you distributions with different names but with the same normalized name can be considered identical; if that's not the case you'd need something slightly more elaborate but along the same lines (first key by normalized name, then for each group of multiple distributions with the same normalized name, if that group has size>1 (which should be rare in practice), actually resolve the non-normalized names and do the deduplication).
The patch above (which doesn't implement all these subtleties) removes nearly all the overhead of this PR.


Also, orthogonally, entry_points.txt parsing itself can be significantly sped up. I'd guess that best would be a hand-rolled parser as the format is actually much less general than everything ConfigParser supports (I didn't find a spec for entry_points.txt, but can it e.g. ever use https://docs.python.org/3/library/configparser.html#interpolation-of-values?), but a quick fix that already yields large gains (~2x speed up on entry_points() before this current PR) is to reuse a single ConfigParser instance, as ConfigParser instantiation is quite slow:

diff --git i/importlib_metadata/__init__.py w/importlib_metadata/__init__.py
index fac3063..257e19e 100644
--- i/importlib_metadata/__init__.py
+++ w/importlib_metadata/__init__.py
@@ -87,6 +87,10 @@ class EntryPoint(
 
     dist: Optional['Distribution'] = None
 
+    cp = ConfigParser(delimiters='=')
+    # case sensitive: https://stackoverflow.com/q/1611799/812183
+    cp.optionxform = str
+
     def load(self):
         """Load the entry point from its definition. If only a module
         is indicated by the value, return that module. Otherwise,
@@ -122,11 +126,9 @@ class EntryPoint(
 
     @classmethod
     def _from_text(cls, text):
-        config = ConfigParser(delimiters='=')
-        # case sensitive: https://stackoverflow.com/q/1611799/812183
-        config.optionxform = str
-        config.read_string(text)
-        return cls._from_config(config)
+        cls.cp.clear()
+        cls.cp.read_string(text)
+        return cls._from_config(cls.cp)
 
     @classmethod
     def _from_text_for(cls, text, dist):

Meanwhile, a hand-rolled parser such as

     @classmethod
     def _from_text(cls, text):
-        config = ConfigParser(delimiters='=')
-        # case sensitive: https://stackoverflow.com/q/1611799/812183
-        config.optionxform = str
-        config.read_string(text)
-        return cls._from_config(config)
+        if not text:
+            return
+        group = None
+        for line in filter(None, map(str.strip, text.splitlines())):
+            if line.startswith("["):
+                group = line[1:-1]
+            else:
+                name, value = map(str.strip, line.split("=", 1))
+                yield cls(name, value, group)

gives you another nearly 2-fold speedup (it depends on how well you want to handle malformed inputs, of course, and this is assuming we indeed don't need to support interpolation or anything fancy).

(edit: actually, half of the hand-rolled implementation could even be shared with the implementation of Distribution._read_sections.)

@domdfcoding
Copy link

this is assuming we indeed don't need to support interpolation or anything fancy

Could it fall back to ConfigParser in those cases?

@anntzer
Copy link
Contributor

anntzer commented Feb 21, 2021

Could it fall back to ConfigParser in those cases?

Yes, but we'd need to actually detect if interpolation is being used (for which I guess "%(" in blah is a close enough approximation with no false negatives...).

@jaraco
Copy link
Member Author

jaraco commented Feb 21, 2021

I'm all but certain that interpolation isn't expected. In fact, that's how pkg_resources parses it. So it's definitely safe not to rely on ConfigParser.

For this PR, I'm going to accept it as written, and we can work on optimization techniques subsequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metadata returned even for superseded packages
3 participants