Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version is no longer totally ordered and comparison speed regressed #26464

Closed
haampie opened this issue Oct 4, 2021 · 11 comments
Closed

Version is no longer totally ordered and comparison speed regressed #26464

haampie opened this issue Oct 4, 2021 · 11 comments
Assignees
Labels
bug Something isn't working triage The issue needs to be prioritized
Milestone

Comments

@haampie
Copy link
Member

haampie commented Oct 4, 2021

Steps to reproduce

After #24639 the set of Version objects no longer satisfy a total order with <:

In [1]: a = spack.version.ver('1111111111222222222233333333334444444444')

In [2]: b = spack.version.ver('5555555555666666666677777777778888888888')

In [3]: a < b
Out[3]: False

In [4]: b < a
Out[4]: False

You may say this is an edge case, but generally I'm not a fan of extending Version with git-related things, because Git commit sha's are the exception, and they don't map perfectly to Spack registered versions.

If what we want is to compare a Git commit sha to a Spack registered Version, why not create a GitVersion class and implement < and friends to compare with Version?

Soon enough we'll be extending commit sha's to general git refs (branches, tags, sha's... of forks even?), and if that's going to be part of Version that'd be really bad.

Also note that we're executing a regex multiple times on every single version string for every comparsion <:

@coerced
def __lt__(self, other):
"""Version comparison is designed for consistency with the way RPM
does things. If you need more complicated versions in installed
packages, you should override your package's version string to
express it more sensibly.
"""
if other is None:
return False
# If either is a commit and we haven't indexed yet, can't compare
if (other.is_commit or self.is_commit) and not (self.commit_lookup or
other.commit_lookup):
return False
# Use tuple comparison assisted by VersionStrComponent for performance
return self._cmp(other.commit_lookup) < other._cmp(self.commit_lookup)

@property
def is_commit(self):
"""
Determine if the original string is referencing a commit.
"""
if self.string in infinity_versions:
return False
return COMMIT_VERSION.match(self.string) is not None

which is redundant if we had a GitVersion object directly.

Currently:

In [1]: a = spack.version.Version('1.0.0')

In [2]: b = spack.version.Version('1.0.1')

In [3]: %timeit a < b
2.41 µs ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Before #24639:

In [3]: %timeit a < b
623 ns ± 6.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So, version comparison got 4x slower.

@haampie haampie added bug Something isn't working triage The issue needs to be prioritized labels Oct 4, 2021
@haampie haampie changed the title Version is no longer totally ordered Version is no longer totally ordered and comparison speed regressed Oct 4, 2021
@haampie
Copy link
Member Author

haampie commented Oct 4, 2021

Note that in #22973 we were happy with a 2x speedup, now we have a 4x slowdown :(

@alalazo
Copy link
Member

alalazo commented Oct 4, 2021

For the record, version comparison affects in non-negligible ways the "setup" time of each concretization (it is needed to compute a lot of facts about packages and preferences)

@haampie
Copy link
Member Author

haampie commented Oct 4, 2021

So, if instead we make the spec parser create GitVersion's we pay the cost of the regex only once, and otherwise there's just the cost of dispatching to the right __lt__ implementation depending on the types of the arguments (and get a fast path for Version < Version comparisons), which as far as I understand is slow in Python anyways, but sure it must be faster than executing a regex...

(waiting for someone to mention memoization on is_commit 😆...)

@alalazo
Copy link
Member

alalazo commented Oct 4, 2021

A few random thoughts that almost date back to #1997 The semantic of versions is package specific, in the sense that it makes little sense to compare versions across packages (what is the meaning of comparing @1.3 in mpich with @4.5 in openmpi?). Given that, we may think of having, longer term, more version classes implementing different semantics - rather than a single class that tries to fit them all. This can possibly trim the number of checks or edge cases we need to have in each class, since we can take advantage upfront of knowing that e.g. a package follows semver.

For version semantic like the one in git commits (that can be used in any package that has an underlying git repo), we can maybe define a literal tag (similar to 0x before writing hex numbers) and know by that that we are talking about a commit.

@haampie
Copy link
Member Author

haampie commented Oct 4, 2021

@alalazo I think I know what you're trying to say with "what is the meaning of comparing @1.3 in mpich with @4.5 in openmpi?" but clearly nobody is going to compare versions across packages, that never comes up.

I guess what you're getting at is that every package can impose a different total order on a different set of versions S:

  1. By default (S = {Version}, <); lexicographical on the Version tuple of integers and strings with special handling of element-wise tuple < involving strings
  2. An explicit order: (S = ['abc', 1.0, 'def', -5], <) where a < b for a, b in S iff S.index(a) < S.index(b)
  3. (S = {Version + set of git commit sha's}, <); ordering of 1 on Version < Version, and some special care for commit sha < Version and commit sha < commit sha.

... and both S and < may depend on the package in some cases.

In principle VersionRange could be implemented just knowing it takes two elements of S and the binary operator <, but because we've chosen a weird open-ended range for Version, it also needs some function called next: S -> S:

x:y contains all elements s in S such that s >= x and s < next(y) where next(Version(a, b, c)) := Version(a, b, c+1) with order (1), or next('def') = -5 in order (2)

@alalazo
Copy link
Member

alalazo commented Oct 4, 2021

but clearly nobody is going to compare versions across packages, that never comes up.

That's exactly my point. We don't need to be able to compare arbitrary Version objects to each other (which would be a valid reason to have a single class representing a version in Spack's codebase). We just need a total order for objects referring to the same package.

For the rest that's where I'd like to arrive. A relevant example of 2 for instance are C++ versions where 98 < 03 < 11 < 14 < 17 < 20.

@trws
Copy link
Contributor

trws commented Nov 8, 2021

I was planning to start another issue on this, may well use this comment as a seed for it anyway, but one significant issue with the current situation is that it's horrifyingly broken for some common git usage models. The main one being the infamous git-flow, which we use on the raja project, and is used by a number of others.

By design the commits that represent any given tag are not reachable from develop commits, or vice-versa. There is an actual ordering to them, but only by merging into release branches from develop and from release branches to main, tagging the merge commit that references the release branch.

For all intents and purposes, that means that all git revisions on develop after the most recent release are treated as a version below zero. All commits that are between two tags are also treated as a negative version because none of their parents are tagged, their parents are on release branches or develop. The current model works only with exactly the tagged hashes, everything else is completely divorced from the order.

As far as I can tell, there's no way for a package to deal with these, other than defining something generic with when="@:0" and hoping. It would really help if ordering were imposed on hashes, even if they are not directly linked.

@trws
Copy link
Contributor

trws commented Mar 3, 2022

So, you mentioned memoization... Much as I would love to see the many, many corner cases go away. The implementation tolerates this one much better with a few tweaks, like doing a string length comparison with 40 before paying to match a regex that must be exactly 40 characters long. PR in the works, but this is the progress so far:

script:

from IPython import get_ipython
from spack.version import Version as V
import spack.repo as repo
ipython = get_ipython()
a=V('1.0.0')
b=V('1.0.1')
print('normal versions')
ipython.magic("%timeit a<b")
a=V('1f3a4cf7f8210e6cb50db9c193f1e843d1fc0ec4')
a.generate_commit_lookup(repo.get('raja'))
b=V('357933a42842dd91de5c1034204d937fce0a2a44')
print('hashes')
ipython.magic("%timeit a<b")

before:

normal versions
1.76 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
hashes
8.96 µs ± 58.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

after:

normal versions
668 ns ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
hashes
719 ns ± 51.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The timing for commit versions assumes the repository has already been downloaded, without that the whole thing gets skewed like crazy by the download time.

@haampie
Copy link
Member Author

haampie commented Mar 3, 2022

Yeah, that's a hacky fix though. It'd be better to just have a GitVersion object. Again, git versions are the exception.

@trws
Copy link
Contributor

trws commented Mar 3, 2022

I agree, but the two vaguely doable options for that both have some nasty downsides when I look at the required refactoring.

  1. Make Version a function that dispatches as appropriate. It's used in too many places to easily split it, so this would work, it's what ver is supposed to be, but way too many places already use Version directly, so we're kinda stuck with it.
    • downside: all places using isinstance(..., Version) now break. That is order 20 rather than order hundreds, so it might be doable, but it's in a lot of critical places and might be tricky
  2. Make Version reject commit versions, refactor everywhere a commit version could show up to call ver.
    • downside: we have to pay to reject them in Version and do a large refactor to make this happen

My actual preference right now is to make it work in Version by adding a new class alongside VersionStrComponent, possibly VersionCommitComponent or similar, since it would give us the same benefits of splitting the classes and specialization so we only really pay for commit versions when we find them, while avoiding the massive refactor of a top level Version class.

@trws
Copy link
Contributor

trws commented Mar 3, 2022

Possibly even replace the version tuple with a commit-version class actually, that would probably be less overall divergence.

@alalazo alalazo added this to the v0.20.0 milestone Apr 6, 2023
@haampie haampie closed this as completed May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage The issue needs to be prioritized
Projects
None yet
Development

No branches or pull requests

5 participants