Skip to content
This repository has been archived by the owner on Jul 8, 2024. It is now read-only.

What are the actual algorithms we should use? #70

Closed
bakkot opened this issue Feb 11, 2022 · 28 comments
Closed

What are the actual algorithms we should use? #70

bakkot opened this issue Feb 11, 2022 · 28 comments

Comments

@bakkot
Copy link
Collaborator

bakkot commented Feb 11, 2022

Consider intersection. The most efficient way to implement it is to choose the smaller of the two sets and then iterate it checking membership in the other set. Are we going to specify that, with all the calls to size and has and so in that it entails? Are we going to specify something observably different from that, so that worse time complexity is observably required?

For a more complex case, here's cpython's implementation of difference.

This is only relevant if we decide that methods like intersection should make any observable method calls on their receivers or arguments (though either is enough for this to be relevant; for example, if we decided to perform has calls on arguments, but not receivers). If we say that all methods operate on the internal slot directly, this isn't relevant.

@zloirock
Copy link
Contributor

Arguments of methods from this proposal are iterables, not sets, so we can't get size.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

That question is also not yet decided. I think it would be pretty weird to force them to be consumed as iterables when the common case is that you're passing a Set, since treating the argument as a Set allows the use of algorithms with better time complexity.

@zloirock
Copy link
Contributor

zloirock commented Feb 12, 2022

At least, it's useful to pass, for example, arrays to such methods, so limiting it only to sets is a bad idea - even in the case of optimizing performance. If you have 2 sets and wanna optimize performance - you can manually choose the biggest and call .intersection on it.

@zloirock
Copy link
Contributor

since treating the argument as a Set allows the use of algorithms with better time complexity

It's anyway O(n) (if consider checking the existence of Set element as O(1)) since it's required to iterate one of the collections.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

It's anyway O(n)

There's two relevant parameters: the size of the receiver and the size of the argument. Call them n and m respectively. Assuming constant-time set membership testing, intersecting a Set with a Set is O(min(n, m)). Intersecting a Set with an iterable is O(m). That matters when the receiver is small and the argument is large.

To put it another way: intersection a one-element Set with a thousand-element Set requires one set lookup. Intersecting a one-element set with a thousand-element iterable requires a thousand set lookups. It is not obvious to me that we should specify the option which is a thousand times worse, here.

@zloirock
Copy link
Contributor

However, in computer science, both of those cases have the same, linear, time complexity.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

No, O(min(n, m)) is not the same as O(m).

@zloirock
Copy link
Contributor

What of those cases is sublinear or requires more than linear time?

@zloirock
Copy link
Contributor

Nobody argues that by choosing to iterate over a smaller collection, you can optimize the method, but this has nothing to do with time complexity.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

It doesn't really make sense to talk about whether something is sub-linear when it's a function of multiple variables. O(min(n, m)) is not the same time complexity as O(m).

@zloirock
Copy link
Contributor

In this case, it makes no sense to talk about time complexity. This is definitely not the case when the multiplicity of function variables affects it.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

OK, let's be precise about this, I guess.

There's a few different definitions people use for big-O for multiple variables. They should all work out to be about the same here, but the one I generally use is that given in (3.1) of this paper, i.e.:

For f ∈ ℝ x ℝ -> ℝ, O(f(n, m)) is the set of all functions g ∈ ℝ x ℝ -> ℝ for which there exists c ≥ 1 ∈ ℝ and N ∈ ℕ such that for all n ≥ N and m ≥ N, g(n, m) ≤ c * f(n, m) and hat(g)(n, m) ≤ c * hat(f)(n, m).

(The hat operator is a technical thing which doesn't end up being relevant here; we can just deal with the first part, g(n, m) ≤ c * f(n, m).)

I claim that O(min(n, m)) is a strict subset of O(m).

Containment is trivial, you can just reuse the same c and N. That is, if we suppose a function g ∈ O(min(n, m)), that means there exists c and N such that for all n ≥ N and m ≥ N, g(n, m) ≤ c * min(n, m). Since min(n, m) ≤ m by definition, this implies g(n, m) ≤ c * m.

To show the containment is strict, consider g(n, m) = m. This is in O(m) trivially. It is not in O(min(n, m)): for any choice of c and N, set n = N and m = c * N + 1. This satisfies n ≥ N and m ≥ N, but we have g(n, m) = m = c * N + 1 and c * min(n, m) = c * min(N, c * N + 1) = c * N, so g(n, m) is not ≤ c * min(n, m). ∎

@zloirock
Copy link
Contributor

I don't understand, what are you trying to show? Sure, I had read this document many years ago. Iteration n times and 2n times - they both have linear time complexity. The same with iteration over all n or m elements - here is no difference, even if min(n, m) <= m what's obvious.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

I don't understand, what are you trying to show?

I am trying to show - have shown, in fact - is that O(min(n, m)) is strictly better time complexity than O(m).

@zloirock
Copy link
Contributor

...have tried to show. O shows the worst case - and the worst case for them is similar.

@Ginden
Copy link
Collaborator

Ginden commented Feb 12, 2022

We can also go with way of Array.prototype.sort if we decide not to use .has checks as per #50.

  1. Sort items using an implementation-defined sequence of calls to SortCompare. If any such call returns an abrupt completion, stop before performing any further calls to SortCompare or steps in this algorithm and return that completion.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

...have tried to show. O shows the worst case - and the worst case for them is similar.

No, it's not. I wrote out a proof. Was part of it unclear?

We can also go with way of Array.prototype.sort

Yeah, I considered this. I suspect there will not be much appetite for adding more implementation-defined stuff to the spec - I don't love the idea myself - but it is technically an option.

@zloirock
Copy link
Contributor

No, it's not. I wrote out a proof. Was part of it unclear?

Are you serious? You posted the first link about the asymptotic notation from the search (the perfect proof, but to what?). You posted a good proof that min(n, m) <= m, that's obvious. However, the big O definition means asymptotic bounding from above - the worst case (when variables go to infinity) - and for them the worst case is similar.

@bakkot
Copy link
Collaborator Author

bakkot commented Feb 12, 2022

You posted a good proof that min(n, m) <= m

The claim was that O(min(m, n)) is a strict subset of O(m), not that min(n, m) <= m. In particular, min(n, m) is O(m) but m is not O(min(m, n)). This is what it means to have better time complexity.

However, the big O definition means asymptotic bounding from above

Yes, which is formalized in the definition I referred to in the proof. And with that definition you can show that they are not the same complexity class.

I don't think I'm capable of explaining this in a way which is going to be clear to you, so I'm going to step away from this conversation now.

@acutmore
Copy link

Maybe there is a middle ground?

function intersection(setA, setB) {
  let primary = {
    has: setA.has.bind(setA),
    it: setA[Symbol.iterator](),
  };
  let secondary = {
    has: setB.has.bind(setB),
    it: setB[Symbol.iterator](),
  };
  const retVal = new Set();
  for (;;) {
    const {done, value} = primary.it.next();
    if (done) break;
    if (! primary.has(value)) throw new Error("Not set-like behavior");
    if (secondary.has(value)) retVal.add(value);
    [secondary, primary] = [primary, secondary]; // switch sets for next iteration of the loop 
  }    
  return retVal;
}

@bakkot
Copy link
Collaborator Author

bakkot commented Jul 10, 2022

@acutmore That algorithm doubles the number of calls required, which seems bad.

@acutmore
Copy link

Yes it sacrifices some performance while keeping O(min(n,m)), the attempted benefits are that the resulting set order doesn’t flip when the argument size changes and can catch Map being passed as an argument (as long as this.size > 1, though that could be improved).

Only a suggestion, the performance cost perhaps doesn’t outweigh the benefits.

@bakkot
Copy link
Collaborator Author

bakkot commented Jul 10, 2022

Oh, the point about order is a good one; it might be worth the cost of the extra calls to get a more consistent order.

I'm unconvinced by the extra primary.has(value) call, though; it's pretty unusual for JS methods to do extra calls which do nothing except establish consistency. Usually we just say it's the caller's responsibility to ensure consistency.

Edit: though it might be possible to get consistent order without this, depending on what the implementation of Set actually is; I'll have to check with implementors.

@bakkot
Copy link
Collaborator Author

bakkot commented Jul 11, 2022

After some discussion about the actual implementation of Sets, I am now pretty sure we can specify that the order of the result should be the order of the elements in this without needing to do extra calls and without significant overhead. (It does add an O(log(size of result)) factor overhead to the theoretical performance, but no user-observable calls, so I think it's worth it.)

We'd still have the problem that the choice of which methods in other to call would depend on the sizes of the inputs, but I'm personally OK with that; the implementor of the Set-like is responsible for ensuring that doesn't matter.

@bakkot
Copy link
Collaborator Author

bakkot commented Sep 17, 2022

The current spec text uses the time-complexity optimal algorithm for intersection and requires that the result be ordered according to the order in the receiver, which should be possible to do efficiently per the previous comment.

@bakkot bakkot closed this as completed Sep 17, 2022
@brad4d
Copy link

brad4d commented Jan 12, 2023

I feel I should point out that polyfills won't have access to the data that makes sorting of the intersection result efficient.

Should we recommend that they just take the performance hit and always iterate over the original set?

@bakkot
Copy link
Collaborator Author

bakkot commented Jan 12, 2023

Up to each polyfill what it wants to do, but that's what I'd do. It's pretty common for new language features to be impossible to polyfill performantly.

@ljharb
Copy link
Member

ljharb commented Jan 12, 2023

That's what my polyfills have done, given that there's no other choice. In practice, my experience tells me most Sets won't be prohibitively large anyways.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants