Skip to content
This repository has been archived by the owner on Jul 8, 2024. It is now read-only.

Set.prototype.symmetricDifference spec has unexpected behavior when given iterable containing duplicate elements #56

Closed
nanaze opened this issue Jan 30, 2019 · 17 comments
Labels

Comments

@nanaze
Copy link
Contributor

nanaze commented Jan 30, 2019

noted while writing polyfill in Closure cl/229848386 (see review comments)

See
https://tc39.github.io/proposal-set-methods/#Set.prototype.symmetricDifference

In 5. 10. d. -- you're calling "remover" on newSet and using the return value to indicate whether the value is in the original set. However, since you're comparing to an iterable (which can have duplicate values) and not a set (which cannot), it means a future value of nextValue may see the same value, which you then use with remover and get a "false" value of removed, erroneously adding it back to newSet.

Instead, you should not depend on the value of newSet.delete() but independently call this.has(value) to see if it is in the original set.

Requesting you to confirm my logic. Feel free to follow up with questions.

@jplaisted
Copy link

To give an example, we can roughly implement this as

function symmetricDifference(set, iterable) {
  const newSet = new Set(set);
  for (const e of iterable) {
    const removed = newSet.delete(e);
    if (removed) newSet.add(e);
  }
  return newSet;
}

And this has unexpected behavior when the iterable has an even number of duplicate elements.

symmetricDifference(new Set(), [0]); // (0), correct
symmetricDifference(new Set(), [0, 1]); // (0, 1), correct
symmetricDifference(new Set(), [0, 1, 1]); // (0), incorrect
symmetricDifference(new Set(), [0, 1, 1, 1]); // (0, 1), correct
// etc

// Also unexpected if the set also contains the element...
symmetricDifference(new Set([1]), [0]); // (1, 0), correct
symmetricDifference(new Set([1]), [0, 1]); // (0), correct
symmetricDifference(new Set([1]), [0, 1, 1]); // (0, 1), incorrect
symmetricDifference(new Set([1]), [0, 1, 1, 1]); // (0), correct
//etc

Note the the spec'd algorithm is correct if the iterable does not have any duplicate elements, which means it is correct for two sets.

@ljharb
Copy link
Member

ljharb commented Jan 30, 2019

The implication is that if the argument can be any iterable, that an internal Set must be used to dedupe anyways (either alongside, or, by converting the iterable to a Set). This relates to #50.

@bakkot
Copy link
Collaborator

bakkot commented Jan 30, 2019

I don't actually that this is "wrong", because symmetric difference is not normally defined in contexts where there can be duplicates, but it's inconsistent with e.g. python:

>>> set([]).symmetric_difference([0, 0])
{0}

#51 fixes this, mostly by accident.

@nanaze
Copy link
Contributor Author

nanaze commented Jan 30, 2019

I don't actually that this is "wrong", because symmetric difference is not normally defined in contexts where there can be duplicates

In that case, it's problematic for the signature of the method to take an iterable, given that a iterable, unlike a set, can possibly and is likely to contain duplicate values. It is weird for behavior to undefined in some cases of valid inputs. For the method to fail in this way would be surprising to the developer and the source of subtle bugs.

I would hope that there would be some form of test case to check this behavior when implemented.

@bakkot
Copy link
Collaborator

bakkot commented Jan 30, 2019

It is weird for behavior to undefined in some cases of valid inputs.

Sorry, to be clear, the mathematical operation called "symmetric difference" is not generally defined for collections which can contain duplicates. The behavior of the function proposed here of course will be defined.

I agree that the current behavior is unnecessarily surprising, and should probably behave the way you labelled "correct". It's just not wrong, per se, because it's a case the mathematical operation doesn't really cover.

@nanaze nanaze changed the title possible bug in Set.prototype.symmetricDifference spec Set.prototype.symmetricDifference spec has unexpected behavior when given iterable contains duplicate elements Jan 30, 2019
@nanaze
Copy link
Contributor Author

nanaze commented Jan 30, 2019

The implication is that if the argument can be any iterable, that an internal Set must be used to dedupe anyways (either alongside, or, by converting the iterable to a Set). This relates to #50.

That is a solution. There may be a way to chain iterables/iterators to avoid creation of yet-another set -- I will contemplate this.

@gsathya
Copy link
Member

gsathya commented Jan 30, 2019

I agree that this is a bug and the intention is to support an iterable as an argument, not just a Set.

@Ginden Ginden added the bug label Feb 1, 2019
@Ginden
Copy link
Collaborator

Ginden commented Feb 1, 2019

I see two options:

  • dedupe iterable before iteration
  • throw error

I'm not sure which one is better, because obviously, symmetric difference doesn't make sense for iterables with non-unique members.

@bakkot
Copy link
Collaborator

bakkot commented Feb 1, 2019

I'd lean towards deduping personally.

(Doesn't have to be before iteration, though. Again, see #51, which mostly-incidentally fixes this.)

@nanaze
Copy link
Contributor Author

nanaze commented Feb 1, 2019

  • throw error

I would find this behavior really undesirable.

@nanaze
Copy link
Contributor Author

nanaze commented Feb 1, 2019

It seems to me that permitting iterables as parameters on these methods is a convienence to the caller to accept a liberal set of inputs, rather than forcing to a set before calling. This mirror's Python behavior with the set methods, but not the set operators, which require both operands to be sets.

>>> a = set([1,2])
>>> a.union([3])
set([1, 2, 3])
>>> a | set([3])
set([1, 2, 3])
>>> a | [3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for |: 'set' and 'list'

There are also obvious benefits to accepting an iterable -- at no point do all elements in the iterable need to be in an in-memory set, which is a good property for iterables with a very large numbers of elements.

Presumably, this isn't much of a problem for implementing union, intersection, and difference, in which only one of the two containers needs to be hashable for lookups. I suspect this is not true for symmetricDifference (I would love to be proven wrong here). So the options I see are to, as mentioned above, to "dedupe" (which I take to mean creating a set from the iterable), which sadly loses the property described in the previous paragraph, or to tighten the parameter type to a Set, which makes symmetricDifference an outlier from the other operations, but at least makes it clear that the operation cannot be done with the same space efficiency as the others.

@ljharb
Copy link
Member

ljharb commented Feb 1, 2019

I question that most users of these APIs will be thinking about space efficiency as much as they'll be thinking about value uniqueness.

@nanaze
Copy link
Contributor Author

nanaze commented Feb 1, 2019

I question that most users of these APIs will be thinking about space efficiency as much as they'll be thinking about value uniqueness.

I'm largely ambivalent about which route to take save for the fact that there is an implicity about the method taking an iterable and the space/time performance. Either route forward is fine with me -- I just want to illuminate the implications for the sake of decision making.

@nanaze nanaze changed the title Set.prototype.symmetricDifference spec has unexpected behavior when given iterable contains duplicate elements Set.prototype.symmetricDifference spec has unexpected behavior when given iterable containing duplicate elements Feb 2, 2019
@tbondwilkinson
Copy link

tbondwilkinson commented Feb 12, 2019

I believe we could implement this with an iterable without storing iterable in memory:

function symmetricDifference(set, iterable) {
  const newSet = new Set(set);
  for (const e of iterable) {
    if (set.has(e)) {
      newSet.delete(e);
    } else {
      newSet.add(e);
    }
  }
  return newSet;
}

Takes advantage of the fact that we never delete from the original set.

@nanaze
Copy link
Contributor Author

nanaze commented Feb 12, 2019

(disclosure -- @tbondwilkinson and I know each other and have discussed this outside this thread)

My biggest concern is "flattening" iterable -- in the other methods, it's a desirable quality to not flatten an iterable as it might be backed by I/O and a giant data set -- "this" is already in memory, so it probably is conceivably "flattenable".

If @tbondwilkinson 's suggestion passes test caes, I believe it would be much preferable to flattening iterable.

@nanaze
Copy link
Contributor Author

nanaze commented Feb 12, 2019

see also some sanity case runs: https://codepen.io/anon/pen/KJRrzP?editors=0011

@bakkot
Copy link
Collaborator

bakkot commented Sep 17, 2022

I've just rewritten the spec text; the new algorithm does not have this problem. It takes the essentially approach mentioned in the above comment, though now the calls to add and delete operate on a spec-internal list and so are not observable; it's observably equivalent to copying the argument into a new Set and then intersecting that with the receiver.

@bakkot bakkot closed this as completed Sep 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants