Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TreeSet.map losing its ordering when used as Iterable #11987

Open
jrudolph opened this issue May 8, 2020 · 16 comments
Open

TreeSet.map losing its ordering when used as Iterable #11987

jrudolph opened this issue May 8, 2020 · 16 comments

Comments

@jrudolph
Copy link

jrudolph commented May 8, 2020

(Seems like this has to be a well-known issue/puzzler but I couldn't find one with superficial searching)

It seems that TreeSet is not working as expected when it is used as an Iterable.

Welcome to Scala 2.13.1 (OpenJDK 64-Bit Server VM, Java 1.8.0_252).

scala> import scala.collection.immutable.TreeSet
import scala.collection.immutable.TreeSet

// as expected
scala> TreeSet[Long](5,3,2,12,38,2,6,2).map(identity).mkString(", ")
res0: String = 2, 3, 5, 6, 12, 38

// unexpected
scala> (TreeSet[Long](5,3,2,12,38,2,6,2): Iterable[Long]).map(identity).mkString(", ")
res1: String = 5, 6, 38, 2, 12, 3

The underlying assumption is that for all i: Iterable[T] where the implementing collection is sorted this property should be valid:

i.toList == i.map(identity).toList

Without the constraint "where the implementing collection is sorted", it is not expected to be true because Iterable can be implemented for unsorted collections.

Above I used the vague phrase "not working as expected" because the behavior is not explicitly specified (TreeSet does not specify how it implements Iterable which is a common problem of inheritance) and in absence of a specification it should do the least surprising thing.

Was observed here: spray/spray-json#329

@swsnr
Copy link

swsnr commented May 8, 2020

I'd like to point out that i.toList == i.map(identity).toList wouldn't help with the spray-json issue you linked, because the affected code in spray-json maps into another domain (JSON values) so it's not really .map(identity).

For spray json you'd actually need something along the lines of "for all pure functions f: A => B and all i:

i.toList.map(f) == i.map(f).toList

In other words map should preserve the ordering of the source domain A and not order by the ordering of the target domain B.

I'm not sure whether that's better than the current behaviour: In terms of B the collection would still be unordered, just in a different way than it is now.

That said having (TreeSet[Long](5,3,2,12,38,2,6,2): Iterable[Long]).map(f) move to a hashed Set which has no real order at all is somewhat surprising 🙂

@jrudolph
Copy link
Author

jrudolph commented May 8, 2020

I'd like to point out that i.toList == i.map(identity).toList wouldn't help with the spray-json issue you linked, because the affected code in spray-json maps into another domain (JSON values) so it's not really .map(identity).

There are two map methods on TreeSet, one that takes an ordering of the codomain to return another TreeSet. Then the one inherited from Iterable. I'm only talking about the one inherited from Iterable.

There's no way to tell the map method inherited from Iterable what the ordering of the codomain should be so it returns a HashSet.

I guess the reason it was done that way was to keep compatibility with the previous CanBuildFrom construction from Scala 2.12 and before. The point of CanBuildFrom was to return "the right thing implicitly" (which was hotly debated).

There are a few reasons that the current behavior might never change:

  • There might not be a consensus that the behavior is bad
  • Even if there were, it cannot be changed due to binary compatibility constraints that are particularly hard to fix because of the inheritance hierarchy.

The workaround for Sets is to mostly ignore that Sets inherit from Iterable but in particular the domain changing methods like map and rather the use the methods from the iterator it returns. I guess there could be something like a Set.toSafeIterable that returns a plain Iterable on which methods will work without surprises.

def toSafeIterable[T](s: Set[T]): Iterable[T] = new Iterable[T] { def iterator: Iterator[T] = s.iterator }

@SethTisue
Copy link
Member

in the same vein: scala/scala-parallel-collections#101 (but the thing lost is parallelism instead of ordering)

@swsnr
Copy link

swsnr commented May 8, 2020

This problem exists on 2.12 and 2.13; we hit the original spray issue with both Scala versions.

The linked parallel collections issue is 2.13 only.

@Jasper-M
Copy link

Jasper-M commented May 8, 2020

IIUC the only fix for this would be to add overloads of map to Iterable just for this use case. Simplified:

def map[B](f: A => B): Iterable[B]
def map(f: A => A)(implicit d: DummyImplicit): Iterable[A]

Otherwise you can never distinguish between the case where you can keep the ordering and where you can't. If you want to keep getting a TreeSet back even for the A => B case you'd have to add the overload that takes an Ordering to Iterable which would make even less sense.

@som-snytt som-snytt changed the title TreeSet.map losing it's ordering when used as Iterable TreeSet.map losing its ordering when used as Iterable May 10, 2020
@julienrf
Copy link

The problem comes from the fact that TreeSet has several overloads of map:

Screenshot from 2020-05-11 09-25-33

When you call map, the compiler selects the overload that returns a TreeSet, because it is “more specfiic” by the overload resolution rules.

However, if you upcast the TreeSet to Iterable, then the compiler sees only one map operation, which returns an unsorted set.

On Scala 2.12, the situation was different but the result was the same (the source collection type was part of the CanBuildFrom argument to compute the resulting collection).

Unfortunately, I don’t see how we could fix this. I think we should live with this issue but maybe document it better?

@jrudolph
Copy link
Author

jrudolph commented May 11, 2020

There's also this issue:

scala> (TreeSet(5,3,2,12,38,2,6,2): Iterable[Int]).map(_ => 5).toList
res1: List[Int] = List(5)

So, there we have an it: Iterable[T] where this isn't true for all f:

it.size == it.map(f).size

@julienrf
Copy link

So, there we have an it: Iterable[T] where this isn't true for all f:

it.size == it.map(f).size

This property is not true for all iterables, only for seqs.

@swsnr
Copy link

swsnr commented May 11, 2020

Shouldn't iteration be size- and order-preserving? At least that's my perhaps naive intuitive expectation 🤷

@julienrf
Copy link

This is not the case on Iterable, because sets and maps don’t necessarily have a specific iteration order.

@swsnr
Copy link

swsnr commented May 11, 2020

Still, I find it surprising if order gets lost for types which are ordered initially, and I'm also surprised that it.size == it.map(f).size doesn't hold for all iterables. Looks like I assumed a lot of things about iterables which aren't true in the general case; I'd love to see this documented in the scaladoc for Iterable.

@Jasper-M
Copy link

I find it surprising if order gets lost for types which are ordered initially, and I'm also surprised that it.size == it.map(f).size doesn't hold for all iterables

Though it would be impossible for both those expectations to hold. You can't have it.map(_ => 5) return a TreeSet and always preserve the size.

@jrudolph
Copy link
Author

This is not the case on Iterable, because sets and maps don’t necessarily have a specific iteration order.

Technically, that's probably right because descriptions of the operators are vague enough to allow that behavior:

E.g. Iterable.map:

Builds a new iterable collection by applying a function to all elements of this iterable collection.

It only says that the function is called for every element of the source collection and that a new collection is built but not what the exact relationship between source and target collections are.

For Iterable.collect on the other hand, it does say that the order is preserved:

a new iterable collection resulting from applying the given partial function pf to each element on which it is defined and collecting the results. The order of the elements is preserved.

This is just false but shows how unclear the semantics of Iterable are even to the authors. Incidentally, filter and filterNot also disagree about whether ordering should be preserved or not.

That makes Iterable a pretty useless abstraction. It means you mustn't accept Iterable as arguments in your APIs or things will break eventually.

Once again, look at that implementation of toSafeIterable which contains the essence of iteration:

def toSafeIterable[T](s: Iterable[T]): Iterable[T] =
  new Iterable[T] { def iterator: Iterator[T] = s.iterator }

This looks like a no-op but it isn't and that tells a lot.

To add a possible explanation of the status quo: since Scala collections are eager, the problem with the generic methods that need to return another collection type are:

  • each eager method (like map) needs to return some eager holding collection. But which one should that be by default and in more specific cases? (The new Iterable snippet above will default to List. Before 2.13 it was statically determined by CanBuildFrom, with 2.13 it is dynamically determined by factories.)
  • you want to avoid extra copying (i.e. not use toXYZ before and after the operation to be safe but go directly to the resulting collection)

So for now, the safe option is to

  • avoid passing Set or Map anywhere where Iterable is expected
  • when accepting an Iterable, never use the methods on Iterable directly but go through Iterator and build the final collection explicitly using toXYZ before returning the result

@julienrf
Copy link

For Iterable.collect on the other hand, it does say that the order is preserved:

a new iterable collection resulting from applying the given partial function pf to each element on which it is defined and collecting the results. The order of the elements is preserved.

This is just false but shows how unclear the semantics of Iterable are even to the authors. Incidentally, filter and filterNot also disagree about whether ordering should be preserved or not.

Ouch, this is a problem, indeed!

That makes Iterable a pretty useless abstraction. It means you mustn't accept Iterable as arguments in your APIs or things will break eventually.

I disagree. As a user, you can still use fold-like operations on iterables without issues. Also, you can flatten a Set[List[Int]] into a Set thanks to the fact that both share a common collection type (IterableOnce, in this case).

I disagree with your conclusion. I think we should fix this problem by better documenting the behavior of Iterable (and fixing the errors in the current documentation), to clearly state that transformation operations on Iterable don’t preserve the order of elements.

@jrudolph
Copy link
Author

I disagree. As a user, you can still use fold-like operations on iterables without issues. Also, you can flatten a Set[List[Int]] into a Set thanks to the fact that both share a common collection type (IterableOnce, in this case).

Maybe I wasn't accurate enough. Iterable may have some useful properties or features in its implementation. In that way it is useful. What I meant that accepting an Iterable as a library writer is not useful because it gives so little guarantees that I cannot safely interface with it. Mixing some useful with some dangerous operations doesn't help in that regard.

you can flatten a Set[List[Int]] into a Set

That's another example where the features of the concrete types (that happen to be implemented by Iterable) can be useful, but if you happen to come across an instance of such a type disguised as an Iterable[Iterable[Int]] or similar and you call flatten on this in a library you will be up for surprises.

@NthPortal
Copy link

What I meant that accepting an Iterable as a library writer is not useful

That tends to be correct, yes. Generally, you'll either want an IterableOnce (since you only need to iterate over it once to do whatever you're doing) or a more specific collection type (SeqMap, Set, ect.) because you want more specific properties. It is uncommon that the only property you care about is that it's iterable more than once.

@dwijnand dwijnand added this to the Backlog milestone Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants