Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterator.toIndexedSeq calls hasNext() multiple times in Scala 2.13.0-RC1 (preview) #11455

Open
xerial opened this Issue Mar 28, 2019 · 20 comments

Comments

Projects
None yet
6 participants
@xerial
Copy link

commented Mar 28, 2019

Problem:

  • Iterator.toIndexedSeq skips the first element if Iterator.hasNext changes some internal state.
  • Iterator's hasNext() will be called multiple times, so if the code is expecting that hasNext() will be called only once and changing some internal state, some elements will be skipped in the generated IndexedSeq.
  • This behavior is introduced after Scala 2.13.0-M5.
  • Iterator.haxNext() implementation that are not idempotent will produce broken results with this change, even though ideally all of the implementations of Iterator.haxNext() should be idempotent.

Code Example

class MyIterator extends Iterator[Int] {
   private var i = 0
   def hasNext: Boolean = {
      val flag = i < 2
      i += 1
      flag
   }
   def next:Int = i
}

object Example {
  def main(args:Array[String]): Unit = {
    val s = new MyIterator().toIndexedSeq
    println(s.mkString(", ")) // Expected 1,2, but 2.13.0-pre-59975bb produces 2
  }
}

This code is in https://github.com/xerial/scala-bug-repro

How to reproduce:

$ git clone https://github.com/xerial/scala-bug-repro
$ cd scala-bug-repo
$ sbt run
...
2 

For the other Scala versions

$ sbt 
> ++ 2.12.8!
> run
1, 2

This behavior is found at #11453

Cause

VectorBuilder.addAll calls the Iterator.hasNext to check whether it should advance to the next block or not:

while (it.hasNext) { // The first call is here
  advanceToNextBlockIfNecessary()
  lo += it.copyToArray(xs = display0, start = lo, len = display0.length - lo) // The next call happens here
}

Related change: scala/scala#7588

@Sciss

This comment has been minimized.

Copy link

commented Mar 28, 2019

It should probably be said though that the contract of Iterator does not forbid hasNext to be called any number of times. I would say if your iterator's behaviour changes if hasNext is called multiple times, it is a bug in your code.

@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

@Sciss There has been no guarantee nor any specification that Iterator.hasNext() must return the same result for subsequent calls before calling next() https://docs.oracle.com/javase/8/docs/api/java/util/Iterator.html#hasNext--

@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

It also means hasNext() is checked multiple times in toIndexedSeq, so some redundancy is introduced in the recent change.

@Sciss

This comment has been minimized.

Copy link

commented Mar 28, 2019

@xerial I don't disagree that calls might be redundant, but I think you are mistaken to assume hasNext can return different answers when called multiple times (without popping elements through next()). Indeed you must even be able to call next() without prior calls to hasNext, this is not forbidden by the spec. Iterator of course is a tricky structure because for any non-trivial iterator, hasNext must advance the internal state.

There are many discussions on the Internet, e.g. https://coderanch.com/t/378826/java/Iterator-hasNext-idempotent

I insist your code is buggy if hasNext cannot be safely called multiple times.

@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

Actually fixing my code is not a big deal, but I don't know how many Iterator implementations are assuming that hasNext() will be called only once for each move.

@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

I'm not arguing Iterator.haxNext() must be called only once.

Rather I mean releasing Scala 2.13.0-RC1 with this behavior change will reveal how many Iterator.hasNext() implementations are non-idempotent, and we need to ask all dependent Scala/Java library maintainers to fix such Iterator implementations.

I don't think this is a good timing to do so, because Scala 2.12 to 2.13 is already a big jump and has been causing many pains for library maintainers.

@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

The first hasNext call happens in VectorBuilder:

        while (it.hasNext) { // The first call is here
          advanceToNextBlockIfNecessary()
          lo += it.copyToArray(xs = display0, start = lo, len = display0.length - lo)
        }

And IterableOnce.copyToArray will call hasNext again:

  def copyToArray[B >: A](xs: Array[B], start: Int, len: Int): Int = {
    val it = iterator
    var i = start
    val end = start + math.min(len, xs.length - start)
    while (i < end && it.hasNext) { // the second call 
      xs(i) = it.next()
      i += 1
    }
    i - start
  }
@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

@joshlemer Do we have any workaround to avoid calling Iterator.hasNext() multiple times in Vector? This optimization is introduced at scala/scala#7588

@Ichoran

This comment has been minimized.

Copy link

commented Mar 28, 2019

I don't consider this a bug. Iterators that can't consistently present what elements they have aren't necessarily going to end up with a maximal representation as a sequence. Relying on any particular series of calls to hasNext and next isn't specified in the API and shouldn't be supported. It's too fragile.

In this case there's a simple recursive function or do-while loop that avoids the extra call (at the cost of a bit of extra work if the iterator starts out empty), but I don't think we should promise anything. Relying on fragile undocumented behavior for correctness is almost never a good way to go.

@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

@Ichoran Yes. It's not a bug, rather it's a behavior change that was introduced just 7 days ago scala/scala#7588 that will affect all poorly implemented Iterators.

And finding the cause of such behavior change will be tricky because only the first element will be missing without any error. Some DBMS implementation that might be relying on the previous behavior will cause incorrect results. This kind of bug will be hard to find.

@Ichoran

This comment has been minimized.

Copy link

commented Mar 28, 2019

Workaround

class SingleCallIterator[A](underlying: Iterator[A]) extends Iterator[A] {
  private[this] var state = 0
  def hasNext = state > 0 || {
    val ans = underlying.hasNext
    state = if (ans) 1 else -1
    ans
  }
  def next: A = {
    val ans = underlying.next
    state = 0
    ans
  }
}

Wrap your unruly iterators in that.

@Ichoran

This comment has been minimized.

Copy link

commented Mar 28, 2019

Another workaround:

Replace

myUnrulyIterator.toIndexedSeq

with

val b = IndexedSeq.newBuilder[Foo]
while (myUnrulyIterator.hasNext) b += myUnrulyIterator.next
b.result
@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

@Ichoran My point is how many such fixes we need to make against the existing Scala/Java libraries because of this behavior change.

It's not only about my code.

I don't have any numbers about how many libraries will start to produce missing results silently.

@joshlemer

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

@xerial actually the regression/change is a result of https://github.com/scala/scala/pull/7110/files#diff-59f3462485b74027de4fd5e9febcc81bR744, not the PR merged a few days ago.

I don't have a very strong opinion whether this should be considered a bug or not, but if it will be considered a bug I would be willing to help think/write an alternative implementation which calls hasNext just once.

@Ichoran

This comment has been minimized.

Copy link

commented Mar 28, 2019

Okay, but this was just IndexedSeq. What if you created a PriorityQueue? What if you created an Array? What if you created a TreeSet? What if you split the iterator with span? What if you used duplicate?

It's an enormous minefield.

It'd be good if we could help people catch their sketchy iterators, but I don't think we can maintain that this is a behavior change that leads from safe expected behavior into unexplained danger. The danger was always there and huge; some people lucked out in some scenarios.

@Ichoran

This comment has been minimized.

Copy link

commented Mar 28, 2019

Josh -- FWIW, you can attempt the copy even if it's empty, and terminate if it says it copied zero elements. (Use a tail-recursive method or do-while.)

@xerial

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

@Ichoran
I feel the discussion point is getting bigger than it should be.

  • As @joshlemer commented, if we can change the implementation to call hasNext() only once in VectorBuilder without sacrificing the existing behavior and performance, why we should not do that?

  • If the answer is it's too hard to maintain such behavior for upcoming versions and various paths to use Iterator.hasNext(), it's OK. We just need to update the documentation of Iterator to mention that hasNext needs to be idempotent to return the same result for multiple calls: https://www.scala-lang.org/api/2.12.8/scala/collection/Iterator.html

  • I'm just wondering how do you evaluate the impact of such behavior change? For example, Spark took a long time to overcome such breaking behavior changes between Scala 2.11 and 2.12, so a lot of libraries that use Spark still need to maintain Scala 2.11 versions, even though such behavior change was not a bug of Scala 2.12.

    • Actually Iterator.hasNext's behavior is a minor change, but it seems no one has checked the impact to major libraries like Spark.
    • I think the community-build is a new process to evaluate such impacts to existing Scala libraries. So I hope there will be a better way than simply saying: it's not a bug.
@joshlemer

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

@Ichoran

Josh -- FWIW, you can attempt the copy even if it's empty, and terminate if it says it copied zero elements. (Use a tail-recursive method or do-while.)

I don't think this will work quite so simply, since the copyToArray(Array[AnyRef]) cannot be called until an array is allocated. And that array is only allocated if it needs to be (if(it.hasNext) advanceToNextBlockIfNecessary() // allocate next array). Maybe something that would work is something like

    val it = (xs.iterator : Iterator[A]).asInstanceOf[Iterator[AnyRef]]
    while (it.hasNext) {
      advanceToNextBlockIfNecessary()
      display0(lo) = it.next()
      lo += 1
      lo += it.copyToArray(xs = display0, start = lo, len = display0.length - lo)
    }
    this
  
@szeiger

This comment has been minimized.

Copy link

commented Apr 10, 2019

I would be surprised if there are many Iterator implementations that relied on calling hasNext only once. I would also be surprised to learn that Iterators in 2.12 were always used in a way where hasNext is only called once.

There is nothing in the docs that implies that hasNext and next should always be called in an alternating way. The documentation for hasNext says: "Tests whether this iterator can provide another element." Futhermore, there is this section:

It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext. Both these methods can be called any number of times without having to discard the iterator.

If there are obvious places where hasNext is called twice for no good reason we should change it because the unnecessary call is bad for performance, but getting to a state where any sequence of calls that wouldn't have called hasNext twice in 2.12 will also not do it in 2.13 is probably impossible. I wouldn't want to compromise the design of the library in an attempt to "fix" these issues.

@xerial

This comment has been minimized.

Copy link
Author

commented Apr 10, 2019

@szeiger I understand there is no practical way to guarantee to call hasNext only once per element. https://github.com/scala/scala/pull/7110/files#diff-59f3462485b74027de4fd5e9febcc81bR744

How about thinking this as an optimization problem?

Actually, the initial behavior change was made for reducing the cost of allocating the next Vector block optimizing the element copy by using System.arrayCopy:
https://github.com/scala/scala/pull/7110/files#diff-59f3462485b74027de4fd5e9febcc81bR687
Basicaly this code does:

  • If there is a next element in the Iterator (call hasNext), prepare the next Vector block, and use System.arrayCopy for the optimization.

So this optimization is making a trade-off between:

  • A) Calling hasNext twice (assuming all implementations of Iterator in the world are idempotent), or
  • B) Initializing the next vector block always in advance even if hasNext() will return false.

For B), we can amortize the memory allocation cost if copying empty Iterator or iterator that has exactly the same number of elements with (block size) x M (= at the boundary of Vector blocks) is quite rare.

And also if Iterator implementation prepares a cache internally, we need to compare two things:

  • A1) Iterator.hasNext (preparing a buffer) -> prepare the next Vector block -> System.arrayCopy
  • A2) prepare the next Vector block -> Call System.arrayCopy (This can be cache efficient)

A1) is memory efficient, but A2) can be much faster because the cache for iterating the elements will be hot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.