Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test a new scheme to implement lazy vals #6979

Merged
merged 5 commits into from
Aug 30, 2019

Conversation

odersky
Copy link
Contributor

@odersky odersky commented Aug 1, 2019

This is a demonstrator for a new algorithm to handle lazy vals. The idea is that
we use the field slot itself for all synchronization; there are no separate bitmaps
or locks. The type of a field is always Object. The field goes through the following
state changes:

null -> Evaluating -+--------------+-> Initialized
                    |              |
                    +--> Waiting --+

The states of a field are characterized as follows:

x == null         Nobody has evaluated the variable yet
x == Evaluating   A thread has started evaluating
x: Waiting        A thread has started evaluating and other threads are waiting
                  for the result
otherwise         Variable is initialized

Note 1: This assumes that fields cannot have null as normal value. Once we have
nullability checking, this should be the standard case. We can still accommodate
fields that can be null by representing null with a special value (say NULL)
and storing NULL instead of null in the field. The necessary tweaks are added
as comment lines to the code below.

A lazy val x: A = rhs is compiled to the following code scheme:

private var _x: AnyRef = null

def x: A =
    _x match
    case current: A =>
        current
    case null =>
        if CAS(_x, null, Evaluating) then
            var result = rhs
//          if result == null then result == NULL
            if !CAS(_x, Evaluating, result) then
                val lock = _x.asInstanceOf[Waiting]
                _x = result
                lock.release(result)
        x
    case Evaluating =>
        CAS(x, Evaluating, new Waiting)
        x
    case current: Waiting =>
        _x = current.awaitRelease()
        x
//  case NULL =>
//      null

The code makes use of the following runtime class:

class Waiting:

    private var done = false
    private var result: AnyRef = _

    def release(result: AnyRef): Unit = synchronized:
        this.result = result
        done = true
        notifyAll()

    def awaitRelease(): AnyRef = synchronized:
        if !done then wait()
        result

Note 2: The code assumes that the getter result type A is disjoint from the type
of Evaluating and the Waiting class. If this is not the case (e.g. A is AnyRef),
then the conditions in the match have to be re-ordered so that case _x: A becomes
the final default case.

Cost analysis:

  • 2 CAS on contention-free initialization
  • 0 or 1 CAS on first read in thread other than initializer thread, depending on
    whether cache has updated
  • no synchronization operations on reads after the first one
  • If there is contention, we see in addition
    • for the initializing thread: a synchronized notifyAll
    • for a reading thread: 0 or 1 CAS and a synchronized wait

Code sizes for getter:

this scheme, if nulls are excluded in type: 72 bytes
current Dotty scheme: 131 bytes
Scala 2 scheme: 39 bytes + 1 exception handler

Advantages of the scheme:

  • no slot other than the field itself is needed
  • no locks are shared among lazy val initializations and between lazy val initializations
    and normal code
  • no deadlocks (other than those inherent in user code)
  • synchronized code is executed only if there is contention
  • simpler than current Dotty scheme

Disadvantages:

  • does not work for local lazy vals (but maybe these could be unsynchronized anyway?)
  • lazy vals of primitive types are boxed
  • every read access goes through a type test and cast.

}

def awaitRelease(): AnyRef = synchronized {
if (!done) wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle spurious wake-ups:

Suggested change
if (!done) wait()
while (!done) wait()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how a spurious wakeup could happen?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what makes them spurious :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to something like this?

https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is

That something like this could happen was news to me! Yes, in that world we need a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odersky yes, exactly that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to consider what would/should happen if someone fires a Thread.interrupt at any threads waiting on initialisation.

@odersky
Copy link
Contributor Author

odersky commented Aug 6, 2019

Aleksandar Prokopec has pointed out to me a problem with this. There is no happens-before between the reader of a lazyval and the CAS of the initializing code. This is not a problem for the lazy value itself, since the algorithm takes care to go into a CAS if the lazy val is not propagated, which will publish the correct value. But it is a problem for fields contained in the lazy val, which will then not be safely published under the JMM. So the lazy val has to be made @volatile. Here is the relevant part of Alek's mail:


Consider the following program:

object Main {
  lazy val x = new Integer(2)
  
  def main(args: Array[String]): Unit = {
    val T2 = new Thread() {
      override def run() = x
    }
    t.start()
    val initializedInteger = x
    assert(initializedInteger.value == 2)
  }
}

My point is that, according to the way I understand JMM, is that the assert can fail on some architectures.

Assume that the thread T2 is faster, and it executes the first and the second CAS.
It affects the memory in the following manner:

T2:
at address(_x): CAS(_x, null, Evaluating)
at address(obj): WRITE(obj + 0, <initialization bit pattern for the Integer's header>)
at address(obj): WRITE(obj + 12, 2)
at address(_x): CAS(Evaluating, address(obj))

Assume that the main thread now proceeds:

main:
at address(_x): READ to a register - let's assume that main is so late that it sees address(obj) now

At this point, there is no guarantee that the main thread also sees the other, earlier memory effects of the thread T2.
In particular, on some architectures:

at address(_obj): READ(obj +12), which is needed by the assert, can return something that was at that memory address before the T2's WRITE(obj + 12, 2).

The reason for this is that the READ(_x) is not followed by a load-load barrier (which a volatile read would typically guarantee).
A load-load barrier would ensure that the CPU does not reorder the subsequent memory loads.
This barrier is not needed on x86, since loads cannot be reordered, but they may be required on ARM or Power, from what I understand.

@smarter
Copy link
Member

smarter commented Aug 6, 2019

But it is a problem for fields contained in the lazy val, which will then not be safely published under the JMM.

Is this solvable by adding an explicit barrier in the initialization code ? I believe @retronym is the expert here.

@liufengyun
Copy link
Contributor

It seems to me the problem cannot happen. For the main thread, the following two reads are impossible to be reordered due to the dependency between them:

READ(_x, REG1)
...
READ(REG1 + 12)

@viktorklang
Copy link

viktorklang commented Aug 6, 2019 via email

@axel22
Copy link

axel22 commented Aug 6, 2019

It seems to me the problem cannot happen. For the main thread, the following two reads are impossible to be reordered due to the dependency between them:

From what I understand, it depends on the architecture. I'm no expert on ARM, but it seems that ARM might, as you point out, prevent reordering due to a data or control flow dependency, as explained starting from page 13 below:

https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

However, I don't think one should rely on this.
There are architectures that do speculative reads that allow such a reordering (such as Alpha).
More importantly, from what I understand, the JMM does not guarantee the visibility of the effects in rhs.
Having a lazy val scheme that relies on how the different JIT compilers and different architectures behave might be brittle.

More importantly, what are the suspected gains from avoiding a volatile read?
On x86, which is the dominant platform, there is no load-load barrier emitted after the read, so the performance difference between a volatile read and a non-volatile read does not come from how an x86 CPU executes the memory read, but from how the JIT compiler treats this read.

In particular - the read will not be reordered with subsequent loads, volatile or not (http://gee.cs.oswego.edu/dl/jmm/cookbook.html).
Effectively, this can, depending on the situation, mean that such a read is not hoisted out of a loop, and factored out into a single read that is then placed into a register. This can furthermore prevent other simplifications, depending on what the loop is.

For example, a JIT compiler could technically pull a volatile read of x out of the loop here, and put it into a register, since this would not reorder any loads (I'm not aware of this being done):

class Foo {
  lazy val x = 2

  def foo = {
    var i = 0
    var sum = 0
    while (i < 4) {
      sum += x
      i += 1
    }
    sum
  }
}

No hoisting is possible in the next snippet when x is a volatile read, since it would potentially be equivalent to reordering some loads:

class Foo {
  lazy val x = 2
  var y = 3

  def foo = {
    var i = 0
    var sum = 0
    while (i < 4) {
      sum += x
      sum += y
      i += 1
    }
    sum
  }
}

Although, people using lazy vals and caring about performance would probably manually extract this read out of the loop.

However, if you really think that lazy val accesses should be optimized, then I think you should identify a set of workloads where such accesses are important, and benchmark them to see the differences. It is possible that potential performance differences can be reduced or eliminated by improving the JIT compilers, rather than making the accesses weaker.

I am under the impression that there is an inherent assumption here that a volatile READ(_x) in the proposed initialization scheme is the cause for the slow reads. It might be other things however - for example, the if-cascade after the read might reveal the JIT compiler's inability to properly do loop unswitching, and further simplify the loop.
Therefore, having a set of concrete workloads first (even if microbenchmarks) would also help in identifying other potential performance obstacles in the current scheme (because, e.g., we could take a look at concrete compiler graphs, and see what these accesses are compiled down to).

@axel22
Copy link

axel22 commented Aug 6, 2019

But it is a problem for fields contained in the lazy val, which will then not be safely published under the JMM.

Is this solvable by adding an explicit barrier in the initialization code ? I believe @retronym is the expert here.

I don't think so. The issue seems to be that a barrier is required in the fast-path, i.e. the part that reads the lazy value.

@odersky
Copy link
Contributor Author

odersky commented Aug 7, 2019

If we make the cache @volatile we get volatile writes as well as volatile reads. Strictly speaking, the write is not needed as the value of the cache is set with a CAS which already is in a happens-before with a volatile read. How expensive would the volatile write be in this case? Should we optimize it away?

EDIT: Actually, it won't matter. We only set _x directly in the contention case which is rare. So the only accesses to the cache that matter are the (volatile) read and the CAS.

@odersky
Copy link
Contributor Author

odersky commented Aug 7, 2019

@axel22 pointed out to me that we can also handle local lazy vals this way. In fact, if a lazy val is not captured in a closure, it is always confined to a single thread, so a simple null test, like the scheme in #6967, suffices. If a lazy val is captured, the cache variable will have to be boxed. And then we have the box object to hang our CAS on. For this to work, we have to move the LazyVals miniphase after CapturedVars and have to modify CapturedVars so that it boxes captured lazy vals as well as mutable variables.

@liufengyun
Copy link
Contributor

liufengyun commented Aug 7, 2019

@axel22 Thanks for the links and detailed explanation, it's pretty helpful.

The spec (jls §17.4.5) says that happens-before is consistent with program order.
If the two instructions of main thread can be reordered, it seems the semantics will be different from the program order, thus invalidates the program order.

Maybe I miss something here.

Edited: Alpha seems to be indeed weird http://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html

@viktorklang
Copy link

@odersky @axel22

You could also consider the following:

@volatile private[this] final var _x$generated: AnyRef = _ // do not assign to null to avoid volatile write

@tailrec final def x = _x$generated match {
  case cur: A => cur // Fast path—there's already a value
  case null =>
    if (CMPXCHG(null, Evaluating)) {
      val xResult = result // What are the desired semantics if `result` throws?
      XCHG(xResult) match {
        case s: Semaphore => // Semaphore would be a type which can never be `A`
          s.release()
          xResult
        case _ =>
          xResult // If there are no waiters, we can simply return the result
      }
    } else x // If we fail the CAS then there was a race for starting Evaluation, so retry `x`
  case Evaluating =>
    CAS(Evaluating, Semaphore(1)) // Only allocates in the case of initialization clashes
    x // retry `x`, it might already have a value
  case s: Semaphore =>
    blocking { semaphore.awaitRelease() } // This would allow any current ExecutionContext to provide ManagedBlocking in the case where the initializer takes time, to try to prevent deadlocks.
   _x$generated
}

In essence this means that initialization is best-case volatile read + LOCK cmpxchg + LOCK xchg

@odersky
Copy link
Contributor Author

odersky commented Aug 7, 2019

@viktorklang Yes, that seems to work as well and looks more streamlined to me. How do I access CMPXCHG and XCHG? My knowledge only extends to the old sun.misc.Unsafe which had compare-and-set but not compare-and-swap.

@viktorklang
Copy link

@odersky cmpxchg is compareAndSwapObject, and xchg is getAndSetObject

@viktorklang
Copy link

@odersky And, perhaps needless to say, null would have to be Uninitialized, since result could evaluate to null. Which means that _x$generated would have to be initialized to Uninitialized

@odersky
Copy link
Contributor Author

odersky commented Aug 7, 2019

@viktorklang Thanks! I did not find getAndSetObject in http://www.docjar.com/docs/api/sun/misc/Unsafe.html, but those docs are from JDK 7. Do you have a link to more up-to-date docs?

@viktorklang
Copy link

@odersky I do not know of any "official" javadocs for Unsafe. Here's what's available in jdk8: https://github.com/AdoptOpenJDK/openjdk-jdk/blob/jdk8-b120/jdk/src/share/classes/sun/misc/Unsafe.java#L1104

@odersky
Copy link
Contributor Author

odersky commented Aug 7, 2019

Synthesizing the state of discussion so far, and also handling exceptions:

private @volatile var _x: AnyRef = null

@tailrec def x: A =
    _x match
    case current: A =>
        current
    case null =>
        if CAS(_x, null, Evaluating) then
            var result = 
                try rhs 
                catch case ex: Throwable =>
                    publish(null)
                    throw ex
            if (result == null) result = NULL  // drop if A is non-nullable
            publish(result)
            result
        else x
    case Evaluating =>
        CAS(x, Evaluating, new Semaphore)
        x
    case current: Semaphore =>
        _x = current.awaitRelease()
        x
    case NULL => null                          // drop if A is non-nullable

def publish(result: Any): Unit = 
    if !CAS(_x, Evaluating, result) then
        val lock = _x.asInstanceOf[Semaphore]
        CAS(_x, lock, result)
        lock.release()

@odersky
Copy link
Contributor Author

odersky commented Aug 7, 2019

@viktorklang Thanks for the link! Browsing some more it seems that XCHG is usually implemented with CAS in a loop. So we can just use CAS instead.

@odersky
Copy link
Contributor Author

odersky commented Aug 7, 2019

Here's a slightly shorter version that uses a try-finally:

@tailrec def x: A =
    _x match
    case current: A =>
        current
    case null =>
        if CAS(_x, null, Evaluating) then
            var result = null
            try
                result = rhs
                if result == null then result = NULL // drop if A is non-nullable
            finally
                if !CAS(_x, Evaluating, result) then
                    val lock = _x.asInstanceOf[Semaphore]
                    CAS(_x, lock, result)
                    lock.release()
        x
    case Evaluating =>
        CAS(x, Evaluating, new Semaphore)
        x
    case current: Semaphore =>
        _x = current.awaitRelease()
        x
    case NULL => null                                // drop if A is non-nullable

(actually, the generated bytecode is almost the same in both cases since a try-finally duplicates the finally block).


class LazyControl

class Waiting extends LazyControl {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is equivalent to a CountDownLatch initialised with a count of 1. release() is now countDown() and awaitRelease() is await(). It may be worth benchmarking both

def initialize(base: Object, offset: Long, result: Object): Unit =
if (!unsafe.compareAndSwapObject(base, offset, Evaluating, result)) {
val lock = unsafe.getObject(base, offset).asInstanceOf[Waiting]
unsafe.compareAndSwapObject(base, offset, lock, result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a CAS here:

unsafe.putObjectVolatile(base, offset, result)

x$lzy
}

def x$lzy: String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Tailrec runs before LazyVals. We need to write the loop manually:

def x$lzy: String = {
  while (<EmptyTree>) {
    val current = _x
    if (current.isInstanceOf[String])
      return current.asInstanceOf[String]
    else {
      val offset = C.x_offset
      if (current == null) {
        if (LazyRuntime.isUnitialized(this, offset)) {
          try {
            val result = init("x")
            LazyRuntime.initialize(this, offset, result)
            return result
          catch {
            case ex: Throwable =>
              LazyRuntime.initialize(this, offset, null)
              throw ex
          }
        }
      }
      else
        LazyRuntime.awaitInitialized(this, offset, current)
    }
  }
}

Note the while(<EmptyTree>). We special case it in the compiler. It means infinite loop and types to Nothing: https://github.com/lampepfl/dotty/blob/963719ed2679f3d4c8188ec068e53941b181ef76/compiler/src/dotty/tools/dotc/typer/TypeAssigner.scala#L529-L530

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. That's trap one could fall into easily.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to rethrow the original exception in case the second initialize fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the second initialize can fail. There's no user code exectuted for it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odersky Good point. I guess in the cases it fails it is something like a SoE or similar VME.

@axel22
Copy link

axel22 commented Aug 8, 2019

The spec (jls §17.4.5) says that happens-before is consistent with program order.
If the two instructions of main thread can be reordered, it seems the semantics will be different from the program order, thus invalidates the program order.

@liufengyun

There is a line in JLS that says:

If x and y are actions of the same thread and x comes before y in program order, then hb(x, y).

I am pasting a part of my reply to Martin that he did not paste above, and which is about (my interpretation of) JMM description in JLS:

A happens-before relationship between an action A1 and an action A2 exists iff
(https://docs.oracle.com/javase/specs/jls/se11/html/jls-17.html):

  • A1 and A2 happen on the same thread, and A1 precedes A2 in the program order. For example, if rhs is a new Integer(2), which in the ctor does this.value = 2, then there is a happens-before relationship between this.value = 2 and CAS(_x, Evaluating, <that Integer object>).
  • A1 synchronizes-with A2. The following actions are in the synchronizes-with relationship
    (there are a few more cases, not applicable here):
    • A1 = unlock of a monitor M, A2 = subsequent lock of a monitor M
    • A1 = volatile write to variable V (including CAS), A2 = volatile read of V (including CAS)
  • If A1 happens-before Ax and Ax happens-before A2 (transitivity).
  • A1 is end of the constructor, and A2 is the invocation of finalize (not applicable here).

If two actions are in a happens-before relationship, then their memory effects are visible
to each other.

As far as I understand, the intent in the first solution is that, since the execution of rhs
is in a happens-before relationship with the second CAS(_x, Evaluating, result),
the READ(_x) that sees a value of type A should be fully initialized.
However, this is not guaranteed by the Java Memory Model, in particular because
the CAS is not in a happens-before relationship with the READ.
The JMM allows this READ to subsequently observe the object in a non-initialized state.

@axel22
Copy link

axel22 commented Aug 8, 2019

If we make the cache @volatile we get volatile writes as well as volatile reads. Strictly speaking, the write is not needed as the value of the cache is set with a CAS which already is in a happens-before with a volatile read. How expensive would the volatile write be in this case? Should we optimize it away?

I think you do not need to worry about the cost of a volatile write vs a CAS. I'd suggest just using CAS everywhere for simplicity - in the uncontended case, it's just an L1-cache access.

CAS will usually be intrinsified to a native instruction by most VMs and on most architectures. Other instructions such as incrementAndGet, getAndSet, etc., are implemented as a retry-loop in JDK's Unsafe class, but these calls can also be simply replaced with native instructions (typically, they are on x86 and ARM).

EDIT: Actually, it won't matter. We only set _x directly in the contention case which is rare. So the only accesses to the cache that matter are the (volatile) read and the CAS.

Agreed. In the contended case, in which a Waiting object is allocated, and there is a set of cache-line exchanges in the cache coherence protocol, CAS vs a write is not a big deal.

@smarter
Copy link
Member

smarter commented Aug 8, 2019

Other instructions such as incrementAndGet, getAndSet, etc., are implemented as a retry-loop in JDK's Unsafe class, but these calls can also be simply replaced with native instructions (typically, they are on x86 and ARM).

So it might be worth it to use getAndSetObject then ? I guess we can't really say without trying it.

@axel22
Copy link

axel22 commented Aug 8, 2019

So it might be worth it to use getAndSetObject then ? I guess we can't really say without trying it.

It would be interesting to try out.
My guess is that there would be no observable benefits, since getAndSet would happen on the initialization path, which evaluates rhs - which should at least be some object allocation or something more expensive (otherwise, why would you use a lazy val).

@retronym
Copy link
Member

retronym commented Aug 8, 2019

We can use jcstress, a concurrency testing framework in OpenJDK, to try to find counter examples to disprove our assumptions. It uses diagnostic flags to make JIT reorder instructions in ways legal within the JMM to provoke races.

Here's a somewhat relevant sample test: https://hg.openjdk.java.net/code-tools/jcstress/file/96d2a23d4b7b/jcstress-samples/src/main/java/org/openjdk/jcstress/samples/JMMSample_04_PartialOrder.java

I tried (but failed) to do this for the safe publication of ListBuffer in https://github.com/scala/scala/tree/2.13.x/test/jcstress .

@axel22
Copy link

axel22 commented Aug 15, 2019

One unrelated thought that was in the back of my head - wouldn't it be reasonable that a lazy value whose rhs threw an exception continues throwing that exception on every access?
Would make it easier to figure out why it failed, and more deterministic.

@odersky
Copy link
Contributor Author

odersky commented Aug 16, 2019

One unrelated thought that was in the back of my head - wouldn't it be reasonable that a lazy value whose rhs threw an exception continues throwing that exception on every access?
Would make it easier to figure out why it failed, and more deterministic.

That's actually what happens. If rhs throws an exception the value is reinitialized back to null, so each other access will retry the rhs.

@GavinRay97
Copy link
Contributor

@smarter Sorry to necro a comment from two years ago, but was crawling through Dotty Project board, found this on 3.1.0 milestone, and saw your comment on the original implementation:

-    if (!done) wait()
+    while (!done) wait()

I read that article on "Spurious Wakeups", but I'm not certain I get what difference the if vs while makes

If you read this and find the time to respond, would be thankful if you might explain the scenario that breaks the logic here 🙂

class Waiting:

    private var done = false
    private var result: AnyRef = _

    def release(result: AnyRef): Unit = synchronized:
        this.result = result
        done = true
        notifyAll()

    def awaitRelease(): AnyRef = synchronized:
        if !done then wait()
        result

@olsdavis
Copy link
Contributor

@GavinRay97 If I am not mistaken, a thread that called wait could be woken up, although no corresponding notifyAll/notify was called, i.e. it wasn't intentionally woken up; in general, we need to verify that the condition to get past the wait is satisfied, because the thread could be woken up prematurely.

szymon-rd added a commit that referenced this pull request Oct 28, 2022
Cleaned up version of #14545

Implementing the new lazy vals scheme mentioned in
#6979, fixing
#7140

New PR to avoid force pushing to main branch of the previous author.

Fixes #15149 as well
Applied #14780
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants