# **Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency**

Takayuki Usui, Reimer Behrends

Department of Computer and Information Science
University of Oregon, Eugene, OR 97403, USA

Email: tusui@uoregon.edu, behrends@cs.uoregon.edu

Jacob Evans, Yannis Smaragdakis

Department of Computer Science

University of Massachusetts, Amherst, MA 01003, USA

Email: jmevans@cs.umass.edu, yannis@cs.umass.edu

Abstract—Transactional memory is being advanced as an alternative to traditional lock-based synchronization for concurrent programming. Transactional memory simplifies the programming model and maximizes concurrency. At the same time, transactions can suffer from interference that causes them to often abort, from heavy overheads for memory accesses, and from expressiveness limitations (e.g., for I/O operations). In this paper we propose an adaptive locking technique that dynamically observes whether a critical section would be best executed transactionally or while holding a mutex lock. The critical new elements of our approach include the adaptivity logic and cost-benefit analysis, a lowoverhead implementation of statistics collection and adaptive locking in a full C compiler, and an exposition of the effects on the programming model. In experiments with both microand macro-benchmarks we found adaptive locks to consistently match or outperform the better of the two component mechanisms (mutexes or transactions). Compared to either mechanism alone, adaptive locks often provide 3-to-10x speedups. Additionally, adaptive locks simplify the programming model by reducing the need for fine-grained locking: with adaptive locks, the programmer can specify coarse-grained locking annotations and often achieve fine-grained locking performance due to the transactional memory mechanisms.

# I. Introduction

Multi-core processors are turning shared-memory parallelism into the default model of computation for mainstream software development. Explicit multi-threading remains the most direct way to program parallel systems. In the multi-threaded programming world, interference between threads is a major issue and results in hard-to-trace defects such as race conditions or deadlocks. Traditionally, programmers have coordinated threads using programming patterns based on mutual-exclusion (*mutex*) locks.

In recent years, an alternative model has been proposed for thread coordination. *Transactional memory* (TM) replaces mutexes and condition variables with "atomic" blocks of code, that are meant to execute as if all other threads had stopped running during the execution of the atomic block. TM has intrigued both software and hardware designers, and many major processor manufacturers have already announced support for TM in upcoming architectures. The advantage of TM is twofold: First, it offers a higher-level programming model by obviating the need for stating which locks to acquire. This means that code is more composable: Callers do not need to know which locks their callees hold,

and writing code does not require global knowledge of which locks are used by possibly interfering threads. The possibility of low-level deadlock is also avoided, as there is no potential for the programmer to erroneously specify circular lock dependencies. Furthermore, TM does not require finegrained delineation of critical sections in order to achieve high concurrency. Most TM implementations allow threads to proceed unless they interfere on the same shared memory data. In contrast, mutex locks conservatively prevent threads from proceeding if they need to acquire the same lock, even if they never access the same data.

The TM approach is not free of disadvantages, however. Transactions eliminate deadlock, but replace it with possible livelock or slower progress: Interfering threads can cause each other's transactions to abort and retry. Furthermore, transactions cannot easily support irreversible operations, such as I/O, despite several proposals in this direction [3], [12], [29]. Finally, when transactions are implemented in software they can suffer from high overheads during the execution of atomic blocks: Every shared memory read and write operation needs to be trapped and treated specially. The overheads have led some authors to even claim that software TM is "only a research toy" [5].

In this paper, we present *adaptive locks*: a synchronization mechanism combining locks and transactions for best performance. In our approach, the programmer specifies critical sections, which can be executed either with mutual exclusion or atomically as transactions. For instance a critical section

```
atomic (11) { ... }
is equivalent to either
atomic { ... }
(when the system executes in transaction mode) or
lock(11); ... unlock(11);
```

(when the system executes in *mutex mode*). At any point in time, all critical sections that use the same lock, 11, have to execute in the same mode.

The decision to execute in mutex mode or in transaction mode depends on the observed behavior of the critical section, namely on the *nominal contention* (how many threads are blocked on the lock when in mutex mode), the *actual*  contention (how many times each transaction retries when in transaction mode), and the transactional overhead (how much slower is the critical section when in transaction mode compared to mutex mode). Our adaptive locks compute these three factors dynamically during the program's execution and combine them for an accurate cost-benefit analysis, as described in Section II. We present techniques for performing this computation highly efficiently. The overall adaptive lock implementation imposes very low overhead compared to either a regular mutex lock or a transaction.

What adaptive locks achieve is the ability to dynamically switch concurrency mechanisms depending on execution conditions. A single code base (e.g., a library implementation of a general data structure, such as a hash table or tree) can be used in environments with high or low contention and always achieve optimal performance. For example, a program could contain two tree structures, both implemented by the same code, but one of them being large and accessed by many threads, while the other being small or only infrequently accessed concurrently. With adaptive locks, unnecessary overheads due to concurrency mismatch will be avoided for both data structures.

The adaptive locks programming model resembles mutex locks more than it does transactions. For instance, the deadlock-freedom and composability guarantees of transactions are not preserved, since our critical sections may execute in mutex lock mode. It is, therefore, important to ask, "are adaptive locks just an optimized implementation of locks?" Based on the benefits observed in our evaluation, we argue that the practical impact of adaptive locks is much more than that. We believe that adaptive locks significantly change the programming model for concurrency. Adaptive locks allow the programmer to concentrate only on coarse-grained locking approaches, instead of trying to achieve more performance by introducing error-prone finegrained locks. The performance of fine-grained locks is then often fully recovered automatically by employing the TM mechanism when appropriate. All our benchmark measurements are implemented with very coarse-grained lock annotations (often a single global lock, which trivially has good composability and deadlock-freedom properties), yet still achieve significant performance improvements. (Indeed, such coarse-grained locks can also be automatically inferred for correctness-e.g., [4].) Thus, adaptive locks encourage programmers to use locks at whichever level of abstraction correctness is easy to establish, and not at the granularity needed for performance.

Some of our work's closest relatives in the research literature are Rajwar and Goodman's *lock elision* [25] and Welc et al.'s *transactional monitors* [31]. Lock elision is a hardware technique for (effectively) implementing locks as low-level transactions, but with no clear adaptive cost-benefit model, as the one we introduce. Welc et al.'s transactional monitors implement locks optimistically as soon as the

monitor encounters contention. Again, there is no dynamic cost-benefit model for the two modes of execution, or a possibility of reverting back to locks if the TM mechanism turns out to be inefficient. Welc et al. acknowledge the need for more adaptive solutions, which our work provides. Finally, the work in this paper is an evolution and realization of the *non-blocking locks* idea that we presented in an earlier position paper [30]. Overall, our concrete contributions are as follows:

- We present a highly efficient and effective implementation of the concept of adaptive locks. Our adaptive locks keep precise statistics on the behavior of the program, while introducing very low overhead: acquiring an adaptive lock is practically no more costly than acquiring a mutex lock. Importantly, this removes all performance arguments against Software Transactional Memory [5]: transactions are used only when they yield benefits, and incur no overhead otherwise. We describe the optimizations responsible for our mechanism's efficiency—e.g., trading some inaccuracy in our statistics in exchange for shortening the critical path of lock acquisition and avoiding bottlenecks. Our implementation is in the form of a full C compiler, based on the CIL framework [24], and is freely available for download (making it one of the most mature opensource platforms for TM research).
- We evaluate adaptive locks with several micro- and macro-benchmarks. Our evaluation shows that adaptive locks combine the performance benefits of mutex locks and transactions. In every case, the performance of adaptive locks closely matches the performance of the better of the two component mechanisms. This allows adaptive locks to achieve the highest possible performance not just for different applications, but also for different configurations of the same application (e.g., 3x faster than TM for 2 processors, 3x faster than mutex locks for 64 processors). Compared to either mutex locks or transactions alone, adaptive locks routinely achieve order-of-magnitude performance improvements by emulating the performance of the complementary mechanism. Adaptive locks occasionally outperform both component mechanisms at the same time, by up to 50%, due to the varied contention behavior of different application phases.

## II. DESIGN AND ADAPTIVITY LOGIC

We next discuss the concept of adaptive locks, as well as the cost-benefit logic that the locks implement in order to choose their optimal execution mode.

# A. Programming with Adaptive Locks

Adaptive locks introduce syntax for a labeled atomic section. This is a block structured construct, headed by the keyword atomic with a label indicating which adaptive lock protects the code statement (usually a block statement) that follows. By convention, in this paper (as well as in our

implementation) adaptive locks are declared as instances of type al t, e.g.:

```
al_t lock1; ...
atomic (lock1) { ... /* critical section */ }
```

The programmer is responsible for ensuring that the lock labels are "correct"—i.e., that the program will work correctly if all instances of atomic(<lckLbl>) are replaced by a regular mutex, Lock (< lckLbl>). (We assume a block-structured mutex lock, with an unlock performed at the end of the block.) The programmer also has the obligation to ensure that the program is equally correct if all lock labels are dropped and all critical sections atomic  $(\langle lckLbl \rangle) \langle stmt \rangle$  execute as transactions, atomic *<stmt>*, in a conventional TM system (e.g., [13], [14], [28]). The reason is that transactions can have subtly different behavior from mutex locks [2], [11], [20], [28], [30]. Nevertheless, the topic of adaptive locks is orthogonal to such differences. For instance, one can implement an adaptive scheme with a TM system supporting single-globallock semantics [20]. For weaker back-end TM systems, the programmer can ensure correctness of adaptive locks code under either mode by employing static separation (i.e., ensuring that data that are ever shared are always accessed under a lock [30]) or dynamic separation [1] techniques. Note also that adaptive locks do not support transactional constructs that rely on retrying (such as an explicit retry or abort statement).

The adaptive lock implementation is, thus, free to execute the critical section it protects either as a transaction or as a critical section protected by a mutex lock. As mentioned in the Introduction, we say that the adaptive lock is in transaction mode or in mutex mode, respectively. All critical sections associated with the same adaptive lock have to execute in the same mode at a given time. If a thread tries to acquire an adaptive lock and decides it wants to execute in a different mode than the current one, it marks the adaptive lock "in-transition" and waits until all current critical sections executing with this lock finish. (Clearly, there is more than one critical section executing only if the adaptive lock is in transaction mode.) While the lock is in-transition, no further mode switching decisions can be made. Furthermore, in the case of lock nesting, the mode of a nested adaptive lock cannot differ from the mode of a surrounding lock.

<sup>1</sup>One can argue that the terms "transaction" and "mutex lock" refer to programming models, rather than implementation mechanisms. E.g., transactions can be implemented by a mechanism that guarantees exclusion, or mutex locks can be implemented speculatively. In this paper, we use the terms to refer to the implementation mechanisms overwhelmingly associated with them in common practice. We have found this to be best for communications purposes: when describing our work, listeners have been more likely to grasp it quickly if we explain it as a "mechanism adapting between mutex locks and transactions" rather than as a "mechanism adapting between speculative and non-speculative locks, where the speculation is implemented through TM techniques".

The reasons for switching the mode of an adaptive lock are either correctness- or performance-related. In the former case, if the lock is executing in transaction mode and an irreversible operation is called (e.g., I/O) the (outermost) critical section restarts in mutex mode. The latter case captures the heuristic at the core of adaptive locks, for deciding when to switch modes in order to improve performance.

## B. Cost-Benefit Analysis

The main reason for executing an adaptive lock in transaction mode is that mutex locks can exhibit *false exclusion* [26]. A single mutex lock is commonly used to protect a large amount of shared data—an approach known as *coarse grained locking*. In this way, multiple threads are blocked from accessing the data, even in cases when they would not really conflict. Programmers use coarse grained locking because it is often far easier than trying to correctly associate locks with smaller amounts of data. Several domains and data structures (e.g., red-black trees) are notoriously difficult to code with a fine-grained locking discipline.

Therefore, the performance benefit of transactions is due to higher concurrency: More threads can execute the same critical section with transactions than with mutexes. Assuming that separate processors exist to run these threads, a net performance increase can result.

At the same time, executing an adaptive lock in transaction mode incurs high overheads when there is true contention on the data. In this case, different threads interfere with each other, preventing the successful commit of transactions. Therefore, transactions have to retry multiple times before they successfully commit, and the result is slower progress, or even livelock. The problem is solved when switching to mutex mode because the thread "reserves" the right to run up-front, thus making progress without interference.

A second factor hindering the performance of transaction mode is that, in pure-software TM, there is typically a high overhead associated with executing a critical section transactionally. *Software transactional memory (STM)* systems need to execute logging actions on each read or write operation of shared memory data. Depending on the design of the STM, the logged values are either used to update shared memory on transaction commit (*redo-logging*), or to revert shared memory to its previous state on transaction abort (*undo-logging*).<sup>2</sup> A second overhead is due to the need to perform synchronization operations (e.g., acquiring locks associated with each written word) to ensure consistent memory writes. The need for logging actions and synchronization imposes

<sup>2</sup>A few STM systems suffer no such overhead [6], [17], [19], by translating transactions into lock acquisitions and releases in a way that guarantees deadlock-freedom (and, thus, the transaction never needs to retry). The performance of such "auto-locking" systems depends crucially on (non-modular) compiler analysis or program annotation. No representative of this approach has yet achieved the same level of performance as standard STMs (pessimistic or optimistic) in a general-purpose, automatic setting.

a heavy overhead on shared memory operations and often slows down transaction mode execution of critical sections by a significant factor (e.g., 2-8x). Additionally, STM implementations often impose extra overheads for policy-specific reasons—e.g., re-validating the read set when a conflict is detected, incurring cost for aborting, etc.

Therefore the adaptive lock analysis of whether to execute in transaction mode or mutex mode has to take into account three factors:

- Nominal contention (c): the number of threads contending for the lock. This quantifies the potential benefit of executing in transaction mode instead of mutex mode. The quantity can be measured by keeping a counter of how many threads are blocked on the lock when in mutex mode. When in transaction mode, c is equal the number of threads currently executing the critical section.
- Actual contention (a): the number of times a transaction needs to try before it commits. This quantifies the contention by other threads on the actual data the critical section tries to access. The quantity is a multiplicative factor in the *cost* of executing the critical section in transaction mode.
- Transactional overhead (o): the slowdown factor due to transactional execution, because of the need to trap shared memory reads and writes, the need to synchronize, the need to re-validate as part of a complex contention management policy, etc. This is a multiplicative factor in the cost of transaction mode.

Thus, the cost-benefit analysis of adaptive locks is based on the inequality:

$$a \cdot o \ge c$$

(The two sides correspond to the overheads of each mode of execution relative to an idealized, no-contention execution. All three factors are computed separately for each adaptive lock, since the decision on which mode to execute affects all critical sections of the lock.) If this inequality holds, mutex mode execution is preferable, otherwise the benefit of transaction mode execution outweighs its cost. Note that the analysis applies and a trade-off exists even if transactional execution incurs no overhead (o=1), e.g., through the use of specialized hardware.

The above cost-benefit analysis is *exact* and not approximate, yet approximations need to be introduced because, for instance, it is hard to measure the overhead o fully accurately, factor a is predictive of future executions so it needs to be estimated from past data, etc. As we describe next, factors c and a are computed dynamically at all times. Factor o is also computed dynamically by sampling a subset of the executions—an approach that proved superior to off-line estimates in our measurements due to the high variance of o for different applications and locks.

To see the advantage of having a complete model for cost and benefit, consider, for instance, the adaptivity approach followed by Welc et al. [31]. Their technique converts a critical section to a transactional implementation as soon as any contention is observed, i.e., as soon as c is more than 1. This completely disregards the costs of transactional execution and results in obtaining good behavior only for transaction-friendly workloads.

## III. IMPLEMENTATION AND OPTIMIZATIONS

We next describe our implementation of adaptive locks. We selectively present key components that expose the precise logic (e.g., behavior when an adaptive lock is in the process of switching modes) or reveal crucial elements for high performance.

# A. Compiler and Locking Mechanism

We have implemented adaptive locks in a conservative extension of the C language. Our compiler is based on the CIL infrastructure [24] for extensible C compilers. A special pragma at the function level is used to supply atomic annotations: the entire body of the function is then considered to be protected by the corresponding adaptive lock. The compiler translates each function body with atomic annotations into two different object code versions: a raw version, used for mutex mode execution and incurring no further overheads, and a transactional version, where all shared memory reads and writes become TM operations for an underlying STM. We use TL2 [7], a high-performance STM library, as our backend STM. Our implementation is freely available (see http://ix.cs.uoregon.edu/~takayuki/al/) represents a mature open-source compiler infrastructure for STM experimentation. Other researchers can build on our compiler support for TM by modifying our CIL patterns to produce full compilers either for different TM constructs or for different back-end TM implementations.

Our implementation of adaptive locks replaces regular lock acquisition and release with versions that perform the adaptive reasoning. We use a standard pattern for high-performance synchronization: The adaptive lock's state is packed in a memory word and we represent bit blocks as different pseudo-variables. The components of the state include the number of threads executing in transaction mode (thrdsInStmMode), whether we are currently in mutex mode (mutexMode), whether the mutex lock is held (lockHeld), and whether we are currently in the process of switching modes (transition). The next state is then computed and updated atomically with a compare-and-swap (CAS) instruction. The thread spins, retrying the state update until the CAS succeeds, or until exceeding a number of tries, in which case it has to yield the CPU.

These elements are illustrated in Figure 1, which shows the state transitions of adaptive locks, as well as in in Figure 2, which contains the code for the acquire routine—the main workhorse of the lock acquisition process. This



| State  | mutexMode | lockHeld | thrdsInStmMode | transition | Description                                       |
|--------|-----------|----------|----------------|------------|---------------------------------------------------|
| $S_0$  | 0         | 0        | 0              | 0          | STM mode, no thread in critical region            |
| $S_1$  | 0         | 0        | $\geq 1$       | 0          | STM mode, thread(s) in critical region            |
| $L_0$  | 1         | 0        | 0              | 0          | Lock mode, no thread in critical region           |
| $L_1$  | 1         | 1        | 0              | 0          | Lock mode, thread in critical region              |
| $LS_0$ | 0         | 1        | 0              | 1          | Begin transition from lock to STM mode            |
| $LS_1$ | 0         | 0        | 0              | 1          | Signal completed transition from lock to STM mode |
| $SL_0$ | 1         | 0        | $\geq 1$       | 1          | Begin transition from STM to lock mode            |
| $SL_1$ | 1         | 0        | 0              | 1          | Signal completed transition from STM to lock mode |

Figure 1. Adaptive lock state machine with explanation of states and transitions. We need two transition states in each direction, because of the Acquire/Release/Acquire handshake taking place (for ensuring progress during mode switches). The initial Acquire changes to the first transitional state, then waits for the Release operation to complete, which is signaled by the move to the second transitional state, and the Acquire then completes the transition by going to  $S_1/L_1$ .

routine is called every time a thread attempts to acquire an adaptive lock. The return value indicates whether the adaptive lock was acquired in transaction mode (TRANS\_MODE) or mutex mode (MUTEX\_MODE). Introducing some conventions is helpful:

- The separate bit ranges of both the current state (prev) and the next state (next) are set through macros maintaining the naming convention. For instance, checking the lockHeld bit of the current state is done with the expression lockHeld(prev) whereas setting the same bit to 1 on the next state is done with the call setLockHeld(next,1). We use TRUE and FALSE for 1 and 0, respectively, when the bit value represents a boolean.
- Atomic operations are shown in all capital letters. INC, DEC, and CAS call (directly or indirectly) atomic instructions. This will be important when we discuss performance optimizations.
- Each adaptive lock holds data for computing its adaptivity statistics. These data are not accessed directly in the code of Figure 2, with the exception of lock->thdsBlocked: a counter of threads blocked on the lock, if the lock is in mutex mode—adding thrdsInStmMode yields the c factor from Section II-B. For its adaptivity logic, the acquire routine calls transactMode which implements the cost-benefit anal-

ysis of Section II-B and returns the estimated best mode for the adaptive lock.

We can now see precisely the behavior of adaptive locks. If the lock is not already in a state of transition from one mode to the other then the cost-benefit analysis is performed to see what is the optimal execution mode. (It is necessary for ensuring progress to choose the mode using the cost-benefit analysis only when the lock is not already in transition. Otherwise, threads that decide to acquire the adaptive lock in mutex mode might be waiting for all threads executing in transaction mode to finish. Yet new threads can keep acquiring the lock in transaction mode with no problem, thus causing the thread desiring to enter in mutex mode to wait forever.) All possibilities end with an attempt to CAS into the next state of the lock. If the CAS succeeds, in most cases we are done, unless we are switching modes, in which case the CAS will just set the state to be in-transition, and will repeat the loop until the new state is set. A failed CAS results in retrying, up to a predefined threshold of times (spin thrld) before yielding.

When the acquire routine returns to its caller (not shown), the adaptive lock is held in the appropriate mode, and the system only needs to execute the corresponding version of the critical section (raw or transactional), per the return value. Transaction mode execution also maintains statistics for the cost-benefit analysis, namely it increments

```
Mode acquire(al t* lock) {
 int spins = 0;
 int useTransact = MUTEX MODE;
while (TRUE) {
 intptr t prev,next;
  prev = lock->state;
  if (!transition(prev))
   // we are not already in transition
   if ((useTransact = transactMode(lock, spins)) ==
        TRANS MODE)
   { // we are better off in transaction mode
    if (!lockHeld(prev)) {
     // the lock is free or in transaction mode
    next = setMutexMode(prev, FALSE);
     next = setThrdsInStmMode(next,
                          thrdsInStmMode(next)+1);
     if (CAS(lock->state,prev,next) == prev) break;
    } else {
     // the lock is in mutex mode. Need transition
     next = setMutexMode(prev, FALSE);
     next = setTransition(next, TRUE);
     CAS(lock->state,prev,next);
    else {\ //\ } we are better off in mutex mode
    if (!lockHeld(prev) &&
        thrdsInStmMode(prev) == 0)
    { // the lock is free, no threads in crit.sec.
     next = setMutexMode(prev, TRUE);
     next = setLockHeld(next, TRUE);
     if (CAS(lock->state,prev,next) == prev) break;
    } else if (!mutexMode(prev)) {
     // lock is currently in transaction mode
    next = setMutexMode(prev, TRUE);
    next = setTransition(next, TRUE);
     CAS(lock->state, prev, next);
  } else { // we are in transition
   if (!mutexMode(prev)) {
    // we want to transition to transaction mode
    if (!lockHeld(prev)) {
     // and the lock is no longer held
     useTransact = TRANS MODE;
    next = setThrdsInStmMode(prev, 1);
    next = setTransition(next, FALSE);
    if (CAS(lock->state, prev, next) == prev) break;
   } else { // we want to transition to mutex mode
    if (thrdsInStmMode(prev) == 0) {
     // and it seems we can do so
    useTransact = MUTEX MODE;
     next = setLockHeld(prev, TRUE);
     next = setTransition(next, FALSE);
     if (CAS(lock->state,prev,next) == prev) break;
  // account for blocked thread on first spin
  if (spins == 0) INC(lock->thdsBlocked);
 if (spin thrld < ++spins) Yield();
   /* end while(TRUE) */
 if (0 < spins) DEC(lock->thdsBlocked);
 return useTransact;
```

Figure 2. The main routine for adaptive lock acquisition. Returns whether the lock was acquired in mutex mode or transaction mode.

a counter for every transaction retry and commit.

## B. Performance Discussion

Adaptive locks keep global statistics, necessary for computing quantities c, a, and o of the adaptivity reasoning. Such statistics include the <code>lock->thdsBlocked</code> count, a count of transaction tries, and a count of transaction commits. Because these counts need to be updated by every thread's execution, they represent a global bottleneck for the performance of adaptive locks. Removing this bottleneck is crucial for performance. In some cases, even a single extra atomic instruction (e.g., a slightly less optimal implementation instead of that of Figure 2) would result in no scalability for the benchmarks we present later. (We do not show such what-if experiments for lack of space.)

One way in which we address this problem is by allowing small inaccuracies in our statistics gathering. The inaccuracies can only influence the performance of an adaptive lock (i.e., which mode it chooses) and not its correctness. For instance, quantity a of the adaptivity reasoning (the "actual contention") is computed from counts of transaction tries and commits for the critical section. Although we make sure that these counts are not cached for long periods of time (by using volatile variables), we do not update the counts atomically. Instead, regular memory writes are performed and later instructions serve as memory barriers, forcing a shared memory update. This allows for races, including write-write races (i.e., an update being lost because a different thread overwrites it). In practical use, the sporadic inaccuracies in such statistics are not significant, especially since the counts of tries and commits are cumulative (although time-decayed).

Additionally, the transactional overhead factor, o, of our analysis depends on the proportion of shared memory operations (which become transactional reads and writes) in a transaction's workload. For instance, transactions that work mostly with thread-local memory (including non-shared external resources) will not incur a heavy overhead for execution in an STM, in contrast to transactions that perform many shared memory operations. The relative mix of reads and writes also matters, depending on the specifics of the STM implementation. For instance, TL2 keeps the cost of reading shared memory low, and contains special handling for read-only transactions. For these reasons, the value of factor o varies widely between applications, as well as between different critical sections of the same application.

In our implementation, we perform a dynamic measurement of o, using architecture-specific instruction (or cycle, when available) counters. Thus, we can estimate o by measuring the execution time of a transaction, and dividing it by the execution time minus the time spent in the wrapper functions for transactional read/write memory operations (which closely approximates the time that would have been spent executing the critical section in lock mode). Getting

good estimates for these times is costly, however. We found that sampling even the cheapest CPU performance counters can be prohibitive for transactions, which are typically quite brief. Furthermore, reading the values of performance counters on every TM read and write can disturb the behavior of the transaction, by prolonging it.

To keep our estimate of o precise yet inexpensive, we apply two optimizations. First, the measurement is not performed on every transactional execution, but only in specific sampling intervals (currently every 512 calls).<sup>3</sup> Second, we do not measure precisely how much time is spent in handling transactional reads and writes. Instead, we just keep a count of the numbers of each operation and multiply these counts by a static estimate. This is just an approximation (since the cost of reads and writes is not constant in TL2 or other STMs) but we have not found it to induce enough noise to skew our decisions.

The result of our dynamic estimation of the overhead factor is a mechanism that adapts very well to the characteristics of the application and critical section, while introducing negligible overhead, as we later show in our experiments.

#### C. Sensitivity Discussion

Although the cost-benefit analysis of Section II-B is fully general, our implementation is specialized for our back-end STM, TL2, and somewhat reflects our intended execution platform. Namely:

- The main transactional overheads of TL2 are due to read and write logging [7]. Therefore our estimate of o ignores (i.e., approximates as a constant) other transactional overheads, such as the cost of acquiring locks, the cost of aborting a transaction, the cost of contention management (e.g., delaying a transaction or re-validating the read-set in order to make progress). These either do not apply to TL2 or have been shown to be secondary factors. Generally, to measure o precisely, one needs to measure the full end-to-end cost of equivalent executions in mutex mode and in transaction mode. This is usually not feasible, as the cost is dependent on other threads, semantic equivalence is hard to establish, etc. Therefore we expect that different realizations of adaptive locks will need to employ appropriately specialized techniques for estimating o.
- We have not found a need to employ more scalable locking or counter techniques (e.g., avoid a CAS when the lock is in transaction mode). This may be partly because our primary execution platform (a Sun Niagara2 architecture) uses a shared L2 cache. Preliminary microbenchmarks, however, do not substantiate this theory: we found that for much higher contention/shorter transactions the performance of our technique would degrade substantially on the

same architecture. Still, an implementation specialized for other architectures (e.g., x86) may need to employ different low-level scalability techniques.

#### IV. EXPERIMENTAL EVALUATION

To evaluate the effectiveness of adaptive locks, we performed experiments with an array of microbenchmarks (for testing boundary conditions) and macrobenchmarks. All measurements are medians of 3 runs on a Sun UltraSparc T2 (Niagara2) T5220 machine (8 cores with 8 threads each for a total of 64 hardware threads, at 1.2GHz; 32 GB RAM). We used GCC 4.0.4, and our implementation of adaptive locks uses version 0.9.4 of TL2. Our plots show the performance of our adaptive locks, compared to modes of our compiler that perform no adaptivity reasoning. (We have confirmed that the non-adaptive modes of our compiler yield virtually identical performance to plain mutexes and the STAMP STM under the base CIL compiler.)

# A. Microbenchmarks

We stress-tested adaptive locks with microbenchmarks corresponding to standard mapping data structures: red-black trees, hash tables, and splay trees.

Red-black trees are the poster child benchmark for transactional memory systems. Mutex-based red-black tree solutions typically do not scale, as they use coarse-grained locking due to the very high complexity of coding a fine-grained red-black tree. TM approaches perform well because the data structure has low actual contention (different operations can access different parts of the tree without conflicts) and can benefit from increased concurrency.

Splay trees, on the other hand, are pathologically bad for implementations that emphasize concurrency (such as TM) since every update to a part of the tree needs to change the root, which becomes a point of contention. Thus, the interesting question for splay trees is how to incur less overhead, rather than how to gain more concurrency. We use a single-lock splay tree in our experiments.

We experimented with two fixed-size hash table implementations: one with coarse-grained locking (single lock per entire table) and one with fine-grained locking (one lock per table bucket). Naturally, there is no difference in the performance of TM in the two implementations, but mutex locks perform better in the latter.

For each data structure we used a relatively high-contention workload with 50% lookup operations, 25% inserts and 25% deletes. Each thread performs 100,000 operations total. Our results are shown in Figure 3—note that these are throughput plots, so higher numbers are better. As can be seen, none of the benchmarks scales perfectly to 64 threads, largely because of the small size of the data and the resulting contention, and possibly partly because our hardware is not a full 64-way machine, but has 8 separate cores with 8 hardware threads each.

<sup>&</sup>lt;sup>3</sup>Our implementation pre-compiles a special *sampling transactional version* of the critical section, which, in addition to TM operations, contains instrumentation for estimating the overhead of transactional execution.



Figure 3. Microbenchmarks: Data structures with different characteristics. *Higher is better.* Note that the fine-grained hash table plot includes the coarse-grained mutex performance for reference.

Adaptive locks succeed in closely tracking the performance of the better of the two component mechanisms for each benchmark. This means that adaptive locks soundly outperform either of the component mechanisms on its own. Statistically, over all microbenchmarks and all thread configurations, adaptive locks are on average 47% faster than mutexes (min: -16%, max: 433%) and 176% faster than transactions (min: -26%, max: 837%). (This should only be viewed as a summary of the figure data, as the average does not map to a real-world quantity.) For red-black trees and coarse-grained hash tables, adaptive locks imitate a mutex lock for low degrees of parallelism (1-2 threads) and a TM for more threads, outperforming the mutex-based implementation. For splay trees, adaptive locks precisely match the performance of a plain mutex lock, outperforming the STM implementation. For fine-grained hash tables, adaptive locks emulate mutexes, yielding better performance than TM for few threads and identical performance for more threads. The stress-testing reveals small overheads in our adaptive locks, compared to a plain STM approach (see the difference between TM and adaptive locks in the red-black tree plot). This is due to the cost of the adaptivity logic, as discussed in Section III-B. We observed such overheads only in stress-testing scenarios but not in more realistic settings, so we have not emphasized removing the last bit of overhead. Compared to mutex locks, our adaptive locks have no measurable overhead, as seen in the splay tree benchmark.

# B. Macrobenchmarks

For larger benchmarks of adaptive locks, we used the STAMP (Stanford Transactional Applications for Multi-Processing) benchmark suite [21], version 0.9.7. Subsequently, a new version of the STAMP benchmarks has been released and we ported the three new benchmark applications of STAMP 0.9.10 to also work with adaptive locks (while keeping the slightly different library interface of STAMP 0.9.7, which was already rewritten for adaptive locks). STAMP comprises 8 applications: bayes (a bayesian network learning program), genome (a gene sequencing program), intruder (a network intrusion detector), kmeans (an implementation of K-means clustering), labyrinth (a maze routing program), ssca2 (a graph analysis compute-

intensive benchmark), vacation (a client/server travel reservation system), and yada (an implementation of Delaunay mesh refinement). All STAMP applications are written to employ a TM system explicitly. That is, the code contains explicit STM primitives (of the TL2 STM) for beginning a transaction, transactionally reading/writing a word from/to shared memory, committing a transaction, etc. As discussed in Section III, our adaptive lock compiler supports a higherlevel programming interface: all shared memory operations become implicitly transactional loads/stores when executing in transaction mode. Therefore, the STAMP applications needed careful manual modification to ensure that the output of our compiler reflects the original hand-written code, and to introduce locking annotations in the code. Our goal was to add only very coarse-grained locking, equivalent to what a programmer would be able to add with minimal effort and sophistication. Indeed, for seven out of the eight STAMP benchmarks (bayes, genome, intruder, labyrinth, ssca2, vacation, yada) we only introduced trivial locking: a single global lock for the entire application. For kmeans, 3 separate locks were introduced, with a localized code change (the critical sections for all 3 locks are in a single file and in adjacent routines).

The performance of adaptive locks for the STAMP benchmarks is illustrated in Figure 4. The graphs plot execution times, so lower is better. For a statistical summary, over all STAMP benchmarks and all thread configurations, adaptive locks are on average 103% faster than mutexes (min: -27%, max: 1021%) and 76% faster than TM (min: -35%, max: 660%).

Adaptive locks track very closely, and even outperform the better of the two component mechanisms over all applications. We should note that we have low confidence in the results for yada: we have already fixed a number of performance and correctness bugs in the application (independent of adaptive locks) and the measurements shown were taken after performing an overly conservative, STMdisadvantaging rewrite (turned many local operations into transactional ones) to avoid crashes. (More study and a deeper understanding of the yada semantics will undoubtedly address this problem.) For labyrinth, adaptive locks imitate TM behavior and vastly outperform mutex locks for all thread configurations. For kmeans, adaptive locks imitate mutexes and outperform the TL2 STM for all thread configurations. The behavior of bayes is unstable by its nature (the STAMP documentation reads "for multithreaded runs, the running time can vary depending on the insertion order of edges") (depends very much on the scheduling order of threads) but adaptive locks consistently perform well for 4 or more threads. More interesting behavior can be seen for genome, ssca2, and vacation, where adaptive locks emulate mutexes for best performance with a low number of threads, while executing in transactional mode and perfectly matching or beating the performance of plain TL2 for higher numbers of threads. Occasionally, adaptivity is profitable even in the course of the same execution. For instance, for most of the intruder data points, as well as for genome in a 2-4 thread configuration, the adaptive lock version of the program profitably switches modes *during* execution, outperforming both mutexes and transactions alone.

Overall, the performance of adaptive locks for STAMP benchmarks validates the approach very well. Our use of only coarse-grained adaptive locking illustrates the intended usage mode of the mechanism. Adaptive locks simplify the multi-threaded programming model, by allowing the programmer to write coarse-grained annotations and achieve easy multi-threaded correctness. The convenience comes without sacrificing concurrent performance: The adaptivity mechanism can detect when coarse-grained locking is too conservative and recover concurrency (as if using fine-grained locks) by executing in transaction mode.

# V. RELATED WORK

We discussed directly related work throughout the previous sections. Here we outline some work that is less directly related, yet offers context for our work, or explores closely related directions in different settings.

Transactions originated in the databases research literature [10] before they transitioned to general-purpose programming in the form of transactional memory [16]. Although the principles are similar, the challenges in the two domains are quite distinct. For instance, TM has to allow for arbitrary memory accesses and, thus, cannot generally predict all locks that need to be acquired. Furthermore, the granularity of access is finer in TM, creating very different trade-offs for high-performance implementations.

In the database world, our adaptive locks might be described as a mechanism adapting between optimistic concurrency control and pessimistic concurrency control. The term "optimistic" refers to allowing transactions to proceed in the hope that they will not conflict, while installing mechanisms to detect such conflicts. The term "pessimistic" refers to acquiring locks up front, so that any transactions that have the possibility of conflict end up serializing. Database researchers have explored combinations of optimistic and pessimistic concurrency control, and so have researchers in automatic parallelization [8], [26]. The options are sometimes said to be akin to "apologizing versus asking permission" [15]. The mutex mode of our adaptive locks is an ultra-pessimistic mechanism, as it forces all transactions to "ask permission" up front. Receiving permission means that the transaction can proceed and is guaranteed to not roll back: it has "reserved" the right to perform its memory operations.

The PhTM [18] system is related to our work in that it describes a mechanism for dynamically switching synchronization mechanisms. Nevertheless, our work advances the PhTM ideas in several ways. First, PhTM introduces only



Figure 4. STAMP benchmarks, for varying numbers of threads. Lower is better.

a single global lock instead of individual locks. Second, although the PhTM "SEQUENTIAL-NOABORT" mode supports switching to lock-based execution, the PhTM prototype does not support such switching. In fact, the PhTM authors speculate, "we can likely improve performance in most cases by monitoring progress of transactions, commit/abort rates, status of transactions with respect to the current mode, etc." and conclude that "[f]uture work includes ... mechanisms for deciding when to switch to what mode." Our work directly addresses these topics. Other researchers have recently also proposed mechanisms for replacing locks with software transactions [27], again without a cost-benefit adaptivity model to guide a run-time mode switch.

Our exploration of adaptive locks is in the context of a pure software implementation. An important trend is to provide hardware support for TM [9], [22], [23]. With hardware support, the performance trade-offs change—e.g., the transactional overhead of loads and stores may be virtually eliminated. Yet the idea of adaptive locks should be quite applicable to hardware TMs: even with no overhead for TM execution, it will be beneficial to adaptively detect when transactions have high actual contention and mutual exclusion would be profitable. Furthermore, most hardware support for TM employs a hybrid software-hardware approach—e.g., transactions that access a lot of shared data need to be implemented in software, making our approach perfectly applicable. Finally, many of the adaptivity ideas of this paper can be employed in hardware mechanisms such as speculative lock elision [25] or optimistic thread concurrency [9].

# VI. CONCLUSIONS

We presented the idea of *adaptive locks* as a concurrency control construct for multi-threaded programming. A major contribution of our work is in identifying the statistics needed for an effective cost-benefit adaptivity analysis and in developing mechanisms for maintaining such statistics highly efficiently. Overall, we believe that our work establishes adaptive locks as an excellent candidate for inclusion in industrial-strength systems.

The efficiency of our adaptivity approach suggests several other future applications. One promising possibility is adaptive switching between a pessimistic and an optimistic TM implementation—i.e., rolling back on deadlock vs. rolling back on actual contention. Although both pessimistic and optimistic TM approaches perform best when there is no actual contention on the data, pessimistic TM approaches reduce the probability of roll-back at the expense of some thread waiting for locks to become available. Adaptivity can help determine when this trade-off is profitable. Another potential application of adaptivity is in changing the locking granularity of a TM system: an adaptive TM can start with associating lock words (used internally by the TM

implementation) at a coarse granularity. Then, as nominal contention is encountered, the system can switch to a finer granularity, and possibly back, if the overhead of fine-grained locking appears heavy and unwarranted. Such granularity switching is applicable to practically all TM systems we are aware of, both pessimistic and optimistic. Our adaptive locks can be seen as an extreme case of both granularity switching and pessimistic-optimistic switching at the same time.

Generally, we believe that the idea of adaptive concurrency control holds significant promise and that adaptive locks, as analyzed in this paper, are an excellent representative of the possibilities.

## ACKNOWLEDGMENTS.

This work was supported by Sun Microsystems via an equipment grant, by the National Science Foundation under grants CCF-0917774 and CCF-0934631, as well as by LogicBlox Inc. We would like to thank Doug Lea for his support and access to equipment, Michal Young for several insightful conversations, as well as the anonymous TRANSACT'09 and PACT'09 reviewers for many comments that helped improve the paper.

#### REFERENCES

- Martin Abadi, Andrew Birrell, Tim Harris, Johnson Hsieh, and Michael Isard. Implementation and use of transactional memory with dynamic separation. In *Compiler Construction*, March 2009.
- [2] Colin Blundell, E. Christopher Lewis, and Milo M. Martin. Subtleties of transactional memory atomicity semantics. *IEEE Comput. Archit. Lett.*, 5(2):17, 2006.
- [3] Colin Blundell, E Christopher Lewis, and Milo M. K. Martin. Unrestricted transactional memory: Supporting I/O and system calls within transactions. Technical Report CIS-06-09, Department of Computer and Information Science, University of Pennsylvania, Apr 2006.
- [4] Chandrasekhar Boyapati and Martin Rinard. A parameterized type system for race-free java programs. In OOPSLA '01: Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 56–69, New York, NY, USA, 2001. ACM.
- [5] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. Software transactional memory: why is it only a research toy? *Commun. ACM*, 51(11):40–46, 2008.
- [6] Sigmund Cherem, Trishul Chilimbi, and Sumit Gulwani. Inferring locks for atomic sections. In SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 304–315, New York, NY, USA, 2008. ACM.
- [7] David Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In Shlomi Dolev, editor, *Distributed Computing*, 20th International Symposium (DISC), volume 4167 of Lecture Notes in Computer Science. Springer, 2006.

- [8] Pedro C. Diniz and Martin C. Rinard. Dynamic feedback: An effective technique for adaptive computing. In SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 71–84, 1997.
- [9] Brian Goetz. Optimistic Thread Concurrency. http:// www.azulsystems.com/products/whitepaper/wp\_otc.pdf, January 2006.
- [10] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992.
- [11] Dan Grossman, Jeremy Manson, and William Pugh. What do high-level memory models mean for transactions? In ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC), pages 62–69, New York, NY, USA, 2006. ACM.
- [12] Tim Harris. Exceptions and side-effects in atomic blocks. *Science of Computer Programming*, 58(3):325–343, 2005.
- [13] Tim Harris and Keir Fraser. Language support for lightweight transactions. In *OOPSLA '03: Proc. 18th conf. on Object-oriented Programing, Systems, Languages, and Applications*, pages 388–402, Anaheim, CA, 2003. ACM Press.
- [14] Tim Harris, Maurice Herlihy, Simon Marlow, and Simon Peyton-Jones. Composable memory transactions. In Proc. Symposium on Principles and Practice of Parallel Programming, Jun 2005.
- [15] Maurice Herlihy. Apologizing versus asking permission: Optimistic concurrency control for abstract data types. ACM Trans. Database Syst., 15(1):96–124, 1990.
- [16] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 289–300. May 1993.
- [17] Michael Hicks, Jeffrey S. Foster, and Polyvios Prattikakis. Lock inference for atomic sections. In Proceedings of the First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing. Jun 2006.
- [18] Yossi Lev, Mark Moir, and Dan Nussbaum. PhTM: Phased transactional memory. In TRANSACT 07: Proceedings of the second ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing, 2007.
- [19] Bill McCloskey, Feng Zhou, David Gay, and Eric Brewer. Autolocker: Synchronization inference for atomic sections. In POPL '06: Proc. 33rd symposium on Principles of Programming Languages, pages 346–358. ACM Press, 2006.
- [20] Vijay Menon, Steven Balensiefer, Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Bratin Saha, and Adam Welc. Single global lock semantics in a weakly atomic STM. In 3rd workshop of Transactional Computing (TRANSACT), 2008.
- [21] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. STAMP: Stanford transactional applications for multi-processing. In IISWC '08: Proc. IEEE International Symposium on Workload Characterization, September 2008.

- [22] Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In ISCA '07: Proceedings of the 34th Annual International Symposium on Computer architecture, pages 69–80, New York, NY, USA, 2007. ACM Press.
- [23] Michelle J. Moravan, Jayaram Bobba, Kevin E. Moore, Luke Yen, Mark D. Hill, Ben Liblit, Michael M. Swift, and David A. Wood. Supporting nested transactional memory in logTM. In ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, pages 359–370, New York, NY, USA, 2006. ACM Press.
- [24] George C. Necula, Scott McPeak, Shree Prakash Rahul, and Westley Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In CC '02: Proc. 11th Int. Conf. on Compiler Construction, pages 213–228, London, UK, 2002. Springer-Verlag.
- [25] Ravi Rajwar and James Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. 34th International Symposium on Microarchitecture, December, 00:294, 2001.
- [26] Martin C. Rinard. Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives. ACM Transactions on Computer Systems, 17(4):337–371, 1999.
- [27] Amitabha Roy, Steven Hand, and Tim Harris. A runtime system for software lock elision. In *Eurosys*, April 2009.
- [28] Tatiana Shpeisman, Vijay Menon, Ali-Reza Adl-Tabatabai, Steven Balensiefer, Dan Grossman, Richard L. Hudson, Katherine F. Moore, and Bratin Saha. Enforcing isolation and ordering in STM. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design Implementation, pages 78–88, New York, NY, USA, 2007. ACM Press.
- [29] Yannis Smaragdakis, Anthony Kay, Reimer Behrends, and Michal Young. Transactions with isolation and cooperation. In ACM Symposium on Object Oriented Programming: Systems, Languages, and Applications (OOPSLA). ACM Press, October 2007.
- [30] Yannis Smaragdakis, Anthony Kay, Reimer Behrends, and Michal Young. General and efficient locking without blocking. In ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC), 2008.
- [31] Adam Welc, Antony L. Hosking, and Suresh Jagannathan. Transparently reconciling transactions with locking for java synchronization. In European Conference on Object-Oriented Programming, pages 148–173, Jul 2006.