riscv · hboehm · Mar 28, 2023 · aswaterman · Apr 19, 2023 · hboehm
diff --git a/src/a.tex b/src/a.tex
@@ -229,19 +229,27 @@ \section{Load-Reserved/Store-Conditional Instructions}
 
 An SC instruction can never be observed by another RISC-V hart
 before the LR instruction that established the reservation.
+
+\begin{commentary}
 The LR/SC
 sequence can be given acquire semantics by setting the {\em aq} bit on
 the LR instruction.  The LR/SC sequence can be given release semantics
-by setting the {\em rl} bit on the SC instruction.  Setting the {\em
-  aq} bit on the LR instruction, and setting both the {\em aq} and the {\em
-  rl} bit on the SC instruction makes the LR/SC sequence sequentially
-consistent, meaning that it cannot be reordered with earlier or
-later memory operations from the same hart.
-
-If neither bit is set on both LR and SC, the LR/SC sequence can be
+by setting the {\em rl} bit on the SC instruction.  Assuming
+suitable mappings for other atomic operations, setting the
+{\em aq} bit on the LR instruction, and setting the
+{\em rl} bit on the SC instruction makes the LR/SC
+sequence sequentially consistent in the C++ {\em memory\_order\_seq\_cst}
+sense. Such a sequence does not act as a fence for ordering ordinary
+load and store instructions before and after the sequence. Specific
+instruction mappings for other C++ atomic operations,
+or stronger notions of ``sequential consistency'', may require both
+bits to be set on either or both of the LR or SC instruction.
+
+If neither bit is set on either LR or SC, the LR/SC sequence can be
 observed to occur before or after surrounding memory operations from
 the same RISC-V hart.  This can be appropriate when the LR/SC
 sequence is used to implement a parallel reduction operation.
+\end{commentary}
 
 Software should not set the {\em rl} bit on an LR instruction unless the {\em
 aq} bit is also set, nor should software set the {\em aq} bit on an SC
@@ -478,20 +486,6 @@ \section{Atomic Memory Operations}
 it cannot be reordered with earlier or later memory operations from the same
 hart.
 
-\begin{commentary}
-The AMOs were designed to implement the C11 and C++11 memory models
-efficiently.  Although the FENCE R, RW instruction suffices to
-implement the {\em acquire} operation and FENCE RW, W suffices to
-implement {\em release}, both imply additional unnecessary ordering as
-compared to AMOs with the corresponding {\em aq} or {\em rl} bit set.
-\end{commentary}
-
-An example code sequence for a critical section guarded by a
-test-and-test-and-set spinlock is shown in Figure~\ref{critical}.  Note the
-first AMO is marked {\em aq} to order the lock acquisition before the
-critical section, and the second AMO is marked {\em rl} to order
-the critical section before the lock relinquishment.
-
 \begin{figure}[h!]
 \begin{center}
 \begin{verbatim}
@@ -511,14 +505,36 @@ \section{Atomic Memory Operations}
 \label{critical}
 \end{figure}
 
+\begin{commentary}
+The AMOs were designed to implement the C11 and C++11 memory models
+efficiently.  Although the FENCE R, RW instruction suffices to
+implement the {\em acquire} operation and FENCE RW, W suffices to
+implement {\em release}, both imply additional unnecessary ordering as
+compared to AMOs with the corresponding {\em aq} or {\em rl} bit set.
+\end{commentary}
+
+An example code sequence for a critical section guarded by a
+test-and-test-and-set spinlock is shown in Figure~\ref{critical}.  Note the
+first AMO is marked {\em aq} to order the lock acquisition before the
+critical section, and the second AMO is marked {\em rl} to order
+the critical section before the lock relinquishment.
+
 \begin{commentary}
 We recommend the use of the AMO Swap idiom shown above for both lock
 acquire and release to simplify the implementation of speculative lock
 elision~\cite{Rajwar:2001:SLE}.
 \end{commentary}
 
-The instructions in the ``A'' extension can also be used to provide
-sequentially consistent loads and stores.  A sequentially consistent load can
-be implemented as an LR with both {\em aq} and {\em rl} set. A sequentially
-consistent store can be implemented as an AMOSWAP that writes the old value to
-x0 and has both {\em aq} and {\em rl} set.
+\begin{commentary}
+The instructions in the ``A'' extension can be used to provide sequentially
+consistent loads and stores, but this constrains hardware
+reordering of memory accesses more than necessary.
+A C++ sequentially consistent load can be implemented as
+an LR with {\em aq} set. However, the LR/SC eventual
+success guarantee may slow down concurrent loads from the same effective
+address. A sequentially consistent store can be implemented as an AMOSWAP
+that writes the old value to x0 and has {\em rl} set. However the superfluous
+load may impose ordering constraints that are unnecessary for this use case.
+Specific compilation conventions may require both the {\em aq} and {\em rl}
+bits to be set in either or both the LR and AMOSWAP instructions.
+\end{commentary}
diff --git a/src/memory.tex b/src/memory.tex
@@ -1225,6 +1225,7 @@ \section{Code Porting and Mapping Guidelines}
 Table~\ref{tab:c11mappings} provides a mapping of C11/C++11 atomic operations onto RISC-V memory instructions.
 If load and store opcodes with {\em aq} and {\em rl} modifiers are introduced, then the mappings in Table~\ref{tab:c11mappings_hypothetical} will suffice.
 Note however that the two mappings only interoperate correctly if {\tt atomic\_<op>(memory\_order\_seq\_cst)} is mapped using an LR that has both {\em aq} and {\em rl} set.
+Even more importantly, a Table~\ref{tab:c11mappings} sequentially consistent store, followed by a Table~\ref{tab:c11mappings_hypothetical} sequentially consistent load can be reordered unless the Table~\ref{tab:c11mappings} mapping of stores is strengthened by either adding a second fence or mapping the store to {\tt amoswap.rl} instead.
 
 Any AMO can be emulated by an LR/SC pair, but care must be taken to ensure that any PPO orderings that originate from the LR are also made to originate from the SC, and that any PPO orderings that terminate at the SC are also made to terminate at the LR.
 For example, the LR must also be made to respect any data dependencies that the AMO has, given that load operations do not otherwise have any notion of a data dependency.