Add some declarations for allocator control

and a design doc explaining why. Unless compiled in with #+system-tlabs, there is no change to the generated code. So average users won't care either way. The implementation of this feature is much further along than implied by the design. Which is to say, it works, and will be committed in a series of changes, pending some touch-ups.
sbcl · Oct 13, 2022 · 7f65522 · 7f65522
1 parent 31d7946
commit 7f65522
Show file tree

Hide file tree

Showing 4 changed files with 189 additions and 13 deletions.
diff --git a/doc/internals-notes/arena-allocation.txt b/doc/internals-notes/arena-allocation.txt
@@ -0,0 +1,166 @@
+Overview
+========
+Arenas offer a mark/release paradigm for rapid deallocation of thread-local
+lisp objects with the goal of reducing global heap usage.
+
+The implementation is built atop the existing pointer-bump allocator
+of 'gencgc' with a provision for redirecting the allocator's free-pointer
+somewhere other than dynamic space. When the C fallback is invoked,
+it notices that allocation should not occur to the main heap.
+
+It is possible for multiple threads to share one arena, or for threads
+to each get their own arena, or potentially even to have more than one
+arena controlled by a thread. The constraint is that in order to release
+all memory used by an arena without incurring a stop-the-world event
+there must be no heap-to-arena pointer reachable in a graph trace,
+supposing that the about-to-be-released memory is not a root.
+An arena has to be released in total, though in theory it could be
+possible to provide a partial release feature as well.
+It thus becomes possible to discard large portions of the
+reachability graph under user control.
+
+Design consideration
+====================
+Two possible approaches toward modifying the pointer-bump were reasonable:
+1) upon each allocation, decide whether it is to occur to the dynamic space
+   or elsewhere, and then use a pair of pointers (free-pointer and limit)
+   that are particular to either the dynamic space or the elsewhere.
+   As always, when the free-pointer and limit coincide, the slow path
+   is invoked which calls to the C runtime support for help.
+
+2) upon each allocation, assume that a _single_ pair of pointers (as before,
+   the free-pointer and limit) are always pointing to the correct "place"
+   (dynamic space or elsewhere). This introduces no branching except
+   in the case where the slow path is invoked.
+
+Option (1) entails significantly more runtime overhead, as every allocation
+would involve flag checking and lookups of the address containing the
+pointers that should be used.
+
+Option (2) entails no overhead beyond the free==limit check, which is always
+performed no matter what. By suitably initializing the pointer and limit
+to coincide on each "switch" of arena<->heap, the only overhead to be
+introduced is in the fallback code (the C runtime)
+This settles the matter: approach (2) wins.
+
+Implementation
+==============
+Each thread structure is augmented with two new thread-local
+allocation buffers ("TLABs"). One is for conses and the other
+is for everything else. These mirror the existing two TLABs.
+The distinction between cons/non-cons is that cons pages can be prefilled
+in each byte with 0xFF which is not a valid cons; whereas other objects are
+prefilled with 0s, which is similarly not valid. Thus we can recognize
+portions of memory that have not been initialized with valid objects.
+The cons of (0 . 0) is valid, and so 0 is inadequate to detect uninitialized
+memory.
+
+In total, each thread has 4 TLABs:
+  system conses
+  system "Mixed objects"
+  user conses
+  user "Mixed objects"
+
+Correct use of the distinct TLABs allows the user code to avoid
+creating heap-to-arena pointers.
+In the absence of arenas, the "user" TLABs are the ones ordinarily
+used for all allocations, including "system" allocation. i.e. There is
+no distinction between "user" and "system" code.
+In the presence of arenas, the user TLABs are directed either
+to the dynamic space, or to the arena depending the dynamic control
+which selects whether arena allocation is to occur.
+In contract, "system" TLABs can only allocate to dynamic space.
+
+Memory is claimed from the arena in small chunks, much as it is
+obtained from dynamic space in a certain granularity, currently
+32 KiB, which is SB-VM:GENCGC-PAGE-BYTES.  The arena allocation
+granularity is the same, for no particular reason.
+There is actually no restriction on the chunk size, so objects
+in excess of 32KiB can be allocated to the arena.
+When several threads share a single arena, they claim successive
+chunks using a compare-and-swap on the arena-relative free-pointer.
+
+GC interaction
+==============
+The memory in an arena is intended to be invisible to GC
+for the most part. Pointers between dynamic space and the arena
+in either direction are not traced by the collector. Therefore,
+applications making use of arenas should generally inhibit
+collection around use of the arena. Arenas do not attempt
+to emulate "thread local heaps".
+
+While debugging arena-based algorithms it is helpful to treat arenas
+as GC roots, so that if garbage-collection occurs organically due to
+dynamic-space usage, all heap objects pointed to by any arena remain live.
+Using the tools available such as SB-EXT:SEARCH-ROOTS and the new
+FIND-HEAP->ARENA, it is almost always possible to eradicate
+the "forbidden" heap->arena pointers. This is of course only for
+debugging, because any real-world scenario would expect not to need
+the extra delay that comes from hunting for pointers, as it is
+entirely contrary to the intent of using the arena in the first place.
+
+Thread interaction
+==================
+A created thread inherits the arena of its creator.
+
+At present, threads do not maintain enough state to know where
+they were allocating in both the arena and the dynamic space.
+Consequently, each "switch" from arena to dynamic space and back
+incurs a small amount of waste, as the last chunk of memory claimed
+for that thread in a particular TLAB is discarded.
+
+Control mechanisms
+==================
+SB-VM:WITH-ARENA
+   specifies that all allocations within its dynamic scope
+   (i.e. regardless of where in the program allocation occurs)
+   are to be directed to the arena, with the exception that
+   code which was compiled to use the system TLAB will only
+   allocate to the heap.
+SB-VM:WITHOUT-ARENA
+   specifies that all allocations within its dynamic scope
+   are to be directed to the dynamic space.
+SB-VM:IN-SAME-ARENA (X)
+   specifies that allocation should occur where object X
+   was allocated.
+(DECLARE (SB-C::TLAB :SYSTEM))
+   specifies that within its lexical scope, all allocations
+   should go to the heap.  This works as intended _only _if_
+   all allocations within the scope are handled as
+   "inline" allocations.  Code that is called from within
+   the scope of this declaration does not see the declaration
+   (as is to be expected per the language semantics)
+   and therefore uses the dynamic mechanism.
+
+Best practice
+=============
+Based on the preceding description of the control mechanisms
+and the limitation upon switching in terms of memory waste,
+it should be evident that code which uses the lexical declaration
+is slightly to be preferred.
+It is often possible to avoid use of the dynamic mechanism
+by replacing an allocation point with the following pattern:
+
+ (if (should-allocate-to-heap)
+     (locally (declare (sb-c::tlab :system)) (do-allocation))
+     (do-allocation))
+
+So despite the "doubling" of the allocator form, this is potentially
+more efficient. Most likely the user would wrap this idiom in a macro.
+
+In practice, all mechanisms of control are necessary. Within a
+lexically scoped usage, there might be a hidden call to a builtin
+function such as REVERSE that would cons a new list using
+the dynamic choice of heap or arena.
+
+Pending items
+=============
+* Some of the waste that comes from switching between arena and heap
+  can be avoided by adding more state to the thread structure.
+
+* Background thread pools in particular are a problem.
+  In one such implementation, a thread which requests work to be performed
+  by a worker in a pool sends as part of the work request an identifier
+  of the arena in use by the requester. The worker will switch to that
+  same arena. The implication is that worker threads will constantly be
+  switching their arena, which as per above, is inefficient.
diff --git a/src/code/fd-stream.lisp b/src/code/fd-stream.lisp
@@ -40,7 +40,7 @@
 ;;;;  (incf (buffer-tail buffer) n))
 ;;;;
 
-(defstruct (buffer (:constructor %make-buffer (sap length))
+(defstruct (buffer (:constructor !make-buffer (sap length))
                    (:copier nil))
   (sap (missing-arg) :type system-area-pointer :read-only t)
   (length (missing-arg) :type index :read-only t)
@@ -55,10 +55,11 @@
   "Default number of bytes per buffer.")
 
 (defun alloc-buffer (&optional (size +bytes-per-buffer+))
+  (declare (sb-c::tlab :system) (inline !make-buffer))
   ;; Don't want to allocate & unwind before the finalizer is in place.
   (without-interrupts
     (let* ((sap (allocate-system-memory size))
-           (buffer (%make-buffer sap size)))
+           (buffer (!make-buffer sap size)))
       (when (zerop (sap-int sap))
         (error "Could not allocate ~D bytes for buffer." size))
       (finalize buffer (lambda ()
@@ -77,6 +78,7 @@
   buffer)
 
 (defun release-buffer (buffer)
+  (declare (sb-c::tlab :system))
   (reset-buffer buffer)
   (atomic-push buffer *available-buffers*))
 

diff --git a/src/code/final.lisp b/src/code/final.lisp
@@ -43,8 +43,7 @@
 (declaim (simple-vector **finalizer-store**))
 
 (defun finalize (object function &key dont-save
-                        &aux (function (%coerce-callable-to-fun function))
-                             (item (if dont-save (list function) function)))
+                        &aux (function (%coerce-callable-to-fun function)))
   "Arrange for the designated FUNCTION to be called when there
 are no more references to OBJECT, including references in
 FUNCTION itself.
@@ -91,11 +90,14 @@ Examples:
     (finalize \"oops\" #'oops)
     (oops)) ; GC causes re-entry to #'oops due to the finalizer
             ; -> ERROR, caught, WARNING signalled"
+  (declare (sb-c::tlab :system))
   (unless object
     (error "Cannot finalize NIL."))
-  (with-finalizer-store (store)
-    (let ((id (gethash object (finalizer-id-map store))))
-      (cond (id ; object already has at least one finalizer
+  (let ((item (if dont-save (list function) function)))
+    (with-finalizer-store (store)
+      (let ((id (gethash object (finalizer-id-map store))))
+        (cond
+            (id ; object already has at least one finalizer
              ;; Multiple finalizers are invoked in the order added.
              (let* ((old (svref store id))
                     (new (make-array (if (simple-vector-p old)
@@ -130,7 +132,7 @@ Examples:
              ;; Clear out lingering junk from (SVREF STORE ID) before
              ;; establishing that OBJECT maps to that index.
              (setf (svref store id) item
-                   (gethash object (finalizer-id-map store)) id)))))
+                   (gethash object (finalizer-id-map store)) id))))))
   object)
 
 (defun invalidate-fd-streams ()
@@ -230,8 +232,9 @@ Examples:
          ;; Not strictly necessary to do this: the next FINALIZE claiming
          ;; the same ID would assign a fresh list anyway.
          (setf (svref store it) 0)
-         (atomic-push it (finalizer-recycle-bin store)))))
-    object))
+         (locally (declare (sb-c::tlab :system))
+           (atomic-push it (finalizer-recycle-bin store)))))))
+  object)
 
 ;;; Drain the queue of finalizers and return when empty.
 ;;; Concurrent invocations of this function in different threads are ok.

diff --git a/src/code/target-thread.lisp b/src/code/target-thread.lisp
@@ -363,6 +363,8 @@ created and old ones may exit at any time."
   "True if THREAD, defaulting to current thread, is the main thread of the process."
   (eq thread *initial-thread*))
 
+(locally (declare (sb-c::tlab :system)) (defun sys-tlab-list (&rest args) args))
+
 (defmacro return-from-thread (values-form &key allow-exit)
   "Unwinds from and terminates the current thread, with values from
 VALUES-FORM as the results visible to JOIN-THREAD.
@@ -374,7 +376,7 @@ ALLOW-EXIT is true, returning from the main thread is equivalent to
 calling SB-EXT:EXIT with :CODE 0 and :ABORT NIL.
 
 See also: ABORT-THREAD and SB-EXT:EXIT."
-  `(%return-from-thread (multiple-value-list ,values-form) ,allow-exit))
+  `(%return-from-thread (multiple-value-call #'sys-tlab-list ,values-form) ,allow-exit))
 
 (defun %return-from-thread (values allow-exit)
   (let ((self *current-thread*))
@@ -1539,7 +1541,8 @@ on this semaphore, then N of them is woken up."
          (let ((old (sb-ext:cas (thread-%visible thread) 1 -1)))
            ;; now (LIST-ALL-THREADS) won't see it
            (aver (eql old 1)))
-         (sb-ext:atomic-push thread *joinable-threads*))
+         (locally (declare (sb-c::tlab :system))
+           (sb-ext:atomic-push thread *joinable-threads*)))
         (t ; otherwise, physically remove from *ALL-THREADS*
          ;; The memory allocation/deallocation is handled in C.
          ;; I would like to combine the recycle bin for foreign and lisp threads though.
@@ -1885,6 +1888,8 @@ session."
           (prot "protect_alien_stack_guard_page")))
       (unless (= (sap-int thread-sap) 0) thread-sap))))
 
+;;; FIXME: now that #-pauseless-threadstart is gone, this macro is just silly.
+;;;        So remove it and write DEFUN like a normal human.
 (defmacro thread-trampoline-defining-macro (&body body) ; NEW WAY
   `(defun run ()
      (macrolet ((apply-real-function ()
@@ -1938,7 +1943,7 @@ session."
                              1))
                      (unmask-signals)
                      (let ((list
-                             (multiple-value-list
+                             (multiple-value-call #'sys-tlab-list
                                 (unwind-protect
                                      (catch '%return-from-thread
                                        (sb-c::inspect-unwinding