Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delayed reclamation mechanism for free-threaded build (QSBR) #115103

Closed
Tracked by #108219
colesbury opened this issue Feb 6, 2024 · 3 comments
Closed
Tracked by #108219

Add delayed reclamation mechanism for free-threaded build (QSBR) #115103

colesbury opened this issue Feb 6, 2024 · 3 comments
Assignees
Labels
topic-free-threading type-feature A feature request or enhancement

Comments

@colesbury
Copy link
Contributor

colesbury commented Feb 6, 2024

Feature or enhancement

Many operations in the free-threaded build are protected by locks. However, in some cases, we want to allow reads to happen concurrently with updates 1. For example, we want to avoid locking during most list read accesses. If there is a concurrent list resize operation, when is it safe to free the list's previous array?

"Safe memory reclamation" (SMR) schemes address this usually by delaying the "free" operation until all concurrent read accesses are guaranteed to have completed. Quiescent state-based reclamation (QSBR) is a safe memory reclamation scheme that will be useful in the free-threaded build. QSBR relies on marked "quiescent states", when a thread reports that it is definitely not in the middle of a non-locking read operation. The eval breaker checks serve as a good place to report a quiescent state.

Background

SMR schemes are most common in systems that don't use garbage collection or reference counting. CPython uses reference counting and has a tracing garbage collector, so it may seem like an odd fit. However, some data structures are not reference counted. For example, a list object's array is not separately reference counted, but may have a shorter lifetime than the containing PyListObject. We could delay reclamation until the GC runs, but we want reclamation to happen quickly, and we generally want to run the GC less frequently in the free-threaded build, because it requires pausing all threads and, at least for now, scanning the entire heap.

Use cases

  • Dict keys (PyDictKeysObject) and list arrays (ob_item): If we resize a dict or list that may be shared between threads, we use QSBR to delay freeing the old keys/array until it's safe to do so. Note that most dicts and lists are not accessed by multiple threads; their keys/arrays can be immediately freed on resize.
  • Mimalloc mi_page_t: The non-locking dict and list accesses require cooperation from mimalloc. We want to ensure that even if an item is freed during access and the memory reused for a new object, the new object’s reference count field is placed at the same location in memory. In practice, this means that when an mi_page_t is empty, we don't immediately allow it to be re-used for allocations of a different size class. We use QSBR to determine when it's safe to use the mi_page_t for a different size class (or return the memory to the OS).

Implementation Overview

The implementation will be partially based on FreeBSD's "Global Unbounded Sequences". The FreeBSD implementation provides a good starting point because it's relatively simple, self contained, and the license (2-Clause BSD) is compatible with CPython. The implementation relies on a few sequence counters:

  • Global (per-interpreter) write sequence: When we want to safely free a data structure, we increment the global write sequence counter and tag the data to be freed with the new value.
  • Per-thread state read sequence: When a thread reaches a quiescent state (such as when it checks the eval breaker), it copies the global write sequence to its local read counter.

It's safe to free data once all thread states' read sequences are greater than or equal to the write sequence used to tag the data.

To check that, we scan over all the thread states read sequences to compute the minimum value (excluding detached threads). For efficiency, we also cache that computed value globally (per-interpreter).

Limitations

Determining the current minimum read sequence requires scanning over all thread states. This will likely become a bottleneck if we have a large number of threads (>1,000?). We will likely eventually want to address this, possibly with some combination of a tree-based mechanism 2 and incremental scanning.

For now, I think keeping the implementation simple important until we are able to run and benchmark multithreaded programs with the GIL disabled.

Linked PRs

Footnotes

  1. See "Optimistically Avoiding Locking" in PEP 703.

  2. https://people.kernel.org/joelfernandes/gus-vs-rcu

@colesbury colesbury added type-feature A feature request or enhancement topic-free-threading labels Feb 6, 2024
@colesbury colesbury self-assigned this Feb 6, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Feb 7, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Feb 8, 2024
This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 8, 2024
This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 12, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Feb 12, 2024
…uilds

This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 13, 2024
…uilds

This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 14, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Feb 14, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Feb 16, 2024
colesbury added a commit that referenced this issue Feb 16, 2024
This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 16, 2024
…uilds

This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
colesbury added a commit that referenced this issue Feb 20, 2024
…115367)

This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
@corona10
Copy link
Member

corona10 commented Mar 2, 2024

For the record: 2e91578
The commit title should be Update gc.collect to process delayed objects not Update refleak checker to trigger _PyMem_ProcessDelayed but I forgot to change the title before merging.
I leave the comment for someone who track the commit log in future.

corona10 added a commit to corona10/cpython that referenced this issue Mar 2, 2024
corona10 added a commit to corona10/cpython that referenced this issue Mar 2, 2024
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024
…uilds (python#115367)

This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024
…gh-116251)

* Revert "pythongh-115103: Update refleak checker to trigger _PyMem_ProcessDelayed (pythongh-116238)"

This reverts commit 2e91578.

* pythongh-115103: Update gc.collect to process delayed objects
colesbury added a commit to colesbury/cpython that referenced this issue Mar 4, 2024
This sets `MI_DEBUG` to `2` in debug builds to enable
`mi_assert_internal()` calls. Expensive internal assertions are not
neabled.
colesbury added a commit to colesbury/cpython that referenced this issue Mar 4, 2024
This sets `MI_DEBUG` to `2` in debug builds to enable
`mi_assert_internal()` calls. Expensive internal assertions are not
enabled.

This also disables an assertion in free-threaded builds that would be
triggered by the free-threaded GC because we traverse heaps that are not
owned by the current thread.
colesbury added a commit to colesbury/cpython that referenced this issue Mar 4, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Mar 5, 2024
colesbury added a commit that referenced this issue Mar 5, 2024
This sets `MI_DEBUG` to `2` in debug builds to enable `mi_assert_internal()`
calls. Expensive internal assertions are not enabled.

This also disables an assertion in free-threaded builds that would be
triggered by the free-threaded GC because we traverse heaps that are not
owned by the current thread.
colesbury added a commit to colesbury/cpython that referenced this issue Mar 5, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Mar 6, 2024
colesbury added a commit that referenced this issue Mar 6, 2024
This implements the delayed reuse of mimalloc pages that contain Python
objects in the free-threaded build.

Allocations of the same size class are grouped in data structures called
pages. These are different from operating system pages. For thread-safety, we
want to ensure that memory used to store PyObjects remains valid as long as
there may be concurrent lock-free readers; we want to delay using it for
other size classes, in other heaps, or returning it to the operating system.

When a mimalloc page becomes empty, instead of immediately freeing it, we tag
it with a QSBR goal and insert it into a per-thread state linked list of
pages to be freed. When mimalloc needs a fresh page, we process the queue and
free any still empty pages that are now deemed safe to be freed. Pages
waiting to be freed are still available for allocations of the same size
class and allocating from a page prevent it from being freed. There is
additional logic to handle abandoned pages when threads exit.
@colesbury
Copy link
Contributor Author

I think this is complete now

colesbury added a commit to colesbury/cpython that referenced this issue Mar 7, 2024
If a thread blocks while waiting on the `shared->mutex` lock, the array
of QSBR states may be reallocated. The `tstate->qsbr` values before the
lock is acquired may not be the same as the value after the lock is acquired.
colesbury added a commit that referenced this issue Mar 8, 2024
If a thread blocks while waiting on the `shared->mutex` lock, the array
of QSBR states may be reallocated. The `tstate->qsbr` values before the
lock is acquired may not be the same as the value after the lock is acquired.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…gh-116251)

* Revert "pythongh-115103: Update refleak checker to trigger _PyMem_ProcessDelayed (pythongh-116238)"

This reverts commit 2e91578.

* pythongh-115103: Update gc.collect to process delayed objects
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…ython#116343)

This sets `MI_DEBUG` to `2` in debug builds to enable `mi_assert_internal()`
calls. Expensive internal assertions are not enabled.

This also disables an assertion in free-threaded builds that would be
triggered by the free-threaded GC because we traverse heaps that are not
owned by the current thread.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…ython#115435)

This implements the delayed reuse of mimalloc pages that contain Python
objects in the free-threaded build.

Allocations of the same size class are grouped in data structures called
pages. These are different from operating system pages. For thread-safety, we
want to ensure that memory used to store PyObjects remains valid as long as
there may be concurrent lock-free readers; we want to delay using it for
other size classes, in other heaps, or returning it to the operating system.

When a mimalloc page becomes empty, instead of immediately freeing it, we tag
it with a QSBR goal and insert it into a per-thread state linked list of
pages to be freed. When mimalloc needs a fresh page, we process the queue and
free any still empty pages that are now deemed safe to be freed. Pages
waiting to be freed are still available for allocations of the same size
class and allocating from a page prevent it from being freed. There is
additional logic to handle abandoned pages when threads exit.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
If a thread blocks while waiting on the `shared->mutex` lock, the array
of QSBR states may be reallocated. The `tstate->qsbr` values before the
lock is acquired may not be the same as the value after the lock is acquired.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…115180)

This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and
quiescent state based reclamation (QSBR). The API provides a mechanism
for callers to detect when it is safe to free memory that may be
concurrently accessed by readers.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…uilds (python#115367)

This adds `_PyMem_FreeDelayed()` and supporting functions. The
`_PyMem_FreeDelayed()` function frees memory with the same allocator as
`PyMem_Free()`, but after some delay to ensure that concurrent lock-free
readers have finished.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…gh-116251)

* Revert "pythongh-115103: Update refleak checker to trigger _PyMem_ProcessDelayed (pythongh-116238)"

This reverts commit 2e91578.

* pythongh-115103: Update gc.collect to process delayed objects
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…ython#116343)

This sets `MI_DEBUG` to `2` in debug builds to enable `mi_assert_internal()`
calls. Expensive internal assertions are not enabled.

This also disables an assertion in free-threaded builds that would be
triggered by the free-threaded GC because we traverse heaps that are not
owned by the current thread.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…ython#115435)

This implements the delayed reuse of mimalloc pages that contain Python
objects in the free-threaded build.

Allocations of the same size class are grouped in data structures called
pages. These are different from operating system pages. For thread-safety, we
want to ensure that memory used to store PyObjects remains valid as long as
there may be concurrent lock-free readers; we want to delay using it for
other size classes, in other heaps, or returning it to the operating system.

When a mimalloc page becomes empty, instead of immediately freeing it, we tag
it with a QSBR goal and insert it into a per-thread state linked list of
pages to be freed. When mimalloc needs a fresh page, we process the queue and
free any still empty pages that are now deemed safe to be freed. Pages
waiting to be freed are still available for allocations of the same size
class and allocating from a page prevent it from being freed. There is
additional logic to handle abandoned pages when threads exit.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
If a thread blocks while waiting on the `shared->mutex` lock, the array
of QSBR states may be reallocated. The `tstate->qsbr` values before the
lock is acquired may not be the same as the value after the lock is acquired.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-free-threading type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants