-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
GH-133136: Revise QSBR to reduce excess memory held #135473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The free threading build uses QSBR to delay the freeing of dictionary keys and list arrays when the objects are accessed by multiple threads in order to allow concurrent reads to proceeed with holding the object lock. The requests are processed in batches to reduce execution overhead, but for large memory blocks this can lead to excess memory usage. Take into account the size of the memory block when deciding when to process QSBR requests.
Objects/obmalloc.c
Outdated
size_t bsize = mi_page_block_size(page); | ||
page->qsbr_goal = _Py_qsbr_advance_with_size(tstate->qsbr, page->capacity*bsize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be the right heuristic, but this is a bit different from _PyMem_FreeDelayed
:
_PyMem_FreeDelayed
holds onto the memory until quiescence. It prevents the memory from being used for any purpose._PyMem_mi_page_maybe_free
only prevents the page from being used by another thread or for a different size class. That's a lot less restrictive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. The memory being held (avoiding collection) by mimalloc is not at all the same as the deferred frees. I reworked the PR so that memory is tracked separately. I also decoupled the write sequence advance from the triggering of _PyMem_ProcessDelayed()
, used process_seq
as a target value for the read sequence.
Now _qsbr_thread_state
is larger than 64 bytes. I don't think that should be a problem.
* Keep separate count of mimalloc page memory that is deferred from collection. This memory doesn't get freed by _PyMem_ProcessDelayed(). We want to advance the write sequence if there is too much of it but calling _PyMem_ProcessDelayed() is not helpful. * Use `process_seq` variable to schedule the next call to `_PyMem_ProcessDelayed()`. * Rename advance functions to have "deferred" in name. * Move `_Py_qsbr_should_process()` call up one level.
Since _Py_atomic_add_uint64() returns the old value, we need to add QSBR_INCR.
Refactor code to keep obmalloc logic out of the qsbr.c file. Call _PyMem_ProcessDelayed() from the eval breaker.
After reverting the erroneous change to I refactored the code to put the "should advance" logic into the obmalloc file. I think that makes more sense compared with having it in the qsbr.c file. The dict_mutate_qsbr_mem.py.txt benchmark RSS sizes, in MB:
|
Updated pyperformance results: |
This is a refinement of GH-135107. Additional changes:
_Py_qsbr_advance_with_size()
to reduce duplicated codeWith these changes, the memory held by QSBR is typically freed a bit more quickly and the process RSS stays a bit smaller.
Regarding the changes to advance and processing, GH-135107 has the following minor issues: if the memory threshold is exceeded when a new item is added, by
free_delayed()
, we immediately setmemory_deferred = 0
and process. It is very unlikely that the goal has been reached for the newly added item. If that's a big chunk of memory, we would have to wait until the next process in order to actually free it. This PR tries to avoid that by storing theseq
(local read sequence) as it was at last process time. If that hasn't changed (this thread hasn't entered a quiescent state) then we wait before processing. This at least gives a chance that other readers will catch up and the process can actually free things.This PR also changes how often we can defer the advance of the global write sequence. Previously, we deferred it up to 10 times. However, I think there is not much benefit to advancing it unless we are nearly ready to process. So, the
should_advance_qsbr()
is checking if it seems time to process. The_Py_qsbr_should_process()
checks if the local read sequence has been updated. That means the write sequence has advanced (it's time to process) and the read sequence for this thread has also advanced. This doesn't tell us that the other threads have advanced their read sequence but we don't want to pay the cost of checking that (would require "poll").pyperformance memory usage results