-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[sleep mode] save memory for on-the-fly quantization #24731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to save memory during on-the-fly quantization by manually releasing unused memory blocks from the custom CuMemAllocator
pool. The changes introduce logic to iterate through memory pool allocations at the end of a use_memory_pool
context and free any blocks that are no longer in use. This is a workaround for a PyTorch issue where torch.cuda.empty_cache()
does not work with pluggable allocators. The changes also include additional logging to provide visibility into memory management. My review focuses on improving the robustness of the new memory release logic. I've identified a potential KeyError
that could occur and suggested adding error handling to prevent crashes.
allocations = data[0].snapshot() | ||
for allocation in allocations: | ||
if allocation["allocated_size"] == 0: | ||
handle = self._python_free_callback(allocation["address"]) | ||
unmap_and_release(handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation assumes that any memory block with allocated_size == 0
will have a corresponding entry in self.pointer_to_data
. However, it's possible that the free callback was already triggered for an allocation, removing it from self.pointer_to_data
, while the memory block is still tracked by the pool. This would lead to a KeyError
when _python_free_callback
is called, as it internally performs a pop
, which could crash the application.
To make the code more robust, you should wrap the calls in a try...except KeyError
block to gracefully handle cases where the allocation has already been freed and removed from pointer_to_data
.
allocations = data[0].snapshot() | |
for allocation in allocations: | |
if allocation["allocated_size"] == 0: | |
handle = self._python_free_callback(allocation["address"]) | |
unmap_and_release(handle) | |
allocations = data[0].snapshot() | |
for allocation in allocations: | |
if allocation["allocated_size"] == 0: | |
try: | |
handle = self._python_free_callback( | |
allocation["address"]) | |
unmap_and_release(handle) | |
except KeyError: | |
# This can happen if the allocation was already freed | |
# through the normal path, but the memory pool has not | |
# released the block. | |
pass |
if allocation["allocated_size"] == 0: | ||
handle = self._python_free_callback(allocation["address"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is on nvidia side, they are not implementing what you want. And as I've said repeatedly, it's not a question of exposing empty_cache
.
Question: Got a segmentation fault when trying to free memory inside the pool context to prevent OOM. |
@jiaau your OOM might be caused by memory fragmentation. Right now the custom memory pool does not support I think you will need to have the real |
@youkaichao Thanks a lot for your reply and the helpful explanation! |
It's not a question of writing extension, you can expose |
Can we have one more line to test whether wakeup kv_cache yields expected memory? Thanks! |
@jiaau one possible solution to your memory fragmentation issue, might be allocate a very large tensor and release it at the beginning, then hopefully later tensors can reuse this buffer. But anyway, I think for on the fly quantization, you need to have enough memory to hold the bf16 checkpoint first. We don't do per-layer on the fly quantization. If you just want to pay the memory of fp8 checkpoint, you need to convert the checkpoint to fp8 directly. |
@vermouth1992 Running the example in the description, Before this PR, I get Available KV cache memory: 9.89 GiB. We can confirm that we have more memory for KV cache now. Testing it in ci would be quite complicated though. |
Purpose
Solves #19855
Test Plan
Run the code, and the memory taken with quantization is close to 9 GiB, showing the benefit of online quantization.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.