-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Closed
Labels
actionablemodule: CUDACachingAllocatormodule: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
In code below, inside c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::release_available_cached_blocks(), totalReleased += (*cur)->size; is executed after *cur is released, and can crash if *cur ends up pointing at pool.blocks.end(). I have observed this crash using C++ libTorch 2.7.1 with CUDA 12.8.1, when using expendable_segments:true and max_split_size_mb:2048, but it's hard to hit reliably as it only triggers a crash when last segment is released.
The order of operations inside if statement should be reversed.
if (!(*cur)->expandable_segment_) {
release_block(*cur, context);
totalReleased += (*cur)->size;
}
to
if (!(*cur)->expandable_segment_) {
totalReleased += (*cur)->size;
release_block(*cur, context);
}
Versions
N/A (using libTorch release)
Metadata
Metadata
Assignees
Labels
actionablemodule: CUDACachingAllocatormodule: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module