Skip to content

release_available_cached_blocks() adds wrong releaseSize to totalReleased and can crash if cur is pointing to last position #159567

@ronbos

Description

@ronbos

🐛 Describe the bug

In code below, inside c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::release_available_cached_blocks(), totalReleased += (*cur)->size; is executed after *cur is released, and can crash if *cur ends up pointing at pool.blocks.end(). I have observed this crash using C++ libTorch 2.7.1 with CUDA 12.8.1, when using expendable_segments:true and max_split_size_mb:2048, but it's hard to hit reliably as it only triggers a crash when last segment is released.

The order of operations inside if statement should be reversed.

        if (!(*cur)->expandable_segment_) {
          release_block(*cur, context);
          totalReleased += (*cur)->size;
        }

to

        if (!(*cur)->expandable_segment_) {
          totalReleased += (*cur)->size;
          release_block(*cur, context);
        }

Versions

N/A (using libTorch release)

cc @ptrblck @msaroufim @eqy @jerryzh168

Metadata

Metadata

Assignees

No one assigned

    Labels

    actionablemodule: CUDACachingAllocatormodule: cudaRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions