Skip to content

[Bug] RegisteredMemory not properly destroyed #533

Open
@liangyuRain

Description

@liangyuRain

Hi, the following code causes GPU OOM on hopper with nvls enabled. I am using the latest main branch.

from mscclpp import Transport, TcpBootstrap, Communicator
from mscclpp._mscclpp import Context, RawGpuBuffer
import cupy as cp
cp.cuda.Device(0).use()
bootstrap = TcpBootstrap.create(0, 1)
bootstrap.initialize(bootstrap.create_unique_id(), 60)
comm = Communicator(bootstrap)
for i in range(100):
    if i % 10 == 0:
        print(f"{i=}", flush=True)
    mem = RawGpuBuffer(2 ** 30)
    reg = comm.register_memory(mem.data(), mem.bytes(), Transport.CudaIpc)
    del reg, mem

Output:

i=0
i=10
i=20
i=30
i=40
i=50
i=60
i=70
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
mscclpp._mscclpp.CuError: (2, 'Call to result failed./.../mscclpp/src/gpu_utils.cc:128 (Cu failure: out of memory)')

The code is fine if memory is not registered. Could you please check if it can be reproduced on your side?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions