[NCCL over MSCCL++] Mscclpp failure: InvalidUsage 'This NVLS connection cannot map the requested devBuffSize' when test single node H100 (8GPUs)

I build libmscclpp_nccl.so and nccl-tests with docker images ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 

compile libmscclpp_nccl.so with the command: 
`   cd mscclpp/build  && cmake -DCMAKE_BUILD_TYPE=Release .. -DMSCCLPP_BUILD_TESTS=OFF -DMSCCLPP_BUILD_PYTHON_BINDINGS=OFF && make -j`
   
compile nccl-tests with the command: 
`make MPI=1 MPI_HOME=/usr/local/mpi NCCL_HOME=/usr/lib/x86_64-linux-gnu/`

And run nccl-tests/all_reduce_perf using libmscclpp_nccl.so on single node H100 (8GPUs) as follows,
`   mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/build/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1k -e 16G -d bfloat16 -f 2 -g 1 -t 1 -c 0 -n 20 -w 20`

And I got the error:   
what():  terminate called after throwing an instance of 'mscclpp::Error'
  what():  This NVLS connection cannot map the requested devBuffSize (Mscclpp failure: InvalidUsage)terminate called after throwing an instance of 'mscclpp::Error'
  what():  This NVLS connection cannot map the requested devBuffSize (Mscclpp failure: InvalidUsage)
terminate called after throwing an instance of 'mscclpp::Error'
This NVLS connection cannot map the requested devBuffSize (Mscclpp failure: InvalidUsage)


When I use -x MSCCLPP_FORCE_DISABLE_NVLS=true to disbale NVLS, the test run successfully on single node H100 8GPUs,
but if test on single node H100 4GPUs,  I got an error as follows,
Test CUDA failure common.cu:306 'an illegal memory access was encountered'
    terminate called after throwing an instance of 'mscclpp::CudaError'
    what():  Call to cudaThreadExchangeStreamCaptureMode(&mode_) failed. /mscclpp/src/gpu_utils.cc:11 (Cuda failure: driver shutting down)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL over MSCCL++] Mscclpp failure: InvalidUsage 'This NVLS connection cannot map the requested devBuffSize' when test single node H100 (8GPUs) #554

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NCCL over MSCCL++] Mscclpp failure: InvalidUsage 'This NVLS connection cannot map the requested devBuffSize' when test single node H100 (8GPUs) #554

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions