Description
I build libmscclpp_nccl.so and nccl-tests with docker images ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4
compile libmscclpp_nccl.so with the command:
cd mscclpp/build && cmake -DCMAKE_BUILD_TYPE=Release .. -DMSCCLPP_BUILD_TESTS=OFF -DMSCCLPP_BUILD_PYTHON_BINDINGS=OFF && make -j
compile nccl-tests with the command:
make MPI=1 MPI_HOME=/usr/local/mpi NCCL_HOME=/usr/lib/x86_64-linux-gnu/
And run nccl-tests/all_reduce_perf using libmscclpp_nccl.so on single node H100 (8GPUs) as follows,
mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/build/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1k -e 16G -d bfloat16 -f 2 -g 1 -t 1 -c 0 -n 20 -w 20
And I got the error:
what(): terminate called after throwing an instance of 'mscclpp::Error'
what(): This NVLS connection cannot map the requested devBuffSize (Mscclpp failure: InvalidUsage)terminate called after throwing an instance of 'mscclpp::Error'
what(): This NVLS connection cannot map the requested devBuffSize (Mscclpp failure: InvalidUsage)
terminate called after throwing an instance of 'mscclpp::Error'
This NVLS connection cannot map the requested devBuffSize (Mscclpp failure: InvalidUsage)
When I use -x MSCCLPP_FORCE_DISABLE_NVLS=true to disbale NVLS, the test run successfully on single node H100 8GPUs,
but if test on single node H100 4GPUs, I got an error as follows,
Test CUDA failure common.cu:306 'an illegal memory access was encountered'
terminate called after throwing an instance of 'mscclpp::CudaError'
what(): Call to cudaThreadExchangeStreamCaptureMode(&mode_) failed. /mscclpp/src/gpu_utils.cc:11 (Cuda failure: driver shutting down)