You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
spark-rapids-ml RandomForestClassifier/Regressor which is built on cuml throws an exception when setting n_streams=2 (or any value > 1) on the node with 2 processes running, each process takes 1 different GPU. Please note that there is no issue if two processes run on different nodes.
The exception is like below
/home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/api_decorators.py:188: UserWarning: To use pickling first train using float32 data to fit the estimator ret = func(*args, **kwargs)terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/opt/conda/conda-bld/work/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argumentObtained 8 stack frames#0 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x84) [0x7f2860bf1fc4]#1 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f2860bf2a1d]#2 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIdiiEEE15assignWorkspaceEPcS5_+0x2f1) [0x7f2861839211]#3 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIdiiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKdPKiiiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIdiEE+0x2cc) [0x7f28618397bc]#4 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(+0x10f78b4) [0x7f286186b8b4]#5 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../.././libgomp.so.1(+0x177f0) [0x7f285663a7f0]#6 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f28d247c609]#7 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f28d223d133]
Steps/Code to reproduce bug
Follow the guideline of spark-rapids-ml
Describe the bug
spark-rapids-ml RandomForestClassifier/Regressor which is built on cuml throws an exception when setting n_streams=2 (or any value > 1) on the node with 2 processes running, each process takes 1 different GPU. Please note that there is no issue if two processes run on different nodes.
The exception is like below
Steps/Code to reproduce bug
Follow the guideline of spark-rapids-ml
Open a terminal on the node with 2 GPUs and input
then paste below code in the terminal
Expected behavior
No exception is thrown
Environment details (please complete the following information):
The text was updated successfully, but these errors were encountered: