Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RandomForest throws exception when setting n_streams=2 #5402

Open
wbo4958 opened this issue May 4, 2023 · 1 comment
Open

[BUG] RandomForest throws exception when setting n_streams=2 #5402

wbo4958 opened this issue May 4, 2023 · 1 comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@wbo4958
Copy link

wbo4958 commented May 4, 2023

Describe the bug
spark-rapids-ml RandomForestClassifier/Regressor which is built on cuml throws an exception when setting n_streams=2 (or any value > 1) on the node with 2 processes running, each process takes 1 different GPU. Please note that there is no issue if two processes run on different nodes.

The exception is like below

/home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/api_decorators.py:188: UserWarning: To use pickling first train using float32 data to fit the estimator
  ret = func(*args, **kwargs)
terminate called after throwing an instance of 'raft::cuda_error'
  what():  CUDA error encountered at: file=/opt/conda/conda-bld/work/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 8 stack frames
#0 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x84) [0x7f2860bf1fc4]
#1 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f2860bf2a1d]
#2 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIdiiEEE15assignWorkspaceEPcS5_+0x2f1) [0x7f2861839211]
#3 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIdiiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKdPKiiiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIdiEE+0x2cc) [0x7f28618397bc]
#4 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(+0x10f78b4) [0x7f286186b8b4]
#5 in /home/xxx/anaconda3/envs/rapids-23.04/lib/python3.10/site-packages/cuml/internals/../../../.././libgomp.so.1(+0x177f0) [0x7f285663a7f0]
#6 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f28d247c609]
#7 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f28d223d133]

Steps/Code to reproduce bug
Follow the guideline of spark-rapids-ml

Open a terminal on the node with 2 GPUs and input

pyspark --master local[2]

then paste below code in the terminal

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
    (1.0, Vectors.dense(1.0)),
    (0.0, Vectors.sparse(1, [], []))] * 100, ["label", "features"])
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(df)
td = si_model.transform(df)
from spark_rapids_ml.classification import RandomForestClassifier, RandomForestClassificationModel
rf = RandomForestClassifier(num_workers=2, n_streams=12, numTrees=20, maxDepth=2, labelCol="indexed", seed=42)
model: RandomForestClassificationModel = rf.fit(td)
model.transform(df).show()

Expected behavior
No exception is thrown

Environment details (please complete the following information):

  • Environment location: [Bare-metal]
  • Linux Distro/Architecture: [Ubuntu 20.04 amd64]
  • GPU Model/Driver: [TITAN RTX and driver 495.29.05]
  • CUDA: [11.5]
  • Method of cuDF & cuML install: [conda, rapids-23.04]
@wbo4958 wbo4958 added ? - Needs Triage Need team to review and classify bug Something isn't working labels May 4, 2023
@wbo4958
Copy link
Author

wbo4958 commented Mar 27, 2024

This issue is still existing in the cuml 24.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant