Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treeshap hypothesis tests #4671

Merged
merged 8 commits into from
Apr 13, 2022
Merged

Conversation

RAMitchell
Copy link
Contributor

Increased test coverage for TreeExplainer, greatly expanding model types tested. New tests take around 4.8s on my machine.

Fixes #4352

New bugs found:
#4663
dmlc/treelite#375
#4670

@RAMitchell RAMitchell requested review from a team as code owners March 30, 2022 15:29
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Mar 30, 2022
@RAMitchell
Copy link
Contributor Author

Looks like xgboost segfaulted. Can't reproduce this locally and it doesn't seem strongly related to any of these changes.

cuml/test/explainer/test_gpu_treeshap.py::test_xgb_toy_categorical [0]	train-error:0.00000
[c8d0704a3bfc:3027 :0:3561] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x5581e5d67ff8)
==== backtrace (tid:   3561) ====
 0  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7eff71a3f3f5]
 1  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7eff71a3f791]
 2  /opt/conda/envs/rapids/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d962) [0x7eff71a3f962]
 3  /usr/lib64/libc.so.6(+0x36400) [0x7f00b1fb4400]
 4  /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libnccl.so.2(+0x5563e) [0x7effbd70d63e]
 5  /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libnccl.so.2(+0x445b9) [0x7effbd6fc5b9]
 6  /usr/lib64/libpthread.so.0(+0x7ea5) [0x7f00b2c64ea5]
 7  /usr/lib64/libc.so.6(clone+0x6d) [0x7f00b207cb0d]
=================================
Fatal Python error: Segmentation fault

Thread 0x00007efecefd5700 (most recent call first):
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 306 in wait
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 558 in wait
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/opt/conda/envs/rapids/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f00b308e740 (most recent call first):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/core.py", line 1423 in __del__
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/training.py", line 188 in train
  File "/workspace/python/cuml/test/explainer/test_gpu_treeshap.py", line 347 in test_xgb_toy_categorical

@dantegd
Copy link
Member

dantegd commented Apr 4, 2022

@RAMitchell I've seen that in another PR or nightly job, so I'm sure it is unrelated to the changes of the PR but it does look like XGB segfaulting, will dig the log

@RAMitchell
Copy link
Contributor Author

I tried to reproduce the CI failure locally by running the test in isolation. Couldn't reproduce anything unfortunately.

import numpy as np
import pandas as pd
import xgboost as xgb
from cuml.experimental.explainer.tree_shap import TreeExplainer


def test_xgb_toy_categorical():
    X = pd.DataFrame({'dummy': np.zeros(5, dtype=np.float32),
                      'x': np.array([0, 1, 2, 3, 4], dtype=np.int32)})
    y = np.array([0, 0, 1, 1, 1], dtype=np.float32)
    X['x'] = X['x'].astype("category")
    dtrain = xgb.DMatrix(X, y, enable_categorical=True)
    params = {"tree_method": "gpu_hist", "eval_metric": "error",
              "objective": "binary:logistic", "max_depth": 2,
              "min_child_weight": 0, "lambda": 0}
    xgb_model = xgb.train(params, dtrain, num_boost_round=1,
                          evals=[(dtrain, 'train')])
    explainer = TreeExplainer(model=xgb_model)
    out = explainer.shap_values(X)

    ref_out = xgb_model.predict(dtrain, pred_contribs=True)
    np.testing.assert_almost_equal(out, ref_out[:, :-1], decimal=5)
    np.testing.assert_almost_equal(explainer.expected_value, ref_out[0, -1],
                                   decimal=5)


for i in range(1000):
    test_xgb_toy_categorical()

@trivialfis
Copy link
Member

trivialfis commented Apr 5, 2022

Have you tried address sanitizer?

@RAMitchell RAMitchell changed the base branch from branch-22.04 to branch-22.06 April 6, 2022 15:38
Comment on lines +585 to +587
n_targets = draw(st.integers(2, 5))
else:
n_targets = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So n_targets means n_classes in the context of classification? Let's just use n_classes for this purpose, since it's confusing otherwise. (I was wondering if we were using an unreleased feature of XGBoost)

Copy link
Contributor Author

@RAMitchell RAMitchell Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests will support multi-output regression. I am using n_targets as a more generic term.

@RAMitchell
Copy link
Contributor Author

I am wondering if the xgboost CI failure is due to nccl somehow. I've seen this occur in different tests calling xgboost, so the failure doesn't seem related to any particular test, as long as xgb is used. It seems to be occurring more in this PR than in other places in CI, maybe because I increased the frequency of xgboost tests.

@RAMitchell
Copy link
Contributor Author

Took @trivialfis's advice and tried compiling cuml with address sanitizer. Turns out we were deleting a base class pointer without a virtual destructor - so the derived class deleter was not called. Hopefully this resolves the issue.

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.06@dfc1eae). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.06    #4671   +/-   ##
===============================================
  Coverage                ?   84.04%           
===============================================
  Files                   ?      252           
  Lines                   ?    20348           
  Branches                ?        0           
===============================================
  Hits                    ?    17102           
  Misses                  ?     3246           
  Partials                ?        0           
Flag Coverage Δ
dask 44.97% <0.00%> (?)
non-dask 77.34% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dfc1eae...a0086f3. Read the comment docs.

@trivialfis
Copy link
Member

Seems to be working?

@dantegd dantegd added tests Unit testing for project improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 13, 2022
@dantegd dantegd added this to PR-WIP in v22.06 Release via automation Apr 13, 2022
v22.06 Release automation moved this from PR-WIP to PR-Reviewer approved Apr 13, 2022
@dantegd
Copy link
Member

dantegd commented Apr 13, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b3967cf into rapidsai:branch-22.06 Apr 13, 2022
v22.06 Release automation moved this from PR-Reviewer approved to Done Apr 13, 2022
rapids-bot bot pushed a commit that referenced this pull request May 11, 2022
Stacked on #4671.

- Remove extra redundant class in python layer.
- Simplify the interface between C++ and python using variants. 
- Fix #4670 by allowing double precision data
- Document TreeExplainer
- Add interventional shap method
- Add shapley interactions and taylor interactions
- Promote from experimental
- Support sklearn estimator types from xgb/lgbm (i.e. no need to convert to booster before using TreeExplainer)

Authors:
  - Rory Mitchell (https://github.com/RAMitchell)

Approvers:
  - Philip Hyunsu Cho (https://github.com/hcho3)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #4697
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
Increased test coverage for TreeExplainer, greatly expanding model types tested. New tests take around 4.8s on my machine.

Fixes rapidsai#4352

New bugs found:
rapidsai#4663
dmlc/treelite#375
rapidsai#4670

Authors:
  - Rory Mitchell (https://github.com/RAMitchell)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4671
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
Stacked on rapidsai#4671.

- Remove extra redundant class in python layer.
- Simplify the interface between C++ and python using variants. 
- Fix rapidsai#4670 by allowing double precision data
- Document TreeExplainer
- Add interventional shap method
- Add shapley interactions and taylor interactions
- Promote from experimental
- Support sklearn estimator types from xgb/lgbm (i.e. no need to convert to booster before using TreeExplainer)

Authors:
  - Rory Mitchell (https://github.com/RAMitchell)

Approvers:
  - Philip Hyunsu Cho (https://github.com/hcho3)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change tests Unit testing for project
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[BUG] RF trained with float64 segfault in conversion to Treelite object
5 participants