-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ESoC 2026 Project Ideas
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP powers millions of users and your contributions could help to significantly improve their experience.
Currently we have 15 functions that use numba's njit functionality to compile those just-in-time.
This works reasonably well but we have had problems with numba recently, namely on our MacOS CI jobs and the delay
in supporting new python versions quickly. Therefore we want to move those functions to C++ via nanobind and make sure that we match the speed of numba to get rid of the numba/llvmlite dependency for SHAP. We need to test this properly, to make sure no memory errors appear.
Furthermore, we have one Cython function, which should be moved to C++ as well.
The goal is to move all of the following functions to C++:
_compute_grey_code_row_values_compute_grey_code_row_values_stlower_credit_single_delta_mask_delta_masking_jit_build_partition_tree_pt_shuffle_recdelta_minimization_order_reverse_window_reverse_window_score_gain_mask_delta_score_build_fixed_single_output_build_fixed_multi_output_init_masks_rec_fill_masks-
_exp_val(Cython)
Optionally, we could move the existing code in _cext (C API with PyObject overhead) to nanobind as well to provide a more modern approach for C++ extensions.
Expected Time: 200-250 hours
Difficulty Rating: Hard
Required Skills: C++, Python
Potential Mentors: Tobias Pitters
SHAP currently supports lots of small tree libraries explicitly, see here. We previously had lots of problems with a couple of these libraries, effectively blocking our CI, see this issue for further details. There is even interest in adding more models (see here).
Currently SHAP creates a TreeEnsemble from all those different models, which is then used to calculate the SHAP values efficiently. In order to not support more and more libraries, we plan to provide a TreeEnsemble.from_trees classmethod, with a well defined and communicated interface such that users or package maintainers are able to provide a way to generate the TreeEnsemble object and SHAP values from there.
Therefore the
initial plan is to prototype a TreeEnsemble.from_trees classmethod, that
generates the object necessary for SHAP to explain the model. Once this
works, we want to have an exhaustive test suite for all currently
supported models with this new behaviour. We'll target scikit-learns
tree structure, see here:
https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html.
Finally adding DeprecationWarnings to a list of models (to be defined)
will round out the feature.
Goals:
- prototype a
TreeEnsemble.from_treesclassmethod - thoroughly test the method against all currently supported tree models
- build a notebook, showing how to calculate SHAP values from currently supported models using this new functionality
- deprecate a list of currently supported models (smaller libraries like
gpboost,ngboost, etc.)
Expected Time: 200 hours
Difficulty Rating: Medium
Required Skills: Python, familiarity with tree-based ML models and scikit-learn
Potential Mentors: Tobias Pitters