-
|
Hello all! I'm just getting my feet wet with explainability and specifically with SHAP values. Thanks to Scott (Dr. Lundberg, I should say 😉) for this library and for moving the ball forward with ML explainability! Here's my use case. My data is tabular, <100k samples, <20 features and I use a neural network for my model. We previously used random forests, but we're hoping to incorporate text in a multimodal model in the future and so we've moved to using DL models. We also found that the neural network seems to perform similarly (perhaps slightly better) than the RF anyways. This obviously makes things slightly more difficult for interpretability, but we're fairly committed to DL at this point. So, I've turned to SHAP values to help with understanding feature importance (I've also done some permutation importance, and am going to compare them). I'm using DeepExplainer to calculate SHAP values. Here is my code for preparing the explainer: This runs very quickly. However, when I generate the shap values from this explainer for ~4k test samples, it takes about 40 minutes. Here's my code for that: Am I doing something wrong? 40 minutes seems like a very long time when I have <20 features. I understand that if I had a high number of features this is to be expected, but with under 20 features should it be taking so long? If I'm not doing something "wrong", am I doing things that don't need to be done that take up extra time? For example, I've seen this github issue about how many samples you need to generate a good prior and how many explanatory samples you need for the global feature importance to be fairly accurate. That brings these optimization questions to mind:
Thanks for any insight you can lend me! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Answer to the main part of my own questionThe short answer to my own question is, yes I'm doing things that are likely unnecessarily expensive. The docs for shap.DeepExplainer clearly state that the time complexity scales linearly with the length of your background data. What is a little less clear (but now fairly obvious to me) is that it means the time complexity of calling When I instead only use 100 samples in my background data (instead of 4.5k), calling Follow-up questionThis does lead me to a related question. The docs say that using 100 samples will give a good estimate of the actual expected value, and that using 1000 samples will give a very good estimate of the expected value. However, I'm dealing with a domain where a difference in even .01 probability makes a large difference. When I use 4.5k samples, I get expected values (after softmaxing it) of [0.45131 0.54869]. When using 1k samples, I get [0.454981 0.545019]. And when using 100, I get [0.448709 0.551291]. These are important differences. EDIT: I am not sure if softmaxing the expected values is appropriate, but the sentiment of this question is the same. These of course vary every time I run them depending on the random samples I get. If I find a random sample of 100 points that gets closer to the actual expectation than others, should I use that one? Will that make my shap values more accurate? To extend this question further, let's say I found a subset of samples that got the exact same expected value as using the entire data set - would that give me the exact same shap values as if I had used the entire data set for my background? Or is the background used for far more than just the expected value? I assume it is or else we could just pass the expected value in of the entire data set. |
Beta Was this translation helpful? Give feedback.
-
|
I have simply found a pragmatic solution by running with 100, 500, 1,000, and 10,000 background points and comparing the SHAP summary plots produced. In my case, 100 seemed to perform almost the same as 10k (some neighboring features in importance swapped places, but they were very similar mean SHAP values and so it isn't really concerning to me). For reference, I have about 20 features, and had ~25k samples in my explanatory set when performing these experiments. Here are the approximate runtimes for varying background set sizes (you can see the linear relationship):
Doing the same thing with the number of explanatory samples will be slightly more involved, since I believe we have domain shift in our test data. My assumption is that I will need to generate explanations using only the pre-shift data and then only the post-shift data and analyze them separately, rather than getting one set of global explanations across the domain shift. Any input is welcome, especially any that uses more theory to answer than experimentation, but I am marking the original thread as answered. |
Beta Was this translation helpful? Give feedback.
I have simply found a pragmatic solution by running with 100, 500, 1,000, and 10,000 background points and comparing the SHAP summary plots produced. In my case, 100 seemed to perform almost the same as 10k (some neighboring features in importance swapped places, but they were very similar mean SHAP values and so it isn't really concerning to me).
For reference, I have about 20 features, and had ~25k samples in my explanatory set when performing these experiments. Here are the approximate runtimes for varying background set sizes (you can see the linear relationship):
Doing the same thing with the number of explanatory samples will …