Am I doing something wrong to cause DeepExplainer to run slowly in low-dimension space? #1494

jaxondk · 2020-10-03T01:10:55Z

jaxondk
Oct 3, 2020

Hello all!

I'm just getting my feet wet with explainability and specifically with SHAP values. Thanks to Scott (Dr. Lundberg, I should say 😉) for this library and for moving the ball forward with ML explainability!

Here's my use case. My data is tabular, <100k samples, <20 features and I use a neural network for my model. We previously used random forests, but we're hoping to incorporate text in a multimodal model in the future and so we've moved to using DL models. We also found that the neural network seems to perform similarly (perhaps slightly better) than the RF anyways. This obviously makes things slightly more difficult for interpretability, but we're fairly committed to DL at this point. So, I've turned to SHAP values to help with understanding feature importance (I've also done some permutation importance, and am going to compare them).

I'm using DeepExplainer to calculate SHAP values. Here is my code for preparing the explainer:

import shap, torch
background_df = x_train.sample(frac=.10) # This is about 4.5k samples for calculating the expectation
background_norm = util.normalize(background_df.values, background_df.mean(), background_df.std())
background = torch.Tensor(background_norm)
explainer = shap.DeepExplainer(model, background)

This runs very quickly. However, when I generate the shap values from this explainer for ~4k test samples, it takes about 40 minutes. Here's my code for that:

x_test = torch.Tensor(util.normalize(x_test_df.values, x_test_df.mean(), x_test_df.std()))
shap_values = explainer.shap_values(x_test)

Am I doing something wrong? 40 minutes seems like a very long time when I have <20 features. I understand that if I had a high number of features this is to be expected, but with under 20 features should it be taking so long?

If I'm not doing something "wrong", am I doing things that don't need to be done that take up extra time? For example, I've seen this github issue about how many samples you need to generate a good prior and how many explanatory samples you need for the global feature importance to be fairly accurate. That brings these optimization questions to mind:

Does using more samples for the prior end up making the shap value calculation take longer? Most examples show using 100 samples for the background; I'm using 4.5k. Note that the expectation calculation doesn't take long with that many, but does it make the shap_values calculation take longer for some reason?
Do I not need to be calculating SHAP values for so many explanatory samples? 4k seems like a reasonable amount to be calculating, and on the issue I linked above, Scott says he typically uses 1k-10k samples for visualizations.

Thanks for any insight you can lend me!

Answered by jaxondk

Oct 21, 2020

I have simply found a pragmatic solution by running with 100, 500, 1,000, and 10,000 background points and comparing the SHAP summary plots produced. In my case, 100 seemed to perform almost the same as 10k (some neighboring features in importance swapped places, but they were very similar mean SHAP values and so it isn't really concerning to me).

For reference, I have about 20 features, and had ~25k samples in my explanatory set when performing these experiments. Here are the approximate runtimes for varying background set sizes (you can see the linear relationship):

100: 8.5 min
500: 33 min
1,000: 69 min
10,000: 648 min

Doing the same thing with the number of explanatory samples will …

View full answer

jaxondk · 2020-10-07T23:08:48Z

jaxondk
Oct 7, 2020
Author

Answer to the main part of my own question

The short answer to my own question is, yes I'm doing things that are likely unnecessarily expensive.

The docs for shap.DeepExplainer clearly state that the time complexity scales linearly with the length of your background data. What is a little less clear (but now fairly obvious to me) is that it means the time complexity of calling explainer.shap_values is what scales linearly with len(background).

When I instead only use 100 samples in my background data (instead of 4.5k), calling explainer.shap_values takes about 1 minute instead of 40. Like they said, it scales linearly 😂

Follow-up question

This does lead me to a related question. The docs say that using 100 samples will give a good estimate of the actual expected value, and that using 1000 samples will give a very good estimate of the expected value.

However, I'm dealing with a domain where a difference in even .01 probability makes a large difference. When I use 4.5k samples, I get expected values (after softmaxing it) of [0.45131 0.54869]. When using 1k samples, I get [0.454981 0.545019]. And when using 100, I get [0.448709 0.551291]. These are important differences. EDIT: I am not sure if softmaxing the expected values is appropriate, but the sentiment of this question is the same.

These of course vary every time I run them depending on the random samples I get. If I find a random sample of 100 points that gets closer to the actual expectation than others, should I use that one? Will that make my shap values more accurate? To extend this question further, let's say I found a subset of samples that got the exact same expected value as using the entire data set - would that give me the exact same shap values as if I had used the entire data set for my background? Or is the background used for far more than just the expected value? I assume it is or else we could just pass the expected value in of the entire data set.

1 reply

jaxondk Oct 20, 2020
Author

@slundberg Any input on this follow-up question?

Also, do your comments found here apply similarly to DeepExplainer? I assume that using a single reference point for the prior for DeepExplainer is a bad idea, correct? How would you suggest I go about determining how many training points to use for calculating the background expectation?

As an aside, is there someone else I should be tagging in questions like this? I presume you get tagged in far more things than you have time to answer.

Many thanks for the hard work with shap - I've thoroughly enjoyed discovering it and using it!

jaxondk · 2020-10-21T16:56:11Z

jaxondk
Oct 21, 2020
Author

I have simply found a pragmatic solution by running with 100, 500, 1,000, and 10,000 background points and comparing the SHAP summary plots produced. In my case, 100 seemed to perform almost the same as 10k (some neighboring features in importance swapped places, but they were very similar mean SHAP values and so it isn't really concerning to me).

For reference, I have about 20 features, and had ~25k samples in my explanatory set when performing these experiments. Here are the approximate runtimes for varying background set sizes (you can see the linear relationship):

100: 8.5 min
500: 33 min
1,000: 69 min
10,000: 648 min

Doing the same thing with the number of explanatory samples will be slightly more involved, since I believe we have domain shift in our test data. My assumption is that I will need to generate explanations using only the pre-shift data and then only the post-shift data and analyze them separately, rather than getting one set of global explanations across the domain shift.

Any input is welcome, especially any that uses more theory to answer than experimentation, but I am marking the original thread as answered.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Am I doing something wrong to cause DeepExplainer to run slowly in low-dimension space? #1494

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Am I doing something wrong to cause DeepExplainer to run slowly in low-dimension space? #1494

Uh oh!

jaxondk Oct 3, 2020

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

jaxondk Oct 7, 2020 Author

Answer to the main part of my own question

Follow-up question

Uh oh!

jaxondk Oct 20, 2020 Author

Uh oh!

Uh oh!

jaxondk Oct 21, 2020 Author

jaxondk
Oct 3, 2020

Replies: 2 comments 1 reply

jaxondk
Oct 7, 2020
Author

jaxondk Oct 20, 2020
Author

jaxondk
Oct 21, 2020
Author