how to extract the most important feature names ? #632

fyears · 2019-06-06T06:45:30Z

We can visualize the feature importance by calling summary_plot, however it only outputs the plot, not any text.

Is there any way to output the feature names sorted by their importance defined by shape_values?

Something like:

def get_feature_importance(shape_values_matrix):
    ... ???
    return np.array([...])

x = get_feature_importance(shape_values_matrix)
# x == np.array(['feature_1', 'feature_2', ... 'feature_n'])

Thanks.

The text was updated successfully, but these errors were encountered:

slundberg · 2019-06-06T21:47:52Z

The array returned by shap_values is the parallel to the data array you explained the predictions on, meaning it is the same shape as the data matrix you apply the model to. That means the names of the features for each column are the same as for your data matrix. If you have those names around somewhere as a list you can pass them to summary_plot using the feature_names argument.

fyears · 2019-06-10T08:20:21Z

I understand that shap_values_matrix.shape == train_data_matrix.shape.

clarifying

My issue, is about getting the names and their shap values of features, instead of visualizing them.

example

Take the demo in readme.md for example, we know the summary_plot plots the following figure:

shap.summary_plot(shap_values, X, plot_type="bar")

However, I am wondering, is there any way to get the name of features ordered by importance:

# is there any function working like get_feature_importance? Or how to implement it?
shap.get_feature_importance(shap_values, X) == np.array(['LSTAT', 'RM', 'CRIM', ... 'CHAS'])

Or even better, output the numeric values of each feature:

# is there any function working like get_feature_importance_2? Or how to implement it?
shap.get_feature_importance_2(shap_values, X) == {
  'LSTAT': 2.6,
  'RM': 1.7,
  ...,
  'CHAS': 0.0
}

Thank you so much.

slundberg · 2019-06-19T23:08:28Z

Ah. Well the numbers for the bar chart are just np.abs(shap_values).mean(0), so if you take X.columns[np.argsort(np.abs(shap_values).mean(0))] you should get the feature names in order (but note I didn't run that code to test it). You could then zip that up into a dictionary if you like.

thoo · 2019-07-03T13:11:32Z

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

clappis · 2019-12-27T15:00:22Z

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

JohnStott · 2020-02-17T23:11:15Z

I think it would be useful to have this as a "feature_importance()" method, which I will look at adding as a pull request if anyone agrees?

On a different note, but related to this topic... I have only read summary information on "shap" and from my understanding, the shap_values are related to the individual observations (not at the global level) as such, I wondered how to use for an obscure cross validation situation I have... I would like to find feature importances based on the training data of all the folds during CV (fully stratified folds, so no danger of missing variables/values in a particular fold etc). I wondered if I should:

get a normalised absolute mean for each fold (i.e., the sum of all features in that fold add up to 100%) of the shap values and then average it amongst the folds...
for each fold, get the shap values and append them into a shared large table; this is then used to calculate the absolute mean...
none of the above?

TAMANNA08 · 2020-10-01T16:49:35Z

if the feature importance does not adds upto one ? is it incorrect?

jadhosn · 2020-10-23T16:52:25Z

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

For anyone seeing this more recently, even if you are doing a binary classification problem, the returned shap_values is now a list of matrices (as mentioned in the TreeExplainer documentation) even if one class' shap values (say the positive class) 1 - the neg class shap values (in case of probabilities).

To reflect that in the code snippet above, I had to specifically select the positive class in my case in order to land on correct shap values:

import numpy as np
vals= np.abs(shap_values[1]).mean(0)

mburaksayici · 2021-02-01T23:31:21Z

For anyone who are looking for a solution in Shap version 0.38.1, I can reproduce the shap importance plot order by taking the absolute value of the proposed solution of @clappis. "features" is the list of features, you can use validation.columns instead.

import numpy as np

feature_importance = pd.DataFrame(list(zip(features, sum(shap_values))), columns=['col_name','feature_importance_vals'])
feature_importance["feature_importance_vals"] = np.abs(feature_importance["feature_importance_vals"]) ##<----

feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

ba1mn · 2021-04-21T09:33:58Z

For Shap version 0.39 :

def global_shap_importance(model, X):
    """ Return a dataframe containing the features sorted by Shap importance
    Parameters
    ----------
    model : The tree-based model 
    X : pd.Dataframe
         training set/test set/the whole dataset ... (without the label)
    Returns
    -------
    pd.Dataframe
        A dataframe containing the features sorted by Shap importance
    """
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    cohorts = {"": shap_values}
    cohort_labels = list(cohorts.keys())
    cohort_exps = list(cohorts.values())
    for i in range(len(cohort_exps)):
        if len(cohort_exps[i].shape) == 2:
            cohort_exps[i] = cohort_exps[i].abs.mean(0)
    features = cohort_exps[0].data
    feature_names = cohort_exps[0].feature_names
    values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
    feature_importance = pd.DataFrame(
        list(zip(feature_names, sum(values))), columns=['features', 'importance'])
    feature_importance.sort_values(
        by=['importance'], ascending=False, inplace=True)
    return feature_importance

krzysztoffiok · 2021-04-27T11:10:47Z

@ba1mn

shap==0.39.0
xgboost==1.1.1

In my case the above code didn't work, but gave and idea what to do. I had an xgboost model and did:

explainer = shap.Explainer(model)
shap_values = explainer(dataset.head())

shap_values.shape => (5, 4000, 30) (5 instances, 4000 features, 30 classes)
type(shap_values) => shap._explanation.Explanation

vals = shap_values.values
vals_abs = np.abs(vals)
val_mean = np.mean(vals_abs, axis=0) => average over instances
val_final = np.mean(val_mean, axis=1) => average over classes
feature_importance = pd.DataFrame(
list(zip(shap_values.feature_names, val_final)), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)

And finally some numbers looking like feature importances popped out.

Timbimjim · 2021-04-27T12:39:44Z

Hey,

I have a very similar question. I am interested in exporting the data used to create the shap.dependence_plot("rank(0)", shap_values, X_train) diagrams.

Anyone could help me with that?

I basically wanne save them in an excel sheet, to plot them bymyself.

thx

sonalisrijan · 2021-05-20T22:08:25Z

How do we get a list of the most important features for a particular class in a multiclassification problem?
For eg: I have 8 different classes.
When I calculate the shap_values for my data using: shap_values = shap.TreeExplainer(xgb).shap_values(data) ,
then shap_values is a list containing 8 (= #classes) arrays. Each array is of the shape: (n_datapoints, n_features).

My question is: Using the array corresponding to each class (eg: For class 0, using array shap_values[0] ), how can I get a list of most important features for that particular class?

TLDR: Basically, I want to save the features shown on the y-axis (AT1G09740.1, etc) from the summary plot in a txt file. How do I do that?

sonalisrijan · 2021-05-20T22:22:30Z

How do we get a list of the most important features for a particular class in a multiclassification problem?
For eg: I have 8 different classes.
When I calculate the shap_values for my data using: shap_values = shap.TreeExplainer(xgb).shap_values(data) ,
then shap_values is a list containing 8 (= #classes) arrays. Each array is of the shape: (n_datapoints, n_features).

My question is: Using the array corresponding to each class (eg: For class 0, using array shap_values[0] ), how can I get a list of most important features for that particular class?

TLDR: Basically, I want to save the features shown on the y-axis (AT1G09740.1, etc) from the summary plot in a txt file. How do I do that?

I think I figured out the answer. This is what I did for Class 0:

feature_importance = pd.DataFrame(list(zip(feature_names, shap_values[0].sum(0))), columns=['feature_name', 'feature_importance_vals'])
feature_importance = feature_importance.iloc[(-np.abs(feature_importance['feature_importance_vals'].values)).argsort()]
feature_importance.to_csv("class_0_shap_values.csv", sep="\t", header=True)

banderlog · 2021-06-03T21:36:18Z

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

Now it wants shap_values.values without any sum()

vals = np.abs(shap_values.values).mean(0)
feature_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
feature_importance.head()

sheecegardezi · 2021-07-22T01:38:36Z

`
import pandas as pd
import xgboost
import shap
import numpy as np

X, y = shap.datasets.boston()
model = xgboost.XGBRegressor().fit(X, y)

explainer = shap.Explainer(model)
shap_values = explainer(X)

feature_names = list(X.columns.values)
vals = np.abs(shap_values.values).mean(0)
feature_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
feature_importance

`

wuhao199368 · 2021-08-29T03:07:28Z

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

However the order of "feature_importance" data frame is different from the order of feature in corresponding summary plot right? For example here's the summary_plot of FIFA data:

However after I leveraged the code above to calculate feature importance in the FIFA data, I got a feature importance data frame like:

The order of each feature is quite different.

So Could you please tell me which one is correct order to show the feature importance?

xinwei-sher · 2021-09-02T21:10:57Z

For Shap version 0.39 :

def global_shap_importance(model, X):
    """ Return a dataframe containing the features sorted by Shap importance
    Parameters
    ----------
    model : The tree-based model 
    X : pd.Dataframe
         training set/test set/the whole dataset ... (without the label)
    Returns
    -------
    pd.Dataframe
        A dataframe containing the features sorted by Shap importance
    """
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    cohorts = {"": shap_values}
    cohort_labels = list(cohorts.keys())
    cohort_exps = list(cohorts.values())
    for i in range(len(cohort_exps)):
        if len(cohort_exps[i].shape) == 2:
            cohort_exps[i] = cohort_exps[i].abs.mean(0)
    features = cohort_exps[0].data
    feature_names = cohort_exps[0].feature_names
    values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
    feature_importance = pd.DataFrame(
        list(zip(feature_names, sum(values))), columns=['features', 'importance'])
    feature_importance.sort_values(
        by=['importance'], ascending=False, inplace=True)
    return feature_importance

This function works like a charm! Thanks!

PARODBE · 2021-09-12T07:23:49Z

feature_importance = pd.DataFrame(list(zip(feature_names, np.abs(shap_values)[0].mean(0))), columns=['feature_name', 'feature_importance_vals'])
feature_importance = feature_importance.iloc[(-np.abs(feature_importance['feature_importance_vals'].values)).argsort()]

Before, you are putting np.abs(shap_values[0]).mean(0)), and this is no correct because you would be selecting the first row and doing the mean. The correct way is: np.abs(shap_values)[0].

jmrichardson · 2021-10-29T01:16:24Z

Version 0.40.0:

    feature_names = shap_values.feature_names
    shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
    vals = np.abs(shap_df.values).mean(0)
    shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
    shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)

Verified the same as summary bar plot

RaymondWJang · 2022-03-22T07:03:40Z

If anyone is still wondering, after sorting through the code for a bit, I realized the solution is incredibly simple (This is with XGBoost. For LightGBM, shap_values must be shap_values[1] for binary classification.).:

feature_order = np.argsort(np.sum(np.abs(shap_values), axis=0))
[X.columns[i] for i in feature_order][::-1]

Confirmed the order is the same as the summary_plot.

c1910475054 · 2022-05-29T15:14:02Z

I am very curious why there is no example as to how to extract the feature importance of lagged shap values (shape (22, 2, 9)) returned by a DeepExplainer. Any method i tried did not return the same order and impact as for example summary plot did. I would appreciate a solution for my issue as i am stuck to summarize feature importance for my thesis

timothygao8710 · 2022-07-01T04:47:44Z

I had to use np.abs(shap_values.values).mean(axis=0) not np.abs(shap_values).mean(axis=0) or np.abs(shap_values.data).mean(axis=0) to get it to work.

amar94data · 2022-08-21T09:02:48Z

How can i see the detailed description of features? Which function to use?
I have following columns in dataframe. I want to see the details of the variable names? Like what does the variable Medu stands for:
Data:

school object
sex object
age int64
address object
famsize object
Pstatus object
Medu int64
Fedu int64
Mjob object
Fjob object
reason object
guardian object
traveltime int64
studytime int64
failures int64
schoolsup object
famsup object
paid object
activities object
nursery object
higher object
internet object
romantic object
famrel int64
freetime int64
goout int64
Dalc int64
Walc int64
health int64
absences int64
G1 int64
G2 int64
G3 int64
dtype: object

CarlaFernandez · 2022-10-13T07:50:05Z

@slundberg I'm curious as to why this is not already implemented as a feature. Any particular reason, or difficulty you foresee? Or there just hasn't been anyone who creates the PR? Thank you

TTFrance · 2022-11-14T20:38:27Z

I had exactly the same issue today and found a fairly easy way to get the top n features by shap value:

df_shap_values = pd.DataFrame(data=shap_values.values,columns=X.columns)
df_feature_importance = pd.DataFrame(columns=['feature','importance'])
for col in df_shap_values.columns:
    importance = df_shap_values[col].abs().mean()
    df_feature_importance.loc[len(df_feature_importance)] = [col,importance]
df_feature_importance = df_feature_importance.sort_values('importance',ascending=False)

then just pick off your top n features from the head of the df_feature_importance dataframe.

I'm not a python expert, so I have no doubt there's a more elegant way of doing this, but it solves the issue.

barpaw1998 · 2022-11-24T14:11:51Z

  model.fit(X_train, y_train)
  explainer = shap.Explainer(model)
  shap_values = explainer(X_test)
  shap_importance = pd.Series(shap_values.values, X_train.columns).abs().sort_values(ascending=False)

With this code we have importance in Series

junaid1990 · 2023-01-08T18:55:19Z

Hey,

I have a very similar question. I am interested in exporting the data used to create the shap.dependence_plot("rank(0)", shap_values, X_train) diagrams.

Anyone could help me with that?

I basically wanne save them in an excel sheet, to plot them bymyself.

thx

Did you find any solution? Please tell me

TopCoder2K · 2024-04-08T19:15:06Z

@slundberg, have something like get_feature_importances() been implemented? I haven't found any function in the documentation that would draw a histogram like the one shown here: https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#tree-s-feature-importance-from-mean-decrease-in-impurity-mdi
beeswarm only sorts in descending order, but I would also like to obtain importance scores.

UPD:
I've also noticed that the standard deviation is often greater than half of the mean itself in my case... Is the mean absolute value a good estimate of importance?

AttriGhosh96 mentioned this issue Oct 3, 2022

Summary plot not matching with shap values #2713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to extract the most important feature names ? #632

how to extract the most important feature names ? #632

fyears commented Jun 6, 2019

slundberg commented Jun 6, 2019

fyears commented Jun 10, 2019 •

edited

slundberg commented Jun 19, 2019

thoo commented Jul 3, 2019

clappis commented Dec 27, 2019

JohnStott commented Feb 17, 2020

TAMANNA08 commented Oct 1, 2020

jadhosn commented Oct 23, 2020 •

edited

mburaksayici commented Feb 1, 2021 •

edited

ba1mn commented Apr 21, 2021 •

edited

krzysztoffiok commented Apr 27, 2021

Timbimjim commented Apr 27, 2021

sonalisrijan commented May 20, 2021 •

edited

sonalisrijan commented May 20, 2021

banderlog commented Jun 3, 2021 •

edited

sheecegardezi commented Jul 22, 2021

wuhao199368 commented Aug 29, 2021

xinwei-sher commented Sep 2, 2021

PARODBE commented Sep 12, 2021

jmrichardson commented Oct 29, 2021 •

edited

RaymondWJang commented Mar 22, 2022 •

edited

c1910475054 commented May 29, 2022 •

edited

timothygao8710 commented Jul 1, 2022

amar94data commented Aug 21, 2022

CarlaFernandez commented Oct 13, 2022

TTFrance commented Nov 14, 2022

barpaw1998 commented Nov 24, 2022 •

edited

junaid1990 commented Jan 8, 2023

TopCoder2K commented Apr 8, 2024 •

edited

how to extract the most important feature names ? #632

how to extract the most important feature names ? #632

Comments

fyears commented Jun 6, 2019

slundberg commented Jun 6, 2019

fyears commented Jun 10, 2019 • edited

clarifying

example

slundberg commented Jun 19, 2019

thoo commented Jul 3, 2019

clappis commented Dec 27, 2019

JohnStott commented Feb 17, 2020

TAMANNA08 commented Oct 1, 2020

jadhosn commented Oct 23, 2020 • edited

mburaksayici commented Feb 1, 2021 • edited

ba1mn commented Apr 21, 2021 • edited

krzysztoffiok commented Apr 27, 2021

Timbimjim commented Apr 27, 2021

sonalisrijan commented May 20, 2021 • edited

sonalisrijan commented May 20, 2021

banderlog commented Jun 3, 2021 • edited

sheecegardezi commented Jul 22, 2021

wuhao199368 commented Aug 29, 2021

xinwei-sher commented Sep 2, 2021

PARODBE commented Sep 12, 2021

jmrichardson commented Oct 29, 2021 • edited

RaymondWJang commented Mar 22, 2022 • edited

c1910475054 commented May 29, 2022 • edited

timothygao8710 commented Jul 1, 2022

amar94data commented Aug 21, 2022

CarlaFernandez commented Oct 13, 2022

TTFrance commented Nov 14, 2022

barpaw1998 commented Nov 24, 2022 • edited

junaid1990 commented Jan 8, 2023

TopCoder2K commented Apr 8, 2024 • edited

fyears commented Jun 10, 2019 •

edited

jadhosn commented Oct 23, 2020 •

edited

mburaksayici commented Feb 1, 2021 •

edited

ba1mn commented Apr 21, 2021 •

edited

sonalisrijan commented May 20, 2021 •

edited

banderlog commented Jun 3, 2021 •

edited

jmrichardson commented Oct 29, 2021 •

edited

RaymondWJang commented Mar 22, 2022 •

edited

c1910475054 commented May 29, 2022 •

edited

barpaw1998 commented Nov 24, 2022 •

edited

TopCoder2K commented Apr 8, 2024 •

edited