Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to extract the most important feature names ? #632

Open
fyears opened this issue Jun 6, 2019 · 29 comments
Open

how to extract the most important feature names ? #632

fyears opened this issue Jun 6, 2019 · 29 comments

Comments

@fyears
Copy link

fyears commented Jun 6, 2019

We can visualize the feature importance by calling summary_plot, however it only outputs the plot, not any text.

Is there any way to output the feature names sorted by their importance defined by shape_values?

Something like:

def get_feature_importance(shape_values_matrix):
    ... ???
    return np.array([...])

x = get_feature_importance(shape_values_matrix)
# x == np.array(['feature_1', 'feature_2', ... 'feature_n'])

Thanks.

@slundberg
Copy link
Collaborator

The array returned by shap_values is the parallel to the data array you explained the predictions on, meaning it is the same shape as the data matrix you apply the model to. That means the names of the features for each column are the same as for your data matrix. If you have those names around somewhere as a list you can pass them to summary_plot using the feature_names argument.

@fyears
Copy link
Author

fyears commented Jun 10, 2019

I understand that shap_values_matrix.shape == train_data_matrix.shape.

clarifying

My issue, is about getting the names and their shap values of features, instead of visualizing them.

example

Take the demo in readme.md for example, we know the summary_plot plots the following figure:

shap.summary_plot(shap_values, X, plot_type="bar")

boston_summary_plot_bar

However, I am wondering, is there any way to get the name of features ordered by importance:

# is there any function working like get_feature_importance? Or how to implement it?
shap.get_feature_importance(shap_values, X) == np.array(['LSTAT', 'RM', 'CRIM', ... 'CHAS']) 

Or even better, output the numeric values of each feature:

# is there any function working like get_feature_importance_2? Or how to implement it?
shap.get_feature_importance_2(shap_values, X) == {
  'LSTAT': 2.6,
  'RM': 1.7,
  ...,
  'CHAS': 0.0
}

Thank you so much.

@slundberg
Copy link
Collaborator

Ah. Well the numbers for the bar chart are just np.abs(shap_values).mean(0), so if you take X.columns[np.argsort(np.abs(shap_values).mean(0))] you should get the feature names in order (but note I didn't run that code to test it). You could then zip that up into a dictionary if you like.

@thoo
Copy link

thoo commented Jul 3, 2019

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

@clappis
Copy link

clappis commented Dec 27, 2019

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

@JohnStott
Copy link

I think it would be useful to have this as a "feature_importance()" method, which I will look at adding as a pull request if anyone agrees?

On a different note, but related to this topic... I have only read summary information on "shap" and from my understanding, the shap_values are related to the individual observations (not at the global level) as such, I wondered how to use for an obscure cross validation situation I have... I would like to find feature importances based on the training data of all the folds during CV (fully stratified folds, so no danger of missing variables/values in a particular fold etc). I wondered if I should:

  1. get a normalised absolute mean for each fold (i.e., the sum of all features in that fold add up to 100%) of the shap values and then average it amongst the folds...
  2. for each fold, get the shap values and append them into a shared large table; this is then used to calculate the absolute mean...
  3. none of the above?

@TAMANNA08
Copy link

if the feature importance does not adds upto one ? is it incorrect?

@jadhosn
Copy link

jadhosn commented Oct 23, 2020

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

For anyone seeing this more recently, even if you are doing a binary classification problem, the returned shap_values is now a list of matrices (as mentioned in the TreeExplainer documentation) even if one class' shap values (say the positive class) 1 - the neg class shap values (in case of probabilities).

To reflect that in the code snippet above, I had to specifically select the positive class in my case in order to land on correct shap values:

import numpy as np
vals= np.abs(shap_values[1]).mean(0)

@mburaksayici
Copy link

mburaksayici commented Feb 1, 2021

For anyone who are looking for a solution in Shap version 0.38.1, I can reproduce the shap importance plot order by taking the absolute value of the proposed solution of @clappis. "features" is the list of features, you can use validation.columns instead.

import numpy as np

feature_importance = pd.DataFrame(list(zip(features, sum(shap_values))), columns=['col_name','feature_importance_vals'])
feature_importance["feature_importance_vals"] = np.abs(feature_importance["feature_importance_vals"]) ##<----

feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

@ba1mn
Copy link

ba1mn commented Apr 21, 2021

For Shap version 0.39 :

def global_shap_importance(model, X):
    """ Return a dataframe containing the features sorted by Shap importance
    Parameters
    ----------
    model : The tree-based model 
    X : pd.Dataframe
         training set/test set/the whole dataset ... (without the label)
    Returns
    -------
    pd.Dataframe
        A dataframe containing the features sorted by Shap importance
    """
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    cohorts = {"": shap_values}
    cohort_labels = list(cohorts.keys())
    cohort_exps = list(cohorts.values())
    for i in range(len(cohort_exps)):
        if len(cohort_exps[i].shape) == 2:
            cohort_exps[i] = cohort_exps[i].abs.mean(0)
    features = cohort_exps[0].data
    feature_names = cohort_exps[0].feature_names
    values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
    feature_importance = pd.DataFrame(
        list(zip(feature_names, sum(values))), columns=['features', 'importance'])
    feature_importance.sort_values(
        by=['importance'], ascending=False, inplace=True)
    return feature_importance

@krzysztoffiok
Copy link

@ba1mn

shap==0.39.0
xgboost==1.1.1

In my case the above code didn't work, but gave and idea what to do. I had an xgboost model and did:

explainer = shap.Explainer(model)
shap_values = explainer(dataset.head())

shap_values.shape => (5, 4000, 30) (5 instances, 4000 features, 30 classes)
type(shap_values) => shap._explanation.Explanation

vals = shap_values.values
vals_abs = np.abs(vals)
val_mean = np.mean(vals_abs, axis=0) => average over instances
val_final = np.mean(val_mean, axis=1) => average over classes
feature_importance = pd.DataFrame(
list(zip(shap_values.feature_names, val_final)), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)

And finally some numbers looking like feature importances popped out.

@Timbimjim
Copy link

Hey,

I have a very similar question. I am interested in exporting the data used to create the shap.dependence_plot("rank(0)", shap_values, X_train) diagrams.

Anyone could help me with that?

I basically wanne save them in an excel sheet, to plot them bymyself.

thx

@sonalisrijan
Copy link

sonalisrijan commented May 20, 2021

How do we get a list of the most important features for a particular class in a multiclassification problem?
For eg: I have 8 different classes.
When I calculate the shap_values for my data using: shap_values = shap.TreeExplainer(xgb).shap_values(data) ,
then shap_values is a list containing 8 (= #classes) arrays. Each array is of the shape: (n_datapoints, n_features).

My question is: Using the array corresponding to each class (eg: For class 0, using array shap_values[0] ), how can I get a list of most important features for that particular class?

TLDR: Basically, I want to save the features shown on the y-axis (AT1G09740.1, etc) from the summary plot in a txt file. How do I do that?

example

@sonalisrijan
Copy link

How do we get a list of the most important features for a particular class in a multiclassification problem?
For eg: I have 8 different classes.
When I calculate the shap_values for my data using: shap_values = shap.TreeExplainer(xgb).shap_values(data) ,
then shap_values is a list containing 8 (= #classes) arrays. Each array is of the shape: (n_datapoints, n_features).

My question is: Using the array corresponding to each class (eg: For class 0, using array shap_values[0] ), how can I get a list of most important features for that particular class?

TLDR: Basically, I want to save the features shown on the y-axis (AT1G09740.1, etc) from the summary plot in a txt file. How do I do that?

example

I think I figured out the answer. This is what I did for Class 0:

feature_importance = pd.DataFrame(list(zip(feature_names, shap_values[0].sum(0))), columns=['feature_name', 'feature_importance_vals'])
feature_importance = feature_importance.iloc[(-np.abs(feature_importance['feature_importance_vals'].values)).argsort()]
feature_importance.to_csv("class_0_shap_values.csv", sep="\t", header=True)

@banderlog
Copy link

banderlog commented Jun 3, 2021

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

Now it wants shap_values.values without any sum()

vals = np.abs(shap_values.values).mean(0)
feature_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
feature_importance.head()

@sheecegardezi
Copy link

`
import pandas as pd
import xgboost
import shap
import numpy as np

X, y = shap.datasets.boston()
model = xgboost.XGBRegressor().fit(X, y)

explainer = shap.Explainer(model)
shap_values = explainer(X)

feature_names = list(X.columns.values)
vals = np.abs(shap_values.values).mean(0)
feature_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
feature_importance

`

@wuhao199368
Copy link

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

However the order of "feature_importance" data frame is different from the order of feature in corresponding summary plot right? For example here's the summary_plot of FIFA data:
image

However after I leveraged the code above to calculate feature importance in the FIFA data, I got a feature importance data frame like:
feature_importance

The order of each feature is quite different.

So Could you please tell me which one is correct order to show the feature importance?

@xinwei-sher
Copy link

For Shap version 0.39 :

def global_shap_importance(model, X):
    """ Return a dataframe containing the features sorted by Shap importance
    Parameters
    ----------
    model : The tree-based model 
    X : pd.Dataframe
         training set/test set/the whole dataset ... (without the label)
    Returns
    -------
    pd.Dataframe
        A dataframe containing the features sorted by Shap importance
    """
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    cohorts = {"": shap_values}
    cohort_labels = list(cohorts.keys())
    cohort_exps = list(cohorts.values())
    for i in range(len(cohort_exps)):
        if len(cohort_exps[i].shape) == 2:
            cohort_exps[i] = cohort_exps[i].abs.mean(0)
    features = cohort_exps[0].data
    feature_names = cohort_exps[0].feature_names
    values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
    feature_importance = pd.DataFrame(
        list(zip(feature_names, sum(values))), columns=['features', 'importance'])
    feature_importance.sort_values(
        by=['importance'], ascending=False, inplace=True)
    return feature_importance

This function works like a charm! Thanks!

@PARODBE
Copy link

PARODBE commented Sep 12, 2021

feature_importance = pd.DataFrame(list(zip(feature_names, np.abs(shap_values)[0].mean(0))), columns=['feature_name', 'feature_importance_vals'])
feature_importance = feature_importance.iloc[(-np.abs(feature_importance['feature_importance_vals'].values)).argsort()]

Before, you are putting np.abs(shap_values[0]).mean(0)), and this is no correct because you would be selecting the first row and doing the mean. The correct way is: np.abs(shap_values)[0].

@jmrichardson
Copy link

jmrichardson commented Oct 29, 2021

Version 0.40.0:

    feature_names = shap_values.feature_names
    shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
    vals = np.abs(shap_df.values).mean(0)
    shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
    shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)

Verified the same as summary bar plot

@RaymondWJang
Copy link

RaymondWJang commented Mar 22, 2022

If anyone is still wondering, after sorting through the code for a bit, I realized the solution is incredibly simple (This is with XGBoost. For LightGBM, shap_values must be shap_values[1] for binary classification.).:

feature_order = np.argsort(np.sum(np.abs(shap_values), axis=0))
[X.columns[i] for i in feature_order][::-1]

Confirmed the order is the same as the summary_plot.

@c1910475054
Copy link

c1910475054 commented May 29, 2022

I am very curious why there is no example as to how to extract the feature importance of lagged shap values (shape (22, 2, 9)) returned by a DeepExplainer. Any method i tried did not return the same order and impact as for example summary plot did. I would appreciate a solution for my issue as i am stuck to summarize feature importance for my thesis

@timothygao8710
Copy link

I had to use np.abs(shap_values.values).mean(axis=0) not np.abs(shap_values).mean(axis=0) or np.abs(shap_values.data).mean(axis=0) to get it to work.

@amar94data
Copy link

How can i see the detailed description of features? Which function to use?
I have following columns in dataframe. I want to see the details of the variable names? Like what does the variable Medu stands for:
Data:

school object
sex object
age int64
address object
famsize object
Pstatus object
Medu int64
Fedu int64
Mjob object
Fjob object
reason object
guardian object
traveltime int64
studytime int64
failures int64
schoolsup object
famsup object
paid object
activities object
nursery object
higher object
internet object
romantic object
famrel int64
freetime int64
goout int64
Dalc int64
Walc int64
health int64
absences int64
G1 int64
G2 int64
G3 int64
dtype: object

@CarlaFernandez
Copy link

@slundberg I'm curious as to why this is not already implemented as a feature. Any particular reason, or difficulty you foresee? Or there just hasn't been anyone who creates the PR? Thank you

@TTFrance
Copy link

I had exactly the same issue today and found a fairly easy way to get the top n features by shap value:

df_shap_values = pd.DataFrame(data=shap_values.values,columns=X.columns)
df_feature_importance = pd.DataFrame(columns=['feature','importance'])
for col in df_shap_values.columns:
    importance = df_shap_values[col].abs().mean()
    df_feature_importance.loc[len(df_feature_importance)] = [col,importance]
df_feature_importance = df_feature_importance.sort_values('importance',ascending=False)

then just pick off your top n features from the head of the df_feature_importance dataframe.

I'm not a python expert, so I have no doubt there's a more elegant way of doing this, but it solves the issue.

@barpaw1998
Copy link

barpaw1998 commented Nov 24, 2022

  model.fit(X_train, y_train)
  explainer = shap.Explainer(model)
  shap_values = explainer(X_test)
  shap_importance = pd.Series(shap_values.values, X_train.columns).abs().sort_values(ascending=False)

With this code we have importance in Series

@junaid1990
Copy link

Hey,

I have a very similar question. I am interested in exporting the data used to create the shap.dependence_plot("rank(0)", shap_values, X_train) diagrams.

Anyone could help me with that?

I basically wanne save them in an excel sheet, to plot them bymyself.

thx

Did you find any solution? Please tell me

@TopCoder2K
Copy link

TopCoder2K commented Apr 8, 2024

@slundberg, have something like get_feature_importances() been implemented? I haven't found any function in the documentation that would draw a histogram like the one shown here: https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#tree-s-feature-importance-from-mean-decrease-in-impurity-mdi
beeswarm only sorts in descending order, but I would also like to obtain importance scores.

UPD:
I've also noticed that the standard deviation is often greater than half of the mean itself in my case... Is the mean absolute value a good estimate of importance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests