# Introduction

Previously, we discussed about feature importance methods that evaluates the signifance of a feature on global level. But what about individual data sample level.

Let's talk about methods that compute individual feature importance in this lecture.


## Prerequisities
- Learning course 1 and 2, as most of the models used in this week is imported from those course.
- Feature Importance Methods

# What will be covered?

Within Individual feature importance method, the course will be covering:

1. Individual feature importance
2. Shapley Values
...


## Individual Feature Importance

In previous lecture, we computed the importance of feature for overall model using drop column and permutation method. Now, say, you want to know the effect of each feature value on risk prediction of a single patient. 

We can find such individualized feature importance using this method.

Taking the previous example of `risk of death prediction` prognostic model, say, we are required to understand significance of each feature value for Patient A on the prediction. Here would be the steps to compute such importance values:

1. First, train the model using the entire feature set.
2. Next, obtain the model's __prediction__ (not performance) for Patient A's data.
3. Similarly, as in drop column method, iteratively train model on single feature values. Then, compute the outputs from those models for Patient A's feature values.

    <figure>
    <center><img src="../../assets/W3/W3_P2_individual_feature_importance.png" width="900">
    <figcaption align="center"> Fig 1: Individual Feature Importance method visualization </figcaption>
    </figure>
    
4. Finally, we subtract the obtained output values to compute individual feature importance score
    
    | Feature | Importance|
    | --- | --- |
    | Age | $ f\text{(\{BP, Age\})}-f\text{(\{BP\})} $|
    | BP  | $f\text{(\{BP, Age\})}-f\text{(\{Age\})} $|

## Disadvantage 

This model fails to take account of _correlation between the features_, for example, systolic and diastolic blood pressures, when computing the scores. Such ignorance might yield in underrepresented importance values from the individual correlation method.

Using previous example of prognosis model, let's introduce a new set of features: Age, dBP (diastolic BP), sBP (systolic BP). Here, when computing the individualized feature importance using the previous method, we have the following importance values given in Fig 2.

<figure>
<center><img src="../../assets/W3/W3_P2_correlated_feature_problem.png" width="900">
<figcaption align="center"> Fig 2: Individualized feature importance for patient A. </figcaption>
</figure>

We can clearly see that the importance values of the features are low, i.e., the method fails to capture the feature importance, especially of the __high sBP and dBP__ of patient A which increase the risk of death. This is because the method is failing to capture the positive correlation between sBP and dBP!

## Shapley Values

Shapley Values comes to the rescue! This method of computing individualized feature importance values can compute accurate importance even when the features are correlated.

For instance, if we require to compute the individualized feature importance of sBP, with shapley values method, we take account of all the possible combinations of feature set where sBP is present and compare (subtract values) those set with combinations where sBP is removed. 

The method to compute the feature importance is as:
1. First, compute the combinations of feature set where concerned feature is present vs where it is absent, for eg:

    <figure>
    <center><img src="../../assets/W3/W3_P2_shapley_values_1.png" width="400">
    <figcaption align="center"> Fig 3: Possible combination of feature set for sBP </figcaption>
    </figure>

2. Then, compute the model outputs for each of the feature set, iteratively. _Here, for empty set $f(\{\})$, we set the expected value of an event (baseline risk), here, death, from the dataset (0.10)._ Note: this step requires training 8 models (Generally, $2^n$ models are required for computing shapley values.

3. Next, we performance subtraction of the predictions between combinations where the elements only differ by the concerned feature.

    <figure>
    <center><img src="../../assets/W3/W3_P2_shapley_values_2.png" width="900">
    <figcaption align="center"> Fig 4: Finding the intermediate values for computing sBP's shapley value </figcaption>
    </figure>

4. Finally, we take the average of the computed feature importance values from all combination, but there is a catch. The sum for computing the average requires taking account of all the possible way the feature set can be formed (aka the permutation of the set, $n!$), which are:
    ```
    1. sBP, dBP, Age
    2. sBP, Age, dBP
    3. dBP, sBP, Age
    4. Age, sBP, dBP
    5. dBP, Age, sBP
    6. Age, dBP, sBP
    ```

    The order of the sequence notifies when the feature joins the feature set, i.e. for the sixth element, Age joins, then dBP, then sBP. Here, we can find the contribution of sBP by computing $f(\{\text{Age, dBP, sBP}\}) - f(\{Age, dBP\})$. 
    
    Similarly, for first element, we compute the contribution by computing $f(\{\text{sBP}\}) - f(\{\})$. 

    Finally, we compute the average of contributions from all possible permutation of generating feature set, and finally obtain the shapley value for sBP.

    <figure>
    <center><img src="../../assets/W3/W3_P2_shapley_values_3.png" width="900">
    <figcaption align="center"> Fig 3: Possible combination of feature set for sBP </figcaption>
    </figure>

The sum of all the shapely values should equal the difference between output of model with all feature and the baseline risk ($f(\{\text{}\}$).

$$\text{I(Age) + I(sBP) + I(dBP)} = f(\{\text{Age, dBP, sBP}\} - f(\{\text{}\} )$$

where $\text{I(x)}$ is represent the shapley value of feature $\text{x}$.

That's all for individualized feature importance section of this lecture. We learned about individualized feature importance and shapley values for computing importance of each feature for a single patient (data sample) using these methods.

As you might have noticed, computing shapley values requires training $2^n$ models, where $n$ is the number of features. [__SHAP__](https://shap.readthedocs.io/en/latest/) was developed by using the concept of Shapley values which removes this retraining part, being computationally efficient. But dicussing SHAP is beyond the scope of this lecture.