### Tree-based importance

- impurity based feature importance
- A feature is important if it is used often to split nodes and the split reduce uncertainty a lot.
- Importance = How much total impurity this feature removes across the whole forest.
- We have impurity measures like Gini, Entropy, Variance(MSE). We can compute the impurity difference between parent and child node. The impurity decrease is the value which is credited to the feature used in that split.
- For a feature we can compute the the entire impurity reductions over the tree. And we can normalize it ( based on other feature’s importance)
- If we have T trees then we can again take the sum and normalize it.

$$
\text{Importance(f)} = \sum_{s \in S_f} \Delta I_s
$$

Here $\Delta_s$ is the impurity decrease at split s

Following equation gives the feature importance in one tree; 

$$
\text{Normalized Importance(f)} = \frac{\text{Importance(f)}}{\sum_j \text{Importance(j)}}
$$

Now we have T trees ; 

$$
\text{Importance(f)} = \frac{1}{T} \sum_{t=1}^{T} \text{Importance(f)}
$$

And after this we can normalize again to get the value between 0 and 1. 

- This approach is biased towards continuous variables.
- High-cardinality categorical features.
- If A and B are correlated ; Tree may pick only one. Other gets low importance even if its meaningful.

### Permutation importance

- Model agnostic, performance based
- If feature is important, breaking its relationship with the target should hurt performance.
- We do this by ;
    - Randomly shuffling one feature column in the validation set.
    - Keeping everything else the same
    - Measuring how much the model performance drops
    - Importance = Performance drop
- Algorithm
    - Train model on training set
    - Evaluate on validation set - get $M_{\text{base}}$
    - For each feature j:
        - Copy dataset
        - Randomly permute column j in the validation set
        - Predict with trained model
        - Compute new performance $M_j$
        - Importance = Difference from baseline
    - Rank features by importance → Normalize
    - We can use repeated permutations also and take the average for computing $M_j$
- Issues
    - If two features are correlated , shuffling one might not hurt much, because model can still use the other feature.

### Linear model coefficients

- In linear models, prediction is a weighted sum. Feature importance is based on the magnitude of the weight. But this is only meaningful after proper scaling.
- In linear regression, increasing the $x_j$ by 1 unit changes the prediction by $w_j$ units.
- In logistic regression increasing $x_j$ by 1 changes log-odds by $w_j$
- Valid if all features are on same scale.
- How strongly does this feature linearly change the model’s prediction
- Not causal effect, Not nonlinear importance, Not interaction strength.

### Drop-column

- Leave-One-Feature-Out Method
- If a feature is important, removing it entirely should hurt performance.
- So we ;
    - Train a model with all features
    - Then remove one feature
    - Retrain the model
    - Compare performance
- Issues;
    - Correlated features can make issues
    - Model changes after dropping

### LIME

- Local interpretable Model agnostic Explanations
- Lime explains one prediction at a time by ;
    - Creating fake data near that point ( perturbations) and weight by distance (using some distance kernel)
    - Asking the black box model for predictions
    - Fitting a simple model locally
    - Using that simple model’s coefficient as feature importance.
- Around this specific point, which features matter most?
- Lime does not give global feature importance or overall model behavior
- Problems with LIME
    - Instability due to sampling → different explanations each run
    - Choice of Kernel width
    - Unrealistic perturbations

### Partial Dependency Plot

- Visual tools to understand how features affect predictions, especially for complex, black-box models
- Shows the average effect of a feature on the model’s prediction, marginalising over all other features.
- Algorithm
    - Pick a feature $X_j$
    - Choose a grid of values for this feature
    - For each such value:
        - Replace the feature with that value in all instances.
        - Predict using the model
        - Average predictions across all instances
    - Plot average predictions over the selected grid of values
- If we force feature $X_j = x$ for everyone, and let all other features vary as they normally do, what is the average predictions?
- Shows overall relationship between feature and prediction
- Is relationship increasing, decreasing nonlinear flat?
- Model agnostic
- Example : As age increases, predicted risk increases
- Unrealistic data combinations

### ICE - Individual Conditional Expectation Plot

- Individual-level version of PDP
- It shows one curve per data point
- Algorithm
    - Trained model
    - Select a feature
    - Choose some values for age
    - Pick a line item ( sample)
    - Create fake records for this sample ( based on the selected values)
    - Get the prediction
    - Draw the line for this sample
    - Repeat this for many samples
    - Each line shows how that particular feature affects the sample’s prediction
    - Example : If everything about this person stayed the same except Age, what would the model do?
- ICE = Sample wise effect : For this specific data point, how does changing this feature change the prediction?
- PDP = Overall/Average Effect : On average across all data points, how does changing this feature change the prediction ?