# Feature Importance Cheatsheet

## Feature of Importance from Embedded Models

Embedded feature selection methods determine feature importance as part of the model training process, unlike filter or wrapper methods. They leverage the internal structure of a model to identify the most informative features, making selection efficient and model-aware.

* Tree-based models (e.g., XGBoost, Random Forest, CatBoost): Feature importance is derived from splits (gain, cover, or frequency)
* Linear models (e.g., Lasso, Ridge): Importance is based on the magnitude of coefficients.

For XGBoost specifically, feature importance can be measured in several ways:
* **weight** counts how many times a feature is used to split the data across all trees
* **gain** measures the average improvement in loss (e.g., reduction in error) brought by splits on that feature.
* **cover** reflects the relative number of samples affected by the splits involving that feature.


In [None]:
# Reinstantiate Model with all features.
model = xgb.XGBClassifier(
    n_estimators=500,
    eval_metric='logloss',
    random_state=42,
    seed=42
)
model.fit(X_train, y_train)

importance_types = ['weight', 'gain', 'total_gain', 'cover', 'total_cover']

fig, axes = plt.subplots(1, len(importance_types), figsize=(5*len(importance_types), 6))

for ax, imp_type in zip(axes, importance_types):
    xgb.plot_importance(
        model,
        importance_type=imp_type,
        height=0.4,
        ax=ax,
        color='steelblue'
    )
    ax.set_title(f'Feature Importance ({imp_type})')

plt.tight_layout()
plt.show()

Using methods like SelectFromModel, you can automatically select features above a certain importance threshold.

Key considerations:
* Model-dependent: Different models rank features differently due to their internal mechanics.
* Training-dependent: Hyperparameters, randomness, and feature correlations can affect importance scores.
* Interpretation: Embedded importance reflects which features are most informative for the trained model, not necessarily universally important.

In [None]:
# Reinstantiate Model with all features.
model = xgb.XGBClassifier(
    n_estimators=500,
    eval_metric='logloss',
    random_state=42,
    seed=42
)
model.fit(X_train[feature_cols], y_train)

# model.set_params(importance_type='cover') # You can use this line to determine which importance type to use.

embedded_selector = SelectFromModel(model, prefit=True)
selected_features = X_train[feature_cols].columns[embedded_selector.get_support()]
print("Embedded selected features:", selected_features)


## Feature Importance from SHAP (SHapley Additive exPlanations)

SHAP is a model-agnostic explainability method that quantifies the contribution of each feature to a model’s prediction. It is based on Shapley values from cooperative game theory, which fairly distribute the “payout” (prediction) among all features according to their contribution.

How it works:

1. For a given prediction, SHAP considers all possible combinations of features and measures the marginal contribution of each feature when added to a subset.
2. These contributions are averaged across all permutations to compute a Shapley value for each feature.
3. The sum of all feature Shapley values plus the expected model output equals the model’s prediction for that instance.

**TreeExplainer**: For tree-based models like XGBoost, shap.TreeExplainer efficiently computes exact Shapley values using the tree structure, making it fast and scalable.

In [None]:
# Reinstantiate Model with all features.
model = xgb.XGBClassifier(
    n_estimators=500,
    eval_metric='logloss',
    random_state=42,
    seed=42
)
model.fit(X_train[feature_cols], y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val[feature_cols])
shap.summary_plot(shap_values, X_val[feature_cols], feature_names=feature_cols)

## Permutation Feature Importance

Permutation feature importance is a post-hoc, model-agnostic method for evaluating the impact of each feature on a trained model’s performance. Unlike embedded methods that rely on the model’s internal feature scoring, permutation importance directly measures how much the model’s predictive ability decreases when a feature’s values are randomly shuffled.

How it works:

* Take a trained model and a validation dataset.
* Compute a baseline performance metric (e.g., recall, AUC) on the validation set.
* For each feature, randomly permute its values across all samples, breaking any association with the target.
* Measure the drop in model performance caused by the permutation.

Features that cause a larger drop are deemed more important.

In [None]:
# Reinstantiate Model with all features.
model = xgb.XGBClassifier(
    n_estimators=500,
    eval_metric='logloss',
    random_state=42,
    seed=42
)
model.fit(X_train[feature_cols], y_train)

def auc_scorer(estimator, X, y):
    y_proba = estimator.predict_proba(X)[:, 1]
    return roc_auc_score(y, y_proba)

def f1_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return f1_score(y, y_pred)

def accuracy_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return accuracy_score(y, y_pred)

def precision_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return precision_score(y, y_pred)

def recall_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return recall_score(y, y_pred)

perm_importance = permutation_importance(model,
                                         X_val,
                                         y_val,
                                         n_repeats=10,
                                         random_state=42,
                                         scoring=auc_scorer # This can be changed with different scorers.
                                         )

sorted_idx = perm_importance.importances_mean.argsort()[::-1]
print("Top features by permutation importance:")
for idx in sorted_idx[:10]:
    print(X_val.columns[idx], perm_importance.importances_mean[idx])


# Iterative Wrapper Methods

This is a list of methods which uses either embedded feature importance OR metrics on the model's outputs to identify features.

1. Recursive Feature Elimination & Boruta Algorithm: Uses the models own embedded feature importance. (_feature_importance or _coeff (for logistic regression models))
2. Sequential Feature Selection: Tunes the features on a specific metric calculated from the model's outputs.

### Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a wrapper-based feature selection method that iteratively selects the most important features for a predictive model. The basic idea is:

1. Start with all features and train the estimator (in this case, XGBClassifier).
2. Compute feature importance scores from the trained model.
3. Remove the least important feature(s).
4. Refit the model on the remaining features.
5. Repeat this process until the desired number of features (n_features_to_select) is reached.

RFE is particularly useful when you want to systematically reduce the feature space while retaining the most informative predictors.

In [None]:
rfe_selector = RFE(
    estimator=xgb.XGBClassifier(n_estimators=100,
                                random_state=42,
                                importance_type="gain" # This can be changed to "cover" etc.
                                ),
    n_features_to_select=5
)
rfe_selector.fit(X_train[feature_cols], y_train)
print("RFE selected features:", [feature_cols[i] for i in rfe_selector.get_support(indices=True)])

### Sequential Feature Selection (SFS)

This is a wrapper-based method for selecting a subset of features by sequentially adding or removing features based on their contribution to model performance. There are two main approaches:

Forward selection (direction='forward'):

1. Starts with no features.
2. Iteratively adds the feature that improves performance the most.
3. Stops when the desired number of features (n_features_to_select) is reached.

Backward elimination (direction='backward'):

1. Starts with all features.
2. Iteratively removes the least important feature.
3. Stops when the desired number of features remains.

In [None]:
def auc_scorer(estimator, X, y):
    y_proba = estimator.predict_proba(X)[:, 1]
    return roc_auc_score(y, y_proba)

def f1_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return f1_score(y, y_pred)

def accuracy_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return accuracy_score(y, y_pred)

def precision_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return precision_score(y, y_pred)

def recall_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    return recall_score(y, y_pred)

sfs_selector = SequentialFeatureSelector(
    xgb.XGBClassifier(n_estimators=100,
                      random_state=42
                      ),
    n_features_to_select=5,
    direction='forward', # This could also be "backward"
    scoring=auc_scorer # This can be changed with the other scorers
)
sfs_selector.fit(X_train[feature_cols], y_train)
print("SFS selected features:", [feature_cols[i] for i in sfs_selector.get_support(indices=True)])

### Boruta Algorithm

Boruta is a wrapper-based feature selection algorithm that aims to identify all relevant features for a predictive model, rather than just a minimal subset. It works by comparing the importance of real features against **“shadow” features**.

Shadow features are synthetic copies of your original features, created by randomly shuffling the values of each feature across all samples. This preserves the distribution of the original feature but breaks any real association with the target variable. Features that consistently outperform their shadow counterparts are considered important and retained.

The selected_features list contains all features that Boruta deemed truly relevant for predicting the target variable. This method is particularly useful when you want a comprehensive set of predictive features, including those that might be missed by simpler methods like RFE or sequential selection.

Key points:

* Boruta identifies all relevant features, not just the top N.
* Works well with tree-based models like Random Forest, XGBoost, or ExtraTrees because they provide feature importance scores.
* Reduces the risk of omitting important features when building the final predictive model.

There is a variation of buruta that uses SHAP values, but it takes a long time to run. If you would like to read more on this, the working package of this is called **"borutashapplus".**

In [None]:
boruta_selector = BorutaPy(
    estimator=xgb.XGBClassifier(n_estimators=100,
                                random_state=42,
                                importance_type="gain" # This can be changed to "cover" etc.
                                ),
    n_estimators='auto',
    verbose=0,
    random_state=42
)

boruta_selector.fit(X_train.fillna(0).values, y_train.values)

selected_features = [feature_cols[i] for i in range(len(feature_cols)) if boruta_selector.support_[i]]
print("Boruta selected features:", selected_features)


# Statistical Filter Methods

This provides guidance on a range of statistical based methods which can also be used for feature selection.

Have a go at running these tests, to see how it compares with your feature importance runs.

### ANOVA F-test
This is a statistical method used to identify features that are most strongly related to the target variable. It is commonly applied when the target is categorical (e.g., classes 0 and 1) and the features are numeric.

**How it works:**

1. For each feature, the F-test compares the variance between groups (classes) to the variance within groups.
2. A higher F-score indicates that the feature’s values differ significantly across classes — meaning it is more informative for predicting the target.
3. SelectKBest(score_func=f_classif, k=5) selects the top 5 features with the highest F-scores.

In [None]:
anova_selector = SelectKBest(score_func=f_classif, k=5)
anova_selector.fit(X_train[feature_cols], y_train) # This is currently using all features, but you will need to use numeric features

print("ANOVA selected features:", [feature_cols[i] for i in anova_selector.get_support(indices=True)])

### Chi-squared Test

This is a statistical method used to identify features that are most strongly associated with a categorical target variable. It is commonly applied when both the target and the features are categorical, such as when the feature is binary (0/1) and the target is a class label.

How it works:

1. For each feature, the Chi-squared test compares the observed frequency of each feature value in each class to the frequency expected if there were no association.

2. A higher Chi-squared statistic indicates that the feature’s distribution differs significantly across classes — meaning it is more informative for predicting the target.

3. SelectKBest(score_func=chi2, k=5) can be used to select the top 5 features with the highest Chi-squared scores.

In [None]:
chi2_selector = SelectKBest(score_func=chi2, k=5)
chi2_selector.fit(X_train[feature_cols], y_train) # This is currently using all features, but you will need to use binary features

print("Chi2 selected features:", [feature_cols[i] for i in chi2_selector.get_support(indices=True)])

### Mutual Information

Mutual Information (MI) measures the dependency between each feature and the target variable. It captures any kind of relationship, including non-linear associations, unlike ANOVA which only detects linear separations.

How it works:
1. For each feature, MI quantifies how much knowing the feature reduces uncertainty about the target.
2. A higher MI score means the feature contains more information about the target.
3. SelectKBest(score_func=mutual_info_classif, k=5) selects the top 5 features with the highest MI scores.

By default, mutual_info_classif will treat integer columns as discrete, float columns as continuous. So make sure your binary features are either int or bool dtype to be interpreted correctly.

In [None]:
mi_selector = SelectKBest(score_func=mutual_info_classif, k=5)
mi_selector.fit(X_train[feature_cols], y_train)

print("Mutual Information features:", [feature_cols[i] for i in mi_selector.get_support(indices=True)])

### MRMR (Minimum Redundancy Maximum Relevance) Summary

MRMR is a filter-based feature selection method designed to choose features that are both:
* Highly relevant to the target (Maximum Relevance): Keep features that have strong statistical association with the target (e.g., high mutual information).
* Minimally redundant with each other (Minimum Redundancy): Avoid including features that are highly correlated or redundant with already selected features.

How it works:

1. Compute a relevance score (usually mutual information) between each feature and the target.
2. Compute redundancy between features (how much information is shared).
3. Iteratively select features that maximize relevance while minimizing redundancy with features already chosen.
4. Stop when the desired number of features K is selected.

Notes:
* MRMR is model-agnostic.
* It works best with numeric features, and the target should be categorical or integer-encoded.

In [None]:
k= 5
mrmr_features = mrmr_classif(X=X_train, y=y_train, K=k)

print("\nMRMR selected features:", mrmr_features)