# **Feature Importance**: Comparing Random Forests vs. Permutation Importance 

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this example, we will compare the impurity-based feature importance of a random forest classifier with the permutation importance on the titanic dataset using. The permutation importance approach is "model agnostic" and defines feature importance as the decrease in performance when a single feature value is randomly shuffled. We will show that the impurity-based feature importance can inflate the importance of numerical features.

Furthermore, the impurity-based feature importance of random forests suffers from being computed on statistics derived from the training dataset: the importances can be high even for features that are not predictive of the target variable, as long as the model has the capacity to use them to overfit.

Source: https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html 

We will examine the Titanic data (the sinking of the RMS Titanic) from the Kaggle competition](https://www.kaggle.com/c/titanic). There are many other examples of the Titanic dataset in introductory statistics and Data Science courses, so we also encourage you to look around and see how others have approached the problem.

<img src="https://upload.wikimedia.org/wikipedia/commons/9/95/Titanic_sinking%2C_painting_by_Willy_St%C3%B6wer.jpg" width="500" height="500" align="center"/>

Image source: https://upload.wikimedia.org/wikipedia/commons/9/95/Titanic_sinking%2C_painting_by_Willy_St%C3%B6wer.jpg

-------------

## **Part 0**: Setup

### Import packages

In [None]:
# Import all packages
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 8]
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

import numpy as np

from sklearn.datasets        import fetch_openml
from sklearn.ensemble        import RandomForestClassifier
from sklearn.impute          import SimpleImputer
from sklearn.inspection      import permutation_importance
from sklearn.compose         import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import OneHotEncoder

### Constants

In [None]:
SEED = 42

## **Part 1**: Load & preprocess data

We add two random variables that are not correlated in any way with the target variable ``survived``:

- ``random_num`` is a high cardinality numerical variable (as many unique values as records)
- ``random_cat`` is a low cardinality categorical variable (3 possible values)

In [None]:
# Fetch titantic data 
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

# Set numpy random state  to generate random variables
rng = np.random.RandomState(seed=SEED)

# Add random variables 
X['random_cat'] = rng.randint(3, size=X.shape[0])
X['random_num'] = rng.randn(X.shape[0])

# Select columns
categorical_columns = ['pclass', 'sex', 'embarked', 'random_cat']
numerical_columns = ['age', 'sibsp', 'parch', 'fare', 'random_num']

X = X[categorical_columns + numerical_columns]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=SEED)

# Impute and one-hot-encode data
categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

preprocessing = ColumnTransformer(
    [('cat', categorical_pipe, categorical_columns),
     ('num', numerical_pipe, numerical_columns)])

X.head()

## **Part 2**: Fit random forest model and evaluate performance

In [None]:
# Set up a pipeline to preprocess data, then fit a random forest classifier
rf = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RandomForestClassifier(random_state=SEED))
])

# Fit random forest
rf.fit(X_train, y_train)

# Note the perfect predictions on the train set - the model memorizes the training data ... and overfits
print("RF train accuracy: %0.3f" % rf.score(X_train, y_train))
print("RF test accuracy: %0.3f" % rf.score(X_test, y_test))

## **Part 3**: Feature importance - Random Forest (mean decrease in impurity)

Random forests compute feature importance based on a decrease in impurity. Impurity-based feature importance ranks the numerical features to be the most important features. As a result, the non-predictive random_num variable is ranked the most important!

This problem stems from two limitations of impurity-based feature importances:

- impurity-based importances are biased towards high cardinality features (i.e. ``random_num`` random variable)
- impurity-based importances are computed on training set statistics and therefore do not reflect the ability of feature to be useful to make predictions that generalize to the test set (when the model has enough capacity).

In [None]:
# Get feature names
ohe = (rf.named_steps['preprocess']
         .named_transformers_['cat']
         .named_steps['onehot'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = np.r_[feature_names, numerical_columns]

# Get feature importances from fitted model
tree_feature_importances = (
    rf.named_steps['classifier'].feature_importances_)
sorted_idx = tree_feature_importances.argsort()

# Plot feature importances 
y_ticks = np.arange(0, len(feature_names))
fig, ax = plt.subplots()
ax.barh(y_ticks, tree_feature_importances[sorted_idx])
ax.set_yticklabels(feature_names[sorted_idx])
ax.set_yticks(y_ticks)
ax.set_title("Random Forest Feature Importances (MDI)")
fig.tight_layout()
plt.show()

## **Part 4**: Feature importance - Permutation Importance

As an alternative, the permutation importances of rf are computed on a held out test set. This shows that the low cardinality categorical feature, sex is the most important feature.

Also note that both random features have very low importances (close to 0) as expected.

In [None]:
# Compute permutation importance 
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=SEED, n_jobs=-1)
sorted_idx = result.importances_mean.argsort()

# Plot importance 
fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=X_test.columns[sorted_idx])
ax.set_title("Permutation Importances (TEST set)")
fig.tight_layout()
plt.show()