<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Permutation-feature-importance" data-toc-modified-id="Permutation-feature-importance-1">Permutation feature importance</a></span></li><li><span><a href="#Train-a-linear-model-on-the-diabetes-dataset" data-toc-modified-id="Train-a-linear-model-on-the-diabetes-dataset-2">Train a linear model on the diabetes dataset</a></span></li><li><span><a href="#Permutation-importance-is-model-specific" data-toc-modified-id="Permutation-importance-is-model-specific-3">Permutation importance is model specific</a></span></li></ul></div>

<center><h2>Permutation feature importance</h2></center>

In [1]:
reset -fs

<center><h2>Train a linear model on the diabetes dataset</h2></center>

In [2]:
from sklearn.datasets        import load_diabetes
from sklearn.model_selection import train_test_split

diabetes = load_diabetes()
X_train, X_val, y_train, y_val = train_test_split(diabetes.data, diabetes.target, random_state=42)

In [3]:
# Refresh your memory about the dataset
print(diabetes.DESCR) 

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature va

In [4]:
from sklearn.linear_model    import Ridge
from sklearn.metrics         import mean_squared_error

model = Ridge(alpha=1e-2).fit(X_train, y_train)
mse = mean_squared_error(y_val, model.predict(X_val))
print(f"Mean squared error: {mse:,.2f}") 

# Note: We know from domain knowledge that this model has useful predictive power.
# Only do permutation feature importance on useful models.
# GIGO - Garbage In, Garbage Out!

Mean squared error: 2,836.40


In [5]:
from sklearn.inspection import permutation_importance

r = permutation_importance(model, 
                           X_val, y_val, # Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. 
                           n_repeats=30,
                           random_state=42)

# Note: Features that are important on the training set but not on the held-out set might cause the model to overfit.

In [6]:
# TODO: Find the most important feature according to permutation_importance
# 1. Sort features by mean value
# 2. Print feature names, mean value, and std value 





In [7]:
# Solutions

for i in r.importances_mean.argsort()[::-1]:
    print(f"{diabetes.feature_names[i]:<8}"
          f"{r.importances_mean[i]:.3f}"
          f" ± {r.importances_std[i]:.3f}")

s5      0.276 ± 0.051
bmi     0.227 ± 0.052
bp      0.070 ± 0.026
s1      0.067 ± 0.046
sex     0.066 ± 0.023
s4      0.024 ± 0.017
s2      0.023 ± 0.012
s6      0.004 ± 0.002
age     -0.004 ± 0.005
s3      -0.005 ± 0.010


The `n_repeats` parameter sets the number of times a feature is randomly shuffled and returns a sample of feature importances.

In [8]:
# TODO: Increase n_repeats=100 and repeat analysis
# What stays the same?
# What changes?

In [9]:
# Solutions
r = permutation_importance(model, 
                           X_val, y_val, 
                           n_repeats=100,
                           random_state=42)
for i in r.importances_mean.argsort()[::-1]:
    print(f"{diabetes.feature_names[i]:<8}"
          f"{r.importances_mean[i]:.3f}"
          f" ± {r.importances_std[i]:.3f}")

s5      0.266 ± 0.052
bmi     0.225 ± 0.053
bp      0.069 ± 0.032
sex     0.066 ± 0.020
s1      0.061 ± 0.041
s4      0.025 ± 0.016
s2      0.024 ± 0.011
s6      0.004 ± 0.003
age     -0.005 ± 0.005
s3      -0.006 ± 0.010


<center><h2>Permutation importance is model specific</h2></center>

Permutation importance does not reflect to the intrinsic predictive value of a feature by itself but how important this feature is for a particular model.

In [10]:
# TODO: Repeat the analysis with k-nearest neighbors (k-NN) Regressor
# Which features have similar importance?
# Which features have dissimilar importance?
# For each model, which features would you report as signficant? (Report by full name)





In [11]:
# Solutions

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor().fit(X_train, y_train)
mse = mean_squared_error(y_val, model.predict(X_val))
print(f"Mean squared error: {mse:,.2f}") 

r = permutation_importance(model, 
                           X_val, y_val, # Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. 
                           n_repeats=30,
                           random_state=42)

for i in r.importances_mean.argsort()[::-1]:
    print(f"{diabetes.feature_names[i]:<8}"
          f"{r.importances_mean[i]:.3f}"
          f" ± {r.importances_std[i]:.3f}")
    
    
# There is evidence that thyroid stimulating hormone (s5), body mass index (bmi), and average blood pressure (bp) are signficant.

Mean squared error: 3,060.17
s5      0.217 ± 0.051
bmi     0.184 ± 0.055
bp      0.100 ± 0.041
s3      0.060 ± 0.043
age     0.051 ± 0.034
sex     0.046 ± 0.038
s6      0.027 ± 0.040
s4      0.025 ± 0.040
s2      -0.001 ± 0.029
s1      -0.007 ± 0.028
