### `shaprank` - An application to a _regression_ problem
In this notebook we run through a feature ranking example based on the `scikit-learn`'s dataset `diabetes`. We rely on `catboost` to grow a simple tree-based model that consumes all raw input features and then use this model trained on default hyper-parameters to produce the (Tree)SHAP values consumed by `shaprank`.

For the sake of compactness, to drive the main points, we only work with the full dataset and take no splits.

In [None]:
%load_ext autoreload
%autoreload 2

Import the required modules:
- `shaprank`: ranking engine
- `shaprank.explain`: an optional module with utilities to generate (Tree)SHAP-values given a model and the input data
- `examples`: the helper module supporting this tutorial notebook.

In [None]:
import logging

logging.getLogger().setLevel(logging.INFO)

import shaprank
import shaprank.addons.explain

import examples

Load the `diabetes` dataset (part of `scikit-learn`) into the frame `df`. The variables `c_inputs` and `c_output` are, respectively, the list of input features' names and the name of the regression target.

In [None]:
df, c_inputs, c_output = examples.load_dataset_diabetes()

Fit a `catboost` model on the raw data and using default hyper-parameters and then generate per-example (Tree)SHAP-values.

In [None]:
cb_model = examples.fit_catboost_regressor(df, c_inputs, c_output)

# concatenate the target column to the frame of SHAP values using `c_keep`
logging.info("Evaluating the SHAP values; find a few examples below.")
df_shap = shaprank.addons.explain.catboost.eval_shap_values(
    cb_model, df, c_keep=[c_output], prefix=""
)
df_shap.head(3)

### Greedy-search based feature ranking

Rank the input features using a "greedy search" algorithm that iteratively selects those features that provide the least contribution to a given optimization objective. Below, we inspect the results for `rmse` and `mae`.

In [None]:
result = shaprank.rank_regressor_features(
    df_shap, c_inputs, c_output, eval_metric="mae", verbose=True
)

In [None]:
result = shaprank.rank_regressor_features(
    df_shap, c_inputs, c_output, eval_metric="rmse", verbose=True
)

The `l1` ranking below coincides with the "average absolute SHAP value" ranking produced by `shap`'s [Global Bar Plot](https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/bar.html#Global-bar-plot).

In [None]:
(df_shap[c_inputs] - df_shap[c_inputs].mean(axis=0)).abs().mean(axis=0).sort_values(
    ascending=False
)

**Note**: Note that the `rmse` metric favours input `bmi` over `s5` whereas the `mae` metric and `shap`'s bar plot would rather pick `s5` over `bmi`. 

Which of these two inputs supports the better approximation of the target? Albeit marginally, `bmi` should be preferred over `s5` when the loss takes on a MSE-like form, resulting in a training loss on average `2%` smaller than the alternative. 

In [None]:
cb_model_bmi = examples.fit_catboost_regressor(df, ["bmi"], c_output)
cb_model_s5 = examples.fit_catboost_regressor(df, ["s5"], c_output)