<!-- # Copyright (c) 2025 takotime808 -->
# Uncertainty Metrics for Multioutput Regressors (Flexible API)

This notebook demonstrates a robust, type-annotated function to extract uncertainty metrics from any multi-output regressor, using [uncertainty-toolbox](https://uncertainty-toolbox.readthedocs.io/en/latest/). It supports various uncertainty APIs, including scikit-learn, GPs, and custom models.

---

**Install requirements**

```sh
pip install numpy pandas matplotlib scikit-learn uncertainty-toolbox
```

----
**Imports**

In [1]:
import numpy as np
# import pandas as pd
import matplotlib.pyplot as plt
from sklearn.multioutput import MultiOutputRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import train_test_split
# import uncertainty_toolbox as uct

from multioutreg.performance.metrics_generalized_api import get_uq_performance_metrics_flexible



Bad key text.markup in file /Users/tako/.matplotlib/matplotlibrc, line 165 ("text.markup         : 'plain'  # Affects how text, such as titles and labels, are")
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.6.3/matplotlibrc.template
or from the matplotlib source distribution


**Create toy multi-output data and fit model**

In [2]:
rng = np.random.RandomState(42)
X = rng.rand(300, 5)
Y = np.dot(X, rng.rand(5, 3)) + rng.randn(300, 3) * 0.1

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

base_gp = GaussianProcessRegressor(random_state=0)
multi_gp = MultiOutputRegressor(base_gp)
multi_gp.fit(X_train, Y_train)


**Compute and plot metrics**

 (1/n) Calculating accuracy metrics
 (2/n) Calculating average calibration metrics
 (3/n) Calculating adversarial group calibration metrics
  [1/2] for mean absolute calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 32.27it/s]


  [2/2] for root mean squared calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 31.85it/s]


 (4/n) Calculating sharpness metrics
 (n/n) Calculating proper scoring rule metrics
**Finished Calculating All Metrics**


  MAE           0.573
  RMSE          0.819
  MDAE          0.382
  MARPD         80.731
  R2            -0.523
  Correlation   -0.016
  Root-mean-squared Calibration Error   0.118
  Mean-absolute Calibration Error       0.088
  Miscalibration Area                   0.089
  Mean-absolute Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.251
     Group Size: 0.56 -- Calibration Error: 0.135
     Group Size: 1.00 -- Calibration Error: 0.088
  Root-mean-squared Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.244
     Group Size: 0.56 -- Calibration Error: 0.159
     Group Size: 1.00 -- Calibration Error: 0.118
  Sharpness   0.594
  Negative-log-likelihood   16.878
  CRPS                      0.465
  Check Score               0.234
  Interval Score            3.126
 (1/n) Calculating accuracy metrics
 (2

100%|██████████| 10/10 [00:00<00:00, 32.10it/s]


  [2/2] for root mean squared calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 32.00it/s]


 (4/n) Calculating sharpness metrics
 (n/n) Calculating proper scoring rule metrics
**Finished Calculating All Metrics**


  MAE           0.586
  RMSE          0.790
  MDAE          0.462
  MARPD         92.006
  R2            -0.757
  Correlation   -0.193
  Root-mean-squared Calibration Error   0.106
  Mean-absolute Calibration Error       0.084
  Miscalibration Area                   0.085
  Mean-absolute Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.228
     Group Size: 0.56 -- Calibration Error: 0.116
     Group Size: 1.00 -- Calibration Error: 0.084
  Root-mean-squared Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.268
     Group Size: 0.56 -- Calibration Error: 0.147
     Group Size: 1.00 -- Calibration Error: 0.106
  Sharpness   0.625
  Negative-log-likelihood   3637.057
  CRPS                      0.464
  Check Score               0.234
  Interval Score            2.914
 (1/n) Calculating accuracy metrics
 

100%|██████████| 10/10 [00:00<00:00, 32.62it/s]


  [2/2] for root mean squared calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 31.94it/s]


 (4/n) Calculating sharpness metrics
 (n/n) Calculating proper scoring rule metrics
**Finished Calculating All Metrics**


  MAE           0.522
  RMSE          0.689
  MDAE          0.389
  MARPD         80.133
  R2            -0.592
  Correlation   -0.025
  Root-mean-squared Calibration Error   0.143
  Mean-absolute Calibration Error       0.113
  Miscalibration Area                   0.114
  Mean-absolute Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.263
     Group Size: 0.56 -- Calibration Error: 0.161
     Group Size: 1.00 -- Calibration Error: 0.113
  Root-mean-squared Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.315
     Group Size: 0.56 -- Calibration Error: 0.183
     Group Size: 1.00 -- Calibration Error: 0.143
  Sharpness   0.544
  Negative-log-likelihood   116.770
  CRPS                      0.425
  Check Score               0.214
  Interval Score            2.890
                                      

In [10]:
df_metrics

Unnamed: 0,accuracy,avg_calibration,adv_group_calibration,sharpness,scoring_rule,output
0,"{'mae': 0.5732925138850177, 'rmse': 0.81875989...","{'rms_cal': 0.11762337888047286, 'ma_cal': 0.0...","{'ma_adv_group_cal': {'group_sizes': [0.0, 0.1...",{'sharp': 0.5944722429835237},"{'nll': 16.878334488351395, 'crps': 0.46466063...",0
1,"{'mae': 0.585920437227464, 'rmse': 0.789957837...","{'rms_cal': 0.1064111364770153, 'ma_cal': 0.08...","{'ma_adv_group_cal': {'group_sizes': [0.0, 0.1...",{'sharp': 0.6247389590516724},"{'nll': 3637.057356307649, 'crps': 0.464484585...",1
2,"{'mae': 0.5219086048909067, 'rmse': 0.68891833...","{'rms_cal': 0.14304345834896118, 'ma_cal': 0.1...","{'ma_adv_group_cal': {'group_sizes': [0.0, 0.1...",{'sharp': 0.5442768176358086},"{'nll': 116.77034119817075, 'crps': 0.42503986...",2


In [None]:
metrics_df, overall_metrics = get_uq_performance_metrics_flexible(
                                    model=multi_gp, 
                                    X_test=X_test, 
                                    y_test=Y_test,
                                    uncertainty_method="return_std", # ['return_std', 'return_cov', 'predict_std', 'predict_var']
            )

RuntimeError: Could not extract uncertainty from model. Please supply y_pred_std manually or specify 'uncertainty_method'.

In [None]:
metrics_df, overall_metrics = get_uq_performance_metrics_flexible(multi_gp, X_test, Y_test)

print("Available columns:", metrics_df.columns)
# metrics_to_plot = [m for m in ['rmse', 'mae', 'nll', 'miscal_area'] if m in metrics_df.columns]
metrics_to_plot = 
if not metrics_to_plot:
    print("No matching metrics found in metrics_df. Available columns:", metrics_df.columns)
else:
    ax = metrics_df[metrics_to_plot].plot.bar(figsize=(10, 6))
    ax.set_xticklabels([f"Output {i}" for i in metrics_df['output']])
    plt.xlabel('Output')
    plt.title('Uncertainty Toolbox Metrics per Output')
    plt.legend(title="Metric")
    plt.tight_layout()
    plt.show()


RuntimeError: Could not extract uncertainty from model. Please supply y_pred_std manually or specify 'uncertainty_method'.

----
**Make a performce metrics dataframe**

In [None]:
import numpy as np
import pandas as pd
# from uncertainty_toolbox.metrics import performance_metrics
import uncertainty_toolbox as uct
# Example predictions
# y_true, y_pred, y_std: (n_samples, n_outputs)
# These would come from your MultiOutputRegressor model
n_samples = 100
n_outputs = 3
y_true = np.random.rand(n_samples, n_outputs)
y_pred = np.random.rand(n_samples, n_outputs)
y_std  = np.abs(np.random.randn(n_samples, n_outputs))  # predicted stddevs, positive

# For each output, compute the metrics
metrics_list = []
for i in range(y_true.shape[1]):
    metrics = uct.get_all_metrics(
        y_true[:, i],
        y_pred[:, i],
        y_std[:, i]
    )
    metrics['output'] = i
    metrics_list.append(metrics)

# Convert to DataFrame for easy viewing
df_metrics = pd.DataFrame(metrics_list)
print(df_metrics)

 (1/n) Calculating accuracy metrics
 (2/n) Calculating average calibration metrics
 (3/n) Calculating adversarial group calibration metrics
  [1/2] for mean absolute calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 32.67it/s]


  [2/2] for root mean squared calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 32.23it/s]


 (4/n) Calculating sharpness metrics
 (n/n) Calculating proper scoring rule metrics
**Finished Calculating All Metrics**


  MAE           0.566
  RMSE          0.744
  MDAE          0.430
  MARPD         85.364
  R2            -0.686
  Correlation   -0.133
  Root-mean-squared Calibration Error   0.176
  Mean-absolute Calibration Error       0.144
  Miscalibration Area                   0.145
  Mean-absolute Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.285
     Group Size: 0.56 -- Calibration Error: 0.177
     Group Size: 1.00 -- Calibration Error: 0.144
  Root-mean-squared Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.323
     Group Size: 0.56 -- Calibration Error: 0.220
     Group Size: 1.00 -- Calibration Error: 0.176
  Sharpness   0.568
  Negative-log-likelihood   70.107
  CRPS                      0.443
  Check Score               0.223
  Interval Score            2.720
 (1/n) Calculating accuracy metrics
 (2

100%|██████████| 10/10 [00:00<00:00, 32.45it/s]


  [2/2] for root mean squared calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 32.20it/s]


 (4/n) Calculating sharpness metrics
 (n/n) Calculating proper scoring rule metrics
**Finished Calculating All Metrics**


  MAE           0.502
  RMSE          0.656
  MDAE          0.389
  MARPD         84.867
  R2            -0.613
  Correlation   -0.070
  Root-mean-squared Calibration Error   0.104
  Mean-absolute Calibration Error       0.081
  Miscalibration Area                   0.082
  Mean-absolute Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.219
     Group Size: 0.56 -- Calibration Error: 0.121
     Group Size: 1.00 -- Calibration Error: 0.081
  Root-mean-squared Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.258
     Group Size: 0.56 -- Calibration Error: 0.150
     Group Size: 1.00 -- Calibration Error: 0.104
  Sharpness   0.584
  Negative-log-likelihood   35.909
  CRPS                      0.393
  Check Score               0.198
  Interval Score            2.406
 (1/n) Calculating accuracy metrics
 (2

100%|██████████| 10/10 [00:00<00:00, 32.38it/s]


  [2/2] for root mean squared calibration error
Measuring adversarial group calibration by spanning group size between 0.0 and 1.0, in 10 intervals


100%|██████████| 10/10 [00:00<00:00, 32.31it/s]


 (4/n) Calculating sharpness metrics
 (n/n) Calculating proper scoring rule metrics
**Finished Calculating All Metrics**


  MAE           0.456
  RMSE          0.625
  MDAE          0.344
  MARPD         74.169
  R2            -0.536
  Correlation   0.047
  Root-mean-squared Calibration Error   0.114
  Mean-absolute Calibration Error       0.083
  Miscalibration Area                   0.084
  Mean-absolute Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.230
     Group Size: 0.56 -- Calibration Error: 0.117
     Group Size: 1.00 -- Calibration Error: 0.083
  Root-mean-squared Adversarial Group Calibration Error
     Group Size: 0.11 -- Calibration Error: 0.265
     Group Size: 0.56 -- Calibration Error: 0.153
     Group Size: 1.00 -- Calibration Error: 0.114
  Sharpness   0.541
  Negative-log-likelihood   51.825
  CRPS                      0.364
  Check Score               0.183
  Interval Score            2.309
                                        

In [13]:
import matplotlib.pyplot as plt

# Plot all metrics except 'output'
metrics_to_plot = [c for c in df_metrics.columns if c != 'output']

ax = df_metrics[metrics_to_plot].plot.bar(figsize=(10, 6))
ax.set_xticklabels([f"Output {i}" for i in df_metrics['output']])
plt.xlabel('Output')
plt.title('Uncertainty Toolbox Metrics per Output')
plt.legend(title="Metric")
plt.tight_layout()
plt.show()


TypeError: no numeric data to plot