## time-to-event modelling and survival prediction

**Set-up instructions:** On binder, this should run out-of-the-box.

To run locally instead, ensure that `skpro` with basic dependency requirements is installed in your python environment.

---

`skpro` provides a unified interface to time-to-event prediction models, also known as survival prediction models.

**Time-to-event prediction** is a form of probabilistic regression where **labels can be "censored"**, i.e., of the form "time is t or later" instead of exat observations.

**Section 1** provides an overview of the basic **time-to-event prediction workflows** supported by `skpro`.

**Section 2** showcases **performance metrics and benchmarking** for time-to-event prediction with censored data.

**Section 3** discusses **advanced composition patterns**, including various ways to leverage `sklearn` regressors for time-to-event prediction with censored data.

**Section 4** gives an introduction to how to write **custom estimators** compliant with the `skpro` interface.

In [None]:
# hide warnings
import warnings

warnings.filterwarnings("ignore")

## 1. Basic survival prediction interface <a class="anchor" id="chapter1"></a>

In this section:

* explanation of censored time-to-event data
* `skpro` time-to-event/survival prediction interface
* metrics, evaluation

### 1.1 data representation, censoring

Survival prediction or time-to-event prediction can be seen a generalization of probabilistic supervised learning.


Each sample consists of:

* a feature vector, row of a data frame
* a label, which can be an exact time of occurrence, or a statement about "time was t or later"

In [None]:
# simulated toy datset, lung cancer survival times
import numpy as np

# demographics - age and smoker yes/no
age = np.random.uniform(low=20, high=100, size=50)
smoker = np.random.binomial(1, 0.3, size=50)

# actual survival time
scale = 200 / (0.5 * age + 30 * smoker)
survival = scale * np.random.weibull(1, size=50)

# patients are observed only for 5 years
# if they surviva 5 years, we know they survived 5 years, but not exact time of death
censored = survival > 5
observation = np.minimum(survival, 5)

`skpro` represents this information in an `sklearn`-like interface:

In [None]:
import pandas as pd

# features
X = pd.DataFrame({"age": age, "smoker": smoker})

# time of survival or censoring
y = pd.DataFrame({"time": observation})

# indicator whether event was observed or censored
# censored = 1/True, observed = 0/False
# variable names should be the same as for y
C = pd.DataFrame({"time": censored})

In [None]:
X.head()

In [None]:
y.head()

In [None]:
C.head()

### 1.2 basic survival prediction workflow

survival prediction is the task:

* given censored time-to-event labels and features, `X`, `y`, `C`
* learn a model that can predict `y` if it were uncensored, i.e., the true event time
* the prediction should take the form of a survival function or probability distribution

`skpro` survival predictors extend the interface of probabilistic regressors:

* `fit(X, y, C=None)`, with `X`, `y`, `C` as above; if `C=None`, all observations are uncensored
* `predict(X_test)` for mean survival time predictions
* `predict_proba(X_test)` for distributional predictions

Other prediction methods - `predict_interval`, `predict_quantiles`, `predict_var` - also generalize the same way.

Because `C` is optional, and means "uncensored" if not passed, all survival prediction models can be used as supervised probabilistic regressors.

Using probabilistic regressors as survival models is similarly possible, to be revisited later.

Basic deployment workflow:

In [None]:
from sklearn.model_selection import train_test_split

from skpro.survival.coxph import CoxPH

# step 1: data specification
# X, y, C, as above
X_train, X_new, y_train, _, C_train, _ = train_test_split(X, y, C)

# step 2: specifying the regressor
# example - Cox proportional hazards model from statsmodels
surv_model_cox = CoxPH()

# step 3: fitting the model to training data
surv_model_cox.fit(X_train, y_train, C_train)

# step 4: predicting labels on new data

# full distribution prediction
y_pred_proba_cox = surv_model_cox.predict_proba(X_new)

In [None]:
# mean predicted survival time
y_pred_proba_cox.mean().head()

In [None]:
# plot of survival functions
y_pred_proba_cox.iloc[range(5)].plot("surv")

In [None]:
# plotting survival funtions in one figure, smokers in red
from matplotlib.pyplot import subplots

_, ax = subplots()

for i in range(len(y_pred_proba_cox)):
    ax = y_pred_proba_cox.iat[i, 0].plot("surv", ax=ax, color=["b", "r"][smoker[i]])

### 1.3 survival prediction with parametric predictive distribution

example: using an accelerated failure time model with Weibull hazard

same workflow, only using different model:

In [None]:
from sklearn.model_selection import train_test_split

from skpro.survival.aft import AFTWeibull

# step 1: data specification
# X, y, C, as above
X_train, X_new, y_train, _, C_train, _ = train_test_split(X, y, C)

# step 2: specifying the regressor
# example - Cox proportional hazards model from statsmodels
surv_model_aft = AFTWeibull()

# step 3: fitting the model to training data
surv_model_aft.fit(X_train, y_train, C_train)

# step 4: predicting labels on new data

# full distribution prediction
y_pred_proba_aft = surv_model_aft.predict_proba(X_new)

In [None]:
# plotting survival funtions in one figure, smokers in red
from matplotlib.pyplot import subplots

_, ax = subplots()

for i in range(len(y_pred_proba_cox)):
    ax = y_pred_proba_aft.iat[i, 0].plot(
        "surv", ax=ax, color=["b", "r"][smoker[i]], x_bounds=[0, 5]
    )

hazard functions can be plotted the same way:

In [None]:
# plot of hazard functions
y_pred_proba_aft.iloc[range(5)].plot("haz", x_bounds=[0, 5])

In [None]:
# plotting survival funtions in one figure, smokers in red
from matplotlib.pyplot import subplots

_, ax = subplots()

for i in range(len(y_pred_proba_aft)):
    ax = y_pred_proba_aft.iat[i, 0].plot(
        "haz", ax=ax, color=["b", "r"][smoker[i]], x_bounds=[0, 5]
    )

In [None]:
# estimated scale parameter
y_pred_proba_aft.to_df().head()

In [None]:
# actual Weibull scale parameter to compare
# unknown in a real scenario, but we know since we simulated the data
scale[0:5]

### 1.4 simple evaluation workflow for time-to-event predictions

for simple evaluation:

1. split the data into train/test set - including the censoring variable
2. make predictions of either type for test features
3. compute metric on test set, comparing test predictions to held out test observations,
  including censoring indicsator

Note:

* metrics will compare probabilistic prediction to tabular ground truth and
  censoring indicator
* the metric will needs to be of a compatible type, e.g., for proba predictions

In [None]:
from sklearn.model_selection import train_test_split

from skpro.metrics import ConcordanceHarrell
from skpro.survival.coxph import CoxPH

# step 1: data specification
X_train, X_test, y_train, y_test, C_train, C_test = train_test_split(X, y, C)

# step 2: specifying the regressor
# example - Cox proportional hazards model from statsmodels
surv_model = CoxPH()

# step 3: fitting the model to training data
surv_model.fit(X_train, y_train, C_train)

# step 4: predicting labels on new data
y_pred_proba = surv_model.predict_proba(X_test)

# step 5: specifying evaluation metric
metric = ConcordanceHarrell()

# step 6: evaluate metric, compare predictions to actuals
metric(y_test, y_pred_proba, C_true=C_test)

how do we know that metric is of right type? Via `scitype:y_pred` tag

In [None]:
metric.get_tags()
# scitype:y_pred is pred_proba - for proba predictions

how do we find metrics for a prediction type?

In [None]:
from skpro.registry import all_objects

all_objects("metric", as_dataframe=True, return_tags="scitype:y_pred")

extra note: quantile metrics can be applied to interval predictions as well

more details on metrics below

### 1.4 `skpro` objects - `scikit-base` interface, searching for regressors and metrics

* `skpro` objects - `skbase` interface points `get_tags`, `get_params`/`set_params`
* searching estimators and metrics via `all_objects`

### 1.4.1 primer on `skpro` object interface <a class="anchor" id="section1_3_1"></a>

metrics and estimators are first-class citizens in `skpro`, with a `scikit-base` compatible interface

In [None]:
# example object 1: CRPS metric
from skpro.metrics import CRPS

crps_metric = CRPS()

# example object 2: ResidualDouble regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from skpro.regression.residual import ResidualDouble

reg_mean = LinearRegression()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)

e.g., all have `get_tags` interface

In [None]:
crps_metric.get_tags()

In [None]:
reg_proba.get_tags()

the tag `object_type` indicates the type of object, e.g., metric or proba regressor

all objects also have the `get_params`/`set_params` interface known from `scikit-learn`

= reading or setting hyper-parameters

`get_params` returns `dict` `{paramname: paramvalue}`; `set_params` writes it

In [None]:
crps_metric.get_params()

composite objects have the nested param interface, keys `componentname__paramname`

In [None]:
# note that reg_proba has components LinearRegression and RandomForestaregressor
# each with their own parameters
reg_proba

so `reg_proba` will have parameters coming from itself and either component:

In [None]:
reg_proba.get_params()

further common interface points are `get_config`, `set_config`, and `get_fitted_params` (only fittable estimators)

### 1.4.2 searching for regressors and metrics <a class="anchor" id="section1_3_2"></a>

as first-class citizens, all objects in `skpro` are indexed via the `registry` utility `all_objects`.

To find probabilistic supervised regressors, use `all_objects` with the type `regressor_proba`:

In [None]:
from skpro.registry import all_objects

all_objects("regressor_proba", as_dataframe=True).head()

a full list can also be found in the online API reference.

for metrics, as seen above:

In [None]:
from skpro.registry import all_objects

all_objects("metric", as_dataframe=True, return_tags="scitype:y_pred")

all tags can be printed by the `all_tags` utility:

In [None]:
# all tags applicable to metrics
from skpro.registry import all_tags

all_tags("metric", as_dataframe=True)

In [None]:
# all tags applicable to probabilistic regressors
from skpro.registry import all_tags

all_tags("regressor_proba", as_dataframe=True)

filtering in search can be done with the `filter_tags` argument in `all_objects`, see docstring:

In [None]:
from skpro.registry import all_objects

# "retrieve all genuinely probabilistic loss functions"
all_objects("metric", as_dataframe=True, filter_tags={"scitype:y_pred": "pred_proba"})

## 2. Prediction types, metrics, benchmarking <a class="anchor" id="chapter2"></a>

This section gives more details on:

* different prediction types, including a methodological primer
* the API of metrics to compare probabilistic predictions to non-probabilistic actuals
* utilities for batch benchmarking of estimators and metrics

### 2.1 Probabilistic predictions - methodological primer <a class="anchor" id="section2_1"></a>

**readers familir with, or less interested in theory, may like to skip section 2.1**

In supervised learning - probabilistic or not:

* we fit estimator to i.i.d samples $(X_1, Y_1), \dots, (X_N, Y_N) \sim (X_*, Y_*)$
* and want to predict $y$ given $x$ accurately, for $(x, y) \sim (X_*, Y_*)$

Let $y$ be the (true) value, for an observed feature $x$

(we consider $y$ a random variable)

| Name | param | prediction/estimate of | `skpro` |
| ---- | ----- | ---------------------- | -------- |
| point prediction | | conditional expectation $\mathbb{E}[y\|x]$ | `predict` |
| variance prediction | | conditional variance $Var[y\|x]$ | `predict_var` |
| quantile prediction | $\alpha\in (0,1)$ | $\alpha$-quantile of $y\|x$ | `predict_quantiles` |
| interval prediction | $c\in (0,1)$| $[a,b]$ s.t. $P(a\le y \le b\| x) = c$ | `predict_interval` |
| distribution prediction | | the law/distribution of $y\|x$ | `predict_proba` |

##### More formal details & intuition:

let's consider the toy example again

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from skpro.regression.residual import ResidualDouble

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)


reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean)

reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_new)

* a **"point prediction"** is a prediction/estimate of the conditional expectation $\mathbb{E}[y|x]$.\
 **Intuition**: "out of many repetitions/worlds, this value is the arithmetic average of all observations".

In [None]:
# if y_pred_proba were *true*, here's how many repetitions would look like:

# repeating this line is "one repetition"
y_pred_proba.sample().head()

In [None]:
many_samples = y_pred_proba.sample(100)
many_samples

In [None]:
# "doing many times and taking the mean" -> usual point prediction
mean_prediction = many_samples.groupby(level=1, sort=False).mean()
mean_prediction.head()

In [None]:
# if we would do this infinity times instead of 100:
y_pred_proba.mean().head()

* a **"variance prediction"** is a prediction/estimate of the conditional expectation $Var[y|x]$.\
 **Intuition:** "out of many repetitions/worlds, this value is the average squared distance of the observation to the perfect point prediction".


In [None]:
# same as above - take many samples, and then compute element-wise statistics
var_prediction = many_samples.groupby(level=1, sort=False).var()
var_prediction.head()

In [None]:
# e.g., predict_var should give the same result as infinite large sample's variance
y_pred_proba.var().head()

* a **"quantile prediction"**, at quantile point $\alpha\in (0,1)$ is a prediction/estimate of the $\alpha$-quantile of $y'|y$, i.e., of $F^{-1}_{y|x}(\alpha)$, where $F^{-1}$ is the (generalized) inverse cdf = quantile function of the random variable y|x.\
 **Intuition**: "out of many repetitions/worlds, a fraction of exactly $\alpha$ will have equal or smaller than this value."
* an **"interval prediction"** or "predictive interval" with (symmetric) coverage $c\in (0,1)$ is a prediction/estimate pair of lower bound $a$ and upper bound $b$ such that $P(a\le y \le b| x) = c$ and $P(y \gneq b| x) = P(y \lneq a| x) = (1 - c) /2$.\
 **Intuition**: "out of many repetitions/worlds, a fraction of exactly $c$ will be contained in the interval $[a,b]$, and being above is equally likely as being below".

(similar - exercise left to the reader)

* a **"distribution prediction"** or "full probabilistic prediction" is a prediction/estimate of the distribution of $y|x$, e.g., "it's a normal distribution with mean 42 and variance 1".\
**Intuition**: exhaustive description of the generating mechanism of many repetitions/worlds.

note: the true distribution is unknown, and not accessible easily!

`y_pred_proba` is a distribution, but in general not equal to the true one!

that is, there are:

* *true* distribution `y_pred_proba_true` - unknown and unknowable but estimable
* `y_pred_proba` - our guess at `y_pred_proba_true`
* the actual data `y_true` is *one* `y_pred_proba_true.sample()`


* `predict` produces guess of `y_pred_proba_true.mean()`
* `predict_var` produces guess of `y_pred_proba_true.var()`
* `predict_quantiles([0.05, 0.5, 0.95])` produces guess of `y_pred_proba_true.quantiles([0.05, 0.5, 0.95])`
* `predict_proba` produces guess of `y_pred_proba_true`

the guesses are algorithm specific, and some algorithms are more accurate than others, given data

### 2.2 probabilistic metrics - details <a class="anchor" id="section2_2"></a>

General usage pattern same as for `sklearn` metrics:

1. get some actuals and predictions
2. specify the metric - similar to estimator specs
3. plug the actuals and predictions into metric to get metric values

*but*: need to use dedicated metric for probabilistic predictions

* ground truth: `y_true` samples
* prediction e.g., `y_predict_proba`, `y_predict_interval`
* so, match metric with type of prediction!
    * `metric(y_true: 2D pd.DataFrame, y_pred: proba_prediction_type) -> float`

Recall methods available for all probabilistic regressors:

- `predict_interval` produces interval predictions.
  Argument `coverage` (nominal interval coverage) must be provided.
- `predict_quantiles` produces quantile predictions.
  Argument `alpha` (quantile values) must be provided.
- `predict_var` produces variance predictions. Same args as `predict`.
- `predict_proba` produces full distributional predictions. Same args as `predict`.

| Name | param | prediction/estimate of | `skpro` |
| ---- | ----- | ---------------------- | -------- |
| point prediction | | conditional expectation $\mathbb{E}[y\|x]$ | `predict` |
| variance prediction | | conditional variance $Var[y\|x]$ | `predict_var` |
| quantile prediction | $\alpha\in (0,1)$ | $\alpha$-quantile of $y\|x$ | `predict_quantiles` |
| interval prediction | $c\in (0,1)$| $[a,b]$ s.t. $P(a\le y \le b\| x) = c$ | `predict_interval` |
| distribution prediction | | the law/distribution of $y\|x$ | `predict_proba` |

let's produce some probabilistic predictions!

In [None]:
# 1. get some actuals and predictions
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# actuals = y_test

In [None]:
from sklearn.ensemble import RandomForestRegressor

from skpro.regression.residual import ResidualDouble

reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean)

reg_proba.fit(X_train, y_train)

# use any of the probabilistic methods, we have seen this
y_pred_int = reg_proba.predict_interval(X_test, coverage=0.95)
y_pred_q = reg_proba.predict_quantiles(X_test, alpha=[0.05, 0.95])
y_pred_proba = reg_proba.predict_proba(X_test)

recall, all have their own output format:

In [None]:
y_pred_int  # lower/upper intervals

In [None]:
y_pred_q  # quantiles

In [None]:
y_pred_proba  # sktime/skpro BaseDistribution

we now need to apply a suitable metric, `metric(y_test, y_pred)`

IMPORTANT: sequence matters, `y_test` first; `y_pred` has very different type!

In [None]:
# 2. specify metric
# CRPS = continuous ranked probability score, for distribution predictions
from skpro.metrics import CRPS

crps = CRPS()

# 3. evaluate metric
crps(y_test, y_pred_proba)

how do we find a metric that fits the prediction type?

answer: metrics are tagged

important tag: `scitype:y_pred`

* `"pred_proba"` - distributional, can applied to distributions, `predict_proba` output
* `"pred_quantiles"` - quantile forecast metric, can be applied to quantile predictions, interval predictions, distributional predictions
    * applicable to `predict_quantiles`, `predict_interval`, `predict_proba` outputs
* `"pred_interval"` - interval forecast metric, can be applied to interval predictions, distributional predictions
    * applicable to `predict_interval`, `predict_proba` outputs

In [None]:
crps.get_tags()

listing metrics with the tag, filtering for probabilistic tags:

(let's try to find a quantile prediction metric!)

In [None]:
from skpro.registry import all_objects

all_objects(
    "metric",
    as_dataframe=True,
    return_tags="scitype:y_pred",
)

`PinballLoss` is a quantile forecast metric:

In [None]:
from skpro.metrics import PinballLoss

pinball_loss = PinballLoss()

pinball_loss(y_test, y_pred_q)

... this is by default an average (grand average, float)

* averages over samples in `y_pred` / `y_test` (rows)
* averages over variables (columns)
* average over `alpha` values, quantile points

what if we don't want these averages?

* variable (column) averaging is controlled by the `multioutput` arg.
    * `"raw_values"` prevents averaging, `"uniform_average"` computes arithmetic mean.
* quantile points (`alpha`) or coverage (`coverage`) is controlled by `score_average` arg
* evaluation by row via the `evaluate_by_index` method
    * can be useful for diagnostics or statistical tests

In [None]:
# Example 1: Pinball loss by quantile point
loss_multi = PinballLoss(score_average=False)
loss_multi(y_test, y_pred_q)

In [None]:
# Example 2: CRPS by test sample index
crps.evaluate_by_index(y_test, y_pred_proba)

Caveat: not every metric is an average over time points, e.g., RMSE

In this case, `evaluate_by_index` computes jackknife pseudo-samples

(for mean statistics, jackknife pseudo-samples are equal to individual samples)

### 2.3 Benchmark evaluation of probabilistic regressors <a class="anchor" id="section2_3"></a>

for quick evaluation and benchmarking,

the `benchmarking.evaluate` utility can be used:

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

from skpro.benchmarking.evaluate import evaluate
from skpro.metrics import CRPS
from skpro.regression.residual import ResidualDouble

# 1. specify dataset
X, y = load_diabetes(return_X_y=True, as_frame=True)

# 2. specify estimator
estimator = ResidualDouble(LinearRegression())

# 3. specify cross-validation schema
cv = KFold(n_splits=3)

# 4. specify evaluation metric
crps = CRPS()

# 5. evaluate - run the benchmark
results = evaluate(estimator=estimator, X=X, y=y, cv=cv, scoring=crps)

# results are pd.DataFrame
# each row is one repetition of the cross-validation on one fold fit/predict/evaluate
# columns report performance, runtime, and other optional information (see docstring)
results

## 3. Advanced composition patterns <a class="anchor" id="chapter3"></a>

we introduce a number of composition patterns available in `skpro`:

* reducer-wrappers that turn `sklearn` regressors into probabilistic ones
* pipelines of `sklearn` transformers with `skpro` regressors
* tuning `skpro` probabilistic regressors via grid/random search, minimizing a probabilistic metric
* ensembling multiple `skpro` probabilistic regressors

data used in this section:

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

evaluation metric used in this section:

In [None]:
crps = CRPS()

### 3.1 Reducers to turn `sklearn` regressors probabilistic <a class="anchor" id="section3_1"></a>

there are many common algorithms that turn a non-probabilistic tabular regressor probabilistic

formally, this is a type of "reduction" - of probabilistic supervised tabular to non-probabilistic supervised tabular

Examples:

* predicting variance equal to training residual variance - `ResidualDouble` with standard settings
    * or other unconditional distribution estimate for residuals
* "squaring the residual" two-step prediction - `ResidualDouble`
* boostrap prediction intervals - `BootstrapRegressor`
* conformal prediction intervals - contributions appreciated :-)
* natural gradient boosting aka NGBoost - contributions appreciated :-)

### 3.1.1 constant variance prediction <a class="anchor" id="section3_1_1"></a>

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

# estimator specification - use any sklearn regressor for reg_mean
reg_mean = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, cv=KFold(5))
# cv is used to estimate out-of-sample residual variance via 5-fold CV
# note - in-sample predictions will usually underestimate the variance!

# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)

# evaluate
crps(y_test, y_pred_proba)

In [None]:
from skpro.utils.plotting import plot_crossplot_interval

plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)

### 3.1.2 two-step residual prediction <a class="anchor" id="section3_1_2"></a>

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

# estimator specification - use any sklearn regressor for reg_mean and reg_resid
reg_mean = RandomForestRegressor()
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, estimator_resid=reg_resid, cv=KFold(5))
# cv is used to estimate out-of-sample residual variance via 5-fold CV

# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)

# evaluate
crps(y_test, y_pred_proba)

In [None]:
from skpro.utils.plotting import plot_crossplot_interval

plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)

### 3.1.3 bootstrap prediction intervals <a class="anchor" id="section3_1_3"></a>

In [None]:
from sklearn.linear_model import LinearRegression

from skpro.regression.bootstrap import BootstrapRegressor

# estimator specification - use any sklearn regressor for reg_mean
reg_mean = LinearRegression()
reg_proba = BootstrapRegressor(reg_mean, n_bootstrap_samples=100)

# fit and predict
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_test)

# evaluate
crps(y_test, y_pred_proba)

In [None]:
from skpro.utils.plotting import plot_crossplot_interval

plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)

### 3.2 Pipelines of `skpro` regressor and `sklearn` transformers <a class="anchor" id="section3_2"></a>

`skpro` regressors can be pipelined with `sklearn` transformers, using the `skpro` pipeline.

This ensure presence of `predict_proba` etc in the pipeline object.

The syntax is exactly the same as for `sklearn`'s pipeline.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.impute import SimpleImputer as Imputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

from skpro.regression.compose import Pipeline
from skpro.regression.residual import ResidualDouble

# estimator specification
reg_mean = LinearRegression()
reg_proba = ResidualDouble(reg_mean)

# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
    steps=[
        ("imputer", Imputer()),  # an sklearn transformer
        ("scaler", MinMaxScaler()),  # an sklearn transformer
        ("regressor", reg_proba),  # an skpro regressor
    ]
)

In [None]:
pipe

In [None]:
# the pipeline behaves as any skpro regressor
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X=X_test)
y_pred_proba = pipe.predict_proba(X=X_test)

the pipeline provides the familiar nested `get_params`, `set_params` interface:

nested parameters are keyed `componentname__parametername`

In [None]:
pipe.get_params()

pipelines can also be created via simple lists of estimators,

in this case names are generated automatically:

In [None]:
# pipeline is specified as a list of tuples (name, estimator)
pipe = Pipeline(
    steps=[
        Imputer(),  # an sklearn transformer
        MinMaxScaler(),  # an sklearn transformer
        reg_proba,  # an skpro regressor
    ]
)

### 3.3 Tuning of `skpro` regressors via grid and random search <a class="anchor" id="section3_3"></a>

`skpro` provides grid and random search tuners to tune arbitrary probabilistic regressors,

using probabilistic metrics. Besides this, they function as the `sklearn` tuners do.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

from skpro.metrics import CRPS
from skpro.model_selection import GridSearchCV
from skpro.regression.residual import ResidualDouble

# cross-validation specification for tuner
cv = KFold(n_splits=3)

# estimator to be tuned
estimator = ResidualDouble(LinearRegression())

# tuning grid - do we fit an intercept in the linear regression?
param_grid = {"estimator__fit_intercept": [True, False]}

# metric to be optimized
crps_metric = CRPS()

# specification of the grid search tuner
gscv = GridSearchCV(
    estimator=estimator,
    param_grid=param_grid,
    cv=cv,
    scoring=crps_metric,
)

In [None]:
gscv

the grid search tuner behaves like any `skpro` probabilistic regressor:

In [None]:
gscv.fit(X_train, y_train)
y_pred = gscv.predict(X_test)
y_pred_proba = gscv.predict_proba(X_test)

random search is similar, except that instead of a grid a parameter sampler should be specified:

In [None]:
from skpro.model_selection import RandomizedSearchCV

# only difference to GridSearchCV is the param_distributions argument

# specification of the random search parameter sampler
param_distributions = {"estimator__fit_intercept": [True, False]}

# specification of the random search tuner
rscv = RandomizedSearchCV(
    estimator=estimator,
    param_distributions=param_distributions,
    cv=cv,
    scoring=crps_metric,
)

### 3.4 Bagging/mixture ensemble of probabilistic regressors <a class="anchor" id="section3_3"></a>

Classical bagging does the following, for a wrapped estimator:

In `fit`:

1. subsample rows and/or columns of `X`, `y` to `X_subs`, `y_subs`
2. fit clone of wrapped estimator to `X_subs`, `y_subs`
3. Repeat 1-2 `n_estimators` times, store that many fitted clones.

In `predict`, for `X_test`:

1. for all fitted clones, obtain predictions on `X_test` - these are distributions
2. return the uniform mixture of these distributions, per test sample

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.linear_model import LinearRegression

from skpro.regression.ensemble import BaggingRegressor
from skpro.regression.residual import ResidualDouble

reg_mean = LinearRegression()
reg_proba = ResidualDouble(reg_mean)

ens = BaggingRegressor(reg_proba, n_estimators=10)
ens.fit(X_train, y_train)

y_pred = ens.predict_proba(X_test)

In [None]:
# y_pred is a mixture distribution!
str(y_pred)

In [None]:
[type(x) for x in y_pred.distributions]

## 4. Extension guide - implementing your own probabilistic regressor <a class="anchor" id="chapter4"></a>


`skpro` is meant to be easily extensible, for direct contribution to `skpro` as well as for local/private extension with custom methods.

To get started:

* Follow the ["implementing estimator" developer guide](https://skpro.readthedocs.io/en/stable/developer_guide/add_estimators.html)
* Use the [probabilistic regressor template](https://github.com/sktime/skpro/blob/main/extension_templates/regression.py) to get started

1. Read through the [probabilistic regression extension template](https://github.com/sktime/skpro/blob/main/extension_templates/regression.py) - this is a `python` file with `todo` blocks that mark the places in which changes need to be added.
2. Copy the proba regressor extension template to a local folder in your own repository (local/private extension), or to a suitable location in your clone of the `skpro` or affiliated repository (if contributed extension), inside `skpro.regression`; rename the file and update the file docstring appropriately.
3. Address the "todo" parts. Usually, this means: changing the name of the class, setting the tag values, specifying hyper-parameters, filling in `__init__`, `_fit`, and at least one of the probabilistic prediction methods, preferably `_predict_proba` (for details see the extension template). You can add private methods as long as they do not override the default public interface. For more details, see the extension template.
4. To test your estimator manually: import your estimator and run it in the worfklows in Section 1; then use it in the compositors in Section 3.
5. To test your estimator automatically: call `skpro.utils.check_estimator` on your estimator. You can call this on a class or object instance. Ensure you have specified test parameters in the `get_test_params` method, according to the extension template.

In case of direct contribution to `skpro` or one of its affiliated packages, additionally:

* Add yourself as an author to the code, and to the `CODEOWNERS` for the new estimator file(s).
* Create a pull request that contains only the new estimators (and their inheritance tree, if it's not just one class), as well as the automated tests as described above.
* In the pull request, describe the estimator and optimally provide a publication or other technical reference for the strategy it implements.
* Before making the pull request, ensure that you have all necessary permissions to contribute the code to a permissive license (BSD-3) open source project.

## 5. Summary<a class="anchor" id="chapter5"></a>

* `skpro` is a unified interface toolbox for probabilistic supervised regression, that is, for prediction intervals, quantiles, fully distributional predictions, in a tabular regression setting. The interface is fully interoperable with `scikit-learn` and `scikit-base` interface specifications.

* `skpro` comes with rich composition functionality that allows to build complex pipelines easily, and connect easily with other parts of the open source ecosystem, such as `scikit-learn` and individual algorithm libraries.

* `skpro` is easy to extend, and comes with user friendly tools to facilitate implementing and testing your own probabilistic regressors and composition principles.

---

### Credits:

noteook creation: fkiraly

skpro: https://github.com/sktime/skpro/blob/main/CONTRIBUTORS.md