## skpro probability distributions

**Set-up instructions:** On binder, this should run out-of-the-box.

To run locally instead, ensure that `skpro` with basic dependency requirements is installed in your python environment.

`skpro` provides a unified interface to probability distributions, with an API that:

* is `scikit-learn` and `scikit-base` compliant, parametric objects with `get_params`, `set_params`
* allows to represent data frame and 2D arrays of distributions easily, as return objects in `predict_proba`
* is `pandas` copmliant, with `index`, `columns`, indexing and subsetting via `iloc`, `loc`
* provides easy access to informative and didactic vicualizations via `matplotlib`

**Section 1** provides an **overview of core API elements**

**Section 2** gives an overview of **probability defining methods**, relevant tags and configs

**Section 3** introduces **composite distributions** and related APIs

**Section 4** gives an introduction to writing API compliant **custom probability distributions**

In [None]:
# hide warnings
import warnings

warnings.filterwarnings("ignore")

## 1. `skpro` probability distributions <a class="anchor" id="chapter1"></a>

### 1.1 probability distributions base interface

`skpro` distributions are parametric objects:

In [None]:
from skpro.distributions import Normal

# defining a normal distribution with mean 1 and std 2
n = Normal(mu=1, sigma=2)

n

The object `n` is a symbolic representation of a normal distribution.

The object provides various useful methods:

* properties and functions such as `pdf`, `cdf`, `mean`, `var`
* a method to sample from the distribution, `sample`
* a method to visualize the distribution, `plot`

In [None]:
n.plot("pdf")

In [None]:
# evaluate the pdf, compare the figure
n.pdf(2)

In [None]:
# mean/expectation of the distribution
# same as mu, for normal
n.mean()

In [None]:
# variance of the distribution
# same as sigma-squared, for normal
n.var()

In [None]:
# plot the cdf
n.plot("cdf")

In [None]:
n.cdf(1)

the `sample` method can be used to produce a single independent sample, or multiple:

In [None]:
# produce single sample
n.sample()

In [None]:
# produce an i.i.d. sample of size 10
n.sample(10)
# returned as a pd.DataFrame

parameters can be set and accessed via `get_params`, `set_params` - like in `scikit-learn`

In [None]:
n.get_params()

### 1.2 DataFrame-like distributions, array distributions

`skpro` distributions are designed to easily represent DataFrame-like distributions,

i.e., array distributions with row (instance) index and column (variable) index

all distributions can be constructed as a DataFrame-like distribution:

In [None]:
from skpro.distributions import Normal

instance_names = [1, 2, 3]
variable_names = ["foo", "bar"]

# the syntax is similar to pandas:
# parameters passed as arrays
# index, columns passed

mus = [[1, 2], [2, 3], [3, 4]]
sigmas = [[1, 1], [2, 1], [1, 2]]
n = Normal(mu=mus, sigma=sigmas, index=instance_names, columns=variable_names)

`n` represents a joint distribution of independent normals in an array, with row and column index

(most) methods behave as applying array input to marginals

e.g., visualisation:

In [None]:
n.plot("cdf")

In [None]:
n.plot("pdf")

In [None]:
# sampling produces DataFrame with same index, columns
n.sample()

In [None]:
# same for mean
n.mean()

In [None]:
# variance
n.var()

methods with arguments broadcast to 2D

In [None]:
# methods with arguments broadcast
n.pdf(1)

In [None]:
# 1D is considered as a row when broadcasting
n.pdf([1, 100])

In [None]:
# to broadcast column-wise, use 2D column vector
n.pdf([[1], [2], [100]])

In [None]:
# 2D is evaluated entry-wise
n.pdf([[1, 2], [3, 5], [5, 7]])

`skpro` distribution objects are pandas-like!

In [None]:
n.shape

`index` and `columns` are coerced to `pandas.Index` subtypes

In [None]:
n.index

In [None]:
n.columns  # same columns as X_new

subsetting with `iloc` (integer location) and `loc` works as in `pandas`:

* `iloc[rows, cols]` and `loc[rows, cols]` subset to an array distribution
* `iat[row, col]` and `at[row, col]` subset to a scalar distribution

In [None]:
# we wubset to two rows and one column
n_subset = n.iloc[[0, 2], [1]]
# n_subset.shape = (2, 1)
n_subset.plot()

In [None]:
# same rows, cols, but with loc indexing
n_subset = n.loc[[1, 3], ["bar"]]
# n_subset.shape = (2, 1)
n_subset.plot()

In [None]:
# subsetting to a scalar distribution
n.at[1, "bar"]
# this behaves the same as the distribution in the previous section

### 1.3 DataFrame-like distributions - broadcasting

at construction, all (simple) probabilty distributions broadcast parameters:

* if `index` or `columns` are passed, always broadcasts to 2D
* 1D iterables are interpreted as row vectors, i.e., of shape (1, n)
* if `index` and `columns` are absent, and the result is 2D, uses `RangeIndex` (integers starting at 0)
* the result is a scalar distribution only if `index`, `columns` are not passed, and all parameters are zero-D (scalar)

In [None]:
# broadcasting example: index is passed
from skpro.distributions import Normal

n = Normal(mu=1, sigma=2, index=[1, 2, 3])
# results in a shape (3, 1) distribution, with all mu, sigma being 1, 2
n.shape

In [None]:
n.mean()

In [None]:
n.var()

In [None]:
# broadcasting example: a parameter is non-scalar
from skpro.distributions import Normal

n = Normal(mu=[[1, 2], [2, 3]], sigma=2)
# results in a shape (2, 2) distribution
# sigma is broadcast ot the shape of mu
# index, columns are RangeIndex, i.e., [0, 1]
n.shape

In [None]:
n.mean()

In [None]:
# broadcasting example: 1D parameter is broadcast as a row
from skpro.distributions import Normal

n = Normal(mu=[1, 2, 3], sigma=2)
# results in a shape (1, 3) distribution
# mu is interpreted as (1, 3) row vector
# sigma is broadcast ot the shape of mu
# index, columns are RangeIndex, i.e., index=[0] and columns=[0, 1, 2]
n.shape

In [None]:
n.mean()

### 1.4 searching for probability distributions

as first-class citizens, all objects in `skpro` are indexed via the `registry` utility `all_objects`.

To find probabilisty distirbutions, use `all_objects` with the type `distribution`:

In [None]:
from skpro.registry import all_objects

all_objects("distribution", as_dataframe=True).head()

a full list can also be found in the online API reference.

all tags can be printed by the `all_tags` utility:

In [None]:
# all tags applicable to probability distribution
from skpro.registry import all_tags

all_tags("distribution", as_dataframe=True)

filtering in search can be done with the `filter_tags` argument in `all_objects`, see docstring:

In [None]:
from skpro.registry import all_objects

# "retrieve all absolutely continuous distributions on the reals"
all_objects("distribution", as_dataframe=True, filter_tags={"distr:measuretype": "continuous"})

## 4. Extension guide - implementing your own probabilisty distribution <a class="anchor" id="chapter4"></a>


`skpro` is meant to be easily extensible, for direct contribution to `skpro` as well as for local/private extension with custom methods.

To get started:

* Follow the ["implementing estimator" developer guide](https://skpro.readthedocs.io/en/stable/developer_guide/add_estimators.html)
* Use the [probabilistic regressor template](https://github.com/sktime/skpro/blob/main/extension_templates/regression.py) to get started

1. Read through the [probability distibution extension template](https://github.com/sktime/skpro/blob/main/extension_templates/distributions.py) - this is a `python` file with `todo` blocks that mark the places in which changes need to be added.
2. Copy the distribution extension template to a local folder in your own repository (local/private extension), or to a suitable location in your clone of the `skpro` or affiliated repository (if contributed extension), inside `skpro.distributions`; rename the file and update the file docstring appropriately.
3. Address the "todo" parts. Usually, this means: changing the name of the class, setting the tag values, specifying hyper-parameters, filling in `__init__`, and as many methods as possible, most importantly `_ppf`, and possibly other common methods such as `_pdf` or `_pmf`, `_cdf`. You can add private methods as long as they do not override the default public interface. For more details, see the extension template.
4. To test your estimator manually: import your estimator and run it in the worfklows in Section 1; then use it in the compositors in Section 3.
5. To test your estimator automatically: call `skpro.utils.check_estimator` on your estimator. You can call this on a class or object instance. Ensure you have specified test parameters in the `get_test_params` method, according to the extension template.

In case of direct contribution to `skpro` or one of its affiliated packages, additionally:

* Add yourself as an author to the code, and to the `CODEOWNERS` for the new estimator file(s).
* Create a pull request that contains only the new estimators (and their inheritance tree, if it's not just one class), as well as the automated tests as described above.
* In the pull request, describe the estimator and optimally provide a publication or other technical reference for the strategy it implements.
* Before making the pull request, ensure that you have all necessary permissions to contribute the code to a permissive license (BSD-3) open source project.

## 5. Summary<a class="anchor" id="chapter5"></a>

* `skpro` is a unified interface toolbox for probabilistic supervised regression, that is, for prediction intervals, quantiles, fully distributional predictions, in a tabular regression setting. The interface is fully interoperable with `scikit-learn` and `scikit-base` interface specifications.

* `skpro` comes with rich composition functionality that allows to build complex pipelines easily, and connect easily with other parts of the open source ecosystem, such as `scikit-learn` and individual algorithm libraries.

* `skpro` is easy to extend, and comes with user friendly tools to facilitate implementing and testing your own probabilistic regressors and composition principles.

---

### Credits:

noteook creation: fkiraly

skpro: https://github.com/sktime/skpro/blob/main/CONTRIBUTORS.md