Our repository **DistCal** implements the methods from our paper [Calibrated and Sharp Uncertainties in Deep Learning via Density Estimation](https://proceedings.mlr.press/v162/kuleshov22a.html). Below, we briefly discuss our idea and provide a demo of DistCal. 
## Introduction

<Provide a motivation for calibration. Provide example applications where calibration matters. Explain calibration and sharpness to characterize accurate probabilistic forecast.>



An accurate probabilistic forecast is characterized by two properties: calibration and sharpness. These properties are grounded in statistical theory and are used to evaluate forecasts in domains such as meteorology and medicine. Intuitively, calibration means that a 90% confidence interval contains the true outcome 90% of the time. Sharpness means that these confidence intervals are narrow. Standard maximum likelihood training yields models that are poorly calibrated and thus inaccurate: a 90% confidence interval typically does not contain the true outcome 90% of the time. As argued in our [paper](https://proceedings.mlr.press/v162/kuleshov22a.html), we show that calibration is important in practice and is easy to maintain by performing low-dimensional density estimation. 

Probabilistic models are key building blocks of machine learning systems in many areas—medicine, robotics, industrial automation, etc. Hence, maintaining calibration in such predictive models is beneficial for applications across many domains.


## Some background on calibration
<Explain the ideas behind quantile and distribution calibration.>

Supervised machine learning models predict a probability distribution over the target variable, e.g., class membership probabilities or the parameters of a normal distribution. We seek to produce models with accurate probabilistic outputs.

**Notation**: We predict a target $y \in Y$ — where $Y$ is either discrete (it’s a classification problem) or $Y= R$ (it’s a regression problem)—using input features $x \in X$. We are given a forecaster $H : X \rightarrow \triangle Y$, which outputs a probability distribution $F(y) : Y \rightarrow [0,1]$ within the set $\triangle Y$ of distributions over $Y$; the probability density function of $F$ is $f$. We are also given a training set $D=\{(x_i,y_i) \in X \times Y \}_{i=1}^{n}$ and a calibration set $C= \{(x_j,y_j) \in X \times Y\}_{j=1}^{m}$, each consisting of i.i.d. realizations of random variables $X,Y \sim \mathcal{P}$, where $\mathbb{P}$ is the data distribution.

### How do we define calibration?
Kuleshov et al. (2018) define quantile calibration for regression as $$P(Y ≤ CDF^{−1}_{F_X} (p)) = p$$ for all $p \in [0,1]$, where $F_X = H(X)$ is the forecast at $X$, itself a random variable that takes values in $ \triangle Y$. Intuitively, for each $X,Y$, the $Y$ is contained in the $p$-th confidence interval $(−\infty,\text{CDF}^{−1}_{F_X} (p)]$ predicted at $X$ a fraction $p$ of the time.

Song et al. (2019) defines a stronger notion of distribution calibration as $$P(Y= y|F_X = F) =  f(y)$$ for all $y \in Y, F \in \triangle Y$, where $F_X = H(X)$ is the random forecast at $X$ and $f$ is its probability density or probability mass function. When $Y= \{0,1\}$ and $F_X$ is Bernoulli with parameter $p$, we can write the above condition as $\mathbb{P}(Y = 1 |F_X = p) = p$. Intuitively, the true  probability of $Y = 1$ is $p$ conditioned on predicting it as $p$.

Distribution calibration also guarantees the weaker notion of quantile calibration that was defined earlier. We enforce this stronger notion of distribution calibration in **DistCal** by performing simple density estimation. 


## How does our approach compare with existing approaches?

Recalibration is a widely used approach for improving probabilistic forecasts. In the classification setting, we have Platt scaling (Platt, 1999) and isotonic regression (Niculescu-Mizil & Caruana, 2005). These methods have been extended to settings like multi-class classification, structured prediction, online prediction and regression. 

Previous work that performs calibration under the regression setting (Kuleshov et al. 2018) targets quantile calibration, while our methods targets distribution calibration. We estimate a different distribution $(\mathbb{P}(Y ≤F^{−1}_X (p))$ vs. $\mathbb{P}(Y|H(X) = F))$ using different objectives (e.g., the quantile divergence vs. calibration error in Kuleshov et al. (2018)).

Unlike Song et al. (2019) our method can recalibrate any parametric distribution (not just
Gaussians) while being also simpler. While Song et al. (2019) relies on variational inference in Gaussian processes (which is slow and complex to implement), our method uses a neural network that can be implemented in a few lines of code. Our method applies to both classification and regression and outperforms Song et al. (2019), Kuleshov et al. (2018), as well as Platt and temperature scaling.







## Evaluation of calibration
We evaluate probabilistic predictions using the framework of proper scoring rules (Gneiting & Raftery, 2007). 

Formally, let $L: \triangle Y \times Y \rightarrow R$ denote a loss between a probabilistic forecast $F \in \triangle Y$ and a realized outcome $y \in Y$. We say that $L$ is a proper loss if it is minimized by $G$ when $G$ is the true distribution for $y: L(F,G) \geq L(G,G)$ for all F. 

One example is the log-likelihood $L(F,y) = −\log  f(y)$. 

Another example is the check score for $\tau \in [0,1]$: 
$$\begin{equation}
    \rho_\tau(F,y) = \begin{cases}
    \tau(y−F^{−1}(\tau)), & \text{if } y \geq f \\
    (1−\tau)(F^{−1}(\tau)−y), & \text{otherwise}
    \end{cases}
    \end{equation}
    $$

In general, a proper loss decomposes to a calibration term and a sharpness term. Thus, both these properties are necessary and sufficient for accurate probabilistic forecast. 

## Recalibration as density estimation


When performing recalibration, we employ an alternative training strategy in which a model $R : \triangle Y \rightarrow \triangle Y$ is fit on the calibration set $C$ such that the forecasts $R \circ F$ are calibrated (Platt, 1999; Vovk et al., 2005). When $x_t,y_t$ are sampled i.i.d. from $\mathbb{P}$, choosing $R(F) = \mathbb{P}(Y |H(X) = F)$ yields a distribution calibrated model $R \circ H$ (proof provided in our [paper](https://proceedings.mlr.press/v162/kuleshov22a.html)). 

Our approach is to define a featurization $\phi : \triangle Y \rightarrow \mathbb{R}^p$ of $F$, such that $ \phi (F)$ is represented by a small number of parameters $p$. Hence, learning $\mathbb{P}(Y | \phi (F))$ involves a tractable low-dimensional density estimation problem for which there exist efficient and provably correct algorithms (Wasserman, 2006).

Our approach will optimize a proper scoring rule L. Specifically, we choose a recalibrator $R$ that minimizes the objective $\sum_{x,y \in C} L(R(F_x),y)$ over a calibration dataset $ C = \{x_j,y_j\}_{j=1}^m$ sampled i.i.d. from $\mathbb{P}$


    

We provide the algorithm below:


**Algorithm 1: Distribution Recalibration Framework**

**Inputs** Pre-trained model $H : X \rightarrow \triangle_Y$, featurizer $\phi : \triangle_Y \rightarrow \mathbb{R}^p$, recalibrator $R: \mathbb{R}^p \rightarrow \triangle Y$, calibration set $C$

**Output** Recalibrated model $R \circ H : X \rightarrow \triangle_Y$

1. Create a training set for recalibrator: $S= \{(\phi (H(x)),y) |x,y \in C\}$
2. Fit the recalibrator R on S using a proper loss: $\min_R \sum_{(\phi ,y) \in S} L(R(\phi),y)$


We introduce specialized versions of Algorithm 1 for the settings of probabilistic regression and classification and we define additional details of the method—the model R, the features $\phi$, and the objective L. Our goal is to estimate the distribution $\mathbb{P}(Y | H(X) = F)$. We choose to represent this distribution via its cumulative distribution function (CDF) or, equivalently, its inverse, the quantile function (QF). Learning a model of the CDF or of the QF facilitates computing confidence intervals and yields more numerically stable algorithms.

Without loss of generality, we train $R$ to fit the QF; the density can be obtained from the CDF or QF by via a derivative. This approach also yields the following equivalent notion of distribution calibration: $\mathbb{P}(Y \leq y|F_X = F) = F(Y \leq y)$ for all $y \in Y, F \in \triangle_Y$, Our approach for estimating the QF relies on quantile function regression (Si et al., 2021). We define this method below in terms of three components—the model R, the features $\phi$, and the objective L. 

The resulting method is shown in the following algorithm.

**Algorithm 2: Distribution Calibrated Regression**

**Inputs** Pre-trained model $H : X \rightarrow \triangle_Y$, recalibrator $[0,1] \times \phi(\triangle_Y)  \rightarrow R$, training set $D$, calibration set $C$

**Output** Recalibrated model $R \circ H : X \rightarrow ([0,1] →R)$

1. Create a training set for recalibrator: $S= \{(\phi (H(x)),y) |x,y \in C\}$
2. Fit the recalibrator R on S using a proper loss: $\min_R \sum_{(\phi ,y) \in S} E_{\phi,y}E_{\tau} \in U[0,1] \rho_\tau (R^{−1}(\tau; \phi),y)$


**Model**: We learn a model $R_\theta(\tau; \phi(F)) : \mathbb{R} \times \mathbb{R}^p \rightarrow \mathbb{R}$ of the inverse of the CDF of $\mathbb{P}(Y |H(X) = F)$. Specifically, $R_\theta(\tau; \phi(F))$ takes in a scalar $\tau$ and features $\phi$, and outputs an estimate of the $\tau$-th quantile of $\mathbb{P}(Y |H(X) = F)$. In our experiments, R is parameterized by a fully-connected neural network with inputs  $\tau$ and $\phi$. 

**Features**: We form a p-dimensional representation of F by featurizing it via its quantiles $\phi(F) = (F^{−1}(\alpha_i))_{i=1}^p$ for some sequence of p levels $\alpha_i$, typically uniform in [0,1]. This parameterization works across diverse types of F, including parametric functions (e.g., Gaussians) or F represented via a set of samples. 

**Objective**: We train R using the quantile proper loss $Lq$. Specifically, when R(τ; φ) is an estimate of the $\tau$-th quantile, our objective becomes $$E_{\phi, y} L_q(\phi,y) = E_{\phi,y}E_{\tau \in U[0,1]} \rho_\tau (R^{−1}(\tau; \phi),y),$$ where $\rho \tau$ is the check score. We fit the objective using gradient descent, approximating the expectations using Monte Carlo.

   
 



    



In classification, each $F \in \triangle_Y$ is categorical and can be represented as a vector $p_F \in \triangle_K$ of $K \geq 2$ class membership probabilities living in a simplex $\triangle_K$ over K-dimensional probability vectors.

**Algorithm 3: Distribution Calibrated Classification**

**Inputs** Pre-trained model $H : X \rightarrow \triangle Y$, recalibrator $ \triangle_K \rightarrow \triangle_K$, training set $D$, calibration set $C$

**Output** Recalibrated model $R \circ H : X \rightarrow \triangle Y$

1. Create a training set for recalibrator: $S= \{(p_{H(x)},y) |x,y \in C\}$
2. Fit the recalibrator R on S using a proper loss: $\min_R \sum_{(p ,y) \in S} L_{\log} (R(p),y)$


## Demo of our techniques using the DistCal library

### Distribution calibrated classification 

Below, we demonstrate distribution calibration of probabilistic outcomes over discrete output. 

At first, we import the necessary files.      

    


In [1]:
import torch
import sys
sys.path.append('../..')


from torchuq.transform.distcal_discrete import *
from torchuq.evaluate.distribution_cal import *
from torchuq.dataset.classification import *
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from torchuq.evaluate import categorical
from matplotlib import pyplot as plt

For this demo, we use the [UCI Digit classification dataset](https://archive.ics.uci.edu/dataset/80/optical+recognition+of+handwritten+digits/).

In [10]:
subset_uci = ['digits']


Below, we use Logistic Regression as a simple base model to predict the probabilistic outcome parameterized as $(p_0, p_1,..,p_{9})$ to represent the probabilities of the discrete output ranging from 0 to 9. 

We use object of the **DiscreteDistCalibrator** class to train a recalibrator that takes the probabilistic outcome from the base model and outputs the recalibrated distribution represented as $(p'_0, p'_1,..,p'_{9})$ . 

We use an independent calibration dataset to train the DiscreteDistCalibrator. We evaluate the quality of probabilistic uncertainty with calibration score as defined [here](https://arxiv.org/pdf/2112.07184). We also report the classification accuracy on the test dataset before and after calibration. 



In [14]:

for name in subset_uci:

	# 60% Train, 20% Calibration, 20% Test dataset
	dataset = get_classification_datasets(name, val_fraction=0.2, test_fraction=0.2, split_seed=0, normalize=True, verbose=True)
	
	train_dataset, cal_dataset, test_dataset = dataset
	X_train, y_train = train_dataset[:][0], train_dataset[:][1]
	X_cal, y_cal = cal_dataset[:][0], cal_dataset[:][1]
	X_test, y_test = test_dataset[:][0], test_dataset[:][1]
	
	# Simple logistic regression classifier trained
	reg = LogisticRegression(random_state=0).fit(X_train, y_train)
	print("=="*25)
	print(f"Classification accuracy on Train: {reg.score(X_train, y_train):.2}")
	print(f"Classification accuracy on Test: {reg.score(X_test, y_test):.2}")
	print("=="*25)


	# Predict probabilistic outcome on K classes, on the calibration and test datasets 
	pred_cal = torch.Tensor(reg.predict_proba(X_cal.numpy()))

	pred_test = torch.Tensor(reg.predict_proba(X_test.numpy()))

	

	# Initialize platt scaling comparison baseline, train the model on calibration dataset

	platt_calibrator = DiscreteDistCalibrator(verbose=True, platt_scaling=True)
	platt_calibrator.train(pred_cal, torch.Tensor(y_cal))

	platt_cal = platt_calibrator(pred_cal)
	platt_test = platt_calibrator(pred_test)

	# Use the DiscreteDistCalibrator class without platt scaling and train it on the calibration dataset

	calibrator = DiscreteDistCalibrator(verbose=True)
	calibrator.train(pred_cal, torch.Tensor(y_cal))

	output_cal = calibrator(pred_cal)
	output_test = calibrator(pred_test)



	# Evaluation
	print("=="*25)
	print(f"[Calibration Dataset] Calibration scores  \n 1) Before calibration = {discrete_cal_score(y_cal, pred_cal):.3f} \n 2) After Platt Scaling = {discrete_cal_score(y_cal, platt_cal):.3f} \n 3) After calibration = {discrete_cal_score(y_cal, output_cal):.3f}")
	print(f"[Calibration Dataset] Classification accuracy \n 1) Before calibration = {accuracy_score(y_cal, pred_cal.argmax(axis=1)):.3f} \n 2) After Platt Scaling = {accuracy_score(y_cal, platt_cal.argmax(axis=1)):.3f} \n 3) After calibration = {accuracy_score(y_cal, output_cal.argmax(axis=1)):.3f}")
	


	print(f"[Test Dataset] Calibration scores \n 1) Before calibration = {discrete_cal_score(y_test, pred_test):.3f} \n 2) After Platt Scaling = {discrete_cal_score(y_test, platt_test):.3f} \n 3) After calibration = {discrete_cal_score(y_test, output_test):.3f}")
	print(f"[Test Dataset] Classification accuracy \n 1)  Before calibration = {accuracy_score(y_test, pred_test.argmax(axis=1)):.3f}\n 2) After Platt Scaling = {accuracy_score(y_test, platt_test.argmax(axis=1)):.3f} \n 3) After calibration = {accuracy_score(y_test, output_test.argmax(axis=1)):.3f}")
	print("=="*25)

Loading dataset digits....
Splitting into train/val/test with 1079/359/359 samples
Done loading dataset digits
Classification accuracy on Train: 1.0
Classification accuracy on Test: 0.96
[Calibration Dataset] Calibration scores  
 1) Before calibration = 0.115 
 2) After Platt Scaling = 0.073 
 3) After calibration = 0.059
[Calibration Dataset] Classification accuracy 
 1) Before calibration = 0.972 
 2) After Platt Scaling = 0.978 
 3) After calibration = 0.978
[Test Dataset] Calibration scores 
 1) Before calibration = 0.123 
 2) After Platt Scaling = 0.104 
 3) After calibration = 0.081
[Test Dataset] Classification accuracy 
 1)  Before calibration = 0.964
 2) After Platt Scaling = 0.964 
 3) After calibration = 0.964


We clearly see that the calibration score improves after employing our method when compared with the Platt scaling baseline as implemented in our library. We also show that the classification accuracy does not get affected here. 

### Distribution calibrated classification

Below, we demonstrate distribution calibration of probabilistic uncertainty over continuous output. 

At first, we import the necessary files. For this demo, we use the  [California Housing Dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html). 



In [None]:
import torch
import sys
sys.path.append('../..')

from torchuq.transform.distcal_continuous import *
from torchuq.transform.calibrate import *
from torchuq.evaluate.distribution_cal import *
from torchuq.dataset.regression import *
from sklearn.linear_model import BayesianRidge
from sklearn.metrics import mean_absolute_error
from torchuq.evaluate import quantile as q_eval
from matplotlib import pyplot as plt

#uci_dataset = ["cal_housing", "protein", "superconductivity"]

subset_uci = ["cal_housing"]



Below, we use Bayesian Ridge Regression as the base model to predict probabilistic outcome, represented by the mean and standard deviation as parameters of a Gaussian outcome distribution.

We use object of the **DistCalibrator** class to train a recalibrator that takes the probabilistic outcome from the base model and outputs the recalibrated distribution parameterized by a fixed number of equispaced quantiles. In this example, we use 20 equispaced quantiles to featurize the outcome distribution. 

We use an independent calibration dataset to train the DistCalibrator. We evaluate the quality of probabilistic uncertainty with the check score and calibration score as defined [here](https://arxiv.org/pdf/2112.07184). 

In [None]:
# Number of evaluation buckets
num_buckets=20
name = "cal_housing"
for name in subset_uci:
    # 60% Train, 20% Calibration, 20% Test dataset
    dataset = get_regression_datasets(name, val_fraction=0.2, test_fraction=0.2, split_seed=0, normalize=True, verbose=True)

    train_dataset, cal_dataset, test_dataset = dataset
    X_train, y_train = train_dataset[:][0], train_dataset[:][1]
    X_cal, y_cal = cal_dataset[:][0], cal_dataset[:][1]
    X_test, y_test = test_dataset[:][0], test_dataset[:][1]

    # Bayesian Ridge Regression to obtain probabilistic outcomes parameterized by the mean and std deviation of Gaussian outcome for each data-point
    reg = BayesianRidge().fit(X_train, y_train)
    print(f"Coeff of determination (R^2) on Train: {reg.score(X_train, y_train):.2}")
    print(f"Coeff of determination (R^2) on Test: {reg.score(X_test, y_test):.2}")



    # Predict mean and std deviation of the outcome distribution on the calibration and test datasets 
    mean_cal, std_dev_cal = reg.predict(X_cal.numpy(), return_std=True)
    mean_cal, std_dev_cal = torch.Tensor(mean_cal), torch.Tensor(std_dev_cal)

    mean_test, std_dev_test = reg.predict(X_test.numpy(), return_std=True)
    mean_test, std_dev_test = torch.Tensor(mean_test), torch.Tensor(std_dev_test)

    params_cal = torch.cat((mean_cal.reshape(-1, 1), std_dev_cal.reshape(-1, 1)), axis=1)
    params_test = torch.cat((mean_test.reshape(-1, 1), std_dev_test.reshape(-1, 1)), axis=1)

    # Convert probabilistic predictions to quantiles
    quantiles_cal = convert_normal_to_quantiles(mean_cal, std_dev_cal, num_buckets)
    quantiles_test = convert_normal_to_quantiles(mean_test, std_dev_test, num_buckets)



    # Use the DistCalibrator class and train it on the calibration dataset
    # Here, the recalibrator uses a fixed number of equispaced quantiles as featurization of the probabilistic outcome
    calibrator = DistCalibrator(num_buckets = num_buckets, quantile_input=True, verbose=True)
    calibrator.train(quantiles_cal, torch.Tensor(y_cal), num_epochs=10)


    quantile_calibrator = RegressionCalibrator()
    input_cdf = torch.distributions.Normal(mean_cal, std_dev_cal).cdf(y_cal)
    empirical_cdf = compute_empirical_cdf(input_cdf)
    quantile_calibrator.train(input_cdf, empirical_cdf)


    calibrated_quantiles_cal = convert_normal_cdf_to_quantiles(mean_cal, std_dev_cal, torch.Tensor(quantile_calibrator.inverse_calibrator(num_buckets=num_buckets)))
    calibrated_quantiles_test = convert_normal_cdf_to_quantiles(mean_test, std_dev_test, torch.Tensor(quantile_calibrator.inverse_calibrator(num_buckets=num_buckets)))

    # Below code is needed if you featurized the Gaussian probabilistic outcome using their parameters mean and std deviation
    # calibrator = DistCalibrator(quantile_input=False, verbose=True)
    # calibrator.train(params_cal, torch.Tensor(y_cal))

    # Evaluation
    # 

    # Compare check scores and weighted calibrations cores 
    print("="*25)
    check_score_before, check_score_after = comparison_quantile_check_score(quantiles_cal, torch.Tensor(y_cal), np.linspace(0, 1, num_buckets), model=calibrator)

    _ , check_score_baseline = comparison_quantile_check_score(quantiles_cal, torch.Tensor(y_cal), np.linspace(0, 1, num_buckets), quant_calibrated_outcome=calibrated_quantiles_cal)

    print(f"[Calibration Split]\n 1) Check score before calibration={check_score_before} \n 2) Check score after quantile calibration={check_score_baseline} \n 3) Check score after dist calibration={check_score_after}")


    print("="*25)

    check_score_before, check_score_after = comparison_quantile_check_score(quantiles_test, torch.Tensor(y_test), np.linspace(0, 1, num_buckets), model=calibrator)

    _, check_score_baseline = comparison_quantile_check_score(quantiles_test, torch.Tensor(y_test), np.linspace(0, 1, num_buckets), quant_calibrated_outcome=calibrated_quantiles_test)

    print(f"[Test Split] \n 1) Check score before calibration={check_score_before} \n 2) Check score after quantile calibration={check_score_baseline} \n 3) Check score after dist calibration={check_score_after}")

    print("="*25)




Loading dataset cal_housing....
Splitting into train/val/test with 12384/4128/4128 samples
Done loading dataset cal_housing
Coeff of determination (R^2) on Train: 0.61
Coeff of determination (R^2) on Test: 0.6
[Calibration Split]
 1) Check score before calibration=3.1939690113067627 
 2)Check score after quantile calibration=3.1262903213500977 
 3) Check score after dist calibration=3.050403356552124
[Test Split] 
 1) Check score before calibration=3.199695587158203 
 2) Check score after quantile calibration=3.1359283924102783 
 3) Check score after dist calibration=3.0494537353515625


Above, we see that after performing distribution calibration, the check score improves in comparison with the quantile calibration baseline (Kuleshov et al. 2018). 

## Conclusions and closing thoughts




In this blog, we saw that accurate predictive uncertainties are fully characterized by two properties—calibration and sharpness. We argued that predictive uncertainties should maximize sharpness subject to being calibrated (Gneiting et al., 2007) and proposed a recalibration technique that achieves this goal. We demonstrated the implementation of our techniques via the DistCal library on a classification and a regression task. 

Out technique guarantees distribution calibration for a wide range of base models and is easy to implement in a few lines of code. It applies to both classification and regression and is formally guaranteed to produce asymptotically distributionally calibrated forecasts while minimizing regret. Finally, our analysis formalizes the well-known paradigm of Gneiting et al. (2007) and provides a simple method that provably achieves it. This lends strong support for this principle and influences how one should reason about uncertainty in machine learning. We believe that an important takeaway of our work is that calibration should be leveraged more broadly throughout machine learning.


