# Conformalized Quantile Regression

In this notebook, we implement Conformal Quantile Regression (CQR) to produce statistically valid prediction intervals for the Boston Housing dataset. First, we train quantile regression models to estimate uncalibrated lower and upper bounds for house prices. Because these model-based intervals generally do not achieve the desired coverage level, we perform conformal calibration on a held-out calibration set. By computing nonconformity scores and selecting an appropriate quantile, we obtain a single calibration value 
qhat that expands all intervals just enough to guarantee finite-sample coverage of approximately 1−α


## Imports

In [2]:
# Import numerical library
import numpy as np
#Plotting library
import matplotlib.pyplot as lt
# High level plotting library built on top of matoplotlib
import seaborn as sns
#Loads dataset from OpenML (we use it to get Boston Housing dataset)
from sklearn.datasets import fetch_openml
# Splits data into train and test subsets
from sklearn.model_selection import train_test_split
# Tree based regression model that can do quantile regression when we set loss="quantile"
from sklearn.ensemble import GradientBoostingRegressor


## Load the Boston Housing data

The boston Housing dataset contains:
- X: features (e.g crime rate, number of rooms)
- y: target variable (median house value in thousand dollars)

In [3]:
# Downloads the dataset named "boston" from OpenML
#as_frame=False--> Returns NUmPY arrays (boston.data, boston.target) instead of pandas DataFrame
boston = fetch_openml(name="boston", version=1, as_frame=False)
# X is a 2D array of shape (n_samples, n_features)
X = boston.data
# We cast the target vecotr to float
labels = boston.target.astype(float)

print("X shape:", X.shape)
print("y shape:", labels.shape)


X shape: (506, 13)
y shape: (506,)


## Train base models for uncalibrated prediction intervals

__Uncalibrated prediction intervals__ = intervals whose coverage probability is unknown or incorrect

We split the data :
- 50% for training the models
- 50% for testing/calibration/validation

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.5, random_state=42
)


We want an interval with 1-a=0.9=90 probability mass. If we think in terms of quantiles:
- Lower bound: 5th percentile (0.05)
- Upper bound : 95th percentile (0.95)

In [6]:
alpha = 0.1  # target miscoverage; we want 90% coverage (1 - alpha)
# lower_alpha and upper_alpha are the quintiles we will feed the quintile regressor
lower_alpha = alpha / 2        # 0.05
upper_alpha = 1 - alpha / 2    # 0.95


__Now we define three models__

__lower_model__: Given features X=x, the model predicts a value such that
only 5% of the Y-values are below it, and 95% are above it.

__upper_model__: Given features X=x, the model predicts a value such that
95% of Y-values are below it, and only 5% are above it.

__mid_model__: a standard regression model, predicting mean 
E[Y∣X], just fore reference


In [7]:
# loss="quantile" tells GBM to minimize the quantile loss- Pinball loss
lower_model = GradientBoostingRegressor(
    loss="quantile", alpha=lower_alpha, random_state=42
)
upper_model = GradientBoostingRegressor(
    loss="quantile", alpha=upper_alpha, random_state=42
)
mid_model = GradientBoostingRegressor(
    loss="squared_error", random_state=42
)


In [8]:
# Fit models
lower_model.fit(X_train, y_train)
upper_model.fit(X_train, y_train)
mid_model.fit(X_train, y_train)


In [10]:
# Predict on X_test
lower = lower_model.predict(X_test)   # estimated lower quantile (uncalibrated)
upper = upper_model.predict(X_test)   # estimated upper quantile (uncalibrated)
y_test_pred = mid_model.predict(X_test)  # mean predictions (unused in conformal)


In [11]:
# X features for test data
X = X_test
# Ground truth values
labels = y_test
val_X_full = X

print("After train/test split, X shape:", X.shape)
print("lower/upper shapes:", lower.shape, upper.shape)


After train/test split, X shape: (253, 13)
lower/upper shapes: (253,) (253,)


## Split into calibration and validation

We want n points for calibration, the rest are final evaluation/validation. Boston is small so we cap n at 100

In [12]:
n = min(100, labels.shape[0] // 2)
# labels.shape[0] number of test samples
# np.array([1] * n + [0] * (labels.shape[0] - n)) creates an array like [1,1,1,...,1,0,0,...,0].
# np.array([1] * n + [0] * (labels.shape[0] - n)) > 0 turns it into [True, True,..., False,...]--> our mask
idx = np.array([1] * n + [0] * (labels.shape[0] - n)) > 0
np.random.shuffle(idx) #shuffles True/False posistions randomly


We apply the boolean mask:

In [13]:
#idx--> True means calibration
#~idx--> False--> Validation

cal_labels, val_labels = labels[idx], labels[~idx] 
cal_upper,  val_upper  = upper[idx],  upper[~idx]
cal_lower,  val_lower  = lower[idx],  lower[~idx]
val_X = X[~idx]

print("Calibration size:", cal_labels.shape[0])
print("Validation size:", val_labels.shape[0])


Calibration size: 100
Validation size: 153


# Conformal Prediction

## Non Conformity Score

__si​=max(yi​−Ui​,Li​−yi​)__ <br>
- s_i measures how badly the interval failed on point i.

- Negative or zero = interval contained the label.

- Positive = label outside the interval by s_i.

In [15]:
cal_scores = np.maximum(cal_labels - cal_upper, cal_lower - cal_labels)


## Compute the conformal quantile

We want to choose qhat so that if we expand intervals by qhat, we will achieve approximately (1 − α) coverage on new data, no assumptions beyond exchangeability.

In [17]:
# The index of the (1-a) quantile among n scores
q_level = np.ceil((n + 1) * (1 - alpha)) / n
qhat = np.quantile(cal_scores, q_level, interpolation='higher')

print("Quantile level used:", q_level)
print("qhat (calibration offset):", qhat)


Quantile level used: 0.91
qhat (calibration offset): 2.349961807174717


## Apply calibration to validation set

We expand both sides by qhat.

If the original model was too narrow, qhat will be positive and enlarge the interval.

If the model was already conservative, qhat might be small.

In [18]:
prediction_sets = [val_lower - qhat, val_upper + qhat]


We also keep the uncalibrated version

In [19]:
prediction_sets_uncalibrated = [val_lower, val_upper]


## Compute Empirical Coverage

In [20]:
empirical_coverage_uncalibrated = (
    (val_labels >= prediction_sets_uncalibrated[0]) &
    (val_labels <= prediction_sets_uncalibrated[1])
).mean()

print(f"The empirical coverage before calibration is: {empirical_coverage_uncalibrated}")


The empirical coverage before calibration is: 0.8104575163398693


In [21]:
empirical_coverage = (
    (val_labels >= prediction_sets[0]) &
    (val_labels <= prediction_sets[1])
).mean()

print(f"The empirical coverage after calibration is: {empirical_coverage}")


The empirical coverage after calibration is: 0.9673202614379085


# Show sample predictions

In [23]:


print("\n=== Example Predictions (True vs Intervals) ===\n")

# Choose how many examples to show
num_examples = 10

# Select random indices from validation set
rnd_idx = np.random.choice(len(val_labels), size=num_examples, replace=False)

for i, idx_i in enumerate(rnd_idx):
    # Read the true value
    y_true = val_labels[idx_i]
    # Read the uncalibrated model interval
    L_uncal = val_lower[idx_i]
    U_uncal = val_upper[idx_i]
    # Builds the confromal calibrated interval 
    L_cal = val_lower[idx_i] - qhat
    U_cal = val_upper[idx_i] + qhat
    # Check if coverage is correct
    covered_uncal = (y_true >= L_uncal) and (y_true <= U_uncal)
    covered_cal   = (y_true >= L_cal)   and (y_true <= U_cal)
    
    print(f"Example {i+1}")
    print(f"True value: {y_true:.3f}")
    print(f"Uncalibrated interval: [{L_uncal:.3f}, {U_uncal:.3f}]  -> Covered? {covered_uncal}")
    print(f"Calibrated interval:   [{L_cal:.3f}, {U_cal:.3f}]  -> Covered? {covered_cal}")
    print("-" * 60)



=== Example Predictions (True vs Intervals) ===

Example 1
True value: 11.700
Uncalibrated interval: [10.742, 21.959]  -> Covered? True
Calibrated interval:   [8.393, 24.309]  -> Covered? True
------------------------------------------------------------
Example 2
True value: 42.800
Uncalibrated interval: [21.700, 47.881]  -> Covered? True
Calibrated interval:   [19.350, 50.231]  -> Covered? True
------------------------------------------------------------
Example 3
True value: 10.500
Uncalibrated interval: [7.219, 21.959]  -> Covered? True
Calibrated interval:   [4.869, 24.309]  -> Covered? True
------------------------------------------------------------
Example 4
True value: 48.800
Uncalibrated interval: [21.601, 50.220]  -> Covered? True
Calibrated interval:   [19.251, 52.570]  -> Covered? True
------------------------------------------------------------
Example 5
True value: 36.400
Uncalibrated interval: [21.601, 44.789]  -> Covered? True
Calibrated interval:   [19.251, 47.139]  -