## Chapter 7: Conformal Prediction for Regression

- CP for regression produces prediction **intervals**, as opposed to prediction sets for classification, but the idea is remarkably similar 

### Is a prediction interval the same as a confidence interval?

- Confidence interval:
    - What is the uncertainty in your parameter estimate based on a sample statistic? 
    - It is an aggregate level parameter uncertainty that exists because you only have a sample of data

- Prediction interval:
    - What is the uncertainty in **this** specific prediction?

### Components of a prediction interval

- Similar to classification:
    - Lower/Upper bounds
    - Coverage probability (i.e. confidence level) 
        - 90%/95%/99%

### Measuring uncertainty in regression

- Use confidence interval directly
    - Compute confidence interval in the dataset
    - Add/subtract a fixed value from your point prediction

- Resampling methods
    - Repeatedly resample a training set from your dataset excluding the specific test sample
    - Train a model on the sampled training set, and use it to form a prediction for the test sample
    - Use the min/max of these prediction to get a distribution of predictions as the "prediction interval"

- Bayesian methods
    - Use bayesian methods to estimate the posterior distribution of the underlying model
    - Since the underlying model parameters has some distribution, naturally its predictions also follow some distribution

- Monte Carlo methods
    - Similar to resampling, but instead of choosing observations from all available observations, it goes one level deeper to resample from each input variable
    - Having reconstructed the training set this way, we can then run the regression again to get a predicted value for the unknown observation

- Sensitivity analysis
    - Adjust input values to determine how the prediction changes for an unknown observation


- Comparing methods and their features
    - Marginal Validty: Do we need to assume anything about the distribution of errors/parameters?
    - Scalability: Are we able to apply this to large datasets
    - Domain Knowledge: Do we need knowledge of the specific problem domain?
    - Validation set: Is a separate validation set necessary?

| Method | Marginal Validity | Scalability | Domain Knowledge | Validation Set |
| --- | --- | --- | --- | --- |
| Bayesian Methods (Gaussian process and approximate GP) | No | Only with approximate inference | Yes | No |
| Ensemble Methods (Dropout ensemble, deep ensembles, and mean-variance estimator) | No | Yes if scalable models are used | No | No |
| Direct Interval Estimation (Neural network quantile regression) | No | Yes | No | Yes |
| Conformal Prediction (Neural networks and random forest) | Yes | Yes (For ICP) | No | Yes |

### Practical: ICP for Regression

- https://github.com/PacktPublishing/Practical-Guide-to-Applied-Conformal-Prediction/blob/main/Chapter_07.ipynb

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnnotationBbox, TextArea
from matplotlib.ticker import FormatStrFormatter

import plotly.graph_objects as go

import openml
from sklearn.datasets import fetch_openml

%matplotlib inline

from sklearn.ensemble import RandomForestRegressor

import plotly.express as px

from yellowbrick.regressor import PredictionError

from tqdm import tqdm

from crepes.extras import binning

from lightgbm import LGBMRegressor
from scipy.stats import randint, uniform
from sklearn.model_selection import KFold, RandomizedSearchCV, train_test_split

from mapie.metrics import regression_coverage_score, regression_mean_width_score
from mapie.regression import MapieQuantileRegressor, MapieRegressor
from mapie.subsample import Subsample

from crepes import WrapRegressor
from crepes.extras import DifficultyEstimator

random_state = 23
rng = np.random.default_rng(random_state)
round_to = 3

import warnings
warnings.filterwarnings('ignore')

#### Get dataset

In [None]:
# California housing dataset https://www.openml.org/search?type=data&status=active&id=43939
dataset = openml.datasets.get_dataset(43939)

# Print a summary
print(
    f"This is dataset '{dataset.name}', the target feature is "
    f"'{dataset.default_target_attribute}'"
)
print(f"URL: {dataset.url}")
print(dataset.description)

#### Clean dataset

In [None]:
# openml API
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
df = pd.DataFrame(X, columns=attribute_names)
df["class"] = y

df['ocean_proximity'] = df['ocean_proximity'].astype('category')
df.dropna(subset=['total_bedrooms'], inplace=True)

In [None]:
X = df.drop('class', axis = 1)
y = df['class']

#### Inductive Conformal Prediction

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, shuffle = True, random_state=42)
X_proper_train, X_cal, y_proper_train, y_cal = train_test_split(X_train, y_train, test_size=1000)
print('Data split. Parts sizes: train = {}, calib = {}, test = {}'.format(X_proper_train.shape, X_cal.shape, X_test.shape))

### Resources

- Comparing methods for generating prediction intervals for regression (https://arxiv.org/abs/2107.00363)