## Chapter 3: Fundamentals of Conformal Prediction

- We will cover 2 types of conformal prediction: (i) Inductive conformal prediction (ICP) and (ii) Transductive conformal prediction (TCP)
    - In TCP, you build your nonconformity score relative to your entire dataset
    - In ICP, you build your nonconformity score relative to your calibration dataset 

- Components of a Conformal Predictor
    - **Nonconformity measure**: How to measure how different the new data point is from the entire dataset (TCP) or from your validation set (ICP)
    - **Calibration set**: Some amount of data (representative of the whole population) set aside to compute non-conformity scores for the rest of your known data. Using this, we will establish the prediction interval/region for new data points
    - **Test set**: The new points you want to test

### Non-conformity Measure

- How should we measure an observation's non-conformity?

- 2 ways: (i) model-dependent and (ii) model-independent non-conformity

- Model-dependent nonconformity measures: 
    - distance to support vectors in support vector machines
    - the residual error in linear regression models
    - discrepancy between predicted and actual class probabilities in probabilistic classifiers

- Model-Agnostic nonconformity measures. This series will focus on this
    - hinge loss
    - margin 
    - Brier score

#### Classification Problems

##### Hinge Loss

- In a classification problem, suppose the true label of an observation is "Class 1"
- And a model predicts a probability of "Class 1" as $P(X)$
- Then
$$\begin{aligned}
    \text{Hinge Loss} &= 1 - P(X)
\end{aligned}$$

- Example
    - Model class probability output are Class 1: 0.5, Class 2: 0.3, Class 3: 0.2
    - True label is Class 2
    - Then hinge loss is $1-0.3=0.7$

##### Margin

- The difference between the predicted probability of the most likely class that isn't the true class and the true class
- That is;
$$\begin{aligned}
    \text{Margin} = \max_{y_i \neq y_{\text{true}}} P(y_i | X) - P(y_{\text{true}})
\end{aligned}$$

- The more positive the margin, the more "nonconformal" the observation must be

- Example
    - Model class probability output are Class 1: 0.5, Class 2: 0.3, Class 3: 0.2
    - True label is Class 2
    - Then hinge loss is $0.5-0.3=0.2$

##### Brier Score

- Considered "proper" in the sense that it gives a true reflection of how good or bad a predicted probability it (no sharp cutoffs, consistent marginal change w.r.t distance from true probability)

- Brier score is computed as the sum of squared differences between labels and their predicted probabilities

- Example
    - Model class probability output are Class 1: 0.5, Class 2: 0.3, Class 3: 0.2
    - True label is Class 2
    - Then Brier score is is $\frac{(0 - 0.5)^2 + (1 - 0.3)^2 + (0 - 0.2)^2}{3} \approx 0.26$

$$\begin{aligned}
    \text{Brier Score} = \frac{\sum_i (\text{Label} - P(y_i | X_i))^2}{n}
\end{aligned}$$

##### Evaluating Non-Conformity Measures

- Each non-conformity measure is typically evaluated along 2 criteria;
    - One-Class Classification (OneC)
        - Count of prediction sets that include only one label 
    - AvgC
        - The average size of the class labels in the prediction set

- TLDR;
    - To maximise singleton sets (i.e. maximise `OneC`), use `Margin` as non-conformity score
    - To minimise `AvgC`, use `Hinge Loss` as non-conformity score

#### Regression Problems

- Again, we give 3 possible non-conformal scores

- Absolute error
$$\begin{aligned}
    \text{Non-conformity} &= | y_{\text{pred}} - y_{\text{true}} |
\end{aligned}$$

- Normalised error
$$\begin{aligned}
    \text{Non-conformity} &= \frac{| y_{\text{pred}} - y_{\text{true}} |}{\text{s.d.}_{\text{residuals}}}
\end{aligned}$$

### Calibration Set: TCP vs ICP

#### ICP

- Procedure
    - Divide your data into training $T$ and calibration $C$ set
    - Train point prediction model $H$ using $T$
    - For each observation in $C$
        - Use $H$ to predict every class probability 
        - Use an appropriate `Non-conformity Measure` to compute nonconformity scores $\alpha$ 
    - For a new observation $x$, for each possible label, compute a non-conformity score by assuming $x$'s true label is that label 
        - i.e. if you have 3 classes, assume $x$ is class 1, and compute nonconformity score. Then assume $x$ is class 2, etc. 
    - Compute the p-value of the test object by:
        - Counting the proportion of observations in the calibration set that have nonconformity score greater than the test object for that class
        - If this number if high, then the test object must be "conforming" to the rest of the dataset. Else it is not!
        $$\begin{aligned}
            \text{p-value} &= \frac{|z_{\alpha_i \ge \alpha_T}| + 1}{n + 1}
        \end{aligned}$$
        - The $+1$ comes from the fact that you are adding the test object to the bag of objects you are comparing
    
    - If the computed p-value (i.e. credibility) is higher than the significance level (i.e. 1 - 0.95 = 0.05), that means that the proportion of the calibration set with higher non-conformity score is very low.
        - Add the label to the return set. Else don't add

#### TCP vs ICP

- The main differences between TCP and ICP are:
    - In ICP, you portion out a dataset to be used as calibration set, to compute nonconformal scores
    - In TCP, you train the model on the entire dataset PLUS the new observation you want to score 
        - Since you don't know what label the unknown observation should have, you do training $k$ times, where $k$ is the number of possible labels
        - For each trained model, use it to compute nonconformal score for the entire dataset 
        - Then compute p-value for the unknown observation

- You can see that, because you need to train $k * m$ models ($k$ being the number of labels, $m$ being the number of unknown observations), this is very computationally intensive.

- Hence, ICP is generally preferred unless your dataset is VERY small

- Procedure for TCP
    - Train the underlying classifier on the entire training set.
    - Append each test point to the training set with each possible class label one class label at a time
    - For each appended test point with a postulated label, retrain the classifier and compute the nonconformity score for the test point given the postulated label.
    - Calculate the p-values for each postulated label, comparing the test point’s nonconformity score to the scores of the points in the training set
    - For each test point and each postulated label, include the postulated label in the prediction set if its p-value is greater than or equal to the chosen significance level.

#### Confidence vs Credibility

- Remember, in CP, the overview is:
    - For any new observation, we compare its nonconformal score $\alpha_{\text{new}}$ with all the other non-conformal scores in the calibration set $\boldsymbol{\alpha_C}$
    - We count the proportion of values in $\boldsymbol{\alpha_C}$ that exceeds $\alpha_{\text{new}}$
    - Let's say this is 0.96; that is, 96% of the conformal scores in the calibration set is higher than $\alpha_{\text{new}}$

- Confidence:
    - The confidence level $X\%$ (i.e. 95%) is the proportion of your calibration set/dataset whose nonconformal score exceeds the test observation, before you include it in the return set
    - That is, for label to be included in the return set, the test observation must have a lower nonconformal score than  $X\%$ of the other data points
    - Confidence is $1 - \epsilon$ where $\epsilon$ is the significance level

- Credibility:
    - Credibility is just p-value

#### Example

- Assume we have some classification model, and these are the nonconformity scores for the calibration dataset and the test point. Here are the values in the calibration set
    - Point 1 (Label A): 0.4
    - Point 2 (Label A): 0.3
    - Point 3 (Label B): 0.2
    - Point 4 (Label B): 0.5

- For the test observation, the model produces the following probs:
    - $P(Class A) = 0.75$
    - $P(Class B) = 0.65$

- Then nonconformity scores are:
    - For A: $1 - 0.75 = 0.25$
    - For B: $1 - 0.65 = 0.35$

- Thus, p-values for each labels are:
    - For A: 0.25 is smaller than 0.4, 0.3, and 0.5. So p-value is $3+1 / 4+1 = 0.8$
    - For B: 0.35 is smaller than 0.4 and 0.5. So p-value is $2+1 / 4+1 = 0.6$
    - At our preset confidence of 0.95, it implies that our significance value is 0.05
    - Since both 0.8 and 0.6 exceeds 0.05, we include both labels in the dataset

### Online vs Offline

- Offline conformal interval uses a fixed model and calibration set to compute nonconformal scores, does not update the model and calibration set over time

- Online is the opposite

### Conformal Prediction Coverage

- When talking about coverage, we are basically asking "how well does our prediction interval cover the true value"?

- There are 2 types of coverage; **conditional** and **unconditional**

- **Unconditional Coverage**:
    - This looks at the proportion of times the predicted intervals contains the true value across the entire dataset
    - As the name suggests, does not account for dependencies between observations (maybe your prediction interval sucks for a particular subset of the data), or for changes in data distribution

- **Conditional Coverage**:
    - How well do your intervals contain the true value for specific subsets of the data
        - i.e. for some specific category, for some specific time period, etc

#### Applications of Conformal Prediction

- Conformal predictions can be applied to:
    - Quantify uncertainty in any ML task
    - Combined with model validation techniques like cross-validation/bootstrapping to better quantify uncertainty
