# Model-Agnostic Methods in Survival Analysis

Model-agnostic methods in survival analysis are techniques used to interpret survival models without requiring access to their internal structure or parameters. Most model-agnostic methods rely on evaluating the survival prediction function $f(t | X)$, which represents the model's predicted outcome at time, given a feature set $X = \begin{bmatrix}
   x_1^1 & \ldots & x_L^1 \\
   \vdots & \ddots & \vdots \\
   x_1^n & \ldots & x_L^n \\
\end{bmatrix} \in \mathbb{R}^{n \times L}.$ This could be:

- Hazard function   $\lambda(t | X)$ 

- Survival function  $S(t | X)$

- Cummulative hazard function $\Lambda(t | X)$

Model-agnostic methods manipulate $f(t | X)$ to extract interpretable insights.



## Local explaination

### Individual Conditional Expectation (ICE)

Individual conditional expectation (ICE) curves provide a visual representation of how changes in a specific feature affect a model's prediction for an individual data point. These curves help describe the dependency of the model’s prediction (or an approximation of the conditional expected value) on the values of a selected feature.

The ICE function evaluates the model's prediction across a predefined grid of values $x_\ell^{\text{grid}} = \{x_\ell^1, \ldots, x_\ell^G\}$ for the selected feature $\ell$. It keeps all other features $x_{-\ell}^i$ fixed at their respective values for the given individual $x^i$. Mathematically, the ICE function evaluated at a specific time point $t$ is defined as:


$$\text{ICE}(f, t, x^i, \ell) = f(t | x_\ell^{\text{grid}}, x_{-\ell}^i).$$

***$\underline{\bf Limitation}$***
- Single-Feature Focus: ICE curves are limited to visualizing one feature at a time. Including multiple features would require overlaying complex surfaces, making the plot unreadable.

- Correlation Issues: When the feature of interest is correlated with others, some plotted points may represent unrealistic or invalid combinations, misrepresenting the data's true joint distribution.

- Overcrowding: Drawing too many ICE curves can clutter the plot, reducing its clarity and interpretability.

### SurvLIME (Local interpretable model-agnostic explanations)

SurvLIME aims to explain the prediction of the black-box survival model by its local approximation using a simple, interpretable surrogate model (Cox model), which can be defined by
$$\Lambda^{\text{Cox-PH}}(t) = \Lambda_0^{\text{Cox-PH}}(t) \exp(\omega^\top x)$$
where $\Lambda_0^{\text{Cox-PH}}(t)$ can be estimated bt Nelson-Aalen estimator

In order to explain the prediction for an individual $x^i$:

1. Generate a neiborghood set of $p$ points $\{x^i_1, \ldots , x^i_p\}$ around $x^i$

2. Optimize the coefficients $\omega$ of the surrogate model so that the average distance between the CHF of explained model and the surrogate model is minimized over all $x^i_u \in \{x^i_1, \ldots , x^i_p\}$

$$\min_{\omega} \sum_{u=1}^p w_u \sum_{v=1}^q r_{uv}^2 \big[ \text{ln} \Lambda(t_v | x^i_u) - \text{ln} \Lambda_0^{\text{Cox-PH}}(t_v | x^i_u) - \omega^\top x^i_u\big]^2 (t_{v+1} - t_v)$$ 

where 

$w_u = K(x^i_u, x^i)$ is the distance weight between $x^i_u$ and $x^i$ and can be computed by some kernel methods $K$

$r_{uv} = \frac{\Lambda(t_v | x^i_u)}{ln \Lambda(t_v | x^i_u)}$ is “straighten” weight to reduce the possible huge difference caused by the logarithm distance.



***$\underline{\bf Limitation}$***
- Defining a proper neighborhood: The method struggles to correctly define a "neighborhood" of data points, especially when features are highly correlated. This can result in unrealistic samples that don't reflect real-world data.

- Unstable results: The explanations can vary with each run because of random sampling. This makes the results less reliable and harder to reproduce.

- Potential for Misleading Explanations: LIME explanations can be manipulated to hide biases, posing challenges for explanation receivers who cannot verify their truthfulness.

### SurvSHAP (SHapley Additive exPlanations)
The goal of SHAP is to explain the prediction of an instance $x$ by computing the contribution of each feature $\phi_\ell \in \{\phi_1, \ldots, \phi_L\}$ to the prediction $f(x)$. The SHAP explanation method computes Shapley values from coalitional game theory. The Shapley value for a feature $\ell \in \{1, \ldots, L\}$ of an instance $x$ is given by:
$$\phi_\ell(f_x) = \sum_{S \subseteq N \setminus \{\ell\}}\frac{\vert S \vert ! (\vert N \vert - \vert S \vert - 1)}{\vert N \vert}\big[f_x(S \cup \{\ell\}) - f_x(S)\big]$$
where $N$ is the set of all features, $S$ is a subset of features excluding $i$ (a coalition), $f(S)$ is the model's prediction when only the features in $S$ are used, $f(S \cup \{\ell\})$ is the model's prediction when feature $\ell$ is added to $S$, and
$$f_x(S) = \int f(x_1, \ldots, x_L)dP_{x \notin S} - E_X(f(X))$$
is the prediction for feature values in set $S$ that are marginalized over features that are not included in set $S$. Then, the feature contributions must add up to the difference of prediction for $x$ and the average.

$$\sum_{\ell=1}^L \phi_\ell = f(x) - E_X(f(X))$$

SHAP assumes that the model's prediction can be decomposed into a sum of contributions from individual features. The Shapley value explanation is represented as an additive feature attribution method, a linear model. That view connects LIME and Shapley values. SHAP specifies the explanation as:

$$f(x) = \phi_0 + \sum_{\ell = 1}^L \phi_\ell z_{\ell},$$

where $z = \{1, \ldots, 1\} \in \mathbb{R}^L$ and $\phi_0 = E_X(f(X))$. Calculating Shapley values $\phi$ directly is computationally expensive. To make SHAP values feasible, approximations and optimizations for $\phi$ are used depending on the model type.

- TreeSHAP (Optimized for Tree-Based Models)
- KernelSHAP (Model-Agnostic Method)


KernelSHAP is a kernel-based estimation approach for Shapley values inspired by local surrogate models (like LIME, and in this case, by a linear model $\Phi^\top z$). KernelSHAP, which is applied to explain the prediction of individual $x_i$, consists of six steps:
1. Sample coalitions $z_k = (z_{k1}, \ldots, z_{kL}) \in \{0, 1\}^L$, $k=\{1, \ldots, K\}$ (1 = feature present in coalition, 0 = feature absent).
2. Get prediction for each $z_k$ by first converting $z_k$ to the original feature space $h(x_i, z_k)$ by keep the value of features in $x_i$ whose $z_k=1$ and replace the value of features in $x_i$ whose $z_k=0$ by a randomly sampled value from the distribution of corresponding feature.
3. Get prediction $f(h(x_i, z_k))$.
4. Compute the weight for each $z_k$ with the SHAP kernel
    $$\omega_k = \frac{p-1}{{p \choose s} s(p-s)}.$$
5. Fit weighted linear model
      $$\min_{\Phi} \sum_{k=1}^K \omega_k \big(f(h(x_i, z_k)) - \Phi^\top z_k \big)^2$$
      
      where 
    
    $$\Phi = (\phi_1, \ldots, \phi_L).$$
6. Return Shapley values $\Phi$  the coefficients from the linear model.

    $$\Phi = (Z^\top \Omega Z)^{-1} Z^\top \Omega Y$$

    where,

$$Z = [z_1, \ldots, z_K] \in \mathbb{R}^{K \times L}$$
$$\Omega = \text{diag}(\omega_1, \ldots, \omega_K) \in \mathbb{R}^{K \times K}$$
$$Y = \big[f(h(x_i, z_1)), \ldots, f(h(x_i, z_K))\big] \in \mathbb{R}^{K \times \tau}$$

***$\underline{\bf Limitation}$***
- Slow computation (on KernelSHAP): Impractical for calculating Shapley values for many instances.
- Ignores feature dependence (on KernelSHAP, like many other permutation-based interpretation methods): Assumes feature independence, which can lead to unrealistic data points when features are correlated.
- Misinterpretation Risk (like Shapley value): Shapley values represent the contribution of a feature to the difference between the prediction and the mean prediction, not the difference of the predicted value after removing the feature from the model training. Misinterpretation is common.
- Dependency on training data (like Shapley value): Access to data is needed for new individual data point. It is not sufficient to access the prediction function because you need the data to replace parts of the instance of interest with values from randomly drawn instances of the data. This can only be avoided if you can create data instances that look like real data instances but are not actual instances from the training data.
- Potential for Misleading Explanations: SHAP explanations can be manipulated to hide biases, posing challenges for explanation receivers who cannot verify their truthfulness.

<!-- Like many other permutation-based interpretation methods, the Shapley value method suffers from inclusion of unrealistic data instances when features are correlated. -->

## Global explaination

### Partial Dependence Plot (PDP)

The partial dependence function (PDP) describes how the expected value of the model prediction varies with respect to a chosen explanatory variable.
This is achieved by evaluating the model's behavior across a predefined grid points $x_\ell^{\text{grid}} = \{x_\ell^1, \ldots, x_\ell^G\}$ for a selected feature $\ell$, while all other variables fluctuate following their respective marginal distributions.

In practice, a one-dimensional PDP is estimated as the average of the Individual Conditional Expectation (ICE) profiles across all $n$ observations 
in the dataset $X = \{x^1, \ldots, x^n\}$. Mathematically, this is expressed as:

$$\text{PDP}(f, t, \ell) = \frac{1}{n}\sum_{i=1}^n f(t | x_\ell^{\text{grid}}, x_{-\ell}^i)$$

<!-- $$\text{PDP}(f, t, \ell) = \frac{1}{n}\sum_{i=1}^n f(t | x_\ell^{\text{grid}}, x_{\{1, \ldots, L\} \setminus \{\ell\}}^i)$$ -->

<!-- \begin{align*}
    S_{\text{DPD}, A}(t | X_{-A}) & = E_{X_{-A}}[S(t | X_A, X_{-A})]\\
    & = \int_{-\infty}^{+\infty} S(t | X_A, X_{-A}) dP(X_{-A})
\end{align*} -->

***$\underline{\bf Limitation}$***
- Dimensionality Limitation: PDP can realistically handle up to two features due to the constraints of 2D visualization and human cognitive limitations.

- Feature Distribution Omission: PDP may not show the feature distribution, which can mislead interpretations in regions with little or no data. This can be mitigated by including rugs or histograms.

- Assumption of Independence: PDP assumes features are independent, leading to unrealistic combinations in correlated features. Accumulated Local Effect (ALE) plots address this by considering conditional distributions.

- Hidden Heterogeneous Effects: PDP shows average effects, which can mask opposing effects in subgroups of the data. Individual Conditional Expectation (ICE) curves reveal such heterogeneous patterns by displaying instance-level effects.

### Accumulated Local Effects (ALE)

In order to avoid the problem of correlated features as in PDP, ALE proposes to calculate the difference (instead of average) in prediction based on the conditional distribution pg features

**Marginal plot (M plot)**

\begin{align}
    \text{M-plot}(t, f, \ell) &=  E_{X_{-\ell}}(f(t | X) | X_\ell = x_\ell) \\
                        &=  \int_{-\inf}^{\inf} f(t | X) dPX_{-\ell} | X_\ell = x_\ell
\end{align}

- Estimation
    $$\text{M-plot}(t, f, \ell) =  \frac{1}{\#N(X_\ell)} \sum_{i \in N(x_\ell)} f(t | X_{-\ell}^i),$$
where $N(x_\ell)$ is a set of individuals for which the value of its feature $\ell$ all into the small neiborghood of $x_\ell$.
This method can handle the problem of unrealistic sample but still uncover the pure effect of interest feature if they have correlation with others features.

ALE handle this problem of M-plot by averaging the changes of the predictionns $\frac{d f(t | X_\ell, X_{-\ell})}{d X_\ell}$, not the prediction itself. The change is defined as the partial derivative (local effect). Then ALE average the local effect over the conditional distribution similar to M-plot to avoid the extrapolation (unrealistic samples). Lastly, it accumuate  (integrate) the averaged local effect

$$\text{ALE}(f, t, \ell) = \int_{q^0_\ell}^{x^{\star}} E_{X_{-\ell} | X_\ell = x_\ell} \big(\frac{\partial f(t | X)}{\partial X_\ell} | X_\ell = q_\ell \big) dq_\ell,$$
where the value of feature $\ell$ is splited into Kinterval $q^k_\ell \in \{q^0_\ell, \ldots, q^K_\ell\}$ in which $q^0_\ell = \min(X_\ell)$ and $q^K_\ell = \max(X_\ell)$

- Estimation
$$\text{ALE}(f, t, \ell) =  \sum_{k=1}^{k_\ell(x^\star)}\frac{1}{\#N(q^k_\ell)} \sum_{i \in N(q^k_\ell)} f(t | q_\ell^k, X_{-\ell}^i) - f(t | q_\ell^{k-1}, X_{-\ell}^i),$$
 
 where $k_\ell(x^\star)$ is the interval of feature $\ell$ hold $x^\star$

***$\underline{\bf Limitation}$***

- Interpretation and Correlation: ALE plots show interval-wise effects that are accumulated into a smooth curve, but interpretation across intervals is invalid when features are highly correlated. Each interval uses different data points, making the effects strictly local.

- Deviation from Linear Models: In models with feature interactions and correlations, ALE plots may deviate from linear regression coefficients. First-order ALE effects can appear curved due to interaction terms, highlighting differences in how ALE and linear models attribute effects.
Effect of Intervals:

- The number of intervals impacts the stability of ALE plots: Too many intervals: Plots become shaky and overly detailed.
Too few intervals: Plots smooth out complexity but lose accuracy.

- Lack of ICE Curves: Unlike PDPs, ALE plots do not include ICE curves to reveal heterogeneous feature effects. While interval-level differences can be checked in ALE, they do not capture individual-level variability as ICE curves do,

### Permutation Feature Importance (PFI)

PFI measures how much the model's prediction accuracy decreases when the values of a specific feature are randomly shuffled, breaking the relationship between the feature and the true outcome. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction. A feature is “unimportant” if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction.

To compute the PFI for a specific feature $\ell$, follow these steps:

- Estimate the model loss $L(t, f(t | X))$. 

- Create a perturbed dataset $\tilde{X}$ by randomly shuffling the values of $X_\ell$ (i.e: $\tilde{X}_\ell = \text{shuffle}\{X_\ell\}$) while keeping other features in $X$ unchanged

- Compute the loss $L(t, f(t | \tilde{X}))$

- Calculate the PFI as the differnce between $L(t, f(t | \tilde{X}))$ and $L(t, f(t | X))$, which can be $L(t, f(t | X)) - L(t, f(t | \tilde{X})),$ or $\frac{L(t, f(t | X))}{L(t, f(t | \tilde{X}))}$

***$\underline{\bf Limitation}$***

- Unstable result: The permutation feature importance depends on shuffling the feature, which adds randomness to the measurement. When the permutation is repeated, the results might vary greatly.

- Unrealistic sample: If features are correlated, the permutation feature importance can be biased by unrealistic data instances. The problem is the same as with PDP

- Requirement for True Outcomes: PFI requires access to the true outcome to compute the loss. Without them, it cannot be computed, limiting its applicability in some scenarios.

- Performance vs. Variance: Permutation Feature Importance (PFI) is tied to model error, which may not always align with your needs. If you are interested in how much a feature influences the model's output variance (e.g., robustness to feature manipulation), PFI may not be the appropriate measure. Model variance (explained by the features) and feature importance correlate strongly when the model generalizes well (i.e. it does not overfit).

### Feature Interation

Friedman’s H-statistic measures the degree of interaction between two features relative to their individual contributions. For two specific features $\ell$ and $k$, the H-statistic is defined as

$$H_{\ell k}^2(t) = \frac{\sum_{i=1}^n [\text{PDP}(f, t , X^i_\ell, X^i_k) - \text{PDP}(f, t , X^i_\ell) - \text{PDP}(f, t , X^i_k)]^2}{\sum_{i=1}^n [\text{PDP}(f, t , X^i_\ell, X^i_k)]^2}.$$

***$\underline{\bf Limitation}$***
- Expensive computation

- Unstable result: Estimates may vary due to sampling variability if we do not use all data points to compute PDP, requiring multiple runs to ensure stability.

- Interpretation challenges: The H-statistic can exceed 1, complicating interpretation or there’s no clear threshold to determine when an interaction is "strong". When the total effect of two features is weak, but mostly consists of interactions, than the H-statistic will be very large. These spurious interactions require a small denominator of the H-statistic and are made worse when features are correlated ==> Visualizing the unnormalized H-statistic (square root of the numerator) can mitigate overemphasis on spurious interactions.

- Interaction strength vs. visualization: The H-statistic measures interaction strength but does not explain how the interaction works. Complementing it with 2D partial dependence plots is recommended.