# Stochastic Models in Neurocognition

## Class 3

<hr>

**Preliminary Notes**:

- $f$, 
- $j$, an estimator

<hr>

# 1 - Cross Validation

### Overview

CV is usually considered to be the most robust method that can be used to select model/estimator. However it is computationally intensive.

- n IID observations $X = \{X_1,...,X_n\}$
- Observations are split into two sets 
    - learning/estimation/training samples
    - validation/transfer sample
    
<u>Needed:</u>

- A contrast
- $d$ different estimators that may depend on $d$ different models but not necessary

An estimator $\hat{f}_i$ is considered to be a **black box** ($data \Rightarrow \hat{f}_i \Rightarrow \hat{f}_j^{data}$, estimator of say $f_0$, a function or just a parameter $\theta_0$)


<u>Example:</u>
- projection estimators or d different subsets
- density estimators (kernel with different bandwidth)

### Hold-Out

#### Process

> For each $\hat{f}_j$, we compute them with only the training set $S_L$, and not the transfer set $S_T$.

$$\hat{f}_1^{S_L}, ..., \hat{f}_d^{S_L}$$

For each of the sample element:

> we compute $C(\hat{f}_j^{S_L}, S_T)$ where C is the contrast designed for the target $f_0$. 
>
> It means that $\mathbb{E}_{f_o}[C(f, S_T)]$ is minimal when $f=f_0$.

**We search $\hat{f}_\hat{j}$ estimator given by $\hat{j}=\underset{j=\{1,...,d\}}{\text{argmin}}C(\hat{f}^{S_L}_j,S_T)$ ($\hat{j}$ is the best estimator)**.

<span style="color:red">/!\ The accuracy is computed over the whole sample, not just the transfer/validation one.</span>

<u>Example:</u>

$X_1, ..., X_n$ IID with density $f_0$.

$\hat{f}_j = \frac{1}{nh_j}\sum^n_{i=1}K*(\frac{x-X_i}{h_j})$ with $h_j$ from $j=1,...d$ $d$ different bandwidth.

$$S_L = \{X_1, ..., X_{n/2}\}; S_T=\{X_{n/2+1}, ..., X_n\}$$

![sa](images/signalanalysis.png)

$$\hat{f}^{S_L}_j(x) = \frac{1}{n/2*h_j}\sum^{n/2}_{i=1}K*(\frac{x-X_i}{h_j})$$
$$C(\hat{f}^{S_L}_j, S_T) = -\frac{2}{n/2}\sum^n_{i=n/2+1}\hat{f}_j^{s_L}(X_i) + \int(\hat{f}_j^{S_L}(x))^2dx$$

The good estimator is therefore $\hat{f}^X_{\hat{j}}$, with $\hat{j}$ the method selecte by holdout, computed it with the whole sample $X$.


### V/K-Fold Cross-Validation

#### Overview

<u>Note:</u> when $V=n$, it is called the **leave one out method**. 

A sample $X$ is split into $K$ or $V$ samples $S_1, ..., S_{V}$. $S_{-K}$ is the whole sample with the $K$ fold removed at a given CV step.

> For each $k$, $\hat{f}_j^{S_{-K}}$ for $j=1,...,d$ is computed with $S_{-K}$: $C(\hat{f}_j^{S_{-K}}, S_K)$.
>
> $S_L=S_{-K}$ and $S_T=S_K$

$$\hat{j} = \underset{j = 1,...,d}{\text{argmin}}\big[\frac{1}{V}\sum^V_{k=1}C(\hat{f}_j^{S_{-K}},S_K)\big]$$

Then I use as an estimator of $f_0$: $\hat{f}^X_{\hat{j}}$ with $\hat{j}$, the method selected by $V$-fold cross validation, and $X$ the whole sample. 

#### Comparison with other classes' method

$$\frac{1}{V}\sum^V_{i=1}\hat{f}_\hat{f}^{S_{-K}}$$

*This apparently is problematic as, depending on the problem, computing estimator averages does not make sense.*

# 2 - Testing

### Overview

Testing and model selection are different in the sense that **they do not answer the sample question**. 

$$m = 1, ..., M\quad\text{different models}$$

> **Model selection**: one will always get a model $hat{m}$ that is considered the best in terms of bias-variance trade-off. It does not mean it select the true model. The selected model will just likely be not too far and has a reasonable number of parameters
>
> **Goodness-of-fit testing**: They are testing $H_0$: that a model $m$ is true vs. $H_1$: that a model $m$ is false

<u>Example</u> Gaussian test with Shapiro/Wilk

If models are known well-enough, one can compute a goodness of fit test for each one.**Do not forget to correct for multiplicity with Bouferroni**. 

### Bonferroni Correction

$\delta_1, ..., \delta_K$ tests of level $\alpha$, then $P(\exists\text{ one test which wrongly rejects an hypothesis } \ge \sum^K_{k=1}P(\delta_i\text{ wrongly rejects } \ge \sum^K_{k=1}\alpha = K*alpha$.

With $K=15$, half a chance to make a mistake.

> Bonferroni suggest to replace $\alpha$ with $\frac{\alpha}{K}$

### Fisher Test (Linear Gaussian Models)

\begin{align}
Y&=m+\epsilon\\
Y&\in\mathbb{R}^n\\
\epsilon_i&\sim\mathcal{N}(0,\sigma^2) IID\\
m&\in V \subset \mathbb{R}^n
\end{align}

$$H_0: m\in W\quad vs \quad H_1: m\in V / W$$

![ft](images/fishertest.png)

The Fisher test is based on the statistic $$T=\frac{||\Pi_V Y - \Pi_W||^2}{||Y-\Pi_VY||^2}*\frac{n-dim(V)}{dim(V)-dim(W)}$$

<u>Note:</u> W is strictly smaller (subset) to V

> Under $H_0$, $T$ obeys a Fisher distribution $\mathcal{F}(dim(V)-dim(W), n-dim(V))$
>
> One rejects when $T$ is larger than the corresponding quantile $1-\alpha$ $\Rightarrow$ transformed in to a $p-values$

<u>In R:</u> The global pvalue of the `lm()` model is the p-value of the Fisher test for $W=Vect(1,...,1)$

### Wilk's Theorem and likelihood ratio test

<u>Example:</u>

You have two model for your data $X = (X_1, ..., X_n)\,\,IID$. 

\begin{align}
\{\text{model 1: }P_\theta, \theta\in\Theta\}&\in \{\text{model 2: }P_\kappa, \kappa\in\mathcal{K}\}
\end{align}

![mf](images/modelfunc.png)

<u>Example on transfer model:</u>

![tf1](images/transfermodel1.png)

![tf2](images/transfermodel2.png)

#### Computing the likelihood ratio (Wilk's Theorem)

The likelihood ratio statistic is given by: 

$$T = \frac{\underset{\kappa\in\mathcal{K}}{max}\,\,L_K^{model2}(X)}{\underset{\theta\in\Theta}{max}\,\,L_\kappa^{model1}(X)}$$

i.e. how plausible is model 2 compared to how plausible is model 1.

> If T is large, one reject $H_0:\text{ model 1 holds}$ because model 2 seems more plausible.

![fb1](images/forgottenbit.png)

Equivalently, one can reject $H_0$ when:

$$W=2(l_{\hat{k}}(X)-l_{\hat{\theta}}(X))$$

> **Wilk's theorem says that under $H_0$, $W$ converges in distribution to $\chi^2(d)$ with $d$ the number of parameter in model 2 minus the number of parameters in model 1, i.e., $d=dim(model2) - dim(model1)$.**
> 
> One rejects when $W$ is larger than the corresponding quantiles

<u>Issue with difficult models:</u>

<span style="color:red">/!\ In practice, for difficult models, being sure of the number of parameters can be problematic. Simulation is necessary here.</span>

e.g., one could have parametrized the transfer model with $(c, w_1, w_2)$ in model 1, forgetting that $w_1 + w_2 = 1$), and $(c, w_1, w_2, w_3)$ in model 2, forgetting that $w_1 + w_2 + w_3 = 1$).

![exp](images/exp.png)

**As such, it is better to perform simulation, and verify under H_0, that pvalues are uniform (ecdf, should be diagonal or under the diagonal to guarantee the level of the test).**

<u>Simulation</u>

> Simulate $N$ simulation of $X_1, ..., X_n$ under model 1, compute each time W
>
> Compute each time $W$'s p-value
>
> $N$ simulation pvalues -> ECDF

![st1](images/simulationstep.png)

![st1](images/example.png)

### Bootstrap

We have the following observations:

$$X_1, ..., X_n\sim\mathcal{E}(\theta_0)$$

We are interested in the distribution of $|\frac{1}{X}-\theta_0|$ where $X\sim\mathcal{E}(\theta_0)$.

One can **perform $N$ simulation $X_1, ..., X_{n_sim}\sim\mathcal{E}(\theta_0)$, $T_i=|\frac{1}{X_i}-\theta_0|\,\,\forall i\in\{1,...,n_{sim}\}$, and from there computes histograms, ecdf/cdf, empirical quantiles when N is large**.

If we observe $X_1\sim\mathcal{E}(\theta_0)$. An estimator of $\theta_0$ is $\frac{1}{X_1}$ and now one can know the confidence interval on $\theta_0$, make tests, etc.

> One needs a distribution for how far $\hat{\theta}$ is from $\theta_0$.
>
> But one doesnt know $\theta_0$: Bootstrap can help but requires $n$ observations of a phenomenon

#### Parametric Bootstrap

$$X_1, ..., X_n\sim\mathcal{E}(\theta_0)$$
$$\hat{\theta}=\frac{1}{\bar{X}}$$

One would like to know the distribution of $\hat{\theta} - \theta_0$ to compute a confidence interval for instance.

With $\epsilon$ the $1-\alpha$ quantile of this distribution, and the confidence interval would be $[\hat{\theta} \pm \epsilon]$. I cannot do that because my distribution depends on $\theta_0$, which is unknown.

Data is simulated $N_{sim}$ times: $X_1^*, ..., X_n^*\sim\mathcal{E}(\hat{\theta})$, with $\hat{\theta}$ computed with the original data.

One computes the bootstrap version of $\hat{\theta}:\,\,\hat{\theta}^*_i$, the i-th simulation. We will yield:

$$|\hat{\theta}^*_i - \hat{\theta}|,\,\,\text{the surrogate of }|\hat{\theta}-\theta_0|\text{ which cannot be access}$$

> One can use the $N_{sim}$, $T_i^*$ to get CDF, quantiles, etc.
>
> e.g. with the $1-\alpha$ quantile $\epsilon$ of the bootstrapped distribution, the bootstrapped CI $|\hat{\theta} \pm\epsilon^*|$

![graphs](images/graphs.png)

In general, given a model with parameter $\theta$ and data $X_1, ..., X_n$ observed and thought to come from this model with unknown parameter $\theta_0$:

- An estimator of $\theta_0$ is proposed (MLE, least square, empirical mean), one need $\hat{\theta}\underset{n\rightarrow+\infty}{\rightarrow}\theta_0$
- Simulate $X_1^*,..., X_n^*$ with parameter $\hat{\theta}$ $N_{sim}$ times.
- Compute each time $\hat{\theta}^* -\hat{\theta}$ and then up to you to do distance absolute value, etc.
- arrive at the empirical bootstrap distribution of the quantity that is wanted 
- Use this mepirical bootstrap distribution as if it was the one of $\hat{\theta} -\theta_0$ to build CI, tests, etc.

<span style="color:red">/!\ Do not forget the centering: $\quad-\theta_0 \rightarrow -\hat{\theta}$</span>

Theories exist to show that it works but it combines 2 things:
- $N_{sim} \rightarrow \infty$
- $n\rightarrow\infty$

#### Non-Parametric Bootstrap

It corresponds to drawing at random from the available data when there is no model assumption.

<u>bootstrap of the mean:</u>

$$X_1, ..., X_n\,\,IID\,\,\text{with unknown mean } m=\mathbb{E}[X]$$

On can estimate $m$ by $\bar{X}$.

The non parametric bootstrap is a uniform pick at random in $\{X_1,...,X_n\}$ to get $X_1^*,...,X_m^*$ -- **the bootstrapped sample must be of size $n$ too, with replacement**. 

$$X_1^*,...,X_m^* \rightarrow \bar{X}^*$$

On can thus get:

$$\bar{X}_i^* - \bar{X}$$

To approximate the distribution of $\bar{X}-m$.

![g2](images/graphs2.png)

# 3 - Other methods with independence 

### Supervised Classification

Labelled data.

### Unsupervised Classification (clustering)

Unlabelled data.