# Stochastic Models in Neurocognition

## Class 2

<hr>

**Preliminary Notes**:

- From now on the true parameter is denote $\theta_0$, unknown.
- $\lambda$ is a reserved letter for the firing rate

<hr>

# 1 - Contrast

## 1.1 - Maximul likelihood

$$\hat{\theta} = \underset{\theta\in\Theta}{argmax}\,\,l_\theta(X) = \underset{\theta\in\Theta}{argmin}\,\,-l_\theta(X)$$

In the Gaussian linear model: 

$$\hat{\mu} = \underset{\\mu\in V}{argmin}\,\,||Y-\mu||^2$$

## 1.2 - Contrast definition

In general, a contrast is a function $C(\theta, X)$ where $\theta\in\Theta$ and $X$ is the obsservation such that:

> $\mathbb{E}_{\theta_0}((C(\theta,X))$ is minimal at $\theta_0$

Then, the estimator defined by the minimum of the contrast is:

> $\hat{\theta} = \underset{\theta\in\Theta}{argmin}\,\,C(\theta, X)$

<u>Examples of contrast functions:</u> MLE, least squares

## 1.3 - Log-Likelihood

Let us prove that $C(\theta, X) = -\mathcal{l}_\theta(X)$ is a contrast

$$\mathbb{E}_{\theta_0}[-\mathcal{l}_\theta(X)] = E_{\theta_0}(-logf_\theta(X))=\int-log[f_\theta(x)]f_{\theta_0(X)}dx$$ where $f_\theta$ is the density of the model with parameter $\theta$

### Kullback-Leiber Divergence

The KL divergence is defined by $$K(f, g)$$ where $f$ and $g$ are two densities such that:

\begin{align}
K(f,g)&=\int log(\frac{f}{g})f(x)\,\,\delta x\\
&=\mathbb{E}_{X\sim f}[log(\frac{f}{g})] \\
&=\int [\frac{g(x)}{f(x)}-log(\frac{f}{g})-1]f(x)\,\,\delta x\\
&= \int\frac{g}{f}f + \int -log(\frac{f}{g})f + \int(-1)f\\
\int f &= 1\\
\int g & = \int \frac{g}{f}f = 1
\end{align}

As such we have:

\begin{align}
\frac{g}{f} - log\frac{g}{f} - 1 &= e^u - u - 1 \quad\text{with $u=log\frac{g}{f}$}\\
u&\rightarrow e^u - u - 1 = h(u)\\
h'(u) &= e^u - 1
\end{align}

<span style="color:red">ADD IMAGE DERIVATION TABLE</span>

We find that:

$$K(f, g) = \int h(log\frac{g}{f})f\,\,\delta x$$

As such:

> K is **always positive or null** and K = 0 *iff* $\forall x,\,\,log\frac{g}{f}=0$ which means $\frac{g}{f} = 1$ or $g = f$.

#### <u>Summary</u>

\begin{align}
K(f, g) &= \mathbb{E}_{X\sim f}[log\frac{f}{g}]\\
&\ge 0\\
&= 0 \quad\text{*iff* $f=g$}\\
\end{align}

**The KL Leibler divergence measures a distance between densities. The same computation applies to PDF.**

### Tying back with $C(\theta,X) = -\mathcal{l}_\theta(X)$

\begin{align}
0\rightarrow\mathbb{E}_{\theta_0}[C(\theta, X)] &= \mathbb{E}_{\theta_0}[-log(f_\theta(X))]\\
&=\mathbb{E}_{\theta_0}[log(f_{\theta_0}(X))] + \mathbb{E}_{\theta_0}[-log(f_\theta(X))] - \mathbb{E}_{\theta_0}[log(f_{\theta_0}(X))]\\
&=\mathbb{E}_{\theta_0}[log(\frac{f_{\theta_0}(X)}{f_\theta(X)})] - \mathbb{E}_{\theta_0}[log(f_{\theta_0}(X))]\\
&=K(f_{\theta_0}, f_\theta)- \mathbb{E}_{\theta_0}[log(f_{\theta_0}(X))]
\end{align}
 
<span style="color:red">CHECK f_\theta or f_0</span>
 
$\mathbb{E}_{\theta_0}[-log(f_{\theta_0}(X))]$ does not depend on $\theta$. So $\mathbb{E}_{\theta_0}[-log(f_\theta(X))]$ is minimal when $K(f_{\theta_0}, f_\theta)$ is null, hence when $\forall x,\,\,f_{\theta_0} = f_\theta$.

If there is no problem of identification, then two different parameters $\theta$ encode two different densities then it implies that $\theta=\theta_0$.

## 1.4 - Least Square Contrast

\begin{align}
X &= \theta_0 + \epsilon\\
\theta_0&\in\mathbb{R}^d
\epsilon\sim pdf \text{ with }\mathbb{E}[\epsilon]=0\\
\end{align}

<span style="color:red">ADD MISSING LINE</span>

If we do not specify the distribution of $\epsilon$, we **cannot compute the MLE** but we can use the constract:

$$C(\theta, X) = ||X-\theta||^2$$

### Verification that it is a contrast

\begin{align}
\mathbb{E}_{\theta_0}[C(\theta, X))] &= \mathbb{E}_{\theta_0}[||X-\theta||^2] = \mathbb{E}_{\theta_0}[||X||^2] - 2<\theta, X> + ||\theta||^2\\
&=\mathbb{E}_{\theta_0}[||X||^2] - 2<\theta, \mathbb{E}_{\theta_0}[X]> + ||\theta||^2\\
&=\mathbb{E}_{\theta_0}[||X||^2] - 2<\theta, \theta_0> + ||\theta||^2\quad\text{by definition of $X$}\\
&=\mathbb{E}_{\theta_0}[||X||^2] - 2<\theta, \mathbb{E}_{\theta_0}[X]> + ||\theta||^2\\
&=- 2<\theta, \theta_0> + ||\theta||^2\\
&=||\theta-\theta_0||^2 - ||\theta_0||^2\\
\end{align}

So this is minimal when $\theta = \theta_0$. As such, $\theta\rightarrow||X-\theta||^2$ is a contrast, also known as a **least square contrast for vectors**.

$$\hat{\theta} = \underset{\theta\in V}{argmin}||X-\theta||^2 = \prod_V X$$

With: $\prod_V$ the project over $V$.

### Least-Square contrast for densities

We have $X = (X_1, ..., X_n)^T$ $IID$ with density $f_0$. Let $f$ be a candidate density and $C(f, X) = \frac{2}{n}\sum_{i=1}f(X_i) - \int[f(x)]^2\delta x$. 

\begin{align}
\mathbb{E}_{X_i\sim f_0}[C(f,X)] &= -\frac{2}{n}\sum_{i=1}\int f(x)f_0(x)\delta x + \int[f(x)]^2\delta x\\
&=-2 \int f(x)f_0(x)\delta x + \int[f(x)]^2\delta x\\
&= \int(f(x)-f_0(x))^2 \delta x - \int[f(x)]^2\delta x\\
\end{align}

So $C(f, X)$ is minimal when $\int(f(x)-f_0(x))^2 \delta x$ is minimal but $\int(f(x)-f_0(x))^2 \delta x$ is $\ge 0$ and $=0$ *iff* $\forall x,\,\,f(x) = f_0(x)$.

**So $C(f, X)$ is a contrast, called the least-square contrast for densities**.

# 2 - Choice of Models

### Example from data science

Given $Y_i$ the firing rate, we have $Y_i = f_0 = (W_i) + \epsilon_i$ with $W_i$ the weight and $\epsilon\,iid\,\mathcal{N}(0, \sigma^2)$.

A possible model would be $f(W) = a + b * W$ with a, b unknown (linear). We can also think about quadratic, cubic, etc. models.

Another set of models rely on the *angle* such that: $Y_i = f_0(U_i) + \epsilon_i$ with $U_i$ the angle of the movement and $\epsilon_i\sim\mathcal{N}(0, \sigma^2)$.

<u>model 1:</u> $f(U_i) = a + b * cos(2\pi U_i)$

<u>model 2:</u> $f(U_i) a_0 + a_1*cos(U_i) + a_{-1}*sin(U_i) + a_d *cos(d*U_i) + a_{-d}*sin(d*U_i)$

<span style="color:red">ADD GRAPH FFT</span>

In general, we have a bunch of linearly independent functions:

$$\phi_1(X), ..., \phi_d(X)$$

The problem is broached such that: $Y_i = f_o(X_i) + \epsilon_i$ with $\epsilon_i\sim\mathcal{N}(0,\sigma^2)$.

The model of $dim(d)$ is $f(X)\in Vec(\phi_1(X), ..., \phi_d(X)) = V$ such that $f(X_i) = a_1\phi_1(X_i) + ... + a_d\phi_d(X_i) \quad\forall i$

<hr>

$$d << n$$

<hr>

For this model, we stop clear of $n$, but when?

For this model, we know that $(..., \hat{f}_d(X_i), ...)^T = \prod_V(Y_i)$

<span style="color:red">ADD IMAGE PROJECTION (slide 17)</span>

## 2.1 - Bias Variance Decomposition

$$Y_i = f_0(X_i) + \epsilon_i$$

For model $d$, we define $(\hat{f}_d(X_1), ..., \hat{f}_d(X_n))^T =\hat{f}_d(X)$. What is the $d$ for which:

$$\mathbb{E}_{X\sim F_0}[||f_0(X) - \hat{f}_d(X)||^2]$$ is the smallest.

\begin{align}
\mathbb{E}_{X\sim F_0}[||f_0(X) - \hat{f}_d(X)||^2] &= \mathbb{E}_{X\sim F_0}[||f_0(X) - \prod_VY||^2]\\
&= \mathbb{E}_{X\sim F_0}[||f_0(X) - \prod_Vf_0(X) + \prod_Vf_0(X) - \prod_VY||^2]\\
&= \mathbb{E}_{X\sim F_0}[||f_0(X) - \prod_Vf_0(X)||^2] + \mathbb{E}_{X\sim F_0}[||\prod_V\epsilon||^2]\\
&= \mathbb{E}_{X\sim F_0}[||f_0(X) - \prod_Vf_0(X)||^2] + \sigma^2d\quad\text{since $||\prod_V\epsilon||^2\sim\sigma^2\mathcal{X}^2(d)$}\\
\end{align}

<span style="color:red">CHECK COMPUTATION BIAS TERM</span>

$||f_0(X) - \prod_Vf_0(X)||^2$ is the bias term, which decreases with d increases, and the variance term $\sigma^2d$ increases with d.

### Model choice for d

To choose a good $d$ you need a trade-off between:
- complexity of the model (to have a small bias)
- variance of each of the coefficient (to have a small variance)

We define an **oracle** $\tilde{d}$ which is a benchmark:

\begin{align}
\tilde{d}&=\underset{d\in[1,...,n-1]}{argmin}\big(||f_0(X) - \prod_Vf_0(X)||^2 + \sigma^2d\big)
\end{align}

### Mallow's Cp

This consists in choosing $\hat{d} = \underset{d}{argmin}||Y-\prod_VY||^2 + 2\sigma^2d$, assuming that $\sigma$ is known. $||Y-\prod_VY||^2$ is the least square, and $2\sigma^2d$ is a penalty.

If there was no penalty, $\hat{d}=\underset{d}{argmin}(||Y-\prod_VY||^2)$ takes always the largest $d$.

**A penalty is always there to avoid overfitting. Mallow's Cp is a particular case which satisfies an oracle inequality.**

\begin{align}
\mathbb{E}[||Y-\prod_VY||^2] &\le C * \text{oracle risk}\\
 &\le C * \underset{d}{min}(||Y-\prod_VY||^2 + 2\sigma^2d)\\
C&\rightarrow \text{ converges to 1}
\end{align}

### Akaike Criterion (AIC)

Model $M$ parametrized by $\Theta_M\rightarrow MLE\,\hat{\theta}_M\rightarrow f_{\hat{\theta}_M}(X)$ the density of X when the parameters are MLE. $f_{\hat{\theta}_M}$ is the likelihood on the model $M$ at the parameter $\hat{\theta}_M$.

AIC is: $$\hat{M} = \underset{M\in\mathcal{M}}{argmin}\big(-log(f_{\hat{\theta}_m}(X) + dim(M)\big)$$

Oracle inequalities exist for that too, as long as there are few models with the same number of parameters.

<span style="color:red">EXPLANATION MODEL DIM</span>

### BIC Criterion

\begin{align}
Y_i = a_0 + a_1 X_i^1 + ... + a_pX_p^p + \epsilon_i\quad\text{ with }\epsilon_i\sim\mathcal{N}(0,\sigma^2)\\
\end{align}

To do variable selection, one could put in competition
\begin{align}
V_0 &= Vec(1,...,1)^T\\
V_{01} &= Vect((1,...,1), X^1)^T\\
V_{0...d} &= Vect(1,...,1), X^1, ..., X^d)
\end{align}

The number of models with dimension 2 is $\frac{p(p+1)}{2}$. As such:

\begin{align}
\hat{M}&= \underset{M\in\mathcal{M}}{argmin} \big[\frac{ln(n)dim(M)}{2} - log(f_{\hat{\theta}_M}(X))\big]
\end{align}

Be careful. The more models in competition, the larger the penalty, and one may end with a model of small dimensions. Whereas if the number of models was not too big, one could have used AIC and selected it.

### Other penalties

There are other penalties: 

- Lasso
- Ridge
- ElasticNet

### Slope heuristic

<span style="color:red">ADD PLOT IMAGE </span>

**Note**: More mathematics details, <u>Birgie and Massart</u>, "Gaussian Model Selection"

## 2.1 - 

In [None]:
\overset{n}{\underset{i=}{\sum}}
\overset{n}{\underset{i=}{\prod}}

<span style="color:red">ADD </span>

\mathcal{E} 
\mathcal{N}

\underset{\theta\in\Theta}{argmax}\,\,

\begin{align}
&=\\
\end{align}