# Stochastic Models in Neurocognition

## Class 1

<hr>

**Preliminary Notes**:

<u>Class framework:</u>
- First 3 classes on independent models
- 2 classes on markov chains and poisson processes
- 2 courses on point processses and their statistics
- 3 classes on PDMP, Brownian Motions, mean-field.

<u>On tutorials:</u>
- Answer in .pdf and in .R to do analysis to be posted online (on Moodle)
- The grading will be about progression and how good we are about correction of other students

<u>Final exam:</u>
- Done in February on-site (programmation + write-down)

<hr>

# 1 - SOME EXAMPLES OF MODELS IN NEUROCOGNITION

## 1.1 - Models with independence

### Examples of models for neurocognition with independence

<u>A. **Justifying independence matters**</u>

**Independence matters** as it helps do a lot of different statistical analysis:
- Independence enables easier computation
- Indentifying where independence lies is important in terms of modeling

<u>Case 1:</u> **Individuals**, e.g. as part of a cognitive experiment with different participants. <span style="color:red">**/!\** Participants interacting removes independence</span>.

<u>Case 2:</u> **Trials**, e.g. when the participants are asked to repeat an experiment. <span style="color:red">**/!\** The state of the participants matter: tiredness, location, etc.</span>.

<u>B. **Parametric vs Non-parametric modelss**</u>

| **Parametric** | **Non-Parametric** | 
| ---: | ---: |
| The distribution of the data is parametrized by a finite set of parameters. e.g. $\mathcal{N}(\mu,\sigma^2)$ with $\theta = (\mu, \sigma^2)\in\mathbb{R}^2$ | The distribution depends on more than a finite number of parameters. e.g. a cumulative distribution function $f$ characterizing ***iid*** $X_1, X_2, ..., X_3$ |

<hr>

## 1.2 - Interspikes intervals

The main model to be looked into during the class is called an **Interspikes intervals**.

### Model of a neuron

A neuron is composed of an **axon**, a **soma**, and **dendrites** connected to another axon via a **synapse**. The **voltage** of a neuron is a continuous time series characterized by fluctuations and an activation/spike called an <span style="color:red">**action potential**</span>.

> **For a given neuron, the pattern/shape of the spike/action potential is always consistent/the same**.

The spike is powerful enough to travel to the synapse (from the **pre-synaptic neuron**), which will consequently affect the voltage of the neuron down the line (pre-synaptic neuron -> axon -> synapse -> dendrite -> neuron). 

A pre-synaptic neuron is either:
- **excitatory**: its spike induces a higher voltage in the neuron N down-the-line, and the higher the voltage, the more likely is N to spike
- **inhibitory**: its spike induces a lower voltage in the neuron N down-the-line, the less likely is N to spike

### Problem of such a model

<u>A. **Stating the problem**</u>

> One can rarely access such a data at a large scale (a human brain has c. $10^{12}$ neurons and setting electrodes in a brain is tricky). 

Instead, **tetrodes** are used. Tetrodes are deep electrodes that can record activity at different areas of the brain. Signals captured by tetrodes differ from electrodes as a tetrodes collects **unsorted spikes** which give no information on the **location of the neurons**.

![tetrodes](images/tetrodes.png)

**Tetrodes** still allow to find **spike trains** where we know when a neuron has emitted spikes as a neuron always emits the same spikes.

<hr>

> **An interspike interval is the distance between the spikes of a neuron on a spike train**
>
> Short-handed as ***ISI***
>
> <span style="color:red">We can model the ISIs of a given neuron as an IID variable. </span>

<u>(Leaky) Integrate-and-Fire:</u> Model that characterize a particular distribution for the ISI without challenging the fact that it is IID

<hr>

<u>B. **Why are ISI IID?**</u>

**Independence:**

>  ISI are considered IID as spikes are intersperced by reset points. This is not a mathematical justification but a modeling one: that neurons 'reset' after a spike.
>
> "At least the voltagte is reset to the same value/voltage at each spike" which would **legitimate** independence

Note: This is not the case of burst phenomena

**Identical distribution:**

> As long as the behavior of the animal does not change too much and is recorded during small periods of time
>
> **Changes in behavior**, **memory effects** (capacitance, STDP) and **iteractions between neurons are not modeled**

### Parametric Assumption

$Xs$ follow an **exponential distribution**, related to Poisson processes.

\begin{align}
X_1, X_2, ..., X_n &\sim \text{IID}\,\,\mathcal{E}(\lambda),\,\,\lambda \text{ is unknown}
\end{align}

<span style="color:red">**Limit of assumption**:</span> This model does not take into account **refractory period**. A refractory period represent the physical delay where a neuron cannot produce a spike anymore right after a spike (due to the low voltage). This cannot be seen with exponential.

<u>Solution:</u> **Shifted exponential**, meaning the density is given by 

\begin{align}
f(x)&=\lambda.e^{-(x-\theta)\lambda}\mathbb{1}_{x>\theta}\\
\lambda &\text{ is expressed in } Hz
\end{align}

![shiftedexp](images/shiftedexp.png)

<u>Note:</u> When $\lambda$ is small (1 to 3 Hz), since the refractory period is c. 2ms to 5ms, we will not see this effect and the first model is goo0d.

### Non-Parametric Assumption

\begin{align}
X_1, X_2, ..., X_n &\sim \text{IID}\,\,{ density }\, f\,\text{, with $f$ unknown} 
\end{align}

Density in $\mathbb{R}$ provides an estimator $\hat{f}$ of $f$.

<hr>

## 1.3 - Neural Rate Coding

### Firing Rate

**Definition**: The firing rate of a neuron is in average the number of spikes produced per seconds.

<u>Adrian and Zottermann's experiment, 1926:</u> 

They found out that the firing rate of a sensory nerve of a muscle increases with the weight attacked to it. 

<u>Georgopoulos, Schwarz and Kettner's experiment, 1986:</u> 

Beyond weights, neurons have a preferred direction (angle of the rotation of the underlying organ, member). Neurons encodes the strength of the underlying phenomenon, but also the mode of such phenomenon.

### Modeling the Firing Rate

<u>Modeling:</u>

We set:

\begin{align}
Y&\text{ is firing rate}\\
Y&=\begin{cases}
      a+b.W+\sigma\epsilon & \text{a, b, $\sigma$ unknown, and $\epsilon\sim\mathcal{N}(0,1)$}\\
      a + b.cos\theta+\sigma\epsilon & \text{For the angle of the movement}
    \end{cases}
\end{align}

As such the **parametric model** for $Y$, the firing rate, can be stated as:


\begin{align}
Y&=\begin{cases}
      f(W)+\sigma\epsilon\quad\text{  with f unknown}\\
      f(\theta)+\sigma\epsilon
    \end{cases}
\end{align}

$\theta$ is the angle between the movement and the preferred direciton of the movement.

<u>Measuring variations:</u>

- For one cell, one tries several $W$ or $\theta$. The observations are independent.
- For different cells, one 'hope' that this is still independent

<u>Is the noise Gaussian?</u>

$$Y = firing\,\,rate = \frac{\text{number of spikes}}{\text{duration of the experiment}}$$

- The number of spikes is usually modeled by Binomial or Poisson
- In both cases when the duration of the experiment is long enough, these distribution might be approximated by a Gaussian

Sometimes, one need to transform the data to make them look Gaussian. The Anscombe transform $N\rightarrow2\sqrt{N+3/8}$ is known to be the best way to make a Poisson look Gaussian.

<hr>

***TO REMEMBER:***

- **Estimation (Law of Large numbers)**
- **Asymptotic confidence intervals (Central Limit Theorem)**
- **Tests -> Parametric (ln()) or Non-Parametric (ks.test(), wilcox.test(), shapiro.trest(), x^2 test)**

<hr>

# 2 - LIKELIHOOD AND CONTRAST

## 2.1 - Likelihood

<u>Toy Example:</u>

One observes $X\sim\mathcal{N}(\mu,1)$ and two statistics $\mu_1$ and $\mu_2$ are available. One would tend to choose the statistics where the observed \mu (sample mean $X$) is the closest (i.e. comparing $f_{m_1}(X)$ and $f_{m_2}(X)$).

The maximum is achieved in $\theta=m_1$. As such, the $\hat{\theta}(MLE)=m_1$.

### Definition

In general, one has a **parametric** family $f_\theta$ parametrized by $\theta\in\mathbb{R}^d$, with $f_theta$ a is either:

\begin{align}
f_\theta&=\begin{cases}
      \text{ a density if the variable is continuous}\\
      \text{ the probability distribution function if the variable is discrete}
    \end{cases}
\end{align}

$$\theta \rightarrow f_\theta(X)$$

One observes $X\sim f_\theta$ for $\theta$ unknown.

### Maximum Likelihood Estimator

An **estimator is a function of the data**, and the data is considered **fixed**. The estimator is a **random variable**.

$$\hat{\theta} = \underset{\theta\in\Theta}{argmax} f_\theta(X)$$ i.e. the point $\theta$ which maximizes $f_theta$ if several $\theta$ are available. The "best" one is selected as it maximizes the likelihood of observing $X$.

<u>Heuristics:</u> 

If one observes $X_1, ..., X_n$ ***IID*** with density $g_\theta(x)$ then the density of $X=(X_1, ..., X_n)$ is:

$$f_\theta(X) = g_\theta(X_1)\,\,\times\,\,...\,\,\times\,\,g_\theta(X_n)$$

<u>Notation:</u>

- **Likelihood**: $\mathcal{L}(\theta) = f_\theta(X)$ in this case (IID) it is: $\overset{n}{\underset{i=1}{\prod}}g_\theta(x_i)$
- **Log-Likelihood**: $\mathcal{l}(\theta) = log(f_\theta(X))$ in this case (IID) it is: $\overset{n}{\underset{i=1}{\sum}}log(g_\theta(x_i))$

<u>Assumptions of the MLE:</u>

With very few assumptions, the MLE is usually:

- **Consistent** (convergence to $\theta_0$
- With the **smallest asyumptotic variance**

However:

- It is computable by hand in very cases
- if $\mathcal{L}(\theta)$ is computable, its maximization might be tricky
- there are cases where even computing $\mathcal{L}(\theta)$ is a challenge

<hr>

## 2.2 - Example with exponential ISI

We have: $X_1, ..., X_n$ with density $\lambda e^{-\lambda x}\mathbb{1}_{x\ge0}$

As such:

\begin{align}
\mathcal{L}(\theta) &= f_\theta(X) = \overset{n}{\underset{i=1}{\prod}}(\lambda e^{-\lambda x_i}\mathbb{1}_{x_i\ge0})\\
&= \lambda^n e^{-\lambda \underset{i=1}{\overset{n}{\sum}}X_i}\mathbb{1}_{min(X_i)\ge0}\\
\mathcal{l}(\theta)&=n.log(\lambda) - \lambda\underset{i=1}{\overset{n}{\sum}}X_i + log(\mathbb{1}_{min(X_i)\ge0})\\
&=n.log(\lambda) - \lambda\underset{i=1}{\overset{n}{\sum}}X_i \quad\quad(log(\mathbb{1}_{min(X_i)\ge0})\,\,\text{is always 0})
\end{align}

To find the maximum, we use the derivative:

\begin{align}
\mathcal{l}(\theta)&=n.log(\lambda) - \lambda\underset{i=1}{\overset{n}{\sum}}X_i\\
\mathcal{l}'(\theta)&=\frac{n}{\lambda} - \underset{i=1}{\overset{n}{\sum}}X_i \\
...\\
\mathcal{l}'(\theta)&=0 \Leftrightarrow \lambda=\frac{n}{\underset{i=1}{\overset{n}{\sum}}X_i}\\
\end{align}

![mle](images/mle.png)

<hr>

## 2.3 - Gaussian Linear Models

\begin{align}
Y_i=\begin{cases}
      a + b.W_i + \sigma\epsilon \\
      a + b.cos(\theta_i)+\sigma\epsilon
    \end{cases}
\end{align}

In general for linear gaussian models (think also to ANOVA, etc.):

> $Y = (T_1, ..., Y_n)^T = \mu+\sigma\epsilon$ with $\epsilon = (\epsilon_1, ..., \epsilon_n)^T$ with $\epsilon_i\,\,IID\,\,\sim\mathcal{N(0,1)}$.
>
> $\mu$ and $\sigma$ are both unknown, e.g., $V=vect((1,...,1)^T, (W_1,...,W_n)^T)$ or $V=vect((1,...,1)^T, (cos(\theta_1),...,cost(\theta_n))^T)$

### Estimation of $\mu$ and $\sigma$ by MLE

<u>Likelihood:</u>

$$\Theta = (\mu, \sigma)$$
$$dim(V)+1\text{ parameters}$$

\begin{align}
\mathcal{L}(\theta) =f_\theta(Y) &= \underset{i=1}{\overset{n}{\prod}}\frac{e^{-\frac{(Y_i-\mu_i)^2}{2\sigma^2}}}{(\sqrt{2\pi\sigma^2})^n}\\
&= \frac{e^{-\underset{i=1}{\overset{n}{\sum}}\frac{(Y_i-\mu_i)^2}{2\sigma^2}}}{(\sqrt{2\pi\sigma^2})^n}
\end{align}

**We note that $\mu = (\mu_1, ..., \mu_n)^T\in V \subset \mathbb{R}^n$. In the case of $V$, $\mu_i = a+b.W_i$ so just two parameters.**

<u>Log-Likelihood:</u>

\begin{align}
\mathcal{l}(\mu, \sigma^2) &= -\underset{i=1}{\overset{n}{\sum}}\frac{(Y_i-\mu_i)^2}{2\sigma^2} - \frac{n}{2}.log(2\pi\sigma^2)\\
&= -\frac{||(Y-\mu)||^2}{2\sigma^2} - n.log(\sigma) - \frac{n}{2}.log(2\pi)
\end{align}

#### FOR $\mu$

**Maximizing the log-likelihood corresponds to minimizing the norm $||Y-\mu||^2,\,\,\forall \mu \in V$.**

$$\hat{\mu}=\Pi_VY$$

![proj](images/projection.png)

#### FOR $\sigma^2$

\begin{align}
\mathcal{l}(\Pi_V(Y), \sigma^2) = -\frac{||Y-\Pi_VY||^2}{2\sigma^2} - \frac{n}{2}log(\sigma^2)\\
\frac{\delta\mathcal{l}}{\delta\sigma^2} = \frac{||Y-\Pi_VY||^2}{2(\sigma^2)^2} - \frac{n}{2\sigma^2}
\end{align}

$\frac{\delta\mathcal{l}}{\delta\sigma^2}$ is null  in $$\hat{\sigma^2}=\frac{||Y-\Pi_VY||^2}{n}$$

<u>Remark:</u> 
$$||Y-\Pi_VY||^2\sim \sigma^2\mathcal{X}^2(n-dim(V))$$

So $\mathbb{E}(||Y-\Pi_VY||^2) = \sigma^2(n-dim(V))$ which means that:

$$\mathbb{E}(\hat{\sigma^2}) = \frac{\sigma^2(n-dim(V))}{n}$$

Hence, the **MLE IS BIASED** as 
$\mathbb{E}(\hat{\sigma^2}) = \frac{\sigma^2(n-dim(V))}{n} \neq \sigma^2$ **but there IS CONVERGENCE** when $n\rightarrow +\infty$

Most of the time, people prefer $$\hat{\sigma^2} = \frac{||Y-\Pi_VY||^2}{n-dim(V)}\quad\text{cf. ln() in $\mathbb{R}$}$$ as $$\mathbb{E}(\hat{\sigma^2}) = \frac{\sigma^2(n-dim(V))}{n-dim(V))}=\sigma^2$$

<u>Classical estimator</u>

$$\hat{\sigma}^2_{classic}=\hat{\sigma}^2_{MLE}*\frac{n}{n-dim(V)}$$

<hr>

## 2.4 - Cognitive Model of Categorization

A participant is given a list of objects to categorize. The participant has a learning and transfer phase.

Models of transfer aim at modeling how a human can categorize given what they have learned (There is no good answer). There is no feedback.

> modeling learning -> difficult as there is no independence -> different participants learn differently

Modeling transfer is easier because:

> one can image that the answer for each object is independent from the other ones (no feedback)
>
> for the same reason, they are "identically distributed" given the object that is presented

<u>Nosofsky, 1986:</u> Proposition of the Generalized Context Modeling (GCM)

For a given object $x$, one represent it by a **list of attributes** (color, shape, etc.) such that: $$x = (x_1, ..., x_d)$$

**Similarity** between an object $x$ and an object $y$ is represented by:

$$S(x, y) = e^{-c.d(x, y)}$$ Where $$d(x, y) = \overset{d}{\underset{i=1}{\sum}}|X_i, y_i|$$

We say that $\mathbb{P}(\text{y is said to be in A}) = \frac{\underset{x\in\mathbb{L}\cap A}{\sum} S(y, x)}{\underset{x\in\mathbb{L}}{\sum} S(y, x)}$ where $\mathbb{L}$ is the set of learned object. 

$$\underset{x\in\mathbb{L}}{\sum} S(y, x) = \underset{x\in\mathbb{L}\cap A}{\sum} S(y, x) + \underset{x\in\mathbb{L}\cap B}{\sum} S(y, x)$$

In a 2-category situation.

<hr>