# Bayesian linear regression

The algorithm SINDy is tasked with what comes down to be a linear regression:

$$
\boldsymbol{\dot{u}} = \boldsymbol{\Theta}\boldsymbol{\xi}
$$

By finding the coefficients $\boldsymbol{\xi}$ that minimize the loss with some optimizing algorithm.

Let us now suppose that the data is affected by noise drawn from a gaussian distribution:

$$
\boldsymbol{\dot{u}} = \boldsymbol{\Theta}\boldsymbol{\xi} + \boldsymbol{\epsilon}
$$

with 

$$
\epsilon_i \sim \mathcal{N}(0,\sigma^2)
$$

(_question: we suppose the noise on the trajectory to be gaussian. is the noise on its time derivative the same?_)

$$

$$

This model implies the following likelihood distribution for $\boldsymbol{\dot{u}}$:

$$
P(\boldsymbol{\dot{u}}|\boldsymbol{\Theta},\boldsymbol{\xi},\sigma^2) \propto (\sigma^2)^{-n/2} \exp \left ( -\frac{1}{2\sigma^2} (\boldsymbol{\dot{u}} - \boldsymbol{\Theta}\boldsymbol{\xi} )^T(\boldsymbol{\dot{u}} - \boldsymbol{\Theta}\boldsymbol{\xi} ) \right )
$$

So the minimizing algorithm is basically looking for the coefficients that maximize the likelihood of the observed data (by making $\boldsymbol{\dot{u}} - \boldsymbol{\Theta}\boldsymbol{\xi} $ small).

### Prior distribution

Now, the choice of the proper a priori distribution to estimate the posterior distribution is a crucial matter: it should reflect how the algorithm works, because if, for instance, we impose L2 regularization, the algorithm isn't simply trying to minimize $\boldsymbol{\dot{u}} - \boldsymbol{\Theta}\boldsymbol{\xi}$ (i.e. maximizing the likelihood of the data combined with a uniform prior) but it's also trying to make $||\boldsymbol{\xi}||^2$ small. We can prove that this is the same as imposing a Gaussian prior; first, recall the Maximum A Posteriori (_MAP_) estimation problem:

$$
\hat{\boldsymbol{\xi}}_{MAP} = \argmax_{\boldsymbol{\xi}} P(\boldsymbol{\xi} | \boldsymbol{\Theta},\boldsymbol{\dot{u}},\sigma^2) = \argmax_{\boldsymbol{\xi}}  P(\boldsymbol{\dot{u}}|\boldsymbol{\xi},\boldsymbol{\Theta},\sigma^2) P(\boldsymbol{\xi})
$$

consider the logarithm:

$$
\log (P(\boldsymbol{\xi} | \boldsymbol{\Theta},\boldsymbol{\dot{u}},\sigma^2)) = \log (P(\boldsymbol{\dot{u}}|\boldsymbol{\xi},\boldsymbol{\Theta},\sigma^2)) + \log (P(\boldsymbol{\xi}))
$$

We already wrote down the likelihood $P(\boldsymbol{\dot{u}}|\boldsymbol{\xi},\boldsymbol{\Theta},\sigma^2)$ earlier; if we assume a normal multivariate prior $\boldsymbol{\xi}\sim \mathcal{N}(0,\boldsymbol{S})$, we have

$$
\log (P(\boldsymbol{\xi} )) = \log (\mathcal{N}(0,\boldsymbol{S})) = \log \left [ \frac{1}{(2\pi)^{D/2}|\boldsymbol{S}|^{1/2}} \exp (-\frac{1}{2}\boldsymbol{\xi}^T \boldsymbol{S}^{-1}\boldsymbol{\xi}) \right ]
$$

if we also assume $\boldsymbol{S}=\eta \boldsymbol{I}$, we find (ignoring constant terms)

$$
\log (P(\boldsymbol{\xi} )) \propto -\frac{1}{2\eta}||\boldsymbol{\xi}||^2
$$

we can see how a minimizing algorithm that is subject to L2 is also maximizing the posterior distribution for a linear likelihood with a Gaussian prior (we expect the coefficients not to go too far from the origin!)

_question: what about the $\sigma$ distribution?_

_maybe:_

$$
P(\boldsymbol{\xi},\sigma^2 | \boldsymbol{\dot{u}},\boldsymbol{\Theta}) = P(\boldsymbol{\dot{u}} | \boldsymbol{\xi},\sigma^2,\boldsymbol{\Theta})P(\boldsymbol{\xi},\sigma^2)
$$

_and if $\sigma^2$ and $\boldsymbol{\xi}$ are independent, we have  $ P(\boldsymbol{\xi},\sigma^2)=P(\boldsymbol{\xi})P(\sigma^2) $ so_

$$
P(\boldsymbol{\xi},\sigma^2 | \boldsymbol{\dot{u}},\boldsymbol{\Theta}) = P(\boldsymbol{\xi} | \boldsymbol{\dot{u}},\boldsymbol{\Theta})P(\sigma^2 | \boldsymbol{\dot{u}},\boldsymbol{\Theta})
$$

### SINDy optimizer

So we know what the likelihood is for a linear model with gaussian noise. But in order to perform a proper analysis and derive the correct posterior distribution of rthe coefficients, we need to assess the prior distribution that best reflects the minimization algorithm we are using. Let's look at what the default SINDy optimizer, `pysindy.optimizers.stsql`, does from its <a href="https://pysindy.readthedocs.io/en/latest/api/pysindy.optimizers.html#module-pysindy.optimizers.stlsq">documentation</a>:

> Attempts to minimize the objective function $||y-Xw||^2_2 + \alpha ||w||^2_2$ by iteratively performing least squares and masking out elements of the weight array w that are below a given threshold.


_question: what prior distribution reflects this minimizing algorithm???_

IDEA: the optimizing algorithm performs a ridge regression and then masks all elements under a certain threshold.
That means that the first iteration maximizes the posterior for P(w) with a gaussian prior.
THEN it masks all elements under a certain threshold: what is the probability of this happening?

We need to calculate the probability of the MAXIMUM of the posterior distirbution for each component $P(\boldsymbol{\xi}_i)$ being under a certain threshold:

$$
P(\argmax_{\xi_i} P(\xi_i | \boldsymbol{\dot{u}},\boldsymbol{\Theta}))<\alpha) 
$$

Every "best coefficient" from the previous iteration has a certain probability of being set to zero, and that depends on the probability of being under the threshold:


WRONG !!!!
$$
P(\xi_i)_{next prior} = 
\begin{cases}
0 \;\;\;  &if \;\; \xi_i < \alpha \\
P(\xi_i)_{previous posterior} & otherwise
\end{cases}
$$

This probability distribution is the prior for each step of the masking algorithm, combined with a gaussian prior due to L2 regularization. 