# Bayesian Methods 

## Why is the Bayesian Inference controversial?

1. Bayesians claim that the parameters are random so that their credible interval is a valid probability argument. This interpretation looks nice but the credible interval of the parameters of interest not only depends on the likelihood but also the prior, which is usually hard to obtain.

2. When the likelihood and prior is complicated, the inference has to rely on the MCMC sampling, which can be really slow in most of the real-world cases.

3. The biggest controversy about Bayesian inference is that you must quantify your prior knowledge about the question at hand. This makes it possible to actually influence your results, either accidentally or on purpose.

This is a genuine concern, and any Bayesian analysis worth it’s salt must demonstrate that the chosen priors aren’t influencing the final results. 






# [Bayesian Inference](http://pages.stat.wisc.edu/~larget/stat302/bayes.pdf)
  

###  Prior and Posterior Distributions, Likelihood

Before seeing data, the prior distribution of an unknown parameter θ is described by a
probability density (or mass function, if discrete)  $f(\theta)$  The Bayesian approach connects
data and parameter through the likelihood function, $f(x \mid \theta)$ . The function where x is fixed as the data and the parameter $\theta$ is what varies
is called the likelihood. Parameter values where the likelihood is high are those that have
a high probability of producing the observed data.   In the maximum likelihood approach to
statistics, the best estimate of the value $\hat{\theta}$ that maximizes the likelihood (and log-likelihood)
function.

 All Bayesian inference is based on evaluation of the posterior distribution: 
 $$f(\theta \mid x) = \dfrac{f(x \mid \theta)f(\theta)}{f(x)}$$
 where $f(x) = \int f(x \mid \theta)f(\theta) d \theta $ is the marginal likelihood f(x), which is the probability
of observing the data $x$ averaged across the entire parameter space.


### [Bayesian Autoregressive Time Series Models](https://www.michaelchughes.com/blog/probability-basics/autoregressive-time-series-models/)

#### About stability

In general, it is worth noting that for most possible choices of the A matrix, the resulting dynamics will not be “stable”, in the sense that the time series $y_1, \ldots y_T$ will either converge to the zero vector or diverge to infinity, producing rather “uninteresting” dynamics. This can happen quite rapidly, sometimes within 10 or 20 timesteps.

To explain this phenomenon, we need only see that the mean of observation at timestep $t$ is a deterministic function of the initial point and repeated matrix multiplication by A.

$$\mu_t = \mathbb{E}[ y_t ] = A^t y_1 $$ 


Most regression weights A will yield rather non-interesting dynamics after long time epochs. This motivates extending the model for time series to overcome this tendency, such as using switching VAR processes so that after small amounts of time under some $(A,\Sigma)$ the system switches to other parameters $(A', \Sigma')$, which can explain different regimes of time series behavior.

#### About identifiability

It is important to note that under the Matrix Normal distribution, the Kronecker product $\Psi \circ \Sigma$ is identifiable, but the individual parameters $\Psi, \Sigma$ are only identifiable up to a constant $c$.  This occurs because for any $\Psi,\Sigma$, we can use $\Psi/c, c\Sigma$ and achieve the same probability value for matrix $A$.  

Thus, it can be difficult to parameterize the priors appropriately, since $\Psi_0$ and $S_0$ can be scaled arbitrarily via the constant $c$ and yield similar results.  
