# Stochastic Gradient Descent Monte Carlo
***
## Summary

$\textrm{We have learnt about MCMC algorithms, which are used to sample from a given target density }\mathrm{\pi(\theta)}\textrm{. But the issue is that MCMC is}$ 
$\textrm{computationally slow, so the paper looks at the alternatives and improvements to MCMC,}$
***

### Langevin Diffusion
$\textrm{The Langevin Diffusion is governed by the following equation:}$
$$\mathrm{d\theta(t) = \frac{-1}{2}\nabla U(\theta(t))dt + d\beta_{t}}$$
$$\textrm{Where: }\mathrm{\nabla U(\theta(t)) \rightarrow}\textrm{ Drift Term, }\mathrm{\beta_{t} \rightarrow}\textrm{ d-Dimensional Brownian Motion }$$
$\textrm{If we could simulate this exactly, it would be easy to sample from the target distribution. But this is not possible, so we have to use the }$
$\textrm{approximation given as: }$
$$\mathrm{\theta(t+h) = \theta(h)-\frac{h}{2}\nabla U(\theta(t))+\sqrt{h}\textbf{Z}}$$
$$\textrm{Where: }\mathrm{\textbf{Z} \rightarrow }\textrm{ d-dimensional random Gaussian variables.}$$
$\textrm{Hence we can sample from the Langevin Diffusion using this approximation if we know how to calculate drift term, and for a fixed h. The}$ 
$\textrm{smaller the h, the more accurate our samples will be. Based on this philosophy, we have the following algorithms:}$ 
- $\textrm{MALA: Metropolis Adjusted Langevin Algorithm}$ 
$\textrm{The approximated thetas are used as proposals in the Metropolis Hastings algorithm. The proposal is then accepted or rejected}$
$\textrm{using the standard MH probability.}$
- $\textrm{ULA: Unadjusted Langevin Algorithm}$
$\textrm{In ULA, there is no accept or reject step. The thetas are taken directly as they are. Hence it is a biased estimator of the target density.}$ 
$\textrm{It is more robust to bad initialization than MALA.}$
***


### Calculating the drift term
$\textrm{The issue of calculating }\mathrm{\nabla U(\theta(t))}\textrm{ still remains. Calculating it directly is not computationally effecient, hence we will try to estimate it.}$ 
$\textrm{We have the following possible ways to estimate:}$
- $\textrm{Stochastic Gradient Langevin Dynamics (SGLD)}$
$$\mathrm{\hat{\nabla}U(\theta)^{n} = \frac{N}{n}\sum_{i \in \delta_{n}}\nabla U_{i}(\theta)}$$

$\textrm{Where }\mathrm{\delta_{n}}\textrm{is a subsample from }\mathrm{\{1,2,.....N\}.}$  
$\textrm{We initialize starting values }\mathrm{\theta_{0},\{h_{1},h_{2}....h_{K}\}}\textrm{ Then for each time step t, we calculate }\mathrm{\theta_{t}}\textrm{ using }\mathrm{\theta_{t-1}}\textrm{ and }\mathrm{ \hat{\nabla}U(\theta)^{n} }\textrm{ (using the above estimator.)}$ 
$\textrm{The noise term }\mathrm{\xi}\textrm{ is drawn from }\mathrm{N(0,h_{k}I).}\textrm{ It is evident that if }\mathrm{h_{k} \rightarrow 0}\textrm{ as }\mathrm{ k \rightarrow \infty}\textrm{, the samples will converge to the Langevin diffusion as the step}$
$\textrm{size nears zero. The advantage of SGLD is that it is computationally more effecient than simply implementing MALA or ULA. }$
- $\textrm{Using control variates: }$ 
$$\mathrm{\hat{\nabla}U(\theta)^{n} = \sum\limits_{i=1}^{N}u_{i}(\theta) + (\frac{N}{n})\sum\limits_{i \in \delta_{n}}({\nabla}U_{i}(\theta) - u_{i}(\theta))}$$

$\textrm{Here, the }\mathrm{u_{i}(\theta)}\textrm{'s are known for all values of theta and n. These are the control variates that we can choose. If }\mathrm{u_{i}(\theta) \approx {\nabla}U_{i}(\theta)}$
$\textrm{then the variance of the estimator is smaller. The question arises how to choose our }\mathrm{ u_{i}}\textrm{'s. We can use Stochastic Gradient Descent to find the }$
$\textrm{mode }\mathrm{\hat{\theta}}\textrm{ and then use the constant function: }$
$$\mathrm{ u_{i}(\theta) = \nabla U_{i}(\hat{\theta})}$$
$\textrm{This reduces the computational cost and the variance.}$
- $\textrm{Using preferential sampling:}$
$$\mathrm{\hat{\nabla}U(\theta)^{n} = \sum_{i \in \delta_{n}}\frac{\nabla U_{i}(\theta)}{w_{i}}}$$

$\textrm{Where }\mathrm{w_{i}}\textrm{ is the expected number of times a sample i appears.}$
- $\textrm{Stratified sampling}$






$\textrm{Another question is how large should our sumsample}\mathrm{\delta_{N}}\textrm{ be? The answer is that is should be such that the variance of }$
$\textrm{the drift term is less than that of the noise term.}$

### Measuring the accuracy and effeciency of SGLD
$\textrm{Take a test function }\mathrm{\phi(\theta)}\textrm{ to calculate }\mathrm{E_{\pi}(\phi(\theta))}\textrm{ and compare it to }\mathrm{\sum\limits_{i=1}^{k}\phi(\theta_{k}).}\textrm{ This can be used as a measure of accuracy of the accuracy of SGLD. Using these tests, studies have shown that we should use exact }$ 
$\textrm{MCMC if given a large enough computational resource. Previous research has also shown that using control variates to estimate the drift }$ 
$\textrm{term results in lesser complexity than other methods.}$
$\textrm{However, SGLD has the drawback that it results in the correct mode but inflated variance for large N.}$


## General Stochastic Gradient MCMC
$\textrm{The paper generalizes the above and claims that any diffusion process can be used instead of the Langevin Diffusion. This results in the }$
$\textrm{generalized SGMCMC. }$
$\textrm{For the generalized case, we just replace }\mathrm{\theta}\textrm{ by the state term }\mathrm{\zeta}\textrm{. The state term encompasses }\mathrm{\theta}\textrm{ and any auxilliary variables. Like above, we can }$
$\textrm{write the exact equation as follows: }$
$$\mathrm{d\zeta = 0.5*\textbf{b}({\zeta) + \sqrt{D(\zeta)}d\beta_{t}}}$$
$$\textrm{Where }\textbf{b}\mathrm{(\zeta) \rightarrow }\textrm{Drift term }$$
$$\mathrm{D(\zeta) \rightarrow}\textrm{ Diffusion Matrix}$$
$\textrm{Assuming that our state has no auxilliary variables, we can generate a stationary distribution proportional to our target density }\mathrm{\pi}\textrm{, by setting }$
$\textrm{the following conditions: }$
$$\textbf{b}\mathrm{(\zeta) = -[}\mathbf{D(\zeta)+Q(\zeta)]\nabla U(\zeta) + \Gamma(\zeta)}$$
$\textrm{We can use the same approximation as we used in the Langevin Diffusion to get the following time discret version of the above equation: }$
$$\mathbf{\zeta_{t+h} = \zeta_{t} - \frac{h}{2}[(D(\zeta)+Q(\zeta))\nabla U(\zeta) + \Gamma(\zeta)] + \sqrt{h}Z}$$
$\textrm{We can prevent an inflated loss by using }\mathbf{Z }\mathrm{ \sim (0,}\mathbf{D(\zeta)}\mathrm{-h}\mathbf{B(\zeta)}\textrm{). This can be done if h is small enough. }$
$\textrm{The diffusion }\mathbf{(D(\zeta)}\textrm{ controls the level of noise. So a larger diffusion term means that you can escape local maximas.}$
$\textrm{Having different }\mathbf{H(\zeta), Q(\zeta), D(\zeta) }\textrm{result in different SGMCMC algorithms such as SGLD, SGRLD, SGHMC, SGRHMC and SGNHT.}$
***