# Implemention of Stochastic Gradient Hamiltonian Monte Carlo with Friction

## Abstract

The purpose of this report is to implement the SGHMC (stochastic gradient Hamiltonian Monte Carlo) created by Tianqi Chen, Emily B. Fox and Carlos Guestrin. SGHMC algorithm takes advantage from minibatch and speed up the code. In this report, we will first illustrate the details of the algorithm, implement and optimize it in code, and then apply it for simulated data and real data. Comparisons between different version of HMC will be made in terms of accuracy and running time.

## Background

HMC is a widely used MCMC sampling algorithm. It resembles energy system by imitating potential energy by the target distribution as well as kinetic energy by 'momentum' auxiliary variables.

Whereas HMC will explore the state space quickly, there is one limitation of HMC: Gradient of the potential energy function is essential for HMC algorithm and it ultilize the whole data set, which in modern days are in millions or even billions. The hugh computational cost encourages ideas of using minibatches instead of the whole data set. It is evident that a naive algorithm that simply replaces the whole data set by minibatches is not consistent. Accordingly, we need to add a MH (Metropolis-Hasting) correction for keeping consistency. However, the MH correction also needs considerable amount of computation power.

In the paper by Tianqi Chen, Emily B. Fox and Carlos Guestrin, a new algorithm proposal is made by adding a friction term to the 'momentum' variables.

## Description of Algorithm

Suppose we have data $x_i \sim p(x|\theta), ~i=1,...,n$. We want to estimate $\theta$ by sampling from the posterior distribution $p(\theta|x)$.

The HMC procedure is as below:

Set initial value $\theta^{(0)}$, n = number of epochs

For t = 1,..., n
1. Sample $r^{(t)} \sim N(0,M)$, where r is the momentum variable and M is the mass matrix
2. Set $(\theta_0, r_0)$ = $(\theta^{(t)}, r^{(t)})$
3. for i = 1,...,m, where m is the number of minibatchs
    - $\theta_i=\theta_{i-1} + \epsilon_t M^{-1} r_{i-1}$ 
    - $r_i = r_{i-1} - \epsilon_t \nabla \tilde{U}(\theta_i) - \epsilon_t C M^{-1} r_{i-1} + N(0,2(C-\hat{B})\epsilon_t)$, where C is a user specified friction term and $\hat{B} = \frac{1}{2}\epsilon_t V_t$, $V_t$ is the estimated fisher information
4. $(\theta^{(t+1)}, r^{(t+1)})$ = $(\theta_m, r_m)$

## Algorithm Optimization

## Application to Simulated Data

## Application to Real Data

## Competing Algorithm

## Conclusion

## Code