# Assignment 5

## Question 2

### Section 2

## SGLD

This section mainly deals with Stochastic Gradient Langevian Diffusion. Langevian diffusion is based on providing fluctuations into a system. How this is used in a system becomes apparent once we consider the differential equation, which contaains a drift term and a brownian term.

We start with the introduction of the potential term. For our usage, the posterior density of a distribution can be represented in terms of potential as follows.
$$
\pi(\theta) \propto exp(-U(\theta))
$$
Considering $\pi(\theta)$ as the posterior, and $y_1, ... y_N$ as samples, we can represent $U(\theta)$ as 
$$
U(\theta) = \Sigma _{i = 1}^N U_i(\theta)
$$
where 
$$
U_i(\theta) = -log(f(y_i|\theta)) - \frac{1}{N} log(p(\theta))
$$

Where $f$ is the likelihood function and $p(\theta)$ is the prior. Now, langevin diffusion states that $ d\theta(t) = -1/2 \nabla U(\theta(t))dt + dB_t $.
Here, the first term is the drift, while the second term is the brownian factor, providing randomness.
If we are using the langevian diffusion algorithm for getting new samples of the possible values of parameters, then we can perform the following.
$ \theta _k = \theta _{k-1} + h/2 \nabla U(\theta _k) + \xi $.

However, the computation of $\nabla U(\theta)$ will be computatuonally intense as we have to sum across all the elements in the given dataset, which becomes especially challenging if the dataset is extremely large. So to overcome this challenge, SGLD uses only a subset of non repeating n elements from the dataset, each time we calculate the gradient. Here, n << N. Therefore, the algorithm for SGLD is as follows

Input: $\theta_0, \{h_0, ... , h_K\}$.

for $k \in 1, ... , K$ do:

   1. Draw $S_n \subset {1, ... , N}$ without replacement

   2. Estimate $\hat{\nabla} U(\theta)^{(n)}$ 

   3. Draw $\xi _k$ ∼ $N(0, h_kI)$

   4. Update $\theta_k+1 ← \theta _k − \frac{h_k}{2} \hat{\nabla} U(\theta _k)^{(n)} + \xi _k$
end

Here, $\hat{\nabla} U(\theta) = \frac{N}{n} \Sigma _{i \in S_n} \nabla U_i(\theta) $

In SGLD, computing the gradient is not efficient. Therefore, we use estimates to replace the true gradient. While using estimates, its important to choose estimates which reduce the variance. We use control variates for exactly this purpose.
$$
\Sigma _{i=1}{N} \nabla U_i(\theta) = \Sigma _{i=1}{N} \nabla u_i(\theta) + \frac{N}{n} \Sigma _{i\in S_n} (\nabla U_i(\theta) - u_i(\theta))
$$
Control variates improves time complexity from $O(N)$ to $O(1) $, assuming we already know $\hat{\theta} $.
However, this approach increases variance, as $\hat{\theta} $ can vary significantly from $\theta$. We therefore have 2 approaches to reduce variance. One is by only accepting $\hat{\theta} $ only if its sufficiently close to $\theta$. Another method is using prefferential sampling.

### Section 3

In the previous section, we had seen how SGLD is used as a stochastic gradient MCMC algorithm. In this section, we provide a general framework for SGMCMC algorithms, which includes SGLD, along with others.

We first declare $\zeta$, which contains $\theta$. However, it can also contain a velocity component, $\rho$. Therefore, the general stochastic equation for $\zeta$ is
$$
d\zeta = \frac{1}{2}b(\zeta)dt + \sqrt{D(\zeta)}dB_t
$$

Here, $b(\zeta)$ becomes the drift component, and D takes a similar role to the gaussian noice provided in SGLD. In order to represent b, we introduce two new terms, the function $H(\zeta) $ and $Q(\zeta) $, latter of which is skew symmetric. 
We write $b(\zeta) $ as
$$
b(\zeta) = -[D(\zeta) + Q(\zeta)]\nabla H(\zeta) + \Gamma(\zeta)
$$
and
$$
\Gamma _i(\zeta) = \Sigma _{j=1}^d \frac{\delta}{\delta \zeta}(D_{ij}(\zeta) + Q_{ij}(\zeta))
$$

Therefore, our new samples for the mean from the distribution can be taken by using 
$$
\zeta_{t+h} \approx \zeta_{t} - \frac{h}{2}[[D(\zeta) + Q(\zeta)]\nabla H(\zeta) + \Gamma(\zeta)] + \sqrt{h}Z
$$
where h represents the sampling time and $Z$ is sampled from $N(0, \zeta_t) $. In order to avoid inflation of variance, we use $V(\theta_t) $. We also need to change $Z$ in order to counter this inflation of variance. However, by changing $Z$, we have to make sure variance of $N(0, D(\zeta_t) - \hat{B}(\zeta_t)) $, isn't less that zero, else the result will become unstable.

## Question 2

### MNIST dataset training using Bayesian Neural Network

In [None]:
import Pkg;
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("Distributions")
# Pkg.add("TensorFlow")

In [2]:
using CSV
using DataFrames
using Distributions
# using TensorFlow

We first load the MNIST dataset.

In [3]:
data = CSV.read("mnist_train.csv", DataFrame)


Unnamed: 0_level_0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,1x10,1x11
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0
5,9,0,0,0,0,0,0,0,0,0,0,0
6,2,0,0,0,0,0,0,0,0,0,0,0
7,1,0,0,0,0,0,0,0,0,0,0,0
8,3,0,0,0,0,0,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0,0,0
10,4,0,0,0,0,0,0,0,0,0,0,0


I have written the softmax function referring to the internet. We use the function to get the probabilities of each integer using the weights.

In [4]:
_exp(x::AbstractVecOrMat) = exp.(x .- maximum(x))

_sftmax(e::AbstractVecOrMat) = (e ./ sum(e))

function softMax(X::AbstractVecOrMat{T})::AbstractVecOrMat where T<:AbstractFloat
    _sftmax(_exp(X))
end


softMax (generic function with 1 method)

In [5]:
function logLikelihood(X, y, A, B, a, b)
    beta = softMax(X*B .+ b')
    beta = softMax(beta*A' .+ a')
    return beta
end

logLikelihood (generic function with 1 method)

We have also declared random weights.

In [6]:
X = data[1:60000, 2:785]
X = Matrix(X)
y = data[1:60000, 1:1]
A = rand(Normal(), 10, 100)
B = rand(Normal(), 784, 100)
B = Matrix(B)
a = rand(Normal(), 10)
b = rand(Normal(), 100)
lambdaA = rand(Gamma(1,1))
lambdaB = rand(Gamma(1,1))
lambdaa = rand(Gamma(1,1))
lambdab = rand(Gamma(1,1))
beta = logLikelihood(X, y, A, B, a, b)
# beta[1][225]



60000×10 Matrix{Float64}:
 1.3346e-6  1.65383e-6  9.98647e-7  …  3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7  …  3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 ⋮                                  ⋱                          
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6  9.98647e-7     3.16154e-6  1.33368e-6  4.85932e-7
 1.3346e-6  1.65383e-6

We now have to train the model using that data. I had an issue with the training part as there were doubts regarding the SGLD method which are yet to be resolved.

Essentially, we will use the SGLD method whereby we select a subset of the dataset for updating one parameter.